CN117612188A

CN117612188A - Text recognition model training method, text recognition device and equipment

Info

Publication number: CN117612188A
Application number: CN202311747946.XA
Authority: CN
Inventors: 张大壮; 徐鑫; 杨月; 李伟
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-02-27

Abstract

The disclosure provides a text recognition model training method, a text recognition device and text recognition equipment, which can be applied to the technical fields of machine vision and natural language processing. Comprising the following steps: inputting the sample image into a visual feature extraction sub-model to extract spatial features of text information in the sample image and output visual features; inputting the visual characteristics into a text characteristic extraction sub-model to perform text error correction processing, and outputting text characteristics; based on a plurality of attribute categories of the sample image, carrying out feature fusion on the visual features and the text features by utilizing a feature fusion sub-model, and outputting a text prediction result about the sample image; determining a model loss value of the text recognition model based on the visual features, the text prediction results and the text labels corresponding to the sample images; and respectively optimizing the visual feature extraction submodel, the text feature extraction submodel and the fusion feature submodel based on the model loss value to obtain a trained text recognition model.

Description

Text recognition model training method, text recognition device and equipment

Technical Field

The disclosure relates to the technical field of machine vision and natural language processing, in particular to a text recognition model training method, a text recognition device and text recognition equipment.

Background

In identifying text information in images in natural scenes, text in the image is typically identified using computer vision models. Because of the conditions of blocked or blurred text in the image in the natural scene, the traditional computer vision model cannot accurately identify the blocked or blurred text content. Accordingly, a language model is generally introduced in the related art, and the blocked or blurred text content is predicted based on language rules and semantic content.

In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: after the language model is introduced, a pre-training language model mode is adopted to enable the language model to obtain semantic error correction capability, so that the overall training efficiency of the model is reduced.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a text recognition model training method, a text recognition method, a device, a medium, and a program product.

According to a first aspect of the present disclosure, there is provided a text recognition model training method, the text recognition model comprising: a visual feature extraction sub-model, a text feature extraction sub-model, and a fusion feature sub-model, the method comprising:

Inputting a sample image into the visual feature extraction sub-model to extract spatial features of text information in the sample image and output visual features;

inputting the visual features into the text feature extraction submodel to perform text correction processing, and outputting text features corresponding to the text information;

based on a plurality of attribute categories of the sample image, carrying out feature fusion on the visual features and the text features by utilizing the feature fusion submodel, and outputting a text prediction result corresponding to the fused features and related to the sample image;

determining a model loss value of the text recognition model based on the visual feature, the text prediction result, and a text label corresponding to the sample image;

and respectively optimizing the visual feature extraction sub-model, the text feature extraction sub-model and the fusion feature sub-model based on the model loss value to obtain a trained text recognition model.

According to an embodiment of the disclosure, the feature fusion sub-model is used to perform feature fusion on the visual feature and the text feature based on a plurality of attribute categories of the sample image, and output a text prediction result corresponding to the fused feature with respect to the sample image, including:

Based on a plurality of attribute categories of the sample image, performing feature stitching on the visual features and the text features by utilizing a character string stitching function in the feature fusion submodel to obtain fusion features;

sequentially inputting the fusion features into a plurality of residual stacking modules in the feature fusion sub-model so as to perform feature reinforcement on the fusion features and output reinforced features;

and respectively carrying out confidence calculation on a plurality of characters in the enhanced feature based on a preset character feature set in the feature fusion submodel so as to determine the text prediction result.

According to an embodiment of the disclosure, the inputting the fusion features into the plurality of residual stacking modules in the feature fusion submodel sequentially, so as to perform feature enhancement on the fusion features, and outputting enhancement features, includes:

inputting the fusion features into a convolution layer of a first residual stacking module to perform feature extraction on the fusion features and outputting extracted features, wherein the first residual stacking module is a first residual stacking module in a plurality of residual stacking modules;

adding the fusion feature and the extraction feature based on a residual connection mode to obtain a spliced feature;

Inputting the spliced features into a feature reinforcing network of a first residual stacking module, and outputting intermediate features;

and inputting the intermediate features into a plurality of second residual stacking modules according to the connection relation among the residual stacking modules, sequentially carrying out feature reinforcement, and outputting the reinforced features, wherein the second residual stacking modules are residual stacking modules except the first residual stacking modules of the residual stacking modules.

According to an embodiment of the disclosure, inputting the stitching feature into a feature enhancement network of a first residual stacking module, outputting an intermediate feature, includes:

determining a position coding matrix corresponding to the splicing characteristic by using a position mapping function based on the characteristic information of the splicing characteristic;

splicing the position coding matrix and the splicing features by using a character string splicing function to obtain first features;

inputting the first feature into a self-attention mechanism layer in the feature enhancement network, and outputting a second feature;

and inputting the first feature and the second feature into a normalization layer in the feature strengthening network for normalization processing, and outputting the intermediate feature.

According to an embodiment of the present disclosure, before the outputting the reinforcement feature, further comprising:

inputting the intermediate features subjected to feature reinforcement into a linear layer in the feature fusion submodel, and performing linear transformation on the intermediate features subjected to feature reinforcement to obtain the reinforced features.

According to an embodiment of the disclosure, the determining a model loss value of the text recognition model based on the visual feature, the text prediction result, and a text label corresponding to the sample image includes:

inputting the visual features and the text labels corresponding to the sample images into a loss function of a visual feature extraction sub-model, and outputting a first loss value;

inputting the text feature and the text label into a loss function of a text feature extraction sub-model, and outputting a second loss value;

inputting the text prediction result and the text label into a loss function of a feature fusion submodel, and outputting a third loss value;

and adding the first loss value, the second loss value and the third loss value to obtain the model loss value.

Another aspect of the present disclosure provides a text recognition method, including:

Inputting the target image into a visual feature extraction sub-model trained by any text recognition model training method so as to extract the spatial features of text information in the target image and obtain target visual features;

inputting the target visual features into a text feature sub-extraction model trained by any text recognition model training method to perform text correction processing, and outputting target text features;

based on a plurality of attribute categories of the target image, utilizing a feature fusion sub-model obtained through training by the text recognition model training method, carrying out feature fusion on the target visual features and the target text features, and outputting a text recognition result corresponding to the fused features and related to the target image.

Another aspect of the present disclosure provides a text recognition model training apparatus, including:

the visual extraction module is used for inputting a sample image into the visual feature extraction sub-model so as to extract spatial features of text information in the sample image and output visual features;

the text extraction module is used for inputting the visual features into the text feature extraction submodel to perform text correction processing and outputting text features corresponding to the text information;

The feature fusion module is used for carrying out feature fusion on the visual features and the text features by utilizing the feature fusion submodel based on a plurality of attribute categories of the sample image, and outputting text prediction results corresponding to the fused features and related to the sample image;

a loss determination module for determining a model loss value of the text recognition model based on the visual feature, the text prediction result, and a text label corresponding to the sample image; and

and the model training module is used for respectively optimizing the visual feature extraction sub-model, the text feature extraction sub-model and the fusion feature sub-model based on the model loss value to obtain a trained text recognition model.

Another aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method described above.

Another aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described method.

Another aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above method.

According to the text recognition model training method, the text recognition method, the device, the medium and the program product, through a combined training mode, a sample image is input into the visual feature sub-model, text information in the sample image is extracted to obtain visual features, the visual features are used as input of the text image taking sub-model, the text features are obtained, the text recognition model can learn the corresponding relation between the visual features and the text features, and therefore the text features can describe the text content in the sample image more accurately. The text features and the visual features are subjected to feature fusion through the feature fusion sub-model, and prediction is performed based on the fusion features, and because the fusion features are used for feature reinforcement on the part with higher accuracy in the visual features and the text features, the loss value of the text recognition model is calculated based on the prediction result obtained by the fusion features and the text labels, the specific weight of the visual features and the text features in the model can be accurately reflected, the accuracy of parameter adjustment of the model by using the loss value is higher, and therefore the overall performance of the text recognition model is improved, and the text recognition model obtained by training is more accurate in recognizing texts in images. In addition, the visual feature extraction sub-model, the text feature extraction sub-model and the fusion feature sub-model are combined for training, any pre-training learning strategy is not needed, and a part of parameters can be shared among all the sub-models, so that the generalization capability and the training efficiency of the text recognition model are improved to a certain extent.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of a text recognition model training method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a text recognition model training method in accordance with an embodiment of the present disclosure;

FIG. 3 schematically illustrates a framework diagram of a text recognition model training method in accordance with an embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic diagram of text recognition model training in accordance with an embodiment of the present disclosure;

FIG. 5 schematically illustrates a framework diagram of a feature-enhanced network in accordance with an embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart of a text recognition method according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a text recognition model training apparatus in accordance with an embodiment of the present disclosure;

FIG. 8 schematically illustrates a block diagram of a text recognition device according to an embodiment of the present disclosure; and

fig. 9 schematically illustrates a block diagram of an electronic device adapted to implement a text recognition model training method, a text recognition method, according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical solution of the present disclosure, the related user information (including, but not limited to, user personal information, user image information, user equipment information, such as location information, etc.) and data (including, but not limited to, data for analysis, stored data, displayed data, etc.) are information and data authorized by the user or sufficiently authorized by each party, and the related data is collected, stored, used, processed, transmitted, provided, disclosed, applied, etc. and processed, all in compliance with the related laws and regulations and standards of the related country and region, necessary security measures are taken, no prejudice to the public order, and corresponding operation entries are provided for the user to select authorization or rejection.

The technology of recognizing text information from an image is widely used, and a main recognition method is to capture an image through an optical device and detect the captured image to recognize characters on the image, thereby extending vision and character recognition capability to a machine. For example, in business handling of a bank, bill information photographed by a mobile phone needs to be uploaded to the bank, however, text on the bill may be lost due to improper bill preservation, or text may be covered by a light spot on a photographed photo due to photographing environment factors, so that the text on the photo is difficult to be recognized by a machine.

To address this phenomenon, there are now two types of correction networks for text recognition: a vision-based spatial correction network and a language-based semantic error correction network. Among other things, vision-based spatial correction networks are techniques that can embed geometric transformations (such as rotations, translations, and scaling) in neural networks, so that the neural networks can process input data more accurately. The language-based semantic error correction network refers to a system for text semantic error detection and correction by using a neural network and natural language processing technology. Such networks are intended to help identify and correct semantic errors in text, thereby improving the accuracy and fluency of text understanding.

In view of recognition and correction of text by the spatial correction network of vision, there are interference factors such as the text in the picture being obscured or blurred, and the visual algorithm cannot correct the interfered characters by using the context information between the texts. Therefore, a language model is introduced in the text recognition process, and the associated information between the texts is learned through the language model, thereby correcting the erroneously recognized text. The algorithm obtained by combining the visual algorithm and the language algorithm often adopts a pre-training language model mode to enable the algorithm to obtain semantic error correction capability, the algorithm has higher time complexity, and a data set used for pre-training cannot be matched with actual engineering requirements, so that the end-to-end algorithm has more advantages compared with the method of directly combining the language model and the visual model.

However, end-to-end algorithms have certain difficulties, including: the problem of mismatch between the visual model and the language model is that the language model interferes with the recognition result of the visual model under the condition of non-ideal, so that the semantic error correction capability of the whole model is weakened, and a large number of spelling errors exist in the text recognition result. In addition, the fusion of the two models increases the parameter quantity of the whole model, and the pre-training of the two models respectively causes more time expenditure than the single pre-training model, and has higher requirement on a computer.

In summary, the main difficulties in the text recognition process are the following: 1) The visual model and the language model are not matched; 2) The text recognition model is overclear, so that the text recognition result is not matched with the original text information; 3) The cost of the model pre-training time obtained after the two models are combined is increased; 4) The text recognition accuracy of the single model is low.

The embodiment of the disclosure provides a text recognition model training method, wherein a text recognition model comprises: the method comprises the steps of a visual feature extraction sub-model, a text feature extraction sub-model and a fusion feature sub-model: inputting the sample image into a visual feature extraction sub-model to extract spatial features of text information in the sample image and output visual features; inputting the visual features into the text feature extraction sub-model to perform text correction processing, and outputting text features corresponding to the text information; based on a plurality of attribute categories of the sample image, carrying out feature fusion on the visual features and the text features by utilizing a feature fusion sub-model, and outputting a text prediction result corresponding to the fused features and related to the sample image; determining a model loss value of the text recognition model based on the visual features, the text prediction results and the text labels corresponding to the sample images; and respectively optimizing the visual feature extraction submodel, the text feature extraction submodel and the fusion feature submodel based on the model loss value to obtain a trained text recognition model.

Fig. 1 schematically illustrates an application scenario diagram of a text recognition model training method according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 through the network 104 using at least one of the first terminal device 101, the second terminal device 102, and the third terminal device 103 to receive or transmit a message or the like, thereby acquiring a training progress of the text recognition model. Various communication client applications can be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.

The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services and a training program for the text recognition model may run on the server 105. The background management server can analyze and other processes on the received input sample image, and feeds back the finally trained text recognition model to the terminal equipment.

It should be noted that the text recognition model training method provided in the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the text recognition model training apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The text recognition model training method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105. Accordingly, the text recognition model training apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The text recognition model training method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 5 based on the scenario described in fig. 1.

Fig. 2 schematically illustrates a flow chart of a text recognition model training method according to an embodiment of the present disclosure.

As shown in fig. 2, the text recognition model training method of this embodiment includes operations S210 to S230.

In operation S210, the sample image is input into the visual feature extraction sub-model to perform spatial feature extraction on text information in the sample image, and the visual features are output.

In operation S220, the visual features are input into the text feature extraction sub-model to perform text correction processing, and text features corresponding to the text information are output.

In operation S230, feature fusion is performed on the visual features and the text features using the feature fusion sub-model based on the plurality of attribute categories of the sample image, and a text prediction result regarding the sample image corresponding to the fused features is output.

In operation S240, a model loss value of the text recognition model is determined based on the visual feature, the text prediction result, and the text label corresponding to the sample image.

In operation S250, the visual feature extraction sub-model, the text feature extraction sub-model, and the fusion feature sub-model are optimized based on the model loss values, respectively, to obtain a trained text recognition model.

According to the embodiment of the disclosure, a sample set for training needs to be acquired first in the sample image training process, and the method for training can be used for acquiring an international public open source data set as the sample set for training, wherein the sample set can comprise 3 training sets and 7 test sets. The training set comprises training sample images and corresponding label sets.

According to an embodiment of the present disclosure, a text recognition model includes: a visual feature extraction sub-model, a text feature extraction sub-model, and a fusion feature sub-model. And randomly extracting a plurality of sample images from the training set to train in the process of training the text recognition model. 150 cropped images may be selected from the training set as sample images.

According to an embodiment of the present disclosure, the text recognition model training method performs processing as in step S210 to step S250 on one sample image of the input model. For each sample image, the sample image can be scaled up to a 32×100 gray scale image to form an input vector F ^150×32×100 . The sample image corresponds to a unified text label with the length of 35 bits, and text labels with the length of less than 35 bits can be filled by using cut-off symbols, so that sample images with the text label length of more than 35 bits are selectively discarded. Setting the text label length to the same length is beneficial to facilitating the data processing of the text recognition model, thereby improving the training efficiency of the text recognition model,

According to an embodiment of the present disclosure, a visual feature extraction sub-model extracts text from an input sample image. The visual model employs a framework of feature extraction-sequence encoding-sequence decoding. The front-end feature extraction part adopts a ResNet-45 network and is responsible for extracting a space feature map from an input image. After sin-cos position coding, vector F ^150×32×100 Into a coding/decoding framework with a transducer as a core, and outputtingVisual characteristicsWherein B is the batch size, C is the text sequence length, and D is the encoding dimension.

According to an embodiment of the present disclosure, the visual feature extraction sub-model is used to identify text content in a sample image. The spatial features may include outline information of text on a picture, character sequence information of text, and the like. And extracting the spatial features of the text on the sample image through the visual feature extraction sub-model, carrying out coding processing on the extracted spatial features, determining the character sequence of the text, and taking the outline features of the characters and the relative sequence features of the characters in the text as the output of the visual feature extraction sub-model, namely the visual features.

According to embodiments of the present disclosure, the visual feature extraction submodel may include a Resnet-45 network (residual neural network) and a transducer codec structure. The visual feature extraction sub-model may first perform feature extraction on an input sample image using a network of Resnet-45 to obtain a spatial feature, then extract character sequence information in the spatial feature using a transducer codec structure, and finally obtain a visual feature of the sample image. Such model designs may provide better performance and semantic understanding capabilities for visual feature extraction sub-models during text recognition.

According to an embodiment of the present disclosure, after word vector encoding is performed on visual features, the visual features are input into a text feature extraction sub-model. Correcting the visual characteristics output by the visual characteristic extraction submodel by the text characteristic extraction submodel to obtain text characteristicsThe text feature extraction sub-model plays an auxiliary role in recognition and can be composed of a bilinear convolution network (Bilinear Convolutional Networks, BCN) network,

according to embodiments of the present disclosure, the text feature extraction sub-model may be a model that predicts and corrects text according to grammatical rules and semantic rules. The text feature extraction sub-model adjusts the vector of the visual feature part representing the text information, so as to correct the up-down prediction error or spelling error and the like of the vector representing the text information in the visual feature. The text feature extraction sub-model may correct semantic errors of the text by predicting the probability of occurrence of the next word or phrase. The text feature extraction sub-model may be a statistical-based N-Gram model (statistical language model) and a neural network-based recurrent neural network or a transducer model. Text features may be used to characterize text information after character correction.

According to embodiments of the present disclosure, the plurality of attribute categories of the sample image correspond to a plurality of encoding dimensions of the visual features and the text features, e.g., the plurality of attribute categories may include a word embedding dimension, a position encoding dimension, and the like. The feature fusion sub-model can splice visual features and text features through a character string splicing function, and can also carry out feature reinforcement on the matrix obtained after splicing to obtain fusion features. The feature fusion sub-model is based on the prediction result of the obtained fusion features to the sample image.

According to embodiments of the present disclosure, a cross entropy loss function may be used to calculate loss values between visual features, text features, and text predictions and text labels, and the resulting loss values are back-propagated into the visual feature extraction sub-model, the text feature extraction sub-model, and the fusion feature sub-model, calculating gradients for the respective sub-models. And respectively updating parameters of the visual feature extraction submodel, the text feature extraction submodel and the fusion feature submodel based on the calculated gradient values of different submodels, and performing repeated training for a plurality of times to obtain a text recognition model.

Fig. 3 schematically illustrates a framework diagram of a text recognition model training method according to an embodiment of the present disclosure.

As shown in fig. 3, the sample image is input into the visual feature extraction sub-model 310 to obtain a visual feature 301, the visual feature 301 is input into the text feature extraction sub-model 320, and the text feature extraction sub-model 320 performs error correction processing on the input visual feature 301 to obtain a text feature 302. The feature fusion sub-model 330 is utilized to perform feature fusion on the visual features 301 and the text features 302, so that the fusion features can be used for strengthening learning semantic information while retaining spatial features. In fig. 3, in the case of inputting a sample picture of 100×32, visual features of 256×512 dimensions are output after the visual model processing, while the language model also outputs text features of 512×256 dimensions according to the recognized text. And after fusion processing and feature learning are carried out through the feature fusion sub-model, classifying processing is carried out through a Softmax classifier, and corrected text is obtained. To ensure robustness of the text recognition model, a total of 1500w pictures of three data sets are used for training, and model evaluation is performed in an international public open-source test set.

According to the embodiment of the disclosure, an end-to-end training mode is adopted, the visual feature extraction submodel, the text feature extraction submodel and the fusion feature submodel are combined for training, any pre-training learning strategy is not needed, and a part of parameters can be shared among all submodels, so that the generalization capability and training efficiency of the text recognition model are improved to a certain extent. In addition, the present disclosure performs joint training of the visual feature extraction sub-model, the text feature extraction sub-model, and the fusion feature sub-model, and uses the loss functions of the three sub-models for back propagation and parameter updating. Therefore, all the sub-models can cooperate with each other in the training process, the performance of the whole model is improved, and the text recognition model obtained through training is more accurate in recognizing the text in the image.

According to an embodiment of the present disclosure, feature fusion is performed on visual features and text features using feature fusion submodels based on a plurality of attribute categories of a sample image, and a text prediction result on the sample image corresponding to the fused features is output, including: based on a plurality of attribute categories of the sample image, performing feature stitching on the visual features and the text features by utilizing a character string stitching function in the feature fusion sub-model to obtain fusion features; sequentially inputting the fusion features into a plurality of residual stacking modules in the feature fusion sub-model to perform feature reinforcement on the fusion features and output reinforced features; and respectively carrying out confidence calculation on a plurality of characters in the enhanced features based on a preset character feature set in the feature fusion submodel so as to determine a text prediction result.

According to embodiments of the present disclosure, visual features may be represented asText features may be represented asWherein B is the batch size, C is the text sequence length, and D is the encoding dimension. />And->The visual features and the text features can be feature spliced in the coding dimension through a splicing (CAT) function in a Pytorch (open source Python machine learning library) to obtain a feature spliced matrix, the feature spliced matrix is input into a linear layer and a softmax classifier (multi-classifier) to perform weight calculation, and specifically, an intermediate fusion matrix is input into the linear layer to be transformed to obtain a linear transformed matrix. An activation function may also be used to increase the nonlinear expressive power of the attention weight matrix. And carrying out normalization operation on the matrix after the linear transformation to obtain a normalized matrix. Multiplying the normalized matrix by the matrix after feature stitching element by element to obtain a weighted feature map, and carrying out summation operation on the weighted feature map in the coding dimension to obtain a final fusion feature.

According to the embodiment of the disclosure, the fusion features are sequentially input into a plurality of residual stacking modules in the feature fusion submodel to perform feature enhancement on the fusion features. The residual stacking module may employ a multi-layer neural network structure, where each module may contain convolution layers, normalization layers, activation functions, etc., for extracting and enhancing the feature representation. A preset character feature set may be prepared in the feature fusion sub-model, wherein the character feature set includes all character categories that the text recognition model needs to recognize.

According to an embodiment of the present disclosure, for each character in the reinforcement feature, a confidence calculation is performed using a preset character feature set in the feature fusion sub-model. The probability distribution for each character class may be calculated using softmax (activation function) or other suitable method. Thereby obtaining a confidence score for each character for determining a final text prediction result.

Fig. 4 schematically illustrates a schematic diagram of text recognition model training in accordance with an embodiment of the present disclosure.

As shown in FIG. 4, the text of the input fusion module is characterized byVisual characteristic of->Will beAnd->Splicing in D dimension through character string splicing function to generate feature map +. >The string concatenation function may be a CAT function in Pytorch. Specifically, the attention weight can be calculated by a linear function and a softmax classifier, and the +_pair is based on the attention weight>And->Weighted summation is carried out, and further linear calculation is carried out through a linear function to obtain fusion characteristics +.>The above process is shown in formulas (1) and (2):

wherein, [;]for the splicing operation, linear is a Linear function. w is the weight matrix output by the attention mechanism. Through the method, the fusion characteristic after noise filtering can be obtained

The residual stacking module is divided into a fusion learning part and a characteristic reinforcement learning part 2. Fusion learning part fuses the spliced fusion characteristicsExtracting sequence information and space information. The fusion learning network consists of a GRU module and 2 repeated convolution blocks. The fusion network uses a convolution block as a main body frame, uses a one-dimensional GRU module to carry out serialization processing on an input feature map, and then carries out processing on the convolution block consisting of a one-dimensional convolution layer, a batch processing normalization layer and an activation function, as shown in a formula (3):

the GRU is a GRU function in Pytorch, and the number of input and output channels is D. Conv _1×1 The one-dimensional convolution with the convolution kernel of 1 has the same number of channels as the GRU function.

F ₀ A body portion of the converged learning network is entered. The body frame is 2 repeated convolution blocks consisting of a convolution layer-a standard normalization layer-an activation layer. The convolution layer adopts a one-dimensional convolution layer with the convolution kernel size of 3×3, and the number of convolution channels is set to 512. And a smoother SoftPuls function is used as the activation function. The above procedure is shown in formula (4):

F ₁ ＝M ₁ (M ₂ (F ₀ )) (4)；

wherein M is _i(i＝1，2) The method comprises a convolution block consisting of a convolution layer, a normalization layer and an activation function. Two convolution blocks M ₁ And M ₂ The same network parameters are used.

After convolution processing of GRU and 2 stacks, F is connected by residual error ₁ Andafter addition, the data are sent to a characteristic enhancement network, as shown in a formula (5):

according to embodiments of the present disclosure, joint feature representation and prediction of visual and text features may be achieved through feature fusion, feature reinforcement, and confidence computation. After the visual features and the text features are fused through the character string splicing function to obtain fusion features, the fused features are sequentially input into a plurality of residual error stacking modules, the residual error stacking modules can dynamically adjust the importance of residual errors according to the input fusion features, and more attention is paid to the features beneficial to the current task, so that the expression capability of the fusion features and the recognition accuracy in an actual production scene are improved.

According to an embodiment of the present disclosure, inputting fusion features sequentially into a plurality of residual stacking modules in a feature fusion sub-model to perform feature enhancement on the fusion features, outputting enhancement features, including: inputting the fusion features into a convolution layer of a first residual stacking module to perform feature extraction on the fusion features and outputting extracted features, wherein the first residual stacking module is a first residual stacking module in a plurality of residual stacking modules; adding the fusion feature and the extraction feature based on a residual connection mode to obtain a spliced feature; inputting the spliced characteristics into a characteristic reinforcing network of a first residual stacking module, and outputting intermediate characteristics; and inputting the intermediate features into a plurality of second residual stacking modules according to the connection relation among the residual stacking modules to sequentially perform feature reinforcement and output reinforced features, wherein the second residual stacking modules are residual stacking modules except the first residual stacking modules.

According to an embodiment of the disclosure, the residual stacking module may include two parts, fusion learning and feature reinforcement learning, the fusion learning being completed through a fusion learning network, and the feature reinforcement learning being completed through a feature reinforcement network. A gate control loop unit (Gated Recurrent Unit, GRU) module and one or more convolution blocks may be included in the fusion learning network. The fusion network takes a convolution block as a main body frame, uses a one-dimensional GRU module to carry out serialization processing on an input feature map, and then carries out processing on the convolution block consisting of a one-dimensional convolution layer, a batch processing normalization layer and an activation function. The convolution block consists of a convolution layer-a standard normalization layer-an activation layer. Smooth softpulses (smooth pulse functions) may be chosen as the activation function at the activation layer.

According to an embodiment of the disclosure, after convolution processing of the GRU and one or more stacks, the fusion feature and the extraction feature are added to obtain a splice feature by using a residual connection mode, and the splice feature is transmitted to a feature reinforcing network. The characteristic strengthening network mainly comprises a multi-head attention mechanism, the input spliced characteristic is weighted based on the dimension of the spliced characteristic and the requirement of text prediction, and a characteristic representation with more expressive capacity, namely an intermediate characteristic, is output.

According to an embodiment of the present disclosure, the plurality of residual stacking modules may include a plurality of residual block compositions of the same structure, in addition to the first residual stacking module. In the accumulation of a plurality of residual stacking modules, each residual stacking module can perform feature enhancement on the feature representation learned by the previous residual stacking module, so that the problem of insufficient text recognition model performance caused by simple structure of a feature fusion mechanism is solved.

According to the embodiment of the disclosure, in the whole process, the fusion feature is subjected to feature extraction through a convolution layer, and then residual connection is performed with the original feature to obtain a spliced feature. And the spliced features are processed by a feature reinforcing network and a plurality of residual stacking modules, and finally the reinforced features are obtained and used as output. The design can help the network learn richer feature representations, and transmits important feature information through residual connection, so that the network performance and training effect are improved.

According to an embodiment of the present disclosure, inputting splice features into a feature enhancement network of a first residual stacking module, outputting intermediate features, includes: determining a position coding matrix corresponding to the splicing characteristic by using a position mapping function based on characteristic information of the splicing characteristic; splicing the position coding matrix and the splicing characteristics by using a character string splicing function to obtain first characteristics; inputting the first feature into a self-attention mechanism layer in a feature enhancement network, and outputting a second feature; and inputting the first feature and the second feature into a normalization layer in the feature strengthening network for normalization processing, and outputting intermediate features.

According to embodiments of the present disclosure, a feature-enhanced network is framed with a multi-headed attentiveness mechanism. Feature information of the stitching feature refers to information in the stitching feature that may characterize the location of text characters. The position mapping function may be a sin-cos position mapping function, and using the position mapping function, a position encoding matrix corresponding to the splice feature may be generated, where the matrix may represent position information of the splice feature in the sequence. And splicing the position coding matrix and the splicing characteristic by using a character string splicing function to obtain a first characteristic. To ensure that the two dimensions can be aligned, a stitching operation along a certain axis may be employed.

According to embodiments of the present disclosure, the self-attention mechanism may capture relationships and interactions between features by calculating attention weights between features. In particular, the attention score is calculated by linear transformation of the first feature. These scores represent the correlation between the positions of the different characters in the first feature. The attention score is softmax operated to obtain the attention weight. These weights represent the importance in context of the location of the individual characters in the first feature. The first features are weighted and summed by using the attention weight to obtain a feature representation regulated by the attention mechanism, namely a second feature.

According to embodiments of the present disclosure, through such a process, the self-attention mechanism may capture the relationship between vectors or matrices representing individual characters in the first feature, thereby generating a richer feature representation, contributing to improved expressive power of the feature and performance of the model.

According to an embodiment of the disclosure, the first feature and the second feature are input into a normalization layer in a feature enhancement network for normalization processing to obtain an intermediate feature. The normalization operation may be implemented using a normalization layer algorithm (Layer Normalization, LN) function in a deep learning model. Normalization helps to improve the stability and interpretability of the feature representation.

Fig. 5 schematically illustrates a framework diagram of a feature-enhanced network according to an embodiment of the present disclosure.

As shown in fig. 5, the feature enhancement network may have a multi-headed attention mechanism layer as the body framework. Ensuring parallel input of feature map F by sin-cos position mapping ₂ Can carry relevant position information, and is specifically shown as a formula (6):

wherein d _pos The value may be set to 16. The size of the sin-cos position map can be set to be 32 multiplied by 26, and the mapped position coding matrix is F _p 。F _p Can be used as the Q value, the K value and the V value of the multi-head attention mechanism at the same time. The scope of attention is calculated by using an attention mask, the masked Q value and K value are calculated by a softmax function, the attention score is obtained, and the attention score is subjected to random inactivation by using a regularization layer (droplayer). And the inactivated feature map is normalized by using a linear function, specifically, normalization function Layer Normalization in a deep learning model can be adopted for the 3 dimensions of the output feature map, as shown in formula (7):

wherein m and s represent the mean and standard deviation, W _a 、W _b Representing a matrix of learnable parameters, F ^* For each intermediate result of the residual stack block, multihead represents a Multihead attention network.

The intermediate features can be subjected to multiple normalization processes to ensure the stability of the intermediate features and improve the generalization capability of the text recognition model.

According to embodiments of the present disclosure, visual features in the fusion feature are gradually fused with text features as the residual stacking module increases. By inputting the spliced features into the first residual error module to perform self-attention operation and normalization operation, the proportion of semantic information in the text in the training process is enhanced, and the sensitivity of the text recognition model to the semantic information is improved

According to an embodiment of the present disclosure, before outputting the reinforcement feature, further comprising: inputting the intermediate features subjected to feature reinforcement into a linear layer in a feature fusion submodel, and performing linear transformation on the intermediate features subjected to feature reinforcement to obtain reinforced features.

According to embodiments of the present disclosure, the intermediate features may be linearly transformed by matrix multiplication and addition of bias terms. The intermediate features are mapped onto probability distributions of different attribute categories of the predicted text. The probability map of the linear transformation can help the text recognition model get more characterizable features to better accomplish subsequent tasks.

According to embodiments of the present disclosure, enhanced features of a text recognizer may be obtained using linear layers to dimension map intermediate featuresWherein N is the size of batch, L is the maximum length of the recognized word, and R is the total category number of the text. The above procedure is shown in formula (8):

wherein, block ₁ And Block ₂ Representing 2 residual stacking modules, respectively, the 2 residual stacking modules having different network depths.

According to an embodiment of the present disclosure, determining a model loss value of a text recognition model based on a visual feature, a text prediction result, and a text label corresponding to a sample image includes: inputting the visual characteristics and the text labels corresponding to the sample images into a loss function of the visual characteristic extraction submodel, and outputting a first loss value; inputting the text feature and the text label into a loss function of the text feature extraction sub-model, and outputting a second loss value; inputting the text prediction result and the text label into a loss function of the feature fusion submodel, and outputting a third loss value; and adding the first loss value, the second loss value and the third loss value to obtain a model loss value.

According to an embodiment of the present disclosure, the model loss value represents a loss value of the gradient pass back in the model. By training a text recognition model using multiple penalty functions, the robustness and generalization ability of the model can be improved. Each loss function can be seen as a constraint on different aspects of the text recognition model, and by integrating these constraints, the model can be better constrained and the risk of overfitting reduced. Meanwhile, a plurality of loss functions are added to be used as final loss functions, and a plurality of tasks can be optimized simultaneously when the text recognition model is trained, so that the efficiency and the accuracy of the model are improved. The gradient of this composite loss function can then be calculated by a back-propagation algorithm, updating the weights and bias of the model to minimize the value of the overall loss function. Therefore, the efficiency, accuracy and robustness of the text recognition model are improved, and a better text recognition result is obtained.

In the training process, a visual feature extraction sub-model, a text feature extraction sub-model and a feature fusion sub-model can be obtained, wherein the predicted result of 3 parts and 3 different loss values are combined, and the loss value returned by the final gradient is the sum of the loss values of the 3 parts, as shown in a formula (9):

loss＝||loss _V +loss _L +loss _A || (9)；

in accordance with embodiments of the present disclosure, high performance graphics processors and processors may be employed for training and testing. The text recognition model can set the initial learning rate to 0.0001, and the learning rate generally decays to 10% after the 1 st, 3 rd and 6 th iteration cycles in the practical process. The text recognition model can be trained for 8 iteration cycles, and the text recognition model with the highest verification accuracy is selected for testing.

Fig. 6 schematically illustrates a flow chart of a text recognition method according to an embodiment of the present disclosure.

As shown in fig. 6, the text recognition method includes operations S610 to S630.

In operation S610, the target image is input into the visual feature extraction sub-model trained by the text recognition model training method according to any one of the above embodiments, so as to extract spatial features of text information in the target image, thereby obtaining target visual features.

In operation S620, the target visual feature is input into the text feature sub-extraction model trained by the text recognition model training method according to any one of the above embodiments to perform text correction processing, and the target text feature is output.

In operation S630, the feature fusion sub-model trained by the text recognition model training method according to any one of the above embodiments is used to perform feature fusion on the target visual feature and the target text feature based on the plurality of attribute categories of the target image, and a text recognition result on the target image corresponding to the fused feature is output.

According to the embodiment of the disclosure, a target image for text recognition is uploaded into a text recognition model, and feature extraction is performed by a visual feature extraction sub-model based on spatial features such as the outline and the relative position of the text on the target image, so as to obtain target visual features. Predicting the text information characterized by the target visual features by the text feature sub-extraction model, and correcting the target visual features of the visual feature extraction sub-model based on the prediction result of the text feature sub-extraction model to obtain target text features.

According to embodiments of the present disclosure, the plurality of attribute categories of the target image may include batch size, text sequence length, and encoding dimension, among others. The feature fusion sub-model is based on a plurality of attribute categories of the target image, the target visual features and the target text features are spliced in the coding dimension through a splicing function to obtain fused features, the text of the target image is predicted based on the fused features, and a text recognition result about the target image is obtained.

According to the embodiment of the disclosure, compared with a traditional text recognition model, the text recognition model obtained by the text recognition model training method can effectively correct and recognize unrecognizable characters in the text according to the context information of the text extracted from the target image, so that the model accuracy is greatly improved.

Based on the text recognition model training method, the disclosure also provides a text recognition model training device. The device will be described in detail below in connection with fig. 7.

Fig. 7 schematically illustrates a block diagram of a text recognition model training apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the text recognition model training apparatus 700 of this embodiment includes a visual extraction module 710, a text extraction module 720, a feature fusion module 730, a loss determination module 740, and a model training module 750.

The visual extraction module 710 is configured to input the sample image into a visual feature extraction sub-model to perform spatial feature extraction on text information in the sample image and output visual features. In an embodiment, the visual extraction module 710 may be configured to perform the operation S210 described above, which is not described herein.

The text extraction module 720 is configured to input the visual feature into the text feature extraction sub-model to perform text correction processing, and output a text feature corresponding to the text information. In an embodiment, the text extraction module 720 may be used to perform the operation S220 described above, which is not described herein.

The feature fusion module 730 is configured to perform feature fusion on the visual feature and the text feature by using the feature fusion submodel based on a plurality of attribute categories of the sample image, and output a text prediction result about the sample image corresponding to the fused feature. In an embodiment, the feature fusion module 730 may be configured to perform the operation S230 described above, which is not described herein.

The penalty determination module 740 is configured to determine a model penalty value for the text recognition model based on the visual features, the text prediction results, and the text labels corresponding to the sample images. The loss determination module 740 may be configured to perform the operation S240 described above, and will not be described herein.

The model training module 750 is configured to optimize the visual feature extraction sub-model, the text feature extraction sub-model, and the fusion feature sub-model based on the model loss value, respectively, to obtain a trained text recognition model. The model training module 750 may be used to perform the operation S250 described above, and will not be described herein.

According to an embodiment of the present disclosure, the feature fusion module 730 includes: the device comprises a feature fusion sub-module, a feature enhancer module and a result prediction sub-module.

And the feature fusion sub-module is used for carrying out feature fusion on the visual features and the text features by utilizing a character string splicing function in the feature fusion sub-model based on a plurality of attribute categories of the sample image to obtain fusion features.

And the feature enhancement sub-module is used for sequentially inputting the fusion features into a plurality of residual stacking modules in the feature fusion sub-model so as to enhance the features of the fusion features and output enhanced features.

And the result prediction sub-module is used for respectively carrying out confidence calculation on a plurality of characters in the enhanced features based on a preset character feature set in the feature fusion sub-model so as to determine a text prediction result.

According to an embodiment of the present disclosure, a feature enhancement submodule includes: the device comprises a feature extraction unit, a feature splicing unit, a feature input unit and a feature output unit.

The feature extraction unit is used for inputting the fusion features into a convolution layer of the first residual stacking module to perform feature extraction on the fusion features and outputting the extracted features, wherein the first residual stacking module is a first residual stacking module in the plurality of residual stacking modules.

And the characteristic splicing unit is used for adding the fusion characteristic and the extraction characteristic based on a residual connection mode to obtain a spliced characteristic.

And the characteristic input unit is used for inputting the spliced characteristic into the characteristic reinforcing network of the first residual stacking module and outputting an intermediate characteristic.

And the characteristic output unit is used for inputting the intermediate characteristics into a plurality of second residual stacking modules according to the connection relation among the residual stacking modules to sequentially perform characteristic reinforcement and output reinforced characteristics, wherein the second residual stacking modules are residual stacking modules except the first residual stacking modules.

According to an embodiment of the present disclosure, a feature input unit includes: the device comprises a matrix determining subunit, a first determining subunit, a second determining subunit and a characteristic output subunit.

And the matrix determining subunit is used for determining a position coding matrix corresponding to the splicing characteristic by utilizing a position mapping function based on the characteristic information of the splicing characteristic.

And the first determining subunit is used for splicing the position coding matrix and the splicing characteristic by utilizing the character string splicing function to obtain a first characteristic.

A second determining subunit, configured to input the first feature into a self-attention mechanism layer in the feature enhancement network and output a second feature.

And the feature output subunit is used for inputting the first feature and the second feature into a normalization layer in the feature reinforcing network for normalization processing and outputting an intermediate feature.

According to an embodiment of the present disclosure, the feature enhancer module further comprises: a first determination unit.

The first determining unit is used for inputting the intermediate features subjected to feature reinforcement into a linear layer in the feature fusion submodel, and performing linear transformation on the intermediate features subjected to feature reinforcement to obtain reinforced features.

According to an embodiment of the present disclosure, the loss determination module 740 includes: the first, second, third and fourth determining sub-modules.

And the first determining sub-module is used for inputting the visual characteristics and the text labels corresponding to the sample images into the loss function of the visual characteristic extraction sub-model and outputting a first loss value.

And the second determining submodule is used for inputting the text feature and the text label into the loss function of the text feature extraction submodule and outputting a second loss value.

And the third determining sub-module is used for inputting the text prediction result and the text label into the loss function of the feature fusion sub-model and outputting a third loss value.

And the fourth determining submodule is used for adding the first loss value, the second loss value and the third loss value to obtain a model loss value.

Any of the multiple modules of the visual extraction module 710, the text extraction module 720, the feature fusion module 730, the loss determination module 740, and the model training module 750 may be combined in one module, or any of the modules may be split into multiple modules, according to embodiments of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the visual extraction module 710, the text extraction module 720, the feature fusion module 730, the loss determination module 740, and the model training module 750 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of any of the three. Alternatively, at least one of the visual extraction module 710, the text extraction module 720, the feature fusion module 730, the loss determination module 740, and the model training module 750 may be at least partially implemented as computer program modules that, when executed, perform the corresponding functions.

Based on the text recognition method, the disclosure also provides a text recognition device. The device will be described in detail below in connection with fig. 8.

Fig. 8 schematically shows a block diagram of a text recognition device according to an embodiment of the present disclosure.

As shown in fig. 8, the text recognition apparatus 800 of this embodiment includes a first determination module 810, a second determination module 820, and a third determination module 830.

The first determining module is configured to input the target image into the visual feature extraction sub-model trained by the text recognition model training method according to any one of the above embodiments, so as to extract spatial features of text information in the target image, and obtain target visual features.

And the second determining module is used for inputting the target visual characteristics into the text characteristic sub-extraction model trained by the text recognition model training method in any one of the embodiments to perform text correction processing and outputting the target text characteristics.

And the third determining module is used for carrying out feature fusion on the target visual feature and the target text feature by utilizing the feature fusion sub-model trained by the text recognition model training method according to any one of the embodiments based on a plurality of attribute categories of the target image, and outputting a text recognition result corresponding to the fused feature and related to the target image.

Any of the first, second, and third determining modules 810, 820, and 830 may be combined in one module to be implemented, or any of them may be split into a plurality of modules, according to an embodiment of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the first determination module 810, the second determination module 820, and the third determination module 830 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable way of integrating or packaging circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the first determination module 810, the second determination module 820, and the third determination module 830 may be at least partially implemented as computer program modules, which when executed, may perform the corresponding functions.

As shown in fig. 9, an electronic device 900 according to an embodiment of the present disclosure includes a processor 901 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. The processor 901 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 901 may also include on-board memory for caching purposes. Processor 901 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. The processor 901 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the program may be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the disclosure, the electronic device 900 may also include an input/output (I/O) interface 905, the input/output (I/O) interface 905 also being connected to the bus 904. The electronic device 900 may also include one or more of the following components connected to an input/output (I/O) interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to an input/output (I/O) interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 902 and/or RAM 903 and/or one or more memories other than ROM 902 and RAM 903 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to implement the text recognition model training method and the text recognition method provided by embodiments of the present disclosure.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, via communication portion 909, and/or installed from removable medium 911. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A text recognition model training method, the text recognition model comprising: a visual feature extraction sub-model, a text feature extraction sub-model, and a fusion feature sub-model, the method comprising:

2. The method of claim 1, wherein the feature fusion of the visual feature and the text feature using the feature fusion sub-model based on the plurality of attribute categories of the sample image, outputting a text prediction result for the sample image corresponding to the fused feature, comprises:

3. The method of claim 2, wherein the inputting the fusion feature into the plurality of residual stacking modules in the feature fusion sub-model in sequence to feature enhance the fusion feature, outputting an enhanced feature, comprises:

4. A method according to claim 3, wherein said inputting the splice feature into a feature enhancement network of a first residual stacking module, outputting an intermediate feature, comprises:

5. The method of claim 3, wherein prior to said outputting the reinforcing feature, further comprising:

6. The method of claim 1, wherein the determining a model loss value for the text recognition model based on the visual feature, the text prediction result, and a text label corresponding to the sample image comprises:

7. A text recognition method, comprising:

Inputting a target image into a visual feature extraction sub-model trained by the text recognition model training method according to any one of claims 1 to 6 so as to extract spatial features of text information in the target image and obtain target visual features;

inputting the target visual characteristics into a text characteristic sub-extraction model trained by the text recognition model training method according to any one of claims 1-6 for text correction processing, and outputting target text characteristics;

based on a plurality of attribute categories of the target image, feature fusion is carried out on the target visual feature and the target text feature by utilizing a feature fusion sub-model obtained through training by the text recognition model training method according to any one of claims 1 to 6, and a text recognition result corresponding to the fused feature and related to the target image is output.

8. A text recognition model training apparatus comprising:

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-6 or claim 7.

10. A computer readable storage medium having stored thereon executable instructions which when executed by a processor cause the processor to perform the method according to any one of claims 1 to 6 or claim 7.

11. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 6 or claim 7.