CN114973267A

CN114973267A - Model training method, text recognition method, device, medium and equipment

Info

Publication number: CN114973267A
Application number: CN202210615726.0A
Authority: CN
Inventors: 王少康
Original assignee: Beijing Zhitong Oriental Software Technology Co ltd
Current assignee: Beijing Zhitong Oriental Software Technology Co ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-08-30

Abstract

The present disclosure relates to a model training method, a text recognition apparatus, a medium, and a device, which solve the problems in the related art. The model training method comprises the following steps: obtaining a sample image, wherein the sample image comprises a sample text, and the sample text is marked with a sample text label; inputting the sample image into a text recognition model to obtain a sample text recognition result output by the text recognition model, wherein the text recognition model comprises a connection time sequence classification network and an attention mechanism network, and the sample text recognition result comprises a first sample recognition result output by the connection time sequence classification network and a second sample recognition result output by the attention mechanism network; calculating a loss value according to the first sample identification result, the second sample identification result and the sample text label; and adjusting parameters of the text recognition model according to the loss value.

Description

Model training method, text recognition method, device, medium and equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a model training method, a text recognition method, an apparatus, a medium, and a device.

Background

OCR (Optical Character Recognition) is one of the important directions of computer vision. OCR may perform character Recognition for an object of a scanned document class, and may also perform character Recognition for a natural Scene, that is, Scene Text Recognition (STR). OCR in educational scenes typically performs character recognition on images of test papers, scanned documents, etc. taken by a mobile device. The photographed document image has the problems of affine transformation, large scale change, character bending, background interference, variable fonts, insufficient illumination, fuzzy photographing, multiple languages and the like, and great technical challenges are brought to text recognition. The method for recognizing the text in the educational scene in the related art cannot solve the problems well, so that the accuracy rate of text recognition is low.

Disclosure of Invention

The present disclosure is directed to a model training method, a text recognition method, an apparatus, a medium, and a device, so as to solve the problems in the related art.

In order to achieve the above object, according to a first aspect of embodiments of the present disclosure, there is provided a model training method, the method including:

obtaining a sample image, wherein the sample image comprises a sample text, and the sample text is marked with a sample text label;

inputting the sample image into a text recognition model to obtain a sample text recognition result output by the text recognition model, wherein the text recognition model comprises a connection time sequence classification network and an attention mechanism network, and the sample text recognition result comprises a first sample recognition result output by the connection time sequence classification network and a second sample recognition result output by the attention mechanism network;

calculating a loss value according to the first sample recognition result, the second sample recognition result and the sample text label;

and adjusting parameters of the text recognition model according to the loss value.

Optionally, the text recognition model further includes a high resolution network, and the inputting of the sample image into the text recognition model to obtain the first sample recognition result and the second sample recognition result output by the text recognition model includes:

inputting the sample image into the high-resolution network to obtain an initial characteristic map sample;

determining a text outline characteristic diagram sample, a text direction characteristic diagram sample and a text character category characteristic diagram sample according to the initial characteristic diagram sample;

performing sequence conversion operation on the text contour feature map sample, the text direction feature map sample and the text character type feature map sample to obtain a feature vector sample sequence;

inputting the feature vector sample sequence into the attention mechanism network and the connection time sequence classification network to obtain a first character class distribution probability sample of each feature vector sample in the feature vector sample sequence output by the connection time sequence classification network and a second character class distribution probability sample of each feature vector sample in the feature vector sample sequence output by the attention mechanism network;

and obtaining the first sample recognition result according to the first character type distribution probability sample, and obtaining the second sample recognition result according to the second character type distribution probability sample.

Optionally, the attention mechanism network comprises an encoder and a decoder, and the inputting the feature vector sample sequence into the attention mechanism network and the connection timing classification network comprises:

inputting the characteristic vector sample sequence into an encoder of the attention mechanism network to obtain an encoded characteristic vector sample sequence;

inputting the sequence of encoded feature vector samples into a decoder of the attention mechanism network and the connection timing classification network.

Optionally, the method further comprises:

determining a text boundary offset feature map sample according to the initial feature map sample, wherein the text boundary offset feature map sample represents the position of the sample text recognition result in the sample image.

Optionally, the training parameters include a loss weight value, and the loss value is calculated by the following formula:

L _MTL ＝λL _CTC +(1-λ)L _Attention ；

wherein L is _MTL Is the loss value, L _CTC Classifying a loss function of the network for said connection timing, L _Attention λ is the loss weight value for the loss function of the attention mechanism network.

Optionally, the text recognition model is a full-volume point aggregation network model.

According to a second aspect of the embodiments of the present disclosure, there is provided a text recognition method, the method including:

acquiring an image to be recognized, wherein the image to be recognized comprises a text to be recognized;

inputting the image to be recognized into a text recognition model to obtain a text recognition result output by the text recognition model, wherein the text recognition model is obtained by training through the model training method of any one of the first aspect.

Optionally, the text recognition result includes a first recognition result and a second recognition result, and the method further includes:

and performing weighted calculation according to the first recognition result and the second recognition result to obtain a target text recognition result.

According to a third aspect of embodiments of the present disclosure, there is provided a model training apparatus, the apparatus comprising:

the device comprises a first obtaining module, a second obtaining module and a third obtaining module, wherein the first obtaining module is used for obtaining a sample image, the sample image comprises a sample text, and the sample text is marked with a sample text label;

the first input module is used for inputting the sample image into a text recognition model to obtain a sample text recognition result output by the text recognition model, wherein the text recognition model comprises a connection time sequence classification network and an attention mechanism network, and the sample text recognition result comprises a first sample recognition result output by the connection time sequence classification network and a second sample recognition result output by the attention mechanism network;

the calculation module is used for calculating a loss value according to the first sample identification result, the second sample identification result and the sample text label;

and the adjusting module is used for adjusting the parameters of the text recognition model according to the loss value.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a text recognition apparatus, the apparatus including:

the second acquisition module is used for acquiring an image to be recognized, wherein the image to be recognized comprises a text to be recognized;

and the second input module is used for inputting the image to be recognized into a text recognition model to obtain a text recognition result output by the text recognition model, wherein the text recognition model is obtained by training through the model training method in any one of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the first or second aspects described above.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of any of the first or second aspects above.

By the technical scheme, the sample image is input into the text recognition model, and a sample text recognition result is obtained. The text recognition model comprises a connection time sequence classification network and an attention mechanism network, and the sample text recognition result comprises a first sample recognition result output by the connection time sequence classification network and a second sample recognition result output by the attention mechanism network. Because the loss value is calculated according to the first sample identification result output by the connection time sequence classification network and the second sample identification result output by the attention mechanism network, the connection time sequence classification network can be used for forcibly aligning the input and the output monotonously, and the attention mechanism network can avoid neglecting the whole information when the connection time sequence classification network predicts the local information, so that the text identification is carried out under the condition of considering the whole information and the input and output alignment, and the accuracy of the text identification is improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flowchart illustrating a model training method according to an exemplary embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a text recognition method according to an exemplary embodiment of the present disclosure.

Fig. 3 is a diagram illustrating a text recognition result according to an exemplary embodiment of the present disclosure.

FIG. 4 is a block diagram of a model training apparatus, shown in an exemplary embodiment of the present disclosure.

Fig. 5 is a block diagram illustrating a text recognition apparatus according to an exemplary embodiment of the present disclosure.

Fig. 6 is a block diagram of an electronic device shown in an exemplary embodiment of the present disclosure.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

It should be noted that all actions of acquiring signals, information or data in the present disclosure are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

Related art OCR in an educational scene involves a conventional method and a deep learning method, in which deep learning includes a single-stage approach (i.e., end-to-end text detection recognition) and a two-stage approach (i.e., text detection and text recognition).

The conventional OCR method relies on image processing technology and statistical machine learning method, and usually adopts a manual feature extraction method to perform text detection (for example, methods such as SWT, MSER, etc.), and then adopts a template matching or model training method to recognize the detected text. The conventional OCR method can be divided into three stages of image preprocessing, character recognition and post-processing. The image preprocessing can be used for completing the processing of character area positioning, character correction, character cutting and the like. The character recognition can be used for recognizing the cut characters, and usually, the characters can be extracted by artificial design features (such as HOG features and the like) or CNN and then recognized by a machine learning classifier (such as SVM). The post-processing can correct the recognition result by using preset rules, language models and the like so as to perform operations of page restoration, recognition correction and the like. However, the traditional method has poor adaptability and anti-interference performance to character shape changes (such as problems of character blurring, stroke adhesion, pen breaking, uneven black and white, ink reverse penetration and the like), and cannot adapt to text recognition in different scenes (the traditional method usually needs to independently design parameters of each module in the face of different scenes, and is difficult to design a model with good generalization performance in the face of complex scenes).

OCR of the deep learning method generally employs a convolutional neural network instead of a manual feature extraction method for text detection, and then employs a neural network to recognize the detected text. The basic idea of the single-stage method is to design a model having both a detection unit and an identification unit, wherein the detection unit and the identification unit share the CNN features (i.e., features extracted by the convolutional neural network), and perform joint training. In the model application stage, the end-to-end recognition model can predict the position and content information of the text in the image in a forward propagation. However, training in a single-stage approach often requires character-level annotations (e.g., Mask textSpotter and CharNet), which is costly. And the single-stage approach usually pre-defines the text reading direction (for example, textdraw and Mask textpointer strongly assume that the reading direction of the text area is from left to right or from top to bottom), so the recognition effect of the text with the non-traditional reading direction is not good.

The two-stage mode is to split the text detection and the text recognition into two parts, locate the text line through the text detection, and then identify the content of the located text line through the text recognition. For text detection, regression-based or segmentation-based methods may be employed. The regression-based method has a poor detection effect on irregular-shaped texts or sparsely distributed texts (for example, CTPN has a poor detection effect on inclined and curved texts, and SegLink has a poor detection effect on sparsely distributed texts). The algorithm based on segmentation has complex post-processing, has the problem of operational performance, and has poor detection effect on overlapped texts. Text recognition may include regular text recognition (e.g., recognizing text that lies in horizontal lines such as typographies, scanned text, etc.) and irregular text recognition (i.e., recognizing text that does not lie in horizontal lines). For regular text recognition, a network model such as CTC or Sequence2Sequence may be used for text recognition. For irregular text recognition, a network model such as STAR-Net, RARE, and Transformer can be used for text recognition. However, the two-stage method optimizes multiple stages during training, and involves time-consuming operations such as non-maximum suppression (NMS) and roi (region of interest), which brings non-negligible computational overhead and affects the training effect of the model.

In view of the above, the present disclosure provides a model training method, a text recognition method, an apparatus, a medium, and a device, where a text recognition model is trained in a single-stage manner, so as to avoid computational overhead caused by a two-stage manner, and a connection timing classification network and an attention mechanism network are used to decode a feature map, so that monotonous alignment between input and output can be forced by using the connection timing classification network, and neglect of whole information when local information is predicted by using the connection timing classification network can be avoided by using the attention mechanism network, so that text recognition can be performed while considering both the whole information and the input and output alignment, and accuracy of text recognition is improved.

The following provides a detailed description of embodiments of the present disclosure.

Fig. 1 is a flowchart illustrating a model training method according to an exemplary embodiment of the disclosure, where the method includes:

s101, obtaining a sample image, wherein the sample image comprises a sample text, and the sample text is marked with a sample text label.

The sample image may be an image of a test paper, a scanned document, or the like photographed by the mobile device in the educational scene, and the image may include sample text. The sample text label may be a text true value in the sample image (i.e., a sample text in the sample image), and may be labeled manually before model training or labeled by a machine learning method in the related art.

S102, inputting the sample image into a text recognition model to obtain a sample text recognition result output by the text recognition model, wherein the text recognition model comprises a connection time sequence classification network and an attention mechanism network, and the sample text recognition result comprises a first sample recognition result output by the connection time sequence classification network and a second sample recognition result output by the attention mechanism network.

Wherein the connection timing classification network may be a ctc (connection temporal classification) network, and the Attention mechanism network may be an Attention network. CTC networks may include a forward-backward algorithm (forward-backward algorithm) that forces monotonic alignment of input and output, thus ensuring a stable alignment even when input data is noisy. However, CTC networks assume independence between outputs at different time steps, with each output being an independent single character probability, resulting in the omission of the overall information. The Attention network does not introduce any alignment constraint condition, and selects a target input corresponding to an output from all the inputs during alignment, and in this case, decoding can cause the input and the output to be misaligned.

According to the model training method provided by the embodiment of the disclosure, the connection time sequence classification network and the attention mechanism network are combined, so that monotonous alignment of input and output can be forced by using the connection time sequence classification network, and the omission of the whole information when the connection time sequence classification network predicts the local information can be avoided through the attention mechanism network, so that text recognition is performed under the condition of considering the whole information and the input and output alignment, and the accuracy of the text recognition is improved. In addition, the alignment time consumption of the attention mechanism network is reduced, so that the model convergence can be accelerated, and the model accuracy is improved.

And S103, calculating a loss value according to the first sample identification result, the second sample identification result and the sample text label.

And S104, adjusting parameters of the text recognition model according to the loss value.

It is easy to understand that, by connecting the first sample recognition result output by the time sequence classification network and the second sample recognition result output by the attention mechanism network to perform loss value calculation, and adjusting the text recognition model parameters according to the loss value, the text recognition model can perform text recognition under the condition of not only considering the whole information but also aligning the input and output, and the accuracy of the text recognition is improved.

Optionally, the text recognition model is a full-volume point-gather network model.

The full-convolution point aggregation network model may be referred to as pgnet (point gating network), and the network model is an end-to-end character recognition model.

It should be noted that PGNet in the related art is mainly applied to english OCR in natural scenes, and is used for detecting and recognizing english and numbers. PGNet is used for english OCR, the set text character categories are 37 (26 english letters, 10 numbers, and 1 background). In this case, PGNet performs image feature extraction using ResNet. In the OCR recognition of the education scene related to the chinese language, since there are more than 6000 commonly used chinese characters, the difficulty of extracting image features by the model is large. On the basis, the effect of extracting the image features by adopting ResNet is not good. This is because there are problems of region level and pixel level in visual feature extraction, and the characterization resolution learned by classification networks such as ResNet and VGGNet is low, so the discrimination of the restored high-resolution characterization space is not strong enough on this basis, which makes it difficult for ResNet to obtain accurate prediction results on tasks sensitive to spatial accuracy. Therefore, the present disclosure improves a network for extracting image features of PGNet in addition to improving a decoding network for PGNet in the related art (i.e., decoding a connection timing classification network and an attention mechanism network in combination).

Optionally, the text recognition model further includes a high resolution network, and the step of inputting the sample image into the text recognition model to obtain a first sample recognition result and a second sample recognition result output by the text recognition model may include:

inputting the sample image into a high-resolution network to obtain an initial characteristic image sample;

performing sequence conversion operation on the text contour characteristic diagram sample, the text direction characteristic diagram sample and the text character category characteristic diagram sample to obtain a characteristic vector sample sequence;

inputting the feature vector sample sequence into an attention mechanism network and a connection time sequence classification network to obtain a first character type distribution probability sample of each feature vector sample in the feature vector sample sequence output by the connection time sequence classification network and a second character type distribution probability sample of each feature vector sample in the feature vector sample sequence output by the attention mechanism network;

and obtaining a first sample recognition result according to the first character type distribution probability sample, and obtaining a second sample recognition result according to the second character type distribution probability sample.

Wherein, the High Resolution Network may be a HRNet (High-Resolution Network). The HRNet basically changes the network structure, and improves the traditional series connection high-low resolution convolution into parallel connection high-low resolution convolution, so that high resolution can be maintained in the convolution process, rich high-resolution representation and accurate spatial information can be learned through multiple information exchanges of high-low resolution representation, and a feature map with stronger representation capability and higher resolution is obtained.

It is understood that, in the embodiment of the present disclosure, HRNet is used to perform image feature extraction, and compared with ResNet, HRNet can obtain a feature map with stronger representation capability and higher resolution.

It should be noted that after feature extraction is performed on the sample image through the high-resolution network, an initial feature map sample may be obtained. Based on the initial feature map sample, a text outline feature map sample, a text direction feature map sample, a text character category feature map sample and a text boundary offset feature map sample can be obtained in a parallel multi-task learning mode. The text outline feature pattern may be a 1-channel feature pattern, which is used to characterize a Text Center Line (TCL). The text direction feature pattern may be a 2-channel feature map used to characterize the Text Direction Offset (TDO), where the text direction offset may refer to the offset of each pixel of TCL in the text outline feature map to the next text reading position. The text character class feature pattern may be an n-channel feature map (n is the number of character classes, e.g., 37 for n in OCR) for characterizing character classification (TCC). The text boundary offset feature pattern may be a 4-channel map, and is used to characterize the position of the sample text recognition result in the sample image, where a Text Boundary Offset (TBO) may refer to an offset of each pixel of TCL in the text contour feature map from upper and lower boundary points of a text region.

After the text contour feature Map sample, the text direction feature Map sample, and the text character category feature Map sample are obtained, an addition feature Map (channel number addition) of the text contour feature Map sample, the text direction feature Map sample, and the text character category feature Map sample may be obtained by calculation, and a sequence conversion operation (Map-to-Seq, an operation of converting a feature Map into a feature vector sequence) may be performed on the addition feature Map to obtain a feature vector sample sequence. On the basis, the feature vector sample sequence can be input into the attention mechanism network and the connection timing sequence classification network, so as to obtain a first character class distribution probability sample of each feature vector sample in the feature vector sample sequence output by the connection timing sequence classification network, and obtain a second character class distribution probability sample of each feature vector sample in the feature vector sample sequence output by the attention mechanism network. The character category distribution probability sample is used for representing the probability sample of the feature vector sample corresponding to each character category.

It is understood that the character class with the highest probability in the character class distribution probability samples of the feature vector may be the predicted character (i.e., the sample recognition result) corresponding to the feature vector. On the basis, a first sample recognition result can be obtained according to the first character type distribution probability sample, and a second sample recognition result can be obtained according to the second character type distribution probability sample. And, a target sample recognition result can be obtained by performing a weighted calculation according to the first sample recognition result and the second sample recognition result. Wherein the target sample identification result can be used to characterize the sample text of the sample image. The weight of the weighted calculation of the first sample recognition result and the second sample recognition result may be a training parameter of the model, and the training parameter may be adjusted according to the result of the iterative training of the model.

Optionally, the technical solution provided by the embodiment of the present disclosure may further include:

and determining a text boundary offset feature map sample according to the initial feature map sample, wherein the text boundary offset feature map sample represents the position of the sample text recognition result in the sample image.

It should be noted that the sample recognition result may include a sample text and a sample text box, and the sample text box may represent the position of the sample text in the sample image. It will be understood that the sample text box may be derived from text boundary offset feature map samples.

In addition, it should be noted that the text contour feature map sample, the text direction feature map sample and the text boundary shift feature map sample can be supervised and learned by the label feature maps with the same scale, so that the text recognition model can predict the text contour feature map sample, the text direction feature map sample and the text boundary shift feature map sample more accurately. In addition, as the text recognition model is trained to learn the text direction characteristic diagram sample, the recognition of the model to the non-traditional reading direction can be expanded, and the accuracy of the model to the text recognition is improved.

Optionally, the attention mechanism network comprises an encoder and a decoder, and the step of inputting the feature vector sample sequence into the attention mechanism network and the connection timing classification network may comprise:

the sequence of encoded feature vector samples is input to a decoder and a connection timing classification network of the attention mechanism network.

It should be noted that the connection timing classification network and the attention mechanism network can be combined by making the encoder of the attention mechanism network share the connection timing classification network, and inputting the coded feature vector sample sequence coded by the encoder of the attention mechanism network into the decoder of the attention mechanism network and the connection timing classification network. On this basis, the first sample identification result output by the connection timing classification network and the second sample identification result output by the attention mechanism network can be combined in a Beam Search (Beam Search) algorithm to eliminate irregular alignment. Therefore, the problem that the connection time sequence classification network ignores the whole information when predicting the local information (namely the prediction of the single character probability) in the first sample recognition result can be effectively solved, and the problem that the network decoding of attention mechanism is not limited by alignment can be effectively solved. In addition, the alignment time consumption of the attention mechanism network is reduced, so that the model convergence can be accelerated, and the model accuracy is improved.

Optionally, the training parameters include a loss weight value, and the loss value may be calculated by the following formula:

L _MTL ＝λL _CTC +(1-λ)L _Attention ；

wherein L is _MTL To a loss value, L _CTC Loss function for connecting time-ordered sorted networks, L _Attention To note the penalty function of the force mechanism network, λ is a penalty weight value.

The loss weight value is a training parameter of the text recognition model, and can be adjusted according to the result of model iterative training. The loss function of the connection timing classification network may be a loss function of the CTC network, and the loss function of the Attention mechanism network may be a loss function of the Attention network. The loss function of the CTC network and the loss function of the Attention network are prior art and are not described herein again.

It is noted that L can be passed _MTL Training to obtain a text character category feature map sample at a pixel level, so that a text recognition model is free from character level marking and NMS and ROI operations.

By adopting the formula to calculate the loss value to adjust the training parameters of the model, the monotonous alignment of input and output can be forced by using the connection time sequence classification network, and the neglect of the whole information when the connection time sequence classification network predicts the local information can be avoided by using the attention mechanism network, so that the text recognition is carried out under the condition of considering the whole information and the input and output alignment, and the accuracy of the text recognition is improved.

By the technical scheme, the sample image is input into the text recognition model, and a sample text recognition result is obtained. The text recognition model comprises a connection time sequence classification network and an attention mechanism network, and the sample text recognition result comprises a first sample recognition result output by the connection time sequence classification network and a second sample recognition result output by the attention mechanism network. Because the loss value is calculated according to the first sample identification result output by the connection time sequence classification network and the second sample identification result output by the attention mechanism network, the monotonous alignment of input and output can be forced by using the connection time sequence classification network, and the neglect of the whole information when the connection time sequence classification network predicts the local information can be avoided by using the attention mechanism network, so that the text identification is carried out under the condition of considering the whole information and the input and output alignment, and the accuracy of the text identification is improved.

Fig. 2 is a flowchart illustrating a text recognition method according to an exemplary embodiment of the present disclosure. As shown in fig. 2, the method includes:

s201, obtaining an image to be recognized, wherein the image to be recognized comprises a text to be recognized.

S202, inputting the image to be recognized into a text recognition model to obtain a text recognition result output by the text recognition model, wherein the text recognition model is obtained by training through the model training method.

It is easy to understand that the text recognition model obtained by training through the model training method can avoid the calculation overhead brought by a two-stage mode, reduce the time consumption and improve the accuracy of text recognition. On the basis, the image to be recognized is input into the text recognition model, so that a text recognition result with higher accuracy can be obtained.

Optionally, the text recognition result may include a first recognition result and a second recognition result, and on this basis, the technical solution provided by the embodiment of the present disclosure may further include:

It should be noted that, after the image to be recognized is input into the text recognition model, the first recognition result may be obtained according to the connection timing classification network of the text recognition model, and the second recognition result may be obtained according to the attention mechanism network of the text recognition model. On the basis, weighting calculation can be carried out according to the first recognition result and the second recognition result to obtain a target text recognition result, and the target text recognition result can be used for representing a text to be recognized of an image to be recognized.

It should also be noted that the target text recognition result may further include a recognition text box, which may characterize the position of the text to be recognized in the image to be recognized.

Fig. 3 is a diagram illustrating a text recognition result according to an exemplary embodiment of the present disclosure. As shown in fig. 3, the text recognition result may include a recognition text box and a recognition text (for example, a recognition text in chinese and an english text in the recognition text box shown in fig. 3).

The text recognition model is obtained through the training of the model training method, loss values are calculated according to the first sample recognition result output by the connection time sequence classification network and the second sample recognition result output by the attention mechanism network during training, so that monotonous alignment of input and output can be forced by the connection time sequence classification network, and neglect of whole information during prediction of local information by the attention mechanism network can be avoided, so that text recognition is performed under the condition of considering both the whole information and input and output alignment, and the accuracy of text recognition is improved.

Based on the same inventive concept, the present disclosure also provides a model training apparatus, and referring to fig. 4, fig. 4 is a block diagram of a model training apparatus shown in an exemplary embodiment of the present disclosure. As shown in fig. 4, the model training apparatus 100 includes:

the first obtaining module 101 is configured to obtain a sample image, where the sample image includes a sample text and the sample text is labeled with a sample text label;

the first input module 102 is configured to input the sample image into a text recognition model to obtain a sample text recognition result output by the text recognition model, where the text recognition model includes a connection timing classification network and an attention mechanism network, and the sample text recognition result includes a first sample recognition result output by the connection timing classification network and a second sample recognition result output by the attention mechanism network;

a calculating module 103, configured to calculate a loss value according to the first sample recognition result, the second sample recognition result, and the sample text label;

and an adjusting module 104, configured to adjust a parameter of the text recognition model according to the loss value.

Optionally, the text recognition model further comprises a high resolution network, and the first input module 102 is configured to:

Optionally, the attention mechanism network comprises an encoder and a decoder, and the first input module 102 is configured to:

Optionally, the apparatus 100 further comprises:

and the determining module is used for determining a text boundary offset characteristic map sample according to the initial characteristic map sample, wherein the text boundary offset characteristic map sample represents the position of the sample text recognition result in the sample image.

Optionally, the training parameters include a loss weight value, the loss value is calculated by the following formula:

L _MTL ＝λL _CTC +(1-λ)L _Attention ；

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the same inventive concept, the present disclosure also provides a model training apparatus, referring to fig. 5, and fig. 5 is a block diagram of a text recognition apparatus shown in an exemplary embodiment of the present disclosure. As shown in fig. 5, the text recognition apparatus 200 includes:

a second obtaining module 201, configured to obtain an image to be recognized, where the image to be recognized includes a text to be recognized;

the second input module 202 is configured to input the image to be recognized into a text recognition model, so as to obtain a text recognition result output by the text recognition model, where the text recognition model is obtained by training through the model training method.

Optionally, the text recognition result includes a first recognition result and a second recognition result, and the apparatus 200 further includes:

and the weighting calculation module is used for carrying out weighting calculation according to the first recognition result and the second recognition result to obtain a target text recognition result.

Based on the same inventive concept, an embodiment of the present disclosure further provides an electronic device, including:

a memory having a computer program stored thereon;

a processor for executing a computer program in a memory to implement the steps of the model training method or the text recognition method described above.

Fig. 6 is a block diagram of an electronic device shown in an exemplary embodiment of the present disclosure. As shown in fig. 6, the electronic device 300 may include: a processor 301 and a memory 302. The electronic device 300 may also include one or more of a multimedia component 303, an input/output (I/O) interface 304, and a communication component 305.

The processor 301 is configured to control the overall operation of the electronic device 300, so as to complete all or part of the steps in the above model training method or text recognition method. The memory 302 is used to store various types of data to support operation at the electronic device 300, such as instructions for any application or method operating on the electronic device 300 and application-related data, such as contact data, transmitted and received messages, pictures, audio, video, and the like. The Memory 302 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 303 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 302 or transmitted through the communication component 305. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 304 provides an interface between the processor 301 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 305 is used for wired or wireless communication between the electronic device 300 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC for short), 2G, 3G, 4G or 5G, NB-IOT (Narrow Band Internet of Things), or a combination of one or more of them, so the corresponding Communication component 305 may include: Wi-Fi module, Bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described model training method or text recognition method.

In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the model training method or the text recognition method described above is also provided. For example, the computer readable storage medium may be the memory 302 described above including program instructions that are executable by the processor 301 of the electronic device 300 to perform the model training method or the text recognition method described above.

With regard to the computer-readable storage medium in the above-described embodiments, the steps of implementing the model training method or the text recognition method when the computer program stored thereon is executed will be described in detail in relation to the embodiments of the method, and will not be elaborated herein.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-described model training method or text recognition method when executed by the programmable apparatus.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A method of model training, the method comprising:

2. The method of claim 1, wherein the text recognition model further comprises a high resolution network, and the inputting of the sample image into the text recognition model results in the first sample recognition result and the second sample recognition result output by the text recognition model, comprises:

inputting the feature vector sample sequence into the attention mechanism network and the connection timing sequence classification network to obtain a first character class distribution probability sample of each feature vector sample in the feature vector sample sequence output by the connection timing sequence classification network, and obtain a second character class distribution probability sample of each feature vector sample in the feature vector sample sequence output by the attention mechanism network;

3. The method of claim 2, wherein the attention mechanism network comprises an encoder and a decoder, and wherein inputting the sequence of feature vector samples into the attention mechanism network and the connection timing classification network comprises:

4. The method of claim 2, further comprising:

5. The method of claim 1, wherein the training parameters comprise a loss weight value, and wherein the loss value is calculated by the following equation:

L _MTL ＝λL _CTC +(1-λ)L _Attention ；

wherein, K _MTL Is the loss value, L _CTC Classifying a loss function of the network for said connection timing, L _Attention λ is the loss weight value for the loss function of the attention mechanism network.

6. The method of any of claims 1-5, wherein the text recognition model is a full-volume point aggregation network model.

7. A method of text recognition, the method comprising:

inputting the image to be recognized into a text recognition model to obtain a text recognition result output by the text recognition model, wherein the text recognition model is obtained by training according to the model training method of any one of claims 1 to 6.

8. The method of claim 7, wherein the text recognition result comprises a first recognition result and a second recognition result, and wherein the method further comprises:

9. A model training apparatus, the apparatus comprising:

a calculation module, configured to calculate a loss value according to the first sample recognition result, the second sample recognition result, and the sample text label;

10. A text recognition apparatus, characterized in that the apparatus comprises:

a second input module, configured to input the image to be recognized into a text recognition model, so as to obtain a text recognition result output by the text recognition model, where the text recognition model is obtained by training according to the model training method of any one of claims 1 to 6.

11. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6 or of any one of claims 7 to 8.

12. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 6 or any one of claims 7 to 8.