CN115393867A

CN115393867A - Text recognition model generation method, text recognition device, and storage medium

Info

Publication number: CN115393867A
Application number: CN202210859202.6A
Authority: CN
Inventors: 朱远志; 丁威; 汤俊; 何梦超; 姚聪; 刘腾; 张洁靖; 朱雅丽
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-11-25

Abstract

The text recognition model generation method, the text recognition device and the storage medium, wherein the text recognition method comprises the following steps: acquiring a text line image to be recognized; performing line text recognition on the text line image by adopting a trained text line recognition model to obtain a text line recognition result; when the confidence coefficient of any character in the text line recognition result is smaller than a preset confidence coefficient threshold value or the energy of the character is larger than a preset energy threshold value, cutting a character sub-image with a preset width by taking a corresponding character in the text line image as a center; for each character sub-image, adopting a trained single character detection and recognition model to perform single character detection and recognition, and outputting a corresponding character recognition result; and combining the text line recognition result and the corresponding character recognition result to obtain the character recognition result of the text line image. The scheme can improve the accuracy and efficiency of text recognition for electronic equipment with limited computing power.

Description

Text recognition model generation method, text recognition device and storage medium

Technical Field

Embodiments of the present disclosure relate to the field of data processing technologies, and in particular, to a text recognition model generation method, a text recognition device, and a storage medium.

Background

In recent years, with continuous development and innovation of technologies in the field of Artificial Intelligence (AI), optical Character Recognition (OCR) makes a great breakthrough, and becomes an important basic capability for digital transformation, intelligent upgrade and fusion innovation in various vertical industries such as finance, traffic, logistics, education, government affairs and the like. Due to the increasingly mature mobile internet and the accelerated development of industrial internet, service carriers and forms of OCR present diversified characteristics, and offline OCR which considers performance and efficiency has become one of the trends of future technical development. Offline OCR becomes a brand new product form following public cloud Application Program Interface (API) and privatization deployment, is an extension of the boundary of a pan OCR system, and has the advantages of low-cost deployment, zero-traffic consumption, privacy protection, what you see is what you get, and the like compared with online services.

The educational intelligent hardware helps students to obtain personalized learning content and reasonably improve learning methods by integrating the product functions of AI algorithm, software, content and the like in various forms, helps parents to guide and supervise the learning conditions of children, and helps teachers to improve teaching content and reduce teaching pressure.

According to the evaluation of Ehry, the intelligent hardware market for education is 343 billion yuan in 2020, and is expected to approach 1000 billion yuan in 2024, wherein the growth rate of emerging items such as AI dictionary pens is particularly prominent. The AI dictionary pen is a brand new generation artificial intelligence dictionary pen which is oriented to student users and can learn languages through operations such as scanning, voice and the like, and can help students to solve the problems of listening, speaking, reading, back interpretation, interpretation and the like. The most core basic function of the AI dictionary pen is that the dictionary pen can be normally used in a non-network environment due to the fact that scanning and word searching are performed by an off-line algorithm, and therefore convenience and usability of seeing are greatly improved. Off-line OCR is an entrance for extracting text information, once recognition is wrong, the whole function is unavailable, and meanwhile, because the off-line OCR is a consumer product for users, user experience and hardware cost need to be considered in a key way, so that how to achieve rapid and high-precision OCR recognition on electronic equipment with limited power is a technical problem in the industry.

The statements in the background section are merely prior art to the public and do not, of course, represent prior art in this field.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a text recognition model generation method, a text recognition device, and a storage medium, which can improve accuracy and efficiency of text recognition for electronic devices with limited computing power.

First, an embodiment of the present specification provides a text recognition model generation method, including:

respectively acquiring a text line image training set and a character image training set;

inputting the text line image training set into a preset text line recognition model, training the text line recognition model, inputting the character image training set into a preset single character detection and recognition model, and training the single character detection and recognition model;

acquiring a text line image test set, inputting the text line identification model, and outputting a text line identification result;

evaluating the confidence coefficient and energy of each character in the text line recognition result;

when the confidence coefficient of any character in the text line recognition result is smaller than a preset confidence coefficient threshold value or the energy of the character is larger than a preset energy threshold value, cutting a character test sub-image with a preset width by taking a corresponding character in the text line image as a center;

inputting the single character detection and recognition model for each character test sub-image, and outputting a corresponding character recognition result;

combining the text line recognition result and the corresponding character recognition result to obtain a character recognition test result of the text line image test set;

and determining whether to continue training the text line recognition model and the single character detection and recognition model based on whether the character recognition test result reaches a preset performance evaluation index or not until the character recognition test result reaches the preset performance evaluation index.

Optionally, the separately acquiring a text line image training set and a character image training set includes:

mixing a first text line image training set and a second text line image training set according to a first mixing proportion preset in a corresponding training batch to obtain a text line image training set; the first text line image training set is obtained by cutting an acquired real text image; the second text line image training set is obtained by collecting word libraries with different fonts and synthesizing text lines according to the arrangement rule of characters in the text lines;

mixing the first character image training set and the second character image training set according to a second mixing proportion preset in a corresponding training batch to obtain a character image training set; the first character image training set is obtained by cutting an acquired real text image; and the second character image training set is obtained by collecting word libraries with different fonts, synthesizing a text image according to the arrangement rule of characters in a text line and then cutting the text image.

Optionally, a first text line image training set and a second text line image training set are mixed according to a first mixing proportion preset in a corresponding training batch, and a first character image training set and a second character image training set are mixed according to a second mixing proportion preset in a corresponding training batch, so that the text line recognition model obtained through training of the text line image training set and the single character detection and recognition model obtained through training of the character image training set reach a preset generalization performance index threshold.

Optionally, the text line recognition model includes: a convolutional recurrent neural network.

Optionally, the method further comprises: and performing image augmentation operation on the text line image training set and the character image training set.

Optionally, the performing an image augmentation operation on the text line image training set and the character image training set includes:

and executing image augmentation operation on the text line image training set and the character image training set according to the super-parameters set by corresponding training batches, wherein the super-parameters are attenuated along with the increase of the training batches.

An embodiment of the present specification further provides a text recognition method, including:

acquiring a text line image to be identified;

performing line text recognition on the text line image by adopting a trained text line recognition model to obtain a text line recognition result;

when the confidence coefficient of any character in the text line recognition result is smaller than a preset confidence coefficient threshold value or the energy of the character is larger than a preset energy threshold value, cutting a character sub-image with a preset width by taking a corresponding character in the text line image as a center;

for each character sub-image, adopting a trained single character detection and recognition model to perform single character detection and recognition, and outputting a corresponding character recognition result;

and combining the text line recognition result and the corresponding character recognition result to obtain the character recognition result of the text line image.

Optionally, the energy of each character in the text line recognition result is evaluated according to the following formula:

wherein X represents the image of the character, T represents the temperature hyper-parameter, K represents the category number, f is the text line recognition model network, f _i (x) And representing the characteristic value corresponding to the ith category in the K categories.

Optionally, before performing line text recognition on the text line image by using the trained text line recognition model, the method further includes:

and carrying out image enhancement preprocessing on the text line image to be recognized.

An embodiment of the present specification further provides a text recognition system, including:

the acquisition unit is suitable for acquiring a text line image to be recognized;

the text line recognition unit is suitable for performing text line recognition on the text line image by adopting a trained text line recognition model to obtain a text line recognition result;

the evaluation unit is suitable for evaluating the confidence coefficient and the energy of each character in the text line recognition result;

the cutting unit is suitable for cutting a character sub-image with a preset width by taking a corresponding character in the text line image as a center when the evaluation unit determines that the confidence coefficient of any character in the text line recognition result is smaller than a preset confidence coefficient threshold value or the energy of the character is larger than a preset energy threshold value;

the character recognition unit is suitable for detecting and recognizing the single characters by adopting the trained single character detection and recognition model for each character sub-image and outputting a corresponding character recognition result;

and the recognition result output unit is suitable for combining the text line recognition result and the corresponding character recognition result to obtain and output the character recognition result of the text line image.

The present specification further provides an electronic device, including a memory and a processor, where the memory stores a computer program executable on the processor, where the processor executes the computer program to perform the text recognition model generation method according to any one of the foregoing embodiments or the steps of the text recognition method according to any one of the foregoing embodiments.

The present specification further provides a computer-readable storage medium, on which a computer program is stored, where the computer program executes the steps of the text recognition model generation method according to any one of the foregoing embodiments or the text recognition method according to any one of the foregoing embodiments.

The method for generating the text recognition model in the embodiment of the present specification includes respectively obtaining a text line image training set and a character image training set, training the text line recognition model by using the text line image training set, training a single character detection and recognition model by using the character image training set, and testing by using a text line image test set, wherein for the text line recognition result, by evaluating a confidence coefficient and energy of each character in the text line recognition result, when it is determined that the confidence coefficient of any character in the text line recognition result is smaller than a preset confidence coefficient threshold value or the energy of the character is larger than a preset energy threshold value, a character test sub-image with a preset width is cut with a corresponding character in the text line image as a center, for each character test sub-image, the single character detection and recognition model is input, a corresponding character recognition result is output, and in combination with the text line recognition result and the corresponding character recognition result, a character recognition test result of the text line image test set is obtained, and then, whether the single character recognition model and the character recognition result of the text line recognition model reach a preset performance evaluation index or not is determined based on whether the character recognition test result of the character recognition test sub-image reaches the preset performance evaluation index or not, and the single character recognition model continues to evaluate the text line recognition test result until the text line recognition model and the character recognition test result continues to reach the preset performance evaluation index. The text recognition model obtained by the text recognition model generation method is introduced with energy joint confidence to comprehensively judge the text line recognition result, so that characters needing to be subjected to single character recognition and detection can be more efficiently selected; in addition, the line recognition and the single character detection and recognition model for rarely-used character recognition are combined for comprehensive recognition, so that the character recognition accuracy can be improved by adopting the text recognition model for character recognition.

The method for recognizing the characters of the text line image comprises the steps of evaluating confidence degrees and energy of characters in a text line recognition result, accurately distinguishing common characters from uncommon characters, cutting character sub-images with preset widths by taking corresponding characters in the text line image as centers when the confidence degree of any character in the text line recognition result is smaller than a preset confidence degree threshold value or the energy of the character is larger than a preset energy threshold value, namely when the character is determined to be a uncommon character, then detecting and recognizing the individual characters by using trained individual character detection and recognition models for the individual character sub-images, outputting corresponding character recognition results, and obtaining character recognition results of the text line image by combining the text line recognition result and the corresponding character recognition results. On one hand, the energy joint confidence is introduced to comprehensively judge the text line recognition result in the whole recognition process, so that characters needing single character recognition and detection can be more efficiently selected; on the other hand, the line recognition and the single character detection and recognition model for rarely-used character recognition are combined for comprehensive recognition, so that the character recognition accuracy can be improved by adopting the text recognition model for character recognition.

Further, a first text line image training set obtained by cutting the acquired real text image is mixed with a second text line image training set obtained by collecting word libraries with different fonts and synthesizing text lines according to the arrangement rule of characters in the text lines according to a first mixing proportion preset in a corresponding training batch to obtain a text line image training set; and mixing a first character image training set obtained by cutting the acquired real text image with a second character image training set obtained by cutting after the text image is synthesized according to the arrangement rule of characters in the text line and a character library with different fonts is collected according to a second mixing proportion preset in a corresponding training batch to obtain the character image training set, namely adding the synthesized second text line image training set and the second character image training set in the training process, so that the text line image training set and the character image training set meeting the training requirements can be more efficiently obtained, and the training efficiency of the text recognition model can be further improved.

Furthermore, the first text line image training set and the second text line image training set are mixed according to a first mixing proportion preset in a corresponding training batch, and the first character image training set and the second character image training set are mixed according to a second mixing proportion preset in a corresponding training batch, so that the text line recognition model obtained through training of the text line image training set and the single character detection and recognition model obtained through training of the character image training set reach a preset generalization performance index threshold value, the text recognition model obtained through training can more effectively express the characteristics of real text image data, and the generalization capability of the text recognition model is improved.

The text line images are input into a convolution cyclic neural network, the features of the text line images can be extracted by using the convolution neural network, then the feature vectors are fused by adopting a bidirectional long and short term memory network to extract the context features of the character sequence, then the probability distribution of each row of features is obtained, and finally the text sequence is obtained by predicting through a transcription layer.

Furthermore, by performing image augmentation operation on the text line image training set and the character image training set, the character recognition rate of the text recognition model in a general scene can be improved.

Further, by executing image augmentation operation on the text line image training set and the character image training set according to the hyper-parameters set by the corresponding training batches, the fitting degree of the original text line image training set and the character image training set can be gradually enhanced in the training process due to the attenuation of the hyper-parameters along with the increase of the training batches, and the character recognition rate of the general scene is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings needed to be used in the embodiments of the present specification or in the description of the prior art will be briefly described below, it is obvious that the drawings described below are only some embodiments of the present specification, and it is also possible for a person skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a flowchart of a text recognition model generation method in an embodiment of the present specification.

Fig. 2 shows a flowchart of a text recognition method in an embodiment of the present specification.

Fig. 3 is a schematic diagram illustrating an example scenario of a text recognition method in an embodiment of the present specification.

Fig. 3a shows a text line image to be recognized in the embodiment of the present specification shown in fig. 3.

Fig. 3b shows a text line recognition result corresponding to the text line image to be recognized shown in fig. 3a in this embodiment.

Fig. 3c shows the confidence and energy corresponding to each character in the text line recognition result shown in fig. 3b in the embodiment of the present specification.

Fig. 3d shows a partial character sub-image obtained for the text line image shown in fig. 3a in an embodiment of the present specification.

Fig. 3e shows a character recognition result obtained after the preset single character detection and recognition model is input into the character sub-image corresponding to fig. 3d in the embodiment of the present specification.

Fig. 3f shows a text recognition result obtained by combining the character recognition results of fig. 3b and fig. 3e in the embodiment of the present specification.

Fig. 4 shows a schematic structural diagram of a text recognition system in an embodiment of the present specification.

Fig. 5 shows a schematic structural diagram of an electronic device in an embodiment of the present specification.

Detailed Description

As described above, it is a technical problem in the industry to achieve fast and accurate OCR recognition on electronic devices with limited power, and especially for words with long tail distribution characteristics and containing a large number of similar words, how to improve the accuracy and efficiency of text recognition is an urgent technical problem to be solved.

In view of the above problems, an aspect of the embodiments of the present specification provides a method for generating a text recognition model, which on the one hand, by introducing energy and confidence to perform comprehensive judgment on a text line recognition result, characters to be recognized and detected by a single character can be more efficiently selected; on the other hand, the line recognition and the single character detection and recognition model for rarely-used character recognition are combined for comprehensive recognition, so that the character recognition accuracy can be improved by adopting the text recognition model for character recognition.

In view of the above problems, another aspect of the embodiments of the present specification provides a text recognition method, which on one hand, because the entire recognition process is performed, the energy joint confidence is introduced to perform comprehensive judgment on the text line recognition result, so that characters needing to be subjected to single character recognition and detection can be more efficiently selected; on the other hand, the line recognition and the single word detection and recognition model for rarely-used word recognition are combined for comprehensive recognition, so that the text recognition model is adopted for character recognition, and the character recognition accuracy can be improved.

For a better understanding and appreciation of the invention, the concepts, principles, and advantages thereof, as well as others thereof, will be better understood by those skilled in the art from the following detailed description when considered in connection with the accompanying drawings and detailed description.

In order to enable the text recognition model to reach the expected performance index, a training set test needs to be performed on the text recognition model in advance, and the required text recognition model is generated until the text recognition model meets the expected performance index requirement. For a better understanding and implementation by those skilled in the art, the text recognition model generation process is first described in detail with reference to specific application examples and the accompanying drawings.

Referring to the flow chart of the text recognition model generation method shown in fig. 1, in some embodiments of the present invention, the following steps may be employed to generate a text recognition model for text character recognition:

and S11, respectively acquiring a text line image training set and a character image training set.

In specific implementation, the text line image training set can be obtained by identifying and cutting the area of each line in each text image in the text image set. The character image training set can be obtained by recognizing characters in a text line image and respectively cutting the characters according to a preset width by taking the characters as centers, or can be obtained by firstly cutting each text image in the obtained text image set to obtain a text line image and then cutting the characters of the obtained text line image, or can be obtained by directly recognizing the characters of the obtained text image set and directly cutting the characters obtained by recognition according to a preset area range by taking the recognized characters as centers.

In specific implementation, the text line image training set and the character image training set may be obtained by collecting a real text image, or may be obtained by a manual synthesis manner.

In a specific implementation, in order to consider both the training efficiency of the text recognition model and the robustness of the training result, the collected real text line images and the synthesized text line images may be mixed to obtain the text line image training set, and the collected real character images and the artificially synthesized character images may be mixed to obtain the character image training set.

And S12, inputting the text line image training set into a preset text line recognition model, training the text line recognition model, inputting the character image training set into a preset single character detection and recognition model, and training the single character detection and recognition model.

In some embodiments of the present invention, a Recurrent Neural Network (RNN) may be used, and in a specific implementation, the text line recognition model used may also be a combined model formed by combining the RNN with other Neural networks or algorithms, or an RNN that is further evolved, deformed or expanded on the basis of the RNN.

As an optional example, the text line recognition model may be a Convolutional Recurrent Neural Network (CRNN). The CRNN is mainly used for recognizing a text sequence of an indefinite length end to end, and converts the text recognition into a sequence learning problem of time sequence dependence, namely, image-based sequence recognition, without cutting a single character.

Specifically, the overall network structure of CRNN includes three parts: convolutional layer, cyclic layer, and transcriptional layer. As a specific example, first, at the convolutional layer, the features may be extracted from the input text line image by using the depth CNN to obtain a feature map; then, at the loop layer, a bidirectional RNN, more specifically, a deep RNN, such as a bidirectional Long-Short Term Memory network (Bi-directional Long-Short Term Memory, biLSTM) may be used to predict the feature sequence, learn each feature vector in the feature sequence, and output a prediction tag (true value) distribution; thereafter, at the transcription layer, a series of tag distributions obtained from the loop layer may be converted into a final tag sequence using a connected semantic Temporal Classification (CTC) loss function.

The CTC is a loss function calculation method, training samples do not need to be aligned, the problem that some positions have no characters is solved by introducing blank (blank) characters, and gradient can be quickly calculated by recursion.

Specifically, for LSTM, there is a training set S = { (x) ₁ ，z ₁ )，(x ₂ ，z ₂ )，...，(x _N ，z _N ) And x is a feature map obtained by CNN calculation of the image, and z is an OCR character mark corresponding to the imageLabel, there is no blank character inside the label. By gradient

Adjusting the parameter w of the LSTM such that π ∈ B for the input samples ^-1 In case (z), p (l | x) is maximized. Looking at a value in the y matrix of the CTC input (i.e. LSTM output) alone

(Note that

And

have the same meaning, all is pi at t _t ＝l _k ) Probability of (c):

in the above formula α _t (l _k )β _t (l _k ) Is a constant calculated by recursion, and can be quickly obtained by recursion at any time, so that the gradient can be quickly calculated

And then training according to the gradient.

As another alternative, the text line recognition model may be a CNN combined with a Seq2Seq model structure and an Attention (Attention) model structure, where the Seq2Seq model structure belongs to an encoder-decoder structure, and the basic idea is to use two RNNs, one RNN as an encoder and the other RNN as a decoder, and the Attention model structure allows the C vector encoded by the encoder and each bookstore in the decoding process of the decoder to perform a weighting operation, and adjust the weight to a different C vector in each decoding process.

It is understood that the embodiment of the present disclosure is not limited to the specific type of text line recognition model used, and those skilled in the art can select the implementation according to actual requirements.

In some embodiments of the invention, the single character detection and recognition model can detect the single character first and then recognize which character is the specific character in the dictionary. In the embodiment of the present invention, a specific type of the adopted single word detection and recognition model is not limited, and as an optional example, the single word detection and recognition model may be a Region suggestion Network (RPN) structure model, or an angular Region suggestion Network (RRPN) structure model, or an RCNN structure model, or a Fast-Recurrent Convolutional Neural Network (Fast-RCNN) structure model, or a combination thereof.

And S13, acquiring a text line image test set, inputting the text line identification model, and outputting a text line identification result.

And S14, evaluating the confidence coefficient and the energy of each character in the text line recognition result, and executing the step S15 when the confidence coefficient of any character in the text line recognition result is smaller than a preset confidence coefficient threshold value or the energy of the character is larger than a preset energy threshold value.

And S15, cutting the character test sub-image with the preset width by taking the corresponding character in the text line image as the center.

In a specific implementation, the confidence level may be calculated by the following formula:

wherein x represents the image of the character, K represents the number of categories, y _i And f is a text line identification model network.

In a specific implementation, the energy of each character in the text line recognition result can be evaluated according to the following formula:

wherein X represents the image of the character, T represents the temperature hyper-parameter, K represents the category number, f is the text line recognition model network, f _i (x) And representing the characteristic value corresponding to the ith category in the K category numbers.

In specific implementation, rarely-used words can be defined as Out Of Distribution (OOD) problems, which can also be called abnormal sample detection problems, and comprehensive judgment is performed by combining the energy and confidence Of characters, so that rarely-used words and common words have obvious distinguishing interfaces in energy Distribution, and parts needing single word detection and recognition in a text line image can be more efficiently picked Out.

In specific implementation, the rarely-used word is not easy to identify generally, so that the corresponding confidence coefficient is generally small, the energy distribution of the rarely-used word is generally large, namely the energy value of the rarely-used word is large, and therefore if the confidence coefficient of a certain character is smaller than a preset confidence coefficient threshold value or the energy of the character is larger than a preset energy threshold value, the rarely-used word can be used as the rarely-used word and further input into the single word detection and identification model to complete more accurate single word identification.

After the preset confidence coefficient threshold value or the preset energy threshold value of the character can pass through a training network, a sample is analyzed, and setting is carried out according to an experience value obtained in a unified mode. In the embodiments of the present disclosure, the specific confidence threshold or energy threshold is not limited, and may be specifically set according to the language type of the text, the text library used for training, the word library, the training network used, and other factors.

And S16, inputting the single character detection and recognition model for each character test sub-image, and outputting a corresponding character recognition result.

And S17, combining the text line recognition result and the corresponding character recognition result to obtain a character recognition test result of the text line image test set.

S18, determining whether the character recognition test result reaches a preset performance evaluation index, and if so, ending the training; if not, continuing to execute the step S11, and continuing to train the text line recognition model and the single character detection and recognition model until the character recognition test result reaches the preset performance evaluation index.

In a specific implementation, the text recognition model may be trained through steps S11 and S12, then the text recognition model is tested through steps S13 to S17, whether the text recognition model reaches a preset performance evaluation index is determined according to an obtained character recognition test result, and whether to continue training the text recognition model is selected according to a test evaluation result. The text recognition model comprises a text line recognition model suitable for recognizing a text line image and a single character detection and recognition model suitable for recognizing single characters, the text line recognition result output by the text line recognition model is comprehensively evaluated through a preset confidence coefficient threshold value and an energy threshold value, characters needing single character recognition and detection can be more efficiently selected, and then single character recognition is carried out through the single character detection and recognition model for rarely-used character recognition. In conclusion, after the text recognition model is trained, the text recognition model is adopted to recognize characters of the text line image, so that the character recognition accuracy and the recognition efficiency can be improved.

In the specific implementation, due to the distribution characteristics of long tails of languages such as chinese characters, japanese language and the like and the large number of characters with shapes and near characters, for example, the direct use of a text line recognition model such as CRNN training for reaching tens of thousands of categories of chinese characters brings some problems, such as precision problems, if there are many characters with shapes and near characters, the recognition rate of common characters is affected, and the recognition precision of rare characters in the text line recognition model is low. The inventor finds through research and practice that rarely-used words and common words are difficult to effectively distinguish in a high-confidence interval, when the threshold is set by only depending on confidence, if the threshold of rarely-used words is determined to be too low, many common words can be misjudged as rarely-used words to enter a single word recognition branch, so that the calculation amount is increased, if the threshold of confidence is set to be too high, many rarely-used words cannot enter the rarely-used word recognition branch, so that the recognition rate of the rarely-used words is low, in the embodiment, the rarely-used words are recognized by combining energy and confidence, compared with single confidence judgment, under the condition that the total number of the single words is the same, for example, the total number of the words is 2 thousands, the accuracy can be improved by 3%, the problem that the efficiency is low when 2 ten thousands of types are trained by directly using a text line recognition model can be solved, and the recognition accuracy of the common words can be guaranteed, so that both the recognition efficiency and the recognition accuracy can be considered.

In addition, in specific implementations, the inventors have found that training up to tens of thousands of classes of chinese characters directly with text line recognition models, such as CRNN, also brings with it some other problems, such as data problems. Particularly, the real data cost is high, the rarely-used word corpus is few, and the random corpus training effect is poor.

In order to alleviate the above problem, in step S11, the first text line image training set and the second text line image training set may be mixed according to a first mixing ratio preset in a corresponding training batch, so as to obtain the text line image training set; the first text line image training set is obtained by cutting an acquired real text image; the second text line image training set is obtained by collecting word libraries with different fonts and synthesizing text lines according to the arrangement rule of characters in the text lines; similarly, the first character image training set and the second character image training set may be mixed according to a second mixing ratio preset in a corresponding training batch to obtain the character image training set; the first character image training set is obtained by cutting an acquired real text image; the second character image training set is obtained by collecting word libraries with different fonts, synthesizing text images according to the arrangement rule of characters in text lines and then cutting the text images.

Due to the high real data acquisition costs, the ratio of synthetic data to real data is typically 100. If training is performed according to the same proportion sampling, the fitting effect of the synthetic data is usually greater than that of the real data, especially the fitting of difficult samples in a real scene is more difficult, and the problem is more prominent on a small model with limited capacity, and therefore, in some embodiments of the present invention, a batch proportion sampling strategy is adopted. Specifically, a first text line image training set and a second text line image training set may be mixed according to a first mixing ratio preset in a corresponding training batch, and a first character image training set and a second character image training set may be mixed according to a second mixing ratio preset in a corresponding training batch, so that the text line recognition model obtained through training of the text line image training set and the single character detection and recognition model obtained through training of the character image training set reach a preset generalization performance index threshold. As an alternative example, the first mixing ratio may be, for example, 1; similarly, the second mixing ratio may be 1. The above mixing ratio is only an example, and in specific implementation, the text line recognition model obtained through training of the text line image training set and the single character detection and recognition model obtained through training of the character image training set can be made to reach a preset generalization performance index threshold.

By the batch proportional sampling strategy, the sampling rate of real data can be improved, the identification accuracy of a real scene is improved, and the problem of training data can be relieved to a certain extent.

In order to better generate the text recognition model, after step S11 is performed, an image augmentation operation may also be performed on the text line image training set and the character image training set, and then step S12 is performed to improve the robustness of the generated text recognition model.

Through image augmentation, a series of random changes are made to the image training set to generate similar but different training samples, so that the size of the image set used for training can be expanded. In a specific implementation, the image augmentation operation may specifically include: one or more of image random cropping, random noise adding, affine transformation, projection transformation, illumination contrast transformation, and the like, and the embodiment of the present invention does not limit the specific manner of the image augmentation operation.

Regularization techniques, such as data augmentation, dropping (Dropout), etc., have been highly successful on large neural networks by adding noise to overcome overfitting. Dropout refers to that, in the training process of the deep learning network, the neural network unit is temporarily discarded from the network according to a certain probability.

However, these conventional regularization techniques compromise the performance of the small neural network because the small models have limited capacity and are often under-fitted. To address this problem, in some embodiments of the present invention, a self-attenuation image augmentation method may be adopted, and specifically, an image augmentation operation may be performed on the text line image training set and the character image training set according to a hyper-parameter set by a corresponding training batch, where the hyper-parameter attenuates as the training batch increases. In the specific implementation, the image augmentation degree can be controlled through a hyper-parameter epsilon, the larger epsilon represents that the difference between the augmented image and the original image is larger, and epsilon can be attenuated in the training process as well as the learning rate. The self-attenuation image augmentation can be regarded as an alternative pre-training when epsilon is larger, the fitting degree of original image data is gradually enhanced in the process of gradually reducing epsilon, and the character recognition rate of a general scene can be improved.

By adopting the self-attenuation image augmentation method, the fitting degree of the original data can be gradually enhanced in the training process, and the OCR recognition rate of the general scene is finally improved by about 2% through a specific test verification finding.

After the text recognition model is trained, the text recognition model can be used for text recognition, and detailed description is given below by using specific embodiments and combining specific application scenarios.

Referring to the flowchart of the text recognition method shown in fig. 2, in some embodiments of the present invention, the following steps may be specifically implemented:

and S21, acquiring a text line image to be recognized.

And S22, performing line text recognition on the text line image by adopting the trained text line recognition model to obtain a text line recognition result.

As an alternative example, the text line recognition model may include: CRNN, as described in the foregoing embodiments, the text line recognition model may also use a neural network model or algorithm with other structures, and the specific type and structure of the model used in the embodiments of the present invention are not limited at all.

And S23, evaluating the confidence coefficient and the energy of each character in the text line recognition result, and executing the step S24 when the confidence coefficient of any character in the text line recognition result is smaller than a preset confidence coefficient threshold value or the energy of the character is larger than a preset energy threshold value.

As an alternative example, the energy of each character in the text line recognition result may be evaluated according to the following formula:

wherein X represents the image of the character, T represents the temperature hyper-parameter, K represents the category number, and f represents the text line recognition model, for example, CRNN, fi (X) represents the characteristic value corresponding to the ith category in the K category numbers.

And S24, cutting a character sub-image with a preset width by taking the corresponding character in the text line image as a center.

In a specific implementation, the preset width may be specifically set based on the type of the language and character to be recognized, the size of the image, the size of the character, and the like, and the specific numerical value is not limited in the embodiment of the present invention.

And S25, for each character sub-image, carrying out single character detection and recognition by adopting the trained single character detection and recognition model, and outputting a corresponding character recognition result.

As described in the foregoing embodiments, the word detection and recognition model can detect a word first and then recognize which word is specific using the dictionary. In the embodiment of the present invention, a specific type of the single character detection and recognition model is not limited, and as an optional example, the single character detection and recognition model may be an RPN structure model, an RRPN structure model, an RCNN structure model, a Fast-RCNN structure model, or a combination thereof.

And S26, combining the text line recognition result and the corresponding character recognition result to obtain and output the character recognition result of the text line image.

By adopting the embodiment, the text line recognition result is comprehensively judged by introducing the energy joint confidence coefficient, and the characters needing to be subjected to single character recognition and detection can be more efficiently selected; on the other hand, the line recognition and the single character detection and recognition model for rarely-used character recognition are combined for comprehensive recognition, so that the character recognition accuracy can be improved by adopting the text recognition model for character recognition.

In specific implementation, in order to improve the robustness of the text recognition model to a general scene, before performing line text recognition on the text line image by using the trained text line recognition model, image enhancement preprocessing may be performed on the text line image to be recognized. Specifically, various image enhancement processes such as median filtering, rotation, or size scaling may be performed on the text line image.

For better understanding and implementation by those skilled in the art, the following description is made with reference to fig. 3 and fig. 3a to 3f, which are schematic views of exemplary scenarios of a text recognition method, for example, a text line image to be recognized as shown in fig. 3a is input into a preset text recognition model, for example, CRNN, which includes a CNN layer and a CTC layer, in a text recognition model, and an output text line recognition result is shown in fig. 3 b.

The confidence Conf and the Energy of each character in the text line recognition result are evaluated to respectively satisfy the corresponding relationship shown in fig. 3 c.

As an alternative example, the Energy may be calculated using the following formula:

wherein X represents the image of the character, T represents the temperature hyper-parameter, K represents the class number, and f is the text line recognition moduleA network of nodes, f _i (x) And representing the characteristic value corresponding to the ith category in the K categories.

After judgment, referring to fig. 3c, if the confidence of the character a01 in the text line recognition result is 064, which is lower than the preset confidence threshold Tc, and the energy value of the character a02 is 400, which is greater than the preset energy threshold Te, the character sub-image with the preset width is cut with the corresponding character in the text line image shown in fig. 3a as the center, for example, two character sub-images are obtained as shown in the character sub-images B01 and B02 in fig. 3d, then the two character sub-images are input into the preset single character detection and recognition model, the corresponding character recognition result is obtained as shown in fig. 3e, and the confidences of the two character sub-images respectively reach 0.93 and 0.96, as shown in fig. 3, which satisfies the preset recognition accuracy requirement, and the text line recognition result shown in fig. 3B and the corresponding character recognition result obtained by the single character detection and recognition shown in fig. 3e are combined, so that the text recognition result of the text line image is obtained as shown in fig. 3 f.

Therefore, the comprehensive judgment of the confidence coefficient and the energy of the characters and the comprehensive identification of the text line identification model and the single character identification model are combined, so that the text identification precision can be improved. And through comprehensive judgment of the confidence coefficient and the energy of the characters, single character detection and identification which need to be further used for rarely-used character identification can be screened out, single character detection and identification which are carried out on the characters one by one in the text row image are avoided, and therefore the overall identification efficiency can be improved.

In the embodiment of the present specification, a corresponding text recognition system is further provided, and as shown in the schematic structural diagram of the text recognition system shown in fig. 4, the text recognition system 40 may include an obtaining unit 41, a text line recognition unit 42, an evaluation unit 43, a cutting unit 44, a character recognition unit 45, and a recognition result output unit 46, where:

the acquiring unit 41 is adapted to acquire a text line image to be recognized;

the text line recognition unit 42 is adapted to perform line text recognition on the text line image by using a trained text line recognition model to obtain a text line recognition result;

the evaluation unit 43 is adapted to evaluate the confidence and energy of each character in the text line recognition result;

the cutting unit 44 is adapted to cut a character sub-image with a preset width by taking a corresponding character in the text line image as a center when the evaluating unit determines that the confidence of any character in the text line recognition result is smaller than a preset confidence threshold or the energy of the character is larger than a preset energy threshold;

the character recognition unit 45 is adapted to perform single character detection and recognition by using the trained single character detection and recognition model for each character sub-image, and output a corresponding character recognition result;

the recognition result output unit 46 is adapted to combine the text line recognition result and the corresponding character recognition result to obtain and output a character recognition result of the text line image.

The present specification further provides a corresponding electronic device, and referring to a schematic structural diagram of the electronic device shown in fig. 5, the present specification further provides an electronic device 50, including a memory 51 and a processor 52, where the memory 51 stores a computer program that is executable on the processor 52, and when the processor 52 executes the computer program, the steps of the text recognition model generation method or the steps of the text recognition method according to any one of the foregoing embodiments may be executed. Reference may be made to the detailed description of the method embodiments described above for specific steps.

In addition, in an implementation, with continued reference to fig. 5, the electronic device 50 may further include a display module 53 adapted to output a display text line recognition result.

In a specific implementation, with continued reference to fig. 5, the electronic device may further include an input interface 54, and the user may interact with the input interface 54 to select the text line image to be recognized or perform some basic settings or personalized settings, etc. As an alternative example, the input interface 54 includes an optical scanning interface, for example, the optical scanning interface may directly acquire the text line image to be recognized; as another optional example, the input interface 54 includes a camera module, and a page or an interface to be recognized or the like can be photographed by the camera module to obtain a text line image to be recognized.

In other embodiments, the text line image to be acquired may be acquired through the communication interface 55, which may be a bluetooth interface or other short-range communication interface.

In one embodiment, the memory 51, the processor 52, the display module 53, the input interface 54, and the communication interface 55 may communicate with each other via a bus 56.

As an alternative example, the electronic device 50 may be a data processing device capable of off-line processing, for example, a dictionary pen, and the processor 52 may be a single-core processor, a multi-core processor, a general-purpose processor, or a specially-customized processor, and the specific configuration and implementation of the processor are not limited in any way.

In specific implementation, the dictionary pen is a product for student users, can be used for scanning or learning languages through voice interaction and other operation modes, and can help users to solve the problems of listening, speaking, reading, back interpretation, interpretation and the like. The most core basic function of the dictionary pen is to scan and search words, and the dictionary pen can be normally used in a network-free offline state due to the adoption of an offline algorithm, so that the use convenience and usability can be greatly improved.

The offline OCR may be used as an entry for extracting text information, that is, as an input interface, once a recognition error occurs, the whole function may be unavailable, and meanwhile, since the offline OCR is a product facing to an end consumer, it needs to consider user experience and hardware cost, and generally, its data processing capability is limited, for example, it may adopt a single-core, 2-core or 4-core low-power-consumption system-on-chip or processor.

Through the optimization and improvement of the embodiment of the invention, the accuracy of the rarely-used word test set of the dictionary pen can be integrally improved by more than 30% through test verification, and the recognition rate of common words is hardly influenced. In addition, through the multiple improved designs aiming at the small model training in the embodiment shown in the specification, the requirement of efficient and quick user application in a dictionary pen scene can be met.

It can be understood that the electronic device may specifically be wearable devices such as smart glasses and smart watches, or a low-end mobile phone, or may also be internet of things devices such as a movable scanner and a smart speaker, and the specific type of the electronic device to which the text recognition method and system according to the embodiments of the present invention can be applied is not limited in the embodiments of the present description.

The embodiments of the present specification further provide a computer-readable storage medium, on which a computer program is stored, where the computer program executes the steps of the text recognition model generation method or the steps of the text recognition method described in any of the foregoing embodiments when running, and specific steps may refer to the foregoing embodiments, which are not described herein again.

In particular implementations, the computer-readable storage medium may be a variety of suitable readable storage media such as an optical disk, a mechanical hard disk, a solid state disk, and so on.

Although the embodiments of the present invention are disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A text recognition model generation method comprises the following steps:

2. The method of claim 1, wherein the separately obtaining a text line image training set and a character image training set comprises:

3. The method of claim 2, wherein the first text line image training set and the second text line image training set are mixed according to a first mixing ratio preset for a corresponding training batch, and the first character image training set and the second character image training set are mixed according to a second mixing ratio preset for a corresponding training batch, so that the text line recognition model obtained through the training of the text line image training set and the single character detection and recognition model obtained through the training of the character image training set reach a preset generalization performance index threshold.

4. The method of claim 1, wherein the text line recognition model comprises: a convolutional recurrent neural network.

5. The method of any of claims 1-4, further comprising: and performing image augmentation operation on the text line image training set and the character image training set.

6. The method of claim 5, wherein the performing an image augmentation operation on the text line image training set and character image training set comprises:

7. A text recognition method, comprising:

acquiring a text line image to be recognized;

8. The method of claim 7, wherein the text line recognition model comprises: a convolutional recurrent neural network.

9. The method of claim 7, wherein the energy of each character in the text line recognition result is evaluated according to the following formula:

10. The method of any of claims 7-9, wherein prior to performing line text recognition on the text line image using the trained text line recognition model, further comprising:

11. A text recognition system, comprising:

the text line recognition unit is suitable for performing line text recognition on the text line image by adopting a trained text line recognition model to obtain a text line recognition result;

12. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, wherein the processor performs the steps of the text recognition model generation method of any of claims 1 to 6 or the text recognition method of any of claims 7-10 when executing the computer program.

13. A computer-readable storage medium, on which a computer program is stored, wherein the computer program executes the steps of the text recognition model generation method of any one of claims 1 to 6 or the text recognition method of any one of claims 7 to 10.