CN116798044A

CN116798044A - Text recognition method and device and electronic equipment

Info

Publication number: CN116798044A
Application number: CN202211242295.4A
Authority: CN
Inventors: 黄威; 刘正珍
Original assignee: Hanwang Technology Co Ltd
Current assignee: Hanwang Technology Co Ltd
Priority date: 2022-10-11
Filing date: 2022-10-11
Publication date: 2023-09-22

Abstract

The application discloses a text recognition method, belongs to the technical field of optical character recognition, and is beneficial to improving the text recognition accuracy. The method comprises the following steps: inputting a target image into a convolutional neural network in a pre-trained character recognition model, and acquiring a characteristic diagram with the height D and the width n, which are output by the convolutional neural network aiming at the target image, wherein D and n are integers larger than 1; recombining the feature images to obtain a feature sequence of the target image; performing coding mapping on the characteristic sequence through a sequence coding network in the character recognition model to obtain a coding sequence; and decoding the coding sequence through a CTC decoder in the character recognition model to obtain a character recognition result of the target image. According to the method, the feature map with the height larger than 1 is extracted, so that character recognition can be performed based on the features with finer granularity, and the text recognition accuracy of complex texts such as arc text images and seal images is improved.

Description

Text recognition method and device and electronic equipment

Technical Field

The present application relates to the field of optical character recognition technology, and in particular, to a text recognition method, apparatus, electronic device, and computer readable storage medium.

Background

For text recognition in an image, the traditional method is to divide the image into text line images, then divide the image in each text line by single character, and finally recognize the single characters respectively. With the widespread use of neural network models, end-to-end text recognition based on deep learning has emerged. In the end-to-end text recognition scheme based on deep learning, the link of cutting text lines does not need to be explicitly added, optical character recognition is converted into a sequence learning problem, and the whole text image can be recognized after being translated and transcribed by a certain CTC (Connectionist temporal classification) in an output stage after CNN (Convolutional Neural Networks, convolutional neural network) and RNN (Convolutional Neural Networks, recurrent neural network) although the input image has different scales and different text lengths. However, the end-to-end text recognition method based on deep learning in the prior art has yet to be improved in accuracy when processing, for example, text recognition in a stamp image, or recognition of an arc-shaped text.

Disclosure of Invention

The embodiment of the application provides a text recognition method which is beneficial to improving the text recognition accuracy of complex text images such as arc text images, seal images and the like.

In a first aspect, an embodiment of the present application provides a text recognition method, including:

inputting a target image into a convolutional neural network in a pre-trained character recognition model, and acquiring a characteristic diagram with the height D and the width n, which is output by the convolutional neural network aiming at the target image, wherein D and n are integers larger than 1;

obtaining a feature subsequence corresponding to the width position according to the D group of features corresponding to the same width position and different height positions in the feature map;

the characteristic subsequences corresponding to the width positions are spliced into characteristic sequences of the target image in sequence;

performing coding mapping on the characteristic sequence through a sequence coding network in the character recognition model to obtain a coding sequence;

and decoding the coding sequence through a CTC decoder in the character recognition model to obtain a character recognition result of the target image.

In a second aspect, an embodiment of the present application provides a text recognition apparatus, including:

the characteristic diagram acquisition module is used for inputting a target image into a convolutional neural network in a pre-trained character recognition model, and acquiring a characteristic diagram with the height D and the width n, which is output by the convolutional neural network aiming at the target image, wherein D and n are integers larger than 1;

The first feature conversion module is used for obtaining a feature subsequence corresponding to the width position according to the D group of features corresponding to the same width position and different height positions in the feature map;

the second feature conversion module is used for sequentially splicing the feature subsequences corresponding to the width positions into the feature sequence of the target image;

the feature coding module is used for coding and mapping the feature sequence through a sequence coding network in the character recognition model to obtain a coding sequence;

and the decoding output module is used for decoding the coding sequence through a CTC decoder in the character recognition model to obtain a character recognition result of the target image.

In a third aspect, the embodiment of the application also discloses an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the text recognition method according to the embodiment of the application when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the text recognition method disclosed in the embodiments of the present application.

According to the text recognition method disclosed by the embodiment of the application, a target image is input into a convolutional neural network in a pre-trained character recognition model, and a feature map with the height of D and the width of n, which is output by the convolutional neural network aiming at the target image, is obtained, wherein D and n are integers larger than 1; obtaining a feature subsequence corresponding to the width position according to the D group of features corresponding to the same width position and different height positions in the feature map; the characteristic subsequences corresponding to the width positions are spliced into characteristic sequences of the target image in sequence; performing coding mapping on the characteristic sequence through a sequence coding network in the character recognition model to obtain a coding sequence; and decoding the coding sequence through a CTC decoder in the character recognition model to obtain a character recognition result of the target image. According to the method, the heights of the feature graphs are output through the configuration convolutional neural network, so that when the text recognition model performs feature extraction on a target image, finer-granularity image features of the target image can be extracted, when the sequence coding network required later performs feature coding based on the finer-granularity image features, noise in the image can be effectively avoided, text recognition can be performed based on the finer-granularity features, and therefore the text recognition accuracy of complex text images such as arc text images and seal images is improved.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

FIG. 1 is a flow chart of a text recognition method of one embodiment of the present application;

FIG. 2 is a schematic diagram of a character recognition model in one embodiment of the application;

FIG. 3 is a schematic diagram of a correspondence between an input image and an output feature map of a convolutional neural network according to one embodiment of the present application;

FIG. 4 is a schematic diagram of a process for reorganizing a feature map output by a convolutional neural network into a feature sequence in one embodiment of the present application;

FIG. 5 is a schematic diagram of the correspondence between a feature sequence input to a sequence encoding network and the position in an image in the prior art;

FIG. 6 is a schematic diagram showing the correspondence between a feature sequence input to a sequence encoding network and the position in an image according to an embodiment of the present application;

FIG. 7 is another flow chart of a text recognition method in one embodiment of the application;

FIG. 8 is a schematic diagram of a text recognition device in accordance with one embodiment of the present application;

fig. 9 schematically shows a block diagram of an electronic device for performing the method according to the application; and

fig. 10 schematically shows a memory unit for holding or carrying program code for implementing the method according to the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Example 1

The embodiment of the application discloses a text recognition method, as shown in fig. 1, which comprises the following steps: steps 110 to 150.

Step 110, inputting a target image into a convolutional neural network in a pre-trained character recognition model, and acquiring a feature map with the height D and the width n, which is output by the convolutional neural network aiming at the target image, wherein D and n are integers larger than 1.

The target image in the embodiment of the application is a gray image obtained by preprocessing an image of a text to be recognized, and the height and the width of the target image are normalized to the input image size of a pre-trained character recognition model, for example, the height is H, and the width is W.

In the embodiment of the application, the character recognition model is a neural network model which is trained offline in advance. The character recognition model may employ a CRNN model.

The CRNN model combines CNN and RNN network for training. The CRNN model is mainly used for identifying text sequences with indefinite lengths in an end-to-end (end-to-end) manner to a certain extent, does not need to cut single characters first, converts text identification into sequence learning problems depending on time sequence, and is image-based sequence identification. The CRNN model mainly includes three layers: CNN (convolutional layer), using depth CNN to extract characteristics of input image and obtain characteristic map; RNN (loop layer), predicting a feature sequence using bidirectional RNN (BLSTM), learning each feature vector in the sequence, and outputting a prediction tag (true value) distribution; CTC loss (transcription layer) is used to convert a series of tag distributions obtained from the loop layer into the final tag sequence.

In the prior art, for an image with height H and width W input to the CRNN model, after the CNN is convolved, a feature map with height 1 and width W/m is output, where m is the width of the time window. And extracting a feature vector sequence required by the RNN from the feature map output by the CNN, and classifying the features corresponding to each width.

Taking a target image with a size of HxWx1 as an example, where H represents a height, W represents a width, and 1 represents a channel value, and here represents a gray scale, the target image is input to a CRNN model in the prior art, and after being processed by a feature layer (a series of convolution layer, pooling layer, batch normalization layer, activation layer) of CNN, a feature map with a shape of 1xnxL is obtained. Where 1 denotes the height of the feature map, n denotes the width of the feature map, and L denotes the number of feature maps.

As shown in fig. 2, the character recognition model employed in the embodiment of the present application includes: convolutional neural network 210, data structure conversion layer 220, sequence encoding network 230, and CTC decoder 240. In the text recognition stage, the convolutional neural network 210 is configured to perform feature extraction on an input image (such as a target image) to obtain a feature map; the data structure conversion layer 220 is configured to convert the feature map output by the convolutional neural network 210 into a feature sequence that meets the input requirement of the sequence encoding network 230; the sequence coding network 230 is configured to code the input feature sequence to obtain a coding sequence; CTC decoder 240 is configured to transcribe the coding sequence output by the sequence encoding network 230 into a character coding sequence, i.e., a text recognition result.

In an embodiment of the present application, the convolutional neural network 210 is configured to output L feature maps with a height D and a width n. Wherein, the values of L, the height value D and the width value n are integers larger than 1. Wherein the number L and the width value n of the feature map are determined according to the computing capability and industry experience of the computing processing device executing the method. Further, the value of the height D is determined according to the radian of the text line matched with the target image. The larger the radian of the text line, the larger the height value D, for example, the value of D may be set to 3 or 4 or more for a character recognition model for recognizing the text in the stamp image; and for a character recognition model that recognizes multiple text lines or a single text line with radians, the value of D may be set to 2.

In the implementation process, a character recognition model may be first constructed according to a specific application scenario of the text recognition method, a height value D of a feature map output by the convolutional neural network 210 in the character recognition model is determined, and a structure of the convolutional neural network 210 is designed according to an output requirement that the height value of the feature map is D. Thus, after training, the obtained character recognition model obtains a feature map with a height D after feature extraction by the convolutional neural network 210 for each input image.

In some embodiments of the present application, the height of the feature map output by the convolutional neural network 210 may be configured by adjusting the sampling window size or step size of one or more convolutional layers and/or the max-pooling layer located at the back end in the convolutional neural network 210 shown in fig. 2.

In the embodiment of the present application, taking the configuration of the height value of the feature map output by the convolutional neural network 210 as 3 as an example, after inputting the target image with the size of HxWx1 into the character recognition model designed in the embodiment of the present application, the convolutional neural network 210 in the character recognition model will extract the feature map with the height D from the input target image. That is, the feature map output by the convolutional neural network 210 in the character recognition model is a feature map shaped as DxnxL, where D takes a value of 3 and represents the height of the feature map. It can be understood that in the character recognition model designed in the embodiment of the present application, the feature map output by the convolutional neural network 210 is L feature maps with a width n and a height 3 (i.e., D). According to the mapping relation between the feature map output by the convolutional neural network 210 and the pixel position of the target image, the image features of the same width position (the same time window corresponding to the convolutional neural network) of the target image are represented by D feature vectors, so that the method has stronger feature expression capability and is more suitable for expressing the text lines with radian.

Because of the convolution layers, the max-pooling layer and the activation function are performed on the local area, so they are not shifted. Thus, each feature vector of the output feature map of convolutional neural network 210 corresponds to a rectangular region (i.e., receptive field) of the input image (e.g., target image), and these rectangular regions have the same order as the corresponding columns on the feature map from left to right. Taking the input image and the feature map shown in fig. 3 as an example, if each feature map output by the convolutional neural network 210 is divided into n time windows from left to right according to the width position, and the input image is divided into n rectangular areas from left to right, then in each feature map, the feature vector corresponding to each time window corresponds to one rectangular area in the input image. For example, the feature vectors in fig. 3, each of which is located in the rectangular parallelepiped region 320, are feature vectors of the image region 310 in the input image. In the embodiment of the application, the height of each feature vector is D, namely, each feature vector has D dimensions, and different dimensions of the feature vector correspond to different height positions of the input image.

Therefore, the L feature maps of the input image extracted by the convolutional neural network 210 designed in the embodiment of the present application have D sets of features at a corresponding width position, and each set of features corresponds to a different image height.

And 120, obtaining a feature subsequence corresponding to the width position according to the D group of features corresponding to the same width position and different height positions in the feature map.

In the embodiment of the present application, the sequence encoding network 230 is implemented based on a sequence network structure such as RNN or LSTM. The input data of the sequence encoding network 230 is a feature sequence, and thus, it is necessary to convert the feature map output by the convolutional neural network 210 into a feature sequence that can be processed by the sequence encoding network 230.

In an embodiment of the present application, the conversion from feature map to feature sequence is achieved by providing a data structure conversion layer 220 between convolutional neural network 210 and sequence coding network 230.

In some embodiments of the present application, the obtaining the feature subsequence corresponding to the width position according to the D group of features corresponding to the same width position and corresponding to different height positions in the feature map includes: and for the D groups of features corresponding to the same width position in the feature map, splicing the D groups of features from front to back according to the sequence of the positions in the target image corresponding to each group of features from top to bottom, and obtaining a feature subsequence corresponding to the corresponding width position.

As can be seen from the feature map mapping diagram of fig. 3, feature vectors (feature vectors in a cuboid as shown at 320) corresponding to the same width position in each of the feature maps express image features of a rectangular image area (e.g., rectangular area 310) in the input image. In the embodiment of the present application, by improving the structure of the convolutional neural network 210, the height of the feature map output by the convolutional neural network 210 is D, that is, the feature vector corresponding to the same width position in each feature map is represented by D-dimensional features, where the features of each dimension are used to express the image features of the image areas corresponding to the width positions and different height positions in the image. Equivalently, the features of the image area at each width position in the input image are expressed by D sets of feature vectors, each set of feature vectors being used to express the image features of the image area at a specified image height position at that width position.

When the data structure conversion of the features is carried out, the D group of feature vectors (namely, the feature vectors corresponding to each height) of the corresponding image region corresponding to one time window are used as the feature subsequences corresponding to the image region, and then the feature subsequences corresponding to each image region are spliced in sequence, so that a feature vector sequence of the input image can be obtained.

For example, for a convolutional neural network 210 with n time windows, the eigenvectors in the eigenvector are represented asFor example, where j represents a height position, values 1 to D, i represents a time window number, i.e., a width position, values 1 to n, and the characteristic map output by the convolutional neural network 210 may be represented as a characteristic map corresponding to the height position 1 of the input image by a characteristic vector ∈> The feature map corresponding to the height position D of the input image can be expressed as a vector sequence composed of feature vectors +.>A sequence of vectors is constructed.

Wherein the image height position is determined according to the characteristic height output by the designed convolutional neural network 210. For example, when the feature height output by the convolutional neural network 210 is configured as D, the input image (e.g., the target image) may be divided into D rows, one image height position for each row, on average from top to bottom in the height direction. For example, the uppermost row may be defined to correspond to image height position 1 and the lowermost row to correspond to image height position D, thereby determining the height position in the input image.

And (3) splicing the D groups of feature vectors corresponding to the same width position i, namely the feature vectors (D groups) corresponding to the same width position i and the height positions 1 to D from front to back according to the sequence of the image height positions corresponding to each group of feature vectors (such as the sequence of the height positions 1 to D) from top to bottom, so as to obtain the feature subsequence corresponding to the width position i. In particular, for example, the feature vector corresponding to the height position 1 of the target image Feature vector +.2 for height position of target image>… up to a feature vector corresponding to the height position D of the target image->And splicing the characteristic subsequences from front to back in sequence to obtain the characteristic subsequences corresponding to the width position i. The effect of the data structure conversion can be seen in fig. 4, in which feature vectors with different heights in the upper three behavior feature diagrams in fig. 4, and the lowest behavior is a feature subsequence obtained after the data structure conversion.

In some embodiments of the present application, the feature vector corresponding to a certain time window may be obtained by fusing feature vectors corresponding to the time window and corresponding to the same image height position in the L feature maps output by the convolutional neural network 210. For example, in a list of feature vectors corresponding to the first time window in the L feature images, the feature vectors corresponding to the image height position 1 are fused to obtain the feature vectorsThe feature vector of which the height position of the corresponding image is D in a column of feature vectors corresponding to the nth time window in the L feature images is fused to obtain the feature vector +.>

And 130, sequentially splicing the characteristic subsequences corresponding to the width positions into the characteristic sequence of the target image.

According to the method, the characteristic subsequence corresponding to each width position of the target image can be obtained.

For a convolutional neural network 210 with n time windows (i.e., the feature map of the target image corresponds to n width positions), if the convolutional neural network 210 is configured to output the feature map with a height D, the convolutional neural network 210 outputs L feature maps with a width n and a height D, and after the data structure conversion, a feature sequence with a length d×n, which is obtained by splicing n feature sub-sequences with a length D, will be obtained.

In the prior art, the characteristic of the convolutional layer input to the cyclic layer in the CRNN model is a sequence of characteristic vectors corresponding to each column of image regions (i.e., each width position) as shown in fig. 5, and in the embodiment of the present application, the characteristic of the convolutional neural network 210 input to the sequence coding network 230 (corresponding to the cyclic layer in the prior art) is a sequence of characteristic vectors corresponding to each grid region as shown in fig. 6 by configuring the height of the output characteristic map of the convolutional neural network 210 to a value greater than 1. Because the feature vectors input to the sequence encoding network 230 are generated based on the feature maps of the plurality of different image height positions, the feature vectors carry finer image features, and the feature expression of the images is finer, so that the anti-interference capability of the character recognition model is stronger, and the recognition result is more accurate. For example, when encoding and transcribing based on a sequence of feature vectors corresponding to each grid region as shown in fig. 6, the features of the pentagram regions in the grid will greatly reduce the interference on text recognition. If the identification is performed in the prior art, in the feature vector corresponding to the width position where the five-pointed star area is located, the image features of the five-pointed star occupy a great proportion, which can cause character identification errors of the width position.

And 140, performing coding mapping on the characteristic sequence through a sequence coding network in the character recognition model to obtain a coding sequence.

Next, the feature sequence with the length d×n obtained by the data structure conversion is input to the sequence coding network 230, and the sequence coding network 230 performs code mapping on the feature sequence to obtain a code sequence.

The specific implementation of the sequence coding network may be referred to the specific implementation of the RNN network (i.e. the loop layer) in the CRNN network in the prior art, which is not described in detail in the embodiment of the present application.

And step 150, decoding the coding sequence through a CTC decoder in the character recognition model to obtain a character recognition result of the target image.

And then, the coding sequence obtained by the coding mapping processing of the sequence coding network 230 is input to the CTC decoder 240, and the CTC decoder 240 transcribes the coding sequence output by the sequence coding network 230, so as to obtain a text recognition result of the target image. The text recognition result is a character coding sequence corresponding to the text line in the target image.

The specific implementation of CTC decoders may be referred to CTC decoders in CRNN networks in the prior art, and will not be described in detail in the embodiments of the present application.

In order to facilitate readers to understand the text recognition method disclosed by the embodiment of the application, a training scheme of the character recognition model is further described below.

Referring to fig. 7, in some embodiments of the present application, before the target image is input to the convolutional neural network in the pre-trained character recognition model, the method further includes: step 100.

Step 100, training a character recognition model.

The training samples for training the character recognition model are as follows: a sample image provided with a tag, the tag being a true coded sequence of characters in the sample image, the convolutional neural network being configured to output a feature map of height D. The configuration and network structure of the convolutional neural network in the character recognition model can be seen from the previous description.

Wherein said training said character recognition model comprises: the following character recognition steps are respectively executed for each training sample, and a predictive coding sequence of the characters in the corresponding sample image is obtained: inputting a sample image into a convolutional neural network in the character recognition model, and acquiring a sample feature map with the height D and the width n, which is output by the convolutional neural network aiming at the sample image; obtaining a sample feature subsequence corresponding to the width position according to the D group of features corresponding to the same width position and different height positions in the sample feature map; sequentially splicing the sample characteristic subsequences corresponding to the width positions into a characteristic sequence of the sample image; performing coding mapping on the characteristic sequence through the sequence coding network to obtain a sample coding sequence; decoding the sample coding sequence through the CTC decoder to obtain a predictive coding sequence of characters in the sample image; calculating CTC loss according to the predictive coding sequence and the real coding sequence of characters in the sample images of the training samples, and iteratively training the character recognition model by optimizing the CTC loss.

In some embodiments of the application, the sample image comprises one or more of the following: a stamp image, an arc text line image, and a multi-line text image.

And when training the character recognition model, respectively performing feature extraction, coding mapping and transcription operation on each training sample by the character recognition model so as to obtain a predictive coding sequence corresponding to each training sample. And then, according to labels (namely real coding sequences) of all training samples and a predicted coding sequence obtained by predicting the character recognition model, calculating CTC (Connectionist Temporal Classification) loss of the character recognition model through a CTC loss function, and adjusting model parameters in a convolutional neural network 210 and a sequence coding network 230 in the character recognition model with the aim of optimizing the CTC loss, thereby achieving the aim of training the character recognition model.

In the model training process, after a sample image of a current training sample is input to a character recognition model, feature extraction is performed on the input sample image through a convolutional neural network 210 of the character recognition model, and the convolutional neural network 210 outputs a sample feature map with a height of D and a width of n for the sample image. The specific embodiment of the convolutional neural network 210 for performing feature extraction on the input image is described above, and will not be described herein.

The sample feature map is then input to the data structure conversion layer 220, and the data structure conversion layer 220 converts the sample feature map to obtain a feature sequence matching the input requirement of the sequence encoding network 230. For example, according to the D groups of features corresponding to the same width position and different height positions in the sample feature map, sample feature subsequences corresponding to the corresponding width positions are obtained; and then, the sample characteristic subsequences corresponding to the width positions are spliced into the characteristic sequences of the sample images in sequence. The data structure conversion is performed on the sample feature images to obtain specific embodiments of feature sequences corresponding to the sample images, and the specific embodiments of the feature sequences obtained by performing the data structure conversion on the feature images of the target images in the text recognition stage are referred to above and are not described herein.

Next, the feature sequence of the sample image is input to the sequence encoding network 230, and the sequence encoding network 230 performs code mapping on the feature sequence to obtain a sample encoding sequence. The characteristic sequence of the sample image is coded and mapped through the sequence coding network to obtain a specific implementation mode of the sample coding sequence, and the characteristic sequence of the target image is coded and mapped through the sequence coding network in the character recognition model in a text recognition stage to obtain a specific implementation mode of the coding sequence, which is not repeated here.

And then, decoding the sample coding sequence through the CTC decoder to obtain a predictive coding sequence of the characters in the sample image. And decoding the sample coding sequence by the CTC decoder to obtain a specific implementation mode of a predicted coding sequence of characters in the sample image, wherein the specific implementation mode of a character recognition result of the target image is obtained by decoding the coding sequence by the CTC decoder in the character recognition model, referring to a text recognition stage, and is not described herein.

In an embodiment of the present application, the CTC loss function may be a generic CTC loss function in the CRNN model of the prior art. The optimization method of the model parameters and the iterative training process of the character recognition model refer to the prior art, and are not repeated here.

According to the text recognition method disclosed by the embodiment of the application, a target image is input into a convolutional neural network in a pre-trained character recognition model, and a feature map with the height of D and the width of n, which is output by the convolutional neural network aiming at the target image, is obtained, wherein D and n are integers larger than 1; obtaining a feature subsequence corresponding to the width position according to the D group of features corresponding to the same width position and different height positions in the feature map; the characteristic subsequences corresponding to the width positions are spliced into characteristic sequences of the target image in sequence; performing coding mapping on the characteristic sequence through a sequence coding network in the character recognition model to obtain a coding sequence; and decoding the coding sequence through a CTC decoder in the character recognition model to obtain a character recognition result of the target image, thereby improving the text recognition accuracy of complex text images such as arc text images, seal images and the like.

According to the text recognition method disclosed by the embodiment of the application, the height of the feature map is output by the convolutional neural network, so that when the text recognition model performs feature extraction on the target image, the image features with finer granularity of the target image can be extracted, and therefore, when the sequence coding network required later performs feature coding based on the image features with finer granularity, noise in the image can be effectively avoided, text recognition can be performed based on the finer features, and the text recognition accuracy of the target image is improved.

The text recognition method disclosed by the embodiment of the application has obvious improvement on the recognition effect of complex text images (such as a multi-line text image, a seal image and an arc text line image).

Example two

The embodiment of the application discloses a text recognition device, as shown in fig. 8, which comprises:

the feature map obtaining module 810 is configured to input a target image to a convolutional neural network in a pre-trained character recognition model, and obtain a feature map with a height D and a width n, which is output by the convolutional neural network for the target image, where D and n are integers greater than 1;

the first feature conversion module 820 is configured to obtain a feature subsequence corresponding to the width position according to the D group of features corresponding to the same width position and corresponding to different height positions in the feature map;

The second feature conversion module 830 is configured to sequentially splice the feature subsequences corresponding to each width position into a feature sequence of the target image;

the feature coding module 840 is configured to code and map the feature sequence through a sequence coding network in the character recognition model to obtain a coding sequence;

and the decoding output module 850 is configured to decode the encoded sequence through a CTC decoder in the character recognition model, so as to obtain a character recognition result of the target image.

In some embodiments of the present application, the first feature conversion module 820 is further configured to:

and for the D groups of features corresponding to the same width position in the feature map, splicing the D groups of features from front to back according to the sequence of the positions in the target image corresponding to each group of features from top to bottom, and obtaining a feature subsequence corresponding to the corresponding width position.

In some embodiments of the application, the apparatus further comprises:

a character recognition model training module (not shown in the figure) for training the character recognition model, wherein the training samples for training the character recognition model are: a sample image provided with a tag, the tag being a true coded sequence of characters in the sample image, the convolutional neural network being configured to output a feature map of height D, the character recognition model training module being further configured to:

The following character recognition steps are respectively executed for each training sample, and a predictive coding sequence of the characters in the corresponding sample image is obtained:

inputting a sample image into a convolutional neural network in the character recognition model, and acquiring a sample feature map with the height D and the width n, which is output by the convolutional neural network aiming at the sample image;

obtaining a sample feature subsequence corresponding to the width position according to the D group of features corresponding to the same width position and different height positions in the sample feature map;

sequentially splicing the sample characteristic subsequences corresponding to the width positions into a characteristic sequence of the sample image;

performing coding mapping on the characteristic sequence through the sequence coding network to obtain a sample coding sequence;

decoding the sample coding sequence through the CTC decoder to obtain a predictive coding sequence of characters in the sample image;

calculating CTC loss according to the predictive coding sequence and the real coding sequence of characters in the sample images of the training samples, and iteratively training the character recognition model by optimizing the CTC loss.

In some embodiments of the present application, the value of the height D is determined according to the radian of the text line matched with the target image.

The embodiment of the application discloses a text recognition device for implementing the text recognition method described in the first embodiment of the application, and the specific implementation of each module of the device is not repeated, and can be referred to the specific implementation of the corresponding step of the method embodiment.

According to the text recognition device disclosed by the embodiment of the application, a target image is input into a convolutional neural network in a pre-trained character recognition model, and a feature map with the height of D and the width of n, which is output by the convolutional neural network aiming at the target image, is obtained, wherein D and n are integers larger than 1; obtaining a feature subsequence corresponding to the width position according to the D group of features corresponding to the same width position and different height positions in the feature map; the characteristic subsequences corresponding to the width positions are spliced into characteristic sequences of the target image in sequence; performing coding mapping on the characteristic sequence through a sequence coding network in the character recognition model to obtain a coding sequence; and decoding the coding sequence through a CTC decoder in the character recognition model to obtain a character recognition result of the target image, thereby improving the text recognition accuracy of complex text images such as arc text images, seal images and the like.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

The foregoing has described in detail a method and apparatus for text recognition provided by the present application, and specific examples have been employed herein to illustrate the principles and embodiments of the present application, the above examples being provided only to assist in understanding the method and a core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.

Various component embodiments of the application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in an electronic device according to embodiments of the present application may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present application can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present application may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

For example, fig. 9 shows an electronic device in which the method according to the application may be implemented. The electronic device may be a PC, a mobile terminal, a personal digital assistant, a tablet computer, etc. The electronic device conventionally comprises a processor 910 and a memory 920 and a program code 930 stored on said memory 920 and executable on the processor 910, said processor 910 implementing the method described in the above embodiments when said program code 930 is executed. The memory 920 may be a computer program product or a computer-readable medium. The memory 920 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 920 has a memory 9201 of program code 930 of a computer program for performing any of the method steps described above. For example, the memory space 9201 for the program code 930 may include individual computer programs for implementing the various steps in the above methods, respectively. The program code 930 is computer readable code. These computer programs may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. The computer program comprises computer readable code which, when run on an electronic device, causes the electronic device to perform a method according to the above-described embodiments.

The embodiment of the application also discloses a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, realizes the steps of the text recognition method according to the embodiment of the application.

Such a computer program product may be a computer readable storage medium, which may have memory segments, memory spaces, etc. arranged similarly to the memory 920 in the electronic device shown in fig. 9. The program code may be stored in the computer readable storage medium, for example, in a suitable form. The computer readable storage medium is typically a portable or fixed storage unit as described with reference to fig. 10. In general, the memory unit comprises computer readable code 930', which computer readable code 930' is code that is read by a processor, which code, when executed by the processor, implements the steps of the method described above.

Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the application. Furthermore, it is noted that the word examples "in one embodiment" herein do not necessarily all refer to the same embodiment.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of text recognition, comprising:

2. The method according to claim 1, wherein the obtaining the feature subsequence corresponding to the width position according to the D group of features corresponding to the same width position and corresponding to different height positions in the feature map includes:

3. The method according to claim 1 or 2, wherein the inputting the target image into the convolutional neural network in the pre-trained character recognition model, before obtaining the feature map with height D and width n output by the convolutional neural network for the target image, further comprises:

training the character recognition model, wherein a training sample for training the character recognition model is: a sample image provided with a tag, the tag being a true coded sequence of characters in the sample image, the convolutional neural network being configured to output a feature map of height D, the training the character recognition model comprising: the following character recognition steps are respectively executed for each training sample, and a predictive coding sequence of the characters in the corresponding sample image is obtained:

4. A method according to claim 3, wherein the value of the height D is determined from the radian of the text line that the target image matches.

5. A method according to claim 3, wherein the sample image comprises one or more of the following: a stamp image, an arc text line image, and a multi-line text image.

6. A text recognition device, comprising:

7. The apparatus of claim 6, wherein the first feature transformation module is further configured to:

8. The apparatus according to claim 6 or 7, characterized in that the apparatus further comprises:

The character recognition model training module is used for training the character recognition model, wherein a training sample used for training the character recognition model is as follows: a sample image provided with a tag, the tag being a true coded sequence of characters in the sample image, the convolutional neural network being configured to output a feature map of height D, the character recognition model training module being further configured to:

9. An electronic device comprising a memory, a processor and program code stored on the memory and executable on the processor, wherein the processor implements the text recognition method of any of claims 1 to 5 when the program code is executed by the processor.

10. A computer readable storage medium having stored thereon program code, which when executed by a processor performs the steps of the text recognition method of any of claims 1 to 5.