CN114495119A

CN114495119A - Real-time irregular text recognition method under complex scene

Info

Publication number: CN114495119A
Application number: CN202111452587.6A
Authority: CN
Inventors: 张三元; 刘旭
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-05-13

Abstract

The invention discloses a real-time irregular text recognition method in a complex scene. The method comprises the following steps: after natural complex scene images containing characters are preprocessed, preprocessed text block images are obtained and input into a backbone convolutional neural network for feature extraction and feature slicing, a plurality of slice local text features are obtained and input into a plurality of local self-attention modules correspondingly respectively, and local text feature aggregation vectors are output correspondingly respectively after local text feature aggregation is performed in parallel by each local self-attention module; and inputting the global text feature vector into a global self-attention module for splicing, down-sampling and global text feature extraction, outputting the global text feature vector, inputting the global text feature vector into a character decoding module for decoding, and outputting the recognized text character string. The method effectively balances the speed and the precision of character recognition in the complex scene, has excellent effect in the curved text scene, leads the recognition precision in the horizontal text to the prior algorithm, and has more advantages in speed.

Description

Real-time irregular text recognition method under complex scene

Technical Field

The invention relates to a character image recognition method in the field of character recognition, in particular to a real-time irregular text recognition method in a complex scene.

Background

Optical Character Recognition (OCR) is to convert characters in an image into Character information that can be edited by a computer, and mainly includes two stages of Character detection and Character Recognition. As an important application branch of computer vision, character recognition has wide landing scenes and research significance, such as entry of bill-card information, handwritten character recognition in an education scene, content information audit of network pictures and the like. The character recognition algorithm aims at the problems that character recognition of printed text and network text is mature day by day, and the character recognition algorithm under natural complex scenes has too wide data distribution, such as large curvature or slope of character distribution shape, small proportion of character part in picture, large background noise and the like. The currently mainstream text recognition method deals with texts in these scenes by using an additional rectification network or a computationally intensive attention mechanism, which makes it difficult for the recognition algorithm to take account of accuracy and efficiency. Therefore, for irregular texts in many real scenes, the current algorithms are difficult to simultaneously meet the requirement that the speed and the performance reach the level of industrial application.

At present, many mainstream segmentation-based text detection algorithms can cut out irregular texts and match with a correction network-based recognition algorithm to recognize characters, but once a scene is complex and the curvature of the characters is too large, the correction capability of the correction network cannot meet the recognition requirement.

Disclosure of Invention

The invention aims to solve the problem that the existing irregular character recognition algorithm cannot consider the speed and precision of model reasoning, and provides a real-time irregular text recognition method in a complex scene.

The technical scheme adopted by the invention is as follows:

the invention comprises the following steps:

1) preprocessing a natural complex scene image containing characters to obtain a preprocessed text block image;

2) inputting the preprocessed text block image into a backbone convolutional neural network for feature extraction and feature slicing, obtaining a plurality of slice local text features and recording the serial number of each slice local text feature;

3) the local text features of the plurality of slices are respectively and correspondingly input into a plurality of local self-attention modules according to serial numbers of the local text features of the slices, and after the local text features are aggregated in parallel by each local self-attention module, local text feature aggregation vectors are respectively and correspondingly output;

4) simultaneously inputting a plurality of local text feature aggregation vectors into a global self-attention module for splicing, down-sampling and global text feature extraction, and outputting global text feature vectors;

5) and after the global text feature vector is input into a character decoding module for decoding, outputting the recognized text character string.

The step 1) is specifically as follows:

firstly, performing character interception on a natural complex scene image containing characters by using a character detection algorithm to obtain an original text block image, then interpolating the height of the original text block image to a preset height H, zooming the width of the original text block image after the height is adjusted according to the aspect ratio of the original text block image to obtain a standard text block image, and finally performing normalization operation on the standard text block image with the mean value of 0 and the variance of 1 to obtain a preprocessed text block image.

The backbone convolution neural network in the step 2) is mainly formed by connecting a convolution down-sampling module with a feature slicing module after sequentially passing through a depth convolution module with 6 layers, a first separable convolution down-sampling module, a depth convolution module with 12 layers and a second separable convolution down-sampling module, wherein a preprocessed text block image is input into the convolution down-sampling module, and the feature slicing module outputs a plurality of slice local text features and records the serial number of each slice local text feature.

The input of the feature slicing module is a visual feature graph output by the second separable convolution down-sampling module, the feature slicing module determines the size of a sliding window according to the height of the visual feature graph, pixel sliding is carried out on the visual feature graph by using the sliding window, the visual feature graph covered by the sliding window is copied into a slice after sliding and then is used as a local text feature of the slice and output, sliding slicing is continuously carried out, and therefore a plurality of local text features of the slice are output and serial numbers of the local text features of the slices are recorded.

The structures of the local self-attention modules in the step 3) are the same, and specifically include:

the local text feature of the slice is formed by splicing a plurality of column vectors in the width dimension, each local self-attention module copies the most middle column vector of the input local text feature of the slice and then linearly transforms the column vector to obtain a local text query vector, then the input local text feature of the slice is subjected to column vector cutting and height dimension rearrangement, different linear transformations are carried out on the rearranged local text feature of the slice to respectively obtain a local text key vector and a local text value vector, then Softmax operation is carried out after matrix dot product is carried out on the local text query vector and the local text key vector to obtain a weight distribution matrix of a local attention mechanism, and then matrix dot product is carried out on the weight distribution matrix of the local attention mechanism and the local text value vector to obtain a local text feature aggregation vector output by the current local self-attention module, the calculation formula is as follows:

Ki＝W^KTi_r

Vi＝W^VTi_r

wherein, W^Q，W^K，

W^Q，W^K，W^VRespectively representing a first, a second and a third local linear transformation parameter matrix,

the dimension is represented as C x C,

the most central column vector representing the local text feature of the ith slice, H' is the height of the local text feature of the slice, Ti_rThe slice local text features are represented after the ith slice local text feature is subjected to column vector cutting and height dimension rearrangement, Qi, Ki and Vi are respectively a local text query vector, a local text key vector and a local text value vector corresponding to the ith slice local text feature, and Ti is_outThe feature aggregation vector is a local text feature aggregation vector corresponding to the ith slice local text feature, C is a feature dimension of the local text feature aggregation vector, and T represents transposition operation; softmax (·) denotes a Softmax operation, specifically a normalization process.

The step 4) is specifically as follows:

the global self-attention module splices a plurality of local text feature aggregation vectors in width dimension according to sequence numbers of corresponding slice local text features to obtain text feature aggregation vectors, then performs mean pooling downsampling on the text feature aggregation vectors in height dimension to obtain a global text feature sequence X, performs different linear transformation on the global text feature sequence to respectively obtain query vectors, key vectors and value vectors of a global text, performs Softmax operation after performing matrix dot product on the query vectors and the key vectors of the global text to obtain a weight distribution matrix of a global attention system, performs matrix dot product on the weight distribution matrix of the global attention system and the value vectors of the global text to obtain the global text feature vectors, and the calculation formula is as follows:

Q_g＝W_g ^qX

K_g＝w_g ^kX

V_g＝W_g ^vX

wherein, [ ·]Representing the stitching operation in the width dimension, averagepoolling (·) represents the operation of mean-pooling downsampling to 1 in the height dimension,

representing the ith local text feature aggregation vector, Q_g，K_g，V_gQuery vectors, key vectors, value vectors, W, for global text_g ^q，W_g ^k，W_g ^vRepresenting a first, a second and a third global linear transformation parameter matrix, X_outThe feature vector of the global text is taken as C, the feature dimension of the feature vector of the global text is taken as C, and T represents transposition operation; softmax (·) denotes a Softmax operation, specifically a normalization process.

The step 5) is specifically as follows:

in the character decoding module, after the global text feature vector is subjected to linear transformation, the feature dimension of the global text feature vector subjected to linear transformation is equal to the number of character categories, and then the global text feature vector subjected to linear transformation is subjected to Softmax operation to obtain character probability distribution, and then the global text feature vector subjected to linear transformation is subjected to character dictionary mapping according to the character probability distribution to realize decoding, and a recognized character recognition sequence is output.

The invention has the following beneficial effects:

compared with the current mainstream irregular character recognition method, the local self-attention module is used for replacing the conventional attention module with higher complexity, the context sequence modeling module is decoupled from the part independently, and the global self-attention mechanism operation is only carried out on the sequence which is down-sampled into one dimension, so that the operation complexity is greatly reduced, the character feature extraction of a local region and the context semantic extraction of a global sequence are considered, and the recognition performance of the model is effectively improved. Meanwhile, a light-weight backbone network is used, so that the algorithm can achieve a real-time effect on each computing platform.

The method and the device can not only identify the irregular text with high precision in a complex scene, but also well support the identification of the conventional horizontal print form text, and can be used for more and wider service scenes.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a structural logic diagram of the entire network in the present invention.

Fig. 3 is a schematic diagram of the backbone network structure in the present invention.

Fig. 4 is a schematic structural diagram of a local self-attention module (LSA) in the present invention.

Fig. 5 is a schematic structural diagram of a global self-attention module (SA) in the present invention.

FIG. 6 is an exemplary diagram of two preprocessed text block images in an embodiment of the invention.

Fig. 7 is a schematic diagram of a weight distribution matrix of local attention mechanisms of the two preprocessed text block images in the respective local self-attention modules in fig. 6.

Fig. 8 is a weight distribution visualization diagram corresponding to the original image after converting and scaling the weight distribution matrix schematic diagrams of the two local attention mechanisms of fig. 7 to the original image size, wherein a whiter area of the weight diagram indicates that the text features of the area are richer, and otherwise, the position is a background area with a high probability.

Detailed Description

In order to more clearly illustrate the object and technical solution of the present invention, the present invention will be further described in detail with reference to the accompanying drawings.

As shown in fig. 1 and 2, the present invention includes the following steps:

the step 1) is specifically as follows:

firstly, character detection algorithm is utilized to intercept characters of natural complex scene images containing characters to obtain original text block images, then, the height of the original text block images is interpolated to preset height H, the width of the original text block images after height adjustment is zoomed according to the width-height ratio of the original text block images, and standard text block images x belonging to R are obtained^H ^×W×3And finally, carrying out normalization operation with the mean value of 0 and the variance of 1 on the standard text block image to obtain a preprocessed text block image.

as shown in fig. 3, the backbone convolutional neural network in step 2) is mainly formed by connecting a convolutional downsampling module sequentially through a 6-layer deep convolutional module, a first separable convolutional downsampling module, a 12-layer deep convolutional module and a second separable convolutional downsampling module with a feature slicing module, wherein the preprocessed text block image is input into the convolutional downsampling module, and the feature slicing module outputs a plurality of slice local text features;

the input of the feature slicing module is a visual feature map output by the second separable convolution down-sampling module, the feature slicing module determines the size of a sliding window according to the height of the visual feature map, the dimensions of the visual feature map are (H ', L, C), wherein H ' and L are respectively the height and width of the visual feature map, in this embodiment, the height H ' of the visual feature map is equal to four pixels, C is the feature dimension of the visual feature map, the height and width of the sliding window are respectively H ' and H ' +1, and sliding is utilizedThe window carries out pixel sliding on the visual characteristic diagram, specifically, the sliding window slides along one direction in sequence by taking 1 pixel as a unit on the width dimension of the visual characteristic diagram, the visual characteristic diagram covered by the sliding window is copied into a slice and then is used as a slice local text characteristic and output after sliding, sliding slicing is continuously carried out, and therefore a plurality of slice local text characteristics T are output,

dimensions representing local text features of the slice are H '× W' × C, W '═ H' + 1. During initial sliding, filling zero vectors with the width and the height of H' at both ends of the width of the visual feature map, and recording the serial numbers of local text features of all the slices;

3) the method comprises the following steps that a plurality of slice Local text features are respectively and correspondingly input into a plurality of Local Self-Attention (LSA) modules according to serial numbers of the slice Local text features, after the Local text features are aggregated in parallel by each Local Self-Attention module, Local text feature aggregation vectors are respectively and correspondingly output, wherein the number of the Local Self-Attention modules is equal to the number of the slice Local text features output by a backbone convolutional neural network and is also equal to the number of column vectors in a visual feature map; the local self-attention mechanism is used in parallel to calculate the spatial attention distribution of the picture where the characters are located, the parameter sharing and parallel calculation of the self-attention mechanism are achieved, the weight distribution of the traditional spatial attention is effectively improved, the recognition precision is improved, the problem of attention drift is solved, meanwhile, the parameter quantity and the operation quantity are greatly reduced, and the optimal result is obtained on a plurality of public test sets for character recognition.

The structures of the local self-attention modules in the step 3) are the same, and specifically are as follows:

as shown in fig. 4, the slice local text feature is formed by splicing a plurality of column vectors in the width dimension, in this embodiment, the width of a column vector is equal to one pixel, each local self-attention module copies the most middle column vector of the input slice local text feature and then linearly transforms the copied column vector to obtain a local text query vector Q,

then, after column vector cutting and height dimension rearrangement are carried out on the input local text characteristics of the slices, different linear transformations are carried out on the rearranged local text characteristics of the slices to respectively obtain local text key vectors K,

and a vector of local text values V,

performing matrix dot product on the local text query vector and the local text key vector, and performing Softmax operation to obtain a weight distribution matrix of a local attention mechanism, and performing matrix dot product on the weight distribution matrix of the local attention mechanism and the local text value vector to obtain a local text feature aggregation vector output by the current local self-attention module, wherein the calculation formula is as follows:

Ki＝W^KTi_r

Vi＝W^VTi_r

wherein Ti represents the local text feature of the ith slice, i is also the sequence number of the column vector in the visual feature map, i belongs to [1, 2]；

Visual feature map, W, representing the overlay of a sliding window when the ith column vector is the intermediate vector of the local text feature of the slice^Q，W^K，

the dimension is represented as C x C,

the most central column vector representing the ith slice local text feature, H' is the height of the slice local text feature, i.e. the height of the visual feature map, Ti_rThe slice local text features are represented after the ith slice local text feature is subjected to column vector cutting and height dimension rearrangement, Qi, Ki and Vi are respectively a local text query vector, a local text key vector and a local text value vector corresponding to the ith slice local text feature, and Ti is_outThe feature aggregation vector is a local text feature aggregation vector corresponding to the ith slice local text feature, C is a feature dimension of the local text feature aggregation vector, and T represents transposition operation; softmax (·) represents Softmax operation, specifically, the result of matrix dot product is normalized to 0-1, so that the local self-attention module assigns higher weight to the part of the image corresponding to the Chinese character region and lower weight to the background region, thereby eliminating the interference of complex background, and different from the previous method, the method is not an iterative auto-regression type to generate an attention weight map, but can perform parallel calculation on the local text features of the slices in each sliding window to obtain a local text vector group, thereby greatly improving the efficiency.

the step 4) is specifically as follows:

as shown in fig. 5, the global self-attention module aggregates the plurality of local text feature aggregation vectors according to the sequence numbers of the corresponding slice local text features in the width dimensionSplicing to obtain text feature aggregate vectors, then performing mean pooling downsampling on the text feature aggregate vectors in height dimension to obtain a global text feature sequence X,

after different linear transformations are carried out on the global text feature sequence, a query vector, a key vector and a value vector of the global text are respectively obtained, matrix dot product is carried out on the query vector and the key vector of the global text, Softmax operation is carried out to obtain a weight distribution matrix of a global attention mechanism, matrix dot product is carried out on the weight distribution matrix of the global attention mechanism and the value vector of the global text, and therefore context semantic association of the global text feature sequence is modeled, the global text feature vector is obtained, and the calculation formula is as follows:

Q_g＝W_g ^qX

K_g＝W_g ^kX

K_g＝W_g ^vX

wherein [ ·]Representing the stitching operation in the width dimension, averagepoolling (·) represents the operation of mean-pooling downsampling to 1 in the height dimension,

represents the ith local text feature aggregation vector, i ∈ [1, 2]，，Q_g，K_g，V_gQuery vector, key vector, value vector, W, for global text_g ^q，W_g ^k，W_g ^vRepresenting a first, a second and a third global linear transformation parameter matrix satisfying W_g ^q，W_g ^k，

X_outIs a global text feature vector, satisfies

C is the feature dimension of the global text feature vector, namely the feature dimension of the visual feature map and the feature dimension of the local text feature aggregation vector, and T represents transposition operation; softmax (-) denotes a Softmax operation, specifically normalizing the results of the matrix dot product to between 0-1.

5) And inputting the global text feature vector into a character decoding module for decoding, and outputting the recognized text character string.

The step 5) is specifically as follows:

in the character decoding module, after the global text feature vector is subjected to linear transformation, the feature dimension of the global text feature vector subjected to linear transformation is equal to the number of character categories, and then the global text feature vector subjected to linear transformation is subjected to Softmax operation to obtain character probability distribution, and then the global text feature vector subjected to linear transformation is subjected to character dictionary mapping according to the character probability distribution to realize decoding, and a recognized character recognition sequence is output. The number of character categories is specifically 5001, and the characters include 5000 Chinese and English characters commonly used and an end character EOS as separate character categories for marking the end of decoding of the current character string. And when the decoding is carried out from left to right in sequence, slicing is carried out and the decoding is terminated to obtain a final character recognition sequence until the last character is decoded or the first termination character EOS is decoded. The invention does not need to use complex autoregressive decoding, and compared with the common CTC decoding, the decoding efficiency is further improved because blank characters do not exist.

Examples of the implementation of the method according to the invention are as follows:

the training and testing data sets used by the method are respectively derived from the public general synthetic data sets of SynthText (English), MJ-Synth (English) and SynthText-Chinese (Chinese), the data volume is about 1400 thousands, and the text dictionary is set to 5001 and comprises 5000 common Chinese and English characters and an end character EOS. The network model presented herein was trained on a 2080Ti GPU using an Adam optimizer and setting a batch size of 512 (bach size), 25 million iterations.

Performing character interception on two natural complex scene images containing characters by using a character detection algorithm, then performing scaling and zero value filling to obtain pictures with the height of 32 and the width of 256, performing normalization processing, and then sending the pictures into a bone dry convolution neural network (shown in fig. 3) for feature extraction to obtain a visual feature map, wherein the pictures are respectively shown in fig. 6 (a) and (b);

and after the visual feature map passes through the feature slicing module, a plurality of sliced local text features can be obtained, and the local text features are respectively input into a plurality of local self-attention modules to obtain a plurality of corresponding local text feature aggregation vectors.

After the local text feature aggregation vectors are spliced in the width dimension, the feature distribution visualization effects are respectively shown in fig. 7 (a) and (b), and the weight distribution graphs of fig. 8 (a) and (b) can be obtained by converting and scaling the feature graphs to the size of the original graph and normalizing and visualizing the feature graphs. And sampling the spliced features to 1 from top to bottom in height to obtain global text features.

And performing global self-attention operation on the global text features to model context relations thereof, wherein the part of the network structure diagram is shown in FIG. 5, and the part can obtain global text feature vectors.

And mapping the output of the global text feature vector after passing through the character decoding module by a character dictionary to obtain a finally recognized character string.

Claims

1. A real-time irregular text recognition method under a complex scene is characterized by comprising the following steps:

2. The method for recognizing the irregular text in the complex scene according to claim 1, wherein the step 1) is specifically as follows:

3. The method for identifying real-time irregular texts under complex scenes according to claim 1, wherein the backbone convolutional neural network in the step 2) is mainly formed by connecting a convolutional downsampling module with a feature slicing module after sequentially passing through a 6-layer deep convolution module, a first separable convolutional downsampling module, a 12-layer deep convolution module and a second separable convolutional downsampling module, the preprocessed text block image is input into the convolutional downsampling module, and the feature slicing module outputs a plurality of slice local text features and records serial numbers of the slice local text features.

4. The method for real-time irregular text recognition under a complex scene according to claim 3, wherein the input of the feature slicing module is the visual feature map output by the second separable convolution down-sampling module, the feature slicing module determines the size of a sliding window according to the height of the visual feature map, pixel sliding is performed on the visual feature map by using the sliding window, after sliding, the visual feature map covered by the sliding window is copied into a slice and then is used as a slice local text feature to be output, and sliding slicing is continuously performed, so that a plurality of slice local text features are output, and the serial number of each slice local text feature is recorded.

5. The method according to claim 1, wherein the structures of the local self-attention modules in the step 3) are the same, specifically:

Ki＝W^KTi_r

Vi＝W^VTi_r

wherein, W^Q,W^K,

W^Q,W^K,W^VRespectively representing a first, a second and a third local linear transformation parameter matrix,

the dimension is represented as C x C,

the most central column vector, H, representing the local text feature of the ith slice^′Height of local text features for slicing, Ti_rThe slice local text features are represented after the ith slice local text feature is subjected to column vector cutting and height dimension rearrangement, Qi, Ki and Vi are respectively a local text query vector, a local text key vector and a local text value vector corresponding to the ith slice local text feature, and Ti is_outThe feature aggregation vector is a local text feature aggregation vector corresponding to the ith slice local text feature, C is a feature dimension of the local text feature aggregation vector, and T represents transposition operation; softmax (·) denotes a Softmax operation, specifically a normalization process.

6. The method for recognizing the irregular text in the complex scene according to claim 1, wherein the step 4) is specifically as follows:

the global self-attention module splices a plurality of local text feature aggregation vectors in a width dimension according to sequence numbers of corresponding slice local text features to obtain text feature aggregation vectors, then performs mean pooling downsampling on the text feature aggregation vectors in a height dimension to obtain a global text feature sequence X, performs different linear transformations on the global text feature sequence to respectively obtain query vectors, key vectors and value vectors of a global text, performs Softmax operation after performing matrix dot product on the query vectors and the key vectors of the global text to obtain a weight distribution matrix of a global attention mechanism, and then performs matrix dot product on the weight distribution matrix of the global attention mechanism and the value vectors of the global text to obtain the global text feature vectors, wherein the calculation formula is as follows:

representing the ith local text feature aggregation vector, Q_g,K_g,V_gQuery vectors, key vectors, value vectors for global text,

representing a first, a second and a third global linear transformation parameter matrix, X_outThe feature vector of the global text is taken as C, the feature dimension of the feature vector of the global text is taken as C, and T represents transposition operation; softmax (·) denotes a Softmax operation, specifically a normalization process.

7. The method for recognizing the irregular text in the complex scene according to claim 1, wherein the step 5) is specifically as follows:

in the character decoding module, after the global text feature vector is subjected to linear transformation, the feature dimension of the global text feature vector subjected to linear transformation is equal to the number of character categories, and then the global text feature vector subjected to linear transformation is subjected to Softmax operation to obtain character probability distribution, and the global text feature vector subjected to linear transformation is subjected to character dictionary mapping according to the character probability distribution to realize decoding, and a recognized character recognition sequence is output.