CN114495119A - Real-time irregular text recognition method under complex scene - Google Patents

Real-time irregular text recognition method under complex scene Download PDF

Info

Publication number
CN114495119A
CN114495119A CN202111452587.6A CN202111452587A CN114495119A CN 114495119 A CN114495119 A CN 114495119A CN 202111452587 A CN202111452587 A CN 202111452587A CN 114495119 A CN114495119 A CN 114495119A
Authority
CN
China
Prior art keywords
text
local
feature
global
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111452587.6A
Other languages
Chinese (zh)
Inventor
张三元
刘旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202111452587.6A priority Critical patent/CN114495119A/en
Publication of CN114495119A publication Critical patent/CN114495119A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a real-time irregular text recognition method in a complex scene. The method comprises the following steps: after natural complex scene images containing characters are preprocessed, preprocessed text block images are obtained and input into a backbone convolutional neural network for feature extraction and feature slicing, a plurality of slice local text features are obtained and input into a plurality of local self-attention modules correspondingly respectively, and local text feature aggregation vectors are output correspondingly respectively after local text feature aggregation is performed in parallel by each local self-attention module; and inputting the global text feature vector into a global self-attention module for splicing, down-sampling and global text feature extraction, outputting the global text feature vector, inputting the global text feature vector into a character decoding module for decoding, and outputting the recognized text character string. The method effectively balances the speed and the precision of character recognition in the complex scene, has excellent effect in the curved text scene, leads the recognition precision in the horizontal text to the prior algorithm, and has more advantages in speed.

Description

Real-time irregular text recognition method under complex scene
Technical Field
The invention relates to a character image recognition method in the field of character recognition, in particular to a real-time irregular text recognition method in a complex scene.
Background
Optical Character Recognition (OCR) is to convert characters in an image into Character information that can be edited by a computer, and mainly includes two stages of Character detection and Character Recognition. As an important application branch of computer vision, character recognition has wide landing scenes and research significance, such as entry of bill-card information, handwritten character recognition in an education scene, content information audit of network pictures and the like. The character recognition algorithm aims at the problems that character recognition of printed text and network text is mature day by day, and the character recognition algorithm under natural complex scenes has too wide data distribution, such as large curvature or slope of character distribution shape, small proportion of character part in picture, large background noise and the like. The currently mainstream text recognition method deals with texts in these scenes by using an additional rectification network or a computationally intensive attention mechanism, which makes it difficult for the recognition algorithm to take account of accuracy and efficiency. Therefore, for irregular texts in many real scenes, the current algorithms are difficult to simultaneously meet the requirement that the speed and the performance reach the level of industrial application.
At present, many mainstream segmentation-based text detection algorithms can cut out irregular texts and match with a correction network-based recognition algorithm to recognize characters, but once a scene is complex and the curvature of the characters is too large, the correction capability of the correction network cannot meet the recognition requirement.
Disclosure of Invention
The invention aims to solve the problem that the existing irregular character recognition algorithm cannot consider the speed and precision of model reasoning, and provides a real-time irregular text recognition method in a complex scene.
The technical scheme adopted by the invention is as follows:
the invention comprises the following steps:
1) preprocessing a natural complex scene image containing characters to obtain a preprocessed text block image;
2) inputting the preprocessed text block image into a backbone convolutional neural network for feature extraction and feature slicing, obtaining a plurality of slice local text features and recording the serial number of each slice local text feature;
3) the local text features of the plurality of slices are respectively and correspondingly input into a plurality of local self-attention modules according to serial numbers of the local text features of the slices, and after the local text features are aggregated in parallel by each local self-attention module, local text feature aggregation vectors are respectively and correspondingly output;
4) simultaneously inputting a plurality of local text feature aggregation vectors into a global self-attention module for splicing, down-sampling and global text feature extraction, and outputting global text feature vectors;
5) and after the global text feature vector is input into a character decoding module for decoding, outputting the recognized text character string.
The step 1) is specifically as follows:
firstly, performing character interception on a natural complex scene image containing characters by using a character detection algorithm to obtain an original text block image, then interpolating the height of the original text block image to a preset height H, zooming the width of the original text block image after the height is adjusted according to the aspect ratio of the original text block image to obtain a standard text block image, and finally performing normalization operation on the standard text block image with the mean value of 0 and the variance of 1 to obtain a preprocessed text block image.
The backbone convolution neural network in the step 2) is mainly formed by connecting a convolution down-sampling module with a feature slicing module after sequentially passing through a depth convolution module with 6 layers, a first separable convolution down-sampling module, a depth convolution module with 12 layers and a second separable convolution down-sampling module, wherein a preprocessed text block image is input into the convolution down-sampling module, and the feature slicing module outputs a plurality of slice local text features and records the serial number of each slice local text feature.
The input of the feature slicing module is a visual feature graph output by the second separable convolution down-sampling module, the feature slicing module determines the size of a sliding window according to the height of the visual feature graph, pixel sliding is carried out on the visual feature graph by using the sliding window, the visual feature graph covered by the sliding window is copied into a slice after sliding and then is used as a local text feature of the slice and output, sliding slicing is continuously carried out, and therefore a plurality of local text features of the slice are output and serial numbers of the local text features of the slices are recorded.
The structures of the local self-attention modules in the step 3) are the same, and specifically include:
the local text feature of the slice is formed by splicing a plurality of column vectors in the width dimension, each local self-attention module copies the most middle column vector of the input local text feature of the slice and then linearly transforms the column vector to obtain a local text query vector, then the input local text feature of the slice is subjected to column vector cutting and height dimension rearrangement, different linear transformations are carried out on the rearranged local text feature of the slice to respectively obtain a local text key vector and a local text value vector, then Softmax operation is carried out after matrix dot product is carried out on the local text query vector and the local text key vector to obtain a weight distribution matrix of a local attention mechanism, and then matrix dot product is carried out on the weight distribution matrix of the local attention mechanism and the local text value vector to obtain a local text feature aggregation vector output by the current local self-attention module, the calculation formula is as follows:
Figure BDA0003386744450000031
Ki=WKTir
Vi=WVTir
Figure BDA0003386744450000032
wherein, WQ,WK
Figure BDA0003386744450000033
WQ,WK,WVRespectively representing a first, a second and a third local linear transformation parameter matrix,
Figure BDA0003386744450000034
the dimension is represented as C x C,
Figure BDA0003386744450000035
the most central column vector representing the local text feature of the ith slice, H' is the height of the local text feature of the slice, TirThe slice local text features are represented after the ith slice local text feature is subjected to column vector cutting and height dimension rearrangement, Qi, Ki and Vi are respectively a local text query vector, a local text key vector and a local text value vector corresponding to the ith slice local text feature, and Ti isoutThe feature aggregation vector is a local text feature aggregation vector corresponding to the ith slice local text feature, C is a feature dimension of the local text feature aggregation vector, and T represents transposition operation; softmax (·) denotes a Softmax operation, specifically a normalization process.
The step 4) is specifically as follows:
the global self-attention module splices a plurality of local text feature aggregation vectors in width dimension according to sequence numbers of corresponding slice local text features to obtain text feature aggregation vectors, then performs mean pooling downsampling on the text feature aggregation vectors in height dimension to obtain a global text feature sequence X, performs different linear transformation on the global text feature sequence to respectively obtain query vectors, key vectors and value vectors of a global text, performs Softmax operation after performing matrix dot product on the query vectors and the key vectors of the global text to obtain a weight distribution matrix of a global attention system, performs matrix dot product on the weight distribution matrix of the global attention system and the value vectors of the global text to obtain the global text feature vectors, and the calculation formula is as follows:
Figure BDA0003386744450000036
Qg=Wg qX
Kg=wg kX
Vg=Wg vX
Figure BDA0003386744450000037
wherein, [ ·]Representing the stitching operation in the width dimension, averagepoolling (·) represents the operation of mean-pooling downsampling to 1 in the height dimension,
Figure BDA0003386744450000041
representing the ith local text feature aggregation vector, Qg,Kg,VgQuery vectors, key vectors, value vectors, W, for global textg q,Wg k,Wg vRepresenting a first, a second and a third global linear transformation parameter matrix, XoutThe feature vector of the global text is taken as C, the feature dimension of the feature vector of the global text is taken as C, and T represents transposition operation; softmax (·) denotes a Softmax operation, specifically a normalization process.
The step 5) is specifically as follows:
in the character decoding module, after the global text feature vector is subjected to linear transformation, the feature dimension of the global text feature vector subjected to linear transformation is equal to the number of character categories, and then the global text feature vector subjected to linear transformation is subjected to Softmax operation to obtain character probability distribution, and then the global text feature vector subjected to linear transformation is subjected to character dictionary mapping according to the character probability distribution to realize decoding, and a recognized character recognition sequence is output.
The invention has the following beneficial effects:
compared with the current mainstream irregular character recognition method, the local self-attention module is used for replacing the conventional attention module with higher complexity, the context sequence modeling module is decoupled from the part independently, and the global self-attention mechanism operation is only carried out on the sequence which is down-sampled into one dimension, so that the operation complexity is greatly reduced, the character feature extraction of a local region and the context semantic extraction of a global sequence are considered, and the recognition performance of the model is effectively improved. Meanwhile, a light-weight backbone network is used, so that the algorithm can achieve a real-time effect on each computing platform.
The method and the device can not only identify the irregular text with high precision in a complex scene, but also well support the identification of the conventional horizontal print form text, and can be used for more and wider service scenes.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a structural logic diagram of the entire network in the present invention.
Fig. 3 is a schematic diagram of the backbone network structure in the present invention.
Fig. 4 is a schematic structural diagram of a local self-attention module (LSA) in the present invention.
Fig. 5 is a schematic structural diagram of a global self-attention module (SA) in the present invention.
FIG. 6 is an exemplary diagram of two preprocessed text block images in an embodiment of the invention.
Fig. 7 is a schematic diagram of a weight distribution matrix of local attention mechanisms of the two preprocessed text block images in the respective local self-attention modules in fig. 6.
Fig. 8 is a weight distribution visualization diagram corresponding to the original image after converting and scaling the weight distribution matrix schematic diagrams of the two local attention mechanisms of fig. 7 to the original image size, wherein a whiter area of the weight diagram indicates that the text features of the area are richer, and otherwise, the position is a background area with a high probability.
Detailed Description
In order to more clearly illustrate the object and technical solution of the present invention, the present invention will be further described in detail with reference to the accompanying drawings.
As shown in fig. 1 and 2, the present invention includes the following steps:
1) preprocessing a natural complex scene image containing characters to obtain a preprocessed text block image;
the step 1) is specifically as follows:
firstly, character detection algorithm is utilized to intercept characters of natural complex scene images containing characters to obtain original text block images, then, the height of the original text block images is interpolated to preset height H, the width of the original text block images after height adjustment is zoomed according to the width-height ratio of the original text block images, and standard text block images x belonging to R are obtainedH ×W×3And finally, carrying out normalization operation with the mean value of 0 and the variance of 1 on the standard text block image to obtain a preprocessed text block image.
2) Inputting the preprocessed text block image into a backbone convolutional neural network for feature extraction and feature slicing, obtaining a plurality of slice local text features and recording the serial number of each slice local text feature;
as shown in fig. 3, the backbone convolutional neural network in step 2) is mainly formed by connecting a convolutional downsampling module sequentially through a 6-layer deep convolutional module, a first separable convolutional downsampling module, a 12-layer deep convolutional module and a second separable convolutional downsampling module with a feature slicing module, wherein the preprocessed text block image is input into the convolutional downsampling module, and the feature slicing module outputs a plurality of slice local text features;
the input of the feature slicing module is a visual feature map output by the second separable convolution down-sampling module, the feature slicing module determines the size of a sliding window according to the height of the visual feature map, the dimensions of the visual feature map are (H ', L, C), wherein H ' and L are respectively the height and width of the visual feature map, in this embodiment, the height H ' of the visual feature map is equal to four pixels, C is the feature dimension of the visual feature map, the height and width of the sliding window are respectively H ' and H ' +1, and sliding is utilizedThe window carries out pixel sliding on the visual characteristic diagram, specifically, the sliding window slides along one direction in sequence by taking 1 pixel as a unit on the width dimension of the visual characteristic diagram, the visual characteristic diagram covered by the sliding window is copied into a slice and then is used as a slice local text characteristic and output after sliding, sliding slicing is continuously carried out, and therefore a plurality of slice local text characteristics T are output,
Figure BDA0003386744450000051
dimensions representing local text features of the slice are H '× W' × C, W '═ H' + 1. During initial sliding, filling zero vectors with the width and the height of H' at both ends of the width of the visual feature map, and recording the serial numbers of local text features of all the slices;
3) the method comprises the following steps that a plurality of slice Local text features are respectively and correspondingly input into a plurality of Local Self-Attention (LSA) modules according to serial numbers of the slice Local text features, after the Local text features are aggregated in parallel by each Local Self-Attention module, Local text feature aggregation vectors are respectively and correspondingly output, wherein the number of the Local Self-Attention modules is equal to the number of the slice Local text features output by a backbone convolutional neural network and is also equal to the number of column vectors in a visual feature map; the local self-attention mechanism is used in parallel to calculate the spatial attention distribution of the picture where the characters are located, the parameter sharing and parallel calculation of the self-attention mechanism are achieved, the weight distribution of the traditional spatial attention is effectively improved, the recognition precision is improved, the problem of attention drift is solved, meanwhile, the parameter quantity and the operation quantity are greatly reduced, and the optimal result is obtained on a plurality of public test sets for character recognition.
The structures of the local self-attention modules in the step 3) are the same, and specifically are as follows:
as shown in fig. 4, the slice local text feature is formed by splicing a plurality of column vectors in the width dimension, in this embodiment, the width of a column vector is equal to one pixel, each local self-attention module copies the most middle column vector of the input slice local text feature and then linearly transforms the copied column vector to obtain a local text query vector Q,
Figure BDA0003386744450000061
then, after column vector cutting and height dimension rearrangement are carried out on the input local text characteristics of the slices, different linear transformations are carried out on the rearranged local text characteristics of the slices to respectively obtain local text key vectors K,
Figure BDA0003386744450000062
and a vector of local text values V,
Figure BDA0003386744450000063
performing matrix dot product on the local text query vector and the local text key vector, and performing Softmax operation to obtain a weight distribution matrix of a local attention mechanism, and performing matrix dot product on the weight distribution matrix of the local attention mechanism and the local text value vector to obtain a local text feature aggregation vector output by the current local self-attention module, wherein the calculation formula is as follows:
Figure BDA0003386744450000064
Figure BDA0003386744450000065
Ki=WKTir
Vi=WVTir
Figure BDA0003386744450000066
wherein Ti represents the local text feature of the ith slice, i is also the sequence number of the column vector in the visual feature map, i belongs to [1, 2];
Figure BDA0003386744450000067
Visual feature map, W, representing the overlay of a sliding window when the ith column vector is the intermediate vector of the local text feature of the sliceQ,WK
Figure BDA0003386744450000068
WQ,WK,WVRespectively representing a first, a second and a third local linear transformation parameter matrix,
Figure BDA0003386744450000069
the dimension is represented as C x C,
Figure BDA00033867444500000610
the most central column vector representing the ith slice local text feature, H' is the height of the slice local text feature, i.e. the height of the visual feature map, TirThe slice local text features are represented after the ith slice local text feature is subjected to column vector cutting and height dimension rearrangement, Qi, Ki and Vi are respectively a local text query vector, a local text key vector and a local text value vector corresponding to the ith slice local text feature, and Ti isoutThe feature aggregation vector is a local text feature aggregation vector corresponding to the ith slice local text feature, C is a feature dimension of the local text feature aggregation vector, and T represents transposition operation; softmax (·) represents Softmax operation, specifically, the result of matrix dot product is normalized to 0-1, so that the local self-attention module assigns higher weight to the part of the image corresponding to the Chinese character region and lower weight to the background region, thereby eliminating the interference of complex background, and different from the previous method, the method is not an iterative auto-regression type to generate an attention weight map, but can perform parallel calculation on the local text features of the slices in each sliding window to obtain a local text vector group, thereby greatly improving the efficiency.
4) Simultaneously inputting a plurality of local text feature aggregation vectors into a global self-attention module for splicing, down-sampling and global text feature extraction, and outputting global text feature vectors;
the step 4) is specifically as follows:
as shown in fig. 5, the global self-attention module aggregates the plurality of local text feature aggregation vectors according to the sequence numbers of the corresponding slice local text features in the width dimensionSplicing to obtain text feature aggregate vectors, then performing mean pooling downsampling on the text feature aggregate vectors in height dimension to obtain a global text feature sequence X,
Figure BDA0003386744450000075
after different linear transformations are carried out on the global text feature sequence, a query vector, a key vector and a value vector of the global text are respectively obtained, matrix dot product is carried out on the query vector and the key vector of the global text, Softmax operation is carried out to obtain a weight distribution matrix of a global attention mechanism, matrix dot product is carried out on the weight distribution matrix of the global attention mechanism and the value vector of the global text, and therefore context semantic association of the global text feature sequence is modeled, the global text feature vector is obtained, and the calculation formula is as follows:
Figure BDA0003386744450000071
Qg=Wg qX
Kg=Wg kX
Kg=Wg vX
Figure BDA0003386744450000072
wherein [ ·]Representing the stitching operation in the width dimension, averagepoolling (·) represents the operation of mean-pooling downsampling to 1 in the height dimension,
Figure BDA0003386744450000074
represents the ith local text feature aggregation vector, i ∈ [1, 2],,Qg,Kg,VgQuery vector, key vector, value vector, W, for global textg q,Wg k,Wg vRepresenting a first, a second and a third global linear transformation parameter matrix satisfying Wg q,Wg k
Figure BDA0003386744450000073
XoutIs a global text feature vector, satisfies
Figure BDA0003386744450000081
C is the feature dimension of the global text feature vector, namely the feature dimension of the visual feature map and the feature dimension of the local text feature aggregation vector, and T represents transposition operation; softmax (-) denotes a Softmax operation, specifically normalizing the results of the matrix dot product to between 0-1.
5) And inputting the global text feature vector into a character decoding module for decoding, and outputting the recognized text character string.
The step 5) is specifically as follows:
in the character decoding module, after the global text feature vector is subjected to linear transformation, the feature dimension of the global text feature vector subjected to linear transformation is equal to the number of character categories, and then the global text feature vector subjected to linear transformation is subjected to Softmax operation to obtain character probability distribution, and then the global text feature vector subjected to linear transformation is subjected to character dictionary mapping according to the character probability distribution to realize decoding, and a recognized character recognition sequence is output. The number of character categories is specifically 5001, and the characters include 5000 Chinese and English characters commonly used and an end character EOS as separate character categories for marking the end of decoding of the current character string. And when the decoding is carried out from left to right in sequence, slicing is carried out and the decoding is terminated to obtain a final character recognition sequence until the last character is decoded or the first termination character EOS is decoded. The invention does not need to use complex autoregressive decoding, and compared with the common CTC decoding, the decoding efficiency is further improved because blank characters do not exist.
Examples of the implementation of the method according to the invention are as follows:
the training and testing data sets used by the method are respectively derived from the public general synthetic data sets of SynthText (English), MJ-Synth (English) and SynthText-Chinese (Chinese), the data volume is about 1400 thousands, and the text dictionary is set to 5001 and comprises 5000 common Chinese and English characters and an end character EOS. The network model presented herein was trained on a 2080Ti GPU using an Adam optimizer and setting a batch size of 512 (bach size), 25 million iterations.
Performing character interception on two natural complex scene images containing characters by using a character detection algorithm, then performing scaling and zero value filling to obtain pictures with the height of 32 and the width of 256, performing normalization processing, and then sending the pictures into a bone dry convolution neural network (shown in fig. 3) for feature extraction to obtain a visual feature map, wherein the pictures are respectively shown in fig. 6 (a) and (b);
and after the visual feature map passes through the feature slicing module, a plurality of sliced local text features can be obtained, and the local text features are respectively input into a plurality of local self-attention modules to obtain a plurality of corresponding local text feature aggregation vectors.
After the local text feature aggregation vectors are spliced in the width dimension, the feature distribution visualization effects are respectively shown in fig. 7 (a) and (b), and the weight distribution graphs of fig. 8 (a) and (b) can be obtained by converting and scaling the feature graphs to the size of the original graph and normalizing and visualizing the feature graphs. And sampling the spliced features to 1 from top to bottom in height to obtain global text features.
And performing global self-attention operation on the global text features to model context relations thereof, wherein the part of the network structure diagram is shown in FIG. 5, and the part can obtain global text feature vectors.
And mapping the output of the global text feature vector after passing through the character decoding module by a character dictionary to obtain a finally recognized character string.

Claims (7)

1. A real-time irregular text recognition method under a complex scene is characterized by comprising the following steps:
1) preprocessing a natural complex scene image containing characters to obtain a preprocessed text block image;
2) inputting the preprocessed text block image into a backbone convolutional neural network for feature extraction and feature slicing, obtaining a plurality of slice local text features and recording the serial number of each slice local text feature;
3) the local text features of the plurality of slices are respectively and correspondingly input into a plurality of local self-attention modules according to serial numbers of the local text features of the slices, and after the local text features are aggregated in parallel by each local self-attention module, local text feature aggregation vectors are respectively and correspondingly output;
4) simultaneously inputting a plurality of local text feature aggregation vectors into a global self-attention module for splicing, down-sampling and global text feature extraction, and outputting global text feature vectors;
5) and inputting the global text feature vector into a character decoding module for decoding, and outputting the recognized text character string.
2. The method for recognizing the irregular text in the complex scene according to claim 1, wherein the step 1) is specifically as follows:
firstly, performing character interception on a natural complex scene image containing characters by using a character detection algorithm to obtain an original text block image, then interpolating the height of the original text block image to a preset height H, zooming the width of the original text block image after the height is adjusted according to the aspect ratio of the original text block image to obtain a standard text block image, and finally performing normalization operation on the standard text block image with the mean value of 0 and the variance of 1 to obtain a preprocessed text block image.
3. The method for identifying real-time irregular texts under complex scenes according to claim 1, wherein the backbone convolutional neural network in the step 2) is mainly formed by connecting a convolutional downsampling module with a feature slicing module after sequentially passing through a 6-layer deep convolution module, a first separable convolutional downsampling module, a 12-layer deep convolution module and a second separable convolutional downsampling module, the preprocessed text block image is input into the convolutional downsampling module, and the feature slicing module outputs a plurality of slice local text features and records serial numbers of the slice local text features.
4. The method for real-time irregular text recognition under a complex scene according to claim 3, wherein the input of the feature slicing module is the visual feature map output by the second separable convolution down-sampling module, the feature slicing module determines the size of a sliding window according to the height of the visual feature map, pixel sliding is performed on the visual feature map by using the sliding window, after sliding, the visual feature map covered by the sliding window is copied into a slice and then is used as a slice local text feature to be output, and sliding slicing is continuously performed, so that a plurality of slice local text features are output, and the serial number of each slice local text feature is recorded.
5. The method according to claim 1, wherein the structures of the local self-attention modules in the step 3) are the same, specifically:
the local text feature of the slice is formed by splicing a plurality of column vectors in the width dimension, each local self-attention module copies the most middle column vector of the input local text feature of the slice and then linearly transforms the column vector to obtain a local text query vector, then the input local text feature of the slice is subjected to column vector cutting and height dimension rearrangement, different linear transformations are carried out on the rearranged local text feature of the slice to respectively obtain a local text key vector and a local text value vector, then Softmax operation is carried out after matrix dot product is carried out on the local text query vector and the local text key vector to obtain a weight distribution matrix of a local attention mechanism, and then matrix dot product is carried out on the weight distribution matrix of the local attention mechanism and the local text value vector to obtain a local text feature aggregation vector output by the current local self-attention module, the calculation formula is as follows:
Figure FDA0003386744440000021
Ki=WKTir
Vi=WVTir
Figure FDA0003386744440000022
wherein, WQ,WK,
Figure FDA0003386744440000023
WQ,WK,WVRespectively representing a first, a second and a third local linear transformation parameter matrix,
Figure FDA0003386744440000024
the dimension is represented as C x C,
Figure FDA0003386744440000025
the most central column vector, H, representing the local text feature of the ith sliceHeight of local text features for slicing, TirThe slice local text features are represented after the ith slice local text feature is subjected to column vector cutting and height dimension rearrangement, Qi, Ki and Vi are respectively a local text query vector, a local text key vector and a local text value vector corresponding to the ith slice local text feature, and Ti isoutThe feature aggregation vector is a local text feature aggregation vector corresponding to the ith slice local text feature, C is a feature dimension of the local text feature aggregation vector, and T represents transposition operation; softmax (·) denotes a Softmax operation, specifically a normalization process.
6. The method for recognizing the irregular text in the complex scene according to claim 1, wherein the step 4) is specifically as follows:
the global self-attention module splices a plurality of local text feature aggregation vectors in a width dimension according to sequence numbers of corresponding slice local text features to obtain text feature aggregation vectors, then performs mean pooling downsampling on the text feature aggregation vectors in a height dimension to obtain a global text feature sequence X, performs different linear transformations on the global text feature sequence to respectively obtain query vectors, key vectors and value vectors of a global text, performs Softmax operation after performing matrix dot product on the query vectors and the key vectors of the global text to obtain a weight distribution matrix of a global attention mechanism, and then performs matrix dot product on the weight distribution matrix of the global attention mechanism and the value vectors of the global text to obtain the global text feature vectors, wherein the calculation formula is as follows:
Figure FDA0003386744440000031
Figure FDA0003386744440000032
Figure FDA0003386744440000033
Figure FDA0003386744440000034
Figure FDA0003386744440000035
wherein [ ·]Representing the stitching operation in the width dimension, averagepoolling (·) represents the operation of mean-pooling downsampling to 1 in the height dimension,
Figure FDA0003386744440000036
representing the ith local text feature aggregation vector, Qg,Kg,VgQuery vectors, key vectors, value vectors for global text,
Figure FDA0003386744440000037
representing a first, a second and a third global linear transformation parameter matrix, XoutThe feature vector of the global text is taken as C, the feature dimension of the feature vector of the global text is taken as C, and T represents transposition operation; softmax (·) denotes a Softmax operation, specifically a normalization process.
7. The method for recognizing the irregular text in the complex scene according to claim 1, wherein the step 5) is specifically as follows:
in the character decoding module, after the global text feature vector is subjected to linear transformation, the feature dimension of the global text feature vector subjected to linear transformation is equal to the number of character categories, and then the global text feature vector subjected to linear transformation is subjected to Softmax operation to obtain character probability distribution, and the global text feature vector subjected to linear transformation is subjected to character dictionary mapping according to the character probability distribution to realize decoding, and a recognized character recognition sequence is output.
CN202111452587.6A 2021-12-01 2021-12-01 Real-time irregular text recognition method under complex scene Pending CN114495119A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111452587.6A CN114495119A (en) 2021-12-01 2021-12-01 Real-time irregular text recognition method under complex scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111452587.6A CN114495119A (en) 2021-12-01 2021-12-01 Real-time irregular text recognition method under complex scene

Publications (1)

Publication Number Publication Date
CN114495119A true CN114495119A (en) 2022-05-13

Family

ID=81492376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111452587.6A Pending CN114495119A (en) 2021-12-01 2021-12-01 Real-time irregular text recognition method under complex scene

Country Status (1)

Country Link
CN (1) CN114495119A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743196A (en) * 2022-05-18 2022-07-12 北京百度网讯科技有限公司 Neural network for text recognition, training method thereof and text recognition method
CN115222947A (en) * 2022-09-21 2022-10-21 武汉珈鹰智能科技有限公司 Rock joint segmentation method and device based on global self-attention transformation network

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743196A (en) * 2022-05-18 2022-07-12 北京百度网讯科技有限公司 Neural network for text recognition, training method thereof and text recognition method
WO2023221422A1 (en) * 2022-05-18 2023-11-23 北京百度网讯科技有限公司 Neural network used for text recognition, training method thereof and text recognition method
CN115222947A (en) * 2022-09-21 2022-10-21 武汉珈鹰智能科技有限公司 Rock joint segmentation method and device based on global self-attention transformation network
CN115222947B (en) * 2022-09-21 2022-12-20 武汉珈鹰智能科技有限公司 Rock joint segmentation method and device based on global self-attention transformation network

Similar Documents

Publication Publication Date Title
CN109299274B (en) Natural scene text detection method based on full convolution neural network
CN111858954B (en) Task-oriented text-generated image network model
CN113343707B (en) Scene text recognition method based on robustness characterization learning
CN111723585A (en) Style-controllable image text real-time translation and conversion method
CN114495119A (en) Real-time irregular text recognition method under complex scene
CN105678293A (en) Complex image and text sequence identification method based on CNN-RNN
CN111160343A (en) Off-line mathematical formula symbol identification method based on Self-Attention
CN115424282A (en) Unstructured text table identification method and system
CN109670555B (en) Instance-level pedestrian detection and pedestrian re-recognition system based on deep learning
CN104077742B (en) Human face sketch synthetic method and system based on Gabor characteristic
CN113837366A (en) Multi-style font generation method
CN114898284B (en) Crowd counting method based on feature pyramid local difference attention mechanism
CN115620010A (en) Semantic segmentation method for RGB-T bimodal feature fusion
CN115862045B (en) Case automatic identification method, system, equipment and storage medium based on image-text identification technology
CN113963232A (en) Network graph data extraction method based on attention learning
CN114494786A (en) Fine-grained image classification method based on multilayer coordination convolutional neural network
CN110348339B (en) Method for extracting handwritten document text lines based on case segmentation
CN109886325B (en) Template selection and accelerated matching method for nonlinear color space classification
CN111401434A (en) Image classification method based on unsupervised feature learning
CN111242839A (en) Image scaling and cutting method based on scale grade
CN116703725A (en) Method for realizing super resolution for real world text image by double branch network for sensing multiple characteristics
CN111428447A (en) Intelligent image-text typesetting method based on significance detection
CN115797179A (en) Street view Chinese text image super-resolution reconstruction method
Chen et al. Scene text recognition based on deep learning: a brief survey
Mosannafat et al. Farsi text detection and localization in videos and images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination