CN113159023A - Scene text recognition method based on explicit supervision mechanism - Google Patents

Scene text recognition method based on explicit supervision mechanism Download PDF

Info

Publication number
CN113159023A
CN113159023A CN202110273068.7A CN202110273068A CN113159023A CN 113159023 A CN113159023 A CN 113159023A CN 202110273068 A CN202110273068 A CN 202110273068A CN 113159023 A CN113159023 A CN 113159023A
Authority
CN
China
Prior art keywords
attention
feature
module
character
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110273068.7A
Other languages
Chinese (zh)
Inventor
王鹏
郑财源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202110273068.7A priority Critical patent/CN113159023A/en
Publication of CN113159023A publication Critical patent/CN113159023A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a scene text recognition method based on an explicit supervision mechanism, and belongs to the field of scene text recognition. In the first part, a feature extraction part of a ResNet convolutional neural network with an attention mechanism is adopted to perform feature extraction and coding on a text image, and a feature map and a global representation are obtained. In the second part, the relationship modeling between characters is carried out by combining the information of the previous predicted character, the position information and the global representation, then the attention weight is generated according to the feature diagram, the feature of a single character is obtained by multiplying the weight by the feature diagram, the feature is input into a feedforward neural network to obtain the predicted character, and then the predicted character enters the predicted recognition process of the next character, and so on until the recognition ending identifier is obtained. The method can automatically position the characteristics of the area at each moment when the method predicts each moment, thereby improving the recognition effect and solving the problem of poor recognition effect under the condition of bending or inclining.

Description

Scene text recognition method based on explicit supervision mechanism
Technical Field
The invention belongs to the field of scene text recognition, and particularly provides a text image recognition method and a text image recognition system adopting an encoding and decoding structure of an explicit supervision attention mechanism. The whole system adopts ResNet34 convolutional neural network with spatial attention and channel attention mechanism to extract text image features, and adopts a Transformer structure based on a self-attention mechanism to perform decoding recognition.
Background
Scene text recognition is an important challenge in the field of computer vision, and its task is to automatically detect and recognize text in natural images. Text acts as a physical carrier of text that can be used to store and transfer information. With the help of text detection and recognition technology, important semantic information in the visual image can be decoded. Due to the huge application value of scene text recognition, a lot of research and exploration are caused in the industrial and academic circles in recent years, however, at present, most of better recognition is horizontal text with simple background. However, in a real scene, due to the influence of factors such as illumination, shielding of a photographing device and a photographing angle in the scene and actual factors such as bending, inclination and artistic words of a text, recognition of a scene text, especially recognition of an irregular scene text, has a great bottleneck.
In order to solve the problem of irregular Text Recognition, the existing Scene Text Recognition technology (such as: more: a Multi-Object Recognition Network and enter: An Attention Scene Text Recognition with a Flexible Recognition for Scene Text Recognition) adopts a decoder based on An Attention mechanism in the decoding stage, so that the character region in the picture can be automatically focused. The method can better solve the problem of irregular text recognition, but the problems of 'attention drift' and the like often occur due to the fact that scene pictures are too noisy, and therefore the accuracy rate of text recognition is reduced.
Disclosure of Invention
Technical problem to be solved
In order to solve the problem of low text recognition accuracy rate caused by the 'attention drift' problem of a decoder based on an attention mechanism in the prior art, the invention provides a scene text recognition method based on an explicit supervision attention mechanism. Scene text recognition to account for warping and tilting.
Technical scheme
A scene text recognition method based on an explicit supervision mechanism is characterized by comprising the following steps:
step 1: inputting the scene text picture into a ResNet34 convolutional neural network to extract a characteristic diagram and recording the characteristic diagram as F, wherein the F belongs to R25x8x512(ii) a Inputting F into a global feature extraction layer of a six-layer Bottleneck to obtain a global feature representation G, wherein G belongs to R1×1×1024(ii) a Meanwhile, performing 1 × 1 convolution on F to obtain F 'as the finally extracted feature, wherein F' belongs to R25×8×1024(ii) a Each block of four layers of the ResNet34 is added with a channel attention and space attention mechanism;
the space attention is explicitly supervised according to the label box of the character, and the calculation formula of the loss is as follows:
Figure BDA0002975455190000021
wherein y ispredIs the attention weight at point i, j, when the point is located in the character area ylabelIs 1, otherwise is 0; the supervisory signal is added to only the last block of each layer;
step 2: at the time step t of decoding, adding the character obtained by prediction before and the position information to obtain E, E belongs to Rt×512And then spliced together with a global feature representation G to obtain a vector C, wherein C belongs to Rt×1024
And step 3: inputting C into a masked self-attention mechanism for modeling dependencies between different characters in the output word, the self-attention mechanism being as follows:
Figure BDA0002975455190000022
the method comprises the following steps of dividing into three steps during the calculation of the attribute, and calculating the similarity of the query and each key to obtain the weight in the first step; the second step then normalizes the weights, typically using a softmax function; finally, weighting and summing the weight and the corresponding key value to obtain the final attention;
connecting the encoder and the decoder by using a two-dimensional attention module, wherein the structure of the two-dimensional attention module is basically consistent with that of the self-attention module, except that K and V of the two-dimensional attention module are both from F' obtained in the encoding stage, and Q is the output of the self-attention module with a mask; the invention carries out explicit supervised training on attention weight, and the definition of the loss function is as follows:
Figure BDA0002975455190000031
wherein y ispredFor attention weight at point i, j, when the point is in the character region ylabelIs 1, otherwise is 0;
and 4, step 4: and obtaining a picture characteristic vector after passing through a two-dimensional attention module, passing the vector through a full connection layer to obtain a vector with the same dimension as the number of the letters, and obtaining a prediction result at the moment by performing argmax operation on the vector.
The technical scheme of the invention is further that: in the step 1: the channel attention module is used for enabling the input feature map to respectively pass through a global maximum pooling layer and a global average pooling layer based on width and height, and then respectively pass through a multilayer perceptron; performing addition operation based on bit-alignment multiplication on the characteristics output by the multilayer perceptron, and performing sigmoid activation operation to generate a final channel attention characteristic diagram; carrying out the multiplication operation of the map and the input feature map by the counterpoint multiplication to generate the input features required by the space attention module; the space attention module takes the feature map output by the channel attention module as an input feature map of the module, firstly, global maximum pooling and global average pooling based on the channel are carried out, and then, the 2 results are spliced based on the channel; then, reducing the dimension into 1 channel through convolution operation; generating a spatial attention feature through sigmoid; and finally, multiplying the characteristic with the input characteristic of the module to obtain the finally generated characteristic.
A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the above-described method.
A computer-readable storage medium having stored thereon computer-executable instructions for performing the above-described method when executed.
A computer program comprising computer executable instructions which when executed perform the method described above.
Advantageous effects
The scene text recognition method based on the explicit supervision mechanism not only can be used for recognizing curved and inclined scene text pictures, but also can be used for converting the information of the pictures into an attention weight matrix by utilizing the two-dimensional supervision mechanism, and can automatically locate the characteristics of the area at each predicted moment, so that the recognition effect is improved, and the problem of poor recognition effect under the curved or inclined condition is solved. An attention mechanism of explicit supervision is introduced, so that the problem of attention drift can be effectively solved, the model can find out a key area of a scene text letter at each decoding moment, and complex scene text pictures can be better identified by combining the letter characteristics. Meanwhile, the horizontal standard scene text can be identified, the whole system has stronger practicability, and the problem of identifying the scene text under various conditions such as bending, inclining, horizontal and the like can be solved.
Drawings
FIG. 1 structural diagram of CBAM
FIG. 2 ResNet structure diagram
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
the system comprises two parts, wherein the first part is a process for extracting and coding the features of the scene picture based on a space attention and channel attention mechanism, the second part is a decoding process based on a transform of a self-attention mechanism, and the recognition of the scene text is realized through a coding and decoding structure and the attention mechanism. In the first part, a feature extraction part of a ResNet convolutional neural network with an attention mechanism is adopted to perform feature extraction and coding on a text image, and a feature map and a global representation are obtained. In the second part, the relationship modeling between characters is carried out by combining the information of the previous predicted character, the position information and the global representation, then the attention weight is generated according to the feature diagram, the feature of a single character is obtained by multiplying the weight by the feature diagram, the feature is input into a feedforward neural network to obtain the predicted character, and then the predicted character enters the predicted recognition process of the next character, and so on until the recognition ending identifier is obtained. It should be noted that, in order to extract more text information while ignoring the background information of the picture as much as possible in the encoding stage and to focus the model on the corresponding feature map region in the decoding stage, the present invention explicitly supervises the attention mechanism used in the encoder and the decoder respectively according to the frame information of the characters. The method comprises the following steps:
(1) extracting a scene text image through a ResNet34 convolutional neural network to obtain a feature map (feature map) and recording the feature map as F, wherein the F belongs to R25x8x512. The present invention adds a spatial attention (spatial attention) and a channel attention mechanism (channel attention) to each block of the four layers of the ResNet, which is called CBAM, and its structure diagram is shown in FIG. 1.
The channel attention module passes the input feature map through a global maximum pooling layer and a global average pooling layer based on width and height respectively and then passes through a multilayer perceptron respectively. And performing addition operation based on bit-wise multiplication (element-wise) on the features output by the multilayer perceptron, and performing sigmoid activation operation to generate a final channel attention feature map. And carrying out bitwise multiplication operation on the graph and the input feature graph to generate the input features required by the spatial attention module.
The spatial attention module takes the feature map output by the channel attention module as an input feature map of the module. First, a global maximum pooling and a global average pooling based on channels are performed, and then the 2 results are spliced based on channels. Then, after a convolution operation, the dimensionality is reduced to 1 channel. And generating a spatial attention feature by sigmoid. And finally, multiplying the characteristic with the input characteristic of the module to obtain the finally generated characteristic.
In order to extract the information of the text as much as possible, the invention carries out explicit supervision on the space attention according to the labeling box of the character, and the calculation formula of the loss is as follows:
Figure BDA0002975455190000051
wherein y ispredIs the attention weight at point i, j, when the point is located in the character area ylabelIs 1, otherwise is 0. The supervisory signal is added only to the last block of each layer.
(2) To maintain dimensional consistency in the two-dimensional attention calculation of the decoding stage, the channel of F is changed using a 1x1 convolution to obtain F ', F' ∈ R25x8x1024. Meanwhile, F passes through six layers of Bottleneeck to obtain another feature G which is called as a global representation, and G belongs to R1x1x512I.e. G is a vector in one 512 dimensions. Inspired by the Transformer, the designed attention-based sequence decoder consists of three layers: the method comprises the following steps of firstly, carrying out a self-attention mechanism with a mask, and modeling the dependency relationship among different characters in an output word; a two-dimensional attention module connecting the encoder and the decoder; and a position feedforward layer applied to each decoding position. For each of the three layers, residual concatenation with addition is used, followed by layer normalization. The above three layers constitute one module, and can be stacked without sharing parameters.
(3) At time step t (t starts from 0), t characters obtained by prediction are coded and then added with position coding position embedding to obtain a 512-dimensional vector, the 512-dimensional vector is spliced with global representation G, and finally t 1024-dimensional inputs C are obtained, wherein C belongs to Rt×1024
(4) C is input to a masked self-attention mechanism for modeling the dependency between different characters in the output word, as shown in the following figure.
Figure BDA0002975455190000061
The method mainly comprises three steps during the calculation of the attribute, wherein the first step is to calculate the similarity of the query and each key to obtain the weight; the second step then normalizes the weights, typically using a softmax function; and finally, carrying out weighted summation on the weight and the corresponding key value to obtain the final attention.
(5) The encoder and decoder are connected using a two-dimensional attention module, whose structure is substantially identical to the self-attention module except that K and V are both from F' obtained in the encoding stage, and Q is the output of the self-attention module with a mask. In order to enable the model to focus on the region of the corresponding character, the embodiment of the two-dimensional attention module is to increase the attention weight of the region of the corresponding character, and the invention carries out explicit supervised training on the attention weight. The term loss function is defined as follows:
Figure BDA0002975455190000062
wherein y ispredFor attention weight at point i, j, when the point is in the character region ylabelIs 1, otherwise is 0.
(6) Obtaining a picture characteristic vector after passing through a two-dimensional attention module, and enabling the vector to be in the range of R1x1x1024And obtaining a vector with the same dimensionality as the number of the letter types through a full connection layer, and obtaining a prediction result at the moment by performing argmax operation on the vector.
(7) Repeating the operations (3-6) at the next moment to obtain the prediction results at a plurality of moments until an END terminator is obtained.
(8) During training, the input of a decoder is a vector obtained after a real label is subjected to word embedding (embedding); at test time, since the real tag information is not known, the output of the previous decoder is taken as the input at that time. Only during the training phase is back propagation involved.
The specific process of the embodiment is as follows:
1. label making of attention mechanism:
the invention provides the bounding box information of each character in the picture in the synthetic scene text data set SynthText, and a label of an attention mechanism is made according to the information for carrying out explicit supervision training on attention.
The picture size of the model input is 400 x 128, with a width of 400 and a height of 128. In the encoding stage, the invention adds an attention supervision signal on the last block of each layer of the ResNet. The sizes of the spatial attention weight maps (spatial attention weight maps) of the four phases are 100 × 32, 50 × 16, 25 × 8, and 25 × 8, respectively, and accordingly, tags of corresponding sizes are made. The method firstly scales the bounding box of the character to the corresponding size according to the proportion of the original image and the attention image weight, and then generates the attention label, wherein the value inside the character bounding box is 1, and the value outside the character bounding box is 0. In the decoding stage, a corresponding label needs to be made for each attention mechanism of the decoding step, and the size of each label is 25 × 8. The character bounding box is first scaled to the corresponding size, but the value of the attention tag is 1 only within the current character bounding box, and is 0 otherwise.
2. Scene text picture preprocessing
In order to make the picture size of the input model 400 × 128, the picture size is adjusted to 400 × 128 using a bilinear interpolation method. The data enhancement mode used in training is random cutting, and changing the brightness, contrast, saturation and tone of the image.
3. ResNet scene text picture feature extraction based on explicit supervision mechanism
The tensor (400 × 128 × 3) obtained by picture preprocessing is input to the feature extraction layer of the ResNet 34. In each block of the four layers of the ResNet34, a channel attention and spatial attention mechanism CBAM is added, as shown in FIG. 1.
In order to enlarge the extracted feature map, the step size step of the last layer of ResNet34 is changed from 2 to 1. The overall framework of ResNet34 is shown in FIG. 2. Obtaining F after characteristic extraction, wherein F belongs to R25×8×512Inputting F into a global feature extraction layer of a six-layer Bottleneck to obtain a global feature representation G, wherein G belongs to R1×1×1024Meanwhile, performing 1 × 1 convolution on F to obtain F 'as the finally extracted feature, wherein F' belongs to R25×8×1024. The structure of the adjusted ResNet is as follows:
4. feature decoding of Transformer structure based on explicit supervised attention mechanism
At the time step t of decoding, adding the character embedding obtained by prediction before and the position information position embedding to obtain E, wherein E belongs to Rt×512And then spliced together with a global representation G to obtain a vector C, wherein C belongs to Rt ×1024And inputting the characters into a self-attention module to model the relationship between the characters. In the two-dimensional attention module, the output from the attention module is regarded as Q, the characteristics F' obtained by coding are regarded as K and V, and the characteristic vector S needing attention at present is obtained by calculation at each time stept,St∈R1024. A position feedforward layer is added to both the self-attention module and the two-dimensional attention module, and can be regarded as a fully-connected layer with two layers, wherein the input and the output are 1024, and the dimension of the hidden layer is 2048.
5. Model training
StAnd through a full connection layer, the output dimension of the full connection layer is equal to the number of all letter types, then softmax operation is executed, the output vector is converted into the probability distribution of each letter, wherein the letter with the maximum probability distribution value is regarded as the prediction result of the layer, and by analogy, the prediction results obtained at a plurality of time steps are all the letters on the scene text. The identified loss function uses a cross-entropy loss function:
Figure BDA0002975455190000081
where x is the predicted 94-dimensional vector and gt is the true character label. The final loss function is:
Figure BDA0002975455190000091
where α and β are coefficients, where α is 0.1 and β is 1.
The optimizer chooses ADADELTA to calculate the gradient and does back propagation. The trained batch size is set to 112, and 64638 iterations are required for one epoch, for a total of 6 epochs to be trained.
8. Model application
After the training process, a plurality of models can be obtained, the optimal model (with the minimum loss function value) is selected for application, and at the moment, the image data processing does not need data enhancement, only the image needs to be adjusted to 400 × 128 size, and normalization can be used as the input of the model. The parameters of the whole network model are fixed, so long as the image data is input and propagated forwards. Obtaining a characteristic diagram F' epsilon R in sequence25×8×1024And G ∈ R1×1×512. Then automatically transmitted into a decoding network for automatic decoding, and the recognition result can be directly obtained through the whole model. When a large number of scene text pictures need to be tested, all the pictures need to be integrated into one lmdb format file, and all the pictures can be conveniently read at one time.
The invention is not to be considered as limited to the particular embodiments shown, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (5)

1. A scene text recognition method based on an explicit supervision mechanism is characterized by comprising the following steps:
step 1: inputting the scene text picture into a ResNet34 convolutional neural network to extract a characteristic diagram and recording the characteristic diagram as F, wherein the F belongs to R25x8x512(ii) a Inputting F into a global feature extraction layer of a six-layer Bottleneck to obtain a global feature representation G, wherein G belongs to R1 ×1×1024(ii) a Meanwhile, performing 1 × 1 convolution on F to obtain F 'as the finally extracted feature, wherein F' belongs to R25×8×1024(ii) a Each block of four layers of the ResNet34 is added with a channel attention and space attention mechanism;
the space attention is explicitly supervised according to the label box of the character, and the calculation formula of the loss is as follows:
Figure FDA0002975455180000011
wherein y ispredIs the attention weight at point i, j, when the point is located in the character area ylabelIs 1, otherwise is 0; the supervisory signal is added to only the last block of each layer;
step 2: at the time step t of decoding, adding the character obtained by prediction before and the position information to obtain E, E belongs to Rt×512And then spliced together with a global feature representation G to obtain a vector C, wherein C belongs to Rt×1024
And step 3: inputting C into a masked self-attention mechanism for modeling dependencies between different characters in the output word, the self-attention mechanism being as follows:
Figure FDA0002975455180000012
the method comprises the following steps of dividing into three steps during the calculation of the attribute, and calculating the similarity of the query and each key to obtain the weight in the first step; the second step then normalizes the weights, typically using a softmax function; finally, weighting and summing the weight and the corresponding key value to obtain the final attention;
connecting the encoder and the decoder by using a two-dimensional attention module, wherein the structure of the two-dimensional attention module is basically consistent with that of the self-attention module, except that K and V of the two-dimensional attention module are both from F' obtained in the encoding stage, and Q is the output of the self-attention module with a mask; the invention carries out explicit supervised training on attention weight, and the definition of the loss function is as follows:
Figure FDA0002975455180000013
wherein y ispredFor attention weight at point i, j, when the point is in the character region ylabelIs 1, otherwise is 0;
and 4, step 4: and obtaining a picture characteristic vector after passing through a two-dimensional attention module, passing the vector through a full connection layer to obtain a vector with the same dimension as the number of the letters, and obtaining a prediction result at the moment by performing argmax operation on the vector.
2. The method for scene text recognition based on the explicit supervised attention mechanism as recited in claim 1, wherein in step 1: the channel attention module is used for enabling the input feature map to respectively pass through a global maximum pooling layer and a global average pooling layer based on width and height, and then respectively pass through a multilayer perceptron; performing addition operation based on bit-alignment multiplication on the characteristics output by the multilayer perceptron, and performing sigmoid activation operation to generate a final channel attention characteristic diagram; carrying out the multiplication operation of the map and the input feature map by the counterpoint multiplication to generate the input features required by the space attention module; the space attention module takes the feature map output by the channel attention module as an input feature map of the module, firstly, global maximum pooling and global average pooling based on the channel are carried out, and then, the 2 results are spliced based on the channel; then, reducing the dimension into 1 channel through convolution operation; generating a spatial attention feature through sigmoid; and finally, multiplying the characteristic with the input characteristic of the module to obtain the finally generated characteristic.
3. A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.
4. A computer-readable storage medium having stored thereon computer-executable instructions for, when executed, implementing the method of claim 1.
5. A computer program comprising computer executable instructions which when executed perform the method of claim 1.
CN202110273068.7A 2021-03-14 2021-03-14 Scene text recognition method based on explicit supervision mechanism Pending CN113159023A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110273068.7A CN113159023A (en) 2021-03-14 2021-03-14 Scene text recognition method based on explicit supervision mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110273068.7A CN113159023A (en) 2021-03-14 2021-03-14 Scene text recognition method based on explicit supervision mechanism

Publications (1)

Publication Number Publication Date
CN113159023A true CN113159023A (en) 2021-07-23

Family

ID=76886970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110273068.7A Pending CN113159023A (en) 2021-03-14 2021-03-14 Scene text recognition method based on explicit supervision mechanism

Country Status (1)

Country Link
CN (1) CN113159023A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569871A (en) * 2021-08-03 2021-10-29 内蒙古工业大学 Library automatic book-making method and system based on deep learning
CN114020881A (en) * 2022-01-10 2022-02-08 珠海金智维信息科技有限公司 Topic positioning method and system
CN115019143A (en) * 2022-06-16 2022-09-06 湖南大学 Text detection method based on CNN and Transformer mixed model
CN115067945A (en) * 2022-08-22 2022-09-20 深圳市海清视讯科技有限公司 Fatigue detection method, device, equipment and storage medium
CN115147381A (en) * 2022-07-08 2022-10-04 烟台大学 Pavement crack detection method based on image segmentation
CN115424330A (en) * 2022-09-16 2022-12-02 郑州轻工业大学 Single-mode face in-vivo detection method based on DFMN and DSD

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569871A (en) * 2021-08-03 2021-10-29 内蒙古工业大学 Library automatic book-making method and system based on deep learning
CN114020881A (en) * 2022-01-10 2022-02-08 珠海金智维信息科技有限公司 Topic positioning method and system
CN115019143A (en) * 2022-06-16 2022-09-06 湖南大学 Text detection method based on CNN and Transformer mixed model
CN115147381A (en) * 2022-07-08 2022-10-04 烟台大学 Pavement crack detection method based on image segmentation
CN115067945A (en) * 2022-08-22 2022-09-20 深圳市海清视讯科技有限公司 Fatigue detection method, device, equipment and storage medium
CN115424330A (en) * 2022-09-16 2022-12-02 郑州轻工业大学 Single-mode face in-vivo detection method based on DFMN and DSD
CN115424330B (en) * 2022-09-16 2023-08-11 郑州轻工业大学 Single-mode face living body detection method based on DFMN and DSD

Similar Documents

Publication Publication Date Title
CN113159023A (en) Scene text recognition method based on explicit supervision mechanism
CN113065550B (en) Text recognition method based on self-attention mechanism
CN114973222A (en) Scene text recognition method based on explicit supervision mechanism
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN114495129B (en) Character detection model pre-training method and device
CN111027576B (en) Cooperative significance detection method based on cooperative significance generation type countermeasure network
CN115131797B (en) Scene text detection method based on feature enhancement pyramid network
CN114049381A (en) Twin cross target tracking method fusing multilayer semantic information
Tang et al. FontRNN: Generating Large‐scale Chinese Fonts via Recurrent Neural Network
CN114596566B (en) Text recognition method and related device
CN111310766A (en) License plate identification method based on coding and decoding and two-dimensional attention mechanism
CN115222998B (en) Image classification method
Fan et al. A novel sonar target detection and classification algorithm
CN114581918A (en) Text recognition model training method and device
CN111242114B (en) Character recognition method and device
CN110942463B (en) Video target segmentation method based on generation countermeasure network
Brzeski et al. Evaluating performance and accuracy improvements for attention-OCR
US11816909B2 (en) Document clusterization using neural networks
CN114692715A (en) Sample labeling method and device
Zhu et al. Dc-net: Divide-and-conquer for salient object detection
Lin et al. Spatio-temporal co-attention fusion network for video splicing localization
CN114882412B (en) Labeling-associated short video emotion recognition method and system based on vision and language
CN118038451B (en) Open world fruit detection model construction method, detection method and electronic equipment
Ma et al. Har enhanced weakly-supervised semantic segmentation coupled with adversarial learning
CN117593755B (en) Method and system for recognizing gold text image based on skeleton model pre-training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination