CN112070114B - Scene character recognition method and system based on Gaussian constraint attention mechanism network - Google Patents

Scene character recognition method and system based on Gaussian constraint attention mechanism network Download PDF

Info

Publication number
CN112070114B
CN112070114B CN202010767079.6A CN202010767079A CN112070114B CN 112070114 B CN112070114 B CN 112070114B CN 202010767079 A CN202010767079 A CN 202010767079A CN 112070114 B CN112070114 B CN 112070114B
Authority
CN
China
Prior art keywords
time step
original
attention
weighted
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010767079.6A
Other languages
Chinese (zh)
Other versions
CN112070114A (en
Inventor
王伟平
乔峙
秦绪功
周宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202010767079.6A priority Critical patent/CN112070114B/en
Publication of CN112070114A publication Critical patent/CN112070114A/en
Application granted granted Critical
Publication of CN112070114B publication Critical patent/CN112070114B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The invention provides a scene character recognition method and a scene character recognition system based on a Gaussian constraint attention mechanism network, which relate to the field of image information recognition and are used for obtaining a two-dimensional feature map by extracting visual features of a picture to be recognized; converting the two-dimensional feature map into a one-dimensional feature sequence, and extracting global semantic information according to the one-dimensional feature sequence; inputting global semantic information into a first time step, initializing and decoding hidden states, calculating original attention weights according to the hidden states and the two-dimensional feature map in each time step, and obtaining original weighted feature vectors by utilizing weighted summation of the weights; constructing a two-dimensional Gaussian distribution mask according to the hidden state and the original weighted feature vector, multiplying the mask by the original attention weight to obtain corrected attention weight, and obtaining a corrected weighted feature vector according to the weight; the original weighted feature vector and the corrected weighted feature vector are fused together to predict the character of the picture to be recognized, so that the attention dispersion situation can be solved.

Description

Scene character recognition method and system based on Gaussian constraint attention mechanism network
Technical Field
The invention relates to the field of image information identification, in particular to a scene text identification method and a scene text identification system based on a Gaussian constraint attention mechanism network.
Background
Text detection and recognition of scene images is a recent research hotspot, wherein text recognition is a core part of the whole process, and the task of the text detection is to transcribe the text in the pictures into a form which can be directly edited by a computer. With the development of deep learning, the field is rapidly advanced. Inspired by the field of machine translation, the current mainstream methods are all based on the structure of a coder and a decoder, the coder extracts rich visual features through a convolutional neural network and a cyclic neural network, and the decoder acquires the required features through an attention mechanism and predicts each character in a sequence according to the sequence of the text.
However, the prior art has the following defects:
1. the word recognition requires only a specific area of each character in the text picture at each time step of decoding, and the existing methods do not take full advantage of this feature of text recognition.
2. The existing method does not consider the constraint attention weight, but allows the model to freely predict the attention weight, and a problem of attention dispersion occurs in a part of pictures, namely the weight cannot be concentrated on a specific character.
3. Although some existing approaches use gaussian distributed labels for each character's position to monitor the attention weight, thereby implicitly constraining the attention weight. However, since a process of displaying constraint is not introduced, a problem of distraction still occurs in some pictures.
Disclosure of Invention
The invention aims to provide a scene character recognition method and a scene character recognition system based on a Gaussian constraint attention mechanism network, which are used for correcting an original attention weight by introducing a displayed constraint in the process of calculating the attention weight, so that the corrected attention weight can be more concentrated in a region corresponding to a character, and the problem of attention dispersion can be solved.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a scene text recognition method based on Gaussian constraint attention mechanism network comprises the following steps:
extracting visual features of the picture to be identified to obtain a two-dimensional feature map;
converting the two-dimensional feature map into a one-dimensional feature sequence, and extracting global semantic information according to the one-dimensional feature sequence;
inputting global semantic information into a first time step, initializing and decoding hidden states, calculating original attention weights according to the hidden states and the two-dimensional feature map in each time step, and obtaining original weighted feature vectors by utilizing weighted summation of the weights;
constructing a two-dimensional Gaussian distribution mask according to the hidden state and the original weighted feature vector, multiplying the mask by the original attention weight to obtain corrected attention weight, and obtaining a corrected weighted feature vector according to the weight;
and fusing the original weighted feature vector and the corrected weighted feature vector together to predict the character of the picture to be identified.
A scene text recognition system based on a gaussian constrained attention mechanism network, comprising:
the feature extraction module comprises a multi-layer residual error network and is responsible for extracting visual features of the picture to be identified to obtain a two-dimensional feature map;
the encoder module comprises a unidirectional double-layer long-short time memory network LSTM and is responsible for converting a two-dimensional feature map into a one-dimensional feature sequence, and then inputting the one-dimensional feature sequence into the LSTM to extract global semantic information;
the decoder module comprises a unidirectional double-layer long-short time memory network AM-LSTM based on an attention mechanism, is responsible for updating the hidden state of the AM-LSTM of the first time step based on global semantic information, calculates the original attention weight according to the hidden state of the AM-LSTM and the two-dimensional feature map in each time step, and obtains an original weighted feature vector by utilizing the weighted summation of the weights; fusing the original weighted feature vector and the corrected weighted feature vector together to predict the character of the picture to be recognized;
and the correction module based on Gaussian constraint is responsible for constructing a two-dimensional Gaussian distribution mask according to the hidden state of the AM-LSTM and the original weighted feature vector, multiplying the mask by the original attention weight to obtain corrected attention weight, and obtaining the corrected weighted feature vector according to the weight.
Further, the feature extraction module includes a 31-layer residual network.
Further, the encoder module will be responsible for maximum pooling of the two-dimensional feature map into a one-dimensional feature sequence.
Further, the decoder module is responsible for updating the hidden state of the AM-LSTM according to the decoding result of the last time step for each time step starting from the second time step.
Further, the correction module based on Gaussian constraint is responsible for predicting a group of Gaussian distribution parameters through a full connection layer after the hidden state of the AM-LSTM and the original weighted feature vector are spliced, and constructing a two-dimensional Gaussian distribution mask by utilizing the parameters, wherein the parameters comprise a mean value and a variance.
Further, the present system optimizes training by calculating character recognition loss optimized by calculating cross entropy loss between predicted character probabilities and recognition labels and attention weight loss optimized by calculating L1 regression loss between predicted character attention distribution and character position labels.
Compared with the prior art, the invention provides a brand new correction module based on Gaussian constraint, and the correction module predicts a Gaussian mask to correct the original attention weight. Since the characters in the word recognition task are typically regular in shape, let the model predict more a gaussian mask as a display constraint to correct the original attention weight. The attention weight after correction is more concentrated on the area corresponding to the character, so that the attention dispersion situation can be solved. Experiments show that the invention can obtain more excellent performance on the existing data set, and the module provided by the invention is very flexible and can be used in the existing method based on the attention mechanism.
Drawings
Fig. 1 is a schematic diagram of a scene text recognition network structure based on a gaussian constraint attention mechanism network according to an embodiment.
Fig. 2 is a schematic diagram of a decoder structure according to an embodiment.
FIG. 3 is a diagram comparing a prior art method with the visualization of the recognition result of the present invention.
Detailed Description
In order to make the technical scheme of the invention more understandable, specific examples are described below in detail with reference to the accompanying drawings.
The embodiment discloses a scene character recognition method and a scene character recognition system (GCAN) based on a Gaussian constraint attention mechanism network, as shown in fig. 1, the GCAN is a recognition model based on a two-dimensional attention mechanism, and a brand new correction module (GSRM) based on Gaussian constraint is introduced. The GSRM input is the original unconstrained attention weighted feature, and the GSRM output is the corrected attention weighted feature. The two feature vectors are fused and then used for predicting characters of the current decoding time step, the time step is one step of predicting characters of a word during iterative decoding, and each step predicts corresponding characters because the decoding is iterated step by step. The system consists of four parts: the device comprises a feature extraction module, an encoder module, a decoder module and a correction module based on Gaussian constraint.
The feature extraction module is composed of a 31-layer residual network, and the residual network can extract rich visual features for the following encoding and decoding processes.
The encoder module consists of a unidirectional double-layer long and short time memory network (LSTM). Firstly, carrying out maximum pooling on the two-dimensional feature map output by the feature extraction module along the vertical direction to obtain a one-dimensional feature sequence. And then inputting the one-dimensional feature sequence into the LSTM to extract the context information, obtaining global semantic information, wherein the output of the encoder module is the hidden state of the last moment of the LSTM and is used as global semantic information to guide a decoder.
The decoder module consists of a unidirectional double-layer long-short time memory network LSTM (AM-LSTM) based on an attention mechanism, and the structure of the decoder is shown in figure 2. The global semantic information finally output by the encoder is input in the first time step of decoding, and then the decoding result of the last time step is input in each time step for updating the hidden state of the decoder AM-LSTM. At each time step, the hidden state of the decoder AM-LSTM and the feature map output by the feature extraction module calculate the original attention weight, and the feature map is weighted and summed based on the original attention weight to obtain the original weighted feature vector. Finally, the original weighted feature vector and the corrected weighted feature vector are fused to predict the character of the current time step
And a correction module based on Gaussian constraint, after splicing the hidden state of the corresponding time step of the decoder AM-LSTM and the original weighted feature vector, predicting a group of parameters (mean value and variance) of Gaussian distribution through a full-connection layer, constructing a two-dimensional Gaussian distribution by using the parameters as a mask, multiplying the mask with the original attention weight to obtain corrected attention weight, and calculating by using the corrected attention weight to obtain a new corrected weighted feature vector. Compared with the original attention weight, the corrected attention is more concentrated, so that the problem of attention dispersion can be solved.
The whole process for identifying the scene characters by adopting the method and the system comprises the following steps:
1. and the input picture extracts visual features through a feature extraction module to obtain a two-dimensional feature map.
2. The extracted visual features are passed through an encoder module to extract global semantic information, which is then input into a decoder module.
3. The decoder module adopts an attention mechanism, calculates the original attention weight through the hidden state and the feature diagram output by the feature extraction module, and then obtains the weighted original weighted feature vector.
4. The weighted original weighted feature vector and the hidden state are input to a correction module based on Gaussian constraint, original attention weights predicted by a decoder are corrected by using two-dimensional Gaussian masks respectively, corrected attention weights are obtained, and corrected weighted feature vectors are obtained.
5. And fusing the corrected weighted feature vector and the original weighted feature vector together to predict corresponding characters.
6. The whole model optimizes training by calculating the loss of character recognition and attention weights. Wherein the character recognition penalty is optimized by calculating a cross entropy penalty between the predicted character probability and the recognition callout, and the penalty for the attention weight is optimized by calculating an L1 regression penalty between the predicted character attention distribution and the character position callout.
The present invention conducted extensive experiments to evaluate the effectiveness of the GCAN of the present invention. The GCAN trains on two generated data Syn90K and SynthText, testing on several scene text datasets of the main stream. Wherein IIIT5K has 3000 images; mostly high quality horizontal images; SVT has 647 images, mostly horizontal text; SVT-Perspective (SVTP) has 645 images, most of which have stronger distortion; ICDAR2013 (IC 13) has 1015 images, mostly high quality horizontal text; ICDAR2015 (IC 15) has 1811 images, mostly text images of arbitrary shape and low quality; the cut had 288 images, most of which were high quality curved text.
Table 1 shows the effect comparison among the modules of the GCAN, and the result proves that the novel module GCRM provided by the invention can bring about obvious improvement, and the existing method can not obviously improve the performance only through character supervision. Table 2 demonstrates the comparison of the effect of the present invention on test datasets with other mainstream methods, the best performance of the present invention on multiple datasets, demonstrating the effectiveness of the present invention. Fig. 3 shows the visualization of the recognition result and the attention weight of the prior art method and the present invention, wherein for each recognized picture on the left, the first on the right represents the recognition result of the prior art method, the second represents the recognition result of the present invention, the white ring in the picture is marked as the recognized position, and the letters below are recognized characters, so that the present invention can be found to effectively solve the phenomenon of attention dispersion, and can obtain a better recognition result.
Table 1 comparative experiments of the respective modules
Figure BDA0002615115030000041
Table 2 comparison of GCAN with other methods on various datasets
Figure BDA0002615115030000051
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention, and the scope of the present invention is defined by the claims.

Claims (8)

1. A scene text recognition method based on a Gaussian constraint attention mechanism network is characterized by comprising the following steps:
extracting visual features of the picture to be identified to obtain a two-dimensional feature map;
converting the two-dimensional feature map into a one-dimensional feature sequence, and extracting global semantic information according to the one-dimensional feature sequence;
inputting global semantic information into a first time step, initializing and decoding hidden states, calculating original attention weights according to the hidden states and the two-dimensional feature map in each time step, and obtaining original weighted feature vectors by utilizing weighted summation of the weights;
in each decoding time step, splicing the hidden state of the corresponding time step and the original weighted feature vector, and then predicting a group of parameters of Gaussian distribution through a full-connection layer, wherein the parameters comprise mean value and variance, constructing a two-dimensional Gaussian distribution by using the parameters as a mask, and finally multiplying the mask with the original attention weight to obtain corrected attention weight, and obtaining the corrected weighted feature vector according to the weight;
and fusing the original weighted feature vector and the corrected weighted feature vector together to predict the character of the picture to be identified.
2. The method of claim 1, wherein starting from the second time step, each time step inputs the decoding result of the last time step to update the hidden state.
3. A scene text recognition system based on a gaussian constrained attention mechanism network, comprising:
the feature extraction module comprises a multi-layer residual error network and is responsible for extracting visual features of the picture to be identified to obtain a two-dimensional feature map;
the encoder module comprises a one-way double-layer long-short time memory network LSTM which is responsible for converting a two-dimensional feature map into a one-dimensional feature sequence, inputting the one-dimensional feature sequence into the LSTM to extract global semantic information, and outputting the hidden state of the last moment of the LSTM;
the decoder module comprises a unidirectional double-layer long-short time memory network AM-LSTM based on an attention mechanism, is responsible for updating the hidden state of the AM-LSTM of each time step based on global semantic information, calculates the original attention weight according to the hidden state of the AM-LSTM and the two-dimensional feature map in each time step, and obtains an original weighted feature vector by utilizing the weighted summation of the weights; fusing the original weighted feature vector and the corrected weighted feature vector together to predict the character of the picture to be recognized;
and the correction module based on Gaussian constraint is responsible for splicing the hidden state of the corresponding time step of the AM-LSTM and the original weighted feature vector at each decoding time step, predicting a group of Gaussian distribution parameters through a full-connection layer, wherein the parameters comprise mean value and variance, constructing a two-dimensional Gaussian distribution by using the parameters as a mask, and multiplying the mask with the original attention weight to obtain corrected attention weight, and obtaining the corrected weighted feature vector according to the weight.
4. The system of claim 3, wherein the feature extraction module comprises a 31-layer residual network.
5. A system as claimed in claim 3, characterized in that the encoder module is responsible for maximum pooling of the two-dimensional feature map into a one-dimensional feature sequence.
6. A system as claimed in claim 3, characterized in that the decoder module is responsible for entering global semantic information in the first time step of decoding, obtaining the decoding result of the next time step, after which each time step updates the hidden state of the AM-LSTM on the basis of the decoding result of the last time step.
7. The system of claim 3, wherein the system optimizes training by calculating character recognition loss and attention weight loss.
8. The system of claim 7, wherein the character recognition penalty is optimized by calculating a cross entropy penalty between predicted character probabilities and recognition labels, and the attention weight penalty is optimized by calculating an L1 regression penalty between predicted character attention distribution and character position labels.
CN202010767079.6A 2020-08-03 2020-08-03 Scene character recognition method and system based on Gaussian constraint attention mechanism network Active CN112070114B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010767079.6A CN112070114B (en) 2020-08-03 2020-08-03 Scene character recognition method and system based on Gaussian constraint attention mechanism network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010767079.6A CN112070114B (en) 2020-08-03 2020-08-03 Scene character recognition method and system based on Gaussian constraint attention mechanism network

Publications (2)

Publication Number Publication Date
CN112070114A CN112070114A (en) 2020-12-11
CN112070114B true CN112070114B (en) 2023-05-16

Family

ID=73657592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010767079.6A Active CN112070114B (en) 2020-08-03 2020-08-03 Scene character recognition method and system based on Gaussian constraint attention mechanism network

Country Status (1)

Country Link
CN (1) CN112070114B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541501B (en) * 2020-12-18 2021-09-07 北京中科研究院 Scene character recognition method based on visual language modeling network
CN113065561A (en) * 2021-03-15 2021-07-02 国网河北省电力有限公司 Scene text recognition method based on fine character segmentation
CN113221874A (en) * 2021-06-09 2021-08-06 上海交通大学 Character recognition system based on Gabor convolution and linear sparse attention
CN113591546B (en) * 2021-06-11 2023-11-03 中国科学院自动化研究所 Semantic enhancement type scene text recognition method and device
CN114463675B (en) * 2022-01-11 2023-04-28 北京市农林科学院信息技术研究中心 Underwater fish group activity intensity identification method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5710833A (en) * 1995-04-20 1998-01-20 Massachusetts Institute Of Technology Detection, recognition and coding of complex objects using probabilistic eigenspace analysis
CN111428727A (en) * 2020-03-27 2020-07-17 华南理工大学 Natural scene text recognition method based on sequence transformation correction and attention mechanism

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5710833A (en) * 1995-04-20 1998-01-20 Massachusetts Institute Of Technology Detection, recognition and coding of complex objects using probabilistic eigenspace analysis
CN111428727A (en) * 2020-03-27 2020-07-17 华南理工大学 Natural scene text recognition method based on sequence transformation correction and attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度学习的光照不均匀文本图像的识别系统;何鎏一等;《计算机应用与软件》;20200612(第06期);全文 *

Also Published As

Publication number Publication date
CN112070114A (en) 2020-12-11

Similar Documents

Publication Publication Date Title
CN112070114B (en) Scene character recognition method and system based on Gaussian constraint attention mechanism network
CN107239801B (en) Video attribute representation learning method and video character description automatic generation method
CN111753827B (en) Scene text recognition method and system based on semantic enhancement encoder and decoder framework
CN106570464A (en) Human face recognition method and device for quickly processing human face shading
CN110188762B (en) Chinese-English mixed merchant store name identification method, system, equipment and medium
CN111967471A (en) Scene text recognition method based on multi-scale features
CN115080766B (en) Multi-modal knowledge graph characterization system and method based on pre-training model
CN112733768A (en) Natural scene text recognition method and device based on bidirectional characteristic language model
CN113807340B (en) Attention mechanism-based irregular natural scene text recognition method
CN112329767A (en) Contract text image key information extraction system and method based on joint pre-training
CN116050496A (en) Determination method and device, medium and equipment of picture description information generation model
CN111178363B (en) Character recognition method, character recognition device, electronic equipment and readable storage medium
CN111259197B (en) Video description generation method based on pre-coding semantic features
CN113191355A (en) Text image synthesis method, device, equipment and storage medium
CN113537187A (en) Text recognition method and device, electronic equipment and readable storage medium
CN116958512A (en) Target detection method, target detection device, computer readable medium and electronic equipment
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN111814508A (en) Character recognition method, system and equipment
CN115718815A (en) Cross-modal retrieval method and system
CN113535975A (en) Chinese character knowledge graph-based multi-dimensional intelligent error correction method
CN112287938A (en) Text segmentation method, system, device and medium
CN117612071B (en) Video action recognition method based on transfer learning
CN116311275B (en) Text recognition method and system based on seq2seq language model
CN116452600B (en) Instance segmentation method, system, model training method, medium and electronic equipment
Beche et al. Narrowing the semantic gap between real and synthetic data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant