CN111967470A - Text recognition method and system based on decoupling attention mechanism - Google Patents
Text recognition method and system based on decoupling attention mechanism Download PDFInfo
- Publication number
- CN111967470A CN111967470A CN202010841738.6A CN202010841738A CN111967470A CN 111967470 A CN111967470 A CN 111967470A CN 202010841738 A CN202010841738 A CN 202010841738A CN 111967470 A CN111967470 A CN 111967470A
- Authority
- CN
- China
- Prior art keywords
- text
- neural network
- layer
- image
- convolution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000007246 mechanism Effects 0.000 title claims abstract description 31
- 238000013528 artificial neural network Methods 0.000 claims abstract description 14
- 238000010586 diagram Methods 0.000 claims abstract description 14
- 230000000007 visual effect Effects 0.000 claims abstract description 13
- 239000000284 extract Substances 0.000 claims abstract description 7
- 238000012549 training Methods 0.000 claims description 43
- 238000013527 convolutional neural network Methods 0.000 claims description 33
- 238000011156 evaluation Methods 0.000 claims description 25
- 230000008569 process Effects 0.000 claims description 16
- 238000010606 normalization Methods 0.000 claims description 13
- 238000005070 sampling Methods 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 10
- 238000003062 neural network model Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 2
- 238000003672 processing method Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 5
- 238000009825 accumulation Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/63—Scene text, e.g. street names
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Image Analysis (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses a text recognition method and system based on a decoupling attention mechanism, which mainly comprise a feature coding module, a convolution alignment module and a text decoding module, wherein the feature coding module extracts visual features from an input image based on a deep convolution neural network; the convolution alignment module replaces a traditional score-based recursive alignment module, multi-scale visual features are extracted from the feature coding module to serve as input, and a full convolution neural network is used for generating an attention diagram channel by channel; the text decoding module obtains a final prediction result by combining the characteristic diagram and the attention diagram through the gated recursion unit, and the method has the advantages of simple realization, high identification precision, effectiveness, flexibility and robustness, excellent performance in various text identification fields such as scene text identification, handwritten text identification and the like, and good practical application value.
Description
Technical Field
The invention belongs to the technical field of pattern recognition and artificial intelligence, and particularly relates to an accurate image recognition method related to a deep neural network.
Background
In recent years, text recognition has attracted the research interest of most scholars. Thanks to deep learning and the study of sequence problems, many text recognition techniques have enjoyed significant success. The connection time classification technique and the attention mechanism technique are two popular methods for solving the sequence problem, wherein the attention mechanism technique exhibits more prominent performance and has been widely studied in recent years.
Attention-driven techniques were first proposed to solve the machine translation problem and are increasingly being used to address the scene text recognition problem. Since then, attention-based technologies have dominated a portion of the development in the field of text recognition. Attention-based techniques in text recognition are used to align and recognize characters. In the previous work, it was noted that the alignment operation of force mechanism techniques was always combined with the decoding operation. Specifically, the alignment operation of conventional attention-based techniques is implemented using two types of information. One is a feature map, which is visual information obtained by encoding an image by an encoder; the second is historical decoding information, which can be a hidden layer state in the recursive process or an embedded vector of the previous decoding result. The main idea behind the attention mechanism technology is matching, namely, a part of characteristics of the characteristic graph is given, and an attention score is calculated by scoring the matching degree of the part of characteristics and historical decoding information.
Conventional attention-based techniques often face serious alignment problems because the relationship of the alignment and decoding operations together inevitably leads to accumulation and propagation of errors. The alignment operation based on matching is very susceptible to the decoding result, for example, when two similar substrings exist in a string, the attention point of the attention mechanism technology is easily jumped from one substring to another substring through historical decoding information, which is also the reason why the attention mechanism technology is difficult to align long sequences observed in the literature, because the longer sequences are more likely to generate similar substrings. This motivates us to find a way to decouple the alignment operation from the historical decoding information, thereby alleviating this negative impact.
Disclosure of Invention
The invention aims to provide a text recognition method and system based on a decoupling attention mechanism.
In order to achieve the purpose, the invention provides the following scheme:
a text recognition method based on a decoupling attention mechanism comprises the following steps:
s1, extracting image features according to the text image and coding to obtain a feature map;
s2, aligning the feature maps to obtain a target image, constructing a deep convolutional neural network model, processing the target image based on the deep convolutional neural network model to obtain an attention map and training;
s3, performing accurate character recognition on the feature map and the attention map based on the deep convolutional neural network recognition model;
preferably, the text image is a scene text image and/or a handwritten character image;
preferably, the scene text image and/or the handwritten text image are characterized by:
the scene text image characteristics comprise a scene text training data set and a scene text real evaluation data set, and the scene text training data set and the scene text real evaluation data set cover various different font styles, light and shadow changes and resolution changes;
the handwritten text image characteristics comprise a handwritten text real training data set and a handwritten text real evaluation data set, and the handwritten text real training data set and the handwritten text real evaluation data set contain different writing styles;
preferably, the scene text image training data set has a complete text part and occupies more than two thirds of the image area, and comprises a plurality of different font styles, so that the scene text image training data set is allowed to cover light and shadow changes and resolution changes;
preferably, the scene text truth evaluation data set is obtained by shooting through a mobile phone and a special hardware camera device, in the shooting process, the text in the normalized scene text image occupies more than two thirds of the image area, inclination and blur are allowed to exist, and the shot scene text image covers application scenes with different font styles;
preferably, the real training data and the real evaluation data of the handwritten text are written and collected by different people respectively, and the training data and the evaluation data have independence.
Preferably, the text image alignment processing method includes:
stretching the scene text training data set and the scene text real evaluation data set image data to be converted into a uniform size;
and scaling the handwritten text real training data set and the handwritten text real evaluation data set in a way of keeping the original image proportion, and filling the periphery until the sizes are unified.
Preferably, the deep convolutional neural network construction method comprises the following steps:
extracting multi-scale visual features based on the feature codes;
carrying out convolution and deconvolution through a full convolution neural network to construct a deep convolution neural network model;
a deconvolution stage, each output feature is added by the corresponding feature mapping of the convolution stage;
the convolution process is down sampling, the deconvolution process is up sampling, except the last deconvolution process, a nonlinear layer is connected after all the convolution and deconvolution processes are finished, and a ReLu function is used;
preferably, the network structure of the deep convolutional neural network model is an input layer, a convolutional layer and a residual error layer;
preferably, the residual layer is divided into a first convolution layer, a first batch of normalization layers, a first nonlinear layer, a second convolution layer, a second batch of normalization layers, a down-sampling layer, and a second nonlinear layer.
Preferably, in the training of the deep convolutional neural network model in S2, a back propagation algorithm is adopted, and all parameters of the network model are updated by calculating a transfer gradient from the last layer and transferring layer by layer;
preferably, the deep convolutional neural network model training strategy adopts a supervision mode: training a universal deep network recognition model by using text image data and corresponding labeling information;
preferably, the input image of the deep convolutional neural network model is a handwritten text image and/or a scene text image, and the output is a character sequence in the text image and/or the scene text image.
Preferably, the parameters of the deep convolutional neural network model training are set as follows:
the number of iterations of the deep convolutional neural network is 1,000,000;
the deep convolutional neural network optimizer is Adadelta;
the deep convolutional neural network learning rate is 1.0;
deep convolutional neural network learning rate updating strategy: the reduction is one tenth of the original at 50% and 75% of the total number of iterations, respectively.
Preferably, the specific method for recognizing the S3 text is as follows:
Fx,yrepresents said characteristic diagram, αt,x,yThe attention map representing the t time obtained by convolution alignment is calculated by equation (1) to obtain a semantic vector ct,
Wherein W and H are the width and height of the characteristic diagram, and at the time t,
output ytComprises the following steps: y ist=Woht+bo, (2),
Wherein, WoAnd boIs a parameter, htRepresenting the hidden layer state of the gated recursion unit at the time t;
htthe way in which (a) is calculated is expressed as,
ht=GRU((et-l,ct),ht-1), (3),
etrepresenting the last bit output yt-1The encoded vector of (1); the final Loss function Loss is calculated as follows,
where θ represents all learnable parameters of the deep neural network model, gtRepresenting the sample tag value at time t.
A text recognition system based on a decoupling attention mechanism comprises a feature coding module, a convolution alignment module and a text decoding module,
the feature coding module extracts visual features from the text image based on a deep convolutional neural network;
the convolution alignment module extracts multi-scale visual features from the feature coding module and generates an attention diagram channel by channel through a deep convolution neural network;
and the text decoding module combines the feature map and the attention map by a gated recursion unit to obtain a final prediction result.
Preferably, the network structure of the deep convolutional neural network unit is an input layer unit, a convolutional layer unit and a residual layer unit;
preferably, the residual layer unit is divided into a first convolution layer unit, a first batch of normalization layer units, a first nonlinear layer unit, a second convolution layer unit, a second batch of normalization layer units, a down-sampling layer unit and a second nonlinear layer unit;
preferably, the nonlinear layer units in the residual layer unit all adopt a ReLU activation function;
preferably, the downsampling layer unit is implemented by the convolutional layer unit and the batch normalization layer unit.
The invention has the technical effects that:
(1) the present invention decouples the conventional attention mechanism modules. Compared with the traditional attention mechanism technology, the method and the device do not need to align the information fed back in the decoding stage, and avoid accumulation and propagation of decoding errors, so that the identification accuracy is higher.
(2) The method is simple to use, can be easily embedded into other models, is very flexible, and can be freely converted in one-dimensional texts and two-dimensional texts.
(3) And a back propagation algorithm is adopted, and the convolution kernel parameters are automatically adjusted, so that a more robust filter is obtained, and the filter can adapt to various complex environments.
(4) Compared with a manual mode, the method and the device can automatically finish the recognition of the scene text and the handwritten text, and can save manpower and material resources.
(5) The invention can provide more reliable alignment performance for the attention mechanism through the decoupling attention algorithm, and particularly has more robust characteristic compared with the traditional attention mechanism when facing long texts.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a block diagram of the deep convolutional network recognition model structure of the present invention.
FIG. 2 is a flow chart of a text recognition method based on a decoupling attention mechanism according to the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1: as shown in fig. 1, the text recognition system based on the decoupling attention mechanism includes a feature encoding module, a convolution alignment module and a text decoding module;
the feature coding module extracts visual features from the text image based on a deep convolutional neural network;
the convolution alignment module extracts multi-scale visual features from the feature coding module and generates an attention diagram channel by channel through a deep convolution neural network;
and the text decoding module combines the feature map and the attention map by a gated recursion unit to obtain a final prediction result.
As shown in fig. 2, the text recognition method based on the decoupling attention mechanism specifically includes the following steps:
firstly, carrying out feature extraction coding on a scene text image and/or a handwritten character image through a feature coding module to form a feature map;
the scene text image characteristics comprise a scene text training data set and a scene text real evaluation data set, wherein the scene text training data set and the scene text real evaluation data set cover various different font styles, light and shadow changes and resolution changes;
the handwritten text image characteristics comprise a handwritten text real training data set and a handwritten text real evaluation data set, and the handwritten text real training data set and the handwritten text real evaluation data set contain different writing styles;
the scene text image training data has a complete text part occupying more than two thirds of the area of the image, comprises various different font styles and is allowed to cover certain degree of light and shadow change and resolution change;
the real evaluation data set of the scene text is obtained by shooting through camera equipment such as a mobile phone and special hardware, in the shooting process, the text in the normalized scene text image occupies more than two thirds of the image area, certain inclination and fuzziness are allowed to exist, and the shot scene text image covers application scenes with different font styles;
the real training data and the real evaluation data of the handwritten text are written by different people and collected respectively, and the training data and the evaluation data have independence;
secondly, performing convolution alignment on the scene text image and/or the handwritten character image through a convolution alignment module, wherein the structure of the convolution alignment module is shown in table 1:
stretching image data of the scene text training data set and the scene text real evaluation data set to be in a uniform size;
scaling the handwritten text real training data set and the handwritten text real evaluation data set to keep the original image proportion, and filling the surroundings until the sizes are unified;
TABLE 1
The deep convolutional neural network is constructed and trained as shown in table 2, and the construction method of the deep convolutional neural network comprises the following steps: based on a convolution neural network, extracting visual features from the scene text image and/or the handwritten character image, extracting multi-scale visual features from a feature coding module as input, performing convolution and deconvolution through a full convolution neural network, wherein in a deconvolution stage, each output feature is added by corresponding feature mapping of the convolution stage, the convolution process is downsampling, the deconvolution process is upsampling, except for the last deconvolution process, a nonlinear layer is connected after all the convolution and deconvolution processes are finished, and a ReLu function is used; the number of output channels of the last deconvolution layer is maxT, different values are determined according to different text types, wherein a scene text is 25, a handwritten text is 150, and the last nonlinear layer adopts a Sigmoid function to keep an output attention diagram between 0 and 1; in the deep neural network model training, a back propagation algorithm is adopted, and all parameters of the network model are updated by calculating a transfer gradient from the last layer and transferring layer by layer;
TABLE 2
TABLE 3
As shown in table 3, the residual layer is divided into a first convolution layer, a first batch of normalization layers, a first nonlinear layer, a second convolution layer, a second batch of normalization layers, a down-sampling layer, and a second nonlinear layer;
nonlinear layers in the residual error layer all adopt a ReLU activation function;
the down-sampling layer is realized by a convolution layer and a batch normalization layer;
the deep neural network model training strategy adopts a supervision mode: training a universal deep network recognition model by using text image data and corresponding labeling information;
the input image of the deep neural network model is a handwritten text image and/or a scene text image, and the input image is output as a character sequence in the text image and/or the scene text image;
the parameters of the deep neural network model training are set as follows:
the number of iterations of the deep neural network is 1,000,000;
the deep neural network optimizer is Adadelta;
the deep neural network learning rate is 1.0;
deep neural network learning rate updating strategy: the reduction is one tenth of the original at 50% and 75% of the total number of iterations, respectively.
Thirdly, character recognition is carried out on the feature graph and the attention map through a character recognition module, the feature graph and the attention map are input, and accurate recognition is carried out on the image based on a depth network recognition model of a decoupling attention mechanism;
the specific method for carrying out character recognition comprises the following steps:
Fx,yrepresents a characteristic diagram, αt,x,yAn attention map indicating time t obtained by convolution alignment is calculated by equation (1) to obtain a semantic vector ct,
Wherein W and H are the width and height of the characteristic diagram, and at the time t,
output ytComprises the following steps: y ist=Woht+bo, (2),
Wherein, WoAnd boIs a parameter, htRepresenting the hidden layer state of the gated recursion unit at the time t;
htthe way in which (a) is calculated is expressed as,
ht=GRU((et-l,ct),ht-l), (3),
etrepresenting the last bit output yt-1The encoded vector of (1); the final Loss function Loss is calculated as follows,
where θ represents all learnable parameters of the deep neural network model, gtA sample tag value representing time t;
inputting a text image, and accurately identifying the image based on a depth network identification model of a decoupling attention mechanism to obtain characters in the text image.
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention may be made by those skilled in the art without departing from the spirit of the present invention, which is defined by the claims.
Claims (10)
1. A text recognition method based on a decoupling attention mechanism is characterized by comprising the following steps:
s1, extracting image features according to the text image and coding to obtain a feature map;
s2, aligning the feature maps to obtain a target image, constructing a deep convolutional neural network model, processing the target image based on the deep convolutional neural network model to obtain an attention map and training;
and S3, performing accurate character recognition on the feature map and the attention map based on the deep convolutional neural network recognition model.
2. The text recognition method based on the decoupling attention mechanism as claimed in claim 1, wherein:
the text image is a scene text image and/or a handwritten character image;
the scene text image and/or the handwritten character image are characterized in that:
the scene text image characteristics comprise a scene text training data set and a scene text real evaluation data set, and the scene text training data set and the scene text real evaluation data set cover various different font styles, light and shadow changes and resolution changes;
the handwritten text image features comprise a handwritten text real training data set and a handwritten text real evaluating data set, and the handwritten text real training data set and the handwritten text real evaluating data set contain different writing styles.
3. The text recognition method based on the decoupling attention mechanism as claimed in claim 2, wherein:
in the scene text training data set, the text part is complete and occupies more than two thirds of the image area, and the scene text training data set comprises a plurality of different font styles and is allowed to cover light and shadow changes and resolution changes;
the scene text real evaluation data set is obtained by shooting through a mobile phone and a special hardware camera device, in the shooting process, the text in the normalized scene text image occupies more than two thirds of the image area, inclination and fuzziness are allowed to exist, and the shot scene text image covers application scenes with various different font styles;
the handwritten text real training data set and the handwritten text real evaluation data set are respectively written by different people and collected, and the training data and the evaluation data have independence.
4. The text recognition method based on the decoupling attention mechanism as claimed in claim 2, wherein:
the text image alignment processing method comprises the following steps:
stretching the scene text training data set and the scene text real evaluation data set image data to be converted into a uniform size;
and scaling the handwritten text real training data set and the handwritten text real evaluation data set in proportion of original images, and filling surroundings until the sizes are unified.
5. The text recognition method based on the decoupling attention mechanism as claimed in claim 1, wherein:
in S2, the method for constructing the deep convolutional neural network includes:
extracting multi-scale visual features based on the feature codes;
carrying out convolution and deconvolution through a full convolution neural network to construct the deep convolution neural network model;
said deconvolution stage, each of said output features being summed by a corresponding feature map of said convolution stage;
the convolution process is down sampling, the deconvolution process is up sampling, except the last deconvolution process, all convolution and the deconvolution processes are all followed by a nonlinear layer by using a ReLu function;
the network structure of the deep convolutional neural network model is an input layer, a convolutional layer and a residual error layer;
the residual error layer is divided into a first convolution layer, a first batch of normalization layers, a first nonlinear layer, a second convolution layer, a second batch of normalization layers, a down-sampling layer and a second nonlinear layer.
6. The text recognition method based on the decoupling attention mechanism as claimed in claim 1, wherein:
in the training of the deep convolutional neural network model in the S2, a back propagation algorithm is adopted, and all parameters of the network model are updated by calculating a transfer gradient from the last layer and transferring layer by layer;
the deep convolutional neural network model training strategy adopts a supervision mode: training a universal deep network recognition model by using text image data and corresponding labeling information;
and the input image of the deep convolutional neural network model is the handwritten text image and/or the scene text image, and the input image is output as a character sequence in the text image and/or the scene text image.
7. The text recognition method based on the decoupling attention mechanism as claimed in claim 6, wherein:
the parameters of the deep convolutional neural network model training are set as follows:
the number of iterations of the deep convolutional neural network is 1,000,000;
the deep convolutional neural network optimizer is Adadelta;
the deep convolutional neural network learning rate is 1.0;
the deep convolutional neural network learning rate updating strategy comprises the following steps: the reduction is one tenth of the original at 50% and 75% of the total number of iterations, respectively.
8. The text recognition method based on the decoupling attention mechanism as claimed in claim 1, wherein:
the specific method for recognizing the S3 characters comprises the following steps:
Fx,yrepresents said characteristic diagram, αt,x,yThe attention map representing the t time obtained by convolution alignment is calculated by equation (1) to obtain a semantic vector ct,
Wherein W and H are the width and height of the characteristic diagram, and at the time t,
output ytComprises the following steps: y ist=Woht+bo, (2),
Wherein, WoAnd boIs a parameter, htRepresenting the hidden layer state of the gated recursion unit at the time t;
htthe way in which (a) is calculated is expressed as,
ht=GRU((et-1,ct),ht-1), (3),
etrepresenting the last bit output yt-1The encoded vector of (1); the final Loss function Loss is calculated as follows,
where θ represents all learnable parameters of the deep neural network model, gtRepresenting the sample tag value at time t.
9. A text recognition system based on a decoupling attention mechanism is characterized by comprising a feature coding module, a convolution alignment module and a text decoding module,
the feature coding module extracts visual features from the text image based on a deep convolutional neural network;
the convolution alignment module extracts multi-scale visual features from the feature coding module and generates an attention diagram channel by channel through a deep convolution neural network;
the text decoding module combines the feature map and the attention map to obtain a final prediction result through a gated recursion unit.
10. The system for text recognition based on a decoupled attention mechanism of claim 9,
the network structure of the deep convolutional neural network unit comprises an input layer unit, a convolutional layer unit and a residual error layer unit;
the residual layer unit is divided into a first convolution layer unit, a first batch of normalization layer units, a first nonlinear layer unit, a second convolution layer unit, a second batch of normalization layer units, a down-sampling layer unit and a second nonlinear layer unit;
the nonlinear layer units in the residual layer units all adopt ReLU activation functions;
the down-sampling layer unit is realized by the convolution layer unit and the batch normalization layer unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010841738.6A CN111967470A (en) | 2020-08-20 | 2020-08-20 | Text recognition method and system based on decoupling attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010841738.6A CN111967470A (en) | 2020-08-20 | 2020-08-20 | Text recognition method and system based on decoupling attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111967470A true CN111967470A (en) | 2020-11-20 |
Family
ID=73387925
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010841738.6A Pending CN111967470A (en) | 2020-08-20 | 2020-08-20 | Text recognition method and system based on decoupling attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111967470A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112580738A (en) * | 2020-12-25 | 2021-03-30 | 特赞(上海)信息科技有限公司 | AttentionOCR text recognition method and device based on improvement |
CN112597925A (en) * | 2020-12-28 | 2021-04-02 | 作业帮教育科技(北京)有限公司 | Handwritten handwriting recognition/extraction and erasing method, handwritten handwriting erasing system and electronic equipment |
CN112686345A (en) * | 2020-12-31 | 2021-04-20 | 江南大学 | Off-line English handwriting recognition method based on attention mechanism |
CN112686219A (en) * | 2021-03-11 | 2021-04-20 | 北京世纪好未来教育科技有限公司 | Handwritten text recognition method and computer storage medium |
CN112733830A (en) * | 2020-12-31 | 2021-04-30 | 上海芯翌智能科技有限公司 | Shop signboard identification method and device, storage medium and computer equipment |
CN113052175A (en) * | 2021-03-26 | 2021-06-29 | 北京百度网讯科技有限公司 | Target detection method and device, electronic equipment and readable storage medium |
CN113065550A (en) * | 2021-03-12 | 2021-07-02 | 国网河北省电力有限公司 | Text recognition method based on self-attention mechanism |
CN113158776A (en) * | 2021-03-08 | 2021-07-23 | 国网河北省电力有限公司 | Invoice text recognition method and device based on coding and decoding structure |
CN113240056A (en) * | 2021-07-12 | 2021-08-10 | 北京百度网讯科技有限公司 | Multi-mode data joint learning model training method and device |
CN113705730A (en) * | 2021-09-24 | 2021-11-26 | 江苏城乡建设职业学院 | Handwriting equation image recognition method based on convolution attention and label sampling |
CN113807340A (en) * | 2021-09-07 | 2021-12-17 | 南京信息工程大学 | Method for recognizing irregular natural scene text based on attention mechanism |
CN114170468A (en) * | 2022-02-14 | 2022-03-11 | 阿里巴巴达摩院(杭州)科技有限公司 | Text recognition method, storage medium and computer terminal |
RU2768211C1 (en) * | 2020-11-23 | 2022-03-23 | Общество с ограниченной ответственностью "Аби Продакшн" | Optical character recognition by means of combination of neural network models |
CN114548067A (en) * | 2022-01-14 | 2022-05-27 | 哈尔滨工业大学(深圳) | Multi-modal named entity recognition method based on template and related equipment |
CN117934974A (en) * | 2024-03-21 | 2024-04-26 | 中国科学技术大学 | Scene text task processing method, system, equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110717336A (en) * | 2019-09-23 | 2020-01-21 | 华南理工大学 | Scene text recognition method based on semantic relevance prediction and attention decoding |
-
2020
- 2020-08-20 CN CN202010841738.6A patent/CN111967470A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110717336A (en) * | 2019-09-23 | 2020-01-21 | 华南理工大学 | Scene text recognition method based on semantic relevance prediction and attention decoding |
Non-Patent Citations (1)
Title |
---|
王天玮等: "Decoupled Attention Network for Text Recognition", 《34TH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》, pages 12216 - 12224 * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2768211C1 (en) * | 2020-11-23 | 2022-03-23 | Общество с ограниченной ответственностью "Аби Продакшн" | Optical character recognition by means of combination of neural network models |
US11568140B2 (en) | 2020-11-23 | 2023-01-31 | Abbyy Development Inc. | Optical character recognition using a combination of neural network models |
CN112580738A (en) * | 2020-12-25 | 2021-03-30 | 特赞(上海)信息科技有限公司 | AttentionOCR text recognition method and device based on improvement |
CN112580738B (en) * | 2020-12-25 | 2021-07-23 | 特赞(上海)信息科技有限公司 | AttentionOCR text recognition method and device based on improvement |
CN112597925A (en) * | 2020-12-28 | 2021-04-02 | 作业帮教育科技(北京)有限公司 | Handwritten handwriting recognition/extraction and erasing method, handwritten handwriting erasing system and electronic equipment |
CN112597925B (en) * | 2020-12-28 | 2023-08-29 | 北京百舸飞驰科技有限公司 | Handwriting recognition/extraction and erasure method, handwriting recognition/extraction and erasure system and electronic equipment |
CN112686345A (en) * | 2020-12-31 | 2021-04-20 | 江南大学 | Off-line English handwriting recognition method based on attention mechanism |
CN112733830A (en) * | 2020-12-31 | 2021-04-30 | 上海芯翌智能科技有限公司 | Shop signboard identification method and device, storage medium and computer equipment |
CN112686345B (en) * | 2020-12-31 | 2024-03-15 | 江南大学 | Offline English handwriting recognition method based on attention mechanism |
CN113158776A (en) * | 2021-03-08 | 2021-07-23 | 国网河北省电力有限公司 | Invoice text recognition method and device based on coding and decoding structure |
CN113158776B (en) * | 2021-03-08 | 2022-11-11 | 国网河北省电力有限公司 | Invoice text recognition method and device based on coding and decoding structure |
CN112686219A (en) * | 2021-03-11 | 2021-04-20 | 北京世纪好未来教育科技有限公司 | Handwritten text recognition method and computer storage medium |
CN113065550A (en) * | 2021-03-12 | 2021-07-02 | 国网河北省电力有限公司 | Text recognition method based on self-attention mechanism |
CN113052175B (en) * | 2021-03-26 | 2024-03-29 | 北京百度网讯科技有限公司 | Target detection method, target detection device, electronic equipment and readable storage medium |
CN113052175A (en) * | 2021-03-26 | 2021-06-29 | 北京百度网讯科技有限公司 | Target detection method and device, electronic equipment and readable storage medium |
CN113240056A (en) * | 2021-07-12 | 2021-08-10 | 北京百度网讯科技有限公司 | Multi-mode data joint learning model training method and device |
CN113807340A (en) * | 2021-09-07 | 2021-12-17 | 南京信息工程大学 | Method for recognizing irregular natural scene text based on attention mechanism |
CN113807340B (en) * | 2021-09-07 | 2024-03-15 | 南京信息工程大学 | Attention mechanism-based irregular natural scene text recognition method |
CN113705730A (en) * | 2021-09-24 | 2021-11-26 | 江苏城乡建设职业学院 | Handwriting equation image recognition method based on convolution attention and label sampling |
CN114548067B (en) * | 2022-01-14 | 2023-04-18 | 哈尔滨工业大学(深圳) | Template-based multi-modal named entity recognition method and related equipment |
CN114548067A (en) * | 2022-01-14 | 2022-05-27 | 哈尔滨工业大学(深圳) | Multi-modal named entity recognition method based on template and related equipment |
CN114170468B (en) * | 2022-02-14 | 2022-05-31 | 阿里巴巴达摩院(杭州)科技有限公司 | Text recognition method, storage medium and computer terminal |
CN114170468A (en) * | 2022-02-14 | 2022-03-11 | 阿里巴巴达摩院(杭州)科技有限公司 | Text recognition method, storage medium and computer terminal |
CN117934974A (en) * | 2024-03-21 | 2024-04-26 | 中国科学技术大学 | Scene text task processing method, system, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111967470A (en) | Text recognition method and system based on decoupling attention mechanism | |
CN109543667B (en) | Text recognition method based on attention mechanism | |
CN109558832B (en) | Human body posture detection method, device, equipment and storage medium | |
CN109726657B (en) | Deep learning scene text sequence recognition method | |
CN112733822B (en) | End-to-end text detection and identification method | |
CN114187450A (en) | Remote sensing image semantic segmentation method based on deep learning | |
CN113780149A (en) | Method for efficiently extracting building target of remote sensing image based on attention mechanism | |
CN114596500B (en) | Remote sensing image semantic segmentation method based on channel-space attention and DeeplabV plus | |
CN108898138A (en) | Scene text recognition methods based on deep learning | |
CN109635726B (en) | Landslide identification method based on combination of symmetric deep network and multi-scale pooling | |
CN111428727B (en) | Natural scene text recognition method based on sequence transformation correction and attention mechanism | |
CN109934272B (en) | Image matching method based on full convolution network | |
CN111310766A (en) | License plate identification method based on coding and decoding and two-dimensional attention mechanism | |
CN111639564A (en) | Video pedestrian re-identification method based on multi-attention heterogeneous network | |
CN113011288A (en) | Mask RCNN algorithm-based remote sensing building detection method | |
CN110969089A (en) | Lightweight face recognition system and recognition method under noise environment | |
CN111079514A (en) | Face recognition method based on CLBP and convolutional neural network | |
CN111985332A (en) | Gait recognition method for improving loss function based on deep learning | |
CN116524189A (en) | High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization | |
CN114581905A (en) | Scene text recognition method and system based on semantic enhancement mechanism | |
CN117079288B (en) | Method and model for extracting key information for recognizing Chinese semantics in scene | |
AU2021104479A4 (en) | Text recognition method and system based on decoupled attention mechanism | |
CN116758621A (en) | Self-attention mechanism-based face expression depth convolution identification method for shielding people | |
CN117058437A (en) | Flower classification method, system, equipment and medium based on knowledge distillation | |
CN116630610A (en) | ROI region extraction method based on semantic segmentation model and conditional random field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201120 |