CN115116066A

CN115116066A - Scene text recognition method based on character distance perception

Info

Publication number: CN115116066A
Application number: CN202210689812.6A
Authority: CN
Inventors: 陈智能; 郑天伦; 姜育刚
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-09-27

Abstract

The invention belongs to the technical field of image text recognition, and particularly relates to a scene text recognition method based on character distance perception. The method carries out scene text recognition by combining three domain information of vision, semantics and character position; the method comprises the steps of firstly, simultaneously coding the semantic, visual and character position characteristics of a text to be recognized, then, carrying out decoding and fusion by using self-attention and cross-attention in an iterative manner, strengthening the character position characteristics, and fusing character semantic information and visual information into the character position characteristics in parallel, so that the character position characteristics have more accurate content perception embedding, and the capability of describing character distances in semantic and visual spaces is achieved. Compared with the mainstream identification method in the current industry, the method has more accurate modeling capability on the character distance, and can obtain the identification precision advantage on the data set with large character distance difference and large identification challenge.

Description

Scene text recognition method based on character distance perception

Technical Field

The invention belongs to the technical field of scene text detection and recognition, and particularly relates to a scene text recognition method based on character distance perception.

Background

Scene text recognition is one of the important research directions in computer vision, which aims to recognize the text of characters in image texts photographed in natural scenes. Because the characters can reflect the high-level language information contained in the image, people can be helped to better understand the content of the image. At present, scene text recognition technology has been widely applied to scenes such as street view image recognition and understanding, bill recognition, and the like, and related technologies have also received high attention from well-known internet companies such as Baidu, Microsoft, Tencent, and the like.

Scene text recognition has been studied more recently, and many excellent recognition methods are developed. At present, under the conditions that characters are regular and backgrounds are relatively single, the current recognition algorithm can achieve relatively high recognition accuracy. Therefore, current research efforts are primarily focused on how to address recognition in more complex scenarios, including but not limited to: low resolution, recognition under fuzzy characters, recognition under complex backgrounds, recognition under partial occlusion, recognition under curved texts, irregular texts and artistic words, and the like. To address these problems, research efforts include: designing a more exquisite recognition network structure, extracting more effective visual features, better utilizing the upper and lower semantic relation of texts, utilizing a visual-text pre-training model and the like. Particularly in the aspect of context semantic relationship and character position relationship utilization, scene text recognition mainly goes through the following development stages.

Document [1] extracts visual features of an image using a convolutional neural network and recombines the features into a sequence of feature vectors. Based on the method, a bidirectional long-short term memory network (BilSTM) is introduced for the first time, and context semantic clues of the characteristic sequences are captured. Aiming at the problem of misalignment of sequence blocks and text characters, a decoding mode based on connection time sequence Classification (CTC) is introduced, so that the method has better identification performance and higher identification speed. This technology has attracted a great deal of attention in the industry and has been adopted by a number of commercial systems. This work does not make use of the contextual semantic relationships and character position relationships of the text to be recognized.

Document [2] applies for the first time the attention mechanism of natural language processing to scene text recognition tasks, with an encoder-decoder framework based on the attention mechanism, constraints of context information are performed and alignment of text and visual blocks is performed semantically. Although the iterative decoding mode slows down the recognition speed, the semantic association is really realized in the decoding stage.

Document [3] applies the point-by-attention form of the Transformer in natural language processing to the scene text recognition task for the first time. Better semantic association of character context is obtained through a higher-level global attention mechanism, and extremely competitive results are obtained in the disclosed scene text recognition data set.

Document [4] designs a parallel attention mechanism, in order to solve the problem that character position information has no content, a decoder is forced to obtain semantic association of context through a complete filling-in-space form, and a real corpus is introduced to train the decoder, so that the accuracy of scene text recognition on a public data set is further improved.

Document [5] proposes a semantic and character position dual-branch recognition decoder mode in order to alleviate the attention drift problem and reduce the interference of semantic information on character recognition results. The newly added position branches improve the character position in the visual coding information through the hard coding of the sequence position, and the influence of strong semantic information is reduced. However, the double-branch mode has no process interaction between position and semantic information, and the hard-coded form cannot learn the character distance change.

Scene text recognition is different from a common classification task, and in order to guarantee recognition accuracy, deeper visual feature information needs to be extracted, and semantic relevance between natural languages needs to be applied. While the character distance between irregular texts varies greatly, document [5] shows that the problem of attention drift is easily caused by the pure utilization of semantic information, and how to construct an accurate character distance perception module by utilizing character semantics and vision to better decode text characters is a technical challenge facing the current. Most of the existing methods belong to a mode of extracting information in a visual and semantic combined mode, and although a few methods use character positions, the problem that multiple characters are concerned at a single time step is easily caused by character position characteristics without contents in hard coding, so that information loss is caused. At present, a text recognition method jointly utilizing vision, semantics and character positions still has a large promotion space.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a content-embedded text recognition method based on more exquisite visual, semantic and character position interaction and fusion mechanisms, namely a scene text recognition method based on accurate character distance perception, which has more robust perception capability on constantly changing character distances under an encoder-decoder framework based on an attention mechanism, can more accurately perceive the character distances in complex scenes, and relieves the problems of attention drift and single-time-step multi-character attention, thereby obtaining better text recognition effect.

The invention combines the three domain information of vision, semantics and character position to recognize, firstly, the semantics, vision and character position characteristics of the text to be recognized are coded at the same time, then the character position characteristics are strengthened by iterative decoding and fusion using self-attention and cross-attention, and the character semantic information and the visual information are fused into the character position characteristics in parallel, so that the character position characteristics have more accurate content perception embedding, thereby having the capability of describing the character distance in the semantic and visual space. The technical scheme of the invention is as follows.

A scene text recognition method based on character distance perception adopts an encoder-decoder framework based on an attention mechanism to carry out scene text recognition;

at the encoder end, constructing visual characteristic representation of visual branches, semantic characteristic representation of semantic branches and character position characteristic representation of character position branches;

at the decoder end, firstly, character position characteristic enhancement is carried out, namely, the character position characteristic of the character position branch is enhanced by using a self-attention mechanism; then, performing double-domain cross enhancement, namely using cross attention decoding, and inputting the character position characteristics as queries into the visual branch and the semantic branch respectively to realize enhancement of the visual characteristics and the semantic characteristics; then, dynamic sharing fusion is carried out, namely the enhanced visual features and semantic features are fused to character position features to generate new character position features which can depict character contents more accurately;

the process of character position feature enhancement, double-domain cross enhancement and dynamic sharing fusion is repeated for a plurality of times, and character position features which better fuse visual and semantic features are gradually generated; then, character prediction of the current time step is carried out based on the character position characteristics which are better fused with the visual and semantic characteristics; updating the semantic features and character position features of the encoder end according to the current recognition condition, re-executing the operation of the decoder end, and recognizing the characters of the next time step;

the above processes are repeated until the input text to be recognized is recognized.

In the invention, at the encoder end, the three-branch characteristic representation of the visual branch, the semantic branch and the character position branch is specifically as follows:

the visual branch firstly adopts a key point correction network TPS to correct the text image; then, performing visual feature extraction on the corrected image by using a convolutional neural network ResNet-50, and performing downsampling on the feature sequences at stages 1, 3 and 5 of the network respectively; secondly, reordering the generated feature graph, and extracting features by using a Transformer encoder so as to further extract the global context relationship among different features and generate more enhanced visual features; preferably, the convolution module numbers of the ResNet-50 network in five stages are respectively configured to be 3, 4, 6 and 3, and in the transform encoder, the hidden layer is set to be 1024;

the semantic branch represents character sequence labels corresponding to the text images as word embedding vector sequences in a training process, wherein each vector corresponds to a word, the length of the sequence is the same as that of the character labels, and the sequence is used as semantic features and input to a transform decoder; in each round of training, in order to prevent information leakage, a transform decoder can only see semantic features of a current character to be decoded and a position character before the current character to be decoded; the decoding training process can be performed in parallel; in the testing process, the semantic features adopt the same modeling mode, and the difference is that: semantic features are extracted based on characters predicted by iterative decoding instead of character labels, so serial iterative decoding recognition is adopted;

the character position branch firstly adopts sine and cosine position coding to carry out hard coding on a position sequence block, then maps the hard coded position information into character position coding with denser characteristics through a multilayer perceptron with two layers of learnable parameters, and endows the character position coding with the capability of perceiving character positions through reverse propagation; the number of character position sequence blocks is determined by the length of the semantic vector sequence; in the training stage, the length of the semantic label is equal to the number of character position sequence blocks; and in the testing stage, the length of the decoded character output by iteration is equal to the number of character position sequence blocks in the current step.

In the invention, at the decoder end, the character position characteristic enhancement process specifically comprises the following steps:

the character position characteristics of the character position branches are enhanced through a point multiplication type self-attention module, and the character position branches are enhanced in a mode of an upper triangular mask matrix so as to ensure that information is not leaked.

In the invention, at the decoder end, the double-domain cross fusion process specifically comprises the following steps:

step 1, calculating character position-vision cross attention: taking the character position characteristics as Query vectors and the visual characteristics as Key word Key vectors and Key Value vectors, and performing cross attention calculation; after the cross attention is calculated, further applying a multilayer perceptron to carry out feature transformation to obtain enhanced visual features;

step 2, calculating the character position-semantic cross attention: and performing cross attention calculation by taking the character position characteristics as Query vectors and the semantic characteristics as Key word Key vectors and Key Value vectors, wherein in the calculation process, a covering mode of upper triangular masks is adopted to prevent semantic information after the current time step from being leaked and participating in calculation, and after the cross attention calculation, a multilayer perceptron is adopted to perform characteristic transformation to obtain enhanced semantic characteristics.

In the invention, at the decoder end, the dynamic sharing fusion process specifically comprises the following steps:

step 1: based on the enhanced visual features and semantic features obtained by the calculation of the last stage, splicing the two types of features on the feature channel dimension, and performing feature dimension reduction operation through a full connection layer to make the two types of features consistent with the dimension of the visual and semantic features;

step 2, obtaining the dimensionality reduction characteristic of [ -1, 1] through a sigmoid function]Attention score attentive within interval; under the attention mechanism of gating, with reference to the form of cross entropy, the attention and the (1-attention) are respectively multiplied by visual and semantic feature vectors and then added to form a fusion score,

the features after the fusion are represented by the graph,

respectively representing the enhanced visual features and semantic features in the step 1, and the Atten represents an attention score as represented by the following formula:

and 3, sharing parameters of the attention score Atten in subsequent dynamic sharing fusion calculation, and providing more effective dynamic sharing fusion capability while saving the parameters.

In the invention, at the decoder end, the process of 'character position characteristic enhancement-double-domain cross enhancement-dynamic sharing fusion' is repeated for 3 times.

In the invention, at a decoder end, when character prediction of the current time step is carried out based on character position characteristics better integrating visual and semantic characteristics, the obtained character position characteristics are input into a linear classifier, and the predicted character of the current time step is calculated. Preferably, K is preferably 36 according to the number of common picture characters.

The invention also provides a device of the scene text recognition method based on character distance perception, and the device consists of an image input module, a text recognition module and a result display module. Wherein: the image input module is used for receiving a text image to be identified; the text recognition module performs text image recognition by packaging the scene text recognition method based on character distance perception; and the result display module is responsible for outputting and displaying the identification result signal.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention adopts the idea of multi-domain fusion, establishes time step decoding on a character position branch, gradually inquires attention characteristics on semantic and visual branches and performs dynamic sharing fusion, thereby obtaining the text recognizer with character distance perception capability. Unlike traditional semantic character attention methods, the present invention models the decoding effort on the character position branch to alleviate the attention drift problem of semantic attention decoding. Different from other character position attention coding methods, the character position branch can dynamically obtain the parallel content embedding of the semantic and visual branches, and the character position branch is made to iterate the fusion of attention information, so that the problem of multi-character attention at the same time step caused by character position attention is solved. This makes the method adaptable to a variety of robust character distance scenarios;

(2) the invention provides a multi-domain attention decoder based on vision, semantics and character positions, which can perform character position modeling of vision and semantic distance perception to generate more effective vision semantic alignment;

(3) the invention develops a novel text recognition method based on character distance perception, which inputs the characters into a decoder with multi-domain attention through the preparation of character position, vision and semantic three-branch input, can effectively utilize semantic, vision and character position information, and can better model the visual and semantic distances among characters of a text image, thereby better recognizing difficult texts with large character space difference and large recognition challenge and obtaining the recognition precision advantage.

Drawings

Fig. 1 is a flowchart of long text recognition based on character distance perception proposed by the present invention.

Detailed Description

The technical solution of the present invention will be specifically described below with reference to the accompanying drawings and embodiments.

The invention provides a scene text recognition method based on character distance perception, which adopts an encoder-decoder framework.

The encoder end constructs input information of three branches of a visual branch, a semantic branch and a character position branch, and the character position branch is subjected to iterative attention information fusion through parallel content embedding of the semantic branch and the visual branch, so that a decoder has more accurate character distance perception capability. In the visual branch, a visual key point correction network TPS and ResNet-50 which more accords with the scene text feature extraction need are adopted in the invention. Taking down-sampled characteristic information as input information of a visual branch, wherein an encoder of a Transformer is used for extracting a global context; and semantic branching, wherein the word embedding mode of the language label is used as semantic input in the training process, so that the aim of parallel training is fulfilled. In the testing process, the feature iteration result of each training is used as the input of the semantic branch; and the character position branch adopts sine and cosine coded position information to code the position relation between each sequence block, dynamically codes the position information of the hard code by two layers of learnable parameters, and realizes the function of sensing the character position by back propagation. The three-branch information will be decoded and output by a content-embedded distance-aware decoder.

The character position, visual and semantic branch input method specifically comprises the following operation steps:

step 1, visual branch input is carried out, and key point correction of the picture is carried out by adopting a key point correction network TPS. The grid reference point K is set to 12. ResNet-50 is used for image feature extraction for visual feature blocks. The feature sequence is down sampled in stages 1, 3, and 5, respectively. The downsampling blocks are respectively configured with 3, 4, 6, 6 and 3 to meet the requirement of texts on high semantic relations. The encoder of the Transformer is used for association of global semantics. The hidden layer is set to 1024.

And 2, semantic branch input and training, wherein the label information passes through a word embedding layer to obtain high-dimensional hidden semantic feature information, so that the decoding steps are parallel during training. And in the testing stage, the output iterative high-dimensional semantic features are used as the input of semantic branches. Requiring serial iterative decoding.

And 3, character position branch input, namely performing hard coding on the position relation between the sequence blocks by adopting sine and cosine position coding, and then obtaining more accurate character position information through two layers of learnable parameters. The number of character position sequence blocks is determined by the length of the semantic vector sequence; in the training stage, the length of the label determines the number of the character positions; in the testing stage, the output length of iteration determines the number of Token positions in the current iteration step.

The decoder is mainly divided into three steps of 3 parts, branch enhancement, double-domain cross fusion and dynamic sharing fusion:

(1) branch enhancement, namely performing self-attention enhancement on the characteristics on the character position branch, and fusing the visual and semantic content embedded information of the previous stage in a form of point-by-point attention.

(2) And (2) double-domain cross fusion, namely, taking the character position branch as a query vector, respectively querying space content information on the semantic branch and the visual branch, and embedding the channel content information through full connection of the ascending-dimension channel.

(3) And dynamic sharing fusion, namely dynamically fusing character position-semantic and character position-visual information. And fusing the two feature vectors by a gating mechanism in a fusion mode similar to cross entropy. This module shares parameters among all the stacked blocks.

The branch enhancing step is as follows:

step 1: the character position is branched through a self-attention module in the form of dot product. In the process of stacking the blocks, semantic and visual branch information is embedded in the branch content of the character position, and attention enhancement of embedding of the three branch contents can be carried out by a single branch. In the process of attention, if no Mask (Mask) is available or a single character Mask (Mask) is adopted, character information of the next time step is leaked in advance in the previous time step. Therefore, unlike normal self-attention, character position branch enhancement requires the use of a Mask-over-triangle (Mask) to ensure that information is not leaked.

The above two-domain cross fusion step is specifically as follows:

step 1, the character position branch characteristic is used as the Key word (Key) vector and Key Value (Value) of the query vector and visual characteristic

The vector performs a power mechanism, and in the cross power mechanism, the visual features do not require time step decoding, so that a Mask (Mask) is not required. However, after spatial cross-attention embedding, full-link layers (MLPs) are required for visual feature embedding of high-dimensional channels.

And 2, performing an attention mechanism by using the character position branch as a query vector, a semantic feature Key word (Key) vector and a Key Value (Value) vector, still adopting a cross attention mechanism, and preventing semantic information from being leaked in the next time step by adopting an upper triangular covering mode for a Mask (Mask). And still adopting a full connection layer (MLP) to carry out high-dimensional channel semantic feature embedding.

The dynamic sharing fusion specifically comprises the following operation steps:

step 1: character position-visual and character position-semantic feature vectors are generated at the previous stage. And performing dimensionality splicing by adopting a Concat splicing mode, and performing dimensionality reduction operation through a full connection layer.

And 2, obtaining attention score Atten in the range of [ -1, 1] through a sigmod function, under a gated attention mechanism, multiplying the attention and the (1-Atten) with two feature vectors respectively by using a cross entropy reference form to obtain a final fusion score.

And 3, sharing parameters in each stacking module by adopting a full connection layer in the dynamic fusion module, saving the parameters and providing more effective dual-domain feature fusion capability.

Fig. 1 is a flowchart of long text recognition based on character distance sensing proposed by the present invention; the method comprises the following steps:

s1, extracting visual features of the text image to be recognized and semantic and character position features at the current time step;

s2, enhancing the character position characteristics based on the self-attention mechanism;

s3, taking the enhanced character position features as query, respectively calculating cross attention with the visual and semantic features to obtain the enhanced visual and semantic features;

s4, fusing the enhanced visual and semantic features through dynamic sharing and fusion to generate character position features for accurately depicting character contents in the next stage;

s5, repeating the steps S2-S4 for a plurality of times to gradually generate character position features which better integrate visual and semantic features;

s6, inputting the character position characteristics into a linear classifier to obtain a character prediction result under the current time step;

and S7, generating semantic and character characteristics at the next time step, repeatedly executing the steps S2-S6, and repeating the processes until all characters are recognized.

Example 1

Static pictures of a fixed size of 32 x 128 were used as input for the visual branching. Pictures of various sizes were normalized to the above dimensions by scaling. And the visual branch is subjected to feature extraction through a key point correction network TPS, a convolutional neural network ResNet-50 and a transform encoder. The semantic branch embeds words with tags classified as 37. The length of the label is used as a hard coding form of character position branches, and the character position codes of sine and cosine are obtained through position embedding. Two layers of learnable weights are used to obtain the dynamic position of the character. The three-tap input then passes through the multi-domain attention module.

The character position branch is taken as a query vector to perform a self-attention module, and the hidden layer is set to 1024. After obtaining the attention score, a Mask of triangles on the Mask (Mask) is used to prevent time-step information leakage when semantic information is embedded. The character position branch features are used as a query vector and a Key Value (Key) vector and a Key Value (Value) vector of the visual features to carry out an attention mechanism, and under the cross attention mechanism, the visual features do not need time step decoding, so that masks are not needed. But after spatial cross-attention embedding, a multi-layered perceptron (MLP) is used for visual feature embedding of the high-dimensional channels. The character position branch is used as a query vector and a key value vector of a keyword of semantic features to carry out a cross attention mechanism, and a Mask (Mask) adopts a covering mode of an upper triangle to prevent semantic information from being leaked in the next time step. And still adopting full connection to carry out high-dimensional channel semantic feature embedding. And then dynamically fusing character position-semantic and character position-visual information, and fusing the two feature vectors by a gating mechanism in a fusion mode similar to cross entropy. This module shares parameters among all the stacked blocks.

This completes the overall process of content-embedded character distance-aware text recognition. Wherein, the Transformer coder is set to 4 layers; the multi-domain sensing module is arranged into 2 layers; the TPS reference point sample is 12. The training model was trained through 10 rounds and the Warmup training strategy was used.

The character distance sensing text recognizer can achieve the performances of 97.4%, 96.4% and 93.4% on the text recognition public data sets IC13, IIIT5K and CUTE80 respectively, and achieves the highest recognition performance in the industry so far, which is much higher than that of the [1-5] text recognizer.

Reference to the literature

[1] Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(11): 2298-2304.

[2] Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for ocr in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2231–2239 (2016)

[3] Sheng, F., Chen, Z., Xu, B.: NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In: 2019 International Conference on Document Analysis and Recognition. pp. 781–786.

[4] Fang S, Xie H, Wang Y, et al. Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 7098-7107.

[5] Yue X, Kuang Z, Lin C, et al. RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition[C]//Proceedings of the European Conference on Computer Vision. 2020: 135-151。

Claims

1. A scene text recognition method based on character distance perception is characterized in that an encoder-decoder framework based on an attention mechanism is adopted for scene text recognition;

at an encoder end, constructing visual characteristic representation of visual branches, semantic characteristic representation of semantic branches and character position characteristic representation of character position branches;

2. The method for recognizing scene text based on character distance perception according to claim 1, wherein at the encoder end, the three-branch feature representation of the visual branch, the semantic branch and the character position branch is specifically as follows:

the visual branch firstly adopts a key point correction network TPS to correct the text image; then, performing visual feature extraction on the corrected image by using a convolutional neural network ResNet-50, and performing downsampling on the feature sequences at stages 1, 3 and 5 of the network respectively; secondly, reordering the generated feature graph, and extracting features by using a Transformer encoder so as to further extract the global context relationship among different features and generate more enhanced visual features;

in the semantic branch and training stage, the label information passes through a word embedding layer to obtain high-dimensional hidden semantic feature information, so that the decoding steps are parallel during training; in the testing stage, the output iterative high-dimensional semantic features are used as the input of semantic branches, so that serial iterative decoding is adopted;

the character position branch firstly adopts sine and cosine position codes to carry out hard coding on a character position sequence block, then maps the hard coded position information into character position codes with denser characteristics through a multilayer perceptron with two layers of learnable parameters, and endows the character position codes with the capability of perceiving character positions through reverse propagation; the number of character position sequence blocks is determined by the length of the semantic vector sequence; in the training stage, the length of the semantic label is equal to the number of character position sequence blocks; and in the testing stage, the length of the decoded character output by iteration is equal to the number of character position sequence blocks in the current step.

3. The method for recognizing scene text based on character distance perception according to claim 1, wherein at a decoder end, the character position feature enhancement process specifically comprises:

4. The method for recognizing scene text based on character distance perception according to claim 1, wherein at a decoder side, a double-domain cross fusion process specifically comprises:

step 1: character position-visual cross attention calculation: taking the character position characteristics as Query vectors and the visual characteristics as Key word Key vectors and Key Value vectors, and performing cross attention calculation; after the cross attention is calculated, further applying a multilayer perceptron to carry out feature transformation to obtain enhanced visual features;

step 2: character position-semantic cross-attention calculation: and performing cross attention calculation by taking the character position characteristics as Query vectors and the semantic characteristics as Key word Key vectors and Key Value vectors, wherein in the calculation process, a covering mode of upper triangular masks is adopted to prevent semantic information after the current time step from being leaked and participating in calculation, and after the cross attention calculation, a multilayer perceptron is adopted to perform characteristic transformation to obtain enhanced semantic characteristics.

5. The method for recognizing scene text based on character distance perception according to claim 1, wherein at a decoder side, the dynamic sharing fusion process specifically comprises:

step 1: based on the enhanced visual features and semantic features obtained by the calculation of the last stage, splicing the two types of features on the dimension of a feature channel, and performing feature dimension reduction operation through a full connection layer to make the two types of features consistent with the dimension of the visual and semantic features;

and 2, step: obtaining the dimensionality reduction characteristic of [ -1, 1] through a sigmoid function]Attention score attentiveness within an interval; under the attention mechanism of gating, with reference to the form of cross entropy, the attention and the (1-attention) are respectively multiplied by visual and semantic feature vectors and then added to form a fusion score,

the features after the fusion are represented by the graph,

respectively representing the visual characteristics and words enhanced in step 1Semantic characteristics, Atten denotes the attention score, as expressed by:

and step 3: the attention score Atten is shared by parameters in subsequent dynamic sharing fusion calculation, and more effective dynamic sharing fusion capability is provided while the parameter quantity is saved.

6. The method of claim 1, wherein the "character position feature enhancement-two-domain cross enhancement-dynamic sharing fusion" process is repeated 3 times at the decoder side.

7. The method for recognizing scene texts based on character distance perception according to claim 1, wherein at a decoder side, when character prediction of a current time step is performed based on character position features better fusing visual and semantic features, the obtained character position features are input to a linear classifier, and predicted characters of the current time step are calculated.