CN114241470A

CN114241470A - Natural scene character detection method based on attention mechanism

Info

Publication number: CN114241470A
Application number: CN202111603367.9A
Authority: CN
Inventors: 刘占东; 张海军
Original assignee: Xinjiang Normal University
Current assignee: Xinjiang Normal University
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-03-25

Abstract

A natural scene character detection method based on an attention mechanism comprises the steps of designing a convolutional neural network model for extracting a text target according to feature information of a text center block and a stroke area, and training the model by using the text center block and stroke information as supervision data; in the model testing stage, the test image is respectively input into a text center block model and a stroke model to obtain a probability chart of a text center block and a word stroke area; obtaining a final text area through reasoning, marking, and completing a scene image character detection task; the method solves the problems that the direction information of the bent text is not easy to return, adhesion between different similar text lines and information redundancy caused by multi-level feature integration in scene character detection in the prior art and the like.

Description

Natural scene character detection method based on attention mechanism

Technical Field

The invention relates to the field of computer analysis, in particular to a natural scene character detection method based on an attention mechanism.

Background

Characters are used as carriers of human knowledge and information, and widely exist in real daily life scenes. The method has the advantages that the method is very valuable and beneficial in a plurality of applications based on image content information, the character extraction technology in the scene image has wide application prospects in scenes such as blind person navigation, blind person reading, image retrieval and marking, man-machine interaction, unmanned driving and the like, the scene character detection is used for determining the specific position of characters in the image, and the character identification is used for identifying the character information in the boundary box into scale characters. The scene character detection plays an important role in extracting and understanding character information in the scene image, and the performance of the scene character detection directly determines the performance of character recognition in the image. The scene character detection and identification technology is used for extracting character information in an image to assist or enhance reality application, becomes a challenging research field in academia and industry, and draws wide attention of researchers at home and abroad.

In recent years, with the rapid development of general target detection and semantic segmentation technology, scene character detection has been widely researched. And achieves remarkable results. Although many scene text detection methods with superior performance are proposed, in some challenging scenes, it is still difficult to achieve accurate positioning of scene text. The challenges of scene text detection come mainly from three aspects: scene characters are influenced by noise, blur, shading, strong light and low resolution factor; scene characters have various existing forms and large height-width ratio change; scene text has different sizes, colors, fonts, languages and styles. For these three reasons, scene text detection is still an openness problem.

Currently, mainstream scene character detection methods can be roughly divided into two types: a general target bounding box regression-based method and a semantic segmentation-based method. The following defects are found in the prior art in the using process:

the direction of the curved text is not easy to return: the scene character detection method based on the packaging box regression needs to carry out regression on direction information when solving the problem of the example direction of the multi-direction scene text; however, for text examples of arbitrary shapes, such as curved shapes, the direction information cannot be regressed;

close adhesion between different text lines: the scene character detection method based on semantic segmentation has obtained better performance when solving the problem of scene characters in any shape and any direction; however, if different text lines are relatively close to each other, the text lines are easy to stick to each other;

multi-level feature integration yields information redundancy: when the scene character detection method based on semantic segmentation predicts the text region information, the multilevel characteristics of a shallow layer and a deep layer are utilized; however, the text target important concentration is distributed in the deep features; in addition, the importance of different features is not considered in the process of feature integration,

disclosure of Invention

The invention aims to provide a natural scene character detection method based on an attention mechanism, so that the problems in the prior art are solved.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a natural scene character detection method based on an attention mechanism comprises the following steps:

s1, constructing a candidate text instance prediction network, and simultaneously adding a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module in the convolution feature extraction network to form the candidate text instance prediction network for natural scene character feature extraction;

s2, acquiring scene character images, and carrying out classification labeling on the scene character images to acquire a scene character image data set; the scene character image dataset comprises a scene character image and a corresponding binary label image, wherein the binary label image comprises a text center block label and a character stroke area label;

s3, extracting the features of the scene character images through the candidate text instance prediction network constructed in the step S1, specifically comprising the following steps:

s301, extracting the features of the scene character images through a convolution feature extraction network to obtain convolution feature maps of the corresponding scene character images, and extracting the convolution feature maps of the scene character images through the third, fourth and fifth convolution stages of the five convolution stages of the VGG-16 convolution network, wherein the convolution feature maps are marked as follows: f ═ F₃，f₄，f₅F represents a convolution feature map set of the scene text image;

s302, inputting the convolution feature map into the context feature extraction module to obtain a corresponding multi-scale context information feature map, wherein the context information feature map is marked as F '═ F'₃，f′₄，f′₅}; f ' represents a set F ' of the multi-scale context information feature map '_iA set of context information feature maps representing the multiscale for each convolution stage, where i ∈ {3,4,5 };

s303, coding all the multi-scale context information feature maps in each convolution stage, and sliding on the multi-scale context information feature maps from left to right by using a 3 x 3 sliding window to obtain a corresponding feature sequence set which is marked as

Wherein

k∈{1，…，H_i}，t∈{1_i，…，W_iDenotes the number of channels, H_iAnd W_iIs a characteristic map f'_iI ∈ {3,4,5}, a sequence number representing a convolution stage, the set S of feature sequences_iInput to the two-way long-short duration memory and feature fusion in forward and backward orderIn the matching module, obtaining a probability icon of scene characters existing in the state diagram of each sliding window in the multi-scale context information characteristic diagram in each convolution stage, and recording the probability icon as:

wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram;

s304, according to the probability distribution of the state diagram mapping in the prediction category (0, 1) values, obtaining a weight diagram W for displaying scene characters in the state diagram_m＝{W^conv33，W^conv43，W^conv53Outputting a probability map of scene characters displayed in each pixel position in the scene character image:

wherein, σ [ ·]Is an activation function, <' > indicates multiplication of elements, W^l、b^lIs the weight and bias of the convolution layer L ∈ L showing the scene text, W^lReflect feature maps at different positions

The degree of attention.

S4, training a text center block model and a character stroke area model, training the text center block model to be convergent through a text center block label in the training set of the scene character image data set in the step S2 and the text center block model, training the character stroke area model to be a bracelet through the character stroke area label, and performing fine adjustment on the basis of the training set to generate the text center module and the character stroke area model;

s5, through the text center model and the character stroke area model pairThe test image F of the scene character image to be tested, which is processed in step S3_outCalculating to finally generate a probability graph of a text center block and a probability graph of a character stroke area corresponding to the scene text and digital image;

and S6, judging the text center block through the area limit of the text center block, eliminating false candidate text center blocks, finally obtaining the text detected in the scene text image, and marking.

Preferably, the context feature extraction module includes four parallel hole convolution layers, the expansion coefficients of the four parallel hole convolution layers are 1, 3, 5 and 7, the size of the convolution kernel is 3 × 3, and the context feature extraction module is added after each convolution layer in the convolution feature extraction network.

Preferably, the bidirectional long-short time memory and feature fusion module includes a forward LSTM layer, a backward LSTM layer, and a Concat layer, and the feature extraction step in the bidirectional long-short time memory and feature fusion module is as follows:

the characteristic sequence S belongs to S_iInputting the data into the bidirectional long-time and short-time memory and feature fusion module; for characteristic sequences

Respectively using a forward LSTM layer and a backward LSTM layer to calculate the state sequence of the hidden layers, and splicing the states of all the hidden layers at each time step to obtain the state diagrams of all the hidden layers

Wherein

Wherein B represents the state diagram of the hidden layer of each characteristic diagram, i belongs to {3,4,5}, and represents the sequence number of the convolution stage; and respectively mapping the hidden layer sequences to corresponding deconvolution layers, respectively outputting a probability graph of scene characters in a state graph of each sliding window in the multi-scale context information characteristic graph in each convolution stage, and marking the probability graph as a character：

Wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram; will M_dCutting the character image into a character image with the same size as the input scene, and marking the character image as a feature image

Splicing the two convolutional layers through a Concat layer, inputting the spliced convolutional layers into the two convolutional layers, wherein the first convolutional layer comprises 512 channels, and the size of a convolutional kernel is 3 multiplied by 3; the second convolutional layer contains 3 channels and the convolutional kernel size is 1 × 1.

Preferably, the attention-based feature integration module comprises two convolution layers, a Softmax layer, a Slice layer and three SpatialProduct layers; the attention-based feature integration module comprises the following execution steps:

the three channels of the second convolutional layer correspond to the characteristic diagram respectively

In the Softmax layer, the feature map F is weighted_cThe weight of (2) is mapped to the probability distribution of the value (0, 1) to obtain the probability distribution; in the Slice layer, dividing the probability distribution into weight maps W_m＝{W^conv33，W^conv43，W^conv53}; the attention-based feature integration module integrates the feature according to the feature map F_cThe probability map F of the scene characters displayed at each pixel position in the scene character image is output_out。

Preferably, the judgment condition of the text center block is as follows:

S_min≤S_tcb≤S_max

wherein S_minAnd S_maxRespectively of minimum area of central block of said textThreshold and maximum area threshold, S_tcbRepresenting the area of the center block of the candidate text.

Preferably, the labeling method for the detected text in step S6 is: for the text with the curved shape, marking a text example by adopting the outline of the text area; for a straight line text, the outline of the text region is fitted by using a minimum rectangle, and a text example is marked by using a rectangular box.

Preferably, the loss function of the text center module and the character stroke area model in step S4 is L- Σ_i， _jG_ijlog(P_ij)+(1-G_ij)log(1-P_ij) Wherein G is_ijIs the label of the pixel at (i, j), P_ijRepresenting the probability that the pixel at (i, j) belongs to the foreground.

The invention has the beneficial effects that: the invention discloses a natural scene character detection method based on an attention mechanism, which is characterized in that a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module are added on the basis of a VGG-16 framework to construct a network model capable of effectively predicting character areas in a scene image so as to overcome the problems in scene character detection in the prior art: the direction information of the bent text is not easy to return, adhesion among different text lines is close, and information redundancy is generated by multi-level feature integration.

Drawings

FIG. 1 is a test flow diagram of a natural scene text detection method based on an attention mechanism;

FIG. 2 is a comparison diagram of the impact of three modules, a context feature extraction module, a two-way long-and-short term memory and feature fusion module, and an attention-based feature integration module, on detection performance;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

the convolution feature extraction network is created based on VGG-16, the last pooling layer and three full-connection layers in the VGG-16 structure are deleted, and three modules are added: the system comprises a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module; the context feature extraction module comprises four parallel cavity convolution layers, the expansion coefficients of the four cavity convolution layers are respectively 1, 3, 5 and 7, the convolution kernel size is 3 multiplied by 3, and the context feature extraction module is added behind each convolution layer in the convolution feature extraction network; the bidirectional long-short time memory and feature fusion module comprises a forward LSTM layer, a backward LSTM layer and a Concat layer; the attention-based feature integration module comprises two convolution layers, a Softmax layer, a Slice layer and three SpatialProduct layers.

the generating method of the text center block label adopts an algebraic method, and the coordinates of each scene character in the number set are directly calculated through a zoom factor: assume that each instance polygon P in the scene text image dataset contains 2N vertices, where N ≧ 2, a set of vertices { P { (N ≧ 2) }₁,…,p_N,p′₁,…,p′_NDenotes wherein (p)_i,p′_i) Called a point pair, i ∈ {1, …, N }; the scaled polygon P' still has 2N vertices, using a set of vertices { P }₀₁,…,p_0N,p′₀₁,…,p′_0NRepresents it. Suppose a vertex p_i、p′_i、p_0i、p′_0iAre respectively expressed as: (x)_i,y_i)，(x′_i,y′_i)，(x_0i,y_0i)，(x′_0i,y′_0i). Given (x)_i,y_i)，(x′_i,y′_i) And a scaling factor lambda₀、x_0i、y_0i、x′_0iAnd y'_0iCalculated from the following formula:

the generation mode of the character stroke area label is obtained through an area growing algorithm, and for scene image characters with simple backgrounds, fewer seeds are selected to generate a stroke horizontal area; for scene image characters with complex background, more seeds need to be selected; for scene characters with complex backgrounds, if the number of the selected seeds exceeds 10 and the generated area is not perfect, in this case, the selected seeds are discarded, and other seeds are replaced to generate a stroke level area until 1-10 seeds are adopted, so that the area with the scene image characters can be completely generated.

Wherein

k∈{1，…，H_i}，t∈{1_i，…，W_iDenotes the number of channels, H_iAnd W_iIs a characteristic map f'_iI belongs to {3,4,5}, denotes the sequence number of the convolution stage, and the characteristic sequence S belongs to S_iInputting the data into the bidirectional long-time and short-time memory and feature fusion module; for characteristic sequences

Wherein

Wherein B represents the state diagram of the hidden layer of each characteristic diagram, i belongs to {3,4,5}, and represents the sequence number of the convolution stage; mapping the hidden layer sequence to corresponding deconvolution layers respectively, and outputting to obtain the multi-scale in each convolution stageThe probability chart of the scene characters existing in the state chart of each sliding window in the context information characteristic chart is marked as follows:

Splicing the two convolutional layers through a Concat layer, inputting the spliced convolutional layers into the two convolutional layers, wherein the first convolutional layer comprises 512 channels, and the size of a convolutional kernel is 3 multiplied by 3; the second convolutional layer contains 3 channels, the convolutional kernel size is 1 × 1;

s304, the three channels of the second convolution layer in step S303 correspond to the characteristic diagram respectively

In the Softmax layer, the feature map F is weighted_cThe weight of (2) is mapped to the probability distribution of the value (0, 1) to obtain the probability distribution; in the Slice layer, dividing the probability distribution into weight maps W_m＝{W^conv33，W^conv43，W^conv53}; the attention-based feature integration module integrates the feature according to the feature map F_cThe probability map of scene characters displayed at each pixel position in the scene character image is output

Wherein, σ [ ·]Is a function of the activation of the function,w indicates multiplication of elements^l、b^lIs the weight and bias of the convolution layer L ∈ L showing the scene text, W^lReflect feature maps at different positions

The degree of attention.

S4, training a text center block model and a character stroke area model, training the text center block model to be convergent through a text center block label in the training set of the scene character image data set in the step S2 and the text center block model, training the character stroke area model to be convergent through the character stroke area label, and performing fine adjustment on the basis of the training set to generate the text center module and the character stroke area model;

the loss function of the text center module and the character stroke area model is L ═ sigma_i，jG_ijlog(P_ij)+(1-G_ij)log(1-P_ij) Wherein G is_ijIs the label of the pixel at (i, j), P_ijRepresenting the probability that the pixel at (i, j) belongs to the foreground;

s5, testing the image F of the scene character image to be tested processed in the step S3 through the text center model and the character stroke area model_outCalculating to finally generate a probability graph of a text center block and a probability graph of a character stroke area corresponding to the scene text and digital image;

test image F containing scene characters_outRespectively inputting into the text center model and the character stroke region model to generate a probability map (F) of the text center block_tcb) And probability map of word stroke area (F)_wsr)；

And S6, judging the text center block through the area limit of the text center block, providing a false candidate text center block, finally obtaining the text detected in the scene text image, and adopting different labeling modes aiming at different text objects.

The judgment condition of the text center block is as follows:

S_min≤S_tcb≤S_max

wherein S_minAnd S_maxRespectively being a threshold value of the minimum area and a threshold value S of the maximum area of the text center block_tcbRepresenting the area of a candidate text center block;

s when judging candidate text center block according to area of text center block_minAnd S_max211 and 81179 are respectively taken because 99% of the area of the text center block is more than or equal to 211 and less than 81179; marking the finally screened text center block and word stroke area examples; obtaining a final character area, and completing a scene image character detection task; and the marking mode of the detected text is as follows: for the text with the curved shape, marking a text example by adopting the outline of the text area; for a straight line text, the outline of the text region is fitted by using a minimum rectangle, and a text example is marked by using a rectangular box.

Examples

In this embodiment, a comparative experiment is performed to verify the technical effect of the scene text detection method described in the first embodiment, and the experimental environment and the experimental result are as follows:

(1) experimental Environment

The system environment is as follows: ubuntu 16.04;

hardware environment: GPU, GTX 1080Ti, memory: 512G.

(2) Experimental data set

Training data: first, 4 × 10 pre-training of the text center block model was performed using 7200 sheets of training data of MLT2017⁵Secondly; the text center block model and the word stroke region model were then fine-tuned 4 x 10 on Total-test (1255 training sets)⁵Next, the process is carried out.

Test data: total-test (300 test set).

(3) Evaluation method

Curved shape text: pascal evaluation methods.

In order to show the effectiveness of the invention, four groups of experiments are respectively set for training the model by adopting the same training set, and the test set of the Total-test data set is respectively evaluated:

the first set of experiments: the method comprises the following steps of training by using a combination of a bidirectional long-short time memory and feature fusion module and an attention-based feature integration module, marking as the bidirectional long-short time memory and feature fusion module and the attention-based feature integration module, predicting a target region by using the combination of the bidirectional long-short time memory and feature fusion module and the attention-based feature integration module, and verifying the effectiveness of a context feature extraction module;

the second set of experiments: training by using a combination of a context feature extraction module and an attention-based feature integration module, marking as the context feature extraction module and the attention-based feature integration module, predicting a target region by using the combination of the context feature extraction module and the attention-based feature integration module, and verifying the validity of a bidirectional long-short time memory and feature fusion module;

the third set of experiments: the method comprises the following steps of training by using a combination of a bidirectional long-short time memory and feature fusion module and a context feature extraction module, marking as the bidirectional long-short time memory and feature fusion module and the context feature extraction module, predicting a target region by using the combination of the context feature extraction module and the bidirectional long-short time memory and feature fusion module, and verifying the effectiveness of a feature integration module based on attention;

fourth set of experiments: the method comprises the following steps of training by using a combination of three modules, namely a bidirectional long-short time memory and feature fusion module, a context feature extraction module and an attention-based feature integration module, marking as the bidirectional long-short time memory and feature fusion module, the context feature extraction module and the attention-based feature integration module, predicting a target region by using a combination of the three modules, namely the context feature extraction module, the bidirectional long-short time memory and feature fusion module and the attention-based feature integration module, and verifying the effectiveness of the three modules as a comparison group;

setting parameters: t is_wAnd T_bSet to 0.55, 0.60, respectively;

validity of the context feature extraction module: in order to obtain multi-scale context information, the invention designs a context feature extraction module; from fig. 2, it can be seen that the method of "two-way long-short time memory and feature fusion module + attention-based feature integration module" reduces F-measure by 1.54% (76.99% vs. 78.53%), reduces Precision by 4.52% (74.39% vs. 78.91%), but increases Recall by 1.63% (79.79% vs. 78.16%) without using the context feature extraction module.

Validity of the bidirectional long-time and short-time memory and feature fusion module: in order to utilize the space sequence characteristics of characters in text objects (words and text lines), the invention designs a bidirectional long-time and short-time memory and feature fusion module. From fig. 2, it can be seen that the method of "context feature extraction module + attention-based feature integration module" reduces F-measure by 3.25% (75.28% vs. 78.53%), Precision by 7.19% (71.72% vs. 78.91%), but increases Recall by 1.06% (79.22% vs. 78.16%) without using the two-way long-short time memory and feature fusion module.

Effectiveness of the attention-based feature integration module: in order to enable the trained model to enhance the attention of the text region in the scene image, the invention designs an attention-based feature integration module. From fig. 2, it can be seen that the method of "context feature extraction module + two-way long-short time memory and feature fusion module" reduces F-measure by 0.71% (77.82% vs. 78.53%), reduces Recall by 1.94% (76.22% vs. 78.16%), but increases Precision by 0.51% (79.48% vs. 78.91%) without using the attention-based feature integration module.

As shown in fig. 2, the test results show the influence of the three modules of the "context feature extraction module", "two-way long-and-short-term memory and feature fusion module", and "attention-based feature integration module" on the detection performance. Comparing the first group of experiments with the fourth group of experiments, and comparing the second group of experiments with the fourth group of experiments, and finding that the two modules, namely the context feature extraction module and the two-way long-short time memory and feature fusion module, can both remarkably improve Precision of the method, but slightly reduce Recall; comparing the third set of experiments with the fourth set of experiments, it was found that the "attention-based feature integration module" module can significantly improve Recall of the method of the invention, but slightly reduce Precision; meanwhile, the results of the comparison experiment show that the three modules, namely the context feature extraction module, the two-way long-short time memory and feature fusion module and the attention-based feature integration module, have complementarity in the aspects of improving Precision, Recall and F-measure of the method.

The above embodiments are only used for illustrating the technical solution of the present invention and not for limiting the same, and it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the above embodiments may be embodied in the form of a software product, which may be stored in a non-volatile storage medium, for example: CD-ROM, usb disk, removable hard disk, etc., comprising instructions for causing a computing device, such as: a personal computer, a server, or a network appliance, etc., that performs the methods described in the various embodiments of the invention.

By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:

the invention discloses a natural scene character detection method based on an attention mechanism, which is characterized in that a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module are added on the basis of a VGG-16 framework to construct a network model capable of effectively predicting character areas in a scene image so as to overcome the problems in scene character detection in the prior art: the direction information of the bent text is not easy to return, adhesion among different text lines is close, and information redundancy is generated by multi-level feature integration.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims

1. A natural scene character detection method based on an attention mechanism is characterized by comprising the following steps:

s301, extracting the features of the scene character images through a convolution feature extraction network to obtain convolution feature maps of the corresponding scene character images, and extracting the convolution feature maps of the scene character images through the third, fourth and fifth convolution stages of the five convolution stages of the VGG-16 convolution network, wherein the convolution feature maps are marked as follows: f ═ F₃，f₄，f₅F represents a convolution feature map set of the scene text image, and marks the last convolution layer of each convolution stage as: l { conv33, conv43, conv53 };

Wherein

C represents the number of channels, H_iAnd W_iIs a characteristic map f'_iI ∈ {3,4,5}, a sequence number representing a convolution stage, the set S of feature sequences_iInputting the data into the bidirectional long-time and short-time memory and feature fusion module in a forward and backward sequence, and obtaining a probability icon of scene characters existing in a state diagram of each sliding window in the multi-scale context information feature diagram in each convolution stage as:

The degree of attention of;

s4, training a text center block model and a character stroke area model, pre-training the text center block model to be convergent through a text center block label and a character stroke area label in a training set of the scene character image data set in the step S2, and simultaneously carrying out fine tuning on the basis of the training set to generate the text center module and the character stroke area model;

2. The attention mechanism-based natural scene character detection method according to claim 1, wherein the context feature extraction module comprises four parallel hole convolution layers, the expansion coefficients of the four hole convolution layers are 1, 3, 5 and 7, the convolution kernel size is 3 x 3, and the context feature extraction module is added after each convolution layer in the convolution feature extraction network.

3. The attention mechanism-based natural scene text detection method according to claim 1, wherein the bidirectional long-short time memory and feature fusion module comprises a forward LSTM layer, a backward LSTM layer and a Concat layer, and the feature extraction step in the bidirectional long-short time memory and feature fusion module is as follows:

Wherein

Wherein B represents the state diagram of the hidden layer of each characteristic diagram, i belongs to {3,4,5}, and represents the sequence number of the convolution stage; and mapping the hidden layer sequences to corresponding deconvolution layers respectively, and outputting a probability graph of scene characters in a state graph of each sliding window in the multi-scale context information characteristic graph in each convolution stage respectively, wherein the probability graph is marked as:

wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram; will M_aCutting the character image into a character image with the same size as the input scene, and marking the character image as a feature image

4. The attention-based natural scene text detection method of claim 3, wherein the attention-based feature integration module comprises two convolution layers, a Softmax layer, a Slice layer and three spatialProduct layers; the attention-based feature integration module comprises the following execution steps:

5. The attention mechanism-based natural scene character detection method according to claim 1, wherein the judgment condition of the text center block is:

S_min≤S_tcb≤S_max

wherein S_minAnd S_maxRespectively a threshold value of the minimum area and a threshold value of the maximum area of the text center block, S_tcbRepresenting the area of the center block of the candidate text.

6. The attention mechanism-based natural scene character detection method of claim 1, wherein the labeling manner for the detected character text in step S6 is: for the text with the curved shape, marking a text example by adopting the outline of the text area; for a straight line text, the outline of the text region is fitted by using a minimum rectangle, and a text example is marked by using a rectangular box.

7. The attention mechanism-based natural scene text detection method according to claim 1, wherein the loss function of the text center module and the character stroke area model in step S4 is L- Σ_i，jG_ijlog(P_ij)+(1-G_ij)log(1-P_ij) Wherein G is_ijIs the label of the pixel at (i, j), P_ijRepresenting the probability that the pixel at (i, j) belongs to the foreground.