CN116630755B

CN116630755B - Method, system and storage medium for detecting text position in scene image

Info

Publication number: CN116630755B
Application number: CN202310373895.2A
Authority: CN
Inventors: 马宗润; 李�浩; 黄向生
Original assignee: Xiong'an Innovation Research Institute
Current assignee: Xiong'an Innovation Research Institute
Priority date: 2023-04-10
Filing date: 2023-04-10
Publication date: 2024-04-02
Anticipated expiration: 2043-04-10
Also published as: CN116630755A

Abstract

The invention discloses a method, a system and a storage medium for detecting text positions in a scene image, wherein the method comprises the following steps: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the preset text position detection model is used for: the feature fusion module is used for fusing the features of different scales of the image extracted by the feature extraction module, and a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network are sequentially adopted for predicting the text position of the image; and inputting the scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain the identification result. The invention supports the picture with any shape as input, is not influenced by the picture with small resolution, can extract the character features of multiple scales, and can not miss particularly large or particularly small characters; the text position in the scene image can be detected more accurately and more quickly.

Description

Method, system and storage medium for detecting text position in scene image

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a method, system, and storage medium for detecting text positions in a scene image.

Background

Reading text in natural scene images has recently attracted more and more attention in the field of computer vision. But in its many practical applications, the large variance of text patterns and the highly cluttered background constitute major challenges for accurate text localization. Accurate scene word detects and helps promoting word recognition's precision and efficiency, helps pushing away word recognition to more application scenes.

The current text detection method mainly adopts a bottom-up recognition flow, which usually starts from low-level character or stroke detection, and then goes through complicated steps such as non-text detection, text construction, text line verification and the like to finally detect the region where the target text is located. In addition, these multi-step bottom-up approaches are often complex and less robust and reliable. And their performance is severely dependent on the results of character detection. Other neural network algorithms, primarily through connected component methods or sliding window methods, also explore low-level features to distinguish text candidates from the background, however, they are not robust in identifying individual strokes or characters alone without contextual information. On the one hand, the context relation of multiple texts provides reasoning clues of fuzzy fonts, so that the effect of identifying fuzzy characters is achieved. On the other hand: these text defects typically result in a large number of non-text components in character detection, which results in major difficulties in handling them in subsequent steps. In addition, these error detections are easy to cause continuous accumulation of errors in the bottom-up recognition process, so that the recognition result cannot meet the requirement.

Generally, the model algorithm related to scene text detection at the present stage has the defects, mainly including the following aspects, namely ignoring the context, not relying on low-level characteristics of the text too much, and not only being difficult to directly apply the general target detection system to scene text detection by using RPN (Region Proposal Network), which generally needs higher positioning accuracy.

Therefore, it is needed to provide a technical solution to solve the above technical problems.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method, a system and a storage medium for detecting text positions in a scene image.

The technical scheme of the method for detecting the text position in the scene image is as follows:

training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;

and inputting a scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected.

The method for detecting the text position in the scene image has the following beneficial effects:

the method of the invention supports the picture with any shape as input, is not affected by the picture with small resolution, can extract the character features of multiple scales, and can not miss the characters with special large or small size. By taking the context relation of the scene image into consideration and adopting the bidirectional LSTM structure to acquire the sequence characteristics of the characters, the text position in the scene image can be detected more accurately and more rapidly.

On the basis of the scheme, the method for detecting the text position in the scene image can be improved as follows.

Further, any training scene corresponds to at least one target training scene image; the step of training the preset text position detection model based on the target training scene images corresponding to the training scenes respectively to obtain the target text position detection model comprises the following steps:

inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model;

optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: training iteration times reach the maximum iteration times or model loss function convergence;

when the judgment result is yes, the first text position detection model is determined to be the target text position detection model;

and when the judging result is negative, taking the first text position detection model as the preset text position detection model, and returning to the step of executing the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model until the judging result is positive, and determining the first text position detection model as the target text position detection model.

Further, the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged.

Further, the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain a training detection result of the training scene includes:

inputting all target training scene images corresponding to any training scene into the Resnet-34 network for feature extraction to obtain a first feature image corresponding to the training scene, and respectively inputting the first feature image to the plurality of different downsampling layers to obtain a plurality of second feature images with different scales corresponding to the training scene;

based on the feature fusion function, performing cross-channel fusion on all the second feature images corresponding to any training scene to obtain a third feature image corresponding to the training scene, and sequentially inputting the normalization layer, the activation function layer and the convolution layer for processing to obtain a fourth feature image corresponding to the training scene;

inputting the fourth characteristic image corresponding to any training scene into the sliding window module to perform sliding window processing of a preset size, so as to obtain a fifth characteristic image corresponding to the training scene;

inputting a fifth characteristic image corresponding to any training scene into the first matrix conversion module for matrix conversion to obtain a first intermediate characteristic image of the training scene, inputting the first intermediate characteristic image of the training scene into the bidirectional LSTM model based on a preset input condition to obtain a second intermediate characteristic image of the training scene, and inputting the second intermediate characteristic image of the training scene into the second matrix conversion module for matrix conversion to obtain a sixth characteristic image corresponding to the training scene; wherein, the preset input conditions are as follows: batch=nh and T _max Data flow of =w, batch is the Batch size of the first intermediate feature map, N is the number of target training scene images corresponding to the training scene, H is the height of the first intermediate feature map, W is the width of the first intermediate feature map, T _max Is the maximum length of time;

and inputting the sixth characteristic image corresponding to any training scene into the full link layer for conversion, obtaining a seventh characteristic image corresponding to the training scene, inputting the seventh characteristic image into the RPN network, obtaining at least one text candidate box corresponding to the training scene, determining a target candidate box corresponding to the training scene from the at least one text candidate box corresponding to the training scene based on a non-maximum suppression algorithm, and taking the target candidate box corresponding to the training scene as a training detection result of the training scene.

Further, the objective function is:wherein N is the number of target training scene images, S _i The number of target candidate frames representing the ith training scenario, U _i The intersection ratio loss of the target candidate frame representing the ith training scene and the labeled text frame corresponding to the training label image, N _s Number of target training scene images representing all training scenes, +.>Representing the number of target candidate frames containing text positive samples in the ith training scene, and IOU represents the intersection ratio of the target candidate frames and the labeled text frames, +.>Indicating whether the target candidate box contains a classification loss for the text positive sample.

Further, the method further comprises the following steps:

acquiring original training scene images corresponding to a plurality of training scenes respectively, and preprocessing the original training scene images of each training scene respectively to obtain target training scene images corresponding to the plurality of training scenes respectively; wherein the preprocessing comprises the following steps: and eliminating the original scene image with overexposure, incomplete image and blurred image.

The technical scheme of the system for detecting the text position in the scene image is as follows:

comprising the following steps: a training unit and a detection unit;

the training unit is used for: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;

the detection unit is used for: and inputting a scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected.

The system for detecting the text position in the scene image has the following beneficial effects:

the system of the invention supports the picture with any shape as input, is not influenced by the picture with small resolution, can extract the character features of multiple scales, and does not miss particularly large or particularly small characters. By taking the context relation of the scene image into consideration and adopting the bidirectional LSTM structure to acquire the sequence characteristics of the characters, the text position in the scene image can be detected more accurately and more rapidly.

Based on the scheme, the system for detecting the text position in the scene image can be improved as follows.

Further, any training scene corresponds to at least one target training scene image; the training unit includes: the system comprises a first training unit, a model optimizing unit, a first processing unit and a second processing unit;

the first training unit is used for: inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model;

the model optimizing unit is used for: optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: training iteration times reach the maximum iteration times or model loss function convergence;

the first processing unit is used for: when the judgment result is yes, the first text position detection model is determined to be the target text position detection model;

the second processing unit is used for: and when the judging result is negative, taking the first text position detection model as the preset text position detection model, and calling the first training unit back until the judging result is positive, and determining the first text position detection model as the target text position detection model.

The technical scheme of the storage medium is as follows:

the storage medium has stored therein instructions which, when read by a computer, cause the computer to perform the steps of a method of detecting text position in an image of a scene as in the present invention.

Drawings

FIG. 1 is a flow chart illustrating an embodiment of a method for detecting text position in an image of a scene provided by the present invention;

fig. 2 is a schematic structural diagram of a preset text position detection model in an embodiment of a method for detecting text positions in a scene image according to the present invention;

FIG. 3 is a flow chart illustrating step 110 in an embodiment of a method for detecting text position in an image of a scene provided by the present invention;

fig. 4 is a schematic structural diagram of an embodiment of a system for detecting text position in an image of a scene provided by the present invention.

Detailed Description

Fig. 1 is a flow chart of an embodiment of a method for detecting text position in a scene image according to the present invention. As shown in fig. 1, the method comprises the steps of:

step 110: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model.

Wherein, (1) as shown in fig. 2, the preset text position detection model includes: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged. (2) The preset text position detection model is used for: and fusing the different scale features of the image extracted by the feature extraction module through a feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network. (3) The target training scene image is: and a scene image of a training scene for training the preset text position detection model, wherein the scene image contains text information. (4) The target text position detection model is as follows: and training to obtain a text position detection model. (5) Any target training scene image contains text information.

Step 120: and inputting a scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected.

Wherein, (1) the scene to be detected is: in this embodiment, a scene to be detected is required. (2) The scene image to be detected is: a scene image in a scene to be detected. (3) The text position recognition result is: the position of a predicted text box containing text content in the scene image is to be detected.

It should be noted that (1) a scene image to be detected may have one predictive text box, or may have a plurality of predictive text boxes, and the specific number is determined according to the text content in the scene image. (2) In the detection process, one to-be-detected scene image of the to-be-detected scene can be input, or a plurality of to-be-detected scene images of the to-be-detected scene can be input at the same time, and the limitation is not set.

Preferably, any training scene corresponds to at least one target training scene image; as shown in fig. 3, step 110 includes:

step 111: inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model.

Wherein, (1) training test results are: and the detection result of at least one predictive text box corresponding to the training scene is included. (2) Training label images are: and before model training, marking the text boxes existing in the training scene in advance to obtain an image. (3) The objective function is:wherein N is the number of target training scene images, S _i The number of target candidate frames representing the ith training scenario, U _i The intersection ratio loss of the target candidate frame representing the ith training scene and the labeled text frame corresponding to the training label image, N _s Number of target training scene images representing all training scenes, +.>Representing the number of target candidate boxes containing text positive samples in the ith training scene, and IOU represents the target candidate boxes and the labeling text boxesCross ratio of->Indicating whether the target candidate box contains a classification loss for the text positive sample. (4) The target loss value is used to represent the degree of difference between the predicted text box and the marked text box.

Step 112: optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: the training iteration number reaches the maximum iteration number or the model loss function converges.

Wherein, (1) preset training conditions are: the training iteration number reaches the maximum iteration number or the model loss function converges. (2) The first text position detection model is: and (5) iterating a text position detection model obtained in the training process.

Step 113A: and when the judgment result is yes, determining the first text position detection model as the target text position detection model.

Step 113B: and when the judging result is negative, taking the first text position detection model as the preset text position detection model, and returning to the execution step 111 until the judging result is positive, and determining the first text position detection model as the target text position detection model.

Preferably, the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged.

Preferably, the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain a training detection result of the training scene includes:

and inputting all target training scene images corresponding to any training scene into the Resnet-34 network for feature extraction to obtain a first feature image corresponding to the training scene, and respectively inputting the first feature image to the plurality of different downsampling layers to obtain a plurality of second feature images with different scales corresponding to the training scene.

Wherein (1) the first feature image is: an image of nxc×w×h, C represents a channel of the first feature image, W represents a width of the first feature image, and H represents a height of the first feature image. (2) Each downsampling layer corresponds to a different degree of downsampling process. (3) The second characteristic image is: a feature image of a specific scale; the plurality of second feature images corresponds to feature images of different scales.

It should be noted that, the process of feature extraction of the image through the network of Resnet-34 is the prior art, and detailed processes are not repeated here.

And based on the feature fusion function, performing cross-channel fusion on all the second feature images corresponding to any training scene to obtain a third feature image corresponding to the training scene, and sequentially inputting the normalization layer, the activation function layer and the convolution layer for processing to obtain a fourth feature image corresponding to the training scene.

Wherein (1) the feature fusion function is: and the Concat function is used for fusing the second characteristic images with different scales in a cross-channel mode. (2) The third feature image is: and fusing the second characteristic images with different scales to obtain an image. (3) And carrying out combination processing through the normalization layer, the activation function layer and the 3×3 convolution layer to restore the channel number of the third characteristic image to the original channel number C, thereby obtaining a fourth characteristic image of NxCxW×H.

And inputting the fourth characteristic image corresponding to any training scene into the sliding window module to perform sliding window processing of a preset size, so as to obtain a fifth characteristic image corresponding to the training scene.

Wherein, (1) the preset size is: 3X 3. (2) The fifth feature image is: characteristic image of n×9c×h×w.

Specifically, a 3×3 sliding window process is performed on the fourth feature image, that is, each point obtains a feature vector of 3×3×c length in combination with the surrounding 3×3 region features, and finally outputs a fifth feature image of n×9c×h×w.

And inputting the fifth characteristic image corresponding to any training scene into the first matrix conversion module for matrix conversion to obtain a first intermediate characteristic image of the training scene, inputting the first intermediate characteristic image of the training scene into the bidirectional LSTM model based on a preset input condition to obtain a second intermediate characteristic image of the training scene, and inputting the second intermediate characteristic image of the training scene into the second matrix conversion module for matrix conversion to obtain a sixth characteristic image corresponding to the training scene.

Wherein, (1) the preset input conditions are: batch=nh and T _max Data flow of =w, batch is the Batch size of the first intermediate feature map, N is the number of target training scene images corresponding to the training scene, H is the height of the first intermediate feature map, W is the width of the first intermediate feature map, T _max For a maximum length of time. (2) The first matrix conversion module is used for converting the fifth characteristic image into: (n×h) ×w×9c, i.e., a first intermediate feature map. (3) The second intermediate feature map is: (NH). Times.W.times.256. The x 0 second matrix conversion module is used for converting the second intermediate feature map into: a feature map of n×256×h×w, i.e., a sixth feature image.

It should be noted that, the process of acquiring the corresponding feature map by using the bidirectional LSTM model for learning the sequence features of each line in the image is the prior art, and will not be repeated here.

Wherein (1) the seventh feature image is: a feature map of nx512×h×w. (2) The non-maximal suppression algorithm is used to preserve the text candidate box with the highest probability.

Preferably, the method further comprises:

the method comprises the steps of obtaining original training scene images corresponding to a plurality of training scenes respectively, and preprocessing the original training scene images of each training scene respectively to obtain target training scene images corresponding to the plurality of training scenes respectively.

Wherein, the pretreatment process comprises the following steps: and eliminating the original scene image with overexposure, incomplete image and blurred image.

In this embodiment, after the text position recognition result of the scene image to be detected is obtained, a corresponding text recognition method may be further used to recognize text content in the target text box, so as to obtain text information in the target text box.

The technical scheme of the embodiment supports the picture with any shape as input, is not influenced by the picture with small resolution, can extract the multi-scale character features, and can not miss particularly large or particularly small characters. By taking the context relation of the scene image into consideration and adopting the bidirectional LSTM structure to acquire the sequence characteristics of the characters, the text position in the scene image can be detected more accurately and more rapidly.

Fig. 4 is a schematic structural diagram of an embodiment of a system for detecting text position in an image of a scene provided by the present invention. As shown in fig. 4, the system 200 includes: a training unit 210 and a detection unit 220.

The training unit 210 is configured to: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;

the detection unit 220 is configured to: and inputting a scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected.

Preferably, any training scene corresponds to at least one target training scene image; the training unit 210 includes: the system comprises a first training unit, a model optimizing unit, a first processing unit and a second processing unit;

The steps for implementing the corresponding functions by the parameters and the modules in the system 200 for detecting the text position in the scene image according to the present embodiment are referred to in the embodiments of the method for detecting the text position in the scene image according to the present embodiment, and are not described herein.

The storage medium provided by the embodiment of the invention comprises: the storage medium stores instructions that, when read by a computer, cause the computer to perform steps such as a method for detecting a text position in a scene image, and specific reference may be made to the parameters and steps in the above embodiments of a method for detecting a text position in a scene image, which are not described herein.

In the description provided herein, numerous specific details are set forth. It will be appreciated, however, that embodiments of the invention may be practiced without such specific details. Similarly, in the above description of exemplary embodiments of the invention, various features of embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. Wherein the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims

1. A method of detecting text position in an image of a scene, comprising:

inputting a scene image to be detected corresponding to a scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected;

any training scene corresponds to at least one target training scene image; the step of training the preset text position detection model based on the target training scene images corresponding to the training scenes respectively to obtain the target text position detection model comprises the following steps:

when the judgment result is negative, the first text position detection model is used as the preset text position detection model, the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model is carried out in a returning mode, and when the judgment result is positive, the first text position detection model is determined to be the target text position detection model;

the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged;

the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain a training detection result of the training scene comprises the following steps:

2. The method of detecting text position in an image of a scene of claim 1, wherein the objective loss function is:wherein N is the number of target training scene images, S _i The number of target candidate frames representing the ith training scenario, U _i The intersection ratio loss of the target candidate frame representing the ith training scene and the labeled text frame corresponding to the training label image, N _s Number of target training scene images representing all training scenes, +.>Representing the number of target candidate frames containing text positive samples in the ith training scene, and IOU represents the intersection ratio of the target candidate frames and the labeled text frames, +.>Indicating whether the target candidate box contains a classification loss for the text positive sample.

3. The method of detecting text position in an image of a scene as recited in claim 1 or 2, further comprising:

4. A system for detecting text position in an image of a scene, comprising: a training unit and a detection unit;

the detection unit is used for: inputting a scene image to be detected corresponding to a scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected;

any training scene corresponds to at least one target training scene image; the training unit includes: the system comprises a first training unit, a model optimizing unit, a first processing unit and a second processing unit;

the second processing unit is used for: when the judgment result is negative, taking the first text position detection model as the preset text position detection model, and calling the first training unit back until the judgment result is positive, and determining the first text position detection model as the target text position detection model;

the first training unit is specifically configured to:

the any one is subjected toA fifth characteristic image corresponding to a training scene is input to the first matrix conversion module for matrix conversion to obtain a first intermediate characteristic image of the training scene, the first intermediate characteristic image of the training scene is input to the bidirectional LSTM model based on a preset input condition to obtain a second intermediate characteristic image of the training scene, and the second intermediate characteristic image of the training scene is input to the second matrix conversion module for matrix conversion to obtain a sixth characteristic image corresponding to the training scene; wherein, the preset input conditions are as follows: batch=nh and T _max Data flow of =w, batch is the Batch size of the first intermediate feature map, N is the number of target training scene images corresponding to the training scene, H is the height of the first intermediate feature map, W is the width of the first intermediate feature map, T _max Is the maximum length of time;

5. A storage medium having instructions stored therein which, when read by a computer, cause the computer to perform the method of detecting text position in an image of a scene as claimed in any one of claims 1 to 3.