CN116630755B - Method, system and storage medium for detecting text position in scene image - Google Patents
Method, system and storage medium for detecting text position in scene image Download PDFInfo
- Publication number
- CN116630755B CN116630755B CN202310373895.2A CN202310373895A CN116630755B CN 116630755 B CN116630755 B CN 116630755B CN 202310373895 A CN202310373895 A CN 202310373895A CN 116630755 B CN116630755 B CN 116630755B
- Authority
- CN
- China
- Prior art keywords
- training
- scene
- text position
- training scene
- position detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000012549 training Methods 0.000 claims abstract description 303
- 238000001514 detection method Methods 0.000 claims abstract description 163
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 34
- 230000004927 fusion Effects 0.000 claims abstract description 34
- 238000000605 extraction Methods 0.000 claims abstract description 24
- 238000006243 chemical reaction Methods 0.000 claims description 34
- 239000011159 matrix material Substances 0.000 claims description 30
- 238000012545 processing Methods 0.000 claims description 21
- 230000004913 activation Effects 0.000 claims description 11
- 238000010606 normalization Methods 0.000 claims description 11
- 238000007781 pre-processing Methods 0.000 claims description 5
- 230000001629 suppression Effects 0.000 claims description 5
- 230000006870 function Effects 0.000 description 27
- 238000010586 diagram Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method, a system and a storage medium for detecting text positions in a scene image, wherein the method comprises the following steps: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the preset text position detection model is used for: the feature fusion module is used for fusing the features of different scales of the image extracted by the feature extraction module, and a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network are sequentially adopted for predicting the text position of the image; and inputting the scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain the identification result. The invention supports the picture with any shape as input, is not influenced by the picture with small resolution, can extract the character features of multiple scales, and can not miss particularly large or particularly small characters; the text position in the scene image can be detected more accurately and more quickly.
Description
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a method, system, and storage medium for detecting text positions in a scene image.
Background
Reading text in natural scene images has recently attracted more and more attention in the field of computer vision. But in its many practical applications, the large variance of text patterns and the highly cluttered background constitute major challenges for accurate text localization. Accurate scene word detects and helps promoting word recognition's precision and efficiency, helps pushing away word recognition to more application scenes.
The current text detection method mainly adopts a bottom-up recognition flow, which usually starts from low-level character or stroke detection, and then goes through complicated steps such as non-text detection, text construction, text line verification and the like to finally detect the region where the target text is located. In addition, these multi-step bottom-up approaches are often complex and less robust and reliable. And their performance is severely dependent on the results of character detection. Other neural network algorithms, primarily through connected component methods or sliding window methods, also explore low-level features to distinguish text candidates from the background, however, they are not robust in identifying individual strokes or characters alone without contextual information. On the one hand, the context relation of multiple texts provides reasoning clues of fuzzy fonts, so that the effect of identifying fuzzy characters is achieved. On the other hand: these text defects typically result in a large number of non-text components in character detection, which results in major difficulties in handling them in subsequent steps. In addition, these error detections are easy to cause continuous accumulation of errors in the bottom-up recognition process, so that the recognition result cannot meet the requirement.
Generally, the model algorithm related to scene text detection at the present stage has the defects, mainly including the following aspects, namely ignoring the context, not relying on low-level characteristics of the text too much, and not only being difficult to directly apply the general target detection system to scene text detection by using RPN (Region Proposal Network), which generally needs higher positioning accuracy.
Therefore, it is needed to provide a technical solution to solve the above technical problems.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method, a system and a storage medium for detecting text positions in a scene image.
The technical scheme of the method for detecting the text position in the scene image is as follows:
training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;
and inputting a scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected.
The method for detecting the text position in the scene image has the following beneficial effects:
the method of the invention supports the picture with any shape as input, is not affected by the picture with small resolution, can extract the character features of multiple scales, and can not miss the characters with special large or small size. By taking the context relation of the scene image into consideration and adopting the bidirectional LSTM structure to acquire the sequence characteristics of the characters, the text position in the scene image can be detected more accurately and more rapidly.
On the basis of the scheme, the method for detecting the text position in the scene image can be improved as follows.
Further, any training scene corresponds to at least one target training scene image; the step of training the preset text position detection model based on the target training scene images corresponding to the training scenes respectively to obtain the target text position detection model comprises the following steps:
inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model;
optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: training iteration times reach the maximum iteration times or model loss function convergence;
when the judgment result is yes, the first text position detection model is determined to be the target text position detection model;
and when the judging result is negative, taking the first text position detection model as the preset text position detection model, and returning to the step of executing the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model until the judging result is positive, and determining the first text position detection model as the target text position detection model.
Further, the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged.
Further, the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain a training detection result of the training scene includes:
inputting all target training scene images corresponding to any training scene into the Resnet-34 network for feature extraction to obtain a first feature image corresponding to the training scene, and respectively inputting the first feature image to the plurality of different downsampling layers to obtain a plurality of second feature images with different scales corresponding to the training scene;
based on the feature fusion function, performing cross-channel fusion on all the second feature images corresponding to any training scene to obtain a third feature image corresponding to the training scene, and sequentially inputting the normalization layer, the activation function layer and the convolution layer for processing to obtain a fourth feature image corresponding to the training scene;
inputting the fourth characteristic image corresponding to any training scene into the sliding window module to perform sliding window processing of a preset size, so as to obtain a fifth characteristic image corresponding to the training scene;
inputting a fifth characteristic image corresponding to any training scene into the first matrix conversion module for matrix conversion to obtain a first intermediate characteristic image of the training scene, inputting the first intermediate characteristic image of the training scene into the bidirectional LSTM model based on a preset input condition to obtain a second intermediate characteristic image of the training scene, and inputting the second intermediate characteristic image of the training scene into the second matrix conversion module for matrix conversion to obtain a sixth characteristic image corresponding to the training scene; wherein, the preset input conditions are as follows: batch=nh and T max Data flow of =w, batch is the Batch size of the first intermediate feature map, N is the number of target training scene images corresponding to the training scene, H is the height of the first intermediate feature map, W is the width of the first intermediate feature map, T max Is the maximum length of time;
and inputting the sixth characteristic image corresponding to any training scene into the full link layer for conversion, obtaining a seventh characteristic image corresponding to the training scene, inputting the seventh characteristic image into the RPN network, obtaining at least one text candidate box corresponding to the training scene, determining a target candidate box corresponding to the training scene from the at least one text candidate box corresponding to the training scene based on a non-maximum suppression algorithm, and taking the target candidate box corresponding to the training scene as a training detection result of the training scene.
Further, the objective function is:wherein N is the number of target training scene images, S i The number of target candidate frames representing the ith training scenario, U i The intersection ratio loss of the target candidate frame representing the ith training scene and the labeled text frame corresponding to the training label image, N s Number of target training scene images representing all training scenes, +.>Representing the number of target candidate frames containing text positive samples in the ith training scene, and IOU represents the intersection ratio of the target candidate frames and the labeled text frames, +.>Indicating whether the target candidate box contains a classification loss for the text positive sample.
Further, the method further comprises the following steps:
acquiring original training scene images corresponding to a plurality of training scenes respectively, and preprocessing the original training scene images of each training scene respectively to obtain target training scene images corresponding to the plurality of training scenes respectively; wherein the preprocessing comprises the following steps: and eliminating the original scene image with overexposure, incomplete image and blurred image.
The technical scheme of the system for detecting the text position in the scene image is as follows:
comprising the following steps: a training unit and a detection unit;
the training unit is used for: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;
the detection unit is used for: and inputting a scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected.
The system for detecting the text position in the scene image has the following beneficial effects:
the system of the invention supports the picture with any shape as input, is not influenced by the picture with small resolution, can extract the character features of multiple scales, and does not miss particularly large or particularly small characters. By taking the context relation of the scene image into consideration and adopting the bidirectional LSTM structure to acquire the sequence characteristics of the characters, the text position in the scene image can be detected more accurately and more rapidly.
Based on the scheme, the system for detecting the text position in the scene image can be improved as follows.
Further, any training scene corresponds to at least one target training scene image; the training unit includes: the system comprises a first training unit, a model optimizing unit, a first processing unit and a second processing unit;
the first training unit is used for: inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model;
the model optimizing unit is used for: optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: training iteration times reach the maximum iteration times or model loss function convergence;
the first processing unit is used for: when the judgment result is yes, the first text position detection model is determined to be the target text position detection model;
the second processing unit is used for: and when the judging result is negative, taking the first text position detection model as the preset text position detection model, and calling the first training unit back until the judging result is positive, and determining the first text position detection model as the target text position detection model.
Further, the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged.
The technical scheme of the storage medium is as follows:
the storage medium has stored therein instructions which, when read by a computer, cause the computer to perform the steps of a method of detecting text position in an image of a scene as in the present invention.
Drawings
FIG. 1 is a flow chart illustrating an embodiment of a method for detecting text position in an image of a scene provided by the present invention;
fig. 2 is a schematic structural diagram of a preset text position detection model in an embodiment of a method for detecting text positions in a scene image according to the present invention;
FIG. 3 is a flow chart illustrating step 110 in an embodiment of a method for detecting text position in an image of a scene provided by the present invention;
fig. 4 is a schematic structural diagram of an embodiment of a system for detecting text position in an image of a scene provided by the present invention.
Detailed Description
Fig. 1 is a flow chart of an embodiment of a method for detecting text position in a scene image according to the present invention. As shown in fig. 1, the method comprises the steps of:
step 110: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model.
Wherein, (1) as shown in fig. 2, the preset text position detection model includes: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged. (2) The preset text position detection model is used for: and fusing the different scale features of the image extracted by the feature extraction module through a feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network. (3) The target training scene image is: and a scene image of a training scene for training the preset text position detection model, wherein the scene image contains text information. (4) The target text position detection model is as follows: and training to obtain a text position detection model. (5) Any target training scene image contains text information.
Step 120: and inputting a scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected.
Wherein, (1) the scene to be detected is: in this embodiment, a scene to be detected is required. (2) The scene image to be detected is: a scene image in a scene to be detected. (3) The text position recognition result is: the position of a predicted text box containing text content in the scene image is to be detected.
It should be noted that (1) a scene image to be detected may have one predictive text box, or may have a plurality of predictive text boxes, and the specific number is determined according to the text content in the scene image. (2) In the detection process, one to-be-detected scene image of the to-be-detected scene can be input, or a plurality of to-be-detected scene images of the to-be-detected scene can be input at the same time, and the limitation is not set.
Preferably, any training scene corresponds to at least one target training scene image; as shown in fig. 3, step 110 includes:
step 111: inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model.
Wherein, (1) training test results are: and the detection result of at least one predictive text box corresponding to the training scene is included. (2) Training label images are: and before model training, marking the text boxes existing in the training scene in advance to obtain an image. (3) The objective function is:wherein N is the number of target training scene images, S i The number of target candidate frames representing the ith training scenario, U i The intersection ratio loss of the target candidate frame representing the ith training scene and the labeled text frame corresponding to the training label image, N s Number of target training scene images representing all training scenes, +.>Representing the number of target candidate boxes containing text positive samples in the ith training scene, and IOU represents the target candidate boxes and the labeling text boxesCross ratio of->Indicating whether the target candidate box contains a classification loss for the text positive sample. (4) The target loss value is used to represent the degree of difference between the predicted text box and the marked text box.
Step 112: optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: the training iteration number reaches the maximum iteration number or the model loss function converges.
Wherein, (1) preset training conditions are: the training iteration number reaches the maximum iteration number or the model loss function converges. (2) The first text position detection model is: and (5) iterating a text position detection model obtained in the training process.
Step 113A: and when the judgment result is yes, determining the first text position detection model as the target text position detection model.
Step 113B: and when the judging result is negative, taking the first text position detection model as the preset text position detection model, and returning to the execution step 111 until the judging result is positive, and determining the first text position detection model as the target text position detection model.
Preferably, the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged.
Preferably, the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain a training detection result of the training scene includes:
and inputting all target training scene images corresponding to any training scene into the Resnet-34 network for feature extraction to obtain a first feature image corresponding to the training scene, and respectively inputting the first feature image to the plurality of different downsampling layers to obtain a plurality of second feature images with different scales corresponding to the training scene.
Wherein (1) the first feature image is: an image of nxc×w×h, C represents a channel of the first feature image, W represents a width of the first feature image, and H represents a height of the first feature image. (2) Each downsampling layer corresponds to a different degree of downsampling process. (3) The second characteristic image is: a feature image of a specific scale; the plurality of second feature images corresponds to feature images of different scales.
It should be noted that, the process of feature extraction of the image through the network of Resnet-34 is the prior art, and detailed processes are not repeated here.
And based on the feature fusion function, performing cross-channel fusion on all the second feature images corresponding to any training scene to obtain a third feature image corresponding to the training scene, and sequentially inputting the normalization layer, the activation function layer and the convolution layer for processing to obtain a fourth feature image corresponding to the training scene.
Wherein (1) the feature fusion function is: and the Concat function is used for fusing the second characteristic images with different scales in a cross-channel mode. (2) The third feature image is: and fusing the second characteristic images with different scales to obtain an image. (3) And carrying out combination processing through the normalization layer, the activation function layer and the 3×3 convolution layer to restore the channel number of the third characteristic image to the original channel number C, thereby obtaining a fourth characteristic image of NxCxW×H.
And inputting the fourth characteristic image corresponding to any training scene into the sliding window module to perform sliding window processing of a preset size, so as to obtain a fifth characteristic image corresponding to the training scene.
Wherein, (1) the preset size is: 3X 3. (2) The fifth feature image is: characteristic image of n×9c×h×w.
Specifically, a 3×3 sliding window process is performed on the fourth feature image, that is, each point obtains a feature vector of 3×3×c length in combination with the surrounding 3×3 region features, and finally outputs a fifth feature image of n×9c×h×w.
And inputting the fifth characteristic image corresponding to any training scene into the first matrix conversion module for matrix conversion to obtain a first intermediate characteristic image of the training scene, inputting the first intermediate characteristic image of the training scene into the bidirectional LSTM model based on a preset input condition to obtain a second intermediate characteristic image of the training scene, and inputting the second intermediate characteristic image of the training scene into the second matrix conversion module for matrix conversion to obtain a sixth characteristic image corresponding to the training scene.
Wherein, (1) the preset input conditions are: batch=nh and T max Data flow of =w, batch is the Batch size of the first intermediate feature map, N is the number of target training scene images corresponding to the training scene, H is the height of the first intermediate feature map, W is the width of the first intermediate feature map, T max For a maximum length of time. (2) The first matrix conversion module is used for converting the fifth characteristic image into: (n×h) ×w×9c, i.e., a first intermediate feature map. (3) The second intermediate feature map is: (NH). Times.W.times.256. The x 0 second matrix conversion module is used for converting the second intermediate feature map into: a feature map of n×256×h×w, i.e., a sixth feature image.
It should be noted that, the process of acquiring the corresponding feature map by using the bidirectional LSTM model for learning the sequence features of each line in the image is the prior art, and will not be repeated here.
And inputting the sixth characteristic image corresponding to any training scene into the full link layer for conversion, obtaining a seventh characteristic image corresponding to the training scene, inputting the seventh characteristic image into the RPN network, obtaining at least one text candidate box corresponding to the training scene, determining a target candidate box corresponding to the training scene from the at least one text candidate box corresponding to the training scene based on a non-maximum suppression algorithm, and taking the target candidate box corresponding to the training scene as a training detection result of the training scene.
Wherein (1) the seventh feature image is: a feature map of nx512×h×w. (2) The non-maximal suppression algorithm is used to preserve the text candidate box with the highest probability.
Preferably, the method further comprises:
the method comprises the steps of obtaining original training scene images corresponding to a plurality of training scenes respectively, and preprocessing the original training scene images of each training scene respectively to obtain target training scene images corresponding to the plurality of training scenes respectively.
Wherein, the pretreatment process comprises the following steps: and eliminating the original scene image with overexposure, incomplete image and blurred image.
In this embodiment, after the text position recognition result of the scene image to be detected is obtained, a corresponding text recognition method may be further used to recognize text content in the target text box, so as to obtain text information in the target text box.
The technical scheme of the embodiment supports the picture with any shape as input, is not influenced by the picture with small resolution, can extract the multi-scale character features, and can not miss particularly large or particularly small characters. By taking the context relation of the scene image into consideration and adopting the bidirectional LSTM structure to acquire the sequence characteristics of the characters, the text position in the scene image can be detected more accurately and more rapidly.
Fig. 4 is a schematic structural diagram of an embodiment of a system for detecting text position in an image of a scene provided by the present invention. As shown in fig. 4, the system 200 includes: a training unit 210 and a detection unit 220.
The training unit 210 is configured to: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;
the detection unit 220 is configured to: and inputting a scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected.
Preferably, any training scene corresponds to at least one target training scene image; the training unit 210 includes: the system comprises a first training unit, a model optimizing unit, a first processing unit and a second processing unit;
the first training unit is used for: inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model;
the model optimizing unit is used for: optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: training iteration times reach the maximum iteration times or model loss function convergence;
the first processing unit is used for: when the judgment result is yes, the first text position detection model is determined to be the target text position detection model;
the second processing unit is used for: and when the judging result is negative, taking the first text position detection model as the preset text position detection model, and calling the first training unit back until the judging result is positive, and determining the first text position detection model as the target text position detection model.
Preferably, the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged.
The technical scheme of the embodiment supports the picture with any shape as input, is not influenced by the picture with small resolution, can extract the multi-scale character features, and can not miss particularly large or particularly small characters. By taking the context relation of the scene image into consideration and adopting the bidirectional LSTM structure to acquire the sequence characteristics of the characters, the text position in the scene image can be detected more accurately and more rapidly.
The steps for implementing the corresponding functions by the parameters and the modules in the system 200 for detecting the text position in the scene image according to the present embodiment are referred to in the embodiments of the method for detecting the text position in the scene image according to the present embodiment, and are not described herein.
The storage medium provided by the embodiment of the invention comprises: the storage medium stores instructions that, when read by a computer, cause the computer to perform steps such as a method for detecting a text position in a scene image, and specific reference may be made to the parameters and steps in the above embodiments of a method for detecting a text position in a scene image, which are not described herein.
In the description provided herein, numerous specific details are set forth. It will be appreciated, however, that embodiments of the invention may be practiced without such specific details. Similarly, in the above description of exemplary embodiments of the invention, various features of embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. Wherein the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.
Claims (5)
1. A method of detecting text position in an image of a scene, comprising:
training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;
inputting a scene image to be detected corresponding to a scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected;
any training scene corresponds to at least one target training scene image; the step of training the preset text position detection model based on the target training scene images corresponding to the training scenes respectively to obtain the target text position detection model comprises the following steps:
inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model;
optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: training iteration times reach the maximum iteration times or model loss function convergence;
when the judgment result is yes, the first text position detection model is determined to be the target text position detection model;
when the judgment result is negative, the first text position detection model is used as the preset text position detection model, the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model is carried out in a returning mode, and when the judgment result is positive, the first text position detection model is determined to be the target text position detection model;
the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged;
the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain a training detection result of the training scene comprises the following steps:
inputting all target training scene images corresponding to any training scene into the Resnet-34 network for feature extraction to obtain a first feature image corresponding to the training scene, and respectively inputting the first feature image to the plurality of different downsampling layers to obtain a plurality of second feature images with different scales corresponding to the training scene;
based on the feature fusion function, performing cross-channel fusion on all the second feature images corresponding to any training scene to obtain a third feature image corresponding to the training scene, and sequentially inputting the normalization layer, the activation function layer and the convolution layer for processing to obtain a fourth feature image corresponding to the training scene;
inputting the fourth characteristic image corresponding to any training scene into the sliding window module to perform sliding window processing of a preset size, so as to obtain a fifth characteristic image corresponding to the training scene;
inputting a fifth characteristic image corresponding to any training scene into the first matrix conversion module for matrix conversion to obtain a first intermediate characteristic image of the training scene, inputting the first intermediate characteristic image of the training scene into the bidirectional LSTM model based on a preset input condition to obtain a second intermediate characteristic image of the training scene, and inputting the second intermediate characteristic image of the training scene into the second matrix conversion module for matrix conversion to obtain a sixth characteristic image corresponding to the training scene; wherein, the preset input conditions are as follows: batch=nh and T max Data flow of =w, batch is the Batch size of the first intermediate feature map, N is the number of target training scene images corresponding to the training scene, H is the height of the first intermediate feature map, W is the width of the first intermediate feature map, T max Is the maximum length of time;
and inputting the sixth characteristic image corresponding to any training scene into the full link layer for conversion, obtaining a seventh characteristic image corresponding to the training scene, inputting the seventh characteristic image into the RPN network, obtaining at least one text candidate box corresponding to the training scene, determining a target candidate box corresponding to the training scene from the at least one text candidate box corresponding to the training scene based on a non-maximum suppression algorithm, and taking the target candidate box corresponding to the training scene as a training detection result of the training scene.
2. The method of detecting text position in an image of a scene of claim 1, wherein the objective loss function is:wherein N is the number of target training scene images, S i The number of target candidate frames representing the ith training scenario, U i The intersection ratio loss of the target candidate frame representing the ith training scene and the labeled text frame corresponding to the training label image, N s Number of target training scene images representing all training scenes, +.>Representing the number of target candidate frames containing text positive samples in the ith training scene, and IOU represents the intersection ratio of the target candidate frames and the labeled text frames, +.>Indicating whether the target candidate box contains a classification loss for the text positive sample.
3. The method of detecting text position in an image of a scene as recited in claim 1 or 2, further comprising:
acquiring original training scene images corresponding to a plurality of training scenes respectively, and preprocessing the original training scene images of each training scene respectively to obtain target training scene images corresponding to the plurality of training scenes respectively; wherein the preprocessing comprises the following steps: and eliminating the original scene image with overexposure, incomplete image and blurred image.
4. A system for detecting text position in an image of a scene, comprising: a training unit and a detection unit;
the training unit is used for: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;
the detection unit is used for: inputting a scene image to be detected corresponding to a scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected;
any training scene corresponds to at least one target training scene image; the training unit includes: the system comprises a first training unit, a model optimizing unit, a first processing unit and a second processing unit;
the first training unit is used for: inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model;
the model optimizing unit is used for: optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: training iteration times reach the maximum iteration times or model loss function convergence;
the first processing unit is used for: when the judgment result is yes, the first text position detection model is determined to be the target text position detection model;
the second processing unit is used for: when the judgment result is negative, taking the first text position detection model as the preset text position detection model, and calling the first training unit back until the judgment result is positive, and determining the first text position detection model as the target text position detection model;
the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged;
the first training unit is specifically configured to:
inputting all target training scene images corresponding to any training scene into the Resnet-34 network for feature extraction to obtain a first feature image corresponding to the training scene, and respectively inputting the first feature image to the plurality of different downsampling layers to obtain a plurality of second feature images with different scales corresponding to the training scene;
based on the feature fusion function, performing cross-channel fusion on all the second feature images corresponding to any training scene to obtain a third feature image corresponding to the training scene, and sequentially inputting the normalization layer, the activation function layer and the convolution layer for processing to obtain a fourth feature image corresponding to the training scene;
inputting the fourth characteristic image corresponding to any training scene into the sliding window module to perform sliding window processing of a preset size, so as to obtain a fifth characteristic image corresponding to the training scene;
the any one is subjected toA fifth characteristic image corresponding to a training scene is input to the first matrix conversion module for matrix conversion to obtain a first intermediate characteristic image of the training scene, the first intermediate characteristic image of the training scene is input to the bidirectional LSTM model based on a preset input condition to obtain a second intermediate characteristic image of the training scene, and the second intermediate characteristic image of the training scene is input to the second matrix conversion module for matrix conversion to obtain a sixth characteristic image corresponding to the training scene; wherein, the preset input conditions are as follows: batch=nh and T max Data flow of =w, batch is the Batch size of the first intermediate feature map, N is the number of target training scene images corresponding to the training scene, H is the height of the first intermediate feature map, W is the width of the first intermediate feature map, T max Is the maximum length of time;
and inputting the sixth characteristic image corresponding to any training scene into the full link layer for conversion, obtaining a seventh characteristic image corresponding to the training scene, inputting the seventh characteristic image into the RPN network, obtaining at least one text candidate box corresponding to the training scene, determining a target candidate box corresponding to the training scene from the at least one text candidate box corresponding to the training scene based on a non-maximum suppression algorithm, and taking the target candidate box corresponding to the training scene as a training detection result of the training scene.
5. A storage medium having instructions stored therein which, when read by a computer, cause the computer to perform the method of detecting text position in an image of a scene as claimed in any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310373895.2A CN116630755B (en) | 2023-04-10 | 2023-04-10 | Method, system and storage medium for detecting text position in scene image |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310373895.2A CN116630755B (en) | 2023-04-10 | 2023-04-10 | Method, system and storage medium for detecting text position in scene image |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116630755A CN116630755A (en) | 2023-08-22 |
CN116630755B true CN116630755B (en) | 2024-04-02 |
Family
ID=87635463
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310373895.2A Active CN116630755B (en) | 2023-04-10 | 2023-04-10 | Method, system and storage medium for detecting text position in scene image |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116630755B (en) |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109299274A (en) * | 2018-11-07 | 2019-02-01 | 南京大学 | A kind of natural scene Method for text detection based on full convolutional neural networks |
CN109711401A (en) * | 2018-12-03 | 2019-05-03 | 广东工业大学 | A kind of Method for text detection in natural scene image based on Faster Rcnn |
CN110110715A (en) * | 2019-04-30 | 2019-08-09 | 北京金山云网络技术有限公司 | Text detection model training method, text filed, content determine method and apparatus |
AU2020101229A4 (en) * | 2020-07-02 | 2020-08-06 | South China University Of Technology | A Text Line Recognition Method in Chinese Scenes Based on Residual Convolutional and Recurrent Neural Networks |
CN112070174A (en) * | 2020-09-11 | 2020-12-11 | 上海海事大学 | Text detection method in natural scene based on deep learning |
CN112418225A (en) * | 2020-10-16 | 2021-02-26 | 中山大学 | Offline character recognition method for address scene recognition |
CN112926372A (en) * | 2020-08-22 | 2021-06-08 | 清华大学 | Scene character detection method and system based on sequence deformation |
CN113837168A (en) * | 2021-09-22 | 2021-12-24 | 易联众智鼎(厦门)科技有限公司 | Image text detection and OCR recognition method, device and storage medium |
CN114494678A (en) * | 2021-12-02 | 2022-05-13 | 国家计算机网络与信息安全管理中心 | Character recognition method and electronic equipment |
WO2022147965A1 (en) * | 2021-01-09 | 2022-07-14 | 江苏拓邮信息智能技术研究院有限公司 | Arithmetic question marking system based on mixnet-yolov3 and convolutional recurrent neural network (crnn) |
WO2023040068A1 (en) * | 2021-09-16 | 2023-03-23 | 惠州市德赛西威汽车电子股份有限公司 | Perception model training method, and perception model-based scene perception method |
CN115909336A (en) * | 2021-08-17 | 2023-04-04 | 腾讯科技(深圳)有限公司 | Text recognition method and device, computer equipment and computer-readable storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230036812A1 (en) * | 2020-01-17 | 2023-02-02 | Microsoft Technology Licensing, Llc | Text Line Detection |
-
2023
- 2023-04-10 CN CN202310373895.2A patent/CN116630755B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109299274A (en) * | 2018-11-07 | 2019-02-01 | 南京大学 | A kind of natural scene Method for text detection based on full convolutional neural networks |
CN109711401A (en) * | 2018-12-03 | 2019-05-03 | 广东工业大学 | A kind of Method for text detection in natural scene image based on Faster Rcnn |
CN110110715A (en) * | 2019-04-30 | 2019-08-09 | 北京金山云网络技术有限公司 | Text detection model training method, text filed, content determine method and apparatus |
AU2020101229A4 (en) * | 2020-07-02 | 2020-08-06 | South China University Of Technology | A Text Line Recognition Method in Chinese Scenes Based on Residual Convolutional and Recurrent Neural Networks |
CN112926372A (en) * | 2020-08-22 | 2021-06-08 | 清华大学 | Scene character detection method and system based on sequence deformation |
CN112070174A (en) * | 2020-09-11 | 2020-12-11 | 上海海事大学 | Text detection method in natural scene based on deep learning |
CN112418225A (en) * | 2020-10-16 | 2021-02-26 | 中山大学 | Offline character recognition method for address scene recognition |
WO2022147965A1 (en) * | 2021-01-09 | 2022-07-14 | 江苏拓邮信息智能技术研究院有限公司 | Arithmetic question marking system based on mixnet-yolov3 and convolutional recurrent neural network (crnn) |
CN115909336A (en) * | 2021-08-17 | 2023-04-04 | 腾讯科技(深圳)有限公司 | Text recognition method and device, computer equipment and computer-readable storage medium |
WO2023040068A1 (en) * | 2021-09-16 | 2023-03-23 | 惠州市德赛西威汽车电子股份有限公司 | Perception model training method, and perception model-based scene perception method |
CN113837168A (en) * | 2021-09-22 | 2021-12-24 | 易联众智鼎(厦门)科技有限公司 | Image text detection and OCR recognition method, device and storage medium |
CN114494678A (en) * | 2021-12-02 | 2022-05-13 | 国家计算机网络与信息安全管理中心 | Character recognition method and electronic equipment |
Non-Patent Citations (3)
Title |
---|
A Deep Learning Framework for Recognizing Vertical Texts in Natural Scene;Yi Ling Ong,et.al;IEEE;全文 * |
Zhi Tian,et.al.Detecting Text in Natural Image with Connectionist Text Proposal Network.arXiv.2016,全文. * |
基于深度学习的场景文本检测技术研究;李相相;中国优秀硕士学位论文全文数据库;11-23 * |
Also Published As
Publication number | Publication date |
---|---|
CN116630755A (en) | 2023-08-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109086756B (en) | Text detection analysis method, device and equipment based on deep neural network | |
CN110390251B (en) | Image and character semantic segmentation method based on multi-neural-network model fusion processing | |
CN110490081B (en) | Remote sensing object interpretation method based on focusing weight matrix and variable-scale semantic segmentation neural network | |
CN112818975B (en) | Text detection model training method and device, text detection method and device | |
US20190019055A1 (en) | Word segmentation system, method and device | |
CN110782420A (en) | Small target feature representation enhancement method based on deep learning | |
CN113326380B (en) | Equipment measurement data processing method, system and terminal based on deep neural network | |
Mor et al. | Confidence prediction for lexicon-free OCR | |
CN111680753A (en) | Data labeling method and device, electronic equipment and storage medium | |
CN113657098B (en) | Text error correction method, device, equipment and storage medium | |
CN112991280B (en) | Visual detection method, visual detection system and electronic equipment | |
Silanon | Thai Finger‐Spelling Recognition Using a Cascaded Classifier Based on Histogram of Orientation Gradient Features | |
Alon et al. | Deep-hand: a deep inference vision approach of recognizing a hand sign language using american alphabet | |
Liu et al. | Scene text recognition with CNN classifier and WFST-based word labeling | |
CN114882204A (en) | Automatic ship name recognition method | |
CN117593514B (en) | Image target detection method and system based on deep principal component analysis assistance | |
CN112418207B (en) | Weak supervision character detection method based on self-attention distillation | |
CN117115565B (en) | Autonomous perception-based image classification method and device and intelligent terminal | |
CN114266308A (en) | Detection model training method and device, and image detection method and device | |
Kumari et al. | A comprehensive handwritten paragraph text recognition system: Lexiconnet | |
Annisa et al. | Analysis and Implementation of CNN in Real-time Classification and Translation of Kanji Characters | |
CN113177511A (en) | Rotating frame intelligent perception target detection method based on multiple data streams | |
CN116630755B (en) | Method, system and storage medium for detecting text position in scene image | |
CN114022684B (en) | Human body posture estimation method and device | |
CN116071544A (en) | Image description prediction method oriented to weak supervision directional visual understanding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |