CN116630755B - Method, system and storage medium for detecting text position in scene image - Google Patents

Method, system and storage medium for detecting text position in scene image Download PDF

Info

Publication number
CN116630755B
CN116630755B CN202310373895.2A CN202310373895A CN116630755B CN 116630755 B CN116630755 B CN 116630755B CN 202310373895 A CN202310373895 A CN 202310373895A CN 116630755 B CN116630755 B CN 116630755B
Authority
CN
China
Prior art keywords
training
scene
text position
training scene
position detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310373895.2A
Other languages
Chinese (zh)
Other versions
CN116630755A (en
Inventor
马宗润
李�浩
黄向生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiong'an Innovation Research Institute
Original Assignee
Xiong'an Innovation Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiong'an Innovation Research Institute filed Critical Xiong'an Innovation Research Institute
Priority to CN202310373895.2A priority Critical patent/CN116630755B/en
Publication of CN116630755A publication Critical patent/CN116630755A/en
Application granted granted Critical
Publication of CN116630755B publication Critical patent/CN116630755B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method, a system and a storage medium for detecting text positions in a scene image, wherein the method comprises the following steps: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the preset text position detection model is used for: the feature fusion module is used for fusing the features of different scales of the image extracted by the feature extraction module, and a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network are sequentially adopted for predicting the text position of the image; and inputting the scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain the identification result. The invention supports the picture with any shape as input, is not influenced by the picture with small resolution, can extract the character features of multiple scales, and can not miss particularly large or particularly small characters; the text position in the scene image can be detected more accurately and more quickly.

Description

Method, system and storage medium for detecting text position in scene image
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a method, system, and storage medium for detecting text positions in a scene image.
Background
Reading text in natural scene images has recently attracted more and more attention in the field of computer vision. But in its many practical applications, the large variance of text patterns and the highly cluttered background constitute major challenges for accurate text localization. Accurate scene word detects and helps promoting word recognition's precision and efficiency, helps pushing away word recognition to more application scenes.
The current text detection method mainly adopts a bottom-up recognition flow, which usually starts from low-level character or stroke detection, and then goes through complicated steps such as non-text detection, text construction, text line verification and the like to finally detect the region where the target text is located. In addition, these multi-step bottom-up approaches are often complex and less robust and reliable. And their performance is severely dependent on the results of character detection. Other neural network algorithms, primarily through connected component methods or sliding window methods, also explore low-level features to distinguish text candidates from the background, however, they are not robust in identifying individual strokes or characters alone without contextual information. On the one hand, the context relation of multiple texts provides reasoning clues of fuzzy fonts, so that the effect of identifying fuzzy characters is achieved. On the other hand: these text defects typically result in a large number of non-text components in character detection, which results in major difficulties in handling them in subsequent steps. In addition, these error detections are easy to cause continuous accumulation of errors in the bottom-up recognition process, so that the recognition result cannot meet the requirement.
Generally, the model algorithm related to scene text detection at the present stage has the defects, mainly including the following aspects, namely ignoring the context, not relying on low-level characteristics of the text too much, and not only being difficult to directly apply the general target detection system to scene text detection by using RPN (Region Proposal Network), which generally needs higher positioning accuracy.
Therefore, it is needed to provide a technical solution to solve the above technical problems.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method, a system and a storage medium for detecting text positions in a scene image.
The technical scheme of the method for detecting the text position in the scene image is as follows:
training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;
and inputting a scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected.
The method for detecting the text position in the scene image has the following beneficial effects:
the method of the invention supports the picture with any shape as input, is not affected by the picture with small resolution, can extract the character features of multiple scales, and can not miss the characters with special large or small size. By taking the context relation of the scene image into consideration and adopting the bidirectional LSTM structure to acquire the sequence characteristics of the characters, the text position in the scene image can be detected more accurately and more rapidly.
On the basis of the scheme, the method for detecting the text position in the scene image can be improved as follows.
Further, any training scene corresponds to at least one target training scene image; the step of training the preset text position detection model based on the target training scene images corresponding to the training scenes respectively to obtain the target text position detection model comprises the following steps:
inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model;
optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: training iteration times reach the maximum iteration times or model loss function convergence;
when the judgment result is yes, the first text position detection model is determined to be the target text position detection model;
and when the judging result is negative, taking the first text position detection model as the preset text position detection model, and returning to the step of executing the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model until the judging result is positive, and determining the first text position detection model as the target text position detection model.
Further, the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged.
Further, the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain a training detection result of the training scene includes:
inputting all target training scene images corresponding to any training scene into the Resnet-34 network for feature extraction to obtain a first feature image corresponding to the training scene, and respectively inputting the first feature image to the plurality of different downsampling layers to obtain a plurality of second feature images with different scales corresponding to the training scene;
based on the feature fusion function, performing cross-channel fusion on all the second feature images corresponding to any training scene to obtain a third feature image corresponding to the training scene, and sequentially inputting the normalization layer, the activation function layer and the convolution layer for processing to obtain a fourth feature image corresponding to the training scene;
inputting the fourth characteristic image corresponding to any training scene into the sliding window module to perform sliding window processing of a preset size, so as to obtain a fifth characteristic image corresponding to the training scene;
inputting a fifth characteristic image corresponding to any training scene into the first matrix conversion module for matrix conversion to obtain a first intermediate characteristic image of the training scene, inputting the first intermediate characteristic image of the training scene into the bidirectional LSTM model based on a preset input condition to obtain a second intermediate characteristic image of the training scene, and inputting the second intermediate characteristic image of the training scene into the second matrix conversion module for matrix conversion to obtain a sixth characteristic image corresponding to the training scene; wherein, the preset input conditions are as follows: batch=nh and T max Data flow of =w, batch is the Batch size of the first intermediate feature map, N is the number of target training scene images corresponding to the training scene, H is the height of the first intermediate feature map, W is the width of the first intermediate feature map, T max Is the maximum length of time;
and inputting the sixth characteristic image corresponding to any training scene into the full link layer for conversion, obtaining a seventh characteristic image corresponding to the training scene, inputting the seventh characteristic image into the RPN network, obtaining at least one text candidate box corresponding to the training scene, determining a target candidate box corresponding to the training scene from the at least one text candidate box corresponding to the training scene based on a non-maximum suppression algorithm, and taking the target candidate box corresponding to the training scene as a training detection result of the training scene.
Further, the objective function is:wherein N is the number of target training scene images, S i The number of target candidate frames representing the ith training scenario, U i The intersection ratio loss of the target candidate frame representing the ith training scene and the labeled text frame corresponding to the training label image, N s Number of target training scene images representing all training scenes, +.>Representing the number of target candidate frames containing text positive samples in the ith training scene, and IOU represents the intersection ratio of the target candidate frames and the labeled text frames, +.>Indicating whether the target candidate box contains a classification loss for the text positive sample.
Further, the method further comprises the following steps:
acquiring original training scene images corresponding to a plurality of training scenes respectively, and preprocessing the original training scene images of each training scene respectively to obtain target training scene images corresponding to the plurality of training scenes respectively; wherein the preprocessing comprises the following steps: and eliminating the original scene image with overexposure, incomplete image and blurred image.
The technical scheme of the system for detecting the text position in the scene image is as follows:
comprising the following steps: a training unit and a detection unit;
the training unit is used for: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;
the detection unit is used for: and inputting a scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected.
The system for detecting the text position in the scene image has the following beneficial effects:
the system of the invention supports the picture with any shape as input, is not influenced by the picture with small resolution, can extract the character features of multiple scales, and does not miss particularly large or particularly small characters. By taking the context relation of the scene image into consideration and adopting the bidirectional LSTM structure to acquire the sequence characteristics of the characters, the text position in the scene image can be detected more accurately and more rapidly.
Based on the scheme, the system for detecting the text position in the scene image can be improved as follows.
Further, any training scene corresponds to at least one target training scene image; the training unit includes: the system comprises a first training unit, a model optimizing unit, a first processing unit and a second processing unit;
the first training unit is used for: inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model;
the model optimizing unit is used for: optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: training iteration times reach the maximum iteration times or model loss function convergence;
the first processing unit is used for: when the judgment result is yes, the first text position detection model is determined to be the target text position detection model;
the second processing unit is used for: and when the judging result is negative, taking the first text position detection model as the preset text position detection model, and calling the first training unit back until the judging result is positive, and determining the first text position detection model as the target text position detection model.
Further, the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged.
The technical scheme of the storage medium is as follows:
the storage medium has stored therein instructions which, when read by a computer, cause the computer to perform the steps of a method of detecting text position in an image of a scene as in the present invention.
Drawings
FIG. 1 is a flow chart illustrating an embodiment of a method for detecting text position in an image of a scene provided by the present invention;
fig. 2 is a schematic structural diagram of a preset text position detection model in an embodiment of a method for detecting text positions in a scene image according to the present invention;
FIG. 3 is a flow chart illustrating step 110 in an embodiment of a method for detecting text position in an image of a scene provided by the present invention;
fig. 4 is a schematic structural diagram of an embodiment of a system for detecting text position in an image of a scene provided by the present invention.
Detailed Description
Fig. 1 is a flow chart of an embodiment of a method for detecting text position in a scene image according to the present invention. As shown in fig. 1, the method comprises the steps of:
step 110: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model.
Wherein, (1) as shown in fig. 2, the preset text position detection model includes: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged. (2) The preset text position detection model is used for: and fusing the different scale features of the image extracted by the feature extraction module through a feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network. (3) The target training scene image is: and a scene image of a training scene for training the preset text position detection model, wherein the scene image contains text information. (4) The target text position detection model is as follows: and training to obtain a text position detection model. (5) Any target training scene image contains text information.
Step 120: and inputting a scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected.
Wherein, (1) the scene to be detected is: in this embodiment, a scene to be detected is required. (2) The scene image to be detected is: a scene image in a scene to be detected. (3) The text position recognition result is: the position of a predicted text box containing text content in the scene image is to be detected.
It should be noted that (1) a scene image to be detected may have one predictive text box, or may have a plurality of predictive text boxes, and the specific number is determined according to the text content in the scene image. (2) In the detection process, one to-be-detected scene image of the to-be-detected scene can be input, or a plurality of to-be-detected scene images of the to-be-detected scene can be input at the same time, and the limitation is not set.
Preferably, any training scene corresponds to at least one target training scene image; as shown in fig. 3, step 110 includes:
step 111: inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model.
Wherein, (1) training test results are: and the detection result of at least one predictive text box corresponding to the training scene is included. (2) Training label images are: and before model training, marking the text boxes existing in the training scene in advance to obtain an image. (3) The objective function is:wherein N is the number of target training scene images, S i The number of target candidate frames representing the ith training scenario, U i The intersection ratio loss of the target candidate frame representing the ith training scene and the labeled text frame corresponding to the training label image, N s Number of target training scene images representing all training scenes, +.>Representing the number of target candidate boxes containing text positive samples in the ith training scene, and IOU represents the target candidate boxes and the labeling text boxesCross ratio of->Indicating whether the target candidate box contains a classification loss for the text positive sample. (4) The target loss value is used to represent the degree of difference between the predicted text box and the marked text box.
Step 112: optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: the training iteration number reaches the maximum iteration number or the model loss function converges.
Wherein, (1) preset training conditions are: the training iteration number reaches the maximum iteration number or the model loss function converges. (2) The first text position detection model is: and (5) iterating a text position detection model obtained in the training process.
Step 113A: and when the judgment result is yes, determining the first text position detection model as the target text position detection model.
Step 113B: and when the judging result is negative, taking the first text position detection model as the preset text position detection model, and returning to the execution step 111 until the judging result is positive, and determining the first text position detection model as the target text position detection model.
Preferably, the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged.
Preferably, the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain a training detection result of the training scene includes:
and inputting all target training scene images corresponding to any training scene into the Resnet-34 network for feature extraction to obtain a first feature image corresponding to the training scene, and respectively inputting the first feature image to the plurality of different downsampling layers to obtain a plurality of second feature images with different scales corresponding to the training scene.
Wherein (1) the first feature image is: an image of nxc×w×h, C represents a channel of the first feature image, W represents a width of the first feature image, and H represents a height of the first feature image. (2) Each downsampling layer corresponds to a different degree of downsampling process. (3) The second characteristic image is: a feature image of a specific scale; the plurality of second feature images corresponds to feature images of different scales.
It should be noted that, the process of feature extraction of the image through the network of Resnet-34 is the prior art, and detailed processes are not repeated here.
And based on the feature fusion function, performing cross-channel fusion on all the second feature images corresponding to any training scene to obtain a third feature image corresponding to the training scene, and sequentially inputting the normalization layer, the activation function layer and the convolution layer for processing to obtain a fourth feature image corresponding to the training scene.
Wherein (1) the feature fusion function is: and the Concat function is used for fusing the second characteristic images with different scales in a cross-channel mode. (2) The third feature image is: and fusing the second characteristic images with different scales to obtain an image. (3) And carrying out combination processing through the normalization layer, the activation function layer and the 3×3 convolution layer to restore the channel number of the third characteristic image to the original channel number C, thereby obtaining a fourth characteristic image of NxCxW×H.
And inputting the fourth characteristic image corresponding to any training scene into the sliding window module to perform sliding window processing of a preset size, so as to obtain a fifth characteristic image corresponding to the training scene.
Wherein, (1) the preset size is: 3X 3. (2) The fifth feature image is: characteristic image of n×9c×h×w.
Specifically, a 3×3 sliding window process is performed on the fourth feature image, that is, each point obtains a feature vector of 3×3×c length in combination with the surrounding 3×3 region features, and finally outputs a fifth feature image of n×9c×h×w.
And inputting the fifth characteristic image corresponding to any training scene into the first matrix conversion module for matrix conversion to obtain a first intermediate characteristic image of the training scene, inputting the first intermediate characteristic image of the training scene into the bidirectional LSTM model based on a preset input condition to obtain a second intermediate characteristic image of the training scene, and inputting the second intermediate characteristic image of the training scene into the second matrix conversion module for matrix conversion to obtain a sixth characteristic image corresponding to the training scene.
Wherein, (1) the preset input conditions are: batch=nh and T max Data flow of =w, batch is the Batch size of the first intermediate feature map, N is the number of target training scene images corresponding to the training scene, H is the height of the first intermediate feature map, W is the width of the first intermediate feature map, T max For a maximum length of time. (2) The first matrix conversion module is used for converting the fifth characteristic image into: (n×h) ×w×9c, i.e., a first intermediate feature map. (3) The second intermediate feature map is: (NH). Times.W.times.256. The x 0 second matrix conversion module is used for converting the second intermediate feature map into: a feature map of n×256×h×w, i.e., a sixth feature image.
It should be noted that, the process of acquiring the corresponding feature map by using the bidirectional LSTM model for learning the sequence features of each line in the image is the prior art, and will not be repeated here.
And inputting the sixth characteristic image corresponding to any training scene into the full link layer for conversion, obtaining a seventh characteristic image corresponding to the training scene, inputting the seventh characteristic image into the RPN network, obtaining at least one text candidate box corresponding to the training scene, determining a target candidate box corresponding to the training scene from the at least one text candidate box corresponding to the training scene based on a non-maximum suppression algorithm, and taking the target candidate box corresponding to the training scene as a training detection result of the training scene.
Wherein (1) the seventh feature image is: a feature map of nx512×h×w. (2) The non-maximal suppression algorithm is used to preserve the text candidate box with the highest probability.
Preferably, the method further comprises:
the method comprises the steps of obtaining original training scene images corresponding to a plurality of training scenes respectively, and preprocessing the original training scene images of each training scene respectively to obtain target training scene images corresponding to the plurality of training scenes respectively.
Wherein, the pretreatment process comprises the following steps: and eliminating the original scene image with overexposure, incomplete image and blurred image.
In this embodiment, after the text position recognition result of the scene image to be detected is obtained, a corresponding text recognition method may be further used to recognize text content in the target text box, so as to obtain text information in the target text box.
The technical scheme of the embodiment supports the picture with any shape as input, is not influenced by the picture with small resolution, can extract the multi-scale character features, and can not miss particularly large or particularly small characters. By taking the context relation of the scene image into consideration and adopting the bidirectional LSTM structure to acquire the sequence characteristics of the characters, the text position in the scene image can be detected more accurately and more rapidly.
Fig. 4 is a schematic structural diagram of an embodiment of a system for detecting text position in an image of a scene provided by the present invention. As shown in fig. 4, the system 200 includes: a training unit 210 and a detection unit 220.
The training unit 210 is configured to: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;
the detection unit 220 is configured to: and inputting a scene image to be detected corresponding to the scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected.
Preferably, any training scene corresponds to at least one target training scene image; the training unit 210 includes: the system comprises a first training unit, a model optimizing unit, a first processing unit and a second processing unit;
the first training unit is used for: inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model;
the model optimizing unit is used for: optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: training iteration times reach the maximum iteration times or model loss function convergence;
the first processing unit is used for: when the judgment result is yes, the first text position detection model is determined to be the target text position detection model;
the second processing unit is used for: and when the judging result is negative, taking the first text position detection model as the preset text position detection model, and calling the first training unit back until the judging result is positive, and determining the first text position detection model as the target text position detection model.
Preferably, the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged.
The technical scheme of the embodiment supports the picture with any shape as input, is not influenced by the picture with small resolution, can extract the multi-scale character features, and can not miss particularly large or particularly small characters. By taking the context relation of the scene image into consideration and adopting the bidirectional LSTM structure to acquire the sequence characteristics of the characters, the text position in the scene image can be detected more accurately and more rapidly.
The steps for implementing the corresponding functions by the parameters and the modules in the system 200 for detecting the text position in the scene image according to the present embodiment are referred to in the embodiments of the method for detecting the text position in the scene image according to the present embodiment, and are not described herein.
The storage medium provided by the embodiment of the invention comprises: the storage medium stores instructions that, when read by a computer, cause the computer to perform steps such as a method for detecting a text position in a scene image, and specific reference may be made to the parameters and steps in the above embodiments of a method for detecting a text position in a scene image, which are not described herein.
In the description provided herein, numerous specific details are set forth. It will be appreciated, however, that embodiments of the invention may be practiced without such specific details. Similarly, in the above description of exemplary embodiments of the invention, various features of embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. Wherein the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims (5)

1. A method of detecting text position in an image of a scene, comprising:
training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;
inputting a scene image to be detected corresponding to a scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected;
any training scene corresponds to at least one target training scene image; the step of training the preset text position detection model based on the target training scene images corresponding to the training scenes respectively to obtain the target text position detection model comprises the following steps:
inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model;
optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: training iteration times reach the maximum iteration times or model loss function convergence;
when the judgment result is yes, the first text position detection model is determined to be the target text position detection model;
when the judgment result is negative, the first text position detection model is used as the preset text position detection model, the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model is carried out in a returning mode, and when the judgment result is positive, the first text position detection model is determined to be the target text position detection model;
the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged;
the step of inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain a training detection result of the training scene comprises the following steps:
inputting all target training scene images corresponding to any training scene into the Resnet-34 network for feature extraction to obtain a first feature image corresponding to the training scene, and respectively inputting the first feature image to the plurality of different downsampling layers to obtain a plurality of second feature images with different scales corresponding to the training scene;
based on the feature fusion function, performing cross-channel fusion on all the second feature images corresponding to any training scene to obtain a third feature image corresponding to the training scene, and sequentially inputting the normalization layer, the activation function layer and the convolution layer for processing to obtain a fourth feature image corresponding to the training scene;
inputting the fourth characteristic image corresponding to any training scene into the sliding window module to perform sliding window processing of a preset size, so as to obtain a fifth characteristic image corresponding to the training scene;
inputting a fifth characteristic image corresponding to any training scene into the first matrix conversion module for matrix conversion to obtain a first intermediate characteristic image of the training scene, inputting the first intermediate characteristic image of the training scene into the bidirectional LSTM model based on a preset input condition to obtain a second intermediate characteristic image of the training scene, and inputting the second intermediate characteristic image of the training scene into the second matrix conversion module for matrix conversion to obtain a sixth characteristic image corresponding to the training scene; wherein, the preset input conditions are as follows: batch=nh and T max Data flow of =w, batch is the Batch size of the first intermediate feature map, N is the number of target training scene images corresponding to the training scene, H is the height of the first intermediate feature map, W is the width of the first intermediate feature map, T max Is the maximum length of time;
and inputting the sixth characteristic image corresponding to any training scene into the full link layer for conversion, obtaining a seventh characteristic image corresponding to the training scene, inputting the seventh characteristic image into the RPN network, obtaining at least one text candidate box corresponding to the training scene, determining a target candidate box corresponding to the training scene from the at least one text candidate box corresponding to the training scene based on a non-maximum suppression algorithm, and taking the target candidate box corresponding to the training scene as a training detection result of the training scene.
2. The method of detecting text position in an image of a scene of claim 1, wherein the objective loss function is:wherein N is the number of target training scene images, S i The number of target candidate frames representing the ith training scenario, U i The intersection ratio loss of the target candidate frame representing the ith training scene and the labeled text frame corresponding to the training label image, N s Number of target training scene images representing all training scenes, +.>Representing the number of target candidate frames containing text positive samples in the ith training scene, and IOU represents the intersection ratio of the target candidate frames and the labeled text frames, +.>Indicating whether the target candidate box contains a classification loss for the text positive sample.
3. The method of detecting text position in an image of a scene as recited in claim 1 or 2, further comprising:
acquiring original training scene images corresponding to a plurality of training scenes respectively, and preprocessing the original training scene images of each training scene respectively to obtain target training scene images corresponding to the plurality of training scenes respectively; wherein the preprocessing comprises the following steps: and eliminating the original scene image with overexposure, incomplete image and blurred image.
4. A system for detecting text position in an image of a scene, comprising: a training unit and a detection unit;
the training unit is used for: training the preset text position detection model based on target training scene images corresponding to the training scenes respectively to obtain a target text position detection model; the text information is contained in any target training scene image, and the preset text position detection model comprises: the system comprises a feature extraction module, a feature fusion module, a sliding window module, a bidirectional LSTM module, a full link layer and an RPN network which are sequentially arranged; the preset text position detection model is used for: fusing the different scale features of the image extracted by the feature extraction module through the feature fusion module, and predicting the text position of the image by sequentially adopting the sliding window module, the bidirectional LSTM module, the full link layer and the RPN network;
the detection unit is used for: inputting a scene image to be detected corresponding to a scene to be detected into the target text position detection model to obtain a text position identification result of the scene to be detected;
any training scene corresponds to at least one target training scene image; the training unit includes: the system comprises a first training unit, a model optimizing unit, a first processing unit and a second processing unit;
the first training unit is used for: inputting all target training scene images corresponding to any training scene into the preset text position detection model to obtain training detection results of the training scenes until training detection results of each training scene are obtained, substituting the training detection results of each training scene and training label images into a target loss function of the preset text position detection model to obtain a target loss value of the preset text position detection model;
the model optimizing unit is used for: optimizing network parameters of the preset text position detection model based on the target loss value to obtain a first text position detection model, and judging whether the first text position detection model meets preset training conditions or not to obtain a judgment result; wherein, the preset training conditions are as follows: training iteration times reach the maximum iteration times or model loss function convergence;
the first processing unit is used for: when the judgment result is yes, the first text position detection model is determined to be the target text position detection model;
the second processing unit is used for: when the judgment result is negative, taking the first text position detection model as the preset text position detection model, and calling the first training unit back until the judgment result is positive, and determining the first text position detection model as the target text position detection model;
the feature extraction module includes: a Resnet-34 network and a plurality of different downsampling layers; the feature fusion module comprises the following components: the device comprises a feature fusion function, a normalization layer, an activation function layer and a convolution layer; the bidirectional LSTM module includes: the system comprises a first matrix conversion module, a bidirectional LSTM model and a second matrix conversion module which are sequentially arranged;
the first training unit is specifically configured to:
inputting all target training scene images corresponding to any training scene into the Resnet-34 network for feature extraction to obtain a first feature image corresponding to the training scene, and respectively inputting the first feature image to the plurality of different downsampling layers to obtain a plurality of second feature images with different scales corresponding to the training scene;
based on the feature fusion function, performing cross-channel fusion on all the second feature images corresponding to any training scene to obtain a third feature image corresponding to the training scene, and sequentially inputting the normalization layer, the activation function layer and the convolution layer for processing to obtain a fourth feature image corresponding to the training scene;
inputting the fourth characteristic image corresponding to any training scene into the sliding window module to perform sliding window processing of a preset size, so as to obtain a fifth characteristic image corresponding to the training scene;
the any one is subjected toA fifth characteristic image corresponding to a training scene is input to the first matrix conversion module for matrix conversion to obtain a first intermediate characteristic image of the training scene, the first intermediate characteristic image of the training scene is input to the bidirectional LSTM model based on a preset input condition to obtain a second intermediate characteristic image of the training scene, and the second intermediate characteristic image of the training scene is input to the second matrix conversion module for matrix conversion to obtain a sixth characteristic image corresponding to the training scene; wherein, the preset input conditions are as follows: batch=nh and T max Data flow of =w, batch is the Batch size of the first intermediate feature map, N is the number of target training scene images corresponding to the training scene, H is the height of the first intermediate feature map, W is the width of the first intermediate feature map, T max Is the maximum length of time;
and inputting the sixth characteristic image corresponding to any training scene into the full link layer for conversion, obtaining a seventh characteristic image corresponding to the training scene, inputting the seventh characteristic image into the RPN network, obtaining at least one text candidate box corresponding to the training scene, determining a target candidate box corresponding to the training scene from the at least one text candidate box corresponding to the training scene based on a non-maximum suppression algorithm, and taking the target candidate box corresponding to the training scene as a training detection result of the training scene.
5. A storage medium having instructions stored therein which, when read by a computer, cause the computer to perform the method of detecting text position in an image of a scene as claimed in any one of claims 1 to 3.
CN202310373895.2A 2023-04-10 2023-04-10 Method, system and storage medium for detecting text position in scene image Active CN116630755B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310373895.2A CN116630755B (en) 2023-04-10 2023-04-10 Method, system and storage medium for detecting text position in scene image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310373895.2A CN116630755B (en) 2023-04-10 2023-04-10 Method, system and storage medium for detecting text position in scene image

Publications (2)

Publication Number Publication Date
CN116630755A CN116630755A (en) 2023-08-22
CN116630755B true CN116630755B (en) 2024-04-02

Family

ID=87635463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310373895.2A Active CN116630755B (en) 2023-04-10 2023-04-10 Method, system and storage medium for detecting text position in scene image

Country Status (1)

Country Link
CN (1) CN116630755B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299274A (en) * 2018-11-07 2019-02-01 南京大学 A kind of natural scene Method for text detection based on full convolutional neural networks
CN109711401A (en) * 2018-12-03 2019-05-03 广东工业大学 A kind of Method for text detection in natural scene image based on Faster Rcnn
CN110110715A (en) * 2019-04-30 2019-08-09 北京金山云网络技术有限公司 Text detection model training method, text filed, content determine method and apparatus
AU2020101229A4 (en) * 2020-07-02 2020-08-06 South China University Of Technology A Text Line Recognition Method in Chinese Scenes Based on Residual Convolutional and Recurrent Neural Networks
CN112070174A (en) * 2020-09-11 2020-12-11 上海海事大学 Text detection method in natural scene based on deep learning
CN112418225A (en) * 2020-10-16 2021-02-26 中山大学 Offline character recognition method for address scene recognition
CN112926372A (en) * 2020-08-22 2021-06-08 清华大学 Scene character detection method and system based on sequence deformation
CN113837168A (en) * 2021-09-22 2021-12-24 易联众智鼎(厦门)科技有限公司 Image text detection and OCR recognition method, device and storage medium
CN114494678A (en) * 2021-12-02 2022-05-13 国家计算机网络与信息安全管理中心 Character recognition method and electronic equipment
WO2022147965A1 (en) * 2021-01-09 2022-07-14 江苏拓邮信息智能技术研究院有限公司 Arithmetic question marking system based on mixnet-yolov3 and convolutional recurrent neural network (crnn)
WO2023040068A1 (en) * 2021-09-16 2023-03-23 惠州市德赛西威汽车电子股份有限公司 Perception model training method, and perception model-based scene perception method
CN115909336A (en) * 2021-08-17 2023-04-04 腾讯科技(深圳)有限公司 Text recognition method and device, computer equipment and computer-readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230036812A1 (en) * 2020-01-17 2023-02-02 Microsoft Technology Licensing, Llc Text Line Detection

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299274A (en) * 2018-11-07 2019-02-01 南京大学 A kind of natural scene Method for text detection based on full convolutional neural networks
CN109711401A (en) * 2018-12-03 2019-05-03 广东工业大学 A kind of Method for text detection in natural scene image based on Faster Rcnn
CN110110715A (en) * 2019-04-30 2019-08-09 北京金山云网络技术有限公司 Text detection model training method, text filed, content determine method and apparatus
AU2020101229A4 (en) * 2020-07-02 2020-08-06 South China University Of Technology A Text Line Recognition Method in Chinese Scenes Based on Residual Convolutional and Recurrent Neural Networks
CN112926372A (en) * 2020-08-22 2021-06-08 清华大学 Scene character detection method and system based on sequence deformation
CN112070174A (en) * 2020-09-11 2020-12-11 上海海事大学 Text detection method in natural scene based on deep learning
CN112418225A (en) * 2020-10-16 2021-02-26 中山大学 Offline character recognition method for address scene recognition
WO2022147965A1 (en) * 2021-01-09 2022-07-14 江苏拓邮信息智能技术研究院有限公司 Arithmetic question marking system based on mixnet-yolov3 and convolutional recurrent neural network (crnn)
CN115909336A (en) * 2021-08-17 2023-04-04 腾讯科技(深圳)有限公司 Text recognition method and device, computer equipment and computer-readable storage medium
WO2023040068A1 (en) * 2021-09-16 2023-03-23 惠州市德赛西威汽车电子股份有限公司 Perception model training method, and perception model-based scene perception method
CN113837168A (en) * 2021-09-22 2021-12-24 易联众智鼎(厦门)科技有限公司 Image text detection and OCR recognition method, device and storage medium
CN114494678A (en) * 2021-12-02 2022-05-13 国家计算机网络与信息安全管理中心 Character recognition method and electronic equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Deep Learning Framework for Recognizing Vertical Texts in Natural Scene;Yi Ling Ong,et.al;IEEE;全文 *
Zhi Tian,et.al.Detecting Text in Natural Image with Connectionist Text Proposal Network.arXiv.2016,全文. *
基于深度学习的场景文本检测技术研究;李相相;中国优秀硕士学位论文全文数据库;11-23 *

Also Published As

Publication number Publication date
CN116630755A (en) 2023-08-22

Similar Documents

Publication Publication Date Title
CN109086756B (en) Text detection analysis method, device and equipment based on deep neural network
CN110390251B (en) Image and character semantic segmentation method based on multi-neural-network model fusion processing
CN110490081B (en) Remote sensing object interpretation method based on focusing weight matrix and variable-scale semantic segmentation neural network
CN112818975B (en) Text detection model training method and device, text detection method and device
US20190019055A1 (en) Word segmentation system, method and device
CN110782420A (en) Small target feature representation enhancement method based on deep learning
CN113326380B (en) Equipment measurement data processing method, system and terminal based on deep neural network
Mor et al. Confidence prediction for lexicon-free OCR
CN111680753A (en) Data labeling method and device, electronic equipment and storage medium
CN113657098B (en) Text error correction method, device, equipment and storage medium
CN112991280B (en) Visual detection method, visual detection system and electronic equipment
Silanon Thai Finger‐Spelling Recognition Using a Cascaded Classifier Based on Histogram of Orientation Gradient Features
Alon et al. Deep-hand: a deep inference vision approach of recognizing a hand sign language using american alphabet
Liu et al. Scene text recognition with CNN classifier and WFST-based word labeling
CN114882204A (en) Automatic ship name recognition method
CN117593514B (en) Image target detection method and system based on deep principal component analysis assistance
CN112418207B (en) Weak supervision character detection method based on self-attention distillation
CN117115565B (en) Autonomous perception-based image classification method and device and intelligent terminal
CN114266308A (en) Detection model training method and device, and image detection method and device
Kumari et al. A comprehensive handwritten paragraph text recognition system: Lexiconnet
Annisa et al. Analysis and Implementation of CNN in Real-time Classification and Translation of Kanji Characters
CN113177511A (en) Rotating frame intelligent perception target detection method based on multiple data streams
CN116630755B (en) Method, system and storage medium for detecting text position in scene image
CN114022684B (en) Human body posture estimation method and device
CN116071544A (en) Image description prediction method oriented to weak supervision directional visual understanding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant