CN113888505B - Natural scene text detection method based on semantic segmentation - Google Patents
Natural scene text detection method based on semantic segmentation Download PDFInfo
- Publication number
- CN113888505B CN113888505B CN202111157377.4A CN202111157377A CN113888505B CN 113888505 B CN113888505 B CN 113888505B CN 202111157377 A CN202111157377 A CN 202111157377A CN 113888505 B CN113888505 B CN 113888505B
- Authority
- CN
- China
- Prior art keywords
- feature
- network
- size
- output
- map
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 23
- 238000001514 detection method Methods 0.000 title claims abstract description 17
- 238000012216 screening Methods 0.000 claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 20
- 238000000034 method Methods 0.000 claims abstract description 18
- 230000008569 process Effects 0.000 claims abstract description 8
- 238000005728 strengthening Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 20
- 230000004913 activation Effects 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 230000004927 fusion Effects 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 8
- 101100194362 Schizosaccharomyces pombe (strain 972 / ATCC 24843) res1 gene Proteins 0.000 claims description 6
- 101100194363 Schizosaccharomyces pombe (strain 972 / ATCC 24843) res2 gene Proteins 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 6
- 230000001788 irregular Effects 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 4
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000013461 design Methods 0.000 claims description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims description 2
- 238000013135 deep learning Methods 0.000 abstract description 4
- 101100295091 Arabidopsis thaliana NUDT14 gene Proteins 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000007493 shaping process Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/13—Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Quality & Reliability (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the field of computer vision by a deep learning technology, and provides a natural scene text detection method based on semantic segmentation. According to the method, a feature extraction network is firstly constructed, then a feature selection module is used for screening effective information, the screened multi-scale feature information is fused through a feature pyramid network structure, finally a semantic segmentation result after the edge of a text region is obviously strengthened is obtained through an edge strengthening network and a semantic segmentation network, and finally boundary coordinate information of the text region is obtained. The invention realizes a quick lightweight text detection model, not only can detect text areas with various complex shapes and backgrounds, but also the detection process is quick and accurate.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, relates to deep learning and computer vision content, and particularly relates to a natural scene text detection method.
Background
The text detection is an important step for acquiring important information of human society by a computer and realizing man-machine interaction, and aims to enable the computer to quickly acquire a text area containing effective information in the field of vision like a human. In a natural scene image, the part with the greatest information density is usually a character, and the first step of acquiring information is to find the position of the character. Through the text region containing effective information selected, the process of acquiring the information by the computer is more accurate and efficient, and the redundancy of calculation and storage resources in the later period is reduced, so that the overall performance of image understanding is improved. In general, text regions containing valid information and other background regions of unwanted information in an image, while understanding an image only requires attention to the valid information therein, ignoring unwanted information, which distinguishes foreground and background alien from semantic segmentation in computer vision. Therefore, it is feasible to perform scene text detection by using a computer to simulate a human visual system.
The prior text detection utilizes a traditional machine learning mode to count and analyze pixel distribution in an image, and the mode cannot fully consider global information, but only carries out traversal searching in the image through a fixed algorithm, so that the speed and the accuracy are not ideal. The method based on deep learning effectively solves the problems of speed and accuracy, the method proposed in the early stage mainly uses a neural network to predict the frame information of a text region, and is limited by the expression capability of the network, and the method of directly regressing the text frame can only detect a simple text region. This approach is not effective if the background and text are not easily separated from each other, the text style is curved, etc. While semantic segmentation can solve the above problems well. Firstly, the processing speed of the neural network on the image can meet the requirement of real-time property due to the development of deep learning and the rapid improvement of the current computer computing power. Furthermore, the semantic segmentation mode can accurately separate the foreground and the background of the target, even if the target has a complex outline, the detection can be performed under the conditions of complex scenes and complex texts. By tracing the detected semantic information, the exact outline of the text region can be obtained, which makes the complex text extraction in natural scenes more efficient.
Disclosure of Invention
The invention aims to solve the technical problems that: the defect of the current scene text detection is overcome, and the edge-enhanced natural scene text detection method based on semantic segmentation is provided, so that the purpose of high-precision and high-efficiency detection is achieved.
The technical scheme of the invention is as follows:
a natural scene text detection method based on semantic segmentation comprises the following steps:
(1) Constructing a basic feature extraction network
The feature extraction network adopts a ResNet or MobileNet classical network structure as backbone, 1/4, 1/8, 1/16 and 1/32 features of the input image size are extracted from different layers respectively as output, and the number of channels corresponding to the output features is 64, 128, 256 and 512 channels respectively;
(2) Construction of feature screening Module
The input of the feature screening module is divided into two parts i and h, i represents the output feature of the feature extraction network, h represents the output feature of the feature screening module at the upper stage, the two parts are subjected to convolution fusion and then normalized by using a sigmoid function, the normalized result is used as a weight, the two inputs i and h are subjected to selective fusion, and finally the fused output feature is obtained; the whole operation process is defined as follows:
S=sigmoid(conv3(conv1(h),conv2(i)))
out=conv4((1-S)·h+S·i)
Where S represents a normalized feature screening heat map, conv (x) represents a series of self-network structures consisting of convolution, batch normalization, reLu activation functions, out represents the final output feature map, fixed as 64 channels. It should be noted that the above operation process also implies the step of channel transformation;
(3) Construction of feature pyramid networks
The feature pyramid network is a step of fusing the outputs of the feature screening modules. Feature screening modules are used at 3 in the network, but the network structure of the modules is only one, namely 1 module 3 is multiplexed. Firstly, feature expansion is carried out on the 1/32-size feature map output by the feature extraction network by using a pyramid pooling network (ASPP), so that a 1/32-size feature map res4 is obtained. Up-sampling res4 to be 1/16 size, and then taking the 1/16 size feature map output by the feature extraction network and the 1/16 size feature map as h and i inputs of a feature screening module respectively, wherein the feature screening module outputs a 1/16 size feature map res3; repeating the steps to obtain res2 and res1, wherein the sizes are 1/8 and 1/4 respectively. Finally, up-sampling res2, res3 and res4 to the size of res1, and then cascading on channels to obtain a multi-scale fusion feature map with 256 channels;
(4) Constructing edge-enhanced networks
The edge enhancement network consists of 3 layers of neural networks, wherein the first two layers of neural networks consist of convolution, batch normalization and ReLu activation functions, and the last layer of neural networks consist of convolution, bias and sigmoid activation functions. Finally, obtaining an edge strengthening heat map with the channel number of 1, wherein the pixel point value range is [0,1], and the larger the value is, the closer the value is to the edge position;
(5) Constructing semantic segmentation networks
Firstly, a 256-channel feature map output by a feature pyramid network and a 1-channel feature map output by an edge enhancement network are cascaded on channels, then the result is input into a 3-layer convolutional neural network, and the front 2-layer network structure consists of upsampling, convolution, batch normalization and ReLu activation functions, wherein the upsampling operation adopts a bilinear interpolation method to enlarge the size of the feature map to 2 times of the original size. And the final layer of network adopts convolution, offset and sigmoid activation functions to obtain a semantic segmentation heat map of 1 channel, wherein the range of values is between 0 and 1. Converting the heat map into a binary map with only two values of 0 and 1 by setting 0.7 as a threshold value;
(6) Contour forming
Separating different text regions from the binarization map by using OpenCV software, and solving a closed polygon with the smallest perimeter of the region for each region, wherein the vertex coordinates of the polygon are the position coordinates of the text region in the image. For a rectangular text area, its coordinates consist of 4 points. For other irregular text regions, openCV software can determine the number of polygon vertices by itself.
(7) Training method
For ResNet networks as backbone structures, first pre-training is performed on image classification dataset ImageNet, and pre-training network weight parameters are saved. Then the whole network is preheated on the artificial synthetic dataset SynthText to enable the model to achieve convergence on the task scene. And finally, performing final formal training under the specific scene data set. In addition, OHEM algorithm is used in the design of the loss function, difficult mining is carried out, and the area gap between the foreground and the background is balanced.
The invention has the beneficial effects that: the invention fully utilizes the strong distinguishing capability of the semantic segmentation algorithm between the foreground and the background, and performs multi-scale feature extraction through the feature pyramid network, thereby ensuring that both small-size texts and large-size texts in the image can be effectively detected. By introducing the information selection gate structure, the up-sampling and feature fusion part selects effective information to propagate and output, so that redundant information in the network is removed. In addition, the commonality of the semantic segmentation algorithm and the frame shaping algorithm in processing the irregular area ensures the accurate detection capability of the whole scheme on the irregular text area.
Drawings
Fig. 1 illustrates a multi-scale feature extraction network. Wherein the top row represents a feature extraction backbone network, and the different sizes represent progressively smaller extracted feature patterns. The middle row represents a feature filter gate with two inputs and ASPP represents a pyramid pooling network. The next row of differently sized boxes represents the extracted different scale feature maps. Finally, the feature images are aggregated together through an up-sampling step;
FIG. 2 is an internal concrete structure of a feature filter gate, conv (x) representing a number of layers of convolutional networks, x representing a pixel multiplication operation, and +representing a pixel addition operation;
FIG. 3 is a schematic diagram of an edge enhancement network, a semantic segmentation network, and a binarization process, wherein conv (x) represents a number of layers of convolutional networks;
FIG. 4 is a true value diagram of an edge enhancement structure, wherein the innermost line of the three lines represents the boundary after the text outline has been reduced to 0.5, and all pixel values within it are set to 0. The boundary of the outermost layer represents the boundary after enlarging the text outline 1.25 times the original, and all pixel values outside it are set to 0. The value of the middle black line is 1, which represents the original boundary, and the pixel values among the three boundary lines are linearly interpolated;
fig. 5 is an input image example;
FIG. 6 is a semantic segmentation result example;
fig. 7 is a frame example of a text region.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings and technical schemes.
A natural scene text detection method based on semantic segmentation comprises the following steps:
(1) Constructing a basic feature extraction network
The feature extraction network employs ResNet network architecture as the backbone, as shown by the upper row conv (x) in fig. 1. Its input is a 3-channel RGB image as shown in fig. 5. Extracting 1/4, 1/8, 1/16 and 1/32 features of the input image size from layers 4, 6, 9 and 13 of ResNet to be output, wherein the number of channels corresponding to the output features is 64, 128, 256 and 512 channels respectively;
(2) Construction of feature screening Module
As shown in fig. 2, the input of the feature screening module is i and h, i represents the output feature of the feature extraction network, h represents the output feature of the previous feature screening module, the two parts are subjected to convolution fusion and then normalized by using a sigmoid function, the normalization result is used as a weight, the i and h inputs are subjected to selective fusion, and finally the fused output feature is obtained; the whole operation process is defined as follows:
S=sigmoid(conv3(conv1(h),conv2(i)))
out=conv4((1-S)·h+S·i)
Wherein S is a normalized feature screening heat map, out is a final output feature map, which has 64 channels with the same size as i and h;
(3) Construction of feature pyramid networks
The feature pyramid network is a step of fusing the outputs of the feature screening modules. As shown in fig. 1, feature screening modules are used at 3 in the network, but the network structure of the modules is only one, namely 1 module 3 is multiplexed. Firstly, feature expansion is carried out on the 1/32-size feature map output by the feature extraction network by using a pyramid pooling network (ASPP), so that a 1/32-size feature map res4 is obtained. Up-sampling res4 to be 1/16 size, and then taking the 1/16 size feature map output by the feature extraction network and the 1/16 size feature map as h and i inputs of a feature screening module respectively, wherein the feature screening module outputs a 1/16 size feature map res3; repeating the steps to obtain res2 and res1, wherein the sizes are 1/8 and 1/4 respectively. Finally, up-sampling res2, res3 and res4 to the size of res1, and then cascading on channels to obtain a multi-scale fusion feature map with 256 channels;
(4) Constructing edge-enhanced networks
The edge enhancement network consists of 3 layers of neural networks, wherein the first two layers of neural networks consist of convolution, batch normalization and ReLu activation functions, and the last layer of neural networks consist of convolution, bias and sigmoid activation functions. Finally, obtaining an edge enhancement heat map with the channel number of 1, wherein the pixel point value range is [0,1], and the larger the value is, the closer the value is to the edge position. FIG. 4 illustrates a distribution of pixel values at text edge locations in a heat map;
(5) Constructing semantic segmentation networks
Firstly, a 256-channel feature map output by a feature pyramid network and a 1-channel feature map output by an edge enhancement network are cascaded on channels, then the result is input into a 3-layer convolutional neural network, and the front 2-layer network structure consists of upsampling, convolution, batch normalization and ReLu activation functions, wherein the upsampling operation adopts a bilinear interpolation method to enlarge the size of the feature map to 2 times of the original size. And the final layer of network adopts convolution, offset and sigmoid activation functions to obtain a semantic segmentation heat map of 1 channel, wherein the range of values is between 0 and 1. Converting the heat map into a binary map with only 0 and 1 values by setting 0.7 as a threshold value, as shown in fig. 6, wherein a black area represents the position of a character and a white area is a background area;
(6) Contour forming
Separating different text regions from the binarization map by using OpenCV software, and solving a closed polygon with the smallest perimeter of the region for each region, wherein the vertex coordinates of the polygon are the position coordinates of the text region in the image. In fig. 6, 3 text regions are detected in total by semantic segmentation and binarization, and in fig. 7, the border of each text region is derived from the binarization map using OpenCV. For the 3 rectangular text regions in fig. 7, openCV will output coordinates of 4 vertices, respectively. Finally, the coordinate points are taken as text region coordinates. For other irregular text areas, the OpenCV software can also determine the number of polygon vertices by itself.
(7) Training method
Using ResNet as backbone network, pre-training it on image classification dataset ImageNet and saving pre-training network weight parameters. The entire network is then pre-trained on the artificial synthetic dataset SynthText to allow the model to converge on the task scene. And finally, performing final formal training under the specific scene data set. In addition, OHEM algorithm is used in the design of the loss function, positive and negative sample balancing is performed, and the area gap between the foreground and the background is balanced. The network optimizer adopts Adam, the batch size is set to 8, an exponentially decaying learning rate curve is used, the initial learning rate is set to 0.0001, and the learning rate is reduced to 0.95 after every 1 ten thousand iterations, and 10 ten thousand iterations are performed.
Claims (1)
1. A natural scene text detection method based on semantic segmentation is characterized by comprising the following steps:
(1) Constructing a basic feature extraction network
The feature extraction network adopts ResNet or MobileNet network structure as backbone, 1/4, 1/8, 1/16 and 1/32 features of the input image size are extracted from different layers as output, and the number of channels corresponding to the output features is 64, 128, 256 and 512 channels respectively;
(2) Construction of feature screening Module
The input of the feature screening module is divided into two parts i and h, i represents the output feature of the feature extraction network, h represents the output feature of the feature screening module at the upper stage, the two parts are subjected to convolution fusion and then normalized by using a sigmoid function, the normalized result is used as a weight, the two inputs i and h are subjected to selective fusion, and finally the fused output feature is obtained; the whole operation process is defined as follows:
S=sigmoid(conv3(conv1(h),conv2(i)))
out=conv4((1-S)·h+S·i)
Wherein S represents a normalized feature screening heat map; conv (x) represents a series of self-network structures consisting of convolution, batch normalization, reLu activation functions; out represents the final output feature map, fixed as 64 channels; the step of channel transformation is also implied in the operation process;
(3) Construction of feature pyramid networks
The feature pyramid network is used for fusing the output of the feature screening module; the feature screening module is used at 3 positions in the feature pyramid network, but the network structure of the feature screening module is only one, namely 1 module 3 positions are multiplexed; firstly, performing feature expansion on a 1/32-size feature map output by a feature extraction network by using a pyramid pooling network to obtain a 1/32-size feature map res4; up-sampling res4 to be 1/16 size, and then taking the 1/16 size feature map output by the feature extraction network and the 1/16 size feature map as h and i inputs of a feature screening module respectively, wherein the feature screening module outputs a 1/16 size feature map res3; repeating the steps to obtain res2 and res1, wherein the sizes are 1/8 and 1/4 respectively; finally, up-sampling res2, res3 and res4 to the size of res1, and then cascading on channels to obtain a multi-scale fusion feature map with 256 channels;
(4) Constructing edge-enhanced networks
The edge strengthening network consists of 3 layers of neural networks, wherein the first two layers of neural networks consist of convolution, batch normalization and ReLu activation functions, and the last layer of neural networks consist of convolution, bias and sigmoid activation functions; finally, obtaining an edge strengthening heat map with the channel number of 1, wherein the pixel point value range is [0,1], and the larger the value is, the closer the value is to the edge position;
(5) Constructing semantic segmentation networks
Firstly, cascading 256 channel feature images output by a feature pyramid network and 1 channel feature images output by an edge enhancement network on channels, inputting the results into a 3-layer convolutional neural network, wherein the front 2-layer network structure consists of upsampling, convolution, batch normalization and ReLu activation functions, and the upsampling operation adopts a bilinear interpolation method to enlarge the size of the feature images to 2 times of the original size; the final layer of network adopts convolution, bias and sigmoid activation functions to obtain a semantic segmentation heat map of a1 channel, wherein the range of values is between 0 and 1; converting the heat map into a binary map with only two values of 0 and 1 by setting 0.7 as a threshold value;
(6) Contour forming
Separating different text regions from a binarization graph by using OpenCV software, and solving a closed polygon with the smallest perimeter of the region for each region, wherein the vertex coordinates of the polygon are the position coordinates of the text region in an image; for a rectangular text region, its coordinates consist of 4 points; for other irregular text areas, the OpenCV software automatically determines the number of polygon vertexes;
(7) Training method
Using ResNet as a backbone network, pre-training the backbone network on an image classification dataset ImageNet, and storing pre-training network weight parameters; then the whole network is pre-trained on the artificial synthetic dataset SynthText to enable the model to converge on the task scene; finally, performing final formal training under the specific scene data set; in addition, OHEM algorithm is used in the design of the loss function, positive and negative sample balancing is carried out, and the area difference between the foreground and the background is balanced; the network optimizer adopts Adam, the batch size is set to 8, an exponentially decaying learning rate curve is used, the initial learning rate is set to 0.0001, and the learning rate is reduced to 0.95 after every 1 ten thousand iterations, and 10 ten thousand iterations are performed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111157377.4A CN113888505B (en) | 2021-09-30 | 2021-09-30 | Natural scene text detection method based on semantic segmentation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111157377.4A CN113888505B (en) | 2021-09-30 | 2021-09-30 | Natural scene text detection method based on semantic segmentation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113888505A CN113888505A (en) | 2022-01-04 |
CN113888505B true CN113888505B (en) | 2024-05-07 |
Family
ID=79004733
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111157377.4A Active CN113888505B (en) | 2021-09-30 | 2021-09-30 | Natural scene text detection method based on semantic segmentation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113888505B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114399710A (en) * | 2022-01-06 | 2022-04-26 | 昇辉控股有限公司 | Identification detection method and system based on image segmentation and readable storage medium |
CN114092930B (en) * | 2022-01-07 | 2022-05-03 | 中科视语(北京)科技有限公司 | Character recognition method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110097049A (en) * | 2019-04-03 | 2019-08-06 | 中国科学院计算技术研究所 | A kind of natural scene Method for text detection and system |
CN110322495A (en) * | 2019-06-27 | 2019-10-11 | 电子科技大学 | A kind of scene text dividing method based on Weakly supervised deep learning |
CN111553351A (en) * | 2020-04-26 | 2020-08-18 | 佛山市南海区广工大数控装备协同创新研究院 | Semantic segmentation based text detection method for arbitrary scene shape |
CN112966691A (en) * | 2021-04-14 | 2021-06-15 | 重庆邮电大学 | Multi-scale text detection method and device based on semantic segmentation and electronic equipment |
-
2021
- 2021-09-30 CN CN202111157377.4A patent/CN113888505B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110097049A (en) * | 2019-04-03 | 2019-08-06 | 中国科学院计算技术研究所 | A kind of natural scene Method for text detection and system |
CN110322495A (en) * | 2019-06-27 | 2019-10-11 | 电子科技大学 | A kind of scene text dividing method based on Weakly supervised deep learning |
CN111553351A (en) * | 2020-04-26 | 2020-08-18 | 佛山市南海区广工大数控装备协同创新研究院 | Semantic segmentation based text detection method for arbitrary scene shape |
CN112966691A (en) * | 2021-04-14 | 2021-06-15 | 重庆邮电大学 | Multi-scale text detection method and device based on semantic segmentation and electronic equipment |
Non-Patent Citations (2)
Title |
---|
Yu Zeng ; Yunzhi Zhuge ; Huchuan Lu ; Lihe Zhang.Joint Learning of Saliency Detection and Weakly Supervised Semantic Segmentation .《arxiv》.2019,全文. * |
基于轻量级网络的自然场景下的文本检测;孙婧婧;张青林;;电子测量技术;20200423(第08期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113888505A (en) | 2022-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110287849B (en) | Lightweight depth network image target detection method suitable for raspberry pi | |
CN111047551B (en) | Remote sensing image change detection method and system based on U-net improved algorithm | |
CN111640125B (en) | Aerial photography graph building detection and segmentation method and device based on Mask R-CNN | |
CN109034210A (en) | Object detection method based on super Fusion Features Yu multi-Scale Pyramid network | |
CN113888505B (en) | Natural scene text detection method based on semantic segmentation | |
CN111612008A (en) | Image segmentation method based on convolution network | |
CN113807355A (en) | Image semantic segmentation method based on coding and decoding structure | |
CN111046917B (en) | Object-based enhanced target detection method based on deep neural network | |
CN111797841B (en) | Visual saliency detection method based on depth residual error network | |
CN110532946A (en) | A method of the green vehicle spindle-type that is open to traffic is identified based on convolutional neural networks | |
CN115620010A (en) | Semantic segmentation method for RGB-T bimodal feature fusion | |
CN113706545A (en) | Semi-supervised image segmentation method based on dual-branch nerve discrimination dimensionality reduction | |
CN111353544A (en) | Improved Mixed Pooling-Yolov 3-based target detection method | |
CN114820579A (en) | Semantic segmentation based image composite defect detection method and system | |
CN110852330A (en) | Behavior identification method based on single stage | |
CN110852199A (en) | Foreground extraction method based on double-frame coding and decoding model | |
CN114943876A (en) | Cloud and cloud shadow detection method and device for multi-level semantic fusion and storage medium | |
CN113516126A (en) | Adaptive threshold scene text detection method based on attention feature fusion | |
CN113298817A (en) | High-accuracy semantic segmentation method for remote sensing image | |
Zhang et al. | R2net: Residual refinement network for salient object detection | |
CN113487610B (en) | Herpes image recognition method and device, computer equipment and storage medium | |
CN113408524A (en) | Crop image segmentation and extraction algorithm based on MASK RCNN | |
CN107766838B (en) | Video scene switching detection method | |
CN115578721A (en) | Streetscape text real-time detection method based on attention feature fusion | |
CN112861860B (en) | Text detection method in natural scene based on upper and lower boundary extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |