CN112712078A - Text detection method and device - Google Patents

Text detection method and device Download PDF

Info

Publication number
CN112712078A
CN112712078A CN202011644150.8A CN202011644150A CN112712078A CN 112712078 A CN112712078 A CN 112712078A CN 202011644150 A CN202011644150 A CN 202011644150A CN 112712078 A CN112712078 A CN 112712078A
Authority
CN
China
Prior art keywords
image
feature map
text
semantic segmentation
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011644150.8A
Other languages
Chinese (zh)
Inventor
崔淼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xiaoi Robot Technology Co Ltd
Original Assignee
Shanghai Xiaoi Robot Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xiaoi Robot Technology Co Ltd filed Critical Shanghai Xiaoi Robot Technology Co Ltd
Priority to CN202011644150.8A priority Critical patent/CN112712078A/en
Publication of CN112712078A publication Critical patent/CN112712078A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a text detection method and a text detection device, wherein the method comprises the following steps: acquiring an image; extracting a feature map of the image by using a convolutional neural network, wherein the convolutional neural network is a lightweight network; up-sampling the feature map to obtain a target feature map of the image; convolving the target feature map by using different convolution kernels to obtain semantic features of the image; and performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, wherein the semantic segmentation result is used for determining a target text region in the image. The method in the embodiment of the invention can improve the accuracy of natural scene text detection.

Description

Text detection method and device
Technical Field
The invention relates to the technical field of image recognition, in particular to a text detection method and device.
Background
With the rapid development of neural networks, natural scene text detection (scene text detection) has gained wide attention in numerous applications such as automatic driving target positioning, subtitle recognition, sensitive word recognition, scene understanding, product recognition and the like. However, the text in the natural scene image has very complicated changes in size, shape, color, font, direction, brightness, contrast, text occlusion, text style, etc., so that text instances of arbitrary shapes (e.g., curved text, text with large inclination, long text, etc.) often appear.
Currently, there are three main methods for text detection: the first is a text detection method based on a projection method, which has limited applicable scenes and requires that input text data must have no background interference, short text and no bent characters; the second is a text detection method based on detection box regression, which can only locate a text box in a rectangular or quadrilateral form with a specific direction, but has an unsatisfactory detection effect on text instances with any shape; the third is a text detection method based on semantic segmentation, which can detect texts with larger inclination, but when two text instances are closer, the two text instances still cannot be separated, and meanwhile, the detection of the method for curved texts and long texts still has a missing detection phenomenon. It can be seen that the effect of text detection in the current natural scene is not ideal.
Therefore, how to improve the accuracy of natural scene text detection becomes a technical problem which needs to be solved urgently.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for text detection, so as to solve the problem in the prior art that the accuracy of natural scene text detection is not high.
In a first aspect, the present invention provides a method for text detection, including:
acquiring an image; extracting a feature map of the image by using a convolutional neural network, wherein the convolutional neural network is a lightweight network; up-sampling the feature map to obtain a target feature map of the image; convolving the target feature map by using different convolution kernels to obtain semantic features of the image; and performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, wherein the semantic segmentation result is used for determining a target text region in the image.
In the embodiment of the invention, the different convolution kernels have different receptive fields, and different convolution kernels are used for performing convolution on the target characteristic image to obtain the semantic characteristics of the image; and performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, so that the detection effect of extremely long texts and curved characters can be improved.
Meanwhile, the characteristic diagram of the image is extracted by adopting a lightweight network, so that the text detection speed can be further improved.
In one embodiment, the convolutional neural network is a GhostNet network.
In one embodiment, the upsampling the feature map to obtain a target feature map of the image includes:
and performing up-sampling on the feature map by using a feature pyramid network to obtain a target feature map of the image.
In one embodiment, the convolving the target feature map with different convolution kernels to obtain the semantic features of the image includes:
and performing variable convolution on the target feature map to obtain the semantic features of the image.
In one embodiment, after the semantic segmentation of the semantic features obtains a semantic segmentation result of the image, the method further includes:
carrying out differentiable binarization processing on the semantic segmentation result to obtain a binarization result; and determining a target text area in the image based on the binarization result.
In one embodiment, the variable convolution includes an adaptive receptive field.
In one embodiment, the variable convolution enables adjustment of the receptive field according to the text line size in the target feature map.
In a second aspect, an apparatus for text detection is provided, the apparatus comprising:
an image acquisition unit for acquiring an image; the characteristic extraction unit is used for extracting a characteristic diagram of the image by using a convolutional neural network, and the convolutional neural network is a lightweight network; the up-sampling unit is used for up-sampling the feature map to obtain a target feature map of the image; the convolution unit is used for performing convolution on the target feature map by using different convolution kernels to obtain the semantic features of the image; and the semantic segmentation unit is used for performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, and the semantic segmentation result is used for determining a target text region in the image.
In a third aspect, the present invention provides an apparatus for text detection, where the apparatus is configured to perform the method in the first aspect or any possible implementation manner of the first aspect.
In a fourth aspect, an apparatus for text detection is provided, where the apparatus includes a storage medium, which may be a non-volatile storage medium, and a central processing unit, where a computer-executable program is stored in the storage medium, and the central processing unit is connected to the non-volatile storage medium and executes the computer-executable program to implement the first aspect or the method in any possible implementation manner of the first aspect.
In a fifth aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to perform the method of the first aspect or any possible implementation manner of the first aspect.
Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the first aspect or the method in any possible implementation manner of the first aspect.
A sixth aspect provides a computer readable storage medium storing program code for execution by a device, the program code comprising instructions for performing the method of the first aspect or any possible implementation of the first aspect.
In the embodiment of the invention, the different convolution kernels have different receptive fields, and different convolution kernels are used for performing convolution on the target characteristic image to obtain the semantic characteristics of the image; and performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, so that the detection effect of extremely long texts and curved characters can be improved. Meanwhile, the characteristic diagram of the image is extracted by adopting a lightweight network, so that the text detection speed can be further improved.
Drawings
Fig. 1 is a schematic block diagram of a text detection method according to an embodiment of the present invention.
Fig. 2 is a schematic block diagram of a text detection method based on variable convolution semantic segmentation according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a detection result of a text detection method based on semantic segmentation in the prior art.
Fig. 4 is a schematic diagram of a detection result of the text detection method according to an embodiment of the present invention.
Fig. 5 is a schematic block diagram of an apparatus for text detection according to an embodiment of the present invention.
Fig. 6 is a schematic block diagram of a text detection apparatus according to another embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With the rapid development of neural networks, natural scene text detection has gained wide attention in many applications such as automatic driving target positioning, subtitle recognition, sensitive word recognition, scene understanding, product recognition and the like in recent years. However, due to the large differences between foreground text and background objects, as well as text variations of various shapes, colors, fonts, orientations and scales, and extreme lighting and occlusion, text detection and text recognition in natural scenes still face considerable challenges.
At present, the method adopted for achieving a better recognition effect is not end-to-end recognition, most of schemes firstly carry out text detection to obtain a text box, and then carry out character recognition on the text box, and the text detection effect in the scheme can directly influence the character recognition. The currently used text detection methods include the following three methods:
(1) text detection method based on projection method
Although the speed of the text detection method is higher than that of the deep learning method, the method has limited application scenes, and requires that the input text data has no background interference, short text and cannot have bent characters.
(2) Text detection method based on detection box regression
The text detection method based on the detection box regression can only locate the text box with a rectangular or quadrilateral form with a specific direction. However, text instances with arbitrary shapes (e.g., curved text, text with large tilt, long text, etc.) often appear in natural scenes, and the detection effect of the method on the text instances with arbitrary shapes is not ideal.
(3) Text detection method based on semantic segmentation
The text detection method based on semantic segmentation can detect the text with larger inclination. However, when two text instances are relatively close to each other, the two text instances still cannot be separated, meanwhile, the detection of the method for the curve text and the long text still has a missing detection phenomenon, and most of the methods based on semantic segmentation need a deep convolution structure to extract text line boundary information and complex post-processing to solve the problem of text lines, thereby resulting in a slow detection speed.
Based on the above problems, the embodiment of the invention provides a text detection method, which can effectively improve the accuracy of natural scene text detection.
FIG. 1 is a schematic block diagram of a method 100 of text detection in accordance with one embodiment of the present invention.
It should be understood that fig. 1 shows the steps or operations of the method 100, but these steps or operations are only examples, and other operations or variations of the individual operations of the method 100 of fig. 1 may be performed by embodiments of the present invention, or not all of the steps need be performed, or the steps may be performed in other orders.
As shown in fig. 1, the method 100 may include steps 110, 120, 130, 140, and 150, which are as follows:
and S110, acquiring an image.
The image may be an image captured by a camera of the terminal device, or may also be an image stored in the terminal device.
The terminal device herein may refer to a device such as a mobile phone and a tablet computer, and may also be a vehicle-mounted device in a vehicle, which is not limited in the embodiment of the present invention.
And S120, extracting a feature map of the image by using a convolutional neural network.
Wherein the convolutional neural network may be a lightweight network.
Alternatively, the convolutional neural network may be a backbone network (of a text detection model). The convolutional neural network may include a plurality of convolutional layers.
The structure of the convolutional neural network is described in detail below with reference to fig. 2.
Fig. 2 is a schematic block diagram of a text detection method based on variable convolution semantic segmentation according to an embodiment of the present invention.
As shown in fig. 2, the convolutional neural network (i.e., backbone network) may include 4 bottleneck layers (bottleeck): the bottle neck comprises a first bottle neck layer, a second bottle neck layer, a third bottle neck layer and a fourth bottle neck layer.
Optionally, the convolutional neural network may be a GhostNet network.
For example, the convolutional neural network may include 4 bottleneck layers, which are, respectively, 4 convolutional layers ghost bottleeck 1, ghost bottleeck 2, ghost bottleeck 4, and ghost bottleeck 6 in the ghost net network, and the resolutions of the input images of the 4 bottleneck layers are, respectively, 1/4, 1/8, 1/16, and 1/32.
In the embodiment of the invention, the backbone network (namely the convolutional neural network) adopts a lighter GhostNet network than a Mobilnetv2 network, so that the text detection speed can be improved.
And S130, performing up-sampling on the feature map to obtain a target feature map of the image.
Optionally, the feature map may be up-sampled by using a Feature Pyramid Network (FPN) to obtain a target feature map of the image.
Accordingly, the target feature map may be a feature map of a pyramid structure.
For example, as shown in FIG. 2, 4 bottleneck layers may be upsampled by a factor of 2 (up2), respectively.
Performing convolution operation with 1 × 1 and 16 channels on the features obtained after the first bottleneck layer is sampled; performing convolution operation with 1 × 1 and 32 channels in number on the features obtained after the upsampling of the second bottleneck layer, and then performing upsampling for 2 times; performing convolution operation with 1 × 1 and 112 channels on the features obtained after the upsampling of the third bottleneck layer, and then performing upsampling for 2 times; the feature obtained after the upsampling of the fourth bottleneck layer may be subjected to a convolution operation of 1 × 1, where the number of channels is 320, and then the upsampling is performed by 2 times.
At this time, the output of the 4 bottleneck layers can be subjected to a fusion (concat) process according to the result obtained by the above operation, and the result obtained after the fusion process can be regarded as the target feature map.
And S140, convolving the target feature map by using different convolution kernels to obtain the semantic features of the image.
Wherein the semantic features of the image may include multi-scale feature information of the image.
For example, the multi-scale feature information may include boundary feature information of curved text lines and boundary feature information of oblique text lines.
Optionally, the convolving the target feature map with different convolution kernels to obtain the semantic features of the image may include:
and performing variable convolution (DCN) on the target feature map to obtain the semantic features of the image.
For example, as shown in fig. 2, after the semantic features of the image are obtained by performing variable convolution on the target feature map, 1 × 1 convolution (Conv 1 × 1) may be performed on the semantic features.
In the embodiment of the present invention, the variable convolution may include an adaptive receptive field, and the variable convolution may further adjust the receptive field according to the size of the text line in the target feature map, so that the text detection effect of the long text and the curved text region may be improved.
S150, performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image.
Wherein the semantic segmentation result can be used to determine a target text region in the image.
Optionally, after the semantically segmenting the semantic features to obtain a semantic segmentation result of the image (i.e., S150), the method 100 may further include:
carrying out differentiable binarization processing on the semantic segmentation result to obtain a binarization result;
and determining a target text area in the image based on the binarization result.
In the embodiment of the present invention, a differentiable binarization process (instead of a conventional binarization process that sets a fixed threshold value) may be used to acquire a segmentation feature boundary in the semantic segmentation result.
Different thresholds have great influence on text detection, the differentiable binarization processing can put binarization operation into a network for execution, and parameters can be continuously optimized, so that the threshold of each pixel point can be adaptively adjusted, the boundary area of dense character lines can be better separated, at the moment, a text detection frame is obtained according to text semantic segmentation and differentiable binarization results, and the accuracy of text detection can be effectively improved.
Further, as shown in fig. 2, the determining a target text region in the image based on the binarization result may include:
and determining a target text region in the image based on the binarization result and the semantic segmentation result.
In the embodiment of the invention, a semantic segmentation network structure is formed by adopting a variable convolution (DCN) and a Feature Pyramid Network (FPN), the variable convolution has a variable convolution layer with a self-adaptive receptive field, the detection effect of an extremely long text and a curved text region is greatly improved, and meanwhile, the text detection speed can be further improved by adopting a GhostNet network as a backbone network.
Fig. 3 is a schematic diagram of a detection result of a text detection method based on semantic segmentation in the prior art.
With the increasingly complex background of characters in natural scenes, most of the existing schemes often adopt a text detection method based on semantic segmentation to perform text detection on texts with large inclination and curved texts.
However, when two text instances are relatively close to each other, the text detection method based on semantic segmentation still may not separate them, and meanwhile, the detection of the method on curved text and long text still has a missing detection phenomenon, and most of the methods based on semantic segmentation (currently, the best model is a PSEnet network) need a deep convolution structure to extract text line boundary information and complex post-processing to solve the text line problem, thereby resulting in a slow detection speed.
For example, the text detection result based on the PSEnet network is shown in fig. 3, the left image in fig. 3 is an input image, and the right image in fig. 3 is a detection result. As can be seen in fig. 3, the curvy text "Japanese retaurant" is missed and the callout of the phone includes the misdetected text.
Fig. 4 is a schematic diagram of a detection result of the text detection method according to an embodiment of the present invention.
In the embodiment of the invention, a semantic segmentation network structure is formed by adopting a variable convolution (DCN) and a characteristic pyramid (FPN), the variable convolution has a variable convolution layer with a self-adaptive receptive field, the detection effect on an extremely long text and a curved text area is greatly improved, and meanwhile, the backbone network further improves the text detection speed by adopting GhostNet.
Meanwhile, differentiable binarization processing is adopted in the embodiment of the invention, and binarization operation can be put into a network and optimized simultaneously by the differentiable binarization processing, so that the threshold value of each pixel point can be predicted in a self-adaptive manner.
The scheme of the embodiment of the invention and the scheme based on the PSEnet network in the prior art are respectively and quantitatively evaluated by using an ICDAR2015 data set, the accuracy of the scheme of the embodiment of the invention is improved by 5% compared with the accuracy of the scheme based on the PSEnet network, and the speed of the scheme of the embodiment of the invention is improved by 8% compared with the speed of the scheme based on the PSEnet network.
For example, fig. 4 shows a text detection result according to an embodiment of the present invention, where the left image in fig. 4 is an input image, and the right image in fig. 4 is a detection result. As can be seen from FIG. 4, the curvy word "Japanese Restaurant" is not missed, and the labeling of the phone is more accurate than the text detection result in FIG. 3.
Fig. 5 is a schematic block diagram of an apparatus 500 for text detection according to an embodiment of the present invention. It should be understood that the apparatus 500 for text detection illustrated in fig. 5 is only an example, and the apparatus 500 of an embodiment of the present invention may further include other modules or units.
It should be understood that the apparatus 500 is capable of performing the various steps in the method of fig. 1 and, to avoid repetition, will not be described in detail herein.
In one possible implementation manner of the present invention, the apparatus includes: an image acquisition unit 510, a feature extraction unit 520, an upsampling unit 530, a convolution unit 540, and a semantic segmentation unit 550.
An image acquisition unit 510 for acquiring an image.
The image may be an image captured by a camera of the terminal device, or may also be an image stored in the terminal device.
The terminal device herein may refer to a device such as a mobile phone and a tablet computer, and may also be a vehicle-mounted device in a vehicle, which is not limited in the embodiment of the present invention.
A feature extraction unit 520, configured to extract a feature map of the image using a convolutional neural network.
Wherein the convolutional neural network may be a lightweight network.
Alternatively, the convolutional neural network may be a backbone network (of a text detection model). The convolutional neural network may include a plurality of convolutional layers.
The structure of the convolutional neural network is described in detail below with reference to fig. 2.
Fig. 2 is a schematic block diagram of a text detection method based on variable convolution semantic segmentation according to an embodiment of the present invention.
As shown in fig. 2, the convolutional neural network (i.e., backbone network) may include 4 bottleneck layers (bottleeck): the bottle neck comprises a first bottle neck layer, a second bottle neck layer, a third bottle neck layer and a fourth bottle neck layer.
Optionally, the convolutional neural network may be a GhostNet network.
For example, the convolutional neural network may include 4 bottleneck layers, which are, respectively, 4 convolutional layers ghost bottleeck 1, ghost bottleeck 2, ghost bottleeck 4, and ghost bottleeck 6 in the ghost net network, and the resolutions of the input images of the 4 bottleneck layers are, respectively, 1/4, 1/8, 1/16, and 1/32.
In the embodiment of the invention, the backbone network (namely the convolutional neural network) adopts a lighter GhostNet network than a Mobilnetv2 network, so that the text detection speed can be improved.
An upsampling unit 530, configured to upsample the feature map to obtain a target feature map of the image.
Optionally, the feature map may be up-sampled by using a Feature Pyramid Network (FPN) to obtain a target feature map of the image.
Accordingly, the target feature map may be a feature map of a pyramid structure.
For example, as shown in FIG. 2, 4 bottleneck layers may be upsampled by a factor of 2 (up2), respectively.
Performing convolution operation with 1 × 1 and 16 channels on the features obtained after the first bottleneck layer is sampled; performing convolution operation with 1 × 1 and 32 channels in number on the features obtained after the upsampling of the second bottleneck layer, and then performing upsampling for 2 times; performing convolution operation with 1 × 1 and 112 channels on the features obtained after the upsampling of the third bottleneck layer, and then performing upsampling for 2 times; the feature obtained after the upsampling of the fourth bottleneck layer may be subjected to a convolution operation of 1 × 1, where the number of channels is 320, and then the upsampling is performed by 2 times.
At this time, the output of the 4 bottleneck layers can be subjected to a fusion (concat) process according to the result obtained by the above operation, and the result obtained after the fusion process can be regarded as the target feature map.
And a convolution unit 540, configured to perform convolution on the target feature map with different convolution kernels to obtain a semantic feature of the image.
Wherein the semantic features of the image may include multi-scale feature information of the image.
For example, the multi-scale feature information may include boundary feature information of curved text lines and boundary feature information of oblique text lines.
Optionally, the convolving the target feature map with different convolution kernels to obtain the semantic features of the image may include:
and performing variable convolution (DCN) on the target feature map to obtain the semantic features of the image.
For example, as shown in fig. 2, after the semantic features of the image are obtained by performing variable convolution on the target feature map, 1 × 1 convolution (Conv 1 × 1) may be performed on the semantic features.
In the embodiment of the present invention, the variable convolution may include an adaptive receptive field, and the variable convolution may further adjust the receptive field according to the size of the text line in the target feature map, so that the text detection effect of the long text and the curved text region may be improved.
And a semantic segmentation unit 550, configured to perform semantic segmentation on the semantic features to obtain a semantic segmentation result of the image.
Wherein the semantic segmentation result can be used to determine a target text region in the image.
Optionally, after the semantically segmenting the semantic features to obtain a semantic segmentation result of the image (i.e., S150), the method 100 may further include:
carrying out differentiable binarization processing on the semantic segmentation result to obtain a binarization result;
and determining a target text area in the image based on the binarization result.
In the embodiment of the present invention, a differentiable binarization process (instead of a conventional binarization process that sets a fixed threshold value) may be used to acquire a segmentation feature boundary in the semantic segmentation result.
Different thresholds have great influence on text detection, the differentiable binarization processing can put binarization operation into a network for execution, and parameters can be continuously optimized, so that the threshold of each pixel point can be adaptively adjusted, the boundary area of dense character lines can be better separated, at the moment, a text detection frame is obtained according to text semantic segmentation and differentiable binarization results, and the accuracy of text detection can be effectively improved.
Further, as shown in fig. 2, the determining a target text region in the image based on the binarization result may include:
and determining a target text region in the image based on the binarization result and the semantic segmentation result.
In the embodiment of the invention, a semantic segmentation network structure is formed by adopting a variable convolution (DCN) and a Feature Pyramid Network (FPN), the variable convolution has a variable convolution layer with a self-adaptive receptive field, the detection effect of an extremely long text and a curved text region is greatly improved, and meanwhile, the text detection speed can be further improved by adopting a GhostNet network as a backbone network.
Optionally, the convolutional neural network is a GhostNet network.
Optionally, the upsampling unit 530 is specifically configured to:
and performing up-sampling on the feature map by using a feature pyramid network to obtain a target feature map of the image.
Optionally, the convolution unit 540 is specifically configured to:
and performing variable convolution on the target feature map to obtain the semantic features of the image.
Optionally, after the semantic segmentation is performed on the semantic features to obtain a semantic segmentation result of the image, the apparatus 500 further includes a binarization unit 560, where the binarization unit 560 is configured to:
carrying out differentiable binarization processing on the semantic segmentation result to obtain a binarization result; and determining a target text area in the image based on the binarization result.
Optionally, the variable convolution comprises an adaptive receptive field.
Optionally, the variable convolution can adjust the receptive field according to the text line size in the target feature map.
It should be appreciated that the means 500 for text detection herein is embodied in the form of a functional module. The term "module" herein may be implemented in software and/or hardware, and is not particularly limited thereto. For example, a "module" may be a software program, a hardware circuit, or a combination of both that implements the functionality described above. The hardware circuitry may include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared processor, a dedicated processor, or a group of processors) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality.
As an example, the apparatus 500 for text detection provided by the embodiment of the present invention may be a processor or a chip, and is configured to perform the method according to the embodiment of the present invention.
FIG. 6 is a schematic block diagram of an apparatus 400 for text detection in accordance with one embodiment of the present invention. The apparatus 400 shown in fig. 6 includes a memory 401, a processor 402, a communication interface 403, and a bus 404. The memory 401, the processor 402 and the communication interface 403 are connected to each other by a bus 404.
The memory 401 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 401 may store a program and the processor 402 is configured to perform the steps of the method for text detection of an embodiment of the present invention when the program stored in the memory 401 is executed by the processor 402, for example, the steps of the embodiment shown in fig. 1 may be performed.
The processor 402 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute related programs to implement the method for text detection according to the embodiment of the present invention.
The processor 402 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the method for text detection according to the embodiment of the present invention may be implemented by integrated logic circuits of hardware in the processor 402 or instructions in the form of software.
The processor 402 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 401, and the processor 402 reads information in the memory 401, and performs, in combination with hardware of the storage medium, functions that need to be performed by units included in the text detection apparatus according to the embodiment of the present invention, or performs the text detection method according to the embodiment of the method of the present invention, for example, the steps/functions of the embodiment shown in fig. 1 may be performed.
The communication interface 403 may use transceiver means, such as, but not limited to, a transceiver, to enable communication between the apparatus 400 and other devices or communication networks.
Bus 404 may include a path that transfers information between various components of apparatus 400 (e.g., memory 401, processor 402, communication interface 403).
It should be understood that the apparatus 400 shown in the embodiment of the present invention may be a processor or a chip for executing the method of text detection described in the embodiment of the present invention.
It should be understood that the processor in the embodiments of the present invention may be a Central Processing Unit (CPU), and the processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In addition, the "/" in this document generally indicates that the former and latter associated objects are in an "or" relationship, but may also indicate an "and/or" relationship, which may be understood with particular reference to the former and latter text.
In the present invention, "at least one" means one or more, "a plurality" means two or more. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and the like that are within the spirit and principle of the present invention are included in the present invention.

Claims (10)

1. A method of text detection, comprising:
acquiring an image;
extracting a feature map of the image by using a convolutional neural network, wherein the convolutional neural network is a lightweight network;
up-sampling the feature map to obtain a target feature map of the image;
convolving the target feature map by using different convolution kernels to obtain semantic features of the image;
and performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, wherein the semantic segmentation result is used for determining a target text region in the image.
2. The method of claim 1, wherein the convolutional neural network is a GhostNet network.
3. The method of claim 2, wherein the upsampling the feature map to obtain a target feature map of the image comprises:
and performing up-sampling on the feature map by using a feature pyramid network to obtain a target feature map of the image.
4. The method according to any one of claims 1 to 3, wherein the convolving the target feature map with different convolution kernels to obtain semantic features of the image comprises:
and performing variable convolution on the target feature map to obtain the semantic features of the image.
5. The method of claim 4, wherein after the semantically segmenting the semantic features into semantic segmentation results of the image, the method further comprises:
carrying out differentiable binarization processing on the semantic segmentation result to obtain a binarization result;
and determining a target text area in the image based on the binarization result.
6. The method of claim 5, wherein the variable convolution comprises an adaptive receptive field.
7. The method of claim 6, wherein the variable convolution is capable of adjusting the receptive field according to the text line size in the target feature map.
8. An apparatus for text detection, comprising:
an image acquisition unit for acquiring an image;
the characteristic extraction unit is used for extracting a characteristic diagram of the image by using a convolutional neural network, and the convolutional neural network is a lightweight network;
the up-sampling unit is used for up-sampling the feature map to obtain a target feature map of the image;
the convolution unit is used for performing convolution on the target feature map by using different convolution kernels to obtain the semantic features of the image;
and the semantic segmentation unit is used for performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, and the semantic segmentation result is used for determining a target text region in the image.
9. An apparatus for text detection comprising a processor and a memory, the memory for storing program instructions, the processor for invoking the program instructions to perform the method of any of claims 1-7.
10. A computer-readable storage medium comprising computer instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1-7.
CN202011644150.8A 2020-12-31 2020-12-31 Text detection method and device Pending CN112712078A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011644150.8A CN112712078A (en) 2020-12-31 2020-12-31 Text detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011644150.8A CN112712078A (en) 2020-12-31 2020-12-31 Text detection method and device

Publications (1)

Publication Number Publication Date
CN112712078A true CN112712078A (en) 2021-04-27

Family

ID=75548076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011644150.8A Pending CN112712078A (en) 2020-12-31 2020-12-31 Text detection method and device

Country Status (1)

Country Link
CN (1) CN112712078A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113313083A (en) * 2021-07-28 2021-08-27 北京世纪好未来教育科技有限公司 Text detection method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110880000A (en) * 2019-11-27 2020-03-13 上海智臻智能网络科技股份有限公司 Picture character positioning method and device, computer equipment and storage medium
CN111210443A (en) * 2020-01-03 2020-05-29 吉林大学 Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
CN111444919A (en) * 2020-04-17 2020-07-24 南京大学 Method for detecting text with any shape in natural scene
CN111652217A (en) * 2020-06-03 2020-09-11 北京易真学思教育科技有限公司 Text detection method and device, electronic equipment and computer storage medium
CN111967488A (en) * 2020-06-22 2020-11-20 南昌大学 Mobile phone shot text image matching method based on twin convolutional neural network
CN112115900A (en) * 2020-09-24 2020-12-22 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110880000A (en) * 2019-11-27 2020-03-13 上海智臻智能网络科技股份有限公司 Picture character positioning method and device, computer equipment and storage medium
CN111210443A (en) * 2020-01-03 2020-05-29 吉林大学 Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
CN111444919A (en) * 2020-04-17 2020-07-24 南京大学 Method for detecting text with any shape in natural scene
CN111652217A (en) * 2020-06-03 2020-09-11 北京易真学思教育科技有限公司 Text detection method and device, electronic equipment and computer storage medium
CN111967488A (en) * 2020-06-22 2020-11-20 南昌大学 Mobile phone shot text image matching method based on twin convolutional neural network
CN112115900A (en) * 2020-09-24 2020-12-22 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113313083A (en) * 2021-07-28 2021-08-27 北京世纪好未来教育科技有限公司 Text detection method and device

Similar Documents

Publication Publication Date Title
CN110176027B (en) Video target tracking method, device, equipment and storage medium
CN112446383B (en) License plate recognition method and device, storage medium and terminal
CN109325954B (en) Image segmentation method and device and electronic equipment
CN111681273B (en) Image segmentation method and device, electronic equipment and readable storage medium
CN107944450B (en) License plate recognition method and device
US8897575B2 (en) Multi-scale, perspective context, and cascade features for object detection
CN110991310B (en) Portrait detection method, device, electronic equipment and computer readable medium
CN111914654B (en) Text layout analysis method, device, equipment and medium
CN110781980B (en) Training method of target detection model, target detection method and device
CN111414910B (en) Small target enhancement detection method and device based on double convolution neural network
WO2020043296A1 (en) Device and method for separating a picture into foreground and background using deep learning
CN109977875A (en) Gesture identification method and equipment based on deep learning
CN110689014B (en) Method and device for detecting region of interest, electronic equipment and readable storage medium
CN112700460A (en) Image segmentation method and system
WO2024174726A1 (en) Handwritten and printed text detection method and device based on deep learning
US9922263B2 (en) System and method for detection and segmentation of touching characters for OCR
CN112712078A (en) Text detection method and device
CN111950501B (en) Obstacle detection method and device and electronic equipment
CN113129298A (en) Definition recognition method of text image
CN112257708A (en) Character-level text detection method and device, computer equipment and storage medium
CN111160340A (en) Moving target detection method and device, storage medium and terminal equipment
CN113743219B (en) Moving object detection method and device, electronic equipment and storage medium
US20240221426A1 (en) Behavior detection method, electronic device, and computer readable storage medium
CN112580638B (en) Text detection method and device, storage medium and electronic equipment
US10832076B2 (en) Method and image processing entity for applying a convolutional neural network to an image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210427