CN112712078A

CN112712078A - Text detection method and device

Info

Publication number: CN112712078A
Application number: CN202011644150.8A
Authority: CN
Inventors: 崔淼
Original assignee: Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-27

Abstract

The invention provides a text detection method and a text detection device, wherein the method comprises the following steps: acquiring an image; extracting a feature map of the image by using a convolutional neural network, wherein the convolutional neural network is a lightweight network; up-sampling the feature map to obtain a target feature map of the image; convolving the target feature map by using different convolution kernels to obtain semantic features of the image; and performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, wherein the semantic segmentation result is used for determining a target text region in the image. The method in the embodiment of the invention can improve the accuracy of natural scene text detection.

Description

Text detection method and device

Technical Field

The invention relates to the technical field of image recognition, in particular to a text detection method and device.

Background

With the rapid development of neural networks, natural scene text detection (scene text detection) has gained wide attention in numerous applications such as automatic driving target positioning, subtitle recognition, sensitive word recognition, scene understanding, product recognition and the like. However, the text in the natural scene image has very complicated changes in size, shape, color, font, direction, brightness, contrast, text occlusion, text style, etc., so that text instances of arbitrary shapes (e.g., curved text, text with large inclination, long text, etc.) often appear.

Currently, there are three main methods for text detection: the first is a text detection method based on a projection method, which has limited applicable scenes and requires that input text data must have no background interference, short text and no bent characters; the second is a text detection method based on detection box regression, which can only locate a text box in a rectangular or quadrilateral form with a specific direction, but has an unsatisfactory detection effect on text instances with any shape; the third is a text detection method based on semantic segmentation, which can detect texts with larger inclination, but when two text instances are closer, the two text instances still cannot be separated, and meanwhile, the detection of the method for curved texts and long texts still has a missing detection phenomenon. It can be seen that the effect of text detection in the current natural scene is not ideal.

Therefore, how to improve the accuracy of natural scene text detection becomes a technical problem which needs to be solved urgently.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for text detection, so as to solve the problem in the prior art that the accuracy of natural scene text detection is not high.

In a first aspect, the present invention provides a method for text detection, including:

acquiring an image; extracting a feature map of the image by using a convolutional neural network, wherein the convolutional neural network is a lightweight network; up-sampling the feature map to obtain a target feature map of the image; convolving the target feature map by using different convolution kernels to obtain semantic features of the image; and performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, wherein the semantic segmentation result is used for determining a target text region in the image.

In the embodiment of the invention, the different convolution kernels have different receptive fields, and different convolution kernels are used for performing convolution on the target characteristic image to obtain the semantic characteristics of the image; and performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, so that the detection effect of extremely long texts and curved characters can be improved.

Meanwhile, the characteristic diagram of the image is extracted by adopting a lightweight network, so that the text detection speed can be further improved.

In one embodiment, the convolutional neural network is a GhostNet network.

In one embodiment, the upsampling the feature map to obtain a target feature map of the image includes:

and performing up-sampling on the feature map by using a feature pyramid network to obtain a target feature map of the image.

In one embodiment, the convolving the target feature map with different convolution kernels to obtain the semantic features of the image includes:

and performing variable convolution on the target feature map to obtain the semantic features of the image.

In one embodiment, after the semantic segmentation of the semantic features obtains a semantic segmentation result of the image, the method further includes:

carrying out differentiable binarization processing on the semantic segmentation result to obtain a binarization result; and determining a target text area in the image based on the binarization result.

In one embodiment, the variable convolution includes an adaptive receptive field.

In one embodiment, the variable convolution enables adjustment of the receptive field according to the text line size in the target feature map.

In a second aspect, an apparatus for text detection is provided, the apparatus comprising:

an image acquisition unit for acquiring an image; the characteristic extraction unit is used for extracting a characteristic diagram of the image by using a convolutional neural network, and the convolutional neural network is a lightweight network; the up-sampling unit is used for up-sampling the feature map to obtain a target feature map of the image; the convolution unit is used for performing convolution on the target feature map by using different convolution kernels to obtain the semantic features of the image; and the semantic segmentation unit is used for performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, and the semantic segmentation result is used for determining a target text region in the image.

In a third aspect, the present invention provides an apparatus for text detection, where the apparatus is configured to perform the method in the first aspect or any possible implementation manner of the first aspect.

In a fourth aspect, an apparatus for text detection is provided, where the apparatus includes a storage medium, which may be a non-volatile storage medium, and a central processing unit, where a computer-executable program is stored in the storage medium, and the central processing unit is connected to the non-volatile storage medium and executes the computer-executable program to implement the first aspect or the method in any possible implementation manner of the first aspect.

In a fifth aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to perform the method of the first aspect or any possible implementation manner of the first aspect.

Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the first aspect or the method in any possible implementation manner of the first aspect.

A sixth aspect provides a computer readable storage medium storing program code for execution by a device, the program code comprising instructions for performing the method of the first aspect or any possible implementation of the first aspect.

In the embodiment of the invention, the different convolution kernels have different receptive fields, and different convolution kernels are used for performing convolution on the target characteristic image to obtain the semantic characteristics of the image; and performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, so that the detection effect of extremely long texts and curved characters can be improved. Meanwhile, the characteristic diagram of the image is extracted by adopting a lightweight network, so that the text detection speed can be further improved.

Drawings

Fig. 1 is a schematic block diagram of a text detection method according to an embodiment of the present invention.

Fig. 2 is a schematic block diagram of a text detection method based on variable convolution semantic segmentation according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a detection result of a text detection method based on semantic segmentation in the prior art.

Fig. 4 is a schematic diagram of a detection result of the text detection method according to an embodiment of the present invention.

Fig. 5 is a schematic block diagram of an apparatus for text detection according to an embodiment of the present invention.

Fig. 6 is a schematic block diagram of a text detection apparatus according to another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

With the rapid development of neural networks, natural scene text detection has gained wide attention in many applications such as automatic driving target positioning, subtitle recognition, sensitive word recognition, scene understanding, product recognition and the like in recent years. However, due to the large differences between foreground text and background objects, as well as text variations of various shapes, colors, fonts, orientations and scales, and extreme lighting and occlusion, text detection and text recognition in natural scenes still face considerable challenges.

At present, the method adopted for achieving a better recognition effect is not end-to-end recognition, most of schemes firstly carry out text detection to obtain a text box, and then carry out character recognition on the text box, and the text detection effect in the scheme can directly influence the character recognition. The currently used text detection methods include the following three methods:

(1) text detection method based on projection method

Although the speed of the text detection method is higher than that of the deep learning method, the method has limited application scenes, and requires that the input text data has no background interference, short text and cannot have bent characters.

(2) Text detection method based on detection box regression

The text detection method based on the detection box regression can only locate the text box with a rectangular or quadrilateral form with a specific direction. However, text instances with arbitrary shapes (e.g., curved text, text with large tilt, long text, etc.) often appear in natural scenes, and the detection effect of the method on the text instances with arbitrary shapes is not ideal.

(3) Text detection method based on semantic segmentation

The text detection method based on semantic segmentation can detect the text with larger inclination. However, when two text instances are relatively close to each other, the two text instances still cannot be separated, meanwhile, the detection of the method for the curve text and the long text still has a missing detection phenomenon, and most of the methods based on semantic segmentation need a deep convolution structure to extract text line boundary information and complex post-processing to solve the problem of text lines, thereby resulting in a slow detection speed.

Based on the above problems, the embodiment of the invention provides a text detection method, which can effectively improve the accuracy of natural scene text detection.

FIG. 1 is a schematic block diagram of a method 100 of text detection in accordance with one embodiment of the present invention.

It should be understood that fig. 1 shows the steps or operations of the method 100, but these steps or operations are only examples, and other operations or variations of the individual operations of the method 100 of fig. 1 may be performed by embodiments of the present invention, or not all of the steps need be performed, or the steps may be performed in other orders.

As shown in fig. 1, the method 100 may include steps 110, 120, 130, 140, and 150, which are as follows:

and S110, acquiring an image.

The image may be an image captured by a camera of the terminal device, or may also be an image stored in the terminal device.

The terminal device herein may refer to a device such as a mobile phone and a tablet computer, and may also be a vehicle-mounted device in a vehicle, which is not limited in the embodiment of the present invention.

And S120, extracting a feature map of the image by using a convolutional neural network.

Wherein the convolutional neural network may be a lightweight network.

Alternatively, the convolutional neural network may be a backbone network (of a text detection model). The convolutional neural network may include a plurality of convolutional layers.

The structure of the convolutional neural network is described in detail below with reference to fig. 2.

As shown in fig. 2, the convolutional neural network (i.e., backbone network) may include 4 bottleneck layers (bottleeck): the bottle neck comprises a first bottle neck layer, a second bottle neck layer, a third bottle neck layer and a fourth bottle neck layer.

Optionally, the convolutional neural network may be a GhostNet network.

For example, the convolutional neural network may include 4 bottleneck layers, which are, respectively, 4 convolutional layers ghost bottleeck 1, ghost bottleeck 2, ghost bottleeck 4, and ghost bottleeck 6 in the ghost net network, and the resolutions of the input images of the 4 bottleneck layers are, respectively, 1/4, 1/8, 1/16, and 1/32.

In the embodiment of the invention, the backbone network (namely the convolutional neural network) adopts a lighter GhostNet network than a Mobilnetv2 network, so that the text detection speed can be improved.

And S130, performing up-sampling on the feature map to obtain a target feature map of the image.

Optionally, the feature map may be up-sampled by using a Feature Pyramid Network (FPN) to obtain a target feature map of the image.

Accordingly, the target feature map may be a feature map of a pyramid structure.

For example, as shown in FIG. 2, 4 bottleneck layers may be upsampled by a factor of 2 (up2), respectively.

Performing convolution operation with 1 × 1 and 16 channels on the features obtained after the first bottleneck layer is sampled; performing convolution operation with 1 × 1 and 32 channels in number on the features obtained after the upsampling of the second bottleneck layer, and then performing upsampling for 2 times; performing convolution operation with 1 × 1 and 112 channels on the features obtained after the upsampling of the third bottleneck layer, and then performing upsampling for 2 times; the feature obtained after the upsampling of the fourth bottleneck layer may be subjected to a convolution operation of 1 × 1, where the number of channels is 320, and then the upsampling is performed by 2 times.

At this time, the output of the 4 bottleneck layers can be subjected to a fusion (concat) process according to the result obtained by the above operation, and the result obtained after the fusion process can be regarded as the target feature map.

And S140, convolving the target feature map by using different convolution kernels to obtain the semantic features of the image.

Wherein the semantic features of the image may include multi-scale feature information of the image.

For example, the multi-scale feature information may include boundary feature information of curved text lines and boundary feature information of oblique text lines.

Optionally, the convolving the target feature map with different convolution kernels to obtain the semantic features of the image may include:

and performing variable convolution (DCN) on the target feature map to obtain the semantic features of the image.

For example, as shown in fig. 2, after the semantic features of the image are obtained by performing variable convolution on the target feature map, 1 × 1 convolution (Conv 1 × 1) may be performed on the semantic features.

In the embodiment of the present invention, the variable convolution may include an adaptive receptive field, and the variable convolution may further adjust the receptive field according to the size of the text line in the target feature map, so that the text detection effect of the long text and the curved text region may be improved.

S150, performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image.

Wherein the semantic segmentation result can be used to determine a target text region in the image.

Optionally, after the semantically segmenting the semantic features to obtain a semantic segmentation result of the image (i.e., S150), the method 100 may further include:

carrying out differentiable binarization processing on the semantic segmentation result to obtain a binarization result;

and determining a target text area in the image based on the binarization result.

In the embodiment of the present invention, a differentiable binarization process (instead of a conventional binarization process that sets a fixed threshold value) may be used to acquire a segmentation feature boundary in the semantic segmentation result.

Different thresholds have great influence on text detection, the differentiable binarization processing can put binarization operation into a network for execution, and parameters can be continuously optimized, so that the threshold of each pixel point can be adaptively adjusted, the boundary area of dense character lines can be better separated, at the moment, a text detection frame is obtained according to text semantic segmentation and differentiable binarization results, and the accuracy of text detection can be effectively improved.

Further, as shown in fig. 2, the determining a target text region in the image based on the binarization result may include:

and determining a target text region in the image based on the binarization result and the semantic segmentation result.

In the embodiment of the invention, a semantic segmentation network structure is formed by adopting a variable convolution (DCN) and a Feature Pyramid Network (FPN), the variable convolution has a variable convolution layer with a self-adaptive receptive field, the detection effect of an extremely long text and a curved text region is greatly improved, and meanwhile, the text detection speed can be further improved by adopting a GhostNet network as a backbone network.

With the increasingly complex background of characters in natural scenes, most of the existing schemes often adopt a text detection method based on semantic segmentation to perform text detection on texts with large inclination and curved texts.

However, when two text instances are relatively close to each other, the text detection method based on semantic segmentation still may not separate them, and meanwhile, the detection of the method on curved text and long text still has a missing detection phenomenon, and most of the methods based on semantic segmentation (currently, the best model is a PSEnet network) need a deep convolution structure to extract text line boundary information and complex post-processing to solve the text line problem, thereby resulting in a slow detection speed.

For example, the text detection result based on the PSEnet network is shown in fig. 3, the left image in fig. 3 is an input image, and the right image in fig. 3 is a detection result. As can be seen in fig. 3, the curvy text "Japanese retaurant" is missed and the callout of the phone includes the misdetected text.

In the embodiment of the invention, a semantic segmentation network structure is formed by adopting a variable convolution (DCN) and a characteristic pyramid (FPN), the variable convolution has a variable convolution layer with a self-adaptive receptive field, the detection effect on an extremely long text and a curved text area is greatly improved, and meanwhile, the backbone network further improves the text detection speed by adopting GhostNet.

Meanwhile, differentiable binarization processing is adopted in the embodiment of the invention, and binarization operation can be put into a network and optimized simultaneously by the differentiable binarization processing, so that the threshold value of each pixel point can be predicted in a self-adaptive manner.

The scheme of the embodiment of the invention and the scheme based on the PSEnet network in the prior art are respectively and quantitatively evaluated by using an ICDAR2015 data set, the accuracy of the scheme of the embodiment of the invention is improved by 5% compared with the accuracy of the scheme based on the PSEnet network, and the speed of the scheme of the embodiment of the invention is improved by 8% compared with the speed of the scheme based on the PSEnet network.

For example, fig. 4 shows a text detection result according to an embodiment of the present invention, where the left image in fig. 4 is an input image, and the right image in fig. 4 is a detection result. As can be seen from FIG. 4, the curvy word "Japanese Restaurant" is not missed, and the labeling of the phone is more accurate than the text detection result in FIG. 3.

Fig. 5 is a schematic block diagram of an apparatus 500 for text detection according to an embodiment of the present invention. It should be understood that the apparatus 500 for text detection illustrated in fig. 5 is only an example, and the apparatus 500 of an embodiment of the present invention may further include other modules or units.

It should be understood that the apparatus 500 is capable of performing the various steps in the method of fig. 1 and, to avoid repetition, will not be described in detail herein.

In one possible implementation manner of the present invention, the apparatus includes: an image acquisition unit 510, a feature extraction unit 520, an upsampling unit 530, a convolution unit 540, and a semantic segmentation unit 550.

An image acquisition unit 510 for acquiring an image.

A feature extraction unit 520, configured to extract a feature map of the image using a convolutional neural network.

Wherein the convolutional neural network may be a lightweight network.

Optionally, the convolutional neural network may be a GhostNet network.

An upsampling unit 530, configured to upsample the feature map to obtain a target feature map of the image.

And a convolution unit 540, configured to perform convolution on the target feature map with different convolution kernels to obtain a semantic feature of the image.

And a semantic segmentation unit 550, configured to perform semantic segmentation on the semantic features to obtain a semantic segmentation result of the image.

Optionally, the convolutional neural network is a GhostNet network.

Optionally, the upsampling unit 530 is specifically configured to:

Optionally, the convolution unit 540 is specifically configured to:

Optionally, after the semantic segmentation is performed on the semantic features to obtain a semantic segmentation result of the image, the apparatus 500 further includes a binarization unit 560, where the binarization unit 560 is configured to:

Optionally, the variable convolution comprises an adaptive receptive field.

Optionally, the variable convolution can adjust the receptive field according to the text line size in the target feature map.

It should be appreciated that the means 500 for text detection herein is embodied in the form of a functional module. The term "module" herein may be implemented in software and/or hardware, and is not particularly limited thereto. For example, a "module" may be a software program, a hardware circuit, or a combination of both that implements the functionality described above. The hardware circuitry may include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared processor, a dedicated processor, or a group of processors) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality.

As an example, the apparatus 500 for text detection provided by the embodiment of the present invention may be a processor or a chip, and is configured to perform the method according to the embodiment of the present invention.

FIG. 6 is a schematic block diagram of an apparatus 400 for text detection in accordance with one embodiment of the present invention. The apparatus 400 shown in fig. 6 includes a memory 401, a processor 402, a communication interface 403, and a bus 404. The memory 401, the processor 402 and the communication interface 403 are connected to each other by a bus 404.

The memory 401 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 401 may store a program and the processor 402 is configured to perform the steps of the method for text detection of an embodiment of the present invention when the program stored in the memory 401 is executed by the processor 402, for example, the steps of the embodiment shown in fig. 1 may be performed.

The processor 402 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute related programs to implement the method for text detection according to the embodiment of the present invention.

The processor 402 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the method for text detection according to the embodiment of the present invention may be implemented by integrated logic circuits of hardware in the processor 402 or instructions in the form of software.

The processor 402 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 401, and the processor 402 reads information in the memory 401, and performs, in combination with hardware of the storage medium, functions that need to be performed by units included in the text detection apparatus according to the embodiment of the present invention, or performs the text detection method according to the embodiment of the method of the present invention, for example, the steps/functions of the embodiment shown in fig. 1 may be performed.

The communication interface 403 may use transceiver means, such as, but not limited to, a transceiver, to enable communication between the apparatus 400 and other devices or communication networks.

Bus 404 may include a path that transfers information between various components of apparatus 400 (e.g., memory 401, processor 402, communication interface 403).

It should be understood that the apparatus 400 shown in the embodiment of the present invention may be a processor or a chip for executing the method of text detection described in the embodiment of the present invention.

It should be understood that the processor in the embodiments of the present invention may be a Central Processing Unit (CPU), and the processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In addition, the "/" in this document generally indicates that the former and latter associated objects are in an "or" relationship, but may also indicate an "and/or" relationship, which may be understood with particular reference to the former and latter text.

In the present invention, "at least one" means one or more, "a plurality" means two or more. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and the like that are within the spirit and principle of the present invention are included in the present invention.

Claims

1. A method of text detection, comprising:

acquiring an image;

extracting a feature map of the image by using a convolutional neural network, wherein the convolutional neural network is a lightweight network;

up-sampling the feature map to obtain a target feature map of the image;

convolving the target feature map by using different convolution kernels to obtain semantic features of the image;

and performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, wherein the semantic segmentation result is used for determining a target text region in the image.

2. The method of claim 1, wherein the convolutional neural network is a GhostNet network.

3. The method of claim 2, wherein the upsampling the feature map to obtain a target feature map of the image comprises:

4. The method according to any one of claims 1 to 3, wherein the convolving the target feature map with different convolution kernels to obtain semantic features of the image comprises:

5. The method of claim 4, wherein after the semantically segmenting the semantic features into semantic segmentation results of the image, the method further comprises:

6. The method of claim 5, wherein the variable convolution comprises an adaptive receptive field.

7. The method of claim 6, wherein the variable convolution is capable of adjusting the receptive field according to the text line size in the target feature map.

8. An apparatus for text detection, comprising:

an image acquisition unit for acquiring an image;

the characteristic extraction unit is used for extracting a characteristic diagram of the image by using a convolutional neural network, and the convolutional neural network is a lightweight network;

the up-sampling unit is used for up-sampling the feature map to obtain a target feature map of the image;

the convolution unit is used for performing convolution on the target feature map by using different convolution kernels to obtain the semantic features of the image;

and the semantic segmentation unit is used for performing semantic segmentation on the semantic features to obtain a semantic segmentation result of the image, and the semantic segmentation result is used for determining a target text region in the image.

9. An apparatus for text detection comprising a processor and a memory, the memory for storing program instructions, the processor for invoking the program instructions to perform the method of any of claims 1-7.

10. A computer-readable storage medium comprising computer instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1-7.