CN110751146A

CN110751146A - Text region detection method, text region detection device, electronic terminal and computer-readable storage medium

Info

Publication number: CN110751146A
Application number: CN201911011794.0A
Authority: CN
Inventors: 谢朝霞
Original assignee: Beijing Institute of Graphic Communication
Current assignee: Beijing Institute of Graphic Communication
Priority date: 2019-10-23
Filing date: 2019-10-23
Publication date: 2020-02-04
Anticipated expiration: 2039-10-23
Also published as: CN110751146B

Abstract

The embodiment of the application provides a text region detection method and device, an electronic terminal and a computer readable storage medium, and relates to the technical field of information detection and image processing. The text area detection method comprises the following steps: acquiring an image to be identified carrying character information; performing visual saliency analysis on the image to be recognized to obtain a saliency region in the image to be recognized as a candidate text region; extracting character features contained in the candidate text regions based on a preset text region detection network, and classifying the extracted character features to obtain background regions in the candidate text regions and text regions containing the character features. The method and the device can effectively improve the accuracy of the text region detection result.

Description

Text region detection method, text region detection device, electronic terminal and computer-readable storage medium

Technical Field

The present application relates to the field of information detection and image processing technologies, and in particular, to a text region detection method, apparatus, electronic terminal, and computer-readable storage medium.

Background

The existing OCR (Optical Character Recognition) technology is mainly used for recognizing characters in document images with print characters, such as files and books, and the Recognition accuracy can reach more than 96%. However, with the rapid development and popularization of portable electronic mobile devices, more and more text information is carried by natural scene images, so that the simple document image recognition cannot meet the increasing demands of people.

For example, in a natural scene, due to the randomness of the image capturing manner, the text in the scene image has diversity in font, size, arrangement manner, and the like, and additionally, the image background is also extremely complex, for example, the image background may be a natural landscape, or may be an outdoor street view or an indoor environment, and these factors will undoubtedly increase the complexity of character recognition, so how to improve the accuracy of character recognition using the natural scene image and the like as a carrier by recognizing the text region and the background region becomes important.

Disclosure of Invention

In order to overcome the above-mentioned deficiencies in the prior art, the present application provides a text region detection method, apparatus, electronic terminal and computer-readable storage medium.

In a first aspect, an embodiment of the present invention provides a text region detection method, including:

acquiring an image to be identified carrying character information;

performing visual saliency analysis on the image to be recognized to obtain a saliency region in the image to be recognized as a candidate text region;

extracting character features contained in the candidate text regions based on a preset text region detection network, and classifying the extracted character features to obtain background regions in the candidate text regions and target text regions containing the character features.

In an alternative embodiment, the step of performing a visual saliency analysis on the image to be recognized to obtain a saliency region in the image to be recognized includes:

carrying out image segmentation on the image to be identified by utilizing a superpixel image segmentation method to obtain a plurality of superpixel blocks;

and respectively calculating the similarity between the super pixel blocks, and determining the significance of the super pixel blocks according to the similarity to obtain a significance region in the image to be identified.

In an optional embodiment, the step of performing image segmentation on the image to be identified by using a super-pixel image segmentation method to obtain a plurality of super-pixel blocks includes:

and adjusting the number of seed points and the side length of the super pixels in the SLIC super pixel algorithm, and carrying out image segmentation on the image to be identified based on the adjusted SLIC super pixel algorithm to obtain a plurality of super pixel blocks under multiple scales.

In an alternative embodiment, the similarity d (R)_i,R_j) Calculated by the following formula:

wherein R is_iRepresenting superpixel blocks i, R_jRepresenting superpixel blocks j, d_colRepresenting the color similarity between super-pixel blocks, d_posRepresenting the distance similarity between superpixels, 1 ≦ α ≦ 10, and α is a positive integer.

In an optional embodiment, before performing the step of extracting the text features included in the candidate text region based on the preset text region detection network, the method further includes:

extracting the region coordinates of the candidate text region, and determining the region type of the candidate text region according to the region coordinates;

and adjusting the convolution kernel of the preset text area detection network according to the area type of the candidate text area.

In an alternative embodiment, the region type includes one or more of a region shape, a region size, and a region angle.

In an alternative embodiment, the target text region y is represented as y ∑ x × k + b, where x is an image corresponding to the candidate text region, k is a convolution kernel, k ═ n ×, m ═ δ ×, δ >2, m and n are natural numbers, and b is a bias value.

In a second aspect, an embodiment of the present invention provides a text region detecting apparatus, including:

the image acquisition module is used for acquiring an image to be identified with character information;

the region determining module is used for carrying out visual saliency analysis on the image to be recognized to obtain a saliency region in the image to be recognized as a candidate text region;

and the text region classification module is used for extracting character features contained in the candidate text regions based on a preset text region detection network, and classifying the extracted character features to obtain background regions in the candidate text regions and target text regions containing the character features.

In a third aspect, an embodiment of the present invention provides an electronic terminal, including a processor and a memory, where the memory stores machine-executable instructions executable by the processor, and the processor may execute the machine-executable instructions to implement the text region detection method according to any one of the foregoing embodiments.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the text region detection method according to any one of the foregoing embodiments.

In the text region detection method, the text region detection device, the electronic terminal and the computer readable storage medium, the saliency region in the image to be recognized is obtained by performing visual saliency analysis on the image to be recognized and is used as a candidate text region, and then the candidate text region is detected and classified based on a preset text region detection network to obtain a background region in the candidate text region and a target text region containing character features, so that a basis is provided for subsequent character recognition.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic structural diagram of an electronic terminal according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a text region detection method according to an embodiment of the present application.

Fig. 3 is a sub-flowchart of step S12 shown in fig. 2.

Fig. 4(a), 4(b) and 4(c) are schematic diagrams of three different arrangement modes of candidate text regions according to an embodiment of the present application.

Fig. 5 is a functional block diagram of a text region detection apparatus according to the present application.

Icon: 10-an electronic terminal; 11-text region detection means; 110-an image acquisition module; 120-a region determination module; 130-text region classification module; 12-a processor; 13-a memory; 14-communication module.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It is to be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

As shown in fig. 1, which is a block schematic diagram of an electronic terminal 10 according to an embodiment of the present disclosure, the electronic terminal 10 may perform, but is not limited to, the text region detection method according to the embodiment of the present disclosure. The electronic terminal 10 may include, but is not limited to, the processor 12, the memory 13 and the communication module 14 shown in fig. 1. The processor 12, the memory 13 and the communication module 14 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

Wherein the memory 13 is used for storing programs or data. The Memory 13 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an erasable Read-Only Memory (EPROM), an electrically erasable Read-Only Memory (EEPROM), and the like.

The processor 12 is used to read/write data or programs stored in the memory 13 and perform corresponding functions.

The communication module 14 is used for establishing a communication connection between the electronic terminal 10 and other terminal devices through a network, and for transceiving data through the network, such as receiving an image to be recognized.

It should be understood that the configuration shown in fig. 1 is merely a schematic diagram of the configuration of the electronic terminal 10, and that the electronic terminal 10 may include more or less components than those shown in fig. 1, or have a different configuration than that shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof. In addition, in the embodiment of the present application, the electronic terminal 10 may be, but is not limited to, a computer, a mobile phone, an IPad, a server, a mobile internet access device, and the like.

Please refer to fig. 2, which is a flowchart illustrating a text region detection method according to an embodiment of the present disclosure, the text region detection method may be executed by, but is not limited to, a text region detection device 11, and the text region detection device 11 may be implemented by software or/and hardware, and may also be configured in an electronic terminal 10 installed with an operating system such as Android (Android). The following describes a specific flow of the text region detection method with reference to fig. 2, and the content is as follows.

And step S11, acquiring the image to be identified carrying the text information.

The image to be recognized may be, but is not limited to, an image with a natural landscape, an outdoor street view, an indoor environment, or the like as a background or a carrier. For example, the image to be recognized may be, but is not limited to, a natural scene image containing text information acquired by an image acquisition device such as a camera or a video camera, such as: traffic alert images, street name images, advertising logo images, poster banner images, book and periodical images, package printed text images, and the like.

In addition, in an implementation manner, the image to be recognized may be directly acquired by the electronic terminal 10, or may be acquired by an image acquisition device independent from the electronic terminal 10 and then sent to the electronic terminal 10, and the like, which is not limited in this embodiment.

And step S12, performing visual saliency analysis on the image to be recognized to obtain a saliency region in the image to be recognized as a candidate text region.

Since a natural scene image usually contains a large amount of non-text region backgrounds such as trees, buildings, flowers and plants, and these non-text region backgrounds and text regions have significant characteristic differences in image textures, colors, structures and the like, in other words, text regions in the natural scene image are more likely to attract attention of human eyes than non-text regions, after the image to be recognized is acquired through step S11, the embodiment of the present application may perform a visual saliency analysis on the image to be recognized by using, but not limited to, a human visual attention saliency mechanism and the like, so as to quickly browse the image to be recognized globally and accurately acquire a saliency region in the image to be recognized, while ignoring other irrelevant or unimportant regions in the image to be recognized.

For example, assuming that the image to be recognized is a natural scene image, in this embodiment, a candidate text region in the natural scene image may be obtained based on the saliency visual attention model by using a human visual attention mechanism, so as to detect and locate the saliency region (candidate text region) in the natural scene image. The significant visual attention model is a model based on multi-scale low-level features, which is constructed by extracting low-level features (such as color, texture, and the like) in an image to be recognized and then using a super-pixel segmentation method and the like, and the description of the embodiment is omitted.

Alternatively, as shown in fig. 3, the candidate text regions in step S12 described above can be implemented by steps S121 and S122 described below, as follows.

And step S121, carrying out image segmentation on the image to be identified by using a superpixel image segmentation method to obtain a plurality of superpixel blocks.

The image segmentation of the image to be identified by using the super-pixel image segmentation method to obtain a plurality of super-pixel blocks may include: adjusting the number of seed points and the side length of the super pixel in an SLIC (simple Linear Iterative Cluster) super pixel algorithm, and carrying out image segmentation on the image to be identified based on the adjusted SLIC super pixel algorithm to obtain a plurality of super pixel blocks under multiple scales (such as different sizes or different specifications).

And step S122, respectively calculating the similarity between the super pixel blocks, and determining the significance of the super pixel blocks according to the similarity to obtain a significance region in the image to be identified.

Wherein the similarity d (R)_i,R_j) Can be represented by formula

Is calculated to obtain, wherein R_iRepresenting superpixel blocks i, R_jRepresenting superpixel blocks j, d_colRepresenting the color similarity between super-pixel blocks, d_posRepresenting the distance similarity between superpixels, 1 ≦ α ≦ 10, and α is a positive integer.

In the above step S121 and step S122, firstly, the structural features of the characters in the image to be recognized are combined, and then, by using the SLIC superpixel segmentation method, the number of seed points and the superpixel side length in the SLIC method are changed to obtain a plurality of superpixel blocks of the natural scene image in multiple scales, so that the speed of positioning the salient region in the image to be recognized can be increased.

In addition, different from the existing method for detecting and positioning the text region in the image by using the color feature, the texture feature, the edge feature, the stroke feature and the like in the image, the method for positioning the candidate text region by using the visual saliency model of the multi-scale low-level feature not only can effectively eliminate the interference caused by redundant background and the background of the non-text region, but also can accurately detect and position the candidate text region, greatly improves the positioning accuracy of the candidate text region, and provides guarantee for performing region classification by using a preset text region detection network in the subsequent step S13.

Step S13, extracting the text features included in the candidate text regions based on a preset text region detection network, and classifying the extracted text features to obtain a background region in the candidate text regions and a target text region including the text features.

The method for extracting the features in the deep learning CNN (Convolutional Neural Networks) can better simulate the understanding of human brains on images, not only can extract image edge features, color features, high-order semantic information features and the like, but also can perform autonomous learning by combining the characteristics in training samples, thereby reducing the uncertainty caused by artificial design and structural features to a great extent and improving the accuracy of network recognition results.

Based on this, in this embodiment, the preset text area detection network may be, but is not limited to, obtained by performing sample training on an existing CNN network. In order to ensure the generalization of the CNN network, when the training sample is selected, the training sample may include a natural scene image and a street view image that carry text information, a document image that includes print characters such as a file and a book, and a natural scene image, a street view image, a document image, and the like, and the embodiment is not limited specifically herein.

In summary, compared with the existing character region detection methods such as the OCR technology and the like, the embodiment of the present application, which uses the predetermined text region detection network implemented based on the CNN network, can extract more robust text semantic features, and the fitting capability and generalization capability of the predetermined text region detection network model are stronger, so that the problems of interference such as text diversity and scene image complexity in text region positioning can be effectively solved, the target text region in the natural scene image can be efficiently and accurately extracted and positioned, a basis is provided for subsequent text detection, and meanwhile, the character detection result implemented based on the target text region has higher accuracy. It is understood that when performing text detection based on the target text region, the text detection can be implemented based on, but not limited to, a text detection method as in OCR technology.

Further, considering that the arrangement mode forms of the candidate text regions in the natural scene image are different, the candidate text regions shown in fig. 4(a), 4(b), and 4(c) may be arranged horizontally, at an angle, vertically, and the like, and further, the width of the candidate text regions may be variable. In an optional implementation manner of the embodiment of the present application, before an image corresponding to a candidate text region is input to a preset text region detection network for text region detection and classification, a convolution kernel of the preset text region detection network is adaptively designed according to a form of the candidate text region, so as to improve accuracy of text region detection based on the preset text region detection network. The specific process can comprise the following steps: extracting the region coordinates of the candidate text region, and determining the region type of the candidate text region according to the region coordinates; and adjusting the convolution kernel of the preset text area detection network according to the area type of the candidate text area. Optionally, the region type includes one or more of a region shape, a region size, and a region angle.

For example, assuming that the region type is a region shape, the region coordinates may be edge coordinates of the candidate text regions, and then determine region shapes of the candidate text regions, such as rectangles, triangles, circles, and the like, according to the edge coordinates, and then adjust a convolution kernel k of the preset text region detection network according to δ values corresponding to different region shapes, where k is n × m, m is δ × n, δ >2, and m and n are natural numbers.

For another example, if the region type is a region angle, the region coordinates may be edge coordinates of an edge close to a preset reference axis (e.g., a horizontal coordinate axis) in the candidate text region. And calculating an included angle between the line segment corresponding to the edge line coordinate and the reference axis, taking the included angle as an area angle, and adjusting a convolution kernel k of the preset text area detection network according to delta values corresponding to different area angles, wherein k is n x m, m is delta x n, delta is greater than 2, and m and n are natural numbers.

In addition to the above-mentioned several ways of adjusting the convolution kernel of the preset text region detection network, in an implementation manner of this embodiment, for a candidate text region that presents a certain angular arrangement, a preset reference axis (e.g., a horizontal coordinate axis) may be used as a reference to rotationally adjust the candidate text region to a horizontal arrangement mode or a vertical arrangement mode, and then the text region is identified based on an image corresponding to the rotated candidate text region.

When the target text region is detected based on the preset text region detection network after the convolution kernel is designed and adjusted, the target text region y can be expressed as y ∑ x × k + b, wherein x is an image corresponding to the candidate text region, k is the convolution kernel, k is n ═ m, m ═ δ × n, δ >2, m and n are natural numbers, and b is a bias value. It can be understood that, before the text region detection method provided in the present application is executed, the electronic terminal 10 may be preset with a correspondence between different region types and δ values, and then the adjustment of the convolution kernel is implemented according to the correspondence.

Through the design of the self-adaptive convolution kernel, the target text region of the special-shaped structure can be more accurately extracted. In addition, compared with a convolution kernel with a fixed geometric structure in the conventional CNN network, the embodiment of the application considers the inherent structural characteristics of the text in the actual scene, sets the convolution kernel to k-n-m, m- δ -n, δ >2, and both m and n are natural numbers, and constructs the convolution kernel capable of dynamically adapting to candidate regions with different widths by adjusting the value of the parameter δ, so that the feature extraction of the candidate text regions with various special-shaped structures can be realized, and the accuracy of text region detection and identification can be improved.

Further, in order to perform corresponding steps in the embodiment or each possible implementation manner of the present application, an implementation manner of the text region detecting apparatus 11 is given below, and optionally, the text region detecting apparatus 11 may adopt the device structure of the electronic terminal 10 shown in fig. 1. In one implementation, the text region detecting device 11 may be understood as the processor 12 in the electronic terminal 10, or may be understood as a software functional module that is independent from the electronic terminal 10 or the processor 12 and implements the text region detecting method under the control of the electronic terminal 10.

It should be noted that the basic principle and the technical effects of the text area detecting apparatus 11 provided in the present embodiment are the same as those of the above embodiments, and for the sake of brief description, no part of the present embodiment is mentioned, and reference may be made to the corresponding contents in the above embodiments. The text region detecting apparatus 11 may include an image acquiring module 110, a region determining module 120, and a text region classifying module 130 as shown in fig. 5.

The image obtaining module 110 is configured to obtain an image to be identified with text information; in this embodiment, the detailed description of the step S11 may be referred to for the description of the image obtaining module 110, that is, the step S11 may be executed by the image obtaining module 110, and thus will not be further described here.

The region determining module 120 is configured to perform visual saliency analysis on the image to be recognized to obtain a saliency region in the image to be recognized, where the saliency region is used as a candidate text region; in this embodiment, the detailed description of the step S12 may be referred to for the description of the area determination module 120, that is, the step S12 may be executed by the area determination module 120, and therefore, will not be further described here.

The text region classification module 130 is configured to extract text features included in the candidate text regions based on a preset text region detection network, and classify the extracted text features to obtain background regions in the candidate text regions and text regions including the text features. In this embodiment, the detailed description of the text region classification module 130 may refer to the above-mentioned detailed description of step S13, that is, step S13 may be executed by the text region classification module 130, and thus, will not be further described herein.

Based on the text region detection method provided in the foregoing embodiments, the present embodiment also provides a computer-readable storage medium on which a computer program is stored, and the computer program, when executed by a processor, implements the text region detection method described in the foregoing embodiments.

In summary, in the text region detection method and apparatus, the electronic terminal 10, and the computer-readable storage medium provided by the present application, the saliency region in the image to be recognized is obtained as the candidate text region by performing the visual saliency analysis on the image to be recognized, and then the background region and the text region including the text features in the candidate text region are obtained by detecting and classifying the candidate text region based on the preset text region detection network, so that the accuracy of text region detection can be effectively improved.

In addition, compared with the traditional text region detection method for directly extracting the text region characteristics of the image by using the character color characteristics, the edge characteristics, the texture characteristics, the stroke characteristics and the like in the image, the text region detection method provided by the application does not need to spend a large amount of time and energy to design and select the characteristics when identifying the text region, and can effectively resist the interference of factors such as blurring, complex background, text distortion, adhesion or noise and the like existing in the text region and improve the accuracy of text region detection.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A text region detection method, comprising:

acquiring an image to be identified carrying character information;

2. The text region detection method according to claim 1, wherein the step of performing visual saliency analysis on the image to be recognized to obtain a saliency region in the image to be recognized comprises:

3. The text region detection method according to claim 2, wherein the step of performing image segmentation on the image to be recognized by using a super-pixel image segmentation method to obtain a plurality of super-pixel blocks comprises:

4. The text region detection method according to claim 2, wherein the similarity d (R)_i,R_j) Calculated by the following formula:

5. The text region detection method according to claim 1, wherein before the step of extracting the text features contained in the candidate text regions based on a preset text region detection network is performed, the method further comprises:

6. The text region detection method according to claim 5, wherein the region type includes one or more of a region shape, a region size, and a region angle.

7. The text region detection method according to claim 1, wherein the target text region y is represented as y ∑ x × k + b, where x is an image corresponding to the candidate text region, k is a convolution kernel, k ═ n ×, m ═ δ ·, δ >2, m and n are natural numbers, and b is a bias value.

8. A text region detecting apparatus, comprising:

9. An electronic terminal comprising a processor and a memory, the memory storing machine executable instructions executable by the processor to implement the text region detection method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the text region detection method according to any one of claims 1 to 7.