CN107480665B

CN107480665B - Character detection method and device and computer readable storage medium

Info

Publication number: CN107480665B
Application number: CN201710675521.0A
Authority: CN
Inventors: 万韶华
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2017-08-09
Filing date: 2017-08-09
Publication date: 2020-08-11
Anticipated expiration: 2037-08-09
Also published as: CN107480665A

Abstract

The disclosure relates to a character detection method, a character detection device and a computer readable storage medium, which relate to the technical field of information processing, and the method comprises the following steps: processing a target image needing character detection through a plurality of convolution layers included by a Faster RCNN model to obtain a plurality of candidate regions; and processing the candidate regions through a pooling layer included by the fast RCNN model to obtain a character region and the position of the character region in the target image, wherein the character region is a region including characters in the candidate regions. Because the fast RCNN model is adopted to detect the characters in the embodiment of the disclosure, the problems that when an SVM is adopted to combine a plurality of regions comprising the characters, the combining process is complicated and the error rate of the characters in the combined character regions is high due to the special left-right, up-down and up-down structures of Chinese characters are solved, and the character detection precision is improved.

Description

Character detection method and device and computer readable storage medium

Technical Field

The present disclosure relates to the field of information processing technologies, and in particular, to a method and an apparatus for detecting a character, and a computer-readable storage medium.

Background

In daily life, a user often needs to recognize characters contained in an image through intelligent equipment so as to acquire character information. For example, when a user wants to convert characters in a paper print into editable texts in the intelligent device, the user can take a picture of the paper print to obtain an image, and then the intelligent device can recognize the characters contained in the image and convert the characters in the image into the editable texts according to the recognition result. Before the intelligent device identifies the characters contained in the image, the characters contained in the image need to be detected first to determine the positions of the characters in the image.

In the related art, the intelligent device may perform binarization processing on an image by using a plurality of gray level thresholds. For each gray threshold value in the plurality of gray threshold values, the binary image obtained by the gray threshold value comprises a black area and a white area respectively. Then, the smart device may extract the black region and the white region with the highest degree of coincidence of position and shape in the plurality of binarized images obtained by the plurality of grayscale thresholds as MSERs (most Stable extreme value Regions). By the method, the intelligent device can obtain a large number of MSER areas for one image. Then, the smart device may filter, by using an SVM (Support Vector Machine), a large number of regions that do not include text in the MSER region, and finally merge the remaining regions that include text, thereby generating a complete text region composed of all text in the image.

Disclosure of Invention

In order to overcome the problems of complex process and low detection precision in the process of extracting MSER and SVM to detect characters in the related art, the disclosure provides a character detection method, a character detection device and a computer readable storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a text detection method, including:

processing a target image needing character detection through a plurality of convolution layers included in a fast convolution neural network (fast RCNN) model based on a region to obtain a plurality of candidate regions;

and processing the candidate regions through a pooling layer included by the Faster RCNN model to obtain a text region and the positions of the text region in the target image, wherein the text region is a region including text in the candidate regions.

Optionally, the processing the candidate regions through a pooling layer included in the Faster RCNN model to obtain a text region and a position of the text region in the target image includes:

taking the plurality of candidate regions as input to a pooling layer of the Faster RCNN model;

for each candidate region of the plurality of candidate regions, dividing the candidate region into a plurality of sub-regions of the same size;

performing maximum pooling operation on the candidate region according to the plurality of sub-regions obtained after division;

and when determining that the candidate area comprises characters according to the maximum pooling result, determining the candidate area as a character area, and determining the position of the character area in the target image.

Optionally, the performing, according to a plurality of sub-regions obtained after the dividing, a maximum pooling operation on the candidate region includes:

for each sub-region in the plurality of sub-regions, determining a maximum pixel value in pixel values of a plurality of pixel points included in the sub-region, and taking the determined maximum pixel value as the pixel value of the sub-region;

and determining the determined pixel values of the plurality of sub-regions as the pooling result of the candidate region.

for each candidate region of the plurality of candidate regions, performing an average pooling operation on the candidate region;

and when determining that the candidate area comprises characters according to the average pooling result, determining the candidate area as a character area, and determining the position of the character area in the target image.

According to a second aspect of the embodiments of the present disclosure, there is provided a text detection apparatus, the apparatus including:

the processing module is used for processing a target image needing character detection through a plurality of convolution layers included in a fast convolutional neural network (fast RCNN) model based on a region to obtain a plurality of candidate regions;

a determining module, configured to process the multiple candidate regions through a pooling layer included in the Faster RCNN model to obtain a text region and a position of the text region in the target image, where the text region is a region including text in the multiple candidate regions.

Optionally, the determining module includes:

an input sub-module for taking the plurality of candidate regions as input to a pooling layer of the Faster RCNN model;

a dividing sub-module for dividing, for each of the plurality of candidate regions, the candidate region into a plurality of sub-regions of the same size;

the first pooling sub-module is used for performing maximum pooling operation on the candidate region according to the plurality of sub-regions obtained after division;

and the first determining submodule is used for determining the candidate area as a character area and determining the position of the character area in the target image when the candidate area is determined to include characters according to the maximum pooling result.

Optionally, the first pooling sub-module is for:

Optionally, the determining module includes:

the input submodule is used for taking the candidate regions as the input of the pooling layer of the Faster RCNN model;

a second pooling sub-module for performing an average pooling operation on each candidate region of the plurality of candidate regions;

and the second determining submodule is used for determining the candidate area as a character area and determining the position of the character area in the target image when the candidate area is determined to comprise characters according to the average pooling result.

According to a third aspect of the embodiments of the present disclosure, there is provided a character detection apparatus, the apparatus including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the steps of any one of the methods of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon instructions which, when executed by a processor, implement the steps of any one of the methods of the first aspect described above.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: according to the method and the device for detecting the characters, the target image needing character detection is processed through the plurality of convolution layers included by the Faster RCNN model, and a plurality of candidate regions are obtained. Then, the candidate regions can be processed through a pooling layer included in the fast RCNN model to obtain a text region including text in the candidate regions and a position of the text region in the target image. Because the fast RCNN model is adopted to detect the characters in the embodiment of the disclosure, the problems that when an SVM is adopted to combine a plurality of regions comprising the characters, the combining process is complicated and the error rate of the characters in the combined character regions is high due to the special left-right, up-down and up-down structures of Chinese characters are solved, and the character detection precision is improved. In addition, the multiple candidate regions are processed through one pooling layer, and three full-connection layers in the fast RCNN model for target detection are replaced, so that the calculation complexity of the fast RCNN model during character detection is reduced, and the processing speed is increased.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is an architecture diagram illustrating a FasterRCNN model for object detection in a related art according to an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a text detection method according to an example embodiment.

FIG. 3A is a flow diagram illustrating a text detection method according to an example embodiment.

FIG. 3B is a schematic diagram illustrating a max-pooling operation according to an exemplary embodiment.

FIG. 4 is a flow diagram illustrating a text detection method according to an example embodiment.

FIG. 5A is a block diagram illustrating a text detection apparatus according to an exemplary embodiment.

FIG. 5B is a block diagram illustrating a determination module in accordance with an exemplary embodiment.

FIG. 5C is a block diagram illustrating a determination module in accordance with an exemplary embodiment.

FIG. 6 is a block diagram illustrating a text detection apparatus according to an exemplary embodiment.

FIG. 7 is a block diagram illustrating a text detection apparatus according to an exemplary embodiment.

Detailed Description

To make the objects, technical solutions and advantages of the present disclosure more apparent, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Before explaining the embodiments of the present disclosure in detail, an application scenario of the embodiments of the present disclosure will be described.

In daily life, a user often needs to recognize characters contained in an image through intelligent equipment so as to acquire character information. For example, when a user wants to convert characters in a paper print into editable texts in the intelligent device, the user can take a picture of the paper print to obtain an image, and then the intelligent device can recognize the characters contained in the image and convert the characters in the image into the editable texts according to the recognition result. For another example, at present, urban roads are equipped with monitoring cameras, and the cameras can take pictures of vehicles coming and going. When a user needs to identify a designated license plate number from a shot image, the intelligent device needs to process the image, so that a character area with characters is found out from the image, and then the characters in the character area are identified. Therefore, before the intelligent device identifies the characters in the image, the characters contained in the image need to be detected first to determine the positions of the characters in the image. The character detection method provided by the embodiment of the disclosure can be used in the above-mentioned scene to detect the characters contained in the image, and further determine the positions of the characters in the image, so as to identify the characters in the image in the following.

In the related art, a method for extracting an MSER region and an SVM is generally used to detect a text. When the SVM filters a large number of MSER regions, when the SVM merges a plurality of regions including characters, the merging process is very complex due to the special upper-lower, left-right structure of Chinese, the error rate of the characters in the merged character region is high, and the precision of character detection is seriously influenced.

In order to solve the above technical problem, in the embodiment of the present disclosure, a Faster RCNN model for target detection is improved and then used for detecting characters, so as to improve the accuracy of character detection.

The Faster RCNN model is a model developed on the RCNN (Region-Based connected Neural Networks) model for object detection. When the target detection is carried out through the fast RCNN model, the detection process mainly comprises 4 basic steps of candidate region generation, feature extraction, classification and position refinement. Fig. 1 is a schematic diagram of a conventional fast RCNN model according to an embodiment of the disclosure, as shown in fig. 1, the fast RCNN model includes a plurality of convolutional layers 101-105, and fully-connected

layers

106, 107, and 108. The last convolution layer 105 is connected to the full-link layer 106, the full-link layer 106 is connected to the full-link layer 107, and the full-link layer 107 is connected to the full-link layer 108.

When the target is detected, the pixel values of a plurality of pixel points included in the target image are input to the convolution layer 101, and the convolution layer 101 performs convolution operation according to the received pixel values of the plurality of pixel points and outputs a convolution operation result as an input value of the convolution layer 102. And so on, the output value of the previous convolutional layer is used as the input value of the next convolutional layer until the last convolutional layer 105 determines a plurality of candidate regions based on the output value of the previous convolutional layer 104. Then, the multiple candidate regions obtained by the determination are used as input values of the fully-connected layer 106, and for each candidate region in the multiple candidate regions, the fully-connected

layer

106 and 107 can determine whether the candidate region includes the target to be detected, and determine the position coordinates of the candidate region. Finally, the probability that the candidate region includes the target to be detected and the probability that the candidate region does not include the target to be detected are output through the full connection layer 108, and the position coordinates corresponding to the candidate region respectively. Therefore, when the number of candidate regions is N, the fully-connected layer needs to perform discrimination N times to determine the position coordinates N times. In practical application, if the fast RCNN model for target detection is directly applied to text detection, thousands of candidate regions may be output from the last convolutional layer in the fast RCNN model, and in this case, the three fully-connected layers are used to distinguish thousands of candidate regions, which results in high calculation overhead and thus seriously affects the processing speed.

As can be seen from the foregoing description of the architecture of the commonly used fast RCNN model, the currently commonly used fast RCNN model includes three fully connected layers, and when characters included in an image are detected by the fast RCNN model, it is necessary to determine whether a plurality of candidate regions include characters or not by the three fully connected layers. However, since the number of candidate regions is usually large, when there are thousands of candidate regions, three fully-connected layers all need to be distinguished thousands of times, which results in large calculation overhead and slow processing speed. Therefore, in order to improve the accuracy of text detection and ensure the processing speed of text detection, the embodiments of the present disclosure provide a text detection method. Next, the text detection method will be explained in detail with reference to the drawings.

Fig. 2 is a flowchart illustrating a text detection method according to an exemplary embodiment, and as shown in fig. 2, the text detection method may be used in a terminal or a server, and includes the following steps:

in step 201, a plurality of candidate regions are obtained by processing a target image to be subjected to text detection through a plurality of convolutional layers included in a fast convolutional neural network (fast RCNN) model based on regions.

In step 202, the multiple candidate regions are processed through the pooling layer included in the fast RCNN model, so as to obtain a text region and a position of the text region in the target image, where the text region is a region including text in the multiple candidate regions.

According to the method and the device for detecting the characters, the target image needing character detection is processed through the plurality of convolution layers included by the Faster RCNN model, and a plurality of candidate regions are obtained. Then, the candidate regions can be processed through a pooling layer included in the fast RCNN model to obtain a text region including text in the candidate regions and a position of the text region in the target image. Because the fast RCNN model is adopted to detect the characters in the embodiment of the disclosure, the problems that when an SVM is adopted to combine a plurality of regions comprising the characters, the combining process is complicated and the error rate of the characters in the combined character regions is high due to the special left-right, up-down and up-down structures of Chinese characters are solved, and the character detection precision is improved. In addition, the multiple candidate regions are processed through one pooling layer, and three full-connection layers in the FasterRCNN model for target detection are replaced, so that the calculation complexity of the FasterRCNN model during character detection is reduced, and the processing speed is increased.

Optionally, processing the multiple candidate regions through a pooling layer included in the fast RCNN model to obtain the text region and a position of the text region in the target image, including:

taking a plurality of candidate regions as input of a pooling layer of a Faster RCNN model;

for each candidate region in the plurality of candidate regions, dividing the candidate region into a plurality of sub-regions with the same size;

and when determining that the candidate area comprises the characters according to the maximum pooling result, determining the candidate area as a character area, and determining the position of the character area in the target image.

After the pooling layer performs the maximum pooling operation on each sub-region of the candidate region to obtain the maximum pooling result, an estimated probability may be determined according to the maximum pooling result, where the estimated probability may be a probability that the candidate region is a text region, or a probability that the candidate region is not a text region. Then, the terminal may compare the estimated probability with a preset probability, so as to determine whether the candidate region includes a text.

Optionally, performing maximum pooling operation on the candidate region according to the plurality of sub-regions obtained after the division, including:

for each sub-area in the plurality of sub-areas, determining the maximum pixel value in the pixel values of a plurality of pixel points included in the sub-area, and taking the determined maximum pixel value as the pixel value of the sub-area;

and determining the determined pixel values of the plurality of sub-areas as the pixel values of the candidate area.

for each candidate region in the plurality of candidate regions, performing an average pooling operation on the candidate region;

and when determining that the candidate area comprises the characters according to the average pooling result, determining the candidate area as a character area, and determining the position of the character area in the target image.

After the pooling layer performs an average pooling operation on each sub-region of the candidate region to obtain an average pooling result, the estimation probability may be determined according to the average pooling result. Then, the terminal may compare the estimated probability with a preset probability, so as to determine whether the candidate region includes a text.

All the above optional technical solutions can be combined arbitrarily to form optional embodiments of the present disclosure, and the embodiments of the present disclosure are not described in detail again.

In the embodiment of the present disclosure, when characters in an image are detected by using fast RCNN, after a plurality of candidate regions corresponding to a target image are determined, a maximum pooling operation may be performed on the candidate regions through a pooling layer, and an average pooling operation may also be performed on the candidate regions through the pooling layer. Next, the text detection method when the maximum pooling operation is adopted will be explained with reference to fig. 3A in the embodiments of the present disclosure.

Fig. 3A is a flowchart of a text detection method according to an exemplary embodiment, where the text detection method may be used in a terminal or a server, and in the embodiment of the present disclosure, the text detection method is explained by taking the terminal as an execution subject. When the execution subject is a server, the method described in the following embodiments may still be used to detect the text in the image. As shown in fig. 3A, the method includes the steps of:

in step 301, a target image to be subjected to text detection is processed through a plurality of convolutional layers included in a fast convolutional neural network (fast RCNN) model based on a region, so as to obtain a plurality of candidate regions.

In an embodiment of the present disclosure, the Faster RCNN model includes a plurality of convolutional layers and a pooling layer, excluding fully-connected layers. When the character detection is carried out, the terminal can process a target image needing the character detection through the input layer so as to obtain pixel values of a plurality of pixel points included by the image, and then the terminal can transmit the pixel values of the pixel points to a first convolution layer connected with the input layer. The first convolution layer carries out convolution operation according to received pixel values of a plurality of pixel points, the convolution operation result is used as an input value of the second convolution layer, and the like, the output value of the previous convolution layer is used as the input value of the next convolution layer, and through the processing of the plurality of convolution layers, the last convolution layer carries out convolution operation on the input value of the previous convolution layer to obtain a plurality of characteristic graphs. And for each feature map in the feature maps, the last convolution layer is convolved with the feature map by adopting a preset convolution kernel so as to obtain a plurality of candidate areas, wherein each feature map corresponds to a plurality of candidate areas.

For example, the last convolutional layer obtains 256 40 × 60 feature maps by convolution operation. For each of the 256 feature maps, a convolution operation is performed on the feature map by using a convolution kernel of 3 × 3, that is, a convolution operation is performed on the feature map by using a sliding window of 3 × 3, and according to the position of the center point of each sliding window of 3 × 3, 9 candidate regions with different scales and different aspect ratios can be correspondingly generated. Thus, 9 candidate regions are obtained for each 3 × 3 region, that is, the feature map obtains 21600 candidate regions, i.e., 40 × 60 × 9.

After determining the candidate regions through the convolution layer of the Faster RCNN model, the terminal may process the candidate regions through the pooling layer included in the Faster RCNN model according to the method in step 302 and step 304, so as to obtain the text region and the position of the text region in the target image.

In step 302, a plurality of candidate regions are used as input to the pooling layer of the Faster RCNN model.

In an embodiment of the present disclosure, the Faster RCNN model includes a plurality of convolutional layers and a pooling layer, with the last convolutional layer connected to the pooling layer. After determining the candidate regions through the last convolutional layer, the terminal may directly use the candidate regions as input values of a pooling layer, and process the candidate regions through the pooling layer, so as to determine the text region from the candidate regions. That is, in the embodiment of the present disclosure, the fast RCNN model does not include three full connection layers, and the multiple candidate regions can be processed only by one pooling layer, so that the amount of computation is greatly reduced, the computation complexity is reduced, and the text detection speed is increased.

In step 303, for each candidate region of the plurality of candidate regions, the candidate region is divided into a plurality of sub-regions with the same size, and the candidate region is subjected to the maximum pooling operation according to the plurality of sub-regions obtained after division.

After determining the plurality of candidate regions, for each of the plurality of candidate regions, the terminal may divide the candidate region into a plurality of sub-regions having the same size. For each of the plurality of sub-regions, the terminal may determine a maximum pixel value among pixel values of a plurality of pixel points of the sub-region as the pixel value of the sub-region. Then, the terminal may determine the determined pixel values of the plurality of sub-regions as the pooling result of the candidate region.

Fig. 3B is a schematic diagram of a maximum pooling provided by embodiments of the present disclosure. As shown in fig. 3B (a), it is assumed that the candidate region is divided into 4 sub-regions of 2 × 2, and each sub-region includes 4 pixel points. Wherein, the maximum pixel value in the first sub-area is 6, the maximum pixel value in the second sub-area is 8, the maximum pixel value in the third sub-area is 3, and the maximum pixel value in the fourth sub-area is 4. After determining the maximum pixel value in each sub-region, the pooling result for the candidate region is shown in fig. 3B (B).

After the terminal performs maximum pooling on a plurality of candidate regions through the pooling layer, the pooling layer of the Faster RCNN model may determine, according to the pooling result, an estimated probability for determining whether the candidate region is a text region, and meanwhile, the terminal may also determine the position coordinates of each candidate region through the pooling layer by using a frame regression method.

In step 304, when it is determined that the candidate region includes a text according to the maximum pooling result, the candidate region is determined as a text region, and a position of the text region in the target image is determined.

After determining the maximum pooling result of the candidate region, the terminal may determine an estimated probability according to the maximum pooling result, where the estimated probability is used to determine whether the candidate region is a text region. The estimated probability may be a probability that the candidate region includes a text, at this time, the terminal may compare the estimated probability with a preset probability, and if the estimated probability is greater than the preset probability, the terminal may determine that the candidate region includes a text, that is, the terminal may determine the candidate region as a text region. It should be noted that the preset probability refers to a preset minimum probability that a candidate region includes a text. In step 303, after the terminal performs the maximum pooling on the candidate region, the terminal determines the position coordinates of the candidate region by using a frame regression method, so that after the terminal determines that the candidate region is a text region, the position coordinates of the text region can be directly obtained.

In the embodiment of the disclosure, the terminal may process the target image to be subjected to text detection through a plurality of convolution layers included in the Faster RCNN model, so as to obtain a plurality of candidate regions. Then, the candidate regions can be processed through a pooling layer included in the fast RCNN model to obtain a text region including text in the candidate regions and a position of the text region in the target image. Because the fast RCNN model is adopted to detect the characters in the embodiment of the disclosure, the problems that when an SVM is adopted to combine a plurality of regions comprising the characters, the combining process is complicated and the error rate of the characters in the combined character regions is high due to the special left-right, up-down structure of the Chinese characters are solved, and the character detection precision is improved. In addition, the embodiment of the disclosure improves the traditional fast RCNN model, processes a plurality of candidate regions through one pooling layer, and replaces three full-connection layers of the fast RCNN model in the related art, thereby reducing the computational complexity of the fast RCNN model during text detection and accelerating the processing speed. Meanwhile, in the embodiment of the disclosure, the plurality of candidate regions can be maximally pooled through the pooling layer, so that the characteristics in the candidate regions can be better reserved, and thus, the position coordinates of the candidate regions are determined according to the pooling result, the position information sensitivity is improved, and further, the character detection accuracy is improved.

The above embodiment describes the text detection method when the maximum pooling operation is performed on a plurality of candidate regions by the pooling layer, and next, the text detection method when the average pooling operation is performed on a plurality of candidate regions by the pooling layer will be described with reference to fig. 4.

Fig. 4 is a flowchart of a text detection method according to an exemplary embodiment, where the text detection method may be used in a terminal or a server, and in the embodiment of the present disclosure, the text detection method is explained by taking the terminal as an execution subject. When the execution subject is a server, the method described in the following embodiments may still be used to detect the text in the image. As shown in fig. 4, the method comprises the steps of:

in step 401, a target image to be subjected to text detection is processed through a plurality of convolutional layers included in a fast convolutional neural network (fast RCNN) model based on a region, so as to obtain a plurality of candidate regions.

The implementation manner of this step may refer to the implementation manner in step 301 in the foregoing embodiment, and details are not described in this embodiment of the disclosure.

In step 402, a plurality of candidate regions are input to the pooling layer of the Faster RCNN model.

The implementation manner of this step may refer to the implementation manner in step 302 in the foregoing embodiment, and details are not described in this embodiment of the disclosure.

In step 403, for each candidate region of the plurality of candidate regions, an average pooling operation is performed on the candidate region.

After the candidate regions are input into the pooling layer, for each of the candidate regions, a pixel average value of a plurality of pixel points included in the candidate region may be calculated, and the calculated pixel average value is used as a pixel value of the candidate region.

After the terminal performs average pooling on a plurality of candidate regions through the pooling layer, the model of the Faster RCNN may output an estimated probability for determining whether the candidate region is a text region according to the pooling result, and meanwhile, the terminal may also determine the position coordinates of each candidate region through the pooling layer by using a frame regression method.

In step 404, when it is determined that the candidate region includes a text according to the average pooling result, the candidate region is determined as a text region, and a position of the text region in the target image is determined.

After determining the average pooling result of the candidate region, the terminal may determine an estimated probability for determining whether the candidate region is a text region according to the average pooling result. Then, the terminal may determine whether the candidate region includes a word according to the estimated probability. If the candidate area includes a text, the candidate area is determined as a text area. In step 403, after the terminal performs average pooling on the candidate regions, the terminal determines the position coordinates of the candidate regions by using a frame regression method, so that after the terminal determines that the candidate regions are text regions, the position coordinates of the text regions can be directly obtained.

In the embodiment of the disclosure, the terminal may process the target image to be subjected to text detection through a plurality of convolution layers included in the Faster RCNN model, so as to obtain a plurality of candidate regions. Then, the candidate regions can be processed through a pooling layer included in the fast RCNN model to obtain a text region including text in the candidate regions and a position of the text region in the target image. Because the fast RCNN model is adopted to detect the characters in the embodiment of the disclosure, the problems that when an SVM is adopted to combine a plurality of regions comprising the characters, the combining process is complicated and the error rate of the characters in the combined character regions is high due to the special left-right, up-down structure of the Chinese characters are solved, and the character detection precision is improved. In addition, the embodiment of the disclosure improves the traditional fast RCNN model, processes a plurality of candidate regions through one pooling layer, and replaces three full-connection layers of the fast RCNN model in the related art, thereby reducing the computational complexity of the fast RCNN model during text detection and accelerating the processing speed.

After explaining the text detection method provided by the embodiment of the present disclosure, a text detection device provided by the embodiment of the present disclosure is introduced next.

Fig. 5A is a block diagram illustrating a text detection apparatus 500 according to an example embodiment. Referring to fig. 5A, the apparatus includes a processing module 501 and a determining module 502.

A processing module 501, configured to process a target image that needs to be subjected to text detection through a plurality of convolutional layers included in a fast convolutional neural network (fast RCNN) model based on a region, so as to obtain a plurality of candidate regions;

the determining module 502 is configured to process the multiple candidate regions through a pooling layer included in the fast RCNN model to obtain a text region and a position of the text region in the target image, where the text region is a region including text in the multiple candidate regions.

Optionally, referring to fig. 5B, the determining module 502 includes:

an input sub-module 5021 for taking a plurality of candidate regions as input of the pooling layer of the Faster RCNN model;

a dividing submodule 5022, configured to, for each candidate region in the plurality of candidate regions, divide the candidate region into a plurality of sub-regions having the same size;

the first pooling sub-module 5023 is used for performing maximum pooling operation on the candidate region according to the plurality of sub-regions obtained after division;

the first determining sub-module 5024 is configured to determine the candidate region as a text region and determine a position of the text region in the target image when it is determined that the candidate region includes text according to the maximum pooling result.

Optionally, the first pooling sub-module 5023 is configured to:

and determining the determined pixel values of the plurality of sub-areas as the pooling result of the candidate area.

Optionally, referring to fig. 5C, the determining module 502 includes:

a second pooling sub-module 5025 for performing an average pooling operation on the candidate regions for each of a plurality of candidate regions;

the second determining sub-module 5026 is configured to determine the candidate region as a text region and determine a position of the text region in the target image when it is determined that the text is included in the candidate region according to the average pooling result.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 6 is a block diagram illustrating an apparatus 600 for text detection in accordance with an example embodiment. For example, the apparatus 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, apparatus 600 may include one or more of the following components: processing component 602, memory 604, power component 606, multimedia component 608, audio component 610, input/output (I/O) interface 612, sensor component 614, and communication component 616.

The processing component 602 generally controls overall operation of the device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operations at the apparatus 600. Examples of such data include instructions for any application or method operating on device 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 604 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power supply component 606 provides power to the various components of device 600. The power components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power supplies for the apparatus 600.

The multimedia component 608 includes a screen that provides an output interface between the device 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 600 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 610 is configured to output and/or input audio signals. For example, audio component 610 includes a Microphone (MIC) configured to receive external audio signals when apparatus 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 614 includes one or more sensors for providing status assessment of various aspects of the apparatus 600. For example, the sensor component 614 may detect an open/closed state of the device 600, the relative positioning of components, such as a display and keypad of the device 600, the sensor component 614 may also detect a change in position of the device 600 or a component of the device 600, the presence or absence of user contact with the device 600, orientation or acceleration/deceleration of the device 600, and a change in temperature of the device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communications between the apparatus 600 and other devices in a wired or wireless manner. The apparatus 600 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the methods provided by the embodiments illustrated in fig. 2-4 and described above.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the apparatus 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium, wherein instructions of the storage medium, when executed by a processor of a terminal, enable the terminal to perform a text detection method provided by the embodiments of fig. 2, 3 and 4.

Fig. 7 is a block diagram illustrating an apparatus 700 for text detection in accordance with an example embodiment. For example, the apparatus 700 may be provided as a server. Referring to fig. 7, the apparatus 700 includes a processor 722 that further includes one or more processors and memory resources, represented by memory 732, for storing instructions, such as applications, that are executable by the processor 722. The application programs stored in memory 732 may include one or more modules that each correspond to a set of instructions. Further, processor 722 is configured to execute instructions to perform the methods provided by the embodiments illustrated in fig. 2-4 and described above.

The apparatus 700 may also include a power component 726 configured to perform power management of the apparatus 700, a wired or wireless network interface 750 configured to connect the apparatus 700 to a network, and an input output (I/O) interface 758. The apparatus 700 may operate based on an operating system stored in memory 732, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided that includes instructions, such as the memory 732 that includes instructions, which are executable by the processor 722 of the device 700 to perform the above-described method. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of a server, enable the server to perform a method of text detection, the method comprising:

In the embodiment of the disclosure, the server may process the target image that needs to be subjected to text detection through a plurality of convolution layers included in the fast RCNN model, so as to obtain a plurality of candidate regions. Then, the candidate regions can be processed through a pooling layer included in the fast RCNN model to obtain a text region including text in the candidate regions and a position of the text region in the target image. Because the fast RCNN model is adopted to detect the characters in the embodiment of the disclosure, the problems that when an SVM is adopted to combine a plurality of regions comprising the characters, the combining process is complicated and the error rate of the characters in the combined character regions is high due to the special left-right, up-down structure of the Chinese characters are solved, and the character detection precision is improved. In addition, the embodiment of the disclosure improves the traditional fast RCNN model, processes a plurality of candidate regions through one pooling layer, and replaces three full-connection layers of the fast RCNN model in the related art, thereby reducing the computational complexity of the fast RCNN model during text detection and accelerating the processing speed.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for detecting text, the method comprising:

processing a target image needing character detection through a plurality of convolution layers included in a fast convolution neural network (fast RCNN) model based on a region to obtain a plurality of candidate regions, wherein the step of obtaining the plurality of candidate regions comprises the following steps: the output value of the previous convolutional layer in the plurality of convolutional layers is used as the input value of the next convolutional layer, and a plurality of characteristic graphs are obtained by performing convolution operation on the input value of the last convolutional layer through the processing of the plurality of convolutional layers; for each feature map in the plurality of feature maps, performing convolution on the last convolution layer and the feature map by adopting a preset convolution kernel to obtain a plurality of candidate regions;

processing the candidate regions through a pooling layer included by the Faster RCNN model to obtain a text region and positions of the text region in the target image, wherein the text region is a region including text in the candidate regions;

wherein the processing the candidate regions through the pooling layer included by the Faster RCNN model to obtain a text region and a position of the text region in the target image includes:

for each candidate region in the multiple candidate regions, dividing the candidate region into multiple sub-regions with the same size, and performing maximum pooling operation on the candidate region according to the multiple sub-regions obtained after division; or, for each candidate region of the plurality of candidate regions, performing an average pooling operation on the candidate region;

and when determining that the candidate region comprises characters according to the maximum pooling result or the average pooling result, determining the candidate region as a character region, and determining the position of the character region in the target image.

2. The method of claim 1, wherein the performing the maximum pooling operation on the candidate region according to the plurality of sub-regions obtained after the dividing comprises:

3. A text detection apparatus, the apparatus comprising:

a determining module, configured to process the multiple candidate regions through a pooling layer included in the Faster RCNN model to obtain a text region and a position of the text region in the target image, where the text region is a region including text in the multiple candidate regions;

the processing module is further configured to use an output value of a previous convolutional layer in the plurality of convolutional layers as an input value of a next convolutional layer, and obtain a plurality of feature maps by performing convolution operation on an input value of a previous convolutional layer through processing of the plurality of convolutional layers; for each feature map in the plurality of feature maps, performing convolution on the last convolution layer and the feature map by adopting a preset convolution kernel to obtain a plurality of candidate regions;

the determining module is further configured to use the plurality of candidate regions as input to a pooling layer of the Faster RCNN model; for each candidate region in the multiple candidate regions, dividing the candidate region into multiple sub-regions with the same size, and performing maximum pooling operation on the candidate region according to the multiple sub-regions obtained after division, or performing average pooling operation on the candidate region for each candidate region in the multiple candidate regions; and when determining that the candidate region comprises characters according to the maximum pooling result or the average pooling result, determining the candidate region as a character region, and determining the position of the character region in the target image.

4. The apparatus of claim 3, wherein the determination module is further configured to:

5. A text detection apparatus, the apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to the steps of any of the methods of claims 1-2.

6. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the steps of any of the methods of claims 1-2.