CN111652218A

CN111652218A - Text detection method, electronic device and computer readable medium

Info

Publication number: CN111652218A
Application number: CN202010496954.1A
Authority: CN
Inventors: 秦勇; 李兵; 张子浩
Original assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Current assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-09-11

Abstract

The embodiment of the invention discloses a text detection method, electronic equipment and a computer readable medium, wherein the text detection method comprises the following steps: performing feature extraction and segmentation on a text image to be detected to obtain a text region probability map of the text image to be detected; determining a text region binary image of the text image to be detected according to the text region probability image; extracting edge information of the binary image of the text area to obtain an edge image of the text area; detecting a connected domain of the edge graph of the text region, and obtaining a minimum circumscribed rectangle of the text region according to a detection result; and obtaining a text detection result of the text image to be detected according to the minimum circumscribed rectangle. By the embodiment of the invention, the speed and the efficiency of text detection, particularly intensive text detection, are improved.

Description

Text detection method, electronic device and computer readable medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a text detection method, electronic equipment and a computer readable medium.

Background

Text detection is a technology for detecting text regions in images and marking the bounding boxes of the text regions, has a wide application range, and is a front-end step of many computer vision tasks, such as image search, character recognition, identity authentication, visual navigation and the like.

The main purpose of text detection is to locate the position of text lines or characters in an image, and a currently popular text detection method is a text detection method based on a sliding window. The method is based on the idea of universal target detection, a large number of anchor point frames with different length-width ratios and different sizes are set, the anchor point frames are used as sliding windows, traversal search is carried out on an image or a feature mapping image obtained by carrying out convolution operation on the image, and classification judgment on whether a text exists in each searched position frame is carried out.

However, this method is too computationally intensive, which not only requires a large amount of computing resources, but also takes a long time.

Disclosure of Invention

The present invention provides a text detection scheme to at least partially address the above-mentioned problems.

According to a first aspect of the embodiments of the present invention, there is provided a text detection method, including: performing feature extraction and segmentation on a text image to be detected to obtain a text region probability map of the text image to be detected; determining a text region binary image of the text image to be detected according to the text region probability image; extracting edge information of the binary image of the text area to obtain an edge image of the text area; detecting a connected domain of the edge graph of the text region, and obtaining a minimum circumscribed rectangle of the text region according to a detection result; and obtaining a text detection result of the text image to be detected according to the minimum circumscribed rectangle.

According to a second aspect of embodiments of the present invention, there is provided an electronic apparatus, the apparatus including: one or more processors; a computer readable medium configured to store one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the text detection method according to the first aspect.

According to a third aspect of embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, characterized in that the program, when executed by a processor, implements the text detection method according to the first aspect.

According to the scheme provided by the embodiment of the invention, when text detection is carried out, particularly dense text detection with high text density, a text region probability graph of a text image to be detected can be obtained according to the feature extraction and segmentation results of the text image to be detected; determining a text region binary image of the text image to be detected based on the text region probability image, determining an edge image according to edge information of the text region binary image, and performing connected domain detection on the edge image, wherein on one hand, the text region can be determined more accurately through the text region probability image; on the other hand, compared with the whole connected domain detection of the text region, the connected domain detection of the edge graph greatly reduces the data amount required to be processed, and improves the detection speed and efficiency. After the detection result is obtained, the minimum circumscribed rectangle of the text region can be determined, and then the detection result is obtained according to the minimum circumscribed rectangle. Therefore, through the process, the text, particularly the intensive text, can be effectively detected, the data calculation amount of the detection is reduced compared with the traditional mode, the calculation resource is saved, and the detection speed and efficiency are improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1A is a flowchart illustrating a text detection method according to a first embodiment of the present invention;

FIG. 1B is a schematic structural diagram of a pixel aggregation network PAN;

FIG. 1C is a schematic diagram of a differentiable binarization network;

FIG. 1D is a schematic diagram of a neural network model according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a text detection method according to a second embodiment of the invention;

fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

Example one

Referring to fig. 1, a flowchart illustrating steps of a text detection method according to a first embodiment of the present invention is shown.

The text detection method of the embodiment comprises the following steps:

step S102: and performing feature extraction and segmentation on the text image to be detected to obtain a text region probability map of the text image to be detected.

The scheme of the embodiment of the invention can be applied to text detection with various text densities, including but not limited to conventional density texts, dense density texts and sparse density texts, and particularly dense density texts. The specific index for determining whether a certain text is a dense text may be set by a person skilled in the art according to practical situations, including but not limited to: embodiments of the present invention are not limited in this regard, as to the spacing between the text (e.g., spacing less than 2 pounds, etc.), the number of texts per unit range (e.g., number of texts per square centimeter greater than 3, etc.), etc.

And performing feature extraction on the text image to be detected to obtain a feature extraction result, namely corresponding features, and forming a feature mapping chart by the features. In this embodiment, after the feature map is obtained, image segmentation is performed based on the feature map to obtain a text region probability map of the text image to be detected. The text region probability map can represent the probability that pixel points in the text image to be detected belong to the foreground or the background, so that the text region can be determined more accurately in the following process.

In this embodiment, this step may be implemented as: performing feature extraction on a text image to be detected to obtain a feature mapping chart; the feature mapping graph is subjected to upsampling, and the upsampled features are connected in series; and carrying out image segmentation based on the feature mapping graph corresponding to the features after the series connection to obtain a text region probability graph. Alternatively, to obtain a more accurate probability map of text regions quickly, in one possible approach, the process may be implemented by a neural network model.

For example, the neural network model may include a PAN structure and a DB structure, and the PAN structure of the neural network model is used to perform feature extraction on the text image to be detected, so as to obtain a PAN feature extraction result; inputting the PAN feature extraction result into a DB structure of the neural network model for up-sampling, and connecting the up-sampled features in series through the DB structure; and performing image segmentation based on the feature mapping image corresponding to the features after the serial connection to obtain a probability map of the text image to be detected. In the mode, the forward processing part with the higher processing speed of the PAN is used for carrying out feature extraction, and the backward processing part with the higher processing speed of the DB is used for carrying out text region probability map acquisition, so that the speed and the efficiency of feature extraction and text region probability map acquisition are improved, and the speed and the efficiency of text detection are further improved.

Specifically, when feature extraction is performed on the text image to be detected by using the PAN structure to obtain a PAN feature extraction result, the method may include: and inputting the text image to be detected into a residual error network part (such as a Resnet network) in the PAN structure to obtain a first text image characteristic. However, in order to further improve the characterization capability of the image features, optionally, after the first text image feature is obtained, the first text image feature may be further input into the feature pyramid enhancement structure portion in the PAN structure, so as to obtain a second text image feature. And the feature extraction is carried out by using the forward processing part of the PAN, so that the processing speed of the feature extraction is improved.

Further, the first and second text image features may each include at least one of: and texture features, edge features, corner features and semantic features of the image region of the text to be detected. The characteristics can effectively represent the characteristics of the image area where the text is located, so that a basis is provided for subsequent processing.

The pixel aggregation network is abbreviated as PAN, and a structure of PAN is shown in fig. 1B. In fig. 1B, an input text image is received through an input layer, a main network part adopts ResNet, and ResNet extracts features of the text image, and delivers the extracted features to two FPEMs (Feature Pyramid Enhancement modules), and the FPEMs can extract features again to enhance the features, so that the features have more representation capability. After the two FPEMs, a feature fusion module FFM fuses features output by the FPEM, and further, text pixels in the text area are guided to a correct kernel to realize text detection.

In the embodiment of the present invention, a partial structure of PAN is used, including Resnet18 (residual network part) and FPEM (Feature Pyramid Enhancement Module), as shown by the dotted line dashed part in fig. 1B. Specifically, in the embodiment, the PAN structure part used takes Resnet18 as a basic network skeleton, and extracts features such as texture, edges, corners, semantic information and the like from the input text image to be detected, and the features are represented by 4 sets of multi-channel feature maps with different sizes. And then, the extracted features are processed by 2 FPEM modules, and features such as textures, edges, corners, semantic information and the like are extracted again.

Compared with a single FPEM module, 2 FPEM modules can achieve the best effect. The processing of each FPEM module is the same, including: the obtained 4 groups of multi-channel feature maps with different sizes are sequentially called as forward first, forward second, forward third and forward fourth group feature maps from large to small in the order from front to back, the forward fourth group feature map is firstly up-sampled by 2 times, namely the size of the forward fourth group feature map is enlarged by 2 times, then the forward fourth group feature map and the forward third group feature map are added point by point according to channels, after the result is subjected to deep separable convolution operation, the convolution operation, batch normalization operation and activation function action operation are carried out again, the obtained result is called as reverse second group feature map, the same operation is used for reverse second group feature map and forward second group feature map to obtain reverse third group feature map, then the same operation is acted on the reverse third group feature map and the forward first group feature map to obtain reverse fourth group feature map, and the forward fourth group feature map is regarded as reverse first group feature map, thereby obtaining 4 sets of reverse feature maps; taking the fourth group of reverse feature maps as a target first group of feature maps, performing 2-time down-sampling on the target first group of feature maps, namely reducing the size by 2 times, then adding the fourth group of reverse feature maps point by point according to channels, performing a depth separable convolution operation on the result, and then performing convolution, batch normalization and activation function action operation again to obtain a result called a target second group of feature maps, wherein the same operation is performed on the target second group of feature maps and the reverse second group of feature maps to obtain a target third group of feature maps, and the same operation is performed on the target third group of feature maps and the reverse first group of feature maps to obtain a target fourth group of feature maps, wherein the target first group of feature maps, the target second group of feature maps, the target third group of feature maps and the target fourth group of feature maps are output of the FPEM module. The 2 nd FPEM module takes the output of the 1 st FPEM module as input, and the same operation is carried out to obtain output. The output includes features that form at least one feature map, or the output features may form a feature map.

In a feasible manner, the obtaining of the probability map of the text image to be detected by performing feature extraction and segmentation on the feature after the up-sampling through the DB structure may be implemented as follows: and connecting the up-sampled features in series by using a differentiable and binaryzation network DB structure, and segmenting the image based on the feature mapping graph corresponding to the connected features to obtain the text region probability graph of the text image to be detected. Image segmentation is a process of dividing an image into several regions and extracting an object of interest, which divides a digital image into mutually disjoint regions. The process of image segmentation is also a labeling process, i.e. pixels belonging to the same region are assigned the same number. In the embodiment, the image segmentation is realized through the DB structure, and the corresponding text region probability map is obtained, so that the text region probability map obtained through the DB structure is more accurate and the feature processing speed of the DB structure is higher compared with the text region probability map obtained by other methods.

Specifically, the features obtained through the PAN structure may be up-sampled to a preset size, such as 1/4 size, of an original picture of the text image to be detected using a DB structure; and then, the features after the upsampling are connected in series, and further image segmentation is carried out according to a series connection result to obtain a text region probability map of the text image to be detected.

The differentiable binary network is also called a DB (differential binary) network, which is also based on the network architecture of Resnet18, and a schematic structure of the DB is shown in fig. 1C. In FIG. 1C, the input image is fed to a pyramid-feature backphone; the pyramid features are upsampled to the same size and cascaded to produce feature F; then, simultaneously predicting a probability map (P) and a threshold map (T) through the feature F; finally, an approximate binary value is calculated by P and F. In the embodiment of the present invention, as shown by the dotted dashed box in fig. 1C, in the training process, the feature map output by the PAN structure part is input to the DB part, the DB part extracts features from the feature map, then the extracted features are all up-sampled to 1/4 of the original image size and are connected in series, then a convolution operation is performed to obtain a 2-channel feature map as output, the first channel outputs the text region probability map of the text region, the second channel outputs the text region threshold map, then the text region is distinguished from the background region by a differentiable binarization function, the parameters of the binarization function can be trained according to the model training, then the binary map of the text region of the image can be calculated according to the text region threshold map and the text region probability map, and calculating a connected domain on the binary image to obtain an internal contracted text region, and then expanding the internal contracted text region outwards according to a certain rule and proportion to obtain a real text region.

It should be noted that, after the training is completed, when the DB is directly used, a preset threshold may be directly used to binarize the probability map of the text region according to the size of the threshold, and the threshold map in the training stage is not required. In one mode, the preset threshold may be set according to a threshold of a previous model training stage; in another mode, the preset threshold value may be determined by analyzing a large number of threshold values used in the binarization processing of a large number of text region probability maps.

The structure of a neural network model combining the above-described PAN structure and DB structure is shown in fig. 1D. As can be seen from fig. 1D, the neural network model of the embodiment of the present invention effectively utilizes the forward processing part in the PAN and the backward processing part in the DB. It should be noted that fig. 1D only illustrates the output of the feature map after upsampling, and processes the output feature map to obtain a text region probability map of the text image to be detected and subsequent processing portions thereof, which can be obtained by a person skilled in the art by combining the text portion of the embodiment of the present invention. Through the structure shown in fig. 1D, feature extraction and segmentation of the text image to be detected can be performed by using the PAN structure and the DB structure, so as to obtain a text region probability map of the text image to be detected.

Step S104: and determining a text region binary image of the text image to be detected according to the text region probability image.

As mentioned above, in the application stage (also referred to as the inference stage or the test stage), the text region probability map may be binarized according to a preset threshold, and the text region binary map of the text image to be detected may be obtained according to the binarization result. The preset threshold value can be set properly by a person skilled in the art according to actual requirements, so that pixel points in the text image to be detected can be effectively distinguished to obtain an effective binary image.

In this embodiment, the processing for extracting the features of the text image to be detected based on the PAN part and the processing from the up-sampling part to the part of the DB where the binary image is obtained are applied to the text detection scheme according to the embodiment of the present invention. Based on this, after the feature extraction result output by the PAN part, such as the feature or the feature map, the feature map is all up-sampled to the original image size 1/4 and is connected in series, then the feature map is obtained through convolution operation as the text region probability map of the contracted text region, and then the binary map of the text region of the image is calculated according to the differentiable binarization function.

Step S106: and extracting the edge information of the binary image of the text area to obtain an edge image of the text area.

In a possible mode, a Canny operator can be used for solving the edge information of the text region binary image to obtain a text region edge image.

The Canny operator, also known as the Canny edge calculator, is a multi-level edge detection algorithm that operates by, for example: smoothing the image with a gaussian filter; calculating gradient amplitude and direction by using first-order partial derivative finite difference; carrying out non-maximum suppression on the gradient amplitude; and detecting and connecting edges by using a double-threshold algorithm and the like to realize edge detection. Compared with other modes, the Canny operator mode can obtain more accurate edge information. Based on the obtained edge information, a text region edge map, that is, an image mainly including the text region edge information may be obtained.

Step S108: and detecting a connected domain of the text region edge image, and obtaining the minimum circumscribed rectangle of the text region according to the detection result.

The connected domain detection of the text region edge map can be realized by a person skilled in the art in any appropriate manner according to actual conditions, and the connected domain detection is only needed to be carried out on the edge of the text region, so that the detection data volume is greatly reduced, and the detection efficiency and speed are improved. Based on the connected component detection of the text area edge image, the minimum bounding rectangle of the text area can be obtained. In the embodiment of the invention, the text area is the part of the text to be detected, which is occupied in the image.

Step S110: and obtaining a text detection result of the text image to be detected according to the minimum circumscribed rectangle.

In the embodiment of the invention, the text detection result of the text image to be detected can be understood as the text box where the text to be detected is located, and at least one text box exists in one text image to be detected.

In a feasible mode, rectangular external expansion can be further performed on each minimum external rectangle, and a text detection result of the text image to be detected is obtained according to the external expansion result. The preset outward expansion proportion can be preset, and outward expansion is carried out on each minimum external rectangle according to the outward expansion proportion. This approach may address situations where the text regions are smaller than in the original image for some reason.

For example, when a text region probability map is obtained using a DB structure, since the feature map is subjected to a process of the DB, a text region is obtained by reducing the feature map, and accordingly, the obtained text region probability map is also a text region probability map. In this case, performing rectangular external expansion on each minimum external rectangle, and obtaining a text detection result for the text image to be detected according to an external expansion result may be implemented as follows: according to the contracted information of the probability map of the contracted text region, determining expanded information matched with the contracted information; and according to the external expansion information, performing external expansion corresponding to the external expansion information on each minimum external rectangle. That is, the outward expansion corresponds to the inward contraction, and the amount of inward contraction is the amount of outward expansion of the minimum circumscribed rectangle subsequently. The contraction information may be an contraction ratio, and correspondingly, the expansion information may be an expansion ratio. During training of the DB structure, the marked text boxes are retracted, and the obtained text region probability map of the trained DB structure is marked on the basis of the retracted text boxes. Therefore, after the training is completed, the features in the text region probability map and the binary map of the text region obtained by the DB structure in the application stage are based on the features after the inlining. Therefore, the text region edge image obtained according to the text region binary image and the minimum circumscribed rectangle of the text region obtained based on the text region edge image need to be subjected to external expansion so as to obtain the image region of the corresponding text in the original image, and therefore the accuracy of the obtained result is ensured.

In one possible approach, the performing rectangular expansion on each minimum bounding rectangle may include: aiming at each minimum circumscribed rectangle, acquiring coordinates of a central point according to coordinates of four vertexes; obtaining four corresponding coordinate difference vectors according to the coordinates of the four vertexes and the coordinates of the central point; according to the external expansion information, expanding the four coordinate difference vectors to obtain four new vectors; and according to the new vector and the coordinates of the central point, carrying out external expansion on the current minimum external rectangle, wherein the external expanded rectangle is the minimum external rectangle of the text in the original image. By the method, the calculation of the external expansion processing is simple, and the calculation speed is high.

By the embodiment, when text detection is carried out, particularly dense text detection with high text density, a text region probability map of the text image to be detected can be obtained according to the feature extraction result and segmentation of the text image to be detected; and then determining a text region binary image of the text image to be detected based on the text region probability image, determining an edge image according to edge information of the text region binary image, and performing connected domain detection on the edge image. After the detection result is obtained, the minimum circumscribed rectangle of the text region can be determined, and then the detection result is obtained according to the minimum circumscribed rectangle. Therefore, through the process, the text, particularly the intensive text, can be effectively detected, the data calculation amount of the detection is reduced compared with the traditional mode, the calculation resource is saved, and the detection speed and efficiency are improved.

The text detection method of the present embodiment may be performed by any suitable electronic device having data processing capabilities, including but not limited to: server, mobile terminal (such as mobile phone, PAD, etc.), PC, etc.

Example two

Referring to fig. 2, a schematic flow chart of a text detection method according to a second embodiment of the invention is shown.

The text detection method of the present embodiment is implemented by a neural network model as shown in fig. 1D, and includes the following steps:

step S202: the text image to be detected is input into the Resnet18 network.

In this embodiment, the Resnet18 network is a part of PAN, and is a trained network, through which features of an input image can be extracted, and features of a certain channel form a feature map of the channel.

Step S204: feature extraction is performed through the Resnet18 network.

In order to be distinguished from the subsequent feature extraction, the feature extraction label in the step is an extraction feature 1, and features such as texture, edges, corners, semantic information and the like of the image can be extracted by extracting the feature 1.

Step S206: and (4) extracting the extracted features again through two FPEM modules.

In this step, based on step S204, feature extraction is performed again through the two FPEM modules, the extracted feature is labeled as extraction feature 2, features such as texture, edge, corner, semantic information, and the like of the image can be extracted again through the extraction feature 2, and corresponding 4 sets of feature mappings are obtained.

The above steps S202 to S206 realize the processing of the PAN structure part (the pre-processing part using the PAN), and the PAN uses the FPEM module to make the forward calculation speed faster. The structures and processes of the Resnet18 network and the FPEM module can refer to the PAN network, and are not described herein again.

Step S208: the feature map formed from the re-extracted features is up-sampled to the original image 1/4 size and concatenated.

In this step, the feature maps of 4 groups of feature maps obtained by extracting features again in step S206 are all up-sampled to 1/4 size of the original image and concatenated together, where the meaning of concatenation here means that 4 groups of feature maps with the same size as the original image 1/4 are concatenated together with channels as an axis, for example, if each group of feature maps has 512 channels, the concatenation together results in a group of feature maps with 512 channels.

Step S210: performing convolution operation on the feature mapping obtained by the series connection to obtain a text region probability map; and carrying out binarization on the text region probability map according to a set threshold value to obtain a text region binary map.

From the application perspective, the text region probability map is obtained according to the feature mapping obtained by the series connection. Specifically, the obtained text region probability map is binarized directly according to the size of a preset threshold value, so that a binary map of the text region is obtained. Wherein the threshold value can be set by those skilled in the art according to actual requirements.

In the training phase, the feature map obtained by concatenation is required to be subjected to convolution operation once, and a feature map with a feature map channel of 2 and a feature map size consistent with the original map 1/4 is output. The feature mapping in this step is a result of processing the feature mapping obtained by upsampling and concatenating the feature mappings extracted by extracting the feature 2, wherein the upsampling is to enlarge the size of the feature mapping by an interpolation method, and therefore, compared with the feature mapping extracted by extracting the feature 2, the feature mapping obtained in this step is scaling and channel concatenation of the feature mapping extracted by extracting the feature 2.

Then, the obtained 2-channel feature map is used as input, and the obtained 2-channel feature map is sequentially subjected to 1 convolution operation, 1 batch normalization operation, 1 RELU function activation operation, 1 deconvolution operation and 1 Sigmoid function activation operation, the number of output channels is 2, and 1 group of feature maps with the same size as that of the original picture 1/4 are output, wherein the feature map corresponding to the 1 st channel is used as a threshold map, the feature map corresponding to the 2 nd channel is used as a text region probability map, in a training stage, the text region probability map and the threshold map are subtracted pixel by pixel to obtain 1 image with the same size as that of the original image, and then the image is subjected to a Sigmoid function to obtain a result which is used as a binary map.

Through the above steps S208 to S210, the processing of the DB structure part (post-processing part employing the DB network) is realized, and the post-processing of the DB is simpler and faster than the PAN.

Step S212: and (5) solving the edge of the binary image of the text region by using a Canny operator to obtain an edge image of the text region.

The method comprises the following steps: using a Canny operator to solve the edge of the binary image of the text region to obtain the edge information of the binary image; and obtaining a binary image describing the edge of the text region according to the edge information.

Step S214: performing operation of solving a connected domain on the text region edge graph to obtain connected domain information; and solving the minimum circumscribed rectangle according to the obtained connected domain information.

Step S216: and solving the coordinates of the center point of each minimum circumscribed rectangle according to the coordinates of the 4 vertexes of each minimum circumscribed rectangle.

Step S218: and (4) for each minimum circumscribed rectangle, subtracting the coordinates of the 4 top points and the coordinates of the central point to obtain 4 vectors.

Step S220: and according to the obtained vertex information of the minimum circumscribed rectangle, carrying out external expansion on the minimum circumscribed rectangle by using a vector addition and subtraction method according to a set rule and an external expansion proportion to obtain the final required 4 vertex information of the text region frame.

The method comprises the following steps: for each vector of the 4 vectors obtained in step S218, adding a multiple to each vector to obtain a new vector; and subtracting the coordinates of the central point from the new vector to obtain the coordinates of 4 top points of the text area frame.

Step S222: and completing the dense text detection.

It can be seen from the above process that, in the model training phase, using the Resnet18 network model as the basic network model, performing convolution operation on the input image to extract features, then processing the extracted features with 2 times of FPEM modules (the input and output of the FPEM modules are 4 groups of corresponding feature maps with the same size and the same number of channels), sampling all the processed feature map images to the original image 1/4 size, concatenating them, performing convolution operation on the concatenated feature map images to obtain a 2-channel output feature map image, which is consistent with the DB idea, the first channel of which represents the text region probability map of the text region, the second channel of which represents the threshold map of the text region, then using a micro-binarization function to process the text region probability map and the threshold map to obtain the text region binary map, in the testing stage, the text region probability graph is directly converted into a binary graph according to a preset threshold value, the calculation amount can be reduced, then Canny operators are used for extracting edge information of the binary graph to obtain a binary image describing the edges of the text region, then a connected domain is calculated on the binary image, then a minimum circumscribed rectangle is obtained for the connected domain on the image, according to 4 vertex coordinates of the minimum circumscribed rectangle, the central point coordinate of the minimum circumscribed rectangle is obtained, then the 4 vertex coordinates and the central point coordinate are subjected to difference to obtain four vectors, a vector addition rule is used for adding a multiple of each vector (the multiple can be set as a fixed value according to specific conditions), then the central point coordinate is subtracted from the new vector to obtain 4 vertex coordinates of a text region frame, and the dense text detection is finished.

Through this embodiment, combine PAN and DB's advantage to optimize text area binary image aftertreatment, under the prerequisite of guaranteeing final text detection effect, realized the amount of calculation less than PAN and DB, compare in PAN and DB, promoted intensive text detection speed greatly.

EXAMPLE III

Fig. 3 is a hardware structure of an electronic device according to a third embodiment of the present invention; as shown in fig. 3, the electronic device may include: a processor (processor)301, a communication Interface 302, a memory 303, and a communication bus 304.

Wherein:

the processor 301, the communication interface 302, and the memory 303 communicate with each other via a communication bus 304.

A communication interface 302 for communicating with other electronic devices or servers.

The processor 301 is configured to execute the program 305, and may specifically perform relevant steps in the text detection method embodiment described above.

In particular, program 305 may include program code comprising computer operating instructions.

The processor 301 may be a central processing unit CPU or an application specific Integrated circuit asic or one or more Integrated circuits configured to implement an embodiment of the present invention. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

The memory 303 stores a program 305. Memory 303 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 305 may specifically be configured to cause the processor 301 to perform the following operations: extracting features of a text image to be detected to obtain a text region probability map of the text image to be detected; determining a text region binary image of the text image to be detected according to the text region probability image; extracting edge information of the binary image of the text area to obtain an edge image of the text area; detecting a connected domain of the edge graph of the text region, and obtaining a minimum circumscribed rectangle of the text region according to a detection result; and obtaining a text detection result of the text image to be detected according to the minimum circumscribed rectangle.

In an alternative embodiment, the program 305 is further configured to enable the processor 301, when obtaining the text detection result of the text image to be detected according to the minimum bounding rectangle: and performing rectangle external expansion on each minimum external rectangle, and obtaining a text detection result of the text image to be detected according to the external expansion result.

In an optional implementation, the text region probability map is a contracted text region probability map, and accordingly, the minimum bounding rectangle of the text region is the minimum bounding rectangle of the contracted text region; the program 305 is further configured to cause the processor 301, when performing rectangle-wrapping on each of the minimum bounding rectangles: according to the contracted information of the probability map of the contracted text region, determining expanded information matched with the contracted information; and according to the external expansion information, performing external expansion corresponding to the external expansion information on each minimum external rectangle.

In an alternative embodiment, the program 305 is further configured to enable the processor 301 to, when performing the dilation corresponding to the dilation information on each minimum bounding rectangle, obtain, for each minimum bounding rectangle, center point coordinates according to four vertex coordinates; obtaining four corresponding coordinate difference vectors according to the coordinates of the four vertexes and the coordinates of the central point; according to the external expansion information, expanding the four coordinate difference vectors to obtain four new vectors; and according to the new vector and the central point coordinate, carrying out outward expansion on the current minimum circumscribed rectangle.

In an alternative embodiment, the program 305 is further configured to enable the processor 301 to, after performing feature extraction and segmentation on the text image to be detected, obtain a text region probability map frog of the text image to be detected: extracting the characteristics of the text image to be detected to obtain a characteristic mapping chart; the feature mapping graph is subjected to upsampling, and the upsampled features are connected in series; and carrying out image segmentation based on the feature mapping image corresponding to the features after the series connection to obtain the text region probability map.

In an optional implementation manner, the program 305 is further configured to enable the processor 301 to perform feature extraction on the text image to be detected by using a PAN structure of a pixel aggregation network of a neural network model when performing feature extraction on the text image to be detected to obtain a feature map, so as to obtain a PAN feature extraction result; the program 305 is further configured to cause the processor 301 to upsample the feature map and concatenate the upsampled features; performing image segmentation based on a feature mapping graph corresponding to the features after series connection, inputting the PAN feature extraction result into a DB structure of a neural network model for up-sampling when obtaining the text region probability graph, and performing series connection on the up-sampled features through the DB structure; and performing image segmentation based on the feature mapping image corresponding to the features after the serial connection to obtain a probability map of the text image to be detected.

In an optional implementation manner, the program 305 is further configured to enable the processor 301 to input the text image to be detected into a residual network portion in a PAN structure of a pixel aggregation network when the PAN structure of a neural network model is used to perform feature extraction on the text image to be detected and obtain a PAN feature extraction result, so as to obtain a first text image feature.

Further, in an alternative embodiment, the program 305 is further configured to enable the processor 301 to input the first text image feature into the feature pyramid enhancement structure portion in the PAN structure of the pixel aggregation network to obtain the second text image feature.

In an alternative embodiment, the first and second text image features each comprise at least one of: and texture features, edge features, corner features and semantic features of the image region of the text to be detected.

In an alternative embodiment, the program 305 is further configured to enable the processor 301, when determining the text region binary map of the text image to be detected according to the text region probability map, to: and carrying out binarization on the text region probability map according to a preset threshold value, and obtaining a text region binary map of the text image to be detected according to a binarization result.

In an alternative embodiment, the program 305 is further configured to enable the processor 301 to, when extracting edge information of the text region binary map to obtain a text region edge map, perform edge information extraction on the text region binary map using a Canny operator to obtain a text region edge map.

For specific implementation of each step in the program 305, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing text detection method embodiments, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

By the electronic equipment, when text detection is carried out, particularly dense text detection with high text density, a text region probability map of a text image to be detected can be obtained according to a feature extraction result of the text image to be detected; and then determining a text region binary image of the text image to be detected based on the text region probability image, determining an edge image according to edge information of the text region binary image, and performing connected domain detection on the edge image. After the detection result is obtained, the minimum circumscribed rectangle of the text region can be determined, and then the detection result is obtained according to the minimum circumscribed rectangle. Therefore, through the process, the text, particularly the intensive text, can be effectively detected, the data calculation amount of the detection is reduced compared with the traditional mode, the calculation resource is saved, and the detection speed and efficiency are improved.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code configured to perform the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program performs the above-described functions defined in the method in the embodiment of the present invention when executed by a Central Processing Unit (CPU). It should be noted that the computer readable medium in the embodiments of the present invention may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access storage media (RAM), a read-only storage media (ROM), an erasable programmable read-only storage media (EPROM or flash memory), an optical fiber, a portable compact disc read-only storage media (CD-ROM), an optical storage media piece, a magnetic storage media piece, or any suitable combination of the foregoing. In embodiments of the invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In an embodiment of the invention, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code configured to carry out operations for embodiments of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may operate over any of a variety of networks: including a Local Area Network (LAN) or a Wide Area Network (WAN) -to the user's computer, or alternatively, to an external computer (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions configured to implement the specified logical function(s). In the above embodiments, specific precedence relationships are provided, but these precedence relationships are only exemplary, and in particular implementations, the steps may be fewer, more, or the execution order may be modified. That is, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes an access module and a transmit module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself.

As another aspect, an embodiment of the present invention further provides a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the text detection method described in the above embodiments.

As another aspect, an embodiment of the present invention further provides a computer-readable medium, which may be included in the apparatus described in the above embodiment; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: extracting features of a text image to be detected to obtain a text region probability map of the text image to be detected; determining a text region binary image of the text image to be detected according to the text region probability image; extracting edge information of the binary image of the text area to obtain an edge image of the text area; detecting a connected domain of the edge graph of the text region, and obtaining a minimum circumscribed rectangle of the text region according to a detection result; and obtaining a text detection result of the text image to be detected according to the minimum circumscribed rectangle.

The expressions "first", "second", "said first" or "said second" used in various embodiments of the invention may modify various components without relation to order and/or importance, but these expressions do not limit the respective components. The above description is only configured for the purpose of distinguishing elements from other elements.

The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention according to the embodiments of the present invention is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept described above. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present invention are mutually replaced to form the technical solution.

Claims

1. A text detection method, comprising:

performing feature extraction and segmentation on a text image to be detected to obtain a text region probability map of the text image to be detected;

determining a text region binary image of the text image to be detected according to the text region probability image;

extracting edge information of the binary image of the text area to obtain an edge image of the text area;

detecting a connected domain of the edge graph of the text region, and obtaining a minimum circumscribed rectangle of the text region according to a detection result;

and obtaining a text detection result of the text image to be detected according to the minimum circumscribed rectangle.

2. The method according to claim 1, wherein the obtaining a text detection result of the text image to be detected according to the minimum bounding rectangle comprises:

and performing rectangle external expansion on each minimum external rectangle, and obtaining a text detection result of the text image to be detected according to the external expansion result.

3. The method according to claim 2, wherein the text region probability map is a contracted text region probability map, and accordingly, the minimum bounding rectangle of the text region is the minimum bounding rectangle of the contracted text region;

the rectangle expanding is carried out on each minimum external rectangle, and the method comprises the following steps: according to the contracted information of the probability map of the contracted text region, determining expanded information matched with the contracted information; and according to the external expansion information, performing external expansion corresponding to the external expansion information on each minimum external rectangle.

4. The method of claim 3, wherein the step of performing the flaring corresponding to the flaring information on each minimum bounding rectangle comprises:

aiming at each minimum circumscribed rectangle, acquiring coordinates of a central point according to coordinates of four vertexes;

obtaining four corresponding coordinate difference vectors according to the coordinates of the four vertexes and the coordinates of the central point;

according to the external expansion information, expanding the four coordinate difference vectors to obtain four new vectors;

and according to the new vector and the central point coordinate, carrying out outward expansion on the current minimum circumscribed rectangle.

5. The method according to claim 1, wherein the extracting and segmenting features of the text image to be detected to obtain the text region probability map of the text image to be detected comprises:

extracting the characteristics of the text image to be detected to obtain a characteristic mapping chart;

the feature mapping graph is subjected to upsampling, and the upsampled features are connected in series; and carrying out image segmentation based on the feature mapping image corresponding to the features after the series connection to obtain the text region probability map.

6. The method of claim 5,

the feature extraction of the text image to be detected to obtain a feature mapping chart comprises the following steps: using the PAN structure of the neural network model to extract the features of the text image to be detected, and obtaining a PAN feature extraction result;

the feature mapping graph is subjected to upsampling, and the upsampled features are connected in series; performing image segmentation based on the feature mapping graph corresponding to the features after the series connection to obtain the text region probability graph, wherein the image segmentation comprises the following steps: inputting the PAN feature extraction result into a DB structure of the neural network model for up-sampling, and connecting the up-sampled features in series through the DB structure; and performing image segmentation based on the feature mapping image corresponding to the features after the serial connection to obtain a probability map of the text image to be detected.

7. The method according to claim 6, wherein the feature extraction of the text image to be detected by using the PAN structure of the neural network model to obtain a PAN feature extraction result comprises:

and inputting the text image to be detected into a residual error network part in the PAN structure to obtain a first text image characteristic.

8. The method of claim 7, wherein after the obtaining the first text image feature, the method further comprises:

and inputting the first text image feature into a feature pyramid enhancement structure part in the PAN structure to obtain a second text image feature.

9. The method of claim 8, wherein the first text image feature and the second text image feature each comprise at least one of: and texture features, edge features, corner features and semantic features of the image region of the text to be detected.

10. The method according to claim 1, wherein the determining the text region binary map of the text image to be detected according to the text region probability map comprises:

and carrying out binarization on the text region probability map according to a preset threshold value, and obtaining a text region binary map of the text image to be detected according to a binarization result.

11. The method according to claim 1, wherein the extracting edge information of the binary map of the text region to obtain an edge map of the text region comprises:

and using a Canny operator to solve the edge information of the binary image of the text region to obtain an edge image of the text region.

12. An electronic device, characterized in that the device comprises:

one or more processors;

a computer readable medium configured to store one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the text detection method of any of claims 1-11.

13. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the text detection method according to any one of claims 1 to 11.