CN111814794A

CN111814794A - Text detection method and device, electronic equipment and storage medium

Info

Publication number: CN111814794A
Application number: CN202010963784.3A
Authority: CN
Inventors: 刘军; 李盼盼; 秦勇
Original assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Current assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2020-10-23
Anticipated expiration: 2040-09-15
Also published as: CN111814794B

Abstract

The application provides a text detection method and device, electronic equipment and a storage medium. The specific implementation scheme is as follows: performing feature extraction on the text image to obtain a feature image; processing the characteristic image by using a convolutional neural network to obtain the probability that pixel points in the text image belong to a text region; processing the characteristic image by using the sequence model to obtain the position of a key point of a text area in the text image; and determining a text region detection result of the text image according to the probability that the pixel point belongs to the text region and the position of the key point of the text region. According to the text detection method and the text detection device, the text detection speed can be increased, the anti-interference capacity of the text detection method is improved, and the robustness of the text detection method is higher.

Description

Text detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of information technologies, and in particular, to a text detection method and apparatus, an electronic device, and a storage medium.

Background

Text detection has a wide range of applications, and is a pre-step in many computer vision tasks. For example, computer vision tasks such as image search, character recognition, identity authentication, and visual navigation require text detection as a preceding step. The main purpose of text detection is to locate the position of text lines or characters in the image. The current popular text detection method, such as a text detection method based on a sliding window or a method based on a calculation connected domain, has the disadvantages of large calculation amount, consumption of a large amount of calculation resources, consumption of a large amount of time, and incapability of meeting the speed requirement of an actual application scene.

Disclosure of Invention

The embodiment of the application provides a text detection method, a text detection device, electronic equipment and a storage medium, and aims to solve the problems in the related art, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a text detection method, including:

performing feature extraction on the text image to obtain a feature image;

processing the characteristic image by using a convolutional neural network to obtain the probability that pixel points in the text image belong to a text region;

processing the characteristic image by using the sequence model to obtain the position of a key point of a text area in the text image;

and determining a text region detection result of the text image according to the probability that the pixel point belongs to the text region and the position of the key point of the text region.

In one embodiment, the feature extraction of the text image to obtain a feature image includes:

and carrying out convolution operation on the text image by utilizing a residual error neural network to obtain a characteristic image.

In one embodiment, processing the feature image by using a convolutional neural network to obtain a probability that a pixel point in the text image belongs to a text region includes:

extracting the features of the feature image by using a feature pyramid enhancement module;

and carrying out convolution operation on the feature image after feature extraction by utilizing a convolution neural network to obtain a probability image of a text region of the text image, wherein the probability image of the text region comprises the probability that pixel points in the text image belong to the text region.

In one embodiment, performing a convolution operation on the feature image after feature extraction by using a convolutional neural network to obtain a probability image of a text region of the text image, includes:

performing up-sampling operation on the feature image after feature extraction, and performing series operation on the image obtained by the up-sampling operation;

and carrying out convolution operation and deconvolution operation on the images obtained by the series operation by utilizing a convolution neural network to obtain a probability image of a text region of the text image.

In one embodiment, processing the feature image by using a sequence model to obtain the location of a key point of a text region in the text image includes:

processing the feature image by using a feature pyramid network to obtain a feature vector;

and processing the characteristic vector by using the sequence model to obtain a key point coordinate vector of a text region in the text image.

In one embodiment, processing the feature image by using the feature pyramid network to obtain a feature vector includes:

performing pooling operation on the feature images by using a feature pyramid network;

and performing series operation on the characteristic images after the pooling operation to obtain characteristic vectors.

In one embodiment, processing the feature vector by using a sequence model to obtain a key point coordinate vector of a text region in a text image includes:

inputting the feature vector into a sequence model, and outputting a low-dimensional expression vector of a high-dimensional vector of a key point coordinate of a text region in the text image;

and performing up-dimensional reduction on the low-dimensional expression vector by using a principal component analysis algorithm to obtain a key point coordinate vector of a text region in the text image.

In one embodiment, determining a text region detection result of a text image according to a probability that a pixel belongs to a text region and a key point position of the text region includes:

obtaining pixel points belonging to the text region in the text image according to the probability that the pixel points belong to the text region;

and under the condition that the key point positions of the text regions are matched with the pixel points belonging to the text regions, taking the text regions corresponding to the key point positions of the text regions as text region detection results of the text images.

In one embodiment, obtaining a pixel point belonging to a text region in a text image according to a probability that the pixel point belongs to the text region includes:

carrying out binarization operation on the probability image of the text region of the text image to obtain a binary image of the text region;

and obtaining pixel points belonging to the text region in the text image according to the binary image of the text region.

In one embodiment, the matching of the key point positions of the text region with the pixel points belonging to the text region includes at least one of:

coordinates of key point positions of the text region are the same as coordinates of pixel points belonging to the text region;

the preset field which takes the key point position of the text area as the center comprises pixel points belonging to the text area.

In a second aspect, an embodiment of the present application provides a text detection apparatus, including:

the extraction unit is used for extracting the characteristics of the text image to obtain a characteristic image;

the first processing unit is used for processing the characteristic image by using a convolutional neural network to obtain the probability that pixel points in the text image belong to a text region;

the second processing unit is used for processing the characteristic image by using the sequence model to obtain the position of a key point of a text area in the text image;

and the determining unit is used for determining the text region detection result of the text image according to the probability that the pixel point belongs to the text region and the key point position of the text region.

In one embodiment, the extraction unit is configured to:

In one embodiment, the first processing unit comprises:

the extraction subunit is used for extracting the features of the feature image by using the feature pyramid enhancement module;

the first processing subunit is configured to perform convolution operation on the feature image after feature extraction by using a convolution neural network to obtain a probability image of a text region of the text image, where the probability image of the text region includes a probability that a pixel point in the text image belongs to the text region.

In one embodiment, the first processing subunit is configured to:

In one embodiment, the second processing unit comprises:

the second processing subunit is used for processing the characteristic image by utilizing the characteristic pyramid network to obtain a characteristic vector;

and the third processing subunit is used for processing the feature vector by using the sequence model to obtain a key point coordinate vector of the text region in the text image.

In one embodiment, the second processing subunit is configured to:

In one embodiment, the third processing subunit is configured to:

In one embodiment, the determining unit comprises:

the fourth processing subunit is used for obtaining the pixel points belonging to the text region in the text image according to the probability that the pixel points belong to the text region;

and the matching subunit is used for taking the text region corresponding to the key point position of the text region as the text region detection result of the text image under the condition that the key point position of the text region is matched with the pixel point belonging to the text region.

In one embodiment, the fourth processing subunit is configured to:

In one embodiment, the matching subunit is configured to determine that the keypoint location of the text region matches a pixel point belonging to the text region if at least one of:

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor. Wherein the memory and the processor are in communication with each other via an internal connection path, the memory is configured to store instructions, the processor is configured to execute the instructions stored by the memory, and the processor is configured to perform the method of any of the above aspects when the processor executes the instructions stored by the memory.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the method in any one of the above-mentioned aspects is executed.

The advantages or beneficial effects in the above technical solution at least include: the anti-interference capability of the text detection method can be improved while the text detection speed is improved, so that the robustness of the text detection method is stronger.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

FIG. 1 is a flow chart of a text detection method according to an embodiment of the present application;

FIG. 2 is a flow chart of processing steps of a text detection method according to another embodiment of the present application;

FIG. 3 is a flow chart of processing steps of a text detection method according to another embodiment of the present application;

FIG. 4 is a flow chart of a key point identification step of a text detection method according to another embodiment of the present application;

FIG. 5 is a flow chart of a keypoint identification step of a text detection method according to another embodiment of the present application;

FIG. 6 is a flow chart of a key point identification step of a text detection method according to another embodiment of the present application;

FIG. 7 is a diagram illustrating key points of a text region in a text detection method according to another embodiment of the present application;

FIG. 8 is a flowchart of post-processing operations of a text detection method according to another embodiment of the present application;

FIG. 9 is a flow diagram of a text detection method according to another embodiment of the present application;

FIG. 10 is a schematic structural diagram of a text detection apparatus according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a first processing unit of a text detection apparatus according to an embodiment of the present application;

FIG. 12 is a diagram illustrating a second processing unit of the text detection apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a determination unit of a text detection apparatus according to an embodiment of the present application;

FIG. 14 is a block diagram of an electronic device used to implement embodiments of the present application.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

Fig. 1 is a flowchart of a text detection method according to an embodiment of the present application. As shown in fig. 1, the text detection method may include:

step S110, extracting the characteristics of the text image to obtain a characteristic image;

step S120, processing the characteristic image by using a convolutional neural network to obtain the probability that pixel points in the text image belong to a text region;

step S130, processing the characteristic image by using the sequence model to obtain the key point position of a text area in the text image;

step S140, determining a text region detection result of the text image according to the probability that the pixel point belongs to the text region and the key point position of the text region.

Text detection may locate the position of a text line or character in an image. The text detection method in the related art has the defects that the calculation amount is too large, a large amount of calculation resources and a large amount of time are consumed, and the speed requirement of an actual application scene cannot be met. For example, the currently popular text detection methods include a sliding window based text detection method and a calculation connected domain based method. In the text detection method based on the sliding window, a large number of anchor boxes with different length-width ratios and different sizes need to be set. And performing convolution operation on the image or the image by taking the anchor point frames as sliding windows, and performing traversal search on a feature mapping image obtained by the convolution operation. For each searched region box, a classification determination is made as to whether text is in the region box. The text detection method based on the sliding window has the disadvantages of large calculation amount, large consumption of calculation resources and long time consumption. The method based on the calculation of the connected domain is also called a method based on a segmentation idea. The method mainly comprises the steps of firstly extracting image features by using a full convolution neural network model, then carrying out binarization processing on the feature images and calculating connected domains of the feature images, and then judging text line positions by using some specific methods according to different application scenes (namely different training data sets). The disadvantage of this method is that the steps of the post-processing operation are cumbersome, involving a lot of calculations and tuning. The method not only consumes a large amount of time, but also strictly limits the performance of the algorithm if the strategy of post-processing operation is reasonable and effective.

In the embodiment of the present application, in step S110, feature extraction is performed on a text image to be detected, so as to obtain a feature image. The feature images obtained in step S110 are respectively subjected to two aspects of processing to obtain information related to text regions in the text images. On the one hand, in step S120, the feature image is processed by using a convolutional neural network, so as to obtain a probability image of a real text region. The probability image comprises the probability that each pixel point in the text image belongs to the text region. On the other hand, in step S130, the feature image is processed by using the sequence model, for example, the feature image may be processed by using a long-short term memory network, so as to obtain the positions of the key points in the text image for marking the text region. In one example, a text region in a text image is a rectangular text box, and then four vertices of the rectangle may be used as key points of the text region.

In step S140, a post-processing operation may be performed according to the results obtained after the network output in steps S120 and S130 to determine a text region detection result in the text image. For example, post-processing operations may include: according to the probability that the pixel point in the text image obtained in step S120 belongs to the text region, it is determined whether the key point in the text region in the text image obtained in step S130 belongs to the text region, that is, whether the key point is in the text box. In one example, if all the key points of a certain text region obtained in step S130 belong to the text region, the text region is determined as the final correct detection result.

In the above processing flow, step S120 and step S130 may be executed sequentially or in parallel. In the case that step S120 and step S130 are executed sequentially, the execution order is not limited in the embodiment of the present application, and step S120 may be executed first and then step S130 may be executed, or step S130 may be executed first and then step S120 may be executed.

Background of text images of natural scenes is extremely diverse, for example, complex interference textures exist near a character area, or textures of similar characters exist in a non-character area, and the like, so that the strength of the anti-interference capability is an important performance indication for measuring the quality of a text detection method. In the embodiment of the present application, first, related information of a text region is obtained, a key point position of the text region is obtained in step S120, and a probability that a pixel point belongs to the text region is obtained in step S130. The relevant information for these two text regions is then combined in a post-processing operation. Whether the positions of the key points in the text region are accurately identified or not is judged according to the probability that the pixel points belong to the text region, so that the accuracy and the anti-interference capability of text detection can be greatly improved, and the robustness of the text detection is stronger. In addition, the post-processing operation is simple and easy to implement, the calculated amount is small, and the text detection speed can be increased.

In one embodiment, step S110 in fig. 1, performing feature extraction on the text image to obtain a feature image, may include:

In the embodiment of the application, the text image to be detected can be input into a Residual neural Network (Resnet) model to obtain the feature image. The residual error neural network model can be used in the fields of target classification and the like and can be used as a part of a computer vision task main neural network. The residual neural network model includes resnet18, resnet50, resnet101, and the like. Where resnet18 represents a residual neural network model where the hidden layer is 18 layers. In one example, the Resnet18 network model may be used as a basic network model, a convolution operation may be performed on an input text image, and features may be extracted through the convolution operation to obtain a feature image.

Fig. 2 is a flowchart of processing steps of a text detection method according to another embodiment of the present application. As shown in fig. 2, in an embodiment, in step S120 in fig. 1, processing the feature image by using a convolutional neural network to obtain a probability that a pixel point in the text image belongs to the text region may specifically include:

step S210, a characteristic pyramid enhancement module is used for extracting characteristics of the characteristic image;

step S220, performing convolution operation on the feature image after feature extraction by using a convolutional neural network to obtain a probability image of a text region of the text image, where the probability image of the text region includes a probability that a pixel point in the text image belongs to the text region.

In the embodiment of the present application, before the Feature image is processed by using the convolutional neural network, in step S210, the Feature image extracted in step S110 may be processed by using a Feature Pyramid Enhancement Module (FPEM) so as to obtain more detailed Feature information. One or more FPEM modules can be used for processing the characteristic image, and the specific number of the FPEM modules can be determined according to the specific situation of the actual application scene. For example, the number of times of using the FPEM module may be determined according to the characteristics of the text image to be detected, and the best feature extraction effect can be achieved by using the FPEM module several times through experiments.

In one example, 2 FPEM modules may be used to process the feature image. The processing of each FPEM module is the same, which specifically includes: and processing the 4 groups of multi-channel feature maps with different sizes obtained in the last step. The 4 sets of multi-channel feature maps with different sizes obtained in the previous step may be sequentially referred to as: forward first set of feature maps, forward second set of feature maps, forward third set of feature maps, forward fourth set of feature maps.

For example, in step S110, the text image is processed by using the residual neural network, so as to obtain a feature image. In the feature image output by the residual error neural network model, the output results from front to back are as follows: forward first set of feature maps, forward second set of feature maps, forward third set of feature maps, forward fourth set of feature maps.

In the 1 st FFEM module, the forward fourth set of feature maps are first up-sampled by a factor of 2, i.e., the size of the forward fourth set of feature maps is expanded to 2 times the original size. And then adding the up-sampled forward fourth group feature map and the forward third group feature map point by point according to the channels. After a further deep separable convolution operation is performed on the summed results, one or more further convolution, batch normalization and activation function application operations are performed. Through the above operations, features can be further extracted, and the final result is called an inverse second group feature mapping. Wherein the depth separable convolution operation may include: firstly, standard convolution operation is carried out on the characteristic image of each channel, and then information of each channel after convolution operation processing is fused by utilizing a convolution kernel. Wherein the information of these channels can be fused using a 1 x 1 convolution kernel. By using the deep separable convolution operation, the operation amount can be reduced, and the text detection speed can be further improved.

Summarizing the above steps, the operation of obtaining the reverse second set of feature maps according to the forward fourth set of feature maps and the forward third set of feature maps comprises: performing 2 times of upsampling on the forward fourth group of feature mapping; adding the up-sampled forward fourth group feature mapping and the forward third group feature mapping point by point according to channels; after one deep separable convolution operation is performed on the added results, one or more convolution, batch normalization and activation function action operations are performed again, and the obtained results are called an inverse second set of feature maps.

In the 1 st FFEM module, the same operation is adopted to act on the reverse second group feature mapping and the forward second group feature mapping to obtain a reverse third group feature mapping. And then, the same operation is adopted to act on the reverse third group of feature mapping and the forward first group of feature mapping to obtain the reverse fourth group of feature mapping. While the forward fourth set of feature maps is treated as the reverse first set of feature maps. Four sets of inverse feature maps are thus obtained.

In the 1 st FFEM module, the fourth set of inverse feature maps is then used as the target first set of feature maps, and then the target first set of feature maps are down-sampled by a factor of 2, i.e. the size of the target first set of feature maps is reduced to 2 times the original size. The downsampled target first set of feature maps and the inverse third set of feature maps are then added point-by-point by channel. And after the depth separable convolution operation is carried out on the added result once, the convolution, batch normalization and activation function action operation are carried out once or more, and the obtained result is called a target second group feature mapping. And the same operation is adopted to act on the target second group feature mapping and the reverse second group feature mapping to obtain a target third group feature mapping. And then, the same operation is adopted to act on the target third group feature mapping and the reverse first group feature mapping to obtain the target fourth group feature mapping. And taking the target first set of feature mapping, the target second set of feature mapping, the target third set of feature mapping and the target fourth set of feature mapping as output results of the 1 st FFEM module.

In the 2 nd FFEM module, the output result of the 1 st FFEM module is used as input information. And performing the same operation as the 1 st FFEM module in the 2 nd FFEM module to obtain four groups of feature maps output by the 2 nd FFEM module. In the case of processing the feature image using 2 FPEM modules, the output result of the 2 nd FFEM module may be used as the result of feature extraction of the feature image by the feature pyramid enhancement module in step S210. Then, in step S220, a probability image of the text region of the text image is obtained by using the convolutional neural network.

Fig. 3 is a flowchart of processing steps of a text detection method according to another embodiment of the present application. As shown in fig. 3, in an embodiment, in step S220 in fig. 2, performing a convolution operation on the feature image after feature extraction by using a convolutional neural network to obtain a probability image of a text region of the text image, which may specifically include:

step S310, carrying out up-sampling operation on the feature image after feature extraction, and carrying out series operation on the image obtained by the up-sampling operation;

and step S320, carrying out convolution operation and deconvolution operation on the images obtained by the series connection operation by using a convolution neural network to obtain the probability image of the text region of the text image.

In step S310, an upsampling operation may be performed on all the feature images processed in step S210, for example, a feature image 1/4 of the original text image size may be obtained by upsampling all the feature images of the four groups of feature maps output by the 2 nd FFEM module. These four 1/4 size feature images are then concatenated to form an image. In the series operation, four 1/4-sized feature images are superimposed, four 1/4-sized feature images are used as one channel to form images after the series operation, and a four-channel image is formed after the series operation. In one example, if the upsampling results in 4 feature images of 128 x 1, a 128 x 4 image is formed after the concatenation operation. Where "128 × 128" indicates that the width and height of the image are both 128 pixels, and "1" and "4" indicate the number of channels of the image, i.e., the feature dimension of the image.

Features may be further extracted from the images resulting from the tandem operation in step S320. For example, one convolution operation and two deconvolution operations may be performed on the images obtained by the tandem operation to obtain a feature image of 1 channel having the same size as the original text image, that is, a probability image of the text region of the text image. The probability image of the text region comprises the probability that the pixel points in the text image belong to the text region. Wherein, the image features can be further extracted by utilizing convolution operation and deconvolution operation. The deconvolution operation can play a role in restoring the image, and the image information before the convolution operation can be recovered, so that more image information can be reserved.

Referring to fig. 1, in the embodiment of the present application, first, in step S110, feature extraction is performed on a text image to be detected, so as to obtain a feature image. The subsequent process flow is divided into two branches. The first branch is executed to step S120, and the feature image is processed by using the convolutional neural network, so as to obtain a probability image of the real text region. The second branch is executed to step S130, and the feature image is processed by using the sequence model, so as to obtain the positions of the key points in the text image for marking the text region.

Fig. 4 is a flowchart of a key point identification step of a text detection method according to another embodiment of the present application. As shown in fig. 4, in an embodiment, in step S130 in fig. 1, processing the feature image by using the sequence model to obtain the key point positions of the text regions in the text image may specifically include:

step S410, processing the feature image by using a feature pyramid network to obtain a feature vector;

step S420, processing the characteristic vector by using the sequence model to obtain a key point coordinate vector of a text area in the text image.

The feature pyramid network can solve the multi-scale problem in text detection. The feature images used for prediction in each layer of the network are fused with features of different resolutions, and the fused feature images of different resolutions are respectively used for text detection of corresponding resolutions. The feature pyramid network is utilized to improve the performance of text detection without adding extra time and calculation amount.

Fig. 5 is a flowchart of a key point identification step of a text detection method according to another embodiment of the present application. As shown in fig. 5, in an embodiment, step S410 in fig. 4, processing the feature image by using the feature pyramid network to obtain a feature vector includes:

step S510, performing pooling operation on the feature images by using a feature pyramid network;

and step S520, performing series operation on the characteristic images after the pooling operation to obtain characteristic vectors.

In step S510, the feature image extracted in step S110 may be pooled by using a feature pyramid network, and the purpose of performing dimension reduction on the feature may be achieved through the pooling operation. In one example, four sets of feature maps are obtained after the text image to be detected passes through the Restnet18 network. The number of channels of each group of feature maps is 128, and the sizes of the channels are 1/4, 1/8, 1/16 and 1/32 of the original image of the text image. Then, performing pooling operation on the above four groups of feature maps, which may specifically include: performing pooling operation on the 1/32 feature map with the size of the text image original image by using a 2 x 2 pooling window; performing pooling operation on 1/16 feature maps with the size of the text image original image by using a 4 × 4 pooling window; performing pooling operation on 1/8 feature maps with the size of the text image original image by using 8 × 8 pooling windows; the pooling operation is performed using a 16 × 16 pooling window for the 1/4 feature map sized as text image artwork. The above "2 × 2", "4 × 4", "8 × 8", and "16 × 16" each represent the size of each pooling window in units of pixels. The above pooling operations may be performed in parallel, and the size of the feature map obtained after the pooling operation is 1/64 of the text image original.

In step S520, the feature maps obtained after the pooling operation are subjected to a concatenation operation. For example, the 1/64 feature images of the text image original in all four sizes obtained after the above pooling operation may be added together, and the four 1/64 feature images may be used as one channel to form images after the series operation, and then one four-channel image may be formed after the series operation. Then, the images of the four channels are arranged into a one-dimensional tensor according to the order of the channels, and the one-dimensional tensor obtained after the arrangement is the eigenvector finally obtained in step S520.

Fig. 6 is a flowchart of a key point identification step of a text detection method according to another embodiment of the present application. As shown in fig. 6, in an embodiment, in step S420 in fig. 4, the processing the feature vector by using the sequence model to obtain a coordinate vector of a key point of a text region in the text image may specifically include:

step S610, inputting the feature vector into a sequence model, and outputting a low-dimensional expression vector of a high-dimensional vector of a key point coordinate of a text region in the text image;

and S620, performing rising-dimensional reduction on the low-dimensional expression vector by using a principal component analysis algorithm to obtain a key point coordinate vector of a text region in the text image.

A model in which sequence data is included in input or output is called a sequence model. For example, Long Short-Term Memory networks (LSTM), which are variant networks of recurrent neural networks, are a sequential model. The recurrent neural network is a model with short-term memory. In a recurrent neural network, neurons can receive information from all neurons, including their own information, and their parameters can be learned by a back-propagation over time algorithm that forwards error information step by step in a time-reversed order. When the input sequence is long, the recurrent neural network may experience a gradient explosion or disappearance problem. To address this problem, the recurrent neural network may be modified, for example by introducing a gating mechanism. LSTM is a recurrent neural network based on a gating mechanism.

In the embodiment of the application, the information of each text area in the text image can be combined with each other by using the sequence model, so that a better detection effect can be obtained on the basis, and the accuracy of text detection is improved.

In one example, the one-dimensional tensor obtained in step S520, i.e., the eigenvector, may be processed using a layer of LSTM. The number of time steps of the LSTM may be set according to the length of the one-dimensional tensor and the fixed input tensor length per time step of the LSTM. For example, the length of the one-dimensional tensor of the input LSTM is 1500, and the length of the input tensor at each time step is 10, that is, the values of 10 elements in the feature vector are input into the LSTM at each time step. From the length of the one-dimensional tensor and the fixed input tensor length per time step of the LSTM, the number of time steps of the LSTM can be determined to be 150. In the LSTM network, the input information at each time step corresponds to an output result, and each output result corresponds to a text region in the text image, for example, the text region may be a rectangular text box. In general, a dense text image may contain more than one hundred text boxes, and in the embodiment of the present application, the time step number may be set to be about 150.

In step S610, the one-dimensional tensor obtained in step S520 is used as input information of the LSTM network. The output results for each time step of the LSTM represent the predicted keypoint coordinates of a text region. In one example, the text area is a rectangular text box, and the coordinates of the key points are the coordinates of the positions of the 4 vertices of the rectangle in the text image. In another example, the text region is a text box of a polygon represented by 14 vertices in the output result of LSTM, and the keypoint coordinates are the position coordinates of the 14 vertices of the polygon in the text image.

Fig. 7 is a schematic diagram of key points of a text region in a text detection method according to another embodiment of the present application. The rectangular text box shown in fig. 7 has 4 vertices, and the coordinates of the 4 vertices are (5, 10), (10, 10), (5, 20), and (10, 20), respectively. In the above coordinate representation method, the 4 vertices need to be represented by 8 numerical values, and a vector composed of the above 8 numerical values is referred to as a high-dimensional vector of the key point coordinates of the text region. The high-dimensional vector can be converted into a corresponding low-dimensional representation vector in a dimension reduction mode, and conversely, the low-dimensional representation vector can also be reduced into a corresponding high-dimensional vector in a dimension ascending mode. In step S610, the output information of the LSTM network is a vector representing a low-dimensional representation vector of a high-dimensional vector of the coordinates of the key points of the text region in the text image. For example, the low-dimensional representation vector of the high-dimensional vector of the keypoint coordinates of the rectangular text box shown in fig. 7 may be (4.26, 3.18).

In step S620, the low-dimensional representation vector is restored in an ascending dimension using a principal component analysis algorithm. In the above example, the low-dimensional representation vector (4.26, 3.18) may be restored in an ascending dimension, resulting in a keypoint coordinate vector of the text region in the text image shown in fig. 7, that is, the coordinates (5, 10), (10, 10), (5, 20), and (10, 20) of the 4 vertices of the rectangular text box.

Principal Component Analysis (PCA) algorithms can be used to perform both dimensionality reduction and dimensionality enhancement on the data. In the dimension reduction algorithm, the algorithm processes all original variables and deletes duplicate variables, for example, one variable may be retained in a plurality of closely related variables and other redundant variables may be deleted. Through the above processing, new variables are established as few as possible, so that the new variables are irrelevant pairwise, and the new variables can reflect the information of the image and keep the original image information as much as possible. That is, the principal component analysis algorithm for dimension reduction is a statistical method that recombines the original variables into a set of new several comprehensive variables that are independent of each other, and at the same time, several smaller sum variables can be extracted from the set according to actual needs, reflecting the information of the original variables as much as possible.

In summary, referring to fig. 1 to 7, in the embodiment of the present application, the processing flow after obtaining the feature image in step S110 is divided into two branches. The first branch is taken to execute step S120, and the probability image of the true text region of 1 channel is output. The second branch is executed to step S130, where consecutive sequence information is output, each output result in the consecutive sequence information represents a low-dimensional representation of a high-dimensional vector of coordinates of a key point of the text region, and a position of the key point in the text image for marking the text region is obtained through ascending dimension reduction.

In the process of performing the model training for the second branch in advance, the principal component analysis algorithm may be used to perform the dimension reduction processing on the labeling information. For example, the annotation information of the key point positions of the text region includes: coordinates (5, 10), (10, 10), (5, 20), and (10, 20) of 4 vertices of a rectangular text box in the text image. Before the model training, the dimension of the labeled information is reduced by using a principal component analysis algorithm to obtain a low-dimensional expression vector (4.26, 3.18). And then, training and optimizing the LSTM model of the second branch by using the labeling information after dimension reduction processing.

In the model training stage, on one hand, the intersection ratio value can be used as a target loss function to train and optimize the probability image of the real text region output by the first branch. Specifically, a dess coefficient difference function (Dice Loss) may be used to calculate the training Loss. In one example, the output result for the first branch is optimized, and the following formula can be adopted as the target loss function:

wherein the content of the first and second substances,

the penalty value for the first branch is indicated,

is shown asiThe probability that an individual pixel belongs to a text region,

a ground route (correct reference value) indicating a text area. A ground route is a binary image in which text pixels are 1 and non-text pixels are 0. The text pixels are pixel points belonging to the text region, and the non-text pixels are pixel points not belonging to the text region.

On the other hand, the coordinate vector of the key point of the text region output by the second branch can be trained and optimized by adopting a mean square error loss function.

In one example, the model training process of the above two aspects may be treated as a multitask training process throughout the model training phase. That is to say the loss functions of the two branches can be optimized individually, instead of optimizing the two branches as a whole.

FIG. 8 is a flowchart of post-processing operations of a text detection method according to another embodiment of the present application. As shown in fig. 8, in an embodiment, in step S140 in fig. 1, determining a text region detection result of the text image according to the probability that the pixel point belongs to the text region and the key point position of the text region may specifically include:

step S710, obtaining pixel points belonging to the text region in the text image according to the probability that the pixel points belong to the text region;

step S720, under the condition that the key point position of the text area is matched with the pixel point belonging to the text area, the text area corresponding to the key point position of the text area is used as the text area detection result of the text image.

In this embodiment, in step S710, the pixel points belonging to the text region in the text image may be obtained according to the probability image of the real text region output by the first branch. Then, in step S720, the coordinates of the key points in the text region output by the second branch are matched with the pixel points belonging to the text region, and the text region corresponding to the key point position of the successfully matched text region is used as the text region detection result of the text image.

In an embodiment, in step S710, obtaining a pixel point belonging to a text region in a text image according to a probability that the pixel point belongs to the text region may specifically include:

In this embodiment, the probability image of the true text region output by the first branch is first subjected to a binarization operation to obtain a binary image of the text region. The binarization operation may include: and processing the probability value corresponding to each pixel point in the probability image, setting the probability value greater than or equal to a certain probability threshold value as a maximum value, and setting the probability value smaller than the certain probability threshold value as a minimum value, thereby realizing binaryzation. In the binary image of the text region, the pixel point with the maximum probability value can be determined as the pixel point belonging to the text region, and the pixel point with the minimum probability value is determined as the pixel point not belonging to the text region. Therefore, according to the binary image of the text region, the pixel points belonging to the text region in the text image can be obtained.

In one example, the key point positions of the text region obtained in step S130 may be matched with pixel points belonging to the text region in the binary image of the text region. For example, if the coordinates of 4 vertices of the text box included in the key point coordinate vector of the text region obtained by the upscaling restoration in step S620 are (5, 10), (10, 10), (5, 20), and (10, 20), the coordinates of the 4 vertices are respectively matched with the binary image of the text region, and it is determined whether each of the 4 vertices is a pixel point belonging to the text region in the binary image of the text region. And if the 4 vertexes are all pixel points belonging to the text region in the binary image of the text region, determining that the text region corresponding to the 4 vertexes is a correct detection result. If at least one of the 4 vertexes in the binary image of the text region is not a pixel point belonging to the text region, determining that the text region corresponding to the 4 vertexes is not a correct detection result, that is, the detection result of the text region is inaccurate.

In another example, a certain error range may be set when matching the key point positions of the text region with the pixel points belonging to the text region is performed. For example, the radius r =3 of the error range may be set, and a circular area is made with the radius of 3 and the circular area is the preset area, with the key point position of the text area as the center. And if the circular region can be detected to have the pixel points belonging to the text region, determining that the matching between the key point position of the text region and the pixel points belonging to the text region is successful. And if all the coordinate points obtained by output of the LSTM after the lifting-dimension reduction of the low-dimensional representation vector are successfully matched, determining that the text region corresponding to the low-dimensional representation vector is a correct detection result. And if at least one coordinate point obtained after the low-dimensional representation vector is subjected to the ascending-dimension reduction cannot be successfully matched, determining that the detection result of the text region corresponding to the low-dimensional representation vector is inaccurate.

In a practical application scenario where the text is very dense, the speed of the text detection method of the prior art is greatly affected by the number of text boxes. For example, in an arithmetic exercise book for pupils, there may be over a hundred text regions on an image. In the text detection method of the related art, the detection speed is almost linearly decreased as the number of text boxes increases. In this case, the speed requirement of the actual application scenario cannot be met. The reason for the reduced detection speed is mainly due to the fact that post-processing is too complex and consumes a lot of time. In contrast, in the post-processing operation executed in step S140 in the embodiment of the present application, it is only necessary to determine whether the key point position is a pixel point belonging to the text region. The post-processing operation is simple and easy to implement, and the calculated amount is small. Especially in the application scene of intensive text image detection, the speed of text detection can be greatly improved. In addition, whether the positions of the key points in the text region are accurately identified is judged according to the probability that the pixel points belong to the text region, so that the accuracy and the anti-interference capability of text detection can be greatly improved, and the robustness of the text detection is stronger.

In summary, the embodiment of the application utilizes the advantage that the residual error neural network and the feature pyramid enhancement module can better extract the image features, combines the advantage that the sequence model can perform time sequence modeling and the principal component analysis method to effectively reduce the data dimensionality, can accurately restore the data capability, and simultaneously adopts a simple and efficient post-processing operation mode, so that the speed and the robustness of text detection are improved on the premise of ensuring the text detection effect.

Fig. 9 is a flowchart of a text detection method according to another embodiment of the present application. As shown in fig. 9, an exemplary text detection method may include the steps of:

step 1: inputting a text image to be detected.

Step 2: and taking the Resnet18 network model as a basic network model, and extracting features of the input text image to obtain a first feature image.

And step 3: feature extraction was performed on the feature images using 2 FPEM modules. And (3) extracting the first characteristic image extracted in the step (1) again through two FPEM modules, and obtaining 4 groups of characteristic mappings corresponding to the second characteristic image.

And 4, step 4: and (4) all the feature maps obtained in the step (3) are up-sampled to 1/4 of the original text image, and images obtained by the up-sampling operation are subjected to a series operation.

And 5: and (4) performing convolution operation once and deconvolution operation twice on the images after the series operation in the step 4.

Step 6: and outputting a feature mapping chart which has the number of feature mapping channels of 1 and the size consistent with the size of the original image, wherein the feature mapping chart is a probability image representing the real text area.

And 7: and converting the probability image of the real text region into a binary image of the real text region.

And 8: and (3) performing dimensionality reduction on the features by using the first feature image extracted in the step (2) by using a feature pyramid pooling method to obtain a one-dimensional tensor. For specific operation steps, refer to the corresponding description of step S510, which is not described herein again.

And step 9: and (4) taking the one-dimensional tensor obtained in the step (8) as input information of the LSTM network with a layer structure. The LSTM network outputs a vector at each time step that is a low-dimensional representation of a high-dimensional vector representing the coordinates of the key points of the text box.

Step 10: and outputting a low-dimensional representation vector of the high-dimensional vector representing the coordinates of the key points of the text box.

Step 11: and (4) carrying out rising-dimension reduction on the low-dimensional expression vector obtained in the step (10) to obtain a key point coordinate vector expressing the text box. And (4) judging whether the coordinate point in the key point coordinate vector belongs to the text region according to the binary image of the real text region obtained in the step (7). And if all coordinate points in the key point coordinate vector representing the text box belong to the text area, determining the text box as a correct detection result.

Fig. 10 is a schematic structural diagram of a text detection apparatus according to an embodiment of the present application. As shown in fig. 10, the apparatus may include:

an extraction unit 100, configured to perform feature extraction on a text image to obtain a feature image;

the first processing unit 200 is configured to process the feature image by using a convolutional neural network to obtain a probability that a pixel point in the text image belongs to a text region;

the second processing unit 300 is configured to process the feature image by using the sequence model to obtain the key point positions of the text regions in the text image;

the determining unit 400 is configured to determine a text region detection result of the text image according to the probability that the pixel belongs to the text region and the position of the key point of the text region.

In one embodiment, the extraction unit 100 is configured to:

Fig. 11 is a schematic structural diagram of a first processing unit of a text detection apparatus according to an embodiment of the present application. As shown in fig. 11, in one embodiment, the first processing unit 200 includes:

an extracting subunit 210, configured to perform feature extraction on the feature image by using the feature pyramid enhancement module;

the first processing subunit 220 is configured to perform convolution operation on the feature image after feature extraction by using a convolutional neural network to obtain a probability image of a text region of the text image, where the probability image of the text region includes a probability that a pixel point in the text image belongs to the text region.

In one embodiment, the first processing subunit 220 is configured to:

Fig. 12 is a schematic structural diagram of a second processing unit of the text detection apparatus according to the embodiment of the present application. As shown in fig. 12, in one embodiment, the second processing unit 300 includes:

the second processing subunit 310 is configured to process the feature image by using the feature pyramid network to obtain a feature vector;

the third processing subunit 320 is configured to process the feature vector by using the sequence model to obtain a coordinate vector of a key point of the text region in the text image.

In one embodiment, the second processing subunit 310 is configured to:

In one embodiment, the third processing subunit 320 is configured to:

Fig. 13 is a schematic structural diagram of a determination unit of a text detection apparatus according to an embodiment of the present application. As shown in fig. 13, in one embodiment, the determining unit 400 includes:

a fourth processing subunit 410, configured to obtain, according to the probability that the pixel belongs to the text region, a pixel belonging to the text region in the text image;

the matching subunit 420 is configured to, in a case that the key point position of the text region matches a pixel point belonging to the text region, use the text region corresponding to the key point position of the text region as a text region detection result of the text image.

In one embodiment, the fourth processing subunit 410 is configured to:

In one embodiment, the matching subunit 420 is configured to determine that the keypoint location of the text region matches a pixel point belonging to the text region if at least one of:

The functions of each module in each apparatus in the embodiments of the present invention may refer to the corresponding description in the above method, and are not described herein again.

FIG. 14 is a block diagram of an electronic device used to implement embodiments of the present application. As shown in fig. 14, the electronic apparatus includes: a memory 910 and a processor 920, the memory 910 having stored therein computer programs operable on the processor 920. The processor 920 implements the text detection method in the above-described embodiment when executing the computer program. The number of the memory 910 and the processor 920 may be one or more.

The electronic device further includes:

and a communication interface 930 for communicating with an external device to perform data interactive transmission.

If the memory 910, the processor 920 and the communication interface 930 are implemented independently, the memory 910, the processor 920 and the communication interface 930 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (enhanced Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 14, but this is not intended to represent only one bus or type of bus.

Optionally, in an implementation, if the memory 910, the processor 920 and the communication interface 930 are integrated on a chip, the memory 910, the processor 920 and the communication interface 930 may complete communication with each other through an internal interface.

Embodiments of the present invention provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.

The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and execute the instruction stored in the memory from the memory, so that the communication device in which the chip is installed executes the method provided in the embodiment of the present application.

An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be an advanced reduced instruction set machine (ARM) architecture supported processor.

Further, optionally, the memory may include a read-only memory and a random access memory, and may further include a nonvolatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may include a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A text detection method, comprising:

performing feature extraction on the text image to obtain a feature image;

processing the characteristic image by using a sequence model to obtain the position of a key point of a text region in the text image;

2. The method of claim 1, wherein extracting features of the text image to obtain a feature image comprises:

and carrying out convolution operation on the text image by utilizing a residual error neural network to obtain the characteristic image.

3. The method of claim 1, wherein processing the feature image using a convolutional neural network to obtain a probability that a pixel point in the text image belongs to a text region comprises:

4. The method of claim 3, wherein performing a convolution operation on the feature image after feature extraction by using a convolutional neural network to obtain a probability image of a text region of the text image comprises:

performing up-sampling operation on the feature image after the feature extraction, and performing series operation on the image obtained by the up-sampling operation;

5. The method according to any one of claims 1 to 4, wherein processing the feature image using a sequence model to obtain the key point positions of text regions in the text image comprises:

processing the characteristic image by utilizing a characteristic pyramid network to obtain a characteristic vector;

and processing the characteristic vector by using a sequence model to obtain a key point coordinate vector of a text region in the text image.

6. The method of claim 5, wherein processing the feature image using a feature pyramid network to obtain a feature vector comprises:

7. The method of claim 5, wherein processing the feature vector using a sequence model to obtain a key point coordinate vector of a text region in the text image comprises:

inputting the characteristic vector into a sequence model, and outputting a low-dimensional expression vector of a high-dimensional vector of a key point coordinate of a text region in a text image;

and performing rising-dimensional reduction on the low-dimensional expression vector by utilizing a principal component analysis algorithm to obtain a key point coordinate vector of a text region in the text image.

8. The method according to claim 3 or 4, wherein determining the text region detection result of the text image according to the probability that the pixel point belongs to the text region and the key point position of the text region comprises:

9. The method according to claim 8, wherein obtaining the pixel point belonging to the text region in the text image according to the probability that the pixel point belongs to the text region comprises:

10. The method of claim 8, wherein the matching of the keypoint locations of the text region to the pixel points belonging to the text region comprises at least one of:

the coordinates of the key point positions of the text regions are the same as the coordinates of the pixel points belonging to the text regions;

and pixels belonging to the text region are included in a preset field taking the key point position of the text region as the center.

11. A text detection apparatus, comprising:

the second processing unit is used for processing the characteristic image by using a sequence model to obtain the position of a key point of a text region in the text image;

12. The apparatus of claim 11, wherein the extraction unit is configured to:

13. The apparatus of claim 11, wherein the first processing unit comprises:

the extraction subunit is used for extracting the features of the feature image by using a feature pyramid enhancement module;

the first processing subunit is configured to perform convolution operation on the feature image after feature extraction by using a convolutional neural network to obtain a probability image of a text region of the text image, where the probability image of the text region includes a probability that a pixel point in the text image belongs to the text region.

14. The apparatus of claim 13, wherein the first processing subunit is configured to:

15. The apparatus according to any one of claims 11 to 14, wherein the second processing unit comprises:

the second processing subunit is used for processing the characteristic image by utilizing a characteristic pyramid network to obtain a characteristic vector;

and the third processing subunit is used for processing the feature vector by using the sequence model to obtain a key point coordinate vector of a text region in the text image.

16. The apparatus of claim 15, wherein the second processing subunit is configured to:

17. The apparatus of claim 15, wherein the third processing subunit is configured to:

18. The apparatus according to claim 13 or 14, wherein the determining unit comprises:

and the matching subunit is configured to, when the key point position of the text region is matched with the pixel point belonging to the text region, use the text region corresponding to the key point position of the text region as a text region detection result of the text image.

19. The apparatus of claim 18, wherein the fourth processing subunit is configured to:

20. The apparatus according to claim 18, wherein the matching subunit is configured to determine that the keypoint location of a text region matches a pixel point belonging to the text region if at least one of:

21. An electronic device comprising a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 10.

22. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 10.