CN112287924B

CN112287924B - Text region detection method, text region detection device, electronic equipment and computer storage medium

Info

Publication number: CN112287924B
Application number: CN202011546450.2A
Authority: CN
Inventors: 杨家博; 秦勇
Original assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Current assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-03-16
Anticipated expiration: 2040-12-24
Also published as: CN112287924A

Abstract

The application discloses a text region detection method and device, electronic equipment and a computer storage medium. The specific implementation scheme comprises the following steps: inputting the target image into a target model to obtain a plurality of channels output by the target model; the multiple channels comprise a contracted text region probability map and a corner probability map; obtaining a probability threshold value based on a gradient map corresponding to the target image and a probability map of the contracted text region; based on the probability threshold, carrying out binarization on the probability map of the contracted text region and carrying out binarization on the probability map of the corner point to obtain a binary map of the contracted text region and a binary map of the corner point; and determining the text region contained in the target image based on the reduced text region binary image and the corner binary image.

Description

Text region detection method, text region detection device, electronic equipment and computer storage medium

Technical Field

The present application relates to the field of image processing, and in particular, to a text region detection method and apparatus, an electronic device, and a computer storage medium.

Background

The text region detection has a wide application range, is a prepositive step of many computer vision tasks, such as image search, character recognition, identity authentication, visual navigation and the like, and the main purpose of the text region detection is to locate the position of a text line or a character in an image, so that the accurate location of the text is very important and challenging.

Although the current text region detection method can realize the detection of text images, when the method is applied to the detection of dense text regions, the detection speed is low, the detection effect is poor, and the detection efficiency of the text regions is seriously influenced.

Disclosure of Invention

In order to solve at least one of the above problems in the prior art, embodiments of the present application provide a text region detection method, apparatus, electronic device, and computer storage medium.

In a first aspect, an embodiment of the present application provides a text region detection method, where the method includes:

inputting the target image into a target model to obtain a plurality of channels output by the target model; the plurality of channels comprise contracted text region probability maps and corner probability maps;

obtaining a probability threshold value based on the gradient map corresponding to the target image and the probability map of the contracted text region;

based on the probability threshold, carrying out binarization on the probability map of the contracted text regions and carrying out binarization on the probability map of the corner points to obtain a binary map of the contracted text regions and a binary map of the corner points;

and determining the text region contained in the target image based on the reduced text region binary image and the corner point binary image.

In a second aspect, an embodiment of the present application provides a text region detection apparatus, including:

the model processing unit is used for inputting the target image into the target model to obtain a plurality of channels output by the target model; the plurality of channels comprise contracted text region probability maps and corner probability maps;

a threshold determining unit, configured to obtain a probability threshold based on the gradient map corresponding to the target image and the probability map of the contracted text region;

a binarization unit, configured to perform binarization on the probability map of the contracted text regions and perform binarization on the probability map of the corner points based on the probability threshold to obtain a binary map of the contracted text regions and a binary map of the corner points;

a text region determining unit, configured to determine a text region included in the target image based on the reduced text region binary image and the corner binary image.

In a third aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method provided by any one of the embodiments of the present application.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present application.

One embodiment in the above application has the following advantages or benefits: inputting a target image into a target model to obtain a plurality of channels output by the target model; the channels comprise a contracted text region probability map and a corner probability map; obtaining a probability threshold value based on a gradient map corresponding to the target image and a probability map of the contracted text region; based on the probability threshold, carrying out binarization on the probability map of the contracted text region and carrying out binarization on the probability map of the corner point to obtain a binary map of the contracted text region and a binary map of the corner point; determining a text region contained in the target image based on the two-value image of the contracted text region and the two-value image of the corner point; therefore, an extremely simple processing mode for determining the text region is provided, the text region of the target image can be accurately determined through the text region binary image and the corner binary image, and the accuracy and the efficiency of text region detection are improved.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

FIG. 1 is a first flowchart illustrating a text region detection method according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a text region detection method according to an embodiment of the present application;

FIG. 3 is a third flowchart illustrating a text region detection method according to an embodiment of the present disclosure;

FIG. 4 is a fourth flowchart illustrating a text region detection method according to an embodiment of the present disclosure;

FIG. 5 is a first flowchart illustrating a method for training a target model according to an embodiment of the present disclosure;

FIG. 6 is a second flowchart illustrating a method for training a target model according to an embodiment of the present disclosure;

FIG. 7 is a flowchart illustrating a text region detection process according to an embodiment of the present application;

FIG. 8 is a block diagram of a text region detection apparatus according to an embodiment of the present disclosure;

FIG. 9 is a block diagram of a text region detecting apparatus according to an embodiment of the present application;

fig. 10 is a schematic diagram of a composition structure of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The embodiment of the application provides a text region detection method, which can be applied to electronic equipment, in particular to terminal equipment or a server. As shown in fig. 1, the text region detection method includes:

s101: inputting the target image into a target model to obtain a plurality of channels output by the target model; the channels comprise a contracted text region probability map and a corner probability map;

s102: obtaining a probability threshold value based on a gradient map corresponding to the target image and a probability map of the contracted text region;

s103: based on the probability threshold, carrying out binarization on the probability map of the contracted text region and carrying out binarization on the probability map of the corner point to obtain a binary map of the contracted text region and a binary map of the corner point;

s104: and determining the text region contained in the target image based on the reduced text region binary image and the corner binary image.

In the above S101, the target image may be any image containing characters, for example, any image containing characters acquired by an electronic device; here, the manner in which the electronic device acquires any one of the images including the text, or the electronic device acquires the target image may be an image including the text captured in advance by a camera, or may be acquired from another electronic device.

Still further, the embodiment of the application is suitable for processing the dense text contained in the target image. For example, dense text may be understood as data text in a table, and may also be understood as containing a plurality of words or formula text. Preferably, the embodiment is particularly suitable for processing the target image containing the long-curved text or the long-curved dense text, and the long-curved text can be processed in such a way that the lowest point pixels of the outlines of a plurality of characters or characters which are continuously adjacent are not on the same straight line.

In the embodiment of the present application, the corner probability map corresponding to the text region refers to a probability map of a mark point of the text region. Further, the corner of the text region may refer to that an intersection of two adjacent edges of the text region is a corner when a certain angle exists between the two edges. For example, if the text region is a rectangular text region, the corner points corresponding to the text region may include four corner points, i.e., top left, top right, bottom left, and bottom right, and if the text region is an irregular polygon, the corner points corresponding to the text region may be more, which is not exhaustive.

In the embodiment of the present application, the text region probability map is a probability map (probability value map) of a text region which is shrunk in proportion according to a certain rule. In the probability map of the contracted text regions, a probability value that each coordinate point in the target image is a contracted text region may be included.

The corner point probability map includes probability values of each coordinate point in the target image as a corner point.

In the scheme of this embodiment, for the case that the selection mode of the probability threshold is manually set, the probability threshold is obtained based on the gradient map corresponding to the target image and the probability map of the contracted text region, and a new mode for selecting the probability threshold is provided. The text region contained in the target image is determined based on the reduced text region binary image and the corner binary image, the operation is simple, the text region of the target image can be accurately determined, and the accuracy and the efficiency of text region detection are improved.

Based on the text region detection method shown in fig. 1, in some embodiments, as shown in fig. 2, S101 includes:

s101 a: inputting the target image into a first network model of the target model to obtain a feature map output by the first network model of the target model;

s101 b: and inputting the characteristic diagram into a second network model in the target model to obtain a plurality of channel information output by the second network model in the target model.

The first network model in the target model may specifically be a Resnet (Deep residual network) 18 network as a base network model, and regular convolution in the Resnet18 network is replaced by pyramid convolution.

In one example, a first network model of the target models, namely the Resnet18 network as a base network model, may include 4 residual blocks (hereinafter referred to as blocks for simplicity of description).

Specifically, the inputting the target image into a first network model of the target model to obtain a feature map output by the first network model of the target model may include:

the first network model in the target model is responsible for converting the original image into high-dimensional features, such as extracting features of textures, edges, corners, semantic information and the like from the input image. The first network model takes a classical convolutional neural network as a basic network model, and can be specifically a Resnet18 network.

In one example, a first network model in the target model, namely a Resnet18 network as a base network model, may include 4 residual blocks (hereinafter referred to as blocks for simplicity of description) constructed in series. Each block includes several convolutional and pooling layers with residual concatenation to reduce the size of the feature map output by the previous stage by a factor of two, for example, the feature map output by the first block is 1/4, 1/8, 1/16 and 1/32 of the original.

The 4 blocks are sequentially connected in series behind an input layer of the Resnet18 network, and when an input target image is 512 × 512, the outputs of the 4 blocks are respectively: 256 × 256, 128 × 128, 64 × 64, 32 × 32. Wherein, the number of each group of feature maps is 128, and the 4 groups of feature maps contain information with different scales.

In this embodiment, the number of feature maps output by each block is small, and the feature maps are not output in hundreds or thousands of widths like other network models. The connection mode enables the transfer of the characteristics to be more effective, and the training of the model is easier. The pyramid convolution is used for replacing the conventional convolution, the number of feature mappings can be reduced to reduce calculation, and the dense text detection effect and speed can be improved.

After a feature map is output by a first network model of the target model, the feature map may be input into a second network model; accordingly, the processing of the second network model in the object model includes:

connecting the input feature maps in series to obtain feature mapping;

and performing convolution operation and deconvolution operation on the feature mapping once, and outputting the channel information with the size consistent with that of the target image.

Specifically, the method comprises the following steps: and the second network model in the target model is connected with each block of the first network model, the feature graph output by the first network model is input to the second network model part, and the second network model extracts features such as textures, edges, corners and semantic information again to complete feature information recombination. The second network model can comprise an upsampling layer and a channel dimension attention layer, and the upsampling layer can be used for adjusting the size of the feature map of the block output, for example, recombining and scaling feature maps of various scales to the same scale; and the channel dimension attention layer is used for fusing the adjusted feature maps and obtaining a multi-channel feature map.

Specifically, the second network model may be a DB network (Real-time Scene Text Detection with differential binary network).

In one example, the upsampling layer of the second network model transforms all the four groups of feature maps into the original map 1/4 by means of interpolation and connects the four groups of feature maps in series, the number of feature maps output by each block is 128, and a group of feature maps with the channel number totaling 512 is obtained. And then, performing convolution operation and deconvolution operation on the feature mapping of the 512 channels at a channel attention layer corresponding to channel output to obtain a plurality of channel information with the same size as the target image.

Based on the text region detection method shown in fig. 1, in some embodiments, as shown in fig. 3, S102 includes:

s102 a: performing AND operation on the gradient map corresponding to the target image and the probability map of the contracted text region to obtain the processed gradient map;

s102 b: adding the processed gradient map and the contracted text region probability map point by point to obtain the processed contracted text region probability map;

s102 c: and calculating a probability average value of the processed probability map of the text regions, and taking the calculated probability average value as a probability threshold value.

The probability average value is obtained from the processed text region probability map, and the obtained probability average value is used as the probability threshold, or the probability average value may be obtained by adding all probability values included in the processed text region probability map and performing an average calculation, and the calculation result is used as the probability average value, and the probability average value is used as the probability threshold.

In S103, the binarizing is performed on the probability map of the contracted text regions and the binarizing is performed on the probability map of the corner regions based on the probability threshold to obtain a binary map of the contracted text regions and a binary map of the corner regions, which may specifically be:

comparing the probability value corresponding to each coordinate point in the probability map of the contracted text region with the probability threshold value, setting the value of the coordinate point which is greater than the probability threshold value as 1, and setting the values of the other coordinate points as 0 to obtain a binary map of the contracted text region; and comparing the probability value corresponding to each corner point in the corner point probability map with the probability threshold, setting the value of the corner point larger than the probability threshold to be 1, and setting the values of the other corner points to be 0, so as to obtain a corner point binary map.

Therefore, in the aspect of probability threshold setting, the probability threshold is obtained based on the gradient map corresponding to the target image and the probability map of the contracted text region, and compared with the artificially set probability threshold, the determined probability threshold is more accurate, so that the speed and the detection effect of detecting the dense text region are improved, and the requirement of detecting the scene of the dense text region is met.

Based on the text region detection method shown in fig. 1, in some embodiments, the plurality of channels further include: offset of angular points; as shown in fig. 4, S104 includes:

s104 a: determining a first connecting domain of a binary image of the contracted text region, and obtaining the contracted text region according to the first connecting domain;

s104 b: determining a second connected domain of the corner point binary image, and correcting the corner point binary image corresponding to the second connected domain according to the corner point offset to obtain a corrected corner point binary image;

s104 c: and carrying out external expansion processing on the contracted-in text region based on the corrected corner binary image to obtain a text region contained in the target image.

Wherein the corner offset comprises: the offset in the x direction and the offset in the y direction corresponding to the corner point.

When the corrected two-value corner image is used for carrying out external expansion processing on the contracted text area, the external expansion is stopped when each direction touches a corner point.

Performing an external expansion process on the contracted-in text region based on the corner binary image to obtain a text region contained in the target image, where the process may be: determining coordinates of one or more corner points based on the binary image of corner points; and performing external expansion on the contracted text region, stopping external expansion under the condition of touching the coordinate of one corner point, finally obtaining the text region after external expansion, and taking the text region after external expansion as the text region contained in the target image.

It should be noted that, in this embodiment, the number of text regions that may be included in the target image is not limited, and in the actual processing, one or more text regions may exist, and the processing for each text region is the same as the foregoing processing, and is not described in detail.

Therefore, compared with the existing text region detection method in which post-processing modes are very complex, the embodiment of the application provides a very simple and novel post-processing mode, and is beneficial to comprehensively improving the speed and the detection effect of detecting the dense text regions.

The text region detection process is implemented based on a target model, and how to train to obtain the target model is described below, where the method for training to obtain the target model includes: training a preset model based on a training sample to obtain the trained preset model, and taking the trained preset model as the target model; the training sample is an image for setting transparency for a text region.

Setting transparency for the text regions, namely increasing the transparency for each text region in the image, namely, marking a mask for each text region, wherein the mask has transparency and can be set through a hyper-parameter (between a value of 0 and 1), and when the value of the hyper-parameter is 0, the transparency is 0, namely, the text regions are completely shielded; when the value of the hyper-parameter is 1, the transparency is 1, namely the text area is not blocked completely; the method has the advantages that the position information of the text region is strengthened, the handwriting texture information of the text region is weakened, the detection effect of the region with the text region is good if the handwriting is fuzzy and definitely not as clear, the transparency smaller than 1 is added, the texture information is weakened to a certain degree, the training result is more accurate by increasing the preprocessing of the transparency, and the detection effect of the text region can be improved by adopting the trained model.

Specifically, the training sample may be any one training sample in a training sample set, and it should be noted that training of the preset model by using the training sample is not only able to use the same training sample, but also can be completed by performing one-time iterative training.

For each training sample, the corresponding multi-channel information label may be determined according to the image for which transparency is set for the text region. The multi-channel information label comprises a probability map label of an inner contracted text region, a probability map label of a corner point and a corner point offset label which are used for marking the image, and further comprises the following steps:

a label for labeling the binary image of the contracted text area of the image;

labels used for marking the corner binary image corresponding to the text area of the image;

a label for labeling an x-axis offset of a corner of a text region of an image;

labels for labeling y-axis offsets of corners of text regions of an image.

For example, according to the image with transparency set for the text region, the corresponding multi-channel information tag is determined, which may be: according to the image with transparency set for the text region, determining boundary coordinate points of the text region based on the boundary of the region with transparency, determining a boundary point binary image corresponding to the text region of the image based on the boundary coordinate points of the text region, and taking the boundary point binary image of the text region as a label of a corner point binary image corresponding to the text region; the two-value corner map may be obtained by setting a value of a coordinate point corresponding to a corner point to 1, and setting values of coordinate points at other positions to 0. According to the image with transparency set for the text region, the text region can be determined based on the transparency coverage range, the transparency coverage range is shrunk, the shrunk text region can be obtained, the value of the coordinate point corresponding to the shrunk text region is set to be 1, the values of the coordinate points of the rest regions are set to be 0, and the label of the two-value image of the shrunk text region is obtained. In addition, the label for labeling the x-axis offset of the corner points of the text region of the image and the label for labeling the y-axis offset of the corner points of the text region of the image may be set to default values, such as zero.

As shown in fig. 5, training the preset model based on the training samples includes:

step S201: inputting a training sample into a preset first network model of a preset model to obtain a characteristic diagram output by the preset first network model of the preset model;

step S202: inputting the characteristic diagram into a preset second network model of the preset model to obtain a plurality of channel information output by the preset second network model of the preset model;

step S203: determining a first type loss function, a second type loss function and a third type loss function based on a plurality of channels corresponding to the training samples and the transparency set by the text area in the training samples;

step S204: and updating the preset model by reverse conduction according to the first type loss function, the second type loss function and the third type loss function.

The preset first network model of the preset model adopts Pyramid convolution to calculate the Feature map, and the second network model adopts a Feature Pyramid Enhancement Module (FPEM) to determine the Feature map.

In some embodiments, the multi-channel includes four channels, a first channel represents a probability map of contracted text regions, a second channel represents a probability map of corner points, and a third channel and a fourth channel represent offset of corner points, specifically, the third channel represents offset of corner points in x direction, and the fourth channel represents offset of corner points in y direction. The first channel corresponds to a first-type loss function; the second channel corresponds to the second type loss function, and the third channel and the fourth channel correspond to a third loss function. The first Loss function is a Dice Loss function; the second type loss function is binary cross entropy; the third type of loss function is the smooth L1 loss function.

In some embodiments, the pyramid convolution is in the format of:

I*W1*H1*(O/n)+I*W2*H2*(O/n)+…+I*W*H*(O/n)；

wherein I represents the number of input feature maps, W represents the width of a convolution kernel, H represents the length of the convolution kernel, O represents the number of output feature maps, and n represents the number of pyramid convolution layers; the width W1 & gtW 2 & gt … W of the convolution kernel; the length of the convolution kernel is H1> H2> … > H, I W H O is the format of a conventional convolution.

It should be noted that, in the preset first network model, instead of the 3 × 3 conventional convolution used in the Resnet18 network, generally speaking, a convolution operation is IWHO, where I refers to the number of input feature maps, WH refers to the width and length of the convolution kernel, respectively, and O refers to the number of output feature maps, generally speaking, a feature map is used to find a given pattern, compared to a natural scene, the pattern of a text image is less, such as a human face image, the eyes, ears, and nose are different patterns, but for the text image, the text line is used, so for the task of text region detection, the number of feature maps can be reduced to reduce the calculation, but the aspect ratio of the text is extremely inconsistent, which makes a large feeling helpful for the task of text region detection, therefore, the present application converts the conventional convolution of 3 × 3 into pyramid convolution, for example, the original convolution is 10 × 3 × 75, then 10 × 7 × 25+10 × 5 × 25+10 × 3 × 25 is changed to 10 × 7 × 5 × 25, similar to a pyramid shape, and before the next convolution operation, the parts obtained by 7 × 7 and 5 × 5 are enlarged by using a bilinear interpolation method according to the size of the feature map obtained by the convolution with 3 × 3, and then the extracted feature map is input to a preset second network model.

In some embodiments, inputting the training sample into the preset first network model of the preset model to obtain the feature map output by the preset first network model of the preset model may specifically include: the preset first network model of the preset model is responsible for converting the training target image into high-dimensional features, such as extracting features of texture, edges, corners, semantic information and the like from the training target image. The preset first network model is formed by replacing a conventional convolution with a classical convolution neural network-based pyramid convolution, and preferably, a Resnet18 network is used as a base network of the preset first network model of the preset model.

In an example, the preset first network model in the preset models is a network based on a Resnet18 network, and may include 4 blocks, and the preset first network model is constructed by connecting the 4 blocks in series. Each block includes several convolutional and pooling layers with residual concatenation for reducing the size of the feature map output by the previous stage by a factor of two, for example, the feature map output by the first block is 1/4, 1/8, 1/16 and 1/32.

In some embodiments, as shown in fig. 6, step S202 includes:

step S202 a: connecting the input feature maps in series to obtain feature mapping;

step S202 b: and performing convolution operation and deconvolution operation on the feature map once, and outputting the channels with the sizes consistent with the target image.

In practical application, step S202 may be implemented by FPEM, the number of FPEMs is not compulsorily limited in the embodiment of the present application, the processing performed by each FPEM is the same, and the specific value may be determined according to an experimental effect. The number of FPEMs can be m, and m is a positive integer. For example, m = 2.

Taking the example of selecting 2 FPEMs to process the input feature map, processing the input feature map by adopting the 1 st FPEM, and outputting the initial first group of feature maps of the target, the second group of feature maps of the target, the third group of feature maps of the target and the fourth group of feature maps of the target; and taking the output of the 1 st FPEM module as an input, and adopting the 2 nd FPEM to perform the same processing operation as the 1 st FPEM module to obtain the final target first group feature mapping, the target second group feature mapping, the target third group feature mapping and the target fourth group feature mapping.

Processing the input feature map by adopting the 1 st FPEM, and outputting an initial target first group feature map, a target second group feature map, a target third group feature map and a target fourth group feature map, wherein the method comprises the following steps:

4 groups of 4-channel features with different sizes are mapped by using the 1 st FPEM, and the 4 groups are sequenced from large to small and from front to back; the 4 groups of 4-channel feature maps after sorting are sequentially called a forward first group of feature maps, a forward second group of feature maps, a forward third group of feature maps and a forward fourth group of feature maps;

firstly, carrying out 2 times of upsampling on a forward fourth group of feature maps, then carrying out point-by-point addition on the fourth group of feature maps subjected to 2 times of upsampling and the forward third group of feature maps according to channels, carrying out depth separable convolution operation on the point-by-point addition result, and then carrying out convolution, batch normalization and activation function action operation once again to obtain a reverse second group of feature maps; the same operation is used for the reverse second group of feature maps and the forward second group of feature maps to obtain a reverse third group of feature maps; the same operation is applied to the reverse third group of feature maps and the forward first group of feature maps to obtain reverse fourth group of feature maps, and meanwhile, the forward fourth group of feature maps are regarded as reverse first group of feature maps to obtain 4 groups of reverse feature maps;

taking the reverse fourth group of feature mapping as a target first group of feature mapping, then carrying out 2-time down-sampling on the target first group of feature mapping, then carrying out depth separable convolution operation on the result of the point-by-point addition after carrying out 2-time down-sampling on the reverse fourth group of feature mapping and the reverse third group of feature mapping according to channels, then carrying out convolution, batch normalization and activation function action operation once again to obtain a target second group of feature mapping, wherein the same operation acts on the target second group of feature mapping and the reverse second group of feature mapping to obtain a target third group of feature mapping, and then the same operation acts on the target third group of feature mapping and the reverse first group of feature mapping to obtain the target fourth group of feature mapping.

In some embodiments, determining the first type loss function, the second type loss function, and the third type loss function based on the channels corresponding to the training samples and the transparency set in the text region in the training samples includes:

calculating a first-class loss function based on the probability graph of the contracted text region corresponding to the training sample and the label of the binary graph of the contracted text region corresponding to the training sample; and the number of the first and second groups,

calculating a second type loss function based on the corner probability graph corresponding to the training sample and the label of the corner binary graph corresponding to the training sample;

and calculating a third-class loss function based on the corner offset corresponding to the training sample and the label of the corner offset corresponding to the training sample.

The first Loss function is a Dice Loss function; the second type loss function is binary cross entropy; the third type of loss function is the smooth L1 loss function.

In some embodiments, the preset model is updated by conducting reverse conduction according to the first-class loss function, the second-class loss function and the third-class loss function, where updating the preset model may specifically refer to updating parameters in the preset model; still further, it may be: and updating parameters of a preset first network model of the preset model and/or updating parameters of a preset second network model in the preset model.

When the iteration number of the preset model reaches a preset threshold value, or an index (such as accuracy or recall ratio) in iterative training of the preset model does not change any more, it may be determined that training is completed, and the finally obtained preset model after training is the target model in the embodiment of the present application.

According to the text region detection scheme, the dense text region detection is carried out by using a segmentation-based method, meanwhile, transparency processing is set for training data, pyramid convolution operation is provided for a backbone network, an extremely simple brand-new post-processing mode is provided, a binarization threshold value selection mode in a post-processing process is improved, and the speed and the detection effect of dense text region detection are comprehensively improved.

Finally, the text region detection processing flow of the embodiment of the present application is described in detail with reference to fig. 7:

s301: and inputting the target image into a first network model of the target model to obtain a characteristic diagram output by the first network model of the target model.

S302: and inputting the characteristic diagram into a second network model in the target model to obtain a plurality of channel information output by the second network model in the target model.

Specifically, the processing of the second network model may include:

connecting the input feature maps in series to obtain feature mapping;

The channel information comprises a contracted text region probability map, a corner probability map and a corner offset.

S303: performing AND operation on the gradient map corresponding to the target image and the probability map of the contracted text region to obtain the processed gradient map; adding the processed gradient map and the contracted text region probability map point by point to obtain the processed contracted text region probability map; and calculating a probability average value of the processed probability map of the text regions, and taking the calculated probability average value as a probability threshold value.

S304: based on the probability threshold, carrying out binarization on the probability map of the contracted text region to obtain a binary map of the contracted text region; determining a first connecting domain of the binary image of the contracted text region, and obtaining the contracted text region according to the first connecting domain; based on the probability threshold value, carrying out binarization on the corner probability image to obtain a corner binary image; determining a second connected domain of the corner point binary image; correcting the two-valued corner image according to the offset of the corner point to obtain a corrected two-valued corner image; and carrying out external expansion processing on the contracted-in text region based on the corrected corner binary image to obtain a text region contained in the target image.

By adopting the above processing, the target image may include N text regions, and the above flow may be used for processing each of the N text regions, and finally, all the text regions included in the target image may be detected.

According to an embodiment of the present application, there is also provided a text region detecting apparatus 400, as shown in fig. 8, including:

a model processing unit 401, configured to input a target image into a target model to obtain multiple channels output by the target model; the plurality of channels comprise contracted text region probability maps and corner probability maps;

a threshold determining unit 402, configured to obtain a probability threshold based on the gradient map corresponding to the target image and the probability map of the contracted text region;

a binarization unit 403, configured to perform binarization on the intra-contracted text region probability map and perform binarization on the corner point probability map based on the probability threshold to obtain an intra-contracted text region binary map and a corner point binary map;

a text region determining unit 404, configured to determine a text region included in the target image based on the reduced text region binary map and the corner binary map.

In some embodiments, the threshold determination unit 402 is configured to:

performing AND operation on the gradient map corresponding to the target image and the probability map of the contracted text region to obtain the processed gradient map;

adding the processed gradient map and the contracted text region probability map point by point to obtain the processed contracted text region probability map;

and calculating a probability average value of the processed probability map of the text regions, and taking the calculated probability average value as the probability threshold value.

In some embodiments, the plurality of channels further comprises: offset of angular points;

the text region determination unit 404 is configured to:

determining a first connecting domain of a binary image of the contracted text region, and obtaining the contracted text region according to the first connecting domain;

determining a second connected domain of the corner point binary image, and correcting the corner point binary image corresponding to the second connected domain according to the corner point offset to obtain a corrected corner point binary image;

and carrying out external expansion processing on the contracted-in text region based on the corrected corner binary image to obtain a text region contained in the target image.

In some embodiments, the model processing unit 401 is configured to:

inputting the target image into a first network model of the target model to obtain a feature map output by the first network model of the target model;

and inputting the characteristic diagram into a second network model in the target model to obtain a plurality of channel information output by the second network model in the target model.

In some embodiments, the model processing unit 401 is configured to:

connecting the input feature maps in series to obtain feature mapping;

and performing convolution operation and deconvolution operation on the feature map once, and outputting the channels with the sizes consistent with the target image.

In some embodiments, the apparatus further comprises:

a model training unit 405, configured to train a preset model based on a training sample to obtain the trained preset model, and use the trained preset model as the target model;

the training sample is an image for setting transparency for a text region.

In some embodiments, the model training unit 405 is configured to:

inputting the training sample into a preset first network model of the preset model to obtain a characteristic diagram output by the preset first network model of the preset model;

inputting the characteristic diagram into a preset second network model of the preset model to obtain a plurality of channel information output by the preset second network model of the preset model;

determining a first type loss function, a second type loss function and a third type loss function based on a plurality of channels corresponding to the training samples and the transparency set by the text area in the training samples;

and updating the preset model by reverse conduction according to the first type loss function, the second type loss function and the third type loss function.

And the second network model adopts a feature pyramid enhancement module FPEM to determine a feature image.

In one embodiment, the model training unit 405 is configured to:

processing the input feature map by using the 1 st FPEM, and outputting an initial target first group feature map, a target second group feature map, a target third group feature map and a target fourth group feature map;

taking the output of the 1 st FPEM module as input, and adopting the 2 nd FPEM to perform the same processing operation as the 1 st FPEM, so as to obtain a final target first group feature mapping, a target second group feature mapping, a target third group feature mapping and a target fourth group feature mapping;

This application text region detection device can the accurate text region in confirming the target image, and then promotes text region detection's rate of accuracy and efficiency.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 10, is a block diagram of an electronic device according to an embodiment of the application. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 10, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 10 illustrates an example of a processor 801.

The memory 802 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the text region detection method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the text region detection method provided by the present application.

The memory 802, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the text region detection method in the embodiments of the present application. The processor 801 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 802, that is, implements the text region detection method in the above-described method embodiment.

The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 10.

The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A text region detection method, comprising:

determining a text region contained in the target image based on the reduced text region binary image and the corner binary image;

obtaining a probability threshold value based on the gradient map corresponding to the target image and the probability map of the contracted text region, including:

2. The method of claim 1, wherein the plurality of channels further comprises: offset of angular points;

determining a text region contained in the target image based on the reduced text region binary image and the corner binary image, including:

determining a contracted text region based on the two-value map of the contracted text region;

correcting the two-valued corner image according to the offset of the corner point to obtain a corrected two-valued corner image;

and carrying out external expansion processing on the internal contraction text region based on the corrected corner binary image to obtain a text region contained in the target image.

3. The method according to any one of claims 1-2, wherein the inputting the target image into the target model to obtain the plurality of channel information output by the target model comprises:

4. The method of claim 3, wherein the inputting the feature map into a second network model of the target models to obtain a plurality of channel information output by the second network model of the target models comprises:

connecting the input feature maps in series to obtain feature mapping;

5. The method according to any one of claims 1-2, further comprising:

training a preset model based on a training sample to obtain the trained preset model, and taking the trained preset model as the target model; the training sample is an image for setting transparency for a text region.

6. The method of claim 5, wherein the training a pre-set model based on training samples comprises:

7. The method according to claim 6, wherein the preset first network model of the preset model uses pyramid convolution to perform feature map calculation, and the second network model uses a Feature Pyramid Enhancement Module (FPEM) to determine a feature image.

8. A text region detecting apparatus, characterized in that the apparatus comprises:

a text region determining unit, configured to determine a text region included in the target image based on the reduced text region binary image and the corner binary image;

wherein the threshold determination unit is configured to:

9. The apparatus of claim 8, wherein the plurality of channels further comprises: offset of angular points;

the text region determination unit is configured to:

10. The apparatus according to any one of claims 8-9, wherein the model processing unit is configured to:

11. The apparatus of claim 10, wherein the model processing unit is configured to:

connecting the input feature maps in series to obtain feature mapping;

12. The apparatus according to any one of claims 8-9, further comprising:

the model training unit is used for training a preset model based on a training sample to obtain the trained preset model, and taking the trained preset model as the target model;

the training sample is an image for setting transparency for a text region.

13. The apparatus of claim 12, wherein the model training unit is configured to:

14. The apparatus of claim 13, wherein the predetermined first network model of the predetermined model uses pyramid convolution to perform feature map calculation, and the second network model uses a feature pyramid enhancement module FPEM to determine a feature image.

15. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.