CN112990204A

CN112990204A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN112990204A
Application number: CN202110507957.5A
Authority: CN
Inventors: 王翔; 秦勇
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-06-18
Anticipated expiration: 2041-05-11
Also published as: CN112990204B

Abstract

The application discloses a target detection method, a target detection device, electronic equipment and a storage medium, and the specific implementation scheme is as follows: performing feature extraction on the first text image based on a feature extraction module to obtain a feature image; inputting the characteristic image into a first detection module to obtain a probability map of an indented text region and a threshold map of the text region; inputting the characteristic image into a second detection module to obtain a score map for representing the probability of whether the pixel belongs to the text region or not and a regression prediction map for representing the coordinates of the text region required by regression processing; taking a detection network obtained by training based on the probability map of the contracted text region, the threshold map of the text region, the score map and the regression prediction map as a target detection network; and detecting a corresponding text region in the second text image according to the target detection network, and positioning the text region. By the method and the device, the accuracy of target detection can be improved.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a target detection method and apparatus, an electronic device, and a storage medium.

Background

With the fact that electronic equipment such as portable equipment and mobile phone terminals are more intelligent than the prior art, the chip has stronger analysis capability, graphic and text information, video information and the like can be efficiently analyzed through a computer vision technology, and target objects in the graphic and text information, the video information and the like can be detected.

Taking a target object as a text object as an example, the main purpose of text detection is to locate the position of a text line or a character in an image, and since characters have the characteristics of multiple directions, irregular shapes, extreme length-width ratios, various fonts, colors, backgrounds and the like, especially a large amount of dense texts exist, a good locating effect cannot be obtained by using a general target detection method, and therefore, the target detection accuracy is not high.

Disclosure of Invention

The application provides a target detection method, a target detection device, electronic equipment and a storage medium.

According to an aspect of the present application, there is provided a target detection method including:

performing feature extraction on the first text image based on a feature extraction module to obtain a feature image;

inputting the characteristic image into a first detection module to obtain a probability map of an indented text region and a threshold map of the text region;

inputting the characteristic image into a second detection module to obtain a score map for representing the probability of whether the pixel belongs to the text region or not and a regression prediction map for representing the coordinates of the text region required by regression processing;

taking a detection network obtained by training based on the probability map of the contracted text region, the threshold map of the text region, the score map and the regression prediction map as a target detection network;

and detecting a corresponding text region in the second text image according to the target detection network, and positioning the text region.

According to another aspect of the present application, there is provided an object detecting apparatus including:

the feature extraction branch module is used for extracting features of the first text image based on the feature extraction module to obtain a feature image;

the first detection branch module is used for inputting the characteristic image into the first detection module to obtain a probability map of the contracted text region and a threshold map of the text region;

the second detection branch module is used for inputting the characteristic image into the second detection module to obtain a score map for representing the probability whether the pixel belongs to the text region or not and a regression prediction map for representing the coordinates of the text region required by regression processing;

a target detection network determining module, configured to use a detection network obtained by training based on the probability map of the contracted text region, the threshold map of the text region, the score map, and the regression prediction map as a target detection network;

and the target detection processing module is used for detecting a corresponding text area in the second text image according to the target detection network and positioning the text area.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as provided by any one of the embodiments of the present application.

According to another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present application.

By adopting the method and the device, the first text image can be subjected to feature extraction based on the feature extraction module to obtain a feature image; inputting the characteristic image into a first detection module to obtain a probability map of an indented text region and a threshold map of the text region; inputting the characteristic image into a second detection module to obtain a score map for representing the probability of whether the pixel belongs to the text region or not and a regression prediction map for representing the coordinates of the text region required by regression processing; taking a detection network obtained by training based on the probability map of the contracted text region, the threshold map of the text region, the score map and the regression prediction map as a target detection network; and detecting a corresponding text region in the second text image according to the target detection network, and positioning the text region. After the features are extracted, a plurality of comparison graphs (namely a probability graph of a contracted text region, a threshold graph of the text region, a score graph and a regression prediction graph) for target detection are obtained through a plurality of detection branches (namely, the feature images are respectively input into the first detection module and the second detection module for operation again), so that the target detection network obtained by training the comparison graphs is used for detecting the text region, the problem that the conventional universal target detection method is poor in positioning effect can be solved, the text region can be accurately positioned, and therefore the target detection accuracy is high.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic flow chart diagram of a target detection method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating dense text detection using regression boxes on a binary graph in an application example according to an embodiment of the present application;

FIG. 3 is a schematic diagram of the structure of an object detection device according to an embodiment of the present application;

fig. 4 is a block diagram of an electronic device for implementing the object detection method of the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The term "at least one" herein means any combination of at least two of any one or more of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C. The terms "first" and "second" used herein refer to and distinguish one from another in the similar art, without necessarily implying a sequence or order, or implying only two, such as first and second, to indicate that there are two types/two, first and second, and first and second may also be one or more.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present application.

The text detection has a wide application range, is a prepositive step of many computer vision tasks, such as image search, character recognition, identity authentication, visual navigation and the like, and the main purpose of the text detection is to locate the position of a text line or a character in an image, so that the accurate location of the text is very important and challenging.

Text detection method based on sliding window

The method is mainly based on the idea of universal target detection, a large number of anchor points with different length-width ratios and different sizes are set, the anchor points are used as sliding windows, traversal search is carried out on an image or on a feature mapping image obtained by convolution operation based on the image, and classification judgment of whether a text exists in each searched position frame is carried out. The text detection method based on the sliding window has the advantages that after the text box is judged, the subsequent work can be carried out without other follow-up, and the defects that the calculated amount is overlarge, a large amount of calculation resources are consumed, and the consumed time is long.

Method for calculating connected domain

The method is mainly based on a segmentation idea, firstly, a full convolution neural network model is used for extracting image features, then binarization processing is carried out on a feature map, a connected domain of the feature map is calculated, and then training data sets suitable for different application scenes are adopted to judge the positions of text lines corresponding to the different application scenes. The method based on the calculation of the connected domain has the advantages of fast calculation, small calculation amount and complex post-processing steps, and relates to a large amount of calculation and tuning, so that a large amount of time is consumed, and meanwhile, whether the post-processing strategy is reasonable and effective also strictly limits the performance of the algorithm.

According to an embodiment of the present application, an object detection method is provided, and fig. 1 is a flowchart of the object detection method according to the embodiment of the present application, and the method may be applied to an object detection apparatus, for example, in a case where the apparatus may be deployed in a terminal or a server or other processing devices for execution, feature extraction, object detection, and the like may be performed. Among them, the terminal may be a User Equipment (UE), a mobile device, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, and so on. In some possible implementations, the method may also be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 1, includes:

s101, performing feature extraction on the first text image based on a feature extraction module to obtain a feature image.

In an example, when the feature extraction module includes a backbone network module and a feature enhanced Fusion (FPEM) module, the first text image may be input to the backbone network module for feature extraction, so as to obtain a plurality of feature vectors, and the feature images are obtained after the feature vectors are subjected to feature extraction, upsampling and concatenation processing again by at least one of the FPEM modules. The FPEM Module may be composed of a Feature Pyramid Enhancement Module (PFEM) and a Feature Fusion Module (FFM), and may be a U-shaped Module that can be cascaded when performing segmentation processing, and may introduce multi-level information and guide better segmentation.

And S102, inputting the characteristic image into a first detection module to obtain a probability map of the contracted text region and a threshold map of the text region.

In an example, when the first Detection module adopts a differential-time Scene Text Detection with Differential Binarization (DB) model, the feature image is input into the DB model to be convolved and deconvolved, and a multi-channel feature image is output, where the feature image output by the first channel is a probability map of the Text region, and the feature image output by the second channel is a threshold map of the Text region.

S103, inputting the characteristic image into a second detection module to obtain a score map for representing the probability whether the pixel belongs to the text region or not and a regression prediction map for representing the coordinates of the text region required by regression processing.

In An example, when the second detection module adopts An EAST, An Efficient and Accurate Scene Text Detector (Scene Text Detector) model, the feature image is input into the EAST model to be convolved and deconvoluted, a group of first feature mapping data is output, the first feature mapping data is subjected to first convolution processing to obtain the score map, and the first feature mapping data is subjected to second convolution processing to obtain the regression prediction map. For the first feature mapping data, in the process of performing feature mapping, a feature vector of the high-dimensional multimedia data is mapped into a one-dimensional or low-dimensional space through dimension reduction processing.

And S104, taking a detection network obtained by training based on the probability map of the contracted text region, the threshold map of the text region, the score map and the regression prediction map as a target detection network.

In an example, the probability map of the contracted text region, the threshold map of the text region, the score map and the regression prediction map may be used as sample data to perform network training, so that the detection network obtained by training is used as a target detection network for final use.

And S105, detecting a corresponding text area in the second text image according to the target detection network, and positioning the text area.

In an example, based on the above S101-S104, the sample data may be obtained by performing the feature extraction and the multi-detection branch processing on the first text image, so as to obtain a finally used target detection network, in the process of using the target detection network in S105, a second text image may be arbitrarily selected, where the second text image may include one or more text lines, the text lines are not limited to english characters, chinese characters, or mixed chinese and english characters, but may also be non-character symbols, and the like, and the target detection network may be used to detect one or more text lines and text contents contained in the text lines.

By adopting the method and the device, the plurality of comparison graphs (namely the probability graph of the contracted text region, the threshold graph of the text region, the score graph and the regression prediction graph) for target detection are obtained through the plurality of detection branches (namely the characteristic images are respectively input into the first detection module and the second detection module for re-operation) after the characteristics are extracted, so that the target detection network obtained by training the plurality of comparison graphs is used for detecting the text region, the problem that the positioning effect is poor by adopting a universal target detection method at present can be solved, the text region can be accurately positioned, and the target detection accuracy is high.

In one embodiment, the method further comprises: training is carried out based on the output of a first detection branch corresponding to the detection processing of the first detection module and the output of a second detection branch corresponding to the detection processing of the second detection module; training the probability map of the contracted text region and the threshold map of the text region output by the first detection branch by adopting a first loss function (namely, a loss function corresponding to a DB model), and training the score map and the regression prediction map output by the second detection branch by adopting a second loss function (namely, a loss function corresponding to an EAST model); and obtaining a total loss function according to the first loss function and the second loss function, and obtaining the target detection network according to the back propagation of the total loss function. By adopting the embodiment, the advantages of a PAN (PAN) and Accurate array-Shaped Text Detection with Pixel Aggregation Network) technology, an EAST model and a DB model can be combined aiming at dense Text Detection, specifically, 2 FPEM modules can be used for feature extraction in the PAN technology, then, the probability map of the contracted text region and the threshold map of the text region corresponding to the first detection branch, and the score map and the regression branch corresponding to the second detection branch are obtained through 2 detection branches (such as the first detection branch where the DB model is located and the second detection branch where the EAST model is located), so as to combine the first detection branch and the second detection branch for joint training, and finally obtain the finally used target detection network through the back propagation of the obtained total loss function, therefore, the target detection network obtained through the joint training can realize more accurate target detection.

The respective advantages of the above-mentioned PAN technique, EAST model and DB model are introduced as follows:

firstly, the method comprises the following steps: the PAN technology is characterized in that Resnet18 is used as a basic network framework, features of an input image are extracted through Resnet18 to obtain features such as texture, edges, corners and semantic information, and the features are represented by 4 groups of multi-channel feature maps with different sizes. The extracted features are then processed through 2 FPEM modules, for example, a process combining convolution, deconvolution, and batch normalization is performed through the FPEM modules. And extracting features such as texture, edge, corner, semantic information and the like again, and finally performing up-sampling on the output feature map to obtain feature mapping of 6 channels.

For feature mapping of 6 channels, a feature map of a first channel is a probability map representing a text line region, and a connected domain is calculated after binarization, so that a specific text line region can be obtained; the characteristic diagram of the second channel is a probability diagram of the text line region after the text line region is subjected to internal contraction according to a certain rule and proportion, and a connected domain is calculated after binarization, so that a specific internal contraction text line region can be obtained; and combining the rest 4 channels to represent 4-dimensional feature vectors of the feature map size, then using a clustering method, combining the text region map and the contracted text region map, and calculating the distance between the 4-dimensional feature vector of each pixel point position and a clustering central point to judge which text region the pixel points which appear in the text region but do not appear in the contracted text region belong to.

It should be noted that, not limited to the above processing of 2 FPEM modules, the benefits of selecting 2 FPEM modules are: more accurate features can be extracted with minimal time cost. The processing of each of the 2 FPEM modules is the same, and the processing of the extracted features by each of the FPEM modules specifically includes: the feature mapping is performed on 4 groups of multi-channel feature maps with different sizes extracted based on Resnet18 in the previous step according to the sequence from the front to the back from the large to the small, and the multi-channel feature maps can be sequentially called forward first, forward second, forward third and forward fourth group feature maps. Firstly, performing 2 times of upsampling processing on the forward fourth group of feature mapping, namely expanding the size of the forward fourth group of feature mapping by 2 times; then adding the forward direction third group of feature mapping point by point according to the channel, carrying out depth separable convolution operation on the result of point by point addition, and then carrying out convolution, batch normalization and function activation operation once again to obtain a result called as reverse direction second group of feature mapping; correspondingly, the same operation (such as the operation of convolution, batch normalization and activation function after the operation of deep separable convolution) is applied to the reverse second group of feature maps and the forward second group of feature maps to obtain a reverse third group of feature maps; then, the same operation (such as the operation of carrying out convolution, batch normalization and activation function again after the operation of deep separable convolution once) is acted on the reverse third group of feature mapping and the forward first group of feature mapping to obtain a reverse fourth group of feature mapping, and meanwhile, the forward fourth group of feature mapping is regarded as the reverse first group of feature mapping, so that 4 groups of reverse feature mappings are obtained; taking the fourth group of reverse feature maps as a target first group of feature maps, and performing 2 times of downsampling processing on the target first group of feature maps, namely reducing the size by 2 times; then adding the first group of feature maps and the reverse third group of feature maps point by point according to channels, performing a depth separable convolution operation on a point-by-point addition result, and then performing operations of convolution, batch normalization and activation functions once again to obtain a result called a target second group of feature maps; correspondingly, the same operation (such as the operation of carrying out convolution, batch normalization and activation function again after the operation of deep separable convolution) is acted on the target second group of feature mapping and the reverse second group of feature mapping to obtain a target third group of feature mapping; then, the same operation (for example, after one depth separable convolution operation, the operation of one convolution, batch normalization and activation function is carried out again) is applied to the target third group feature mapping and the inverse first group feature mapping to obtain a target fourth group feature mapping, and finally, the target first, target second, target third and target fourth group feature mapping are used as the output of the first FFEM module, and the 2 nd FFEM module takes the output of the first FFEM module as the input and carries out the same operation as the first FFEM module to obtain the output of the second FFEM module.

II, secondly: the DB model is also based on the Resnet18 network architecture, features of an input image are extracted through Resnet18, the extracted feature maps are all up-sampled to the size of one fourth of the original image and are connected in series, and then a feature map of 2 channels is obtained through one convolution operation and serves as output.

For a feature map of 2 channels, a first channel represents a probability map of a contracted text region; the second channel represents a threshold map of the text region, wherein the distance of each pixel point from the real text region box is normalized, and the distance can take any value between 0 and 1. A differentiable binarization function is also designed, the parameters of the binarization function can be learned along with the network, then a binary image of the image text region can be calculated according to a threshold value image and a probability image, a connected domain is calculated on the binary image to obtain an inner contracted text region, and then the inner contracted text region is expanded outwards according to certain rules and proportions, so that a real text region is obtained.

Thirdly, the EAST model is a regression-based text detection model, can directly predict the existence of text instances and geometrical characteristics thereof from a complete image, and the output of the EAST model comprises two branches, wherein the first branch is a score map with pixel values in the range of [0, 1] and the score map expresses the probability of whether each pixel belongs to a text region; the second branch is a regression branch prediction graph that can generate candidate prediction bounding boxes for two geometries, such as a Rotated Box (RBOX) or a QUAD box (QUAD), for text regions. After the score map and the rotation frame are obtained, thresholding is performed on each pixel by adopting a post-processing algorithm to obtain a text region with a score exceeding a preset confidence threshold value. Since these regions are considered valid, a text box is predicted at each pixel position, and then all candidate predicted bounding boxes of pixel points of the same text region are combined into a final predicted bounding box by a local perceptual non-maximum suppression (LNMS) algorithm to represent the text region. And finally, taking the output result of the LNMS post-processing as the final output of the whole text detection algorithm.

It can be seen that: the PAN technology, the DB model and the EAST model have advantages, wherein the PAN technology has more obvious advantage of extracting features due to the use of the FPEM module, and the post-processing operation of the DB model is faster because the post-processing of the DB model is simpler than that of the PAN technology; the EAST model is different from the PAN technology and the DB model in thinking, although the speed is faster and the post-processing is simpler, the detection capability for a wider or longer text region is weaker, resulting in a poorer edge regression effect.

On some open scene text detection data sets, for example, in the case that each image includes 4 to 5 text boxes, the detection speed and the detection result using the above-mentioned PAN technology, DB model and EAST model are almost the same, but for the actual application scene where the text is very dense, for example, in the case that there are at least 100 text areas on one image on the arithmetic exercise book of the pupils, text blocking and the like may occur due to the dense text effect caused by at least 100 text areas, the target detection on the dense text cannot be solved well using the PAN technology and DB model.

For the dense text situation, the comprehensive effects of time cost and detection accuracy are considered, and the advantages of the PAN technology, the DB model and the EAST model can be combined, that is: 2 FPEM modules of PAN technology are applied to detection of a DB model and an EAST model in a backbone network, and meanwhile, final output of detection executed by the DB model and the EAST model is combined to jointly train a target detection network. In one example, a method for obtaining a target regression box by using a regression box to obtain a brand-new post-processing screening of a real text region on a binary image can be realized based on the target detection network, so that the performance of intensive text detection is improved, the speed of intensive text detection is ensured, and time cost and detection accuracy are considered.

In one embodiment, the method further comprises: carrying out binarization processing on the threshold map of the text area to obtain a text box binary map; carrying out binarization processing on the score map to obtain a score binary map; obtaining a regression frame according to the score binary image and the regression prediction image; and taking the regression frame falling on the text box binary image as an object to be compared, and screening a target regression frame from the object to be compared based on the intersection-to-parallel ratio operation of the regression frame. By adopting the embodiment, the final target regression frame can be screened from the multiple regression frames serving as the objects to be compared based on binarization processing, so that the accuracy of target detection is improved.

For the above example combining the advantages of the PAN technology, the DB model and the EAST model, the following contents are included:

firstly, in the process of extracting the characteristics before a first detection branch of a DB model and a second detection branch of an EAST model, the FPEM module of the PAN technology is applied to the characteristics extraction process in a backbone network, specifically, the convolution operation is carried out on an input image, the characteristics are extracted, then the extracted characteristics are processed by using the FPEM module for 2 times, and all processed characteristic images are sampled to the size of an original image and are connected in series.

2 detection branches are used respectively, namely: the first detection branch of the DB model and the second detection of the EAST model can synchronously execute multi-branch detection in parallel or asynchronously execute multi-branch detection in a time sharing mode. And finally, combining the final output of the DB model and the EAST model, so that the fact that the real text region is obtained by using a regression box on the binary image can be realized in the following process.

And thirdly, in the first detection branch, performing convolution once and deconvolution twice on the feature images after being connected in series to obtain a 2-channel output feature image, wherein the first channel represents a probability map of the contracted text region, and the second channel represents a threshold map of the text region.

In the second detection branch, specifically, a convolution operation and two deconvolution operations may be performed on the feature images after being connected in series to obtain a group of 32-channel feature maps, and then a convolution operation is performed on the group of feature maps to obtain a 1-channel feature map (such as the above score map) indicating a probability of whether a pixel belongs to a text region; correspondingly, a convolution operation is also performed on the feature maps of the 32 channels of the group to obtain a feature map (such as a regression branch prediction map) of 5 channels or 8 channels, which represents the coordinates of the regression text box (the number of the coordinates of the text box depends on the number of the regression box), and two candidate prediction bounding boxes of geometric figures can be generated for the text region according to the coordinates of the text box.

In the training stage, for the output of the first detection branch, a loss function corresponding to the DB model can be used, for the output of the second detection branch, a loss function corresponding to the EAST model can be used, the total loss function is the sum of the loss function corresponding to the DB model and the loss function corresponding to the EAST model, multi-task joint training learning is realized through the total loss function, the finally trained target detection model is more accurate, and the target detection network is used for target detection, so that the character detection on dense texts can be improved. In the process of processing the contracted text probability map represented by the first channel and the text region threshold map represented by the second channel in the first detection branch, a Dice Loss function and a smoothing L1 Loss function can be respectively used for training the two channels, and a binary map of the contracted text region can be obtained after binarization. While for the second detection branch, the Loss function used by the EAST model may be trained using the Dice Loss function or the smoothed L1 Loss function.

In the test stage, performing binarization processing on the text region threshold value image obtained by the first detection branch according to a set first threshold value to obtain a text box binary image, wherein the two frames may overlap or intersect; then, performing binarization processing on the score map obtained by the second detection branch according to a set second threshold (considering whether the value of each pixel point on the score map indicates that the score map belongs to text probability, so that most points can be screened out by setting the set second threshold to be higher than the first threshold), and obtaining a score binary map; then, according to the output of other channels of the second detection branch, the regression frame corresponding to each pixel point left after binarization of the score map can be known.

In this example, after a large number of regression blocks are removed by score map binarization processing, current processing is performed on the remaining regression blocks and the text box binary map: if all four edges of the regression frame fall on the pixel point with the value of 1 on the binary image of the text frame, the frame is reserved; finally, judging whether all the remaining regression boxes are intersected or not, and keeping only one intersected box if the intersection ratio of the intersected boxes exceeds 0.8 (a specified higher threshold), so far, obtaining all the text boxes.

Compared with the method only adopting a DB model and a PAN technology, the method effectively solves the problem of text adhesion, improves the capability of detecting long texts, and is simpler and more efficient in application process compared with the application process only adopting the PAN technology, the DB model and the EAST model, wherein the application process (namely the post-processing process) after the target detection model is obtained.

Application example:

fig. 2 is a schematic flowchart of dense text detection using a regression box on a binary graph in an application example according to an embodiment of the present application, where the flowchart includes the following contents:

in the first step, dense text images are input into a Resnet18 network for feature extraction.

And secondly, extracting the features from the features extracted in the first step again through two FPEM modules, and obtaining 4 groups of feature mapping feature maps corresponding to the extracted features.

And thirdly, the feature map sizes of the 4 groups of feature maps obtained in the second step are all up-sampled to the size of the original image 1/4 and are connected in series.

And fourthly, performing convolution operation once and deconvolution operation twice on the feature map obtained in the third step, outputting a feature map with a feature map channel of 2 and the feature map size consistent with the size of the original image, wherein the first channel represents a probability map of the text region, and the second channel represents a threshold map of the text region (the threshold map of the text region can be used for indicating a frame of the text).

And fifthly, carrying out convolution operation once and deconvolution operation twice on the feature mapping obtained in the third step, and outputting a group of feature mapping images with a feature mapping channel of 32 and a feature mapping size consistent with that of the original image.

Note that the features obtained by the above-described one convolution operation and two deconvolution operations are more accurate than the features obtained in the above-described first step and the features obtained in the second step, and if the features obtained in the above-described first step are referred to as features 1 and the features obtained in the second step are referred to as features 2, the features obtained in this step can be referred to as features 3.

And a sixth step of performing a convolution operation on the feature map obtained in the fifth step (i.e., performing a convolution operation on the 32-channel set of feature maps) to obtain a 1-channel feature map having the same size as the original image, and representing the score map.

And a seventh step of performing a convolution operation on the feature map obtained in the fifth step (i.e., performing a convolution operation on the 32-channel set of feature maps) to obtain a 5-channel (or 8-channel) feature map having a size identical to that of the original image, so as to indicate the coordinate offset of the regression frame.

It should be noted that, if the predicted frame is a Rotation Box (RBOX), the predicted frame is the above 5 channels, which respectively represent the distance between the current pixel point and the text box, and the rotation angle of the text box; if a QUAD frame (QUAD) is predicted, then the 8 channels above represent the 4 vertex coordinates of the QUAD frame, respectively.

And eighthly, in the training stage, the loss function of the DB is used as the output of the fourth step, the loss function of the EAST is used as the output of the sixth step and the seventh step, and the finally adopted total loss function is the sum of the loss function of the DB and the EAST loss function in consideration of the multi-task learning mode.

And ninthly, in the testing stage, carrying out binarization processing on the text region threshold value map obtained in the fourth step according to a set threshold value to obtain a text box binary map.

And step ten, carrying out binarization on the score map obtained in the step six according to a set high threshold value to obtain a score binary map.

And step ten, combining the score binary image in the step ten with the output in the step seven to obtain a regression frame corresponding to each pixel point with the value of 1 on the score binary image.

And step ten, judging whether the regression frame obtained in the step eleventh falls on the binary image of the text frame according to the binary image of the text frame obtained in the step ninth, and recording the regression frame falling on the regression frame.

And a thirteenth step of judging whether the intersection ratio of the regression frames intersected in the regression frames obtained in the twelfth step is larger than a set higher threshold value or not, if so, filtering the frame, and otherwise, keeping the frame.

The present application provides an object detection apparatus, fig. 3 is a schematic structural diagram of the object detection apparatus according to an embodiment of the present application, and as shown in fig. 3, the apparatus includes: the feature extraction branch module 41 is configured to perform feature extraction on the first text image based on the feature extraction module to obtain a feature image; a first detection branch module 42, configured to input the feature image into the first detection module, so as to obtain a probability map of a contracted text region and a threshold map of the text region; a second detection branch module 43, configured to input the feature image into the second detection module, so as to obtain a score map used for representing a probability that whether a pixel belongs to a text region or not and a regression prediction map used for representing a coordinate of the text region required by regression processing; a target detection network determining module 44, configured to use a detection network obtained by training based on the probability map of the contracted text region, the threshold map of the text region, the score map, and the regression prediction map as a target detection network; and the target detection processing module 45 is configured to detect a corresponding text region in the second text image according to the target detection network, and locate the text region.

In an embodiment, the feature extraction branch module is configured to, when the feature extraction module includes a backbone network module and an FPEM module, input the first text image into the backbone network module to perform feature extraction, so as to obtain a plurality of feature vectors, and perform feature extraction, upsampling, and concatenation processing on the plurality of feature vectors again through at least one of the FPEM modules, so as to obtain the feature image.

In one embodiment, the first detection branch module is configured to, when the first detection module adopts a DB model, input the feature image into the DB model to perform convolution and deconvolution processing, and output a multi-channel feature image; in the multi-channel characteristic images, the characteristic image output by a first channel is a probability map of the contracted text region, and the characteristic image output by a second channel is a threshold map of the text region.

In one embodiment, the second detection branch module is configured to, when the second detection module adopts an EAST model, input the feature image into the EAST model to perform convolution and deconvolution processing, and output a set of first feature mapping data; performing first convolution processing on the first feature mapping data to obtain the score map; and performing second convolution processing on the first feature mapping data to obtain the regression prediction graph.

In an embodiment, the apparatus further includes a training module, configured to perform training by combining an output of a first detection branch corresponding to the detection processing performed by the first detection module with an output of a second detection branch corresponding to the detection processing performed by the second detection module; training the probability map of the contracted text region and the threshold map of the text region output by the first detection branch by adopting a first loss function, and training the score map and the regression prediction map output by the second detection branch by adopting a second loss function; and obtaining a total loss function according to the first loss function and the second loss function, and obtaining the target detection network according to the back propagation of the total loss function.

In one embodiment, the text processing device further comprises a screening module, configured to perform binarization processing on the threshold map of the text region to obtain a text box binary map; carrying out binarization processing on the score map to obtain a score binary map; obtaining a regression frame according to the score binary image and the regression prediction image; and taking the regression frame falling on the text box binary image as an object to be compared, and screening a target regression frame from the object to be compared based on the intersection-to-parallel ratio operation of the regression frame.

The functions of each module in each apparatus in the embodiment of the present application may refer to corresponding descriptions in the above method, and are not described herein again.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 4 is a block diagram of an electronic device for implementing the object detection method according to the embodiment of the present application. The electronic device may be the aforementioned deployment device or proxy device. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, a processor 801 is taken as an example.

The memory 802 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the object detection method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the object detection method provided by the present application.

The memory 802, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the object detection methods in the embodiments of the present application. The processor 801 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 802, that is, implements the object detection method in the above-described method embodiments.

The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the target detection method may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, as exemplified by the bus connection in fig. 4.

The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of object detection, the method comprising:

2. The method of claim 1, wherein the extracting features of the first text image based on a feature extraction module to obtain a feature image comprises:

and under the condition that the feature extraction module comprises a backbone network module and a feature enhancement Fusion (FPEM) module, inputting the first text image into the backbone network module for feature extraction to obtain a plurality of feature vectors, and performing feature extraction, up-sampling and series connection processing on the plurality of feature vectors again through at least one FPEM module to obtain the feature image.

3. The method according to claim 1 or 2, wherein the inputting the feature image into a first detection module to obtain a probability map of contracted text regions and a threshold map of text regions comprises:

under the condition that the first detection module adopts a differentiable binarization DB model, inputting the characteristic image into the DB model for convolution and deconvolution processing, and outputting a multi-channel characteristic image;

in the multi-channel characteristic images, the characteristic image output by a first channel is a probability map of the contracted text region, and the characteristic image output by a second channel is a threshold map of the text region.

4. The method according to claim 1 or 2, wherein the inputting the feature image into a second detection module to obtain a score map for characterizing the probability of whether a pixel belongs to a text region and a regression prediction map for characterizing the coordinates of the text region required for regression processing comprises:

under the condition that the second detection module detects the EAST model by adopting the scene text, the characteristic image is input into the EAST model to be subjected to convolution and deconvolution processing, and a group of first characteristic mapping data is output;

performing first convolution processing on the first feature mapping data to obtain the score map;

and performing second convolution processing on the first feature mapping data to obtain the regression prediction graph.

5. The method of claim 1 or 2, further comprising:

training is carried out based on the output of a first detection branch corresponding to the detection processing of the first detection module and the output of a second detection branch corresponding to the detection processing of the second detection module;

training the probability map of the contracted text region and the threshold map of the text region output by the first detection branch by adopting a first loss function, and training the score map and the regression prediction map output by the second detection branch by adopting a second loss function; and obtaining a total loss function according to the first loss function and the second loss function, and obtaining the target detection network according to the back propagation of the total loss function.

6. The method of claim 1 or 2, further comprising:

carrying out binarization processing on the threshold map of the text area to obtain a text box binary map;

carrying out binarization processing on the score map to obtain a score binary map;

obtaining a regression frame according to the score binary image and the regression prediction image;

and taking the regression frame falling on the text box binary image as an object to be compared, and screening a target regression frame from the object to be compared based on the intersection-to-parallel ratio operation of the regression frame.

7. An object detection apparatus, characterized in that the apparatus comprises:

8. The apparatus of claim 7, wherein the feature extraction branch module is configured to, if the feature extraction module includes a backbone network module and a feature enhancement Fusion (FPEM) module, input the first text image into the backbone network module for feature extraction to obtain a plurality of feature vectors, and perform feature extraction, upsampling, and concatenation processing on the plurality of feature vectors again by at least one of the FPEM modules to obtain the feature image.

9. The apparatus of claim 7 or 8, wherein the first detection branch module is configured to:

10. The apparatus of claim 7 or 8, wherein the second detection branch module is configured to:

11. The apparatus of claim 7 or 8, further comprising a training module to:

12. The apparatus of claim 7 or 8, further comprising a screening module to:

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-6.