CN113255669B

CN113255669B - Method and system for detecting text of natural scene with any shape

Info

Publication number: CN113255669B
Application number: CN202110715820.9A
Authority: CN
Inventors: 许信顺; 刘新宇; 罗昕
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-10-01
Anticipated expiration: 2041-06-28
Also published as: CN113255669A

Abstract

The invention discloses a method and a system for detecting a natural scene text with any shape, which comprises the following steps: acquiring a to-be-detected text image; inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area; and the detection model screens the candidate detection frames through the classification score and the mask score to obtain the final detection frame. The invention designs a mask attention module for connecting a mask generation process and a mask quality scoring process, wherein the mask attention module has a positive effect on the prediction of the mask score.

Description

Method and system for detecting text of natural scene with any shape

Technical Field

The invention relates to the technical field of natural scene text detection, in particular to a method and a system for detecting a natural scene text with any shape.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

Text appears in every corner of our daily lives as the most direct way of information dissemination. Due to potential application values in the aspects of automatic driving, blind navigation, information retrieval and the like, the text understanding task in a natural scene gets more and more attention. In general, the natural scene text understanding task involves two steps: text detection and text recognition. The first step is to locate text regions; the second step is for identifying content in the text region. Text detection is a crucial position as a precursor to text understanding tasks.

The traditional text detection method is mainly based on a connected region analysis method and a sliding window method, the two methods are based on the characteristics of manual design and can play a role in some simple scenes, and some methods cannot be used in some complex scenes. In recent years, due to the improvement of computer performance, the deep learning meets the unprecedented development opportunity, and the text detection technology based on the deep learning is also rapidly improved. Text detection based on deep learning can be mainly classified into two categories: regression-based methods and segmentation-based methods. The regression-based method can be used for detecting horizontal or multi-directional texts, and the segmentation-based method can be used for detecting texts in any shapes, so that the segmentation-based method occupies the dominant position of natural scene text detection at present.

One of the mainstream segmentation-based methods is an instance-based segmentation method. Such methods typically first use a horizontal candidate box (pro-visual) to locate a region; a classification score is then generated to determine whether the region enclosed by the candidate box belongs to text, and a segmentation mask is generated for delineating the text region. Such methods typically use classification scores as the only criteria for evaluating the quality of predicted candidate boxes, which can lead to serious false positive problems. False positive problems can be particularly classified into three categories:

(1) classify the resulting false positives. As shown in fig. 2(a), some areas in the natural scene have features similar to the text, such as graffiti on a wall, lines on a book, cracks on a road surface, etc., and these areas may be mistakenly classified as texts, resulting in a false positive sample.

(2) False positives due to regression. As shown in fig. 2(b), for some long texts or texts with larger character spacing (such as chinese), a candidate box may only contain a partial text segment, and an incomplete text segment may cause ambiguity in the subsequent recognition module.

(3) False positives resulting from segmentation. As shown in fig. 2(c), for some irregular texts, the horizontal candidate box may contain a large amount of background noise, so that the final segmented mask may not perfectly represent the text region.

For some systems with high precision requirements, the existence of false positive samples can cause immeasurable loss, many systems prefer not to recognize and fail to recognize, and the false positive samples in the detection result can cause fatal influence on the recognition result. For example, in the automatic driving, the four words of 'no parking' detect only the latter half, which may cause the vehicle to illegally park, and in the information retrieval process, the four words of 'football' detect only the former half, which may cause the detected result to be greatly different from the ideal result.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method and a system for detecting a natural scene text with any shape;

in a first aspect, the invention provides a method for detecting a text in a natural scene in any shape;

the method for detecting the text of the natural scene with any shape comprises the following steps:

acquiring a to-be-detected text image;

inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;

and the detection model screens the candidate detection frames through the classification score and the mask score to obtain the final detection frame.

In a second aspect, the invention provides a system for detecting text in a natural scene in any shape;

an arbitrarily shaped natural scene text detection system, comprising:

an acquisition module configured to: acquiring a to-be-detected text image;

a detection module configured to: inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;

Compared with the prior art, the invention has the beneficial effects that:

1. the invention analyzes and summarizes the false positive problem existing in the traditional text detection method based on example segmentation, and provides a mechanism for scoring the mask quality to inhibit the false positive examples.

2. The invention designs a new method for detecting the natural scene text with any shape according to the proposed mask quality scoring mechanism, and the proposed method can inhibit all types of false positive samples in a simple and uniform manner.

3. The invention designs a mask attention module for connecting a mask generation process and a mask quality scoring process, wherein the mask attention module has a positive effect on the prediction of the mask score.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of a method of the first embodiment;

FIG. 2(a) is a graph of false positives resulting from classification according to the first embodiment;

FIG. 2(b) is a false positive resulting from the regression of the first embodiment;

FIG. 2(c) is a graph of false positives resulting from the segmentation of the first embodiment;

FIG. 2(d) is a sample of the true positive of the first embodiment;

FIG. 3 is a detailed structure of the MAM of the first embodiment;

fig. 4 is a detailed structure of the Mask head of the first embodiment.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

All data are obtained according to the embodiment and are legally applied on the data on the basis of compliance with laws and regulations and user consent.

In order to uniformly solve the above-mentioned false positive problem, the invention designs an arbitrary shape text detection method based on a mask quality scoring mechanism, which is different from the previous method that only uses the classification score of a candidate box to evaluate the quality of a candidate box, and in the model designed by the invention, whether a candidate box can be kept is determined by the classification score and the mask score of the candidate box. Based on the mechanism, the model can evaluate the quality of the candidate box more reasonably, and false positive samples are more likely to be found and filtered out. The overall framework of the model is shown in fig. 1. The model designed by the invention consists of four parts: a skeleton Network (Backbone), a candidate area Network (RPN), a bounding Box module (Box head), and a Mask module (Mask head). The frame module (Box head) comprises two full connection layers which are connected in sequence.

Example one

The embodiment provides a method for detecting a natural scene text in any shape;

as shown in fig. 1, the method for detecting a text in a natural scene with an arbitrary shape includes:

s1: acquiring a to-be-detected text image;

s2: inputting the image to be detected into the trained detection model to obtain a final detection frame; carrying out post-processing on the obtained final detection frame to form a text area;

Further, the S2: inputting the image to be detected into the trained detection model to obtain a final detection frame; the method specifically comprises the following steps:

s21: carrying out feature extraction on the image to be detected;

s22: constructing an initial candidate frame based on the extracted image features;

s23: generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame;

s24: generating characteristics of the adjusted candidate frame for the adjusted candidate frame;

expanding the adjusted candidate frame to obtain an expanded candidate frame;

for the expansion candidate frame, generating an expansion candidate frame characteristic;

s25: generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score;

s26: and screening the adjusted candidate frames through the classification score and the mask score to form a final detection frame.

Further, the S21: carrying out feature extraction on the image to be detected; the method specifically comprises the following steps:

and a Deep residual Network (ResNet) is adopted as a skeleton Network Backbone, the features of the image to be detected are extracted, and Feature Pyramid networks are used for enhancing and representing the features.

Further, the S22: constructing an initial candidate frame based on the extracted image features; the method specifically comprises the following steps:

inputting the extracted features into a candidate Region generation Network (RPN) to obtain a constructed initial candidate frame.

Illustratively, the RPN network will output several horizontal candidate boxes, and one candidate box may be specifically expressed as follows:

wherein the content of the first and second substances,

is a candidate frame

The coordinates of the upper left corner of the table,

and

are respectively as

Width and height of (d);

further, the S23: generating an initial candidate frame feature based on the initial candidate frame; the method specifically comprises the following steps:

and generating initial candidate frame features by adopting a candidate region alignment operation RoIAlign based on the initial candidate frame and the extracted image features.

Further, the S23: predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame; the method specifically comprises the following steps:

reducing the dimension of the initial candidate frame features through two full-connection layers, and simultaneously and respectively sending the dimension-reduced features to a classification branch and a regression branch;

the classification branch is a full-connection layer with two-dimensional vector output, and a classification score is obtained by calculation according to the output of the classification branch;

the regression branch is a full connection layer with four-dimensional vector output, and the initial candidate frame is subjected to frame regression according to the output of the regression branch.

Illustratively, the S23: generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame; the method specifically comprises the following steps:

s231: candidate frame

ROIAlign formation by candidate region alignment operation

Is characterized by

；

S232: candidate frame features

Reducing the dimension through two full-connection layers;

s233: outputting a two-dimensional vector by the reduced features through a full connection layer

，

Is directed to the output of a category of text,

is directed to the output of the background analogy.

The classification score of a candidate box is used for determining whether the area enclosed by the candidate box belongs to the text or not, and the classification score

：

S234: the feature after dimensionality reduction simultaneously outputs a four-dimensional vector through another full-connection layer

For bounding box regression, initial candidate boxes

Forming adjusted candidate frame by adjusting position and size through frame regression branch

：

Further, the S24: generating characteristics of the adjusted candidate frame for the adjusted candidate frame; the method specifically comprises the following steps:

and generating the feature of the adjusted candidate frame by adopting a candidate region alignment operation RoIAlign based on the adjusted candidate frame and the image feature extracted in the S22.

Further, the S24: expanding the adjusted candidate frame to obtain an expanded candidate frame; the method specifically comprises the following steps:

and adopting the extension operation extension to form an extended candidate frame for the adjusted candidate frame.

The extension operation keeps the center position of the candidate frame unchanged, and the width and the height of the candidate frame are respectively expanded

In the practical operation process

Typically 2 is taken.

Meanwhile, the extension operation also ensures that the candidate frame after expansion does not exceed the image boundary.

Further, the S24: generating an expansion candidate frame characteristic for the expansion candidate frame; the method specifically comprises the following steps:

and for the expansion candidate frame, generating an expansion candidate frame characteristic by adopting a candidate region alignment operation RoIAlign.

Further, the S25: generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score; the method specifically comprises the following steps:

the adjusted features of the candidate box and the expanded features of the candidate box are input into a Mask module Mask head, and the Mask module Mask head comprises two workflows: mask generating stream and mask score stream; two workflows are connected through a Mask Attention Module (MAM).

The upper workflow in the Mask head of fig. 4 is a Mask generation flow, and outputs a Mask with the adjusted candidate box feature as an input. A mask generation stream comprising: the segmentation mask comprises a convolutional layer C1, a convolutional layer C2 and an deconvolution layer, wherein the characteristics of the deconvolution layer are subjected to dimensionality reduction through a convolutional layer C3 with a 1x1 convolutional kernel, and a segmentation mask is output.

The next workflow in the Mask head of fig. 4 is a Mask score stream, and outputs a Mask score with the adjusted candidate box features and the expanded candidate box features as inputs. The Mask score stream firstly adopts two layers of convolution layers and two Mask Attention Modules (MAM) to fuse the input characteristics, and the Mask score stream comprises: convolutional layer D1, convolutional layer D2, convolutional layer D3, full-link layer FC1, full-link layer FC2, and full-link layer FC 3.

The Mask of the feature and Mask generation stream output after the Mask Attention Module (MAM) in fig. 4 is stacked into the convolutional layer D3, the full connection layer FC1, the full connection layer FC2, and the full connection layer FC3, and outputs a predicted Mask score.

Further, the S26: screening the adjusted candidate frame through the classification score and the mask score to obtain a final detection frame; the method specifically comprises the following steps:

s261: all adjusted candidate frames are first deduplicated by non-maximal suppression (NMS);

s262: the quality of a candidate frame is scored by the candidate frame

The specific calculation formula is as follows:

wherein the content of the first and second substances,

for the predicted mask score, will

Filtering out candidate frames smaller than 0.5, and forming a final detection frame by the reserved candidate frames;

s263: and selecting the maximum connected region in the mask of the reserved detection frame as a final detection result.

Further, the detection model, the model structure, includes:

the system comprises a skeleton network Backbone, a text detection module and a text analysis module, wherein the skeleton network Backbone is used for inputting a text detection image;

the output end of the Backbone Network Backbone is connected with the input end of a Region candidate Network (RPN);

an output end of the candidate Region generation Network (RPN) is connected with an input end of the RoIAlign layer; the output end of the RoIAlign layer is connected with the input end of the frame module Box head; the frame module Box head comprises two fully-connected layers which are connected in sequence;

the output end of the RoIAlign layer is also connected with the input end of the Mask head module.

Further, the Mask head module comprises: two parallel working branches: a first branch and a second branch;

wherein, first branch road includes: a convolutional layer C1 and a convolutional layer C2 connected in sequence; an input terminal of convolutional layer C1 for inputting the characteristics of the adjusted candidate frame;

wherein, the second branch road includes: a convolutional layer D1 and a convolutional layer D2 connected in this order; the input end of the convolution layer D1 is used for inputting the splicing value of the adjusted candidate frame characteristic and the expanded candidate frame characteristic;

the output terminal of convolutional layer C2 is connected to the first input terminal of the first Mask Attention Module (MAM);

the output end of the convolutional layer D2 is connected with the second input end of the first mask attention module;

the first output end of the first mask attention module is connected with the first input end of the second mask attention module;

the second output end of the first mask attention module is connected with the second input end of the second mask attention module;

a first output end of the second mask attention module is connected with an input end of the deconvolution layer, an output end of the deconvolution layer is connected with an input end of the convolution layer C3, and an output end of the convolution layer C3 generates a predicted mask;

the second output end of the second mask attention module is connected with the input end of the convolutional layer D3, the output end of the convolutional layer C3 is connected with the input end of the convolutional layer D3, the characteristics of the output end of the convolutional layer D3 are connected with three full-connection layers after being subjected to size adjustment, and the last full-connection layer outputs a mask score.

Illustratively, the Mask head module is divided into two workflows: the mask generates a stream and a masked score stream. The mask generation flow generates a corresponding mask for the candidate box feature by using the adjusted candidate box feature, and the mask scoring flow evaluates the mask quality by using the adjusted candidate box feature and the expanded candidate box feature, wherein the expanded candidate box comprises more peripheral information which is helpful for predicting the mask quality.

Illustratively, the Mask head module specifically works as follows:

step (1): adjusted candidate box features

Passes through twoConvolutional layer formation masking to generate stream features

；

Step (2): adjusted candidate box features

And extended candidate box features

Cascading into two convolutional layers to form a masked scored stream feature

；

And (3):

and

feeding into a first Mask Attention Module (MAM); the first mask attention module MAM causes the masking score stream to focus on the areas contained in the mask in order to predict the mask quality more accurately. The detailed structure of the MAM can refer to fig. 3;

and (4): features of the two workflows pass through a second mask attention module;

and (5): mask generation stream characterization through an deconvolution layer and a convolution layer to generate predicted masks

；

And (6): masked score stream feature sum

Outputting predicted mask scores for a convolution layer and three fully-connected layers in a stack

。

The specific process of the step (3) is as follows:

step (3.1): masked production flow features

Generating a phased mask through a convolutional layer as an attention map:

wherein the content of the first and second substances,

indicates an possession of

The convolution layer of the convolution kernel is formed,

is one

The attention map of (1);

the region of which the medium response value is higher than the set threshold value represents a region of interest in the segmentation process;

the region with the medium response value lower than the set threshold value represents a region which is not concerned in the segmentation process;

step (3.2): enhancing on features of masked score streams

The representation of the region of interest is specifically operated as follows:

wherein the content of the first and second substances,

is a function for expanding the number of channels of the feature map, and the feature map is copied in actual operation

From

Extend to

；

Expanded attention map and masked score stream features

Enhanced dot product feature

；

Step (3.3):

the response value of the area of no interest is usually very low, so

The response of these regions of no interest is greatly suppressed;

in order to prevent the loss of the whole area information, in

Adding original characteristics on the basis of the following steps:

step (3.4):

and

respectively obtaining output values through convolution layer fusion characteristics.

Wherein the internal structure of the first mask attention module and the second MAM module are the same.

Further, the first masked attention module includes:

convolutional layer E1; an input of the convolutional layer E1 is for connection with a first mask attention module first input; the output end of the convolutional layer E1 is used for being connected with a first output end of a first mask attention module;

a convolutional layer F1; an input of the convolutional layer F1 is for connection with a first mask attention module first input; the output end of the convolutional layer F1 is used for being connected with the input end of the multiplier;

the input end of the multiplier is also connected with the second input end of the first mask attention module; the output end of the multiplier is connected with the input end of the adder, and the input end of the adder is also connected with the second input end of the first mask attention module; the output of the adder is further adapted to be coupled to a second output of the first mask attention module via convolutional layer G1.

Further, the training of the trained detection model comprises:

sa 1: constructing a training set, wherein the training set is an image of a known candidate frame label;

sa 2: inputting the training set into the detection model, training the detection model,

sa 3: carrying out feature extraction on the image of the known candidate frame tag;

sa 4: constructing an initial candidate frame based on the extracted features;

sa 5: generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, generating a four-dimensional regression bias vector for the initial candidate frame based on the characteristics of the initial candidate frame;

sa 6: generating characteristics of the initial candidate frame for the initial candidate frame; expanding the initial candidate frame to obtain an expanded candidate frame; generating an expansion candidate frame characteristic for the expansion candidate frame;

sa 7: generating a mask based on the initial candidate box feature and the expanded candidate box feature; evaluating the mask quality to obtain a mask score;

sa 8: and calculating a loss function according to the generated classification score, the regression bias vector, the mask score and the attention map generated in the step Sa7, and optimizing network parameters through back propagation to obtain a trained detection model.

Exemplarily, the Sa 6: for the initial candidate frame, generating characteristics of the initial candidate frame; expanding the initial candidate frame to obtain an expanded candidate frame; generating an expansion candidate frame characteristic for the expansion candidate frame; the method specifically comprises the following steps:

sa 61: candidate frame

Formed by Roiarign

Is characterized by

；

Sa62：

Forming extended candidate boxes via extension operations

The extension operation is specifically as follows:

wherein the content of the first and second substances,

is that

The coordinates of the upper left corner of the table,

and

are respectively as

The width and the height of the base material,

represents the expansion factor, i.e. the multiple by which the candidate box expands;

Sa63：

also formed by Roiarign

Extended candidate box feature of

。

Further, in Sa8, the loss function of the computational model optimizes the parameters of the entire model by back propagation, and the specific form of the loss function is as follows:

wherein L is a loss function of the model and consists of six parts,

、

、

、

and

for balancing the importance between the loss functions;

，

，

the loss of the RPN network and the Box head is in the same form as the prior method based on example division,

involving two parts, two classes of Log loss and bounding box regression

The loss of the carbon dioxide gas is reduced,

in the form of cross-entropy is used,

by using

The form will not be described herein.

Mask generation flow loss for Mask head, adopting cross entropy form:

wherein

Representation mask

To middle

The value of the individual pixels is then calculated,

is shown as

Mask labels for individual pixels (obtained by real labels of the training data),

to represent

Total number of middle pixels.

The loss function is a loss function aiming at a mask attention diagram in the MAM, and the specific form is as follows:

wherein

And

respectively representing the loss of two MAMs, and the loss forms of the two MAMs are consistent.

Representing masked attention maps

To middle

The value of the individual pixels is then calculated,

a mask label representing the pixel is identified,

can be labeled by a mask

And (3) obtaining secondary interpolation:

aiming at Mask head Mask gain loss, the method adopts

Form (a):

wherein

Is the mask score of the model prediction and,

is the true mask score, defined as the intersection ratio between the generation mask and the true mask.

Can be obtained through the following process:

if a candidate box has an intersection ratio greater than 0.2 to a true horizontal text box, then the true mask score for the candidate box is calculated by the following formula:

wherein

Firstly, binarization processing is carried out:

if a candidate box does not intersect any real text boxes or the intersection ratio of all real text boxes is less than 0.2, then the candidate box's intersection ratio is less than

Directly set to 0.

Fig. 2(a) -2 (d) show three types of false positive samples and one true positive sample. Wherein the rectangular box represents the final detection box of the model prediction; the shaded portions in the boxes represent the segmentation masks for these detection boxes. Wherein cls-score is the classification score predicted by the model; ms-score is the masking score predicted by the model. The traditional method only screens candidate frames through classification scores, and the three types of false positive samples generally have higher classification scores so that the false positive samples are reserved; the method provided by the invention screens the candidate boxes through the classification score and the mask score, the mask score of three types of false positive samples is very low and can be filtered, and the two scores of one true positive sample are both very high and can be reserved in the final detection result.

Training process:

step (1): acquiring original pictures of a training set and original labels of text regions in each picture (generally, horizontal or multidirectional texts are labeled by quadrangles, and irregular texts are labeled by polygons), and generating mask labels and candidate frame labels of the text regions by using the original labels. Marking all pixels inside the quadrangle or the polygon as a text category (namely, marking the pixel value as 1), and marking all pixels outside the quadrangle or the polygon as a background category (namely, marking the pixel value as 0), and forming a text region mask; taking the minimum horizontal frame capable of surrounding the quadrangle or the polygon as a candidate frame label;

step (2): sending the pictures into a backhaul extraction feature and constructing an initial candidate frame through an RPN;

and (3): the initial candidate Box characteristic and the expanded candidate Box characteristic are sent to the Box head and the Mask head to generate the classification score

With offset frame

Dividing the mask

And a mask score

. The MAM module in the Mask head outputs an attention diagram

；

And (4): calculating a loss function of the model, and optimizing the whole model through back propagation;

and (5): after the whole training set trains K epochs, the fixed model stores network parameters, and K is a positive integer in the range of 30-40.

The testing process comprises the following steps:

step (1): acquiring a picture to be tested;

and (3): the initial candidate Box features are fed into the Box head to generate classification scores

And the frame offset

By using

Adjusting the original candidate frame;

and (4): sending the adjusted candidate frame characteristics and the expanded adjusted candidate frame characteristics into a Mask head to generate a segmentation Mask

And a mask score

；

And (5): non-maxima suppression is used to filter out duplicate candidate blocks. Reusing classification scores

Sum mask score

Calculating candidate box scores

. Will be provided with

Candidate boxes smaller than 0.5 are filtered out;

and (6): and selecting the maximum connected region in the mask of the reserved candidate frame as a final detection result.

Example two

The embodiment provides a natural scene text detection system with any shape;

an arbitrarily shaped natural scene text detection system, comprising:

an acquisition module configured to: acquiring a to-be-detected text image;

It should be noted that the above-mentioned acquiring module and detecting module correspond to steps S1 to S2 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The method for detecting the text of the natural scene in any shape is characterized by comprising the following steps:

acquiring a to-be-detected text image;

inputting the image to be detected into the trained detection model to obtain a final detection frame, wherein the method specifically comprises the following steps:

carrying out feature extraction on the image to be detected;

constructing an initial candidate frame based on the extracted image features;

generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame;

generating characteristics of the adjusted candidate frame for the adjusted candidate frame; expanding the adjusted candidate frame to obtain an expanded candidate frame; for the expansion candidate frame, generating an expansion candidate frame characteristic;

generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score;

screening the adjusted candidate frame by the product of the classification score and the mask score to form a final detection frame;

carrying out post-processing on the obtained final detection frame to form a text area;

and the detection model screens the candidate detection frames through the product of the classification score and the mask score to obtain the final detection frame.

2. The method for detecting the text in the natural scene with the arbitrary shape as claimed in claim 1, wherein the classification score of the candidate box is predicted based on the initial candidate box feature; meanwhile, performing frame regression on the initial candidate frame, and adjusting the size and the position of the initial candidate frame to obtain an adjusted candidate frame; the method specifically comprises the following steps:

3. The method for detecting text in a natural scene with an arbitrary shape as set forth in claim 1,

generating a mask for the adjusted candidate frame based on the features of the adjusted candidate frame and the expanded candidate frame features; evaluating the mask quality to obtain a mask score; the method specifically comprises the following steps:

the adjusted features of the candidate box and the expanded features of the candidate box are input into a Mask module Mask head, and the Mask module Mask head comprises two workflows: mask generating stream and mask score stream;

a mask generation stream, which takes the adjusted candidate box characteristics as input and outputs a mask;

and the mask score stream takes the adjusted candidate box characteristics and the expanded candidate box characteristics as input and outputs a mask score.

4. The method for detecting the text in the natural scene with the arbitrary shape as claimed in claim 1, wherein the model structure of the detection model comprises:

the output end of the Backbone network Backbone is connected with the input end of the candidate region generation network RPN;

the output end of the candidate region generation network RPN is connected with the input end of a RoIAlign layer; the output end of the RoIAlign layer is connected with the input end of the frame module Box head; the frame module Box head comprises two fully-connected layers which are connected in sequence;

5. The method for detecting the text of the natural scene with the arbitrary shape as claimed in claim 4, wherein the Mask head module comprises: two parallel working branches: a first branch and a second branch;

the output end of the convolutional layer C2 is connected to the first input end of the first mask attention module MAM;

6. The method for detecting the text of the natural scene with the arbitrary shape as claimed in claim 4, wherein the Mask head module specifically works as follows:

adjusted candidate box features

Forming mask generation stream features over two convolutional layers

；

Adjusted candidate box features

And extended candidate box features

Cascading into two convolutional layers to form a masked scored stream feature

；

And

feeding into a first mask attention module; the first mask attention module causes the mask score stream to focus on regions contained in the mask;

features of the two workflows pass through a second mask attention module;

mask generation stream characterization through an deconvolution layer and a convolution layer to generate predicted masks

；

Masked score stream feature sum

。

7. The method for detecting text in an arbitrarily-shaped natural scene as recited in claim 5, wherein the first masking attention module comprises:

8. The method for detecting the text in the natural scene with the arbitrary shape as claimed in claim 1, wherein the training step of the trained detection model comprises:

constructing a training set, wherein the training set is an image of a known candidate frame label;

inputting the training set into the detection model, training the detection model,

carrying out feature extraction on the image of the known candidate frame tag;

constructing an initial candidate frame based on the extracted features;

generating an initial candidate frame feature based on the initial candidate frame; predicting a classification score of the candidate box based on the initial candidate box feature; meanwhile, generating a four-dimensional regression bias vector for the initial candidate frame based on the characteristics of the initial candidate frame;

generating characteristics of the initial candidate frame for the initial candidate frame; expanding the initial candidate frame to obtain an expanded candidate frame; generating an expansion candidate frame characteristic for the expansion candidate frame;

generating a mask based on the initial candidate box feature and the expanded candidate box feature; evaluating the mask quality to obtain a mask score;

and calculating a loss function according to the generated classification score, the regression bias vector, the mask score and the generated attention map, and obtaining a trained candidate frame screening model by reversely propagating and optimizing network parameters.

9. The system for detecting the texts in the natural scenes in any shapes is characterized by comprising the following steps:

an acquisition module configured to: acquiring a to-be-detected text image;

a detection module configured to: inputting the image to be detected into the trained detection model to obtain a final detection frame, wherein the method specifically comprises the following steps:

carrying out feature extraction on the image to be detected;

constructing an initial candidate frame based on the extracted image features;