CN113255679A

CN113255679A - Text detection method, device, medium and electronic equipment

Info

Publication number: CN113255679A
Application number: CN202110676649.5A
Authority: CN
Inventors: 欧阳世壮; 刘霄; 熊泽法
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-08-13
Anticipated expiration: 2041-06-18
Also published as: CN113255679B

Abstract

The invention relates to a text detection method, a text detection device, a text detection medium and electronic equipment, wherein the text detection method is applied to a Mask RCNN model, a region suggestion network RPN in the model is replaced by a segmentation network, and the text detection method comprises the following steps: acquiring a first feature map of an original image containing a text; obtaining a first segmentation result of the first feature map through a segmentation network, wherein the first segmentation result comprises a plurality of polygonal frames surrounding the text region; carrying out post-processing on the polygonal frames to obtain a plurality of corrected rectangular frames; extracting second feature maps of the plurality of corrected rectangular frames, and multiplying the plurality of second feature maps with corresponding local feature maps in the first feature map to obtain a third feature map; and obtaining a detection result of the third feature map and a second segmentation result through the model, wherein the detection result comprises a plurality of candidate rectangular frames, and the second segmentation result comprises the feature map surrounding the text area in each candidate rectangular frame. The embodiment of the invention can more accurately detect and identify the texts such as distortion, inclination and the like in the image.

Description

Text detection method, device, medium and electronic equipment

Technical Field

The embodiment of the disclosure relates to the technical field of image processing, and in particular, to a text detection method, a text detection device, a computer-readable storage medium for implementing the text detection method, and an electronic device.

Background

In the face of image data, how to efficiently and automatically acquire useful text information from an image has become a research hotspot.

In the related art, the image text detection scheme adopts a target detection two-stage network, such as a fast Region conditional Neural Networks (fast RCNN), which can improve the efficiency of automatically detecting text information to a certain extent.

However, the current image text detection scheme fails to consider scenes such as character inclination and distortion of the text in the image, so the accuracy and efficiency of image text detection and recognition under the scenes such as character inclination and distortion are low.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, embodiments of the present disclosure provide a text detection method, a text detection apparatus, and a computer-readable storage medium and an electronic device implementing the text detection method.

In a first aspect, an embodiment of the present disclosure provides a text detection method applied to an example segmentation Mask RCNN model, where a region in the Mask RCNN model suggests that a network RPN is replaced by a segmentation network, and the method includes:

acquiring a first characteristic diagram of an original image, wherein the original image at least comprises a text;

obtaining a first segmentation result of the first feature map through the segmentation network, wherein the first segmentation result at least comprises a plurality of polygon frames surrounding a text region;

carrying out post-processing on the polygonal frames to obtain a plurality of corresponding correction rectangular frames;

extracting second feature maps of the plurality of correction rectangular frames, and multiplying the plurality of second feature maps by a plurality of local feature maps in the first feature map to obtain a third feature map, wherein each local feature map corresponds to one second feature map;

and obtaining a detection result of the third feature map and a second segmentation result through the Mask RCNN model, wherein the detection result comprises a plurality of candidate rectangular frames, and the second segmentation result comprises feature maps surrounding text regions in each candidate rectangular frame.

In some embodiments of the present disclosure, the obtaining, by the segmentation network, a first segmentation result of the first feature map includes:

performing reduction processing on the first feature map based on a preset reduction scale value;

and acquiring a first segmentation result of the reduced first feature map through the segmentation network.

In some embodiments of the present disclosure, after performing post-processing on the plurality of polygon frames to obtain a plurality of corresponding modified rectangular frames, the method further includes:

amplifying the plurality of correction rectangular frames according to a preset amplification ratio value, wherein the preset amplification ratio value is related to the preset reduction ratio value;

the extracting of the second feature maps of the plurality of the correction rectangular frames includes:

and extracting second feature maps of the amplified plurality of correction rectangular frames.

In some embodiments of the present disclosure, the first segmentation result further includes a plurality of rectangular frames and a first confidence degree corresponding to each of the rectangular frames and the polygonal frames, the detection result further includes a second confidence degree corresponding to each of the candidate rectangular frames, the total number of the candidate rectangular frames is the same as the sum of the numbers of the rectangular frames and the polygonal frames, and the method further includes:

determining a final confidence coefficient of each candidate rectangular frame based on a first confidence coefficient corresponding to each rectangular frame and the polygonal frame and a second confidence coefficient corresponding to each candidate rectangular frame;

and determining a target candidate rectangular frame meeting a preset confidence degree condition and a feature map surrounding a text region in the target candidate rectangular frame based on the final confidence degree of each candidate rectangular frame.

In some embodiments of the present disclosure, the determining a final confidence level of each of the candidate rectangular frames based on the first confidence level corresponding to each of the rectangular frames and the polygonal frames and the second confidence level corresponding to each of the candidate rectangular frames includes:

when the second confidence coefficient corresponding to each candidate rectangular frame is zero, the final confidence coefficient of each candidate rectangular frame is zero;

when the second confidence coefficient corresponding to each candidate rectangular frame is larger than zero, calculating the average value of the second confidence coefficient of each candidate rectangular frame and the first confidence coefficient of the corresponding rectangular frame or polygonal frame;

and taking each calculated average value as the final confidence of each corresponding candidate rectangular frame.

In some embodiments of the present disclosure, the determining, based on the final confidence of each candidate rectangular box, a target candidate rectangular box that satisfies a preset confidence condition includes:

filtering the candidate rectangular frames with the final confidence coefficient smaller than the preset confidence coefficient in each candidate rectangular frame to obtain candidate rectangular frames to be selected;

and screening based on the final confidence of the candidate rectangular frame to be selected through a non-maximum suppression algorithm to obtain a target candidate rectangular frame.

In some embodiments of the present disclosure, the segmentation network comprises at least a differentiable binary segmentation network.

In a second aspect, an embodiment of the present disclosure provides a text detection apparatus, which is applied to an example segmentation Mask RCNN model, where a region in the Mask RCNN model suggests that a network RPN is replaced by a segmentation network, and the apparatus includes:

the system comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for acquiring a first feature map of an original image, and the original image at least comprises a text;

a segmentation module, configured to obtain a first segmentation result of the first feature map through the segmentation network, where the first segmentation result includes at least a plurality of polygon frames surrounding a text region;

the correction module is used for carrying out post-processing on the polygonal frames to obtain a plurality of corresponding correction rectangular frames;

the feature processing module is configured to extract second feature maps of the plurality of modified rectangular frames, and multiply the plurality of second feature maps with a plurality of local feature maps in the first feature map to obtain a third feature map, where each local feature map corresponds to one second feature map;

and a result determining module, configured to obtain a detection result of the third feature map and a second segmentation result through the Mask RCNN model, where the detection result includes multiple candidate rectangular frames, and the second segmentation result includes a feature map surrounding a text region in each candidate rectangular frame.

In a third aspect, the disclosed embodiments provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the text detection method described in any of the above embodiments.

In a fourth aspect, an embodiment of the present disclosure provides an electronic device, including:

a processor; and

a memory for storing a computer program;

wherein the processor is configured to perform the steps of the text detection method of any of the above embodiments via execution of the computer program.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

the text detection method, the text detection device, the text detection medium and the electronic device provided by the embodiment of the disclosure are characterized in that a region suggestion network RPN in a Mask RCNN model is replaced by a segmentation network, a first segmentation result of a first feature map of an original image is obtained by the segmentation network in a first stage, the first segmentation result at least comprises a plurality of polygon frames surrounding a text region, the polygon frames are subjected to post-processing to obtain a plurality of corresponding correction rectangular frames, a second feature map of the correction rectangular frames is extracted, the plurality of second feature maps are multiplied by a plurality of local feature maps in the first feature map to obtain a third feature map, and each local feature map corresponds to one second feature map, namely, the spatial position corresponds to each other; and during the second-stage detection, obtaining a detection result and a second segmentation result of the third feature map through the Mask RCNN model, wherein the detection result comprises a plurality of candidate rectangular frames, and the second segmentation result comprises feature maps surrounding text areas in each candidate rectangular frame. Thus, in this embodiment, the segmentation network is combined with the conventional Mask RCNN model, and the segmentation network is used to replace the RPN part in the Mask RCNN model, because the current RPN stage can only obtain a rectangular frame and cannot well support the detection of the tilted and warped text, after the replacement in this embodiment, when detecting, the first-stage detection segmentation network can obtain a polygonal frame surrounding the text region, which can well support the detection of the tilted and warped text. Meanwhile, in order to adapt to the detection of the second stage in the Mask RCNN model, the polygonal frame obtained by the detection of the first stage is subjected to post-processing to obtain a corresponding correction rectangular frame, a second feature map of the correction rectangular frame is extracted, the second feature map is multiplied by the first feature map of the original image to obtain a third feature map, and then the detection of the second stage is performed, so that the attention (attention) mechanism can be played, the accuracy of the Mask RCNN model on the detection and identification of the image text in scenes such as character inclination and distortion is high as a whole, and the processing efficiency is greatly improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a flow chart of a text detection method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a Mask RCNN model text detection architecture after RPN is replaced according to the embodiment of the present disclosure;

FIG. 3 is a diagram illustrating the detection result and segmentation result of the second stage of the detection in the embodiment of the disclosure;

FIG. 4 is a schematic view of a text detection device according to an embodiment of the disclosure;

fig. 5 is a schematic view of an electronic device implementing a text detection method according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

It is to be understood that, hereinafter, "at least one" means one or more, "a plurality" means two or more. "and/or" is used to describe the association relationship of the associated objects, meaning that there may be three relationships, for example, "a and/or B" may mean: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

Fig. 1 provides a text detection method for an example segmented Mask RCNN (Region relational Neural Networks) model, which is applied to the Mask RCNN model shown in fig. 2, where a Region suggestion network rpn (Region probable Networks) in the Mask RCNN model is replaced by a segmented network, and the rest of the Region suggestion network may be understood with reference to the prior art, and will not be described herein again. The method may comprise the steps of:

step S101: the method comprises the steps of obtaining a first feature map of an original image, wherein the original image at least comprises a text.

Illustratively, the original image may be obtained by, for example, a photographing function of a mobile device such as a mobile phone or a scanning function of a scanner, and the original image may be, for example, an image of a page of a book, but is not limited thereto. After the original image is obtained, the original image may be input into the feature extraction network in the embodiment shown in fig. 2, and the feature extraction network may be formed by a convolutional neural network, but is not limited thereto. In some embodiments, the feature extraction network may also be part of the Mask RCNN model, but is not so limited. The feature extraction network may extract a feature map (feature map), i.e., a first feature map, of the original image.

Step S102: obtaining a first segmentation result of the first feature map through the segmentation network, the first segmentation result including at least a plurality of polygon boxes surrounding a text region.

Optionally, in some embodiments of the present disclosure, the segmentation network at least includes a differential Binarization (differential Binarization) segmentation network, which is referred to as a DB segmentation network for short. The DB partitioning network is classified based on pixel points, determines whether each pixel point in the whole graph belongs to a text, and partitions out an area of each text through post-processing, and these specific processes can be understood with reference to the prior art, and are not described herein again. The DB segmentation network is classified by taking pixel points as units, is different from the traditional target detection method in which the whole target is taken as the unit detection, and can better solve the detection problems of text inclination and distortion.

Specifically, in one embodiment, during the first stage detection, the DB segmentation network may obtain a first segmentation result of the first feature map, where the first segmentation result may include at least a plurality of polygon boxes surrounding a text region, and each polygon box may surround exactly one line of text or character region, such as a tilted or distorted line of text or character, for example. The polygonal box surrounding the text region is generally a non-rectangular polygonal box.

Step S103: and carrying out post-processing on the plurality of polygonal frames to obtain a plurality of corresponding correction rectangular frames.

Specifically, in order to adapt to the subsequent second-stage detection in the Mask RCNN model, the polygonal frame obtained by the first-stage detection needs to be post-processed to obtain a corresponding modified rectangular frame.

Exemplary post-processing may specifically include: and obtaining a characteristic diagram of the polygonal frame, converting the characteristic diagram into a binary diagram such as black and white diagrams marked by '0' and '1', then converting a connected domain of the binary diagram, obtaining an approximate outline of the connected domain in order to reduce the calculation amount, and then expanding the outline to obtain a corrected rectangular frame of the text region such as the region marked by '1'.

Step S104: and extracting second feature maps of the plurality of correction rectangular frames, and multiplying the plurality of second feature maps and a plurality of local feature maps in the first feature map to obtain a third feature map, wherein each local feature map corresponds to one second feature map.

Illustratively, the local feature map is a feature map of a local region in the first feature map, and each second feature map has a local feature map of a corresponding one of the spatial location regions in the first feature map.

Specifically, after the post-processing, the second feature map of each modified rectangular frame may be extracted, and the second feature map is multiplied by the local feature map at the corresponding spatial position in the first feature map of the original image to obtain a third feature map, so as to play a role of an attention (attention) mechanism in the subsequent second-stage detection. The attention mechanism can select some key information such as characteristic information input of character parts of inclination, distortion and the like to be processed, so that the information processing efficiency of the neural network is improved, and the accuracy of image text detection and identification under scenes of character inclination, distortion and the like is higher.

Step S105: and obtaining a detection result of the third feature map and a second segmentation result through the Mask RCNN model, wherein the detection result comprises a plurality of candidate rectangular frames, and the second segmentation result comprises feature maps surrounding text regions in each candidate rectangular frame.

Specifically, as shown in fig. 2, a feature Pyramid network fpn (feature Pyramid networks) in the Mask RCNN model may obtain feature information Of a plurality Of third feature maps corresponding to a plurality Of second feature maps, and scale the plurality Of third feature maps into a fixed size through a Pooling layer, such as ROI (region Of interest) Align or ROI Pooling, and perform classification and regression to obtain a detection result, while another Mask portion may obtain a segmentation result, where the segmentation result is usually semi-transparently covered on an object to be segmented, such as a line Of text regions, and is also referred to as a Mask. Illustratively, the feature map 302 surrounding a text region, such as a line of characters "1234567", within the candidate rectangular box 301 shown in fig. 3 is embodied as a mask, for example, a line of character regions is surrounded by feature maps of different colors, such as a transparency map in fig. 3.

It should be noted that, the specific processing procedure of the second-stage detection may be understood by referring to the prior art, and is not described herein again. In this embodiment, the segmentation result in the first stage is post-processed, the second feature map of the modified rectangular frame after the post-processing is extracted, the second feature map is multiplied by the first feature map of the original image to obtain a third feature map, and then the second-stage detection is performed based on the third feature map, that is, an attention (attention) mechanism is added during the second-stage detection.

In this embodiment, a segmentation network is combined with a traditional Mask RCNN model, and the segmentation network is used to replace the RPN part in the Mask RCNN model, because the current RPN stage can only obtain a rectangular frame and cannot well support the detection of the tilted and warped text, after the replacement in this embodiment, when detecting, the first stage detection segmentation network can obtain a polygonal frame surrounding a text region, which can well support the detection of the tilted and warped text. Meanwhile, in order to adapt to the detection of the second stage in the Mask RCNN model, the polygonal frame obtained by the detection of the first stage is subjected to post-processing to obtain a corresponding correction rectangular frame, a second feature map of the correction rectangular frame is extracted, the second feature map is multiplied by a corresponding local feature map in the first feature map of the original image to obtain a third feature map, and then the detection of the second stage is carried out, so that the function of an attention (attention) mechanism can be played, the accuracy of the Mask RCNN model on image text detection and identification under scenes such as character inclination and distortion is higher overall, and the processing efficiency is greatly improved.

Optionally, in some embodiments of the present disclosure, the obtaining, in step S102, the first segmentation result of the first feature map through the segmentation network may specifically include the following steps:

step 1): and carrying out reduction processing on the first feature map based on a preset reduction scale value.

Step 2): and acquiring a first segmentation result of the reduced first feature map through the segmentation network.

Specifically, the preset reduction ratio value may be, for example, 1/16, etc., which is not limited in this embodiment and may be set according to specific needs. Step 2) may be followed by a jump to continue with step S103. Since each pixel point in the segmentation network, such as the DB segmentation network, is classified, considering the balance between the calculation amount and the segmentation effect, the DB segmentation network in this embodiment can reduce the first feature map to, for example, 1/16 of the original image for segmentation, which can reduce the calculation amount, greatly reduce the processing time consumption of the segmentation network, and better obtain the text detection result of the polygon frame.

Optionally, on the basis of the foregoing embodiment, since the reduction processing is performed on the first feature map. Therefore, in some embodiments of the present disclosure, after performing post-processing on a plurality of the polygon frames in step S103 to obtain a plurality of corresponding modified rectangular frames, the method may further include the following steps:

step a): and amplifying the plurality of correction rectangular frames according to a preset amplification ratio value, wherein the preset amplification ratio value is related to the preset reduction ratio value.

For example, the preset reduction scale value is 1/16, and the preset enlargement scale value is 16, but not limited thereto.

step b): and extracting second feature maps of the amplified plurality of correction rectangular frames.

Specifically, as an example, in the above embodiment, in order to reduce the calculation amount, the first feature map is reduced to 1/16 of the original image, for example, then the width x value and the height y value of the correction rectangular frame obtained here may be multiplied by 4 times to be enlarged to correspond to the target area of the original image. And then extracting a second feature map of the amplified correction rectangular frame.

Optionally, in some embodiments of the present disclosure, during the first-stage detection, the first segmentation result obtained by the DB segmentation network may further include a plurality of rectangular boxes, and a first confidence corresponding to each of the rectangular boxes and the polygonal boxes, where the first confidence may be a value between zero and one, such as 0.1, 0.2, and the like. Each rectangular box corresponds to a polygonal box, which is typically located within the corresponding rectangular box. During the second-stage detection, the obtained detection result may further include a second confidence corresponding to each of the candidate rectangular frames, and the total number of the candidate rectangular frames is the same as the sum of the numbers of the rectangular frames and the polygonal frames, that is, the total number of the detection frames during the first-stage detection is the same as the total number of the detection frames during the second-stage detection, which may be understood by combining with the prior art, and is not described here again. For a scene of the oblique curved characters, the main body area of the oblique curved characters only occupies a smaller part in the rectangular box, the confidence of the oblique curved text is generally low due to a traditional detection mode, and in order to optimize the problems and further improve the accuracy of image text detection and identification, correspondingly, the method can further comprise the following steps:

step i): and determining the final confidence coefficient of each candidate rectangular frame based on the first confidence coefficient corresponding to each rectangular frame and the polygonal frame and the second confidence coefficient corresponding to each candidate rectangular frame.

That is to say, in this embodiment, the first confidence degrees corresponding to the rectangular frame and the corresponding polygonal frame obtained in the first-stage detection and the second confidence degrees corresponding to the candidate rectangular frames obtained in the second-stage detection may be combined to determine the final confidence degrees of the candidate rectangular frames obtained in the second-stage detection.

Step ii): and determining a target candidate rectangular frame meeting a preset confidence degree condition and a feature map surrounding a text region in the target candidate rectangular frame based on the final confidence degree of each candidate rectangular frame.

Specifically, after the final confidence of the candidate rectangular frame obtained by the second-stage detection is determined, the final target candidate rectangular frame and the feature map surrounding the text region in the target candidate rectangular frame are obtained by screening based on the preset confidence condition.

Therefore, in the embodiment, the final confidence coefficient is determined based on the comprehensive confidence coefficients of the detection frames obtained by the first-stage detection and the second-stage detection, and then screening is performed, so that the confidence coefficient of the oblique curved text can be improved to a certain extent, and the accuracy of image text detection and identification is further improved.

Optionally, in some embodiments of the present disclosure, in step i), the final confidence of each candidate rectangular frame is determined based on the first confidence corresponding to each rectangular frame and the polygonal frame, and the second confidence corresponding to each candidate rectangular frame, which may specifically include the following steps:

step c): and when the second confidence degree corresponding to each candidate rectangular frame is zero, the final confidence degree of each candidate rectangular frame is zero.

Step d): and when the second confidence degree corresponding to each candidate rectangular frame is larger than zero, calculating the average value of the second confidence degree of each candidate rectangular frame and the first confidence degree of the corresponding rectangular frame or polygonal frame.

Step e): and taking each calculated average value as the final confidence of each corresponding candidate rectangular frame.

By way of example, the above embodiments may be understood in conjunction with the following formulas:

wherein i represents the ith detection box,

is firstThe confidence of the i-th detection box (rectangular box or corresponding polygonal box) at the stage detection, i.e. the first confidence,

the confidence of the ith detection box, e.g., the candidate rectangular box, in the second stage detection, i.e., the second confidence,

the final confidence for each candidate rectangular box.

In this embodiment, when the final confidence is determined based on the combination of the confidence of the detection frames obtained by the first-stage detection and the second-stage detection, the final confidence can be obtained through the above calculation method, and then subsequent screening is performed, so that the confidence of the oblique and curved text can be improved to a certain extent, and the accuracy of image text detection and identification can be further improved.

Optionally, in some embodiments of the present disclosure, the determining, in step ii), a target rectangular candidate box that satisfies a preset confidence condition based on the final confidence of each rectangular candidate box may specifically include the following steps: filtering the candidate rectangular frames with the final confidence coefficient smaller than the preset confidence coefficient in each candidate rectangular frame to obtain candidate rectangular frames to be selected; and screening based on the final confidence of the candidate rectangular frame to be selected through a non-maximum suppression algorithm to obtain a target candidate rectangular frame.

Specifically, the preset confidence level may be set to be, for example, 0.01 as needed, which is not limited in this embodiment. After the final confidence of each candidate rectangular frame is obtained, the candidate rectangular frames with low confidence can be filtered based on the preset confidence, and then the final target candidate rectangular frame can be obtained through screening by a maximum suppression algorithm according to the confidence and the intersection ratio of the rectangular frames. The nature of the Non-maximum suppression algorithm (NMS) is to search for local maxima and suppress Non-maxima elements. The algorithm is widely applied in the field of target detection and target positioning, aims to eliminate redundant candidate rectangular frames and obtain the remaining optimal text candidate rectangular frame, and can be understood by referring to the prior art in the specific calculation process, which is not described herein again.

The foregoing solution of the embodiments of the present disclosure uses a split network in a two-stage detection network instead of the oblique warped text detection scheme of the RPN network. The scheme has the advantages of segmenting the network in the current character detection and the advantages of the detection effect of the double-stage detection network in the target detection field, can pertinently reduce the size of the characteristic diagram of the segmenting network in one stage aiming at the secondary detection mechanism in the double-stage detection network, and enhances the detection effect of the double-stage network on the characters in the inclined, distorted and dense scenes on the premise of not additionally introducing time consumption. Meanwhile, the attention mechanism is used for enhancing the detection effect of the two-stage detection network on inclined and distorted characters.

It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc. Additionally, it will also be readily appreciated that the steps may be performed synchronously or asynchronously, e.g., among multiple modules/processes/threads.

Based on the same inventive concept, an embodiment of the present disclosure provides a text detection apparatus, which is applied to an example segmentation Mask RCNN model, where a region-suggested network RPN in the Mask RCNN model is replaced by a segmentation network, as shown in fig. 4, the apparatus includes:

a feature extraction module 401, configured to obtain a first feature map of an original image, where the original image at least includes a text;

a segmentation module 402, configured to obtain a first segmentation result of the first feature map through the segmentation network, where the first segmentation result includes at least a plurality of polygon boxes surrounding a text region;

a correction module 403, configured to perform post-processing on the multiple polygonal frames to obtain multiple corresponding corrected rectangular frames;

a feature processing module 404, configured to extract a plurality of second feature maps of the modified rectangular frame, and multiply the plurality of second feature maps with a plurality of local feature maps in the first feature map to obtain a third feature map, where each local feature map corresponds to one second feature map;

a result determining module 405, configured to obtain a detection result of the third feature map and a second segmentation result through the Mask RCNN model, where the detection result includes a plurality of candidate rectangular frames, and the second segmentation result includes a feature map surrounding a text region in each candidate rectangular frame.

In some embodiments of the disclosure, the obtaining, by the segmentation module, the first segmentation result of the first feature map through the segmentation network may specifically include: performing reduction processing on the first feature map based on a preset reduction scale value; and acquiring a first segmentation result of the reduced first feature map through the segmentation network.

In some embodiments of the disclosure, after the modification module performs post-processing on the plurality of polygon frames to obtain a plurality of corresponding modified rectangular frames, the apparatus may further include an amplification module configured to amplify the plurality of modified rectangular frames according to a preset amplification ratio value, where the preset amplification ratio value is related to the preset reduction ratio value. Correspondingly, the extracting, by the feature processing module, the second feature map of the plurality of corrected rectangular frames may specifically include: and extracting second feature maps of the amplified plurality of correction rectangular frames.

Optionally, in some embodiments of the present disclosure, the first segmentation result may further include a plurality of rectangular frames and a first confidence level corresponding to each of the rectangular frames and the polygonal frames, and the detection result may further include a second confidence level corresponding to each of the candidate rectangular frames, where the total number of the candidate rectangular frames is the same as the sum of the numbers of the rectangular frames and the polygonal frames. Accordingly, the result determination module may be further configured to: determining a final confidence coefficient of each candidate rectangular frame based on a first confidence coefficient corresponding to each rectangular frame and the polygonal frame and a second confidence coefficient corresponding to each candidate rectangular frame; and determining a target candidate rectangular frame meeting a preset confidence degree condition and a feature map surrounding a text region in the target candidate rectangular frame based on the final confidence degree of each candidate rectangular frame.

Optionally, in some embodiments of the present disclosure, the determining the final confidence of each candidate rectangular frame by the result determining module based on the first confidence corresponding to each rectangular frame and the polygonal frame and the second confidence corresponding to each candidate rectangular frame may specifically include:

Optionally, in some embodiments of the present disclosure, the determining, by the result determining module, a target rectangular candidate box that satisfies a preset confidence condition based on the final confidence of each rectangular candidate box may specifically include: filtering the candidate rectangular frames with the final confidence coefficient smaller than the preset confidence coefficient in each candidate rectangular frame to obtain candidate rectangular frames to be selected; and screening based on the final confidence of the candidate rectangular frame to be selected through a non-maximum suppression algorithm to obtain a target candidate rectangular frame.

Optionally, in some embodiments of the present disclosure, the segmentation network may include at least, but is not limited to, a differentiable binary segmentation network.

The specific manner in which the above-mentioned embodiments of the apparatus, and the corresponding technical effects brought about by the operations performed by the respective modules, have been described in detail in the embodiments related to the method, and will not be described in detail herein.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units. The components shown as modules or units may or may not be physical units, i.e. may be located in one place or may also be distributed over a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the wood-disclosed scheme. One of ordinary skill in the art can understand and implement it without inventive effort.

The disclosed embodiments also provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the text detection method according to any one of the above embodiments.

By way of example, and not limitation, such readable storage media can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

An embodiment of the present disclosure further provides an electronic device, as shown in fig. 5, including a processor 501 and a memory 502, where the memory 502 is used to store a computer program. Wherein the processor 501 is configured to perform the steps of the text detection method in any of the above embodiments via execution of the computer program.

The various aspects, implementations, or features of the described embodiments can be used alone or in any combination. Aspects of the described embodiments may be implemented by software, hardware, or a combination of software and hardware. The described embodiments may also be embodied by a computer-readable medium having computer-readable code stored thereon, the computer-readable code comprising instructions executable by at least one computing device. The computer readable medium can be associated with any data storage device that can store data which can be read by a computer system. Exemplary computer readable media can include read-only memory, random-access memory, CD-ROMs, HDDs, DVDs, magnetic tape, and optical data storage devices, among others. The computer readable medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

The above description of the technology may refer to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration embodiments in which the described embodiments may be practiced. These embodiments, while described in sufficient detail to enable those skilled in the art to practice them, are non-limiting; other embodiments may be utilized and changes may be made without departing from the scope of the described embodiments. For example, the order of operations described in a flowchart is non-limiting, and thus the order of two or more operations illustrated in and described in accordance with the flowchart may be altered in accordance with several embodiments. As another example, in several embodiments, one or more operations illustrated in and described with respect to the flowcharts are optional or may be eliminated. Additionally, certain steps or functions may be added to the disclosed embodiments, or two or more steps may be permuted in order. All such variations are considered to be encompassed by the disclosed embodiments and the claims.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A text detection method is applied to an example segmentation Mask RCNN model, and is characterized in that a region suggestion network RPN in the Mask RCNN model is replaced by a segmentation network, and the method comprises the following steps:

2. The text detection method of claim 1, wherein the obtaining of the first segmentation result of the first feature map by the segmentation network comprises:

3. The text detection method according to claim 2, wherein after the post-processing of the plurality of polygon frames to obtain a corresponding plurality of modified rectangular frames, the method further comprises:

4. The text detection method according to claim 1, wherein the first segmentation result further includes a plurality of rectangular frames, and a first confidence level corresponding to each of the rectangular frames and the polygonal frames, and a second confidence level corresponding to each of the candidate rectangular frames, and the total number of the candidate rectangular frames is equal to the sum of the numbers of the rectangular frames and the polygonal frames, and the method further includes:

5. The text detection method of claim 4, wherein the determining the final confidence level of each of the candidate rectangular boxes based on the first confidence level of each of the rectangular boxes and the polygonal boxes and the second confidence level of each of the candidate rectangular boxes comprises:

6. The text detection method according to claim 4 or 5, wherein the determining a target rectangular candidate box satisfying a preset confidence condition based on the final confidence of each rectangular candidate box comprises:

7. The text detection method according to any one of claims 1 to 5, wherein the segmentation network comprises at least a differentiable binarization segmentation network.

8. A text detection device is applied to an example segmentation Mask RCNN model, and is characterized in that a region suggestion network RPN in the Mask RCNN model is replaced by a segmentation network, and the device comprises:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the text detection method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing a computer program;

wherein the processor is configured to perform the steps of the text detection method of any one of claims 1 to 7 via execution of the computer program.