CN112560857A

CN112560857A - Character area boundary detection method, equipment, storage medium and device

Info

Publication number: CN112560857A
Application number: CN202110190870.XA
Authority: CN
Inventors: 操晓春; 代朋纹; 张华�
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2021-03-26
Anticipated expiration: 2041-02-20
Also published as: CN112560857B

Abstract

The invention discloses a character region boundary detection method, a device, a storage medium and a device, compared with the prior scene character detection method only by exploring the expression form of characters with any shape or enhancing the characteristic expression, the invention extracts the characteristics of an image to be processed through a preset backbone network to obtain the image characteristics, determines an initial candidate region according to the image characteristics and a preset region suggestion network, performs pooling processing on the initial candidate region to obtain a first fixed characteristic and a second fixed characteristic, analyzes the first fixed characteristic through a preset character region adjustment network to obtain a characteristic analysis result, segments the network according to a preset character mask, determines a target character region boundary detection result according to the second fixed characteristic and the characteristic analysis result, and overcomes the defect that the region boundary of characters with any shape cannot be accurately identified in the prior art, therefore, the regional boundary detection process of the characters can be optimized, and the accuracy of the regional boundary detection of the characters is improved.

Description

Character area boundary detection method, equipment, storage medium and device

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a method, an apparatus, a storage medium, and a device for detecting a text region boundary.

Background

In the prior art, in order to detect characters in an arbitrary shape scene, an effort is usually made to explore expression forms of the characters in the arbitrary shape, for example, how to better learn attributes of pixel points or character segments and relationships between the pixel points or the character segments to distinguish character regions, or to enhance feature expression, for example, combining features of different granularities or learning context features.

However, in the prior art, the boundary of the region of the characters in any shape cannot be accurately identified, so that the detection accuracy rate of the characters in any shape scene is low, and the reliability is poor.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a method, equipment, a storage medium and a device for detecting the boundary of a text area, and aims to solve the technical problem of how to optimize the text area boundary detection process.

In order to achieve the above object, the present invention provides a text region boundary detection method, which includes the following steps:

acquiring an image to be processed, and extracting the features of the image to be processed through a preset backbone network to obtain image features;

determining an initial candidate region according to the image features and a preset region suggestion network, and performing pooling processing on the initial candidate region to obtain a first fixed feature and a second fixed feature;

analyzing the first fixed characteristic through a preset character area adjusting network to obtain a characteristic analysis result;

determining an initial character area boundary detection result according to a preset character mask segmentation network and the second fixed characteristic;

and adjusting the initial character area boundary detection result according to the characteristic analysis result to obtain a target character area boundary detection result.

Preferably, the step of adjusting the initial text region boundary detection result according to the feature analysis result to obtain a target text region boundary detection result specifically includes:

obtaining confidence and position offset from the feature analysis result, and determining an initial character mask according to the initial character region boundary and the position offset;

determining a word mask overlapping rate according to the initial word mask, and determining a target word mask according to the word mask overlapping rate and the confidence;

and carrying out boundary detection on the target character mask to obtain a detection result, and determining a target character area boundary detection result according to the detection result.

Preferably, the step of determining a word mask overlap ratio according to the initial word mask and determining a target word mask according to the word mask overlap ratio and the confidence degree specifically includes:

determining the word mask overlapping rate according to the initial word mask, and judging whether the word mask overlapping rate is greater than a preset threshold value;

when the word mask overlapping rate is larger than a preset threshold, sequencing the initial word mask according to the confidence coefficient to obtain a sequencing result;

and screening the initial character mask according to the sorting result to obtain a target character mask.

Preferably, the step of determining an initial candidate region according to the image feature and a preset region suggested network, and performing pooling processing on the initial candidate region to obtain a first fixed feature and a second fixed feature specifically includes:

determining an initial candidate area according to the image characteristics and a preset area suggestion network;

analyzing the initial candidate region through a preset deformation interest region pooling model to obtain deformation offset;

and performing pooling processing on the initial candidate region according to the deformation offset to obtain a first fixed feature and a second fixed feature.

Preferably, before the step of obtaining the image to be processed and performing feature extraction on the image to be processed through a preset backbone network to obtain image features, the text region boundary detection method further includes:

obtaining an initial sample image, and carrying out scale adjustment on the initial sample image to obtain a sample image to be processed;

performing image extraction on the sample image to be processed through a preset sliding window to obtain a sample subimage to be processed;

performing feature analysis on the sub-image to be processed to obtain a positive sample sub-image and a negative sample sub-image;

and training an initial area suggested network according to the positive sample subimage and the negative sample subimage to obtain a preset area established network.

Preferably, the step of performing feature analysis on the sub-image to be processed to obtain a positive example sample sub-image and a negative example sample sub-image specifically includes:

acquiring the image size of the sub-image to be processed, and searching a threshold range corresponding to the image size;

analyzing the sub-image to be processed to obtain a character enclosing frame;

and acquiring the boundary length of the text bounding box, and determining a positive sample sub-image and a negative sample sub-image according to the boundary length and the threshold range.

Preferably, before the step of dividing the network according to the preset word mask and determining the initial word region boundary detection result according to the second fixed feature, the word region boundary detection method further includes:

determining a shape structure constraint function according to the image to be processed;

and training the initial character mask segmentation network according to the shape structure constraint function to obtain a preset character mask segmentation network.

In addition, in order to achieve the above object, the present invention further provides a text region boundary detection apparatus, which includes a memory, a processor, and a text region boundary detection program stored in the memory and executable on the processor, wherein the text region boundary detection program is configured to implement the steps of the text region boundary detection method as described above.

In addition, to achieve the above object, the present invention further provides a storage medium, on which a text region boundary detection program is stored, wherein the text region boundary detection program, when executed by a processor, implements the steps of the text region boundary detection method as described above.

In order to achieve the above object, the present invention further provides a character region boundary detection device, including: the device comprises an acquisition module, a processing module, an analysis module, a detection module and an adjustment module;

the acquisition module is used for acquiring an image to be processed and extracting the characteristics of the image to be processed through a preset backbone network to obtain image characteristics;

the processing module is used for determining an initial candidate region according to the image features and a preset region suggestion network, and performing pooling processing on the initial candidate region to obtain a first fixed feature and a second fixed feature;

the analysis module is used for analyzing the first fixed characteristic through a preset character area adjustment network to obtain a characteristic analysis result;

the detection module is used for determining an initial character area boundary detection result according to a preset character mask segmentation network and the second fixed characteristic;

and the adjusting module is used for adjusting the initial character region boundary detection result according to the characteristic analysis result to obtain a target character region boundary detection result.

Compared with the prior method for detecting the scene characters only by researching the expression form of the characters with any shape or enhancing the characteristic expression, the method obtains the image characteristics by obtaining the image to be processed and extracting the characteristics of the image to be processed through the preset main network, determines the initial candidate area according to the image characteristics and the preset area suggestion network, performs pooling processing on the initial candidate area to obtain the first fixed characteristic and the second fixed characteristic, analyzes the first fixed characteristic through the preset character area adjustment network to obtain the characteristic analysis result, divides the network according to the preset character mask and determines the initial character area boundary detection result according to the second fixed characteristic, adjusts the initial character area boundary detection result according to the characteristic analysis result to obtain the target character area boundary detection result, and overcomes the defect that the area boundary of the characters with any shape can not be accurately identified in the prior art, therefore, the regional boundary detection process of the characters can be optimized, and the accuracy and reliability of the regional boundary detection of the characters are improved, so that the requirement of scene character detection is met.

Drawings

Fig. 1 is a schematic structural diagram of a text region boundary detection device in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a text region boundary detection method according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a text region boundary detection method according to a second embodiment of the present invention;

FIG. 4 is a flowchart illustrating a text region boundary detection method according to a third embodiment of the present invention;

fig. 5 is a block diagram of a first embodiment of a text region boundary detection apparatus according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a text region boundary detection device in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the text region boundary detecting apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), and the optional user interface 1003 may further include a standard wired interface and a wireless interface, and the wired interface for the user interface 1003 may be a USB interface in the present invention. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory or a Non-volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the text region boundary detection apparatus and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

As shown in FIG. 1, memory 1005, identified as one type of computer storage medium, may include an operating system, a network communication module, a user interface module, and a text region boundary detection program.

In the text area boundary detection device shown in fig. 1, the network interface 1004 is mainly used for connecting a background server and performing data communication with the background server; the user interface 1003 is mainly used for connecting user equipment; the text area boundary detection device calls a text area boundary detection program stored in the memory 1005 through the processor 1001, and executes the text area boundary detection method provided by the embodiment of the present invention.

Based on the hardware structure, the embodiment of the character area boundary detection method is provided.

Referring to fig. 2, fig. 2 is a flowchart illustrating a text region boundary detection method according to a first embodiment of the present invention.

Step S10: acquiring an image to be processed, and extracting the features of the image to be processed through a preset backbone network to obtain image features.

It should be understood that the main execution body of the embodiment is the text region boundary detection device, where the text region boundary detection device may be an electronic device such as a computer and a server, or may also be other devices that can achieve the same or similar functions.

It should be noted that the image to be processed may be a scene image input by a user through a user interaction interface of the text region boundary detection device, or may also be a scene image input by a user through a terminal device that establishes a communication connection with the text region boundary detection device in advance, which is not limited in this embodiment.

The preset backbone network is an image feature extraction network preset by a user, and in this embodiment, a ResNet-101 network embedded with a deformed convolution is taken as an example for description, which is not limited in this embodiment.

In a specific implementation, for example, an image to be processed is acquired, and the ResNet-101 network embedded with deformation convolution is used as a backbone network to extract image features.

Step S20: and determining an initial candidate region according to the image features and a preset region suggestion network, and performing pooling processing on the initial candidate region to obtain a first fixed feature and a second fixed feature.

It can be understood that, determining an initial candidate region according to the image feature and a preset region suggestion network, and performing pooling processing on the initial candidate region to obtain a first fixed feature and a second fixed feature may be determining an initial candidate region according to the image feature and the preset region suggestion network, analyzing the initial candidate region through a preset deformation interest region pooling model to obtain a deformation offset, and performing pooling processing on the initial candidate region according to the deformation offset to obtain a first fixed feature and a second fixed feature.

In a specific implementation, for example, an image feature is processed through a Region suggestion Network (RPN) to generate a candidate Region, and in the process of mapping the candidate Region to a fixed size, a deformation offset is learned by using a position-aware deformation Region-of-Interest (drosi) pooling technique to generate a more accurate alignment feature. Here, two different fixed-size features are generated using two DROI pooling layers that do not share learning parameters

。

Step S30: and analyzing the first fixed characteristic through a preset character area adjusting network to obtain a characteristic analysis result.

It should be noted that the preset Text adjustment Network may be a regional adjustment Network preset by a user, and in this embodiment, a Text Region adjustment Network (TRRN) is taken as an example for description.

In particular implementations, for example, features of fixed size

Input to a Text Region Reference Network (TRRN). The structure of the text region adjustment network is the same as that of the Mask RCNN, but the difference is that the text region adjustment network aims at the 2-category case. The TRRN is configured to obtain the confidence and the position offset of the adjusted text region.

Step S40: and determining an initial character area boundary detection result according to a preset character mask segmentation network and the second fixed characteristic.

It should be noted that the preset Text Mask Segmentation Network may be a Segmentation Network preset by a user, and in this embodiment, a Text Mask Segmentation Network (TMSN) is taken as an example for description.

In particular implementations, for example, features of fixed size

Input into a Text Mask Segmentation Network (TMSN). The structure of the character Mask segmentation network is consistent with that of the Mask RCNN, and the TMSN is used for obtaining segmentation results of scene characters in any shapes.

Step S50: and adjusting the initial character area boundary detection result according to the characteristic analysis result to obtain a target character area boundary detection result.

It should be understood that, adjusting the initial text region boundary detection result according to the feature analysis result to obtain the target text region boundary detection result may be obtaining a confidence degree and a position offset from the feature analysis result, determining an initial text mask according to the initial text region boundary and the position offset, determining a text mask overlap rate according to the initial text mask, determining a target text mask according to the text mask overlap rate and the confidence degree, performing boundary detection on the target text mask to obtain a detection result, and determining the target text region boundary detection result according to the detection result.

Further, in order to improve the accuracy and reliability of the target word mask, the determining the word mask overlap ratio according to the initial word mask and determining the target word mask according to the word mask overlap ratio and the confidence includes:

determining a word mask overlapping rate according to the initial word mask, judging whether the word mask overlapping rate is greater than a preset threshold, when the word mask overlapping rate is greater than the preset threshold, sorting the initial word mask according to the confidence to obtain a sorting result, and screening the initial word mask according to the sorting result to obtain a target word mask.

In a specific implementation, for example, the absolute position of the character region of any shape in the input image is obtained according to the divided binary image and the corresponding position of the character region. And calculating the overlapping rate mode of the two character masks as max (O/A, O/B) by a Non-Maximum Suppression (NMS) of the maximized intersection region, wherein O is the overlapping area of the two masks, and A and B are the areas of the two masks respectively. When the overlap ratio is greater than 0.8, one of the two masks with lower confidence is removed.

Compared with the existing method for detecting the scene characters only by researching the expression form of the characters with any shape or enhancing the feature expression, in the embodiment, the image to be processed is obtained by acquiring the image to be processed and extracting the features of the image to be processed through the preset backbone network, the image features are obtained, the initial candidate region is determined according to the image features and the preset region suggestion network, the initial candidate region is subjected to pooling processing to obtain the first fixed feature and the second fixed feature, the first fixed feature is analyzed through the preset character region adjustment network to obtain the feature analysis result, the network and the second fixed feature are divided according to the preset character mask to determine the initial character region boundary detection result, the initial character region boundary detection result is adjusted according to the feature analysis result to obtain the target character region boundary detection result, and the defect that the region boundary of the characters with any shape cannot be accurately identified in the prior art is overcome, therefore, the regional boundary detection process of the characters can be optimized, and the accuracy and reliability of the regional boundary detection of the characters are improved, so that the requirement of scene character detection is met.

Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the text region boundary detection method according to the present invention, and the second embodiment of the text region boundary detection method according to the present invention is proposed based on the first embodiment illustrated in fig. 2.

In the second embodiment, the step S20 includes:

step S201: and determining an initial candidate region according to the image characteristics and a preset region suggestion network.

It should be noted that the preset area recommendation Network may be a processing Network preset by a user, and in this embodiment, an area recommendation Network (RPN) is taken as an example for description, which is not limited in this embodiment.

In a specific implementation, for example, the image features are processed through a Region suggestion Network (RPN) to generate candidate regions.

Step S202: and analyzing the initial candidate region through a preset deformation interest region pooling model to obtain deformation offset.

It should be noted that the preset deformed Interest Region pooling model may be a pooling processing model preset by a user, and in this embodiment, a position-sensitive deformed Interest Region (drosi) pooling model is taken as an example for description, which is not limited in this embodiment.

Step S203: and performing pooling processing on the initial candidate region according to the deformation offset to obtain a first fixed feature and a second fixed feature.

In a specific implementation, for example, in the process of mapping a text Region of an arbitrary size to a fixed size, a deformation offset is learned by using a position-aware deformation Region-of-Interest (drosi) pooling technique, so that a more accurate alignment feature is generated. Here, two different fixed-size features are generated using two DROI pooling layers that do not share learning parameters

。

In a second embodiment, an initial candidate region is determined according to the image features and a preset region suggestion network, the initial candidate region is analyzed through a preset deformation interest region pooling model to obtain a deformation offset, the initial candidate region is pooled according to the deformation offset to obtain a first fixed feature and a second fixed feature, and therefore the candidate region with any scale can be mapped into the feature with a fixed size.

In the second embodiment, the step S50 includes:

step S501: and obtaining confidence coefficient and position offset from the feature analysis result, and determining an initial word mask according to the initial word region boundary and the position offset.

It is understood that the obtaining of the confidence and the position offset from the feature analysis result may be performing feature extraction on the feature analysis result, obtaining a text feature, and determining the confidence and the position offset according to the text feature.

Step S502: and determining a word mask overlapping rate according to the initial word mask, and determining a target word mask according to the word mask overlapping rate and the confidence.

It can be understood that, determining a word mask overlap ratio according to the initial word mask, and determining a target word mask according to the word mask overlap ratio and the confidence level may be determining a word mask overlap ratio according to the initial word mask, and determining whether the word mask overlap ratio is greater than a preset threshold, when the word mask overlap ratio is greater than the preset threshold, sorting the initial word mask according to the confidence level to obtain a sorting result, and screening the initial word mask according to the sorting result to obtain the target word mask.

Further, in order to improve the accuracy and reliability of the target word mask, the step S502 includes:

Step S503: and carrying out boundary detection on the target character mask to obtain a detection result, and determining a target character area boundary detection result according to the detection result.

It can be understood that, the boundary detection of the target word mask is performed to obtain a detection result, and the determination of the boundary detection result of the target word area according to the detection result may be that the boundary detection of the target word mask is performed, the boundary of the target word mask is determined according to the detection result, and the boundary of the target word mask is used as the boundary detection result of the target word area.

In a second embodiment, the confidence and the position offset are obtained from the feature analysis result, an initial word mask is determined according to the initial word region boundary and the position offset, a word mask overlapping rate is determined according to the initial word mask, a target word mask is determined according to the word mask overlapping rate and the confidence, boundary detection is performed on the target word mask, a detection result is obtained, and a target word region boundary detection result is determined according to the detection result, so that the accuracy of the target word region boundary detection result can be improved.

Referring to fig. 4, fig. 4 is a flowchart illustrating a text region boundary detection method according to a third embodiment of the present invention, and the third embodiment of the text region boundary detection method is proposed based on the first embodiment shown in fig. 2.

In the third embodiment, before the step S20, the method further includes:

step S110: obtaining an initial sample image, and carrying out scale adjustment on the initial sample image to obtain a sample image to be processed.

It should be noted that the initial sample image may be a sample image input by a user through a text region boundary detection device, which is not limited in this embodiment.

In a specific implementation, for example, the scaling of the initial sample image is performed, and the obtaining of the sample image to be processed may be the scaling of the initial sample image to obtain three scales

To be processed sample image.

Step S120: and carrying out image extraction on the sample image to be processed through a preset sliding window to obtain a sample subimage to be processed.

It should be noted that the preset sliding window may be a sliding window preset by a user, and in this embodiment, a sliding window of 512 × 512 is taken as an example for description.

In a specific implementation, for example, for each scale of the image, a 512x512 window is slid to generate the sub-image.

Step S130: and performing characteristic analysis on the sub-image to be processed to obtain a positive sample sub-image and a negative sample sub-image.

It can be understood that, the step of performing feature analysis on the to-be-processed sub-image to obtain the positive example sample sub-image and the negative example sample sub-image may be to obtain an image size of the to-be-processed sub-image, find a threshold range corresponding to the image size, analyze the to-be-processed sub-image to obtain a text enclosure frame, obtain each boundary length of the text enclosure frame, and determine the positive example sample sub-image and the negative example sample sub-image according to the boundary length and the threshold range.

Further, the step S03 includes:

analyzing the sub-image to be processed to obtain a character enclosing frame;

It is understood that the threshold range corresponding to the image size may be searched in a preset mapping table. The preset mapping table includes a corresponding relationship between the image size and the threshold range, which is not limited in this embodiment.

In particular embodimentsFor example, a range is designed for each scale

When the shortest side of the text bounding box falls within the range, the texts will participate in the training process, and we will write the texts as

. Then, cover

The largest number of sub-images will be selected as the positive example sub-images

. To select the negative example sub-image, a negative example sub-image mining technique is employed. Specifically, the generated normal example image is first utilized

Training a regional recommendation Network (RPN) generates candidate boxes. Then remove the quilt

Candidate frame for covering when the sub-image area is covered

Candidate frames in the range, these sub-images are referred to as negative example images at that scale

. In the learning process, the size of each sub-image region is 512x512, and the size of each minipatch is 10, where the ratio of positive and negative example images is 4: 1. Since many positive example sub-pictures contain only a small number of valid words, it results in a limited number of positive samples in the RPN. The number of positive samples in the RPN is increased by the segment of the text. Specifically, when the overlapping rate of the first check box and the effective character area is larger than a threshold value of 0.7,and the horizontal distance of the overlap region is not less than 1/3 of the horizontal distance of the entire text region, the prior box may be taken as a positive sample. Furthermore, for invalid text in the sub-image, we use the same method to obtain the prior frame, which is removed from the negative sample to reduce the ambiguity of the negative sample.

Step S140: and training an initial area suggested network according to the positive sample subimage and the negative sample subimage to obtain a preset area established network.

It should be noted that the initial area recommendation network may be an area recommendation network to be trained, which is preset by a user, and this embodiment is not limited thereto.

In a third embodiment, an initial sample image is obtained, the initial sample image is subjected to scale adjustment to obtain a sample image to be processed, the sample image to be processed is subjected to image extraction through a preset sliding window to obtain a sample sub-image to be processed, the sub-image to be processed is subjected to feature analysis to obtain a positive sample sub-image and a negative sample sub-image, an initial area recommendation network is trained according to the positive sample sub-image and the negative sample sub-image to obtain a preset area establishment network, and therefore the generalization capability of the area recommendation network can be improved under the condition of limited data.

In the third embodiment, before the step S40, the method further includes:

step S310: and determining a shape structure constraint function according to the image to be processed.

In a particular implementation, for example, determining the Shape Structure Constraint function from the image to be processed may be for Shape Structure Constraints (SSCs) that encourage similarity between network-generated text regions and text region truth values, and between network-generated background regions and background region truth values. The shape structure constraint as an auxiliary function favors the global perception of the text area of the network over the popular pixel-level cross entropy loss function. The loss function is calculated as follows:

where C represents a category, we set C =2, i.e. both text and background.

A matrix averaging operation is represented.

Representing multiplication of corresponding elements of the matrix.

，

The factors representing the stable fractional division are set in the experiment respectively

,

。

，

The average graphs of the class c prediction graph and the truth graph are respectively shown.

，

The variance diagrams of the class c prediction diagram and the truth diagram are respectively shown.

Showing a covariance graph between the class c prediction graph and the truth graph. The calculation method is as follows:

wherein

，

The prediction graph and the truth graph of the class c are respectively, and the size of the prediction graph and the truth graph is 28x 28.

Representing a gaussian weight filter of size 3x 3.

And (5) indicating the cooperative related operation.

The shape structure constraint is used as an auxiliary loss function and added into the original network to form end-to-end learning, and the total loss function equation is expressed as follows:

wherein the loss function balance factor

Are all set to 1.

A loss function representing the RPN is shown,

the loss function of the TRNN is represented,

representing a cross entropy loss function in a literal mask partitioned network. These three functions are identical to the loss functions in Mask RCNN, except that the above-described loss functions are for the case of C = 2.

Step S320: and training the initial character mask segmentation network according to the shape structure constraint function to obtain a preset character mask segmentation network.

It should be noted that the initial word mask segmentation network may be a word mask segmentation network to be trained, which is preset by a user, and this is not limited in this embodiment.

In the third embodiment, the shape structure constraint function is determined according to the image to be processed, and the initial character mask segmentation network is trained according to the shape structure constraint function to obtain the preset character mask segmentation network, so that the reliability of the preset character mask segmentation network can be improved.

In addition, an embodiment of the present invention further provides a storage medium, where a text region boundary detection program is stored on the storage medium, and when executed by a processor, the text region boundary detection program implements the steps of the text region boundary detection method described above.

In addition, referring to fig. 5, an embodiment of the present invention further provides a text region boundary detection apparatus, where the text region boundary detection apparatus includes: the system comprises an acquisition module 10, a processing module 20, an analysis module 30, a detection module 40 and an adjustment module 50;

the acquiring module 10 is configured to acquire an image to be processed, and perform feature extraction on the image to be processed through a preset backbone network to obtain image features.

The processing module 20 is configured to determine an initial candidate region according to the image feature and a preset region suggestion network, and perform pooling processing on the initial candidate region to obtain a first fixed feature and a second fixed feature.

。

The analysis module 30 is configured to analyze the first fixed feature through a preset text area adjustment network to obtain a feature analysis result.

In particular implementations, for example, features of fixed size

The detection module 40 is configured to determine an initial text region boundary detection result according to a preset text mask segmentation network and the second fixed feature.

In particular implementations, for example, features of fixed size

The adjusting module 50 is configured to adjust the initial text region boundary detection result according to the feature analysis result, so as to obtain a target text region boundary detection result.

Further, in order to improve the accuracy and reliability of the target word mask, the adjusting module 50 is further configured to determine a word mask overlap ratio according to the initial word mask, determine whether the word mask overlap ratio is greater than a preset threshold, rank the initial word mask according to the confidence when the word mask overlap ratio is greater than the preset threshold, obtain a ranking result, and screen the initial word mask according to the ranking result, so as to obtain the target word mask.

In an embodiment, the adjusting module 50 is further configured to obtain a confidence degree and a position offset from the feature analysis result, determine an initial word mask according to the initial word region boundary and the position offset, determine a word mask overlap rate according to the initial word mask, determine a target word mask according to the word mask overlap rate and the confidence degree, perform boundary detection on the target word mask, obtain a detection result, and determine a target word region boundary detection result according to the detection result;

in an embodiment, the adjusting module 50 is further configured to determine a word mask overlap ratio according to the initial word mask, determine whether the word mask overlap ratio is greater than a preset threshold, rank the initial word mask according to the confidence when the word mask overlap ratio is greater than the preset threshold, obtain a ranking result, and filter the initial word mask according to the ranking result to obtain a target word mask;

in an embodiment, the processing module 20 is further configured to determine an initial candidate region according to the image feature and a preset region suggestion network, analyze the initial candidate region through a preset deformation interest region pooling model to obtain a deformation offset, perform pooling processing on the initial candidate region according to the deformation offset, and obtain a first fixed feature and a second fixed feature;

in an embodiment, the text region boundary detecting apparatus further includes: a training module;

the training module is used for acquiring an initial sample image, carrying out scale adjustment on the initial sample image to obtain a sample image to be processed, carrying out image extraction on the sample image to be processed through a preset sliding window to obtain a sample subimage to be processed, carrying out feature analysis on the subimage to be processed to obtain a positive sample subimage and a negative sample subimage, training an initial area suggestion network according to the positive sample subimage and the negative sample subimage, and acquiring a preset area establishment network;

in an embodiment, the training module is further configured to obtain an image size of the sub-image to be processed, search a threshold range corresponding to the image size, analyze the sub-image to be processed, obtain a text bounding box, obtain each boundary length of the text bounding box, and determine a positive sample sub-image and a negative sample sub-image according to the boundary length and the threshold range;

in an embodiment, the training module is further configured to determine a shape structure constraint function according to the image to be processed, and train the initial word mask segmentation network according to the shape structure constraint function to obtain a preset word mask segmentation network.

Other embodiments or specific implementation manners of the text region boundary detection device according to the present invention may refer to the above method embodiments, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order, but rather the words first, second, third, etc. are to be interpreted as names.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g., a Read Only Memory (ROM)/Random Access Memory (RAM), a magnetic disk, an optical disk), and includes several instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. a text area boundary detection method, is characterized in that, described text area boundary detection method comprises the following steps:

Obtaining an image to be processed, and performing feature extraction on the image to be processed through a preset backbone network to obtain image features;

Determine an initial candidate region according to the image feature and the preset region suggestion network, and perform a pooling process on the initial candidate region to obtain a first fixed feature and a second fixed feature;

Analyze the first fixed feature through a preset text area adjustment network to obtain a feature analysis result;

Determine the initial text region boundary detection result according to the preset text mask segmentation network and the second fixed feature;

The initial text area boundary detection result is adjusted according to the feature analysis result to obtain the target text area boundary detection result.

2. The text area boundary detection method according to claim 1, wherein the step of adjusting the initial text area boundary detection result according to the feature analysis result, and obtaining the target text area boundary detection result, specifically include:

Obtain confidence and position offset from the feature analysis result, and determine an initial text mask according to the initial text area boundary and the position offset;

Determine a text mask overlap rate according to the initial text mask, and determine a target text mask according to the text mask overlap rate and the confidence level;

Perform boundary detection on the target text mask to obtain detection results, and determine the boundary detection results of the target text area according to the detection results.

3 . The method for detecting the boundary of a text area according to claim 2 , wherein the text mask overlap ratio is determined according to the initial text mask, and the text mask overlap ratio is determined according to the text mask overlap ratio and the confidence level. 4 . The steps of the target text mask, including:

Determine the text mask overlap ratio according to the initial text mask, and determine whether the text mask overlap ratio is greater than a preset threshold;

When the overlap ratio of the text masks is greater than a preset threshold, sorting the initial text masks according to the confidence, to obtain a sorting result;

The initial text mask is filtered according to the sorting result to obtain a target text mask.

4. The method for detecting the boundary of a text region according to claim 1, wherein the proposed network determines an initial candidate region according to the image features and a preset region, and performs a pooling process on the initial candidate region to obtain The steps of the first fixing feature and the second fixing feature specifically include:

Determine the initial candidate region according to the image feature and the preset region suggestion network;

The initial candidate region is analyzed by a preset deformation interest region pooling model to obtain the deformation offset;

The initial candidate region is pooled according to the deformation offset to obtain a first fixed feature and a second fixed feature.

5. The text area boundary detection method according to claim 1, characterized in that, before the step of obtaining the image to be processed, and performing feature extraction on the image to be processed through a preset backbone network, before the step of obtaining the image features, the The text region boundary detection method also includes:

obtaining an initial sample image, and performing scale adjustment on the initial sample image to obtain a sample image to be processed;

Perform image extraction on the to-be-processed sample image through a preset sliding window to obtain a to-be-processed sample sub-image;

Perform feature analysis on the sub-images to be processed to obtain a positive sample sub-image and a negative sample sub-image;

The initial region proposal network is trained according to the positive example sample sub-image and the negative example sample sub-image, and a preset region establishment network is obtained.

6. The text area boundary detection method according to claim 5, wherein the step of performing feature analysis on the sub-image to be processed to obtain a positive sample sub-image and a negative sample sub-image specifically comprises:

Obtain the image size of the sub-image to be processed, and find the threshold range corresponding to the image size;

Analyzing the sub-image to be processed to obtain a text bounding box;

Each boundary length of the text bounding box is acquired, and a positive example sample sub-image and a negative example sample sub-image are determined according to the boundary length and the threshold range.

7. The text region boundary detection method according to any one of claims 1 to 6, wherein the method for determining an initial text region boundary detection result according to a preset text mask segmentation network and the second fixed feature Before the step, the text area boundary detection method further includes:

Determine a shape structure constraint function according to the to-be-processed image;

The initial word mask segmentation network is trained according to the shape structure constraint function to obtain a preset word mask segmentation network.

8. A text region boundary detection device, wherein the text region boundary detection device comprises: a memory, a processor, and a text region boundary detection program that is stored on the memory and can be run on the processor, When the text area boundary detection program is executed by the processor, the steps of the text area boundary detection method according to any one of claims 1 to 7 are implemented.

9. A storage medium, wherein the storage medium stores a text area boundary detection program, and when the text area boundary detection program is executed by a processor, the method according to any one of claims 1 to 7 is realized. The steps of the text area boundary detection method.

10. A text region boundary detection device, characterized in that the text region boundary detection device comprises: an acquisition module, a processing module, an analysis module, a detection module and an adjustment module;

The acquisition module is configured to acquire the image to be processed, and perform feature extraction on the image to be processed through a preset backbone network to obtain image features;

The processing module is configured to determine an initial candidate region according to the image feature and the preset region suggestion network, and perform pooling processing on the initial candidate region to obtain a first fixed feature and a second fixed feature;

The analysis module is configured to analyze the first fixed feature through a preset text area adjustment network to obtain a feature analysis result;

The detection module is configured to determine the initial text region boundary detection result according to the preset text mask segmentation network and the second fixed feature;

The adjustment module is configured to adjust the initial text region boundary detection result according to the feature analysis result to obtain the target text region boundary detection result.