CN114283419A

CN114283419A - Text image area detection method, related equipment and readable storage medium

Info

Publication number: CN114283419A
Application number: CN202111616149.9A
Authority: CN
Inventors: 张晋; 张银田
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-04-05

Abstract

The application discloses a text image region detection method, related equipment and a readable storage medium. According to the scheme, after a text image to be subjected to area detection is obtained, instance segmentation processing is carried out on a first target in the text image to obtain an area detection result of the first target, semantic segmentation processing is carried out on a second target in the text image to obtain an area detection result of the second target, and finally the area detection result of the text image is determined based on the area detection result of the first target and the area detection result of the second target. In the scheme, different modes are adopted for carrying out region detection aiming at different targets, so that the problem that detection of various targets cannot be covered by one text image region detection method to cause missing detection or error detection of partial targets is avoided, and therefore the accuracy of region detection can be improved.

Description

Text image area detection method, related equipment and readable storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a text image region detection method, a related device, and a readable storage medium.

Background

In some scenes, it is necessary to perform region detection on a text image and perform content identification on the region, for example, in a scene of automatic correction of middle and primary school jobs, it is necessary to detect the question number, question stem and answer region of each question in the job image, and perform content identification on the question number, question stem and answer region of each question to realize question type retrieval, answer acquisition, automatic correction and the like; in a document layout analysis correction scene, each independent area of a document image needs to be detected, and the content of the area needs to be identified, so that the analysis and correction of the document layout content are realized. In these scenarios, the accuracy of region detection has a crucial impact on the accuracy of region content identification.

At present, some text image region detection schemes exist, but the text image region detection schemes cannot be applied to detection of various targets, and missing detection or false detection of part of the targets often exists, so that accuracy of region detection is low.

Therefore, how to improve the accuracy of text image region detection becomes a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of the foregoing problems, the present application provides a text image region detection method, a related device, and a readable storage medium. The specific scheme is as follows:

a text image region detection method, the method comprising:

acquiring a text image to be subjected to area detection;

carrying out example segmentation processing on a first target in the text image to obtain a region detection result of the first target;

performing semantic segmentation processing on a second target in the text image to obtain a region detection result of the second target;

determining a region detection result of the text image based on the region detection result of the first target and the region detection result of the second target.

Optionally, the performing example segmentation processing on the first target in the text image to obtain an area detection result of the first target includes:

inputting the text image into a trained example segmentation model, outputting the area detection result of the first target by the example segmentation model, and training the example segmentation model by using a text image for training as a training sample and using real area frames contained in the first target labeled by the text image for training and the type of each real area frame as a sample label.

Optionally, the example segmentation model includes a feature map extraction module, a region frame generation module, a region frame screening module, and a region frame detection module;

the training process of the example segmentation model comprises the following steps:

the feature map extraction module is used for extracting a feature map of the training text image to obtain the feature map of the training text image;

the region box generation module determines a plurality of region boxes contained in a first target in the text image for training based on a feature map of the text image for training;

the region frame screening module is used for determining a non-maximum value inhibition threshold value corresponding to each region frame generated by the region frame generating module based on the shape information of the region frame; and screening the area frame to be detected from the plurality of area frames generated by the area frame generation module based on the non-maximum value inhibition threshold corresponding to each area frame, and inputting the area frame to be detected into the area frame detection module.

Optionally, the determining, based on the shape information of the region frame, a non-maximum suppression threshold corresponding to the region frame includes:

calculating the inclination angle and the aspect ratio of the minimum circumscribed rectangle frame of the area frame;

acquiring a preset reference non-maximum value inhibition threshold, an inclination angle threshold and an aspect ratio threshold;

when the inclination angle of the minimum circumscribed rectangular frame of the area frame is larger than the inclination angle threshold value and the aspect ratio of the minimum circumscribed rectangular frame of the area frame is larger than the aspect ratio threshold value, adjusting the reference non-maximum suppression threshold value, and determining the adjusted non-maximum suppression threshold value as the non-maximum suppression threshold value corresponding to the area frame; the adjusted non-maximum suppression threshold is greater than the baseline non-maximum suppression threshold;

otherwise, determining the reference non-maximum suppression threshold as the non-maximum suppression threshold corresponding to the area frame.

Optionally, the screening, based on the non-maximum suppression threshold corresponding to each region frame, of the plurality of region frames generated by the region frame generation module for a region frame to be detected includes:

for each region frame generated by the region frame generation module, calculating the intersection ratio of the region frame and the region frame with the highest current confidence score;

when the intersection ratio of the region frame and the region frame with the highest current confidence score is larger than the non-maximum value inhibition threshold value corresponding to the region frame, discarding the region frame; otherwise, determining the area frame as the area frame to be detected.

Optionally, the area frame detection module includes three cascaded detection sub-modules, and an output of a previous detection sub-module is used as an input of a next detection sub-module; the output of the last detection submodule is used as the output of the example segmentation model; each detection submodule is provided with a reference intersection ratio threshold;

the training process of the example segmentation model further comprises the following steps:

for each detection submodule, for a region frame input into the detection submodule, the detection submodule determines an intersection ratio threshold corresponding to the region frame based on the reference intersection ratio threshold and the area of the region frame, and allocates a label to the region frame based on the intersection ratio threshold corresponding to the region frame.

Optionally, the determining, based on the reference intersection ratio threshold and the area of the region frame, an intersection ratio threshold corresponding to the region frame includes:

if the area of the area frame is smaller than the area threshold, adjusting the reference intersection ratio threshold to obtain an adjusted intersection ratio threshold which is used as an intersection ratio threshold corresponding to the area frame, wherein the adjusted intersection ratio threshold is smaller than the reference intersection ratio threshold;

and if the area of the area frame is larger than or equal to the area threshold, taking the reference intersection ratio threshold as an intersection ratio threshold corresponding to the area frame.

Optionally, the performing semantic segmentation processing on the second target in the text image to obtain a region detection result of the second target includes:

inputting the text image into a trained semantic segmentation model, outputting a region detection result of the second target by the semantic segmentation model, and training the semantic segmentation model by taking the text image for training as a training sample and taking the real region box included in the second target labeled by the text image for training and the type of each real region box as a sample label.

Optionally, the determining the region detection result of the text image based on the region detection result of the first target and the region detection result of the second target includes:

optimizing the area detection result of the first target to obtain the optimized area detection result of the first target;

and determining the area detection result of the text image according to the area detection result of the first target and the area detection result of the second target after the optimization processing.

Optionally, the optimizing the region detection result of the first target to obtain the optimized region detection result of the first target includes:

calculating the intersection ratio of any two region frames corresponding to the region frame types aiming at each region frame type in the first target detection result;

acquiring a preset frame intersection comparison threshold value of regions of the same type;

and if the intersection ratio of the two region frames is greater than the intersection ratio threshold of the region frames of the same type, discarding the region frame with lower confidence score in the two region frames.

calculating the intersection ratio of two region frames aiming at two different types of region frames with the positions meeting preset conditions in the first target detection result;

acquiring a frame intersection comparison threshold value of preset different types of areas;

and if the intersection ratio of the two region frames is greater than the intersection ratio threshold of the different types of region frames, discarding the region frame with lower confidence score in the two region frames.

A text image region detecting apparatus, the apparatus comprising:

the text image acquisition unit is used for acquiring a text image to be subjected to area detection;

the example segmentation processing unit is used for carrying out example segmentation processing on a first target in the text image to obtain an area detection result of the first target;

the semantic segmentation processing unit is used for performing semantic segmentation processing on a second target in the text image to obtain a region detection result of the second target;

an area detection result determination unit configured to determine an area detection result of the text image based on an area detection result of the first target and an area detection result of the second target.

Optionally, the instance splitting processing unit includes:

and the example segmentation model application unit is used for inputting the text image into a trained example segmentation model, the example segmentation model outputs the area detection result of the first target, and the example segmentation model is obtained by training by taking the text image for training as a training sample and taking the real area box contained in the first target marked by the text image for training and the type of each real area box as a sample label.

the device further comprises: an example segmentation model training unit, the example segmentation model training unit comprising:

the feature map extraction unit is used for extracting a feature map of the text image for training by adopting the feature map extraction module to obtain the feature map of the text image for training;

a region box generating unit, configured to determine, by using the region box generating module, a plurality of region boxes included in a first target in the training text image based on a feature map of the training text image;

a region frame screening unit, configured to determine, by using the region frame screening module, for each region frame generated by the region frame generation module, a non-maximum suppression threshold corresponding to the region frame based on shape information of the region frame; and screening the area frame to be detected from the plurality of area frames generated by the area frame generation module based on the non-maximum value inhibition threshold corresponding to each area frame, and inputting the area frame to be detected into the area frame detection module.

Optionally, the region frame screening unit includes:

a calculation unit for calculating an inclination angle and an aspect ratio of a minimum bounding rectangle frame of the region frame;

the threshold value acquisition unit is used for acquiring a preset reference non-maximum value inhibition threshold value, an inclination angle threshold value and an aspect ratio threshold value;

a non-maximum suppression threshold determination unit, configured to, when the tilt angle of the minimum bounding rectangle of the area frame is greater than the tilt angle threshold and the aspect ratio of the minimum bounding rectangle of the area frame is greater than the aspect ratio threshold, adjust the reference non-maximum suppression threshold, and determine the adjusted non-maximum suppression threshold as the non-maximum suppression threshold corresponding to the area frame; the adjusted non-maximum suppression threshold is greater than the baseline non-maximum suppression threshold; otherwise, determining the reference non-maximum suppression threshold as the non-maximum suppression threshold corresponding to the area frame.

Optionally, the region frame screening unit includes:

the intersection ratio calculation unit is used for calculating the intersection ratio of the region frame and the region frame with the highest current confidence score aiming at each region frame generated by the region frame generation module;

the to-be-detected region frame determining unit is used for discarding the region frame when the intersection ratio of the region frame and the region frame with the highest current confidence score is larger than the non-maximum value inhibition threshold corresponding to the region frame; otherwise, determining the area frame as the area frame to be detected.

the example segmentation model training unit further includes:

and the label allocation unit is used for determining an intersection ratio threshold corresponding to the region frame based on the reference intersection ratio threshold and the area of the region frame and allocating labels to the region frame based on the intersection ratio threshold corresponding to the region frame by the detection submodule aiming at the region frame input into the detection submodule.

Optionally, the label dispensing unit comprises:

an intersection ratio threshold adjusting unit, configured to adjust the reference intersection ratio threshold if the area of the region frame is smaller than the area threshold, to obtain an adjusted intersection ratio threshold as an intersection ratio threshold corresponding to the region frame, where the adjusted intersection ratio threshold is smaller than the reference intersection ratio threshold;

Optionally, the semantic segmentation processing unit includes:

and the semantic segmentation model application unit is used for inputting the text image into a trained semantic segmentation model, the semantic segmentation model outputs the region detection result of the second target, and the semantic segmentation model is obtained by training by taking the training text image as a training sample and taking the real region box contained in the second target labeled by the training text image and the type of each real region box as a sample label.

Optionally, the area detection result determining unit includes:

the optimization unit is used for optimizing the region detection result of the first target to obtain the optimized region detection result of the first target;

and the area detection result determining subunit is configured to determine an area detection result of the text image from the area detection result of the first target and the area detection result of the second target after the optimization processing.

Optionally, the optimization unit includes:

the homogeneous region frame optimization unit is used for calculating the intersection and parallel ratio of any two region frames corresponding to the region frame types aiming at each region frame type in the first target detection result; acquiring a preset frame intersection comparison threshold value of regions of the same type; and if the intersection ratio of the two region frames is greater than the intersection ratio threshold of the region frames of the same type, discarding the region frame with lower confidence score in the two region frames.

Optionally, the optimization unit includes:

the heterogeneous region frame optimization unit is used for calculating the intersection ratio of two different types of region frames with the positions meeting preset conditions in the first target detection result; acquiring a frame intersection comparison threshold value of preset different types of areas; and if the intersection ratio of the two region frames is greater than the intersection ratio threshold of the different types of region frames, discarding the region frame with lower confidence score in the two region frames.

A text image area detecting apparatus includes a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the text image region detection method.

A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the text image area detection method as described above.

By means of the technical scheme, the application discloses a text image region detection method, related equipment and a readable storage medium. According to the scheme, after a text image to be subjected to area detection is obtained, instance segmentation processing is carried out on a first target in the text image to obtain an area detection result of the first target, semantic segmentation processing is carried out on a second target in the text image to obtain an area detection result of the second target, and finally the area detection result of the text image is determined based on the area detection result of the first target and the area detection result of the second target. In the scheme, different modes are adopted for carrying out region detection aiming at different targets, so that the problem that detection of various targets cannot be covered by one text image region detection method to cause missing detection or error detection of partial targets is avoided, and therefore the accuracy of region detection can be improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flowchart of a text image region detection method disclosed in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an example segmentation model disclosed in an embodiment of the present application;

FIG. 3 is a block diagram of another example segmentation model disclosed in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a text image area detection apparatus disclosed in an embodiment of the present application;

fig. 5 is a block diagram of a hardware structure of a text image area detection device disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the text image, the areas to be detected are often different in size and shape, for example, in the automatic correction scene of the middle and primary school homework, the areas corresponding to the answer areas of different question types are often different in size, the answer areas of the questions are large and easy to detect, the answer areas of the selected questions and the blank-filling questions are small, and the answer areas are distributed more closely and difficult to detect. In addition, in the process of generating the text image, the distortion of the region to be detected is easily caused. For example, in the process of taking pictures of homework in middle and primary schools, if the homework is bent, the area corresponding to the bent part in the text image is distorted. These situations add difficulty to the detection of text image regions.

Some text image area detection schemes exist, however, the text image area detection schemes cannot be applied to detection of various targets, and missing detection or false detection of part of the targets often exists, so that accuracy of area detection is low.

In order to improve the accuracy of text image region detection, the inventor of the present application has made extensive research to provide a text image region detection method, in which region detection is performed in different ways for different targets, so as to avoid the problem that detection of various targets cannot be covered by one text image region detection method, which results in missed detection or false detection of some targets, and improve the accuracy of region detection.

Next, a text image region detection method provided by the present application will be described by the following embodiments.

Referring to fig. 1, fig. 1 is a schematic flowchart of a text image area detection method disclosed in an embodiment of the present application, where the method may include:

step S101: and acquiring a text image to be subjected to area detection.

In the present application, the text image to be subjected to area detection may be any text image having an area detection requirement, for example, it may be an image of a middle and primary school task, an image of a test paper, or the like. The present application is not limited to this.

In the present application, a text can be photographed based on various image acquisition devices, such as a camera and a scanner, and a corresponding text image can be obtained. The pre-stored text images may also be retrieved directly from various storage devices. The present application is not limited to this.

Step S102: and carrying out example segmentation processing on the first target in the text image to obtain an area detection result of the first target.

It should be noted that the example segmentation method has better results for regions with regular regions and larger sizes. Therefore, in the present application, the area detection may be performed on the first target by using an example segmentation method. The first target may be a target in the text image suitable for instance segmentation processing, and may be specifically determined based on scene requirements.

As an implementable manner, in the present application, the first target may be a target with a larger area in the text image, for example, a target with an area larger than a preset threshold. Taking the automatic correction scene of the primary and secondary school jobs as an example, the first target can be a large subject. The area detection result of the first target is used for indicating different types of area frames contained in the first target, and still taking the automatic correction scene of the primary and secondary school jobs as an example, the area detection result of the first target can be an area frame used for indicating a large topic question number, an area frame used for indicating a large topic question stem, and an area frame used for indicating a large topic answer area.

In the present application, a neural network model may be used to implement the example segmentation processing on the first target in the text image, and the neural network model may use an existing general example segmentation model, but the existing general example segmentation model may not achieve a better effect for some scenes, so in the present application, an example segmentation model may also be obtained by retraining the scene requirements to implement the example segmentation processing on the first target in the text image, which will be specifically described in detail through the following embodiments.

Step S103: and performing semantic segmentation processing on a second target in the text image to obtain a region detection result of the second target.

It should be noted that, for small-sized, dense, distorted, and long regions, the detection effect of the example segmentation method is relatively poor, and the semantic segmentation method, as a text segmentation method, has low requirements on the size and the form of the region, and therefore, in the present application, the semantic segmentation method may be used to perform region detection on the second target. The second target may be a target in the text image suitable for semantic segmentation processing, and may be specifically determined based on a scene requirement.

As an implementable manner, in the present application, the second target may be a target with a smaller area in the text image, for example, a target with an area smaller than a preset threshold. Taking the automatic correction scene of the primary and secondary school homework as an example, the second target can be a special question type such as a blank filling question and a selection question. The area detection result of the second target is used to indicate different types of area frames included in the second target, and still taking the automatic correction scene of the middle and primary school jobs as an example, the area detection result of the second target may be an area frame used to indicate a blank filling question stem and an area frame of a response area.

In the present application, a neural network model may be used to implement semantic segmentation processing on the second target in the text image, and the neural network model may use an existing general semantic segmentation model, but the existing general semantic segmentation model may not achieve a better effect for some scenes, so in the present application, the semantic segmentation model may also be obtained by retraining the scene requirements to implement semantic segmentation processing on the second target in the text image, which will be specifically described in detail through the following embodiments.

Step S104: determining a region detection result of the text image based on the region detection result of the first target and the region detection result of the second target.

In the present application, after obtaining the area detection result of the first object and the area detection result of the second object, as an implementable manner, the area detection result of the first object and the area detection result of the second object may be directly used as the area detection result of the text image. As another possible implementation manner, after the region detection result of the first target and/or the region detection result of the second target are/is post-processed, the region detection result of the text image may be determined.

The embodiment discloses a text image region detection method, in the scheme, after a text image to be subjected to region detection is obtained, instance segmentation processing is performed on a first target in the text image to obtain a region detection result of the first target, semantic segmentation processing is performed on a second target in the text image to obtain a region detection result of the second target, and finally the region detection result of the text image is determined based on the region detection result of the first target and the region detection result of the second target. According to the method, different modes are adopted for carrying out region detection on different targets, so that the problem that detection of part of targets is missed or mistakenly detected due to the fact that detection of various targets cannot be covered by one text image region detection method is avoided, and therefore the accuracy of region detection can be improved.

In another embodiment of the present application, a specific implementation manner of performing, in step S102, example segmentation processing on the first target in the text image to obtain a region detection result of the first target is described. The method can comprise the following steps:

It should be noted that, for different scenes, the text image for training may be a text image in the current scene, and the type of the real area box and each real area box included in the first target labeled with the text image for training is also adapted to the current scene. For the convenience of understanding, taking the working scene of primary and secondary schools as an example, the training text image may be a large number of working images of primary and secondary schools, the first target may be a large question type (e.g., question and answer), and the type of the real area frame included in the first target may include a question number, a question stem, an answer area, and the like.

As an implementation manner, please refer to fig. 2, and fig. 2 is a schematic structural diagram of an example segmentation model disclosed in the embodiment of the present application. As shown in fig. 2, the example segmentation model includes a feature map extraction module, a region box generation module, a region box screening module, and a region box detection module.

Based on the example segmentation model shown in fig. 2, the present application also discloses a training process of the example segmentation model, which may include the following steps:

step S201: and the feature map extraction module is used for extracting a feature map of the training text image to obtain the feature map of the training text image.

In this application, as an implementable embodiment, the feature map extraction module may be implemented by using a residual network (e.g., ResNet50), and as another implementable embodiment, the feature map extraction module may be implemented by using a residual network (e.g., ResNet50) and adding a Feature Pyramid Network (FPN) to enhance the expressive power of features and cover multi-scale information.

Step S202: the region box generation module determines a plurality of region boxes included in a first target in the training text image based on a feature map of the training text image.

In this application, as an implementable manner, the region box generating module may be implemented by using RPN (region pro-social network) of fast-RCNN. It can also be implemented using the RPN of Cascade MaskRCNN.

In this step, region frames with different sizes and different aspect ratios are generated at positions of the feature map of the training text image corresponding to the first target in the training text image.

Step S203: the region frame screening module is used for determining a non-maximum value inhibition threshold value corresponding to each region frame generated by the region frame generating module; and screening the area frame to be detected from the plurality of area frames generated by the area frame generation module based on the non-maximum value inhibition threshold corresponding to each area frame, and inputting the area frame to be detected into the area frame detection module.

In the region frames generated in step S202, there are often region frames with too much overlapping area, and if all of these region frames are processed, the training time length is increased, and the training effect is also affected. Therefore, in the application, the region frame generated by the region frame generation module is screened by the region frame screening module, and the region frame to be detected is determined to be processed.

Conventionally, in RPN, NMS (Non-maximum suppression) is performed on a region box having an excessive overlap area, specifically, a fixed Non-maximum suppression threshold (usually 0.7) is set, and NMS is performed based on the Non-maximum suppression threshold, thereby suppressing a region box having an excessive overlap area. However, there are some area boxes that are not desired to be suppressed among the area boxes included in the first target in the text image for training generated in step S202, and if NMS is performed using the above-mentioned fixed non-maximum suppression threshold, these area boxes will be suppressed, resulting in a low detection effect of the model on these area boxes. Therefore, in the present application, for different area frames generated by the area frame generation module, different non-maximum suppression thresholds are used for NMS, so as to reduce the suppression capability for some area frames.

As an implementation, the non-maximum suppression threshold corresponding to the area frame may be determined based on the shape information of the area frame. In the present application, the shape information of the region box may be characterized in various ways. As an implementable manner, the shape information of the region box may be characterized based on the inclination angle and the aspect ratio of the minimum bounding rectangle of the region box. Based on this, the determining the non-maximum suppression threshold corresponding to the area frame based on the shape information of the area frame may include: calculating the inclination angle and the aspect ratio of the minimum circumscribed rectangle frame of the area frame; acquiring a preset reference non-maximum value inhibition threshold, an inclination angle threshold and an aspect ratio threshold; when the inclination angle of the minimum circumscribed rectangular frame of the area frame is larger than the inclination angle threshold value and the aspect ratio of the minimum circumscribed rectangular frame of the area frame is larger than the aspect ratio threshold value, adjusting the reference non-maximum suppression threshold value, and determining the adjusted non-maximum suppression threshold value as the non-maximum suppression threshold value corresponding to the area frame; the adjusted non-maximum suppression threshold is greater than the baseline non-maximum suppression threshold; otherwise, determining the reference non-maximum suppression threshold as the non-maximum suppression threshold corresponding to the area frame.

For ease of understanding, let angle be the tilt angle of the minimum bounding rectangle of the region frame, thr_aThe threshold value of the inclination angle can be obtained by training the prior approximation of the first target of the text image, and the ratio is the aspect ratio of the minimum bounding rectangle box of the region box, thr_rThe aspect ratio threshold may be obtained by training a priori approximations of the first target in the text image. thr_IoU0Is the reference non-maximum suppression threshold, for example, the value in the RPN training stage is usually 0.7, and the non-maximum suppression threshold thr corresponding to the region frame_IoUIt can be calculated based on the following formula:

based on the above, as an implementable manner, the screening out the area frame to be detected from the plurality of area frames generated by the area frame generation module based on the non-maximum value suppression threshold corresponding to each area frame includes: for each region frame generated by the region frame generation module, calculating the intersection ratio of the region frame and the region frame with the highest current confidence score; when the intersection ratio of the region frame and the region frame with the highest current confidence score is larger than the non-maximum value inhibition threshold value corresponding to the region frame, discarding the region frame; otherwise, determining the area frame as the area frame to be detected.

As another possible implementation manner, please refer to fig. 3, and fig. 3 is a schematic structural diagram of another example segmentation model disclosed in the embodiment of the present application. As shown in fig. 3, the example segmentation model includes a feature map extraction module, a region frame generation module, a region frame screening module, and a region frame detection module, where the region frame detection module includes three detection sub-modules in cascade, and the output of the previous detection sub-module is used as the input of the next detection sub-module; the output of the last detection submodule is used as the output of the example segmentation model.

It should be noted that each detection submodule is provided with a reference intersection ratio threshold; specifically, for each input area frame, calculating the intersection ratio of the area frame and a labeled real area frame, if the intersection ratio of the area frame and the labeled real area frame is greater than the reference intersection ratio threshold, allocating a negative label to the area frame, otherwise, allocating a positive label to the area frame.

As an implementation mode, the structure of three detection submodules can be realized by adopting three classification regression heads cascaded in Cascade MaskRCNN. At present, in the Cascade MaskRCNN, different classification regression heads are provided with different reference intersection ratio thresholds, the reference intersection ratio thresholds provided by the different classification regression heads are fixed, the reference intersection ratio threshold of the later classification regression head is larger than the reference intersection ratio threshold of the former classification regression head, and with the continuous increase of the reference intersection ratio threshold, a small target is gradually determined as a negative sample, so that the small target learning effect is poor in the model training process. In order to avoid such a problem, in the present application, in the training process of the example segmentation model shown in fig. 3, for each detection submodule, for a region frame input to the detection submodule, the detection submodule determines an intersection ratio threshold corresponding to the region frame based on the reference intersection ratio threshold and the area of the region frame, and assigns a label to the region frame based on the intersection ratio threshold corresponding to the region frame.

As an implementation manner, the determining, based on the reference intersection ratio threshold and the area of the region frame, an intersection ratio threshold corresponding to the region frame includes: if the area of the area frame is smaller than the area threshold, adjusting the reference intersection ratio threshold to obtain an adjusted intersection ratio threshold which is used as an intersection ratio threshold corresponding to the area frame, wherein the adjusted intersection ratio threshold is smaller than the reference intersection ratio threshold; and if the area of the area frame is larger than or equal to the area threshold, taking the reference intersection ratio threshold as an intersection ratio threshold corresponding to the area frame.

For the sake of understanding, assuming that area is the area of the region box, threshold is the area threshold, which can be obtained by training the prior approximation of the first target in the text image, and IoU0 is the reference intersection ratio threshold, the intersection ratio threshold IoU corresponding to the region box can be calculated based on the following formula:

based on the above, as an implementable embodiment, assigning a label to the region box based on the intersection ratio threshold corresponding to the region box may include: calculating the intersection ratio of the area frame and the marked real area frame, and if the intersection ratio of the area frame and the marked real area frame is greater than the intersection ratio threshold corresponding to the area frame, distributing a negative label to the area frame; and if the intersection ratio of the area frame and the marked real area frame is not greater than the intersection ratio threshold corresponding to the area frame, distributing a positive label for the area frame.

It should be noted that, by adopting the above scheme, the intersection ratio threshold of the small-size region frame can be reduced, the condition that the small-size region frame is determined as a positive sample is relaxed, and the learning capability of the small-size region frame is improved in the model training process.

In another embodiment of the present application, an implementation manner of performing semantic segmentation processing on the second target in the text image in step S103 to obtain a region detection result of the second target is described, where the implementation manner may include:

It should be noted that, for different scenes, the text image for training may be a text image in the current scene, and the type of the real area box and each real area box included in the second target labeled with the text image for training is also adapted to the current scene. For the convenience of understanding, taking the working scene of primary and secondary schools as an example, the training text image may be a large number of working images of primary and secondary schools, the second target may be a small question type (e.g., a blank filling question, a choice question, etc.), and the type of the real area frame included in the second target may include a question number, a question stem, a question answering area, etc.

As an implementable manner, the structure of the semantic segmentation model may adopt a structure of a DBNet (Real-time Scene Text Detection with differential Binarization-based Real-time Scene Text Detection) network.

In another embodiment of the present application, an implementation manner of determining the region detection result of the text image based on the region detection result of the first target and the region detection result of the second target in step S104 is introduced, which is specifically as follows:

as an implementable manner, the region detection result of the text image may be determined from the region detection result of the first target and the region detection result of the second target.

However, there are some similar region frames in the region detection result of the first target, and these similar region frames cause the accuracy of the region detection result of the first target to be lowered. As another implementable manner, the determining a region detection result of the text image based on the region detection result of the first target and the region detection result of the second target may include: optimizing the area detection result of the first target to obtain the optimized area detection result of the first target; and determining the area detection result of the text image according to the area detection result of the first target and the area detection result of the second target after the optimization processing.

As an implementation manner, the optimizing the region detection result of the first target to obtain the region detection result of the first target after the optimizing process may include the following steps:

step S301: and calculating the intersection ratio of any two region frames corresponding to the region frame types aiming at each region frame type in the first target detection result.

For convenience of understanding, assuming that two region frames corresponding to a certain region frame type in the first target detection result are A, B, respectively, an intersection ratio of the region frame a and the region frame B may be calculated using the formula IoU ═ a ∞ B)/min (a, B), where min (a, B) represents a minimum value of the areas of a and B, and a ∞ B represents an overlapping area of A, B.

Step S302: acquiring a preset frame intersection comparison threshold value of regions of the same type;

step S303: and if the intersection ratio of the two region frames is greater than the intersection ratio threshold of the region frames of the same type, discarding the region frame with lower confidence score in the two region frames.

As another possible implementation manner, the optimizing the region detection result of the first target to obtain the region detection result of the first target after the optimizing process may include the following steps:

step S401: and calculating the intersection ratio of the two region frames aiming at the two different types of region frames with the positions meeting the preset conditions in the first target detection result.

In this step, the position satisfies the preset condition, and may be that the two region frames have an intersection.

For convenience of understanding, assuming that two different types of region frames whose positions satisfy the preset condition in the first target detection result are A, B, respectively, an intersection-and-merge ratio of the region frame a and the region frame B may be calculated using the formula IoU ═ a ∞ B)/(a + B- (a &) where a + B- (a &) represents a union of a and B and a &brepresents an overlapping area of A, B.

Step S402: and acquiring a frame intersection ratio threshold value of preset different types of areas.

Step S403: and if the intersection ratio of the two region frames is greater than the intersection ratio threshold of the different types of region frames, discarding the region frame with lower confidence score in the two region frames.

It should be noted that the two optimization methods may be executed alternatively or both. By optimizing the first target detection result, the region detection result of the text image can be more accurate.

The following describes the text image area detection device disclosed in the embodiment of the present application, and the text image area detection device described below and the text image area detection method described above may be referred to in correspondence with each other.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a text image area detection apparatus disclosed in an embodiment of the present application. As shown in fig. 4, the text image region detecting device may include:

a text image acquisition unit 41 configured to acquire a text image to be subjected to area detection;

an example segmentation processing unit 42, configured to perform example segmentation processing on a first target in the text image to obtain an area detection result of the first target;

a semantic segmentation processing unit 43, configured to perform semantic segmentation processing on a second target in the text image to obtain a region detection result of the second target;

a region detection result determining unit 44 configured to determine a region detection result of the text image based on the region detection result of the first target and the region detection result of the second target.

Optionally, the instance splitting processing unit includes:

Optionally, the region frame screening unit includes:

the example segmentation model training unit further includes:

Optionally, the label dispensing unit comprises:

Optionally, the semantic segmentation processing unit includes:

Optionally, the area detection result determining unit includes:

Optionally, the optimization unit includes:

Referring to fig. 5, fig. 5 is a block diagram of a hardware structure of a text image area detection device according to an embodiment of the present disclosure, and referring to fig. 5, the hardware structure of the text image area detection device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring a text image to be subjected to area detection;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

acquiring a text image to be subjected to area detection;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A text image region detection method, characterized in that the method comprises:

acquiring a text image to be subjected to area detection;

2. The method according to claim 1, wherein the performing instance segmentation processing on the first target in the text image to obtain a region detection result of the first target comprises:

3. The method of claim 2, wherein the instance segmentation model comprises a feature map extraction module, a region box generation module, a region box screening module and a region box detection module;

4. The method of claim 3, wherein determining the non-maximum suppression threshold corresponding to the region box based on the shape information of the region box comprises:

5. The method according to claim 3, wherein the screening out the region frames to be detected from the plurality of region frames generated by the region frame generation module based on the non-maximum suppression threshold corresponding to each region frame comprises:

6. The method of claim 3, wherein the zone-box detection module comprises a cascade of three detection sub-modules, the output of a previous detection sub-module being input to a subsequent detection sub-module; the output of the last detection submodule is used as the output of the example segmentation model; each detection submodule is provided with a reference intersection ratio threshold;

7. The method of claim 6, wherein determining the intersection ratio threshold corresponding to the region frame based on the reference intersection ratio threshold and the area of the region frame comprises:

8. The method according to claim 2, wherein the performing semantic segmentation processing on the second target in the text image to obtain a region detection result of the second target includes:

9. The method according to claim 1, wherein determining the region detection result of the text image based on the region detection result of the first target and the region detection result of the second target comprises:

10. The method according to claim 9, wherein the optimizing the region detection result of the first target to obtain the optimized region detection result of the first target includes:

11. The method according to claim 9 or 10, wherein the optimizing the region detection result of the first target to obtain the optimized region detection result of the first target includes:

12. A text image region detection apparatus, characterized in that the apparatus comprises:

13. A text image region detection apparatus includes a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, and implement the steps of the text image region detection method according to any one of claims 1 to 11.

14. A readable storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the text image area detection method according to any one of claims 1 to 11.