CN112801097B

CN112801097B - Training method and device of text detection model and readable storage medium

Info

Publication number: CN112801097B
Application number: CN202110397684.3A
Authority: CN
Inventors: 王德强; 刘霄; 熊泽法
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2021-07-16
Anticipated expiration: 2041-04-14
Also published as: CN112801097A

Abstract

The embodiment of the disclosure relates to a training method, a device and a readable storage medium of a text detection model, wherein the method comprises the following steps: acquiring a training sample carrying a standard detection result, wherein the standard detection result comprises at least one standard text area; inputting a training sample into an initial text detection model, and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text area; acquiring a first loss value according to a preset loss function, a plurality of detection results and a standard detection result; and training the initial text detection model according to the first loss value until the training times meet the preset iteration times, and obtaining the text detection model. The standard text region is obtained by reducing the initial text region, and the reduction distance is only related to the minimum value of the length and the width of the initial text region, so that the method is more suitable for the detection scene of the line text, the stability of the training of the text detection model is ensured, and the detection accuracy of the text detection model is improved.

Description

Training method and device of text detection model and readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for training a text detection model, and a readable storage medium.

Background

In the scene of Artificial Intelligence (AI) + education, the positioning of text lines for image texts and the text detection of multiple text types such as handwritten texts and formula texts are the prepositive links for performing layout reduction and content understanding. At present, text detection of different text types is usually realized by adopting a text detection mode based on pixel segmentation. Specifically, feature extraction is carried out on the image text by using a text detection model, and classification prediction is carried out on pixel points in the image text according to an extracted feature map; and then, extracting a connected domain of each text type according to the classification prediction result to serve as a text detection example of the subsequent text recognition.

For dense text, to avoid the sticking problem of lines of text, the text detection model predicts one or more regions of text of the same shape but different sizes, which are typically smaller than the actual text regions. In the prior art, a text detection model determines a reduction distance according to the area and the perimeter of a real text region by adopting a polygon clipping algorithm, and reduces the real text region according to the determined reduction distance, so as to obtain the predicted text region.

Although the above method can improve the sticking problem of adjacent text lines by reducing the real text region, the reduction distance determined by the polygon clipping algorithm is determined according to the area and the perimeter of the real text region; when the widths of different text lines are the same and the lengths are different, the difference of the reduction distances of different text lines is large, and the training of the text detection model is unstable easily.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, the present disclosure provides a method and an apparatus for training a text detection model, and a readable storage medium.

In a first aspect, the present disclosure provides a method for training a text detection model, including:

obtaining a training sample, wherein the standard detection result of the training sample comprises: at least one standard text region and a text type identifier to which each standard text region belongs; the standard text region is obtained by reducing the initial text region, and the reduction distance is determined according to the minimum value of the length and the width of the initial text region;

inputting the training sample into an initial text detection model, and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text region and a text type identifier to which each text region belongs;

acquiring a first loss value according to a preset loss function, the plurality of detection results and the standard detection result;

and training the initial text detection model according to the first loss value until the training times meet the preset iteration times, and obtaining the text detection model.

In some possible designs, the reduction distance when reducing the initial text region satisfies the formula:

wherein d represents a reduction distance; w represents the length of the initial text region; h represents the width of the initial text region; a represents a hyper-parameter.

In some possible designs, the initial text detection model includes a feature extraction sub-model, a feature fusion sub-model, and a classification prediction sub-model; the inputting the training sample into an initial text detection model to obtain a plurality of detection results of the training sample includes:

inputting the training sample into the feature extraction submodel, and extracting a plurality of first feature maps with different scales of the training sample, a plurality of second feature maps with different scales of the training sample and a first detection result;

fusing the third feature map and the fourth feature map through the feature fusion submodel, and outputting a first fusion feature map; wherein the first feature maps with different scales comprise the third feature map, the second feature maps with different scales comprise the fourth feature map, and the third feature map and the fourth feature map have the same scale;

fusing the first fused feature map and the first detection result to obtain a second fused feature map; inputting the second fusion characteristic diagram into the classification predictor model to obtain a second detection result;

the plurality of detection results includes the first detection result and the second detection result.

In some possible designs, the feature extraction submodel includes: a first feature extraction submodel and a second feature extraction submodel;

the first feature extraction submodel is used for carrying out multiple times of downsampling processing on the original feature map of the training sample and extracting the first feature maps with different scales;

the second feature extraction submodel is used for carrying out multiple times of last sampling processing on the first feature map with the minimum scale and extracting a plurality of second feature maps with different scales; and acquiring the first detection result according to the second characteristic diagram with the largest scale.

In some possible designs, the third feature map is the first feature map with the largest scale in the plurality of first feature maps with different scales; the fourth feature map is the second feature map with the same scale as the third feature map.

In some possible designs, the first fused feature map is fused with the first detection result to obtain a second fused feature map; and inputting the second fusion feature map into the classification predictor model to obtain the second detection result, wherein the second detection result comprises:

adding the probability values of the same pixel points in the first fusion characteristic diagram and the first detection result to obtain a second fusion characteristic diagram;

inputting the second fusion characteristic graph into N channels of the classification prediction submodel respectively, calculating the second fusion characteristic graph according to the classification function of each channel, and acquiring probability values of each pixel point belonging to N text types respectively; n is an integer greater than 2; the N channels correspond to N text types one by one, the N text types are divided into a plurality of text type groups, and classification functions corresponding to the text type groups are not completely the same;

aiming at each pixel point, determining the text type to which the pixel point belongs according to the maximum value in the probability values of the pixel point belonging to the N text types respectively; and acquiring the second detection result according to the text type to which each pixel point belongs and the connected domain of each text type.

In some possible designs, the obtaining a first loss value according to a preset loss function, the detection result, and the standard detection result includes:

respectively acquiring a second loss value of the first detection result and the standard detection result and a third loss value between the second detection result and the standard detection result according to a preset loss function;

and acquiring the first loss value according to the second loss value and the third loss value.

In a second aspect, the present disclosure provides a training apparatus for a text detection model, including:

an obtaining module, configured to obtain a training sample, where a standard detection result of the training sample includes: at least one standard text region and a text type identifier to which each standard text region belongs; the standard text region is obtained by reducing the initial text region, and the reduction distance is determined according to the minimum value of the length and the width of the initial text region;

the processing module is used for inputting the training sample into an initial text detection model and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text region and a text type identifier to which each text region belongs; acquiring a first loss value according to a preset loss function, the plurality of detection results and the standard detection result; and training the initial text detection model according to the first loss value until the training times meet the preset iteration times, and acquiring the text detection model.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including: memory, processor, and computer program instructions;

the memory configured to store the computer program instructions;

the processor configured to execute the computer program instructions, the processor executing the computer program instructions to perform the training method of the text detection model according to any one of the first aspect.

In a fourth aspect, an embodiment of the present disclosure further provides a readable storage medium, including: computer program instructions;

the computer program instructions, when executed by a processor of an electronic device, are configured to perform the method of training a text detection model according to any of the first aspect.

In a fifth aspect, the disclosed embodiments also provide a program product, where the program product includes a computer program, the computer program is stored in a readable storage medium, the computer program can be read by at least one processor of a training apparatus of the text detection model, and the at least one processor executes the computer program to make the training apparatus of the text detection model execute the training method of the text detection model according to any one of the first aspect.

The embodiment of the disclosure provides a training method and a device for a text detection model and a readable storage medium, wherein the method comprises the following steps: obtaining a training sample carrying a standard detection result, wherein the standard detection result comprises: at least one standard text region and a text type identifier to which each standard text region belongs; then, inputting the training sample into the initial text detection model, and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text area; acquiring a first loss value according to a preset loss function, a plurality of detection results and a standard detection result; and training the initial text detection model according to the first loss value until the training times meet the preset iteration times, and obtaining the text detection model. In the scheme, the standard text region is obtained by reducing the initial text region, and the reduction distance is only related to the minimum value of the length and the width of the initial text region, so that the method is more suitable for the detection scene of the line text. The reduction distances of adjacent text regions with the same width but different lengths are the same, so that the difference of the reduction distances caused by the different lengths of the text regions can be avoided. And further, the stability of the training of the text detection model is ensured, and the detection accuracy of the text detection model is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a flowchart of a training method of a text detection model according to an embodiment of the present disclosure;

FIG. 2 is a diagram illustrating a comparison between an initial text region and a standard text region of a training sample according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of a training method of a text detection model according to another embodiment of the present disclosure;

fig. 4 is a flowchart of a training method of a text detection model according to another embodiment provided in an embodiment of the present disclosure;

fig. 5a is a flowchart of a training method of a text detection model according to another embodiment of the present disclosure;

fig. 5b is a schematic structural diagram of a feature fusion submodel according to an embodiment of the disclosure;

FIG. 6 is a schematic structural diagram of a classification predictor model according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a training apparatus for a text detection model according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

In the "AI + education" scenario, multi-class detection of image text is crucial. For example, in a science and chemistry scenario, besides printed text, there are a large number of special characters, such as formula text, handwritten text, and the like, and the text recognition of these categories is difficult, and therefore, the content recognition model is usually customized for these categories individually. If the text type detection is wrong, the recognized text instance is input into the unmatched content recognition model, and further serious errors occur in text content recognition.

At present, there are two methods for text detection model based on deep learning, which are respectively: a regression method based on a preset frame and a method based on pixel segmentation.

Typical regression methods based on a preset frame include: the Faster RCNN has the advantages of strong classification capability and insensitivity to image noise, but has poor adaptability to dense and curved text lines, and the processing time of a single image text in practical application is long, so that the real-time requirement is difficult to meet.

The pixel segmentation based method has obvious advantages for dense texts and curved texts by performing dense prediction on the texts and extracting various types of text examples through a special post-processing strategy. Currently, methods based on pixel segmentation include: when the PSENet and DBNet algorithms are adopted, in order to avoid the adhesion problem of adjacent dense text lines, one or more text regions with the same shape and different sizes are usually predicted, the size of the text regions is smaller than that of a real text region, and in the text detection model training process, the PSENet and DBNet algorithms both adopt a polygon clipping algorithm (a variable clipping algorithm) to inwards narrow an initial text region by a certain distance, so that the text regions for multi-class text detection are obtained. Wherein the reduction distance satisfies formula (1):

formula (1)

Wherein the content of the first and second substances,

representing the second in the image text

An area of the initial text region;

is shown as

A perimeter of the initial text region;

indicates that the hyper-parameter, optionally,

。

although the above method can avoid the problem of text region blocking, it can be known from formula (1) that the calculation formula of the reduced distance of the text region is a complex function of the area and the perimeter in the prior art. By adopting a calculation mode in the prior art, when the text regions have the same width and different lengths, the difference of the reduction distances of the text regions is large, which easily causes unstable training of the text detection model, and further causes the accuracy of the text region predicted by the text detection model to be reduced.

In order to solve the problems in the prior art, the present disclosure provides a training method for a text detection model. The following describes the training method of the text detection model provided by the present disclosure in detail through several specific embodiments.

Fig. 1 is a flowchart of a training method of a text detection model according to an embodiment of the present disclosure. The execution subject of the method can be a training device of the text detection model provided by the embodiment of the disclosure, and the device can be realized in a software and/or hardware manner. As shown in fig. 1, the method includes:

s101, obtaining a training sample, wherein the standard detection result of the training sample comprises: at least one standard text region and a text type identifier to which each standard text region belongs; the standard text region is obtained by reducing an initial text region, and a reduction distance is determined according to a minimum value of a length and a width of the initial text region.

Each training sample carries a standard detection result, specifically, the standard detection result of each training sample includes at least one standard text region, the size of the standard text region is smaller than that of the real text region, and the standard text region is obtained by reducing the initial text region. Wherein the initial text region may be manually pre-labeled.

In the scheme, the reduction distance corresponding to each initial text region has an association relation with the minimum value of the length and the width of the initial text region.

Alternatively, the reduction distance satisfies the following formula (2):

formula (2)

Wherein d represents a reduction distance; w represents the length of the initial text region; h represents the width of the initial text region;

representing a hyper-parameter. Alternatively,

has a value range of [8.0, 10.0 ]]By means of hyper-parameters

The reduction distance of each initial text region can be controlled to avoidThe problem of text area blocking. It should be understood that in equation (1)

。

In addition, in the scheme, the reduction distance corresponding to each initial text region is related to the minimum side length of the initial text region, so that the method is more suitable for the detection scene of the line text. The reduction distances of adjacent text regions with the same width but different lengths are the same, so that the difference of the reduction distances caused by the different lengths of the text regions can be avoided.

Exemplarily, fig. 2 shows a comparison result between an initial text region and a standard text region. Referring to fig. 2, the standard text area is smaller than the initial text area, and in practical applications, areas of different text types may be marked by different identifications, for example, as shown in fig. 2, printed text is marked by solid lines, and a table is marked by dotted lines.

S102, inputting the training sample into the initial text detection model, and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text region and a text type identifier to which each text region belongs.

The specific implementation manner of the initial file detection model outputting a plurality of detection results is described with reference to the embodiment in fig. 4.

S103, acquiring a first loss value according to a preset loss function, the plurality of detection results and the standard detection result.

The purpose of this step is: and performing statistical analysis on loss values corresponding to the detection results respectively to obtain the loss value of the text detection model, and providing a guidance basis for adjusting the weight value of the parameter in the initial text detection model.

Specifically, for each training sample, calculating a loss value between each detection result and a standard detection result according to a preset loss function corresponding to each detection result; and then, weighting the loss values corresponding to the detection results according to the weight coefficients of the loss values corresponding to the detection results, so as to obtain a first loss value of the text detection model.

Take the example that the plurality of detection results includes two detection results: the loss value between one of the detection results and the standard detection result is a second loss value, and is recorded as: loss 0; the Loss between the other test result and the standard test result is the third Loss, which is denoted as Loss 1. The first Loss value Loss = w Loss0+ (1-w) Loss1, where w is the adjustment factor between the second Loss value and the third Loss value. Optionally, w = 0.2.

And S104, training the initial text detection model according to the first loss value until the training times meet preset iteration times, and obtaining the text detection model.

Specifically, according to the first loss value obtained by calculation in S103, the weight values of one or more parameters in the initial text detection model are adjusted, and the training is repeated until the training times satisfy the preset iteration times, so as to obtain the text detection model.

The specific implementation manner of performing parameter adjustment on the initial text detection model based on the loss value may be an implementation manner in the prior art, which is not limited in the embodiment of the present disclosure.

In the scheme, when the training times meet the preset iteration times, the output text detection model meets the preset precision requirement. The preset iteration number may be set according to a requirement, which is not limited in the embodiments of the present disclosure.

The training method for the text detection model provided by the embodiment comprises the following steps: obtaining a training sample carrying a standard detection result, wherein the standard detection result comprises: at least one standard text region and a text type identifier to which each standard text region belongs; then, inputting the training sample into the initial text detection model, and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text area; acquiring a first loss value according to a preset loss function, a plurality of detection results and a standard detection result; and training the initial text detection model according to the first loss value until the training times meet the preset iteration times, and obtaining the text detection model. In the scheme, the standard text region is obtained by reducing the initial text region, and the reduction distance is only related to the minimum value of the length and the width of the initial text region, so that the method is more suitable for the detection scene of the line text. The reduction distances of adjacent text regions with the same width but different lengths are the same, so that the difference of the reduction distances caused by the different lengths of the text regions can be avoided. And further, the stability of the training of the text detection model is ensured, and the detection accuracy of the text detection model is improved.

On the basis of the embodiment shown in fig. 1, a structure of an initial text detection model and a specific implementation manner for obtaining a plurality of detection results by the initial text detection model are described in detail, where fig. 3 is a schematic structural diagram of the initial text detection model provided in an embodiment of the present disclosure; fig. 4 is a flowchart of a training method of a text detection model according to another embodiment of the present disclosure. Specifically, the method comprises the following steps:

one possible implementation, as shown in fig. 3, the initial text detection model 300 includes: a feature extraction sub-model 301, a feature fusion sub-model 302, and a classification prediction sub-model 303.

In some possible implementations, the feature extraction submodel 301 may include two parts: the system comprises a first feature extraction sub-model 3011 and a second feature extraction sub-model 3012, wherein the second feature extraction sub-model 3011 is connected with at least one output interface of the first feature extraction sub-model 3012.

On the basis of fig. 3, inputting a training sample into the initial text detection model, and obtaining a plurality of detection results of the training sample, where each detection result includes at least one text region and a text type identifier to which each text region belongs, and the method may include the following steps:

s401, inputting the training sample into the feature extraction submodel, and acquiring a plurality of first feature maps with different scales, a plurality of second feature maps with different scales and a first detection result output by the feature extraction submodel.

It should be noted that the scale of the plurality of first feature maps is smaller than or equal to the scale of the original feature map corresponding to the training sample. And the scales of the plurality of second feature maps are smaller than or equal to the scales of the original feature maps corresponding to the training samples.

Specifically, as shown in fig. 3, the first feature extraction submodel 3011 is configured to perform continuous downsampling processing on an original feature map of a training sample for multiple times, and output multiple first feature maps with different scales; illustratively, the first feature extraction submodel 3011 may employ ResNet18 as the primary network architecture. The second feature extraction submodel 3012 is configured to perform multiple times of continuous upsampling processing on the first feature map with the smallest scale, and output multiple second feature maps with different scales; the second feature extraction submodel outputs a first detection result according to the second feature graph with the largest scale; illustratively, the portion of the second feature extraction submodel for extracting the second feature map of the plurality of scales may be implemented using a Feature Pyramid Network (FPN). The part of the second feature extraction submodel for obtaining the first detection result may adopt the same network structure as the classification prediction submodel.

S402, fusing the third feature map and the fourth feature map through the feature fusion sub-model, and outputting a first fused feature map, wherein the first feature maps with different scales comprise the third feature map, the second feature maps with different scales comprise the fourth feature map, and the scales of the third feature map and the fourth feature map are the same.

Optionally, the third feature map is a first feature map with a largest scale in the plurality of first feature maps with different scales; the fourth feature map is a second feature map having the same scale as the third feature map among a plurality of second feature maps having different scales. For example, the scale of the third feature map and the scale of the fourth feature map are both half of the scale of the original feature map.

Specifically, the third feature map and the fourth feature map are spliced along the channel direction, and the obtained fusion feature map is subjected to continuous multiple downsampling and continuous multiple upsampling to obtain a first fusion feature map.

S403, fusing the first fusion characteristic diagram with the first detection result to obtain a second fusion characteristic diagram; and inputting the second fusion characteristic graph to the classification predictor model to obtain a second detection result.

In one possible implementation, S403 may include the following steps:

step one, adding probability values of the same pixel points in the first fusion characteristic graph and the first detection result to obtain a second fusion characteristic graph.

Step two, inputting the second fusion characteristic graph into N channels of a classification prediction submodel respectively, calculating the second fusion characteristic graph according to the classification function of each channel, and acquiring probability values of each pixel point belonging to N text types respectively, wherein N is an integer greater than 2; the N channels correspond to the N text types one by one, the N text types are divided into a plurality of text type groups, and classification functions corresponding to the text type groups are not completely the same.

And step three, aiming at each pixel point, determining the text type to which the pixel point belongs according to the maximum value in the probability values of the pixel point belonging to the N text types respectively.

And step four, acquiring a second detection result according to the text type to which each pixel point belongs and the connected domain of each text type.

It should be understood that, in this embodiment, the plurality of detection results corresponding to the training samples at least include: a first detection result and a second detection result.

It should be noted that, in practical application, the shallow fine-grained feature map retains more texture information and position information, and can better improve the classification capability of the text class and the background class; the deep coarse-grained characteristic diagram has richer semantic information, and is beneficial to improving the distinguishing capability among different categories. Therefore, according to the characteristics of multi-class text detection tasks, the shallow feature map and the deep feature map are subjected to multi-level fusion, and the advantages of the shallow feature and the deep feature are fully exerted in a multi-prediction mode, so that the detection capability of a text detection model is improved.

In addition, the scheme divides the N text types into a plurality of text type groups, thereby realizing the pixel classification of category mutual exclusion and non-mutual exclusion and avoiding the problem of text type confusion.

The following describes the training process of the text detection model in detail by taking a training sample as an example:

(1) and inwards reducing each initial text region in the training sample, wherein the reduced distance corresponding to each initial text region can be obtained through the formula (2) so as to obtain the training sample carrying the standard detection result. (2) Referring to FIG. 5a, the original feature map of the training sample is extracted

And extracting the sub-model through the first characteristic by using the original characteristic diagram of the training sample

Carrying out continuous multiple downsampling processing, and outputting first feature maps of four scales, which are respectively:

、

、

。

it should be noted that the original feature map of the training sample is described above

Is the same scale as the training sample.

(3) And extracting the sub-model to the first characteristic diagram through the second characteristic

Carrying out continuous multi-time upsampling processing, and outputting second characteristic diagrams of four scales, which are respectively:

、

(ii) a According to a second characteristic diagram

Performing upsampling to obtain a first detection result

。

In this embodiment, the first characteristic diagram

Is the original feature map

One half of the scale; second characteristic diagram

Is the original feature map

One half of the scale. I.e. the first characteristic diagram

Corresponds to the third characteristic diagram in the previous embodiment; second characteristic diagram

Corresponding to the fourth characteristic diagram in the previous embodiment.

(4) And matching the second feature map through the feature fusion submodel

Down-sampling, and obtaining the down-sampled second feature map

And the first characteristic diagram

Splicing along the channel direction, and obtaining a fusion characteristic diagram after corresponding convolution layer processing

(ii) a Then, the fused feature map is processed

Continuously and repeatedly downsampling to respectively obtain a fusion characteristic diagram

、

(ii) a For the fusion feature map

Continuous multi-time up-sampling processing is carried out to obtain a fusion characteristic diagram

、

And

wherein feature maps are fused

Which is the first fused feature map in the previous embodiment.

Referring to fig. 5b, the feature fusion submodel includes 3 × 3 convolutional layers with a step size of 2, 3 × 3 convolutional layers with a step size of 1, and 1 prediction layer. Specifically, the second feature map after down-sampling

And the first characteristic diagram

And (3) splicing along the channel direction, and recording the obtained fusion feature map (as:

) As input for the feature fusion submodel. The fused feature map

Obtaining a fused feature map by 3-by-3 convolution layers with the step length of 2

Fusing feature maps

The scale of (2) is one sixteenth of the scale of the training sample, and the number of characteristic diagram channels in the process can be 64, so that the expression capability of the network can be improved without increasing the model parameters significantly.

Next, for the fused feature map

And (3) sequentially expanding the scale of the convolution layer to be twice of that of the previous layer by adopting a bilinear difference method, and fusing the bilinear difference value with a feature map which is output by the previous convolution layer and has the same scale, wherein the fusion mode is element addition. This process uses 3 x 3 convolutional layers with step size 1. This reduces aliasing effects produced by the fusion process.

In obtaining a fused feature map

Then, the up-sampling processing of 2 times is carried out on the two continuous images, thereby obtaining a fused feature map

。

(5) Fusing the feature maps

And the first detection result

Performing fusion, wherein the fusion mode is that corresponding elements are added to obtain a second fusion characteristic diagram Y₀。

(6) The second fused feature map Y₀Inputting the multi-class detection result into the classification predictor model to perform multi-class detection, and obtaining a second detection result P. In this embodiment, the text types include: background class, printed text, handwritten text, formula text, illustrations and tables, wherein the background class, the printed text, the handwritten text and the formula text belong to a text type group 1, and the illustrations and the tables belong to a text type group 2.

Confidence normalization is carried out among a plurality of text types included in the text type group 1 in a softmax mode, and confusion among the text types is reduced through explicit category mutual exclusion.

Confidence normalization is carried out among a plurality of text types included in the text type group 2 by adopting a sigmoid function, so that the suppression of other text types by a table and an illustration area can be effectively avoided.

In the model training process, the difference between the detection result output by the model and the standard detection result can be measured through the loss function, and then the weight value of the parameter of the model is driven to be updated, so that the model is trained. The penalty function for classifying the predictor models is as follows:

aiming at the text type group 1, a multi-classification cross entropy calculation loss function is adopted, and the formula (3) is satisfied:

formula (3)

Wherein A represents a set of pixel points belonging to the text type group 1;

is shown as

Whether each pixel belongs to the first

A text type;

is shown as

Whether each pixel belongs to the first

The predicted value of each text type is,

is in the range of 1 to

；

Is shown as

Whether each pixel belongs to the first

A predicted value of each text type;

is in the range of 1 to

；

Indicates the total number of types of text,

is an integer greater than or equal to 2; e is a natural constant. In the formula (4), the first and second groups,

and

are all traversal parameters, represent

Each text type is fetched.

Aiming at the text type group 2, a two-classification cross entropy calculation loss function is adopted to satisfy the formula (4):

formula (4)

Wherein B represents a set of pixel points belonging to text type group 2;

is shown as

Whether each pixel belongs to a positive sample or not;

is shown as

Each pixel point belongs to the predicted value of the positive sample;

is an intermediate variable;

is a natural constant.

When the sigmoid value of the pixel point meets the corresponding preset condition, the pixel point is determined to belong to one text type in the text type group 2, and when the sigmoid value of the pixel point does not meet the corresponding preset condition, the pixel point is determined to belong to the other text type in the text type group 2. For example, when the sigmoid value of the pixel point is greater than 0.5, the pixel point is determined to belong to the table, and when the sigmoid value of the pixel point is less than 0.5, the pixel point is determined to belong to the illustration.

The structure of the classification prediction submodel is shown in fig. 6, the classification prediction submodel includes 6 channels, wherein the text type group 1 corresponds to 4 channels, the text type group 2 corresponds to 2 channels, the 4 channels corresponding to the text type group 1 all adopt a softmax mode to perform confidence normalization, and the 2 channels corresponding to the text type group 2 all adopt a sigmoid mode to perform confidence normalization.

The loss function corresponding to the classification prediction submodel is obtained by calculating according to the multi-classification cross entropy corresponding to the text type group 1 and the two-classification cross entropy corresponding to the text type group 2. Specifically, the loss function corresponding to the classification predictor model satisfies formula (5):

formula (5)

Wherein the content of the first and second substances,

a loss function representing a classification predictor model;

a loss function representing multi-class cross entropy, i.e. corresponding to the above equation (3);

the loss function representing the cross entropy of the two classes corresponds to equation (4) above.

It should be noted that the first detection result is

The corresponding loss function (i.e. the aforementioned second loss value) and the second detection result

The corresponding loss function (i.e. the aforementioned third loss value) is calculated in the same manner, and is not described herein again.

(7) According to the formula (6), the first detection result

Corresponding loss function

And a second detection result

Corresponding loss function

Calculating to obtain a first loss value, wherein equation (6) is:

formula (6)

Wherein the content of the first and second substances,

which represents the value of the first loss to be,

to represent

And

see the detailed description in example 1 for adjustment coefficients therebetween.

(8) And updating the weight values of one or more parameters of the initial text detection model according to the first loss value, and retraining. When retraining, the training samples may be the same as the training samples of the previous round, or may be different from the training samples of the previous round, which is not limited in this disclosure.

In practical application, the number of training samples is large, each training sample corresponds to one first loss value, and the weight values of one or more parameters of the initial text detection model can be updated according to the maximum first loss value. The maximum first loss value can reflect the situation that the text detection effect is the worst, so the weight value of the model parameter is adjusted according to the maximum first loss value, and the performance of the text detection model can be effectively improved.

And repeatedly executing the training process until the training times meet the preset iteration times, stopping training and outputting the text detection model.

The embodiment has at least the following beneficial effects:

the reduction distance corresponding to each initial text region in the training sample is only related to the minimum value of the length and the width of the initial text region, so that the detection result of the line text is more fit with the text boundary, the method is more suitable for the detection scene of the dense line text, and the problem of inconsistent reduction distances of the text lines with the same width and different lengths is remarkably solved.

Secondly, the shallow feature map and the deep feature map are subjected to multi-stage fusion in a cascaded pixel prediction mode, the advantages of the shallow feature and the deep feature are fully played in a twice prediction mode, and the overall detection capability is effectively improved.

In addition, in order to realize multi-class detection of text types such as tables, illustrations, printed texts, handwritten texts, formula texts and the like, the scheme can simultaneously carry out pixel classification tasks of class mutual exclusion and non-mutual exclusion through a class normalization strategy of a grouping mode; specifically, sigmoid is fully utilized to normalize pixel points of two text types, namely a table and an illustration; the category mutual exclusion advantage of softmax is fully utilized, the problem of category confusion among printed texts, handwritten texts and formula texts is remarkably reduced, and meanwhile, the contradiction that tables and illustrations are not mutually exclusive with other text types is avoided.

Fig. 7 is a schematic structural diagram of a training apparatus for a text detection model according to an embodiment of the present disclosure. Referring to fig. 7, the training apparatus 700 for text detection model according to this embodiment includes: an acquisition module 701 and a processing module 702.

The obtaining module 701 is configured to obtain a training sample, where a standard detection result of the training sample includes: at least one standard text region and a text type identifier to which each standard text region belongs; the standard text region is obtained by reducing an initial text region, and a reduction distance is determined according to a minimum value of a length and a width of the initial text region.

A processing module 702, configured to input the training sample into an initial text detection model, and obtain a plurality of detection results of the training sample, where each detection result includes at least one text region and a text type identifier to which each text region belongs; acquiring a first loss value according to a preset loss function, the plurality of detection results and the standard detection result; and training the initial text detection model according to the first loss value until the training times meet the preset iteration times, and acquiring the text detection model.

(ii) a Wherein d represents a reduction distance; w represents the length of the initial text region; h represents the width of the initial text region; a represents a hyper-parameter.

In some possible designs, the initial text detection model includes a feature extraction sub-model, a feature fusion sub-model, and a classification prediction sub-model; a processing module 702, configured to input the training sample to the feature extraction sub-module, and extract a plurality of first feature maps with different scales of the training sample, a plurality of second feature maps with different scales of the training sample, and a first detection result; fusing the third feature map and the fourth feature map through the feature fusion submodel, and outputting a first fusion feature map; wherein the first feature maps with different scales comprise the third feature map, the second feature maps with different scales comprise the fourth feature map, and the third feature map and the fourth feature map have the same scale; fusing the first fused feature map and the first detection result to obtain a second fused feature map; and inputting the second fusion characteristic graph to the classification predictor model to obtain the second detection result.

In some possible designs, the feature extraction submodel includes: a first feature extraction submodel and a second feature extraction submodel; the first feature extraction submodel is used for carrying out multiple times of downsampling processing on the original feature map of the training sample and extracting the first feature maps with different scales; the second feature extraction submodel is used for carrying out multiple times of last sampling processing on the first feature map with the minimum scale and extracting a plurality of second feature maps with different scales; and acquiring the first detection result according to the second characteristic diagram with the largest scale.

In some possible designs, the processing module 702 is specifically configured to add probability values of the same pixel points in the first fusion feature map and the first detection result to obtain the second fusion feature map; inputting the second fusion characteristic graph into N channels of the classification prediction submodel respectively, calculating the second fusion characteristic graph according to the classification function of each channel, and acquiring probability values of each pixel point belonging to N text types respectively; n is an integer greater than 2; the N channels correspond to N text types one by one, the N text types are divided into a plurality of text type groups, and classification functions corresponding to the text type groups are not completely the same; for each pixel point, determining the text type to which the pixel point belongs according to the maximum value in the probability values of the pixel point belonging to the N text types respectively; and acquiring the second detection result according to the text type to which each pixel point belongs and the connected domain of each text type.

In some possible designs, the processing module 702 is specifically configured to obtain, according to a preset loss function, a second loss value of the first detection result and the standard detection result, and a third loss value between the second detection result and the standard detection result, respectively; and acquiring the first loss value according to the second loss value and the third loss value.

The training apparatus for text detection models provided in this embodiment may be used to implement the technical solution in any of the above method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 8 is a schematic structural diagram of an electronic device according to another embodiment of the present disclosure. As shown in fig. 8, the electronic device 800 provided in the present embodiment includes: a memory 801 and a processor 802.

The memory 801 may be a separate physical unit, and the processor 802 may be connected via the bus 803. The memory 801 and the processor 802 may also be integrated, implemented in hardware, and the like.

The memory 801 is used to store program instructions that are called by the processor 802 to perform the operations of any of the above method embodiments.

Alternatively, when part or all of the methods of the above embodiments are implemented by software, the electronic device 800 may only include the processor 802. A memory 801 for storing programs is located outside the electronic device 800, and a processor 802 is connected to the memory via circuits/wires for reading and executing the programs stored in the memory.

The processor 802 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.

The processor 802 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

The memory 801 may include a volatile memory (volatile memory), such as a random-access memory (RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); the memory may also comprise a combination of memories of the kind described above.

The present disclosure also provides a computer-readable storage medium comprising computer program instructions which, when executed by a processor of a training apparatus for a text detection model, perform the solution of any of the above method embodiments.

The present disclosure also provides a program product comprising a computer program stored in a readable storage medium, from which the computer program can be read by at least one processor of a training apparatus of a text detection model, the execution of which by the at least one processor causes the training apparatus of the text detection model to carry out the solution of any one of the above method embodiments.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A training method of a text detection model is characterized by comprising the following steps:

training the initial text detection model according to the first loss value until the training times meet preset iteration times, and acquiring a text detection model;

wherein a reduction distance when the initial text region is reduced satisfies a formula:

2. The training method of the text detection model according to claim 1, wherein the initial text detection model comprises a feature extraction sub-model, a feature fusion sub-model and a classification prediction sub-model; the inputting the training sample into an initial text detection model to obtain a plurality of detection results of the training sample includes:

3. The method of claim 2, wherein the feature extraction submodel comprises: a first feature extraction submodel and a second feature extraction submodel;

4. The method for training the text detection model according to claim 2, wherein the third feature map is a first feature map with a largest scale in the first feature maps with different scales; the fourth feature map is the second feature map with the same scale as the third feature map.

5. The training method of the text detection model according to claim 2, wherein the first fused feature map is fused with the first detection result to obtain a second fused feature map; and inputting the second fusion feature map into the classification predictor model to obtain the second detection result, wherein the second detection result comprises:

6. The method for training the text detection model according to claim 2, wherein the obtaining a first loss value according to a preset loss function, the detection result and the standard detection result comprises:

7. An apparatus for training a text detection model, comprising:

the processing module is used for inputting the training sample into an initial text detection model and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text region and a text type identifier to which each text region belongs; acquiring a first loss value according to a preset loss function, the plurality of detection results and the standard detection result; training the initial text detection model according to the first loss value until the training times meet preset iteration times, and acquiring a text detection model;

8. An electronic device, comprising: memory, processor, and computer program instructions;

the memory configured to store the computer program instructions;

the processor configured to execute the computer program instructions, the processor executing the computer program instructions to perform the method of training a text detection model according to any of claims 1 to 6.

9. A readable storage medium, comprising: computer program instructions;

the computer program instructions, when executed by a processor of an electronic device, perform a method of training a text detection model according to any of claims 1 to 6.