CN112801097B - Training method and device of text detection model and readable storage medium - Google Patents

Training method and device of text detection model and readable storage medium Download PDF

Info

Publication number
CN112801097B
CN112801097B CN202110397684.3A CN202110397684A CN112801097B CN 112801097 B CN112801097 B CN 112801097B CN 202110397684 A CN202110397684 A CN 202110397684A CN 112801097 B CN112801097 B CN 112801097B
Authority
CN
China
Prior art keywords
text
detection result
training
detection
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110397684.3A
Other languages
Chinese (zh)
Other versions
CN112801097A (en
Inventor
王德强
刘霄
熊泽法
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Century TAL Education Technology Co Ltd
Original Assignee
Beijing Century TAL Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Century TAL Education Technology Co Ltd filed Critical Beijing Century TAL Education Technology Co Ltd
Priority to CN202110397684.3A priority Critical patent/CN112801097B/en
Publication of CN112801097A publication Critical patent/CN112801097A/en
Application granted granted Critical
Publication of CN112801097B publication Critical patent/CN112801097B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Abstract

The embodiment of the disclosure relates to a training method, a device and a readable storage medium of a text detection model, wherein the method comprises the following steps: acquiring a training sample carrying a standard detection result, wherein the standard detection result comprises at least one standard text area; inputting a training sample into an initial text detection model, and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text area; acquiring a first loss value according to a preset loss function, a plurality of detection results and a standard detection result; and training the initial text detection model according to the first loss value until the training times meet the preset iteration times, and obtaining the text detection model. The standard text region is obtained by reducing the initial text region, and the reduction distance is only related to the minimum value of the length and the width of the initial text region, so that the method is more suitable for the detection scene of the line text, the stability of the training of the text detection model is ensured, and the detection accuracy of the text detection model is improved.

Description

Training method and device of text detection model and readable storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for training a text detection model, and a readable storage medium.
Background
In the scene of Artificial Intelligence (AI) + education, the positioning of text lines for image texts and the text detection of multiple text types such as handwritten texts and formula texts are the prepositive links for performing layout reduction and content understanding. At present, text detection of different text types is usually realized by adopting a text detection mode based on pixel segmentation. Specifically, feature extraction is carried out on the image text by using a text detection model, and classification prediction is carried out on pixel points in the image text according to an extracted feature map; and then, extracting a connected domain of each text type according to the classification prediction result to serve as a text detection example of the subsequent text recognition.
For dense text, to avoid the sticking problem of lines of text, the text detection model predicts one or more regions of text of the same shape but different sizes, which are typically smaller than the actual text regions. In the prior art, a text detection model determines a reduction distance according to the area and the perimeter of a real text region by adopting a polygon clipping algorithm, and reduces the real text region according to the determined reduction distance, so as to obtain the predicted text region.
Although the above method can improve the sticking problem of adjacent text lines by reducing the real text region, the reduction distance determined by the polygon clipping algorithm is determined according to the area and the perimeter of the real text region; when the widths of different text lines are the same and the lengths are different, the difference of the reduction distances of different text lines is large, and the training of the text detection model is unstable easily.
Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, the present disclosure provides a method and an apparatus for training a text detection model, and a readable storage medium.
In a first aspect, the present disclosure provides a method for training a text detection model, including:
obtaining a training sample, wherein the standard detection result of the training sample comprises: at least one standard text region and a text type identifier to which each standard text region belongs; the standard text region is obtained by reducing the initial text region, and the reduction distance is determined according to the minimum value of the length and the width of the initial text region;
inputting the training sample into an initial text detection model, and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text region and a text type identifier to which each text region belongs;
acquiring a first loss value according to a preset loss function, the plurality of detection results and the standard detection result;
and training the initial text detection model according to the first loss value until the training times meet the preset iteration times, and obtaining the text detection model.
In some possible designs, the reduction distance when reducing the initial text region satisfies the formula:
Figure 127861DEST_PATH_IMAGE001
wherein d represents a reduction distance; w represents the length of the initial text region; h represents the width of the initial text region; a represents a hyper-parameter.
In some possible designs, the initial text detection model includes a feature extraction sub-model, a feature fusion sub-model, and a classification prediction sub-model; the inputting the training sample into an initial text detection model to obtain a plurality of detection results of the training sample includes:
inputting the training sample into the feature extraction submodel, and extracting a plurality of first feature maps with different scales of the training sample, a plurality of second feature maps with different scales of the training sample and a first detection result;
fusing the third feature map and the fourth feature map through the feature fusion submodel, and outputting a first fusion feature map; wherein the first feature maps with different scales comprise the third feature map, the second feature maps with different scales comprise the fourth feature map, and the third feature map and the fourth feature map have the same scale;
fusing the first fused feature map and the first detection result to obtain a second fused feature map; inputting the second fusion characteristic diagram into the classification predictor model to obtain a second detection result;
the plurality of detection results includes the first detection result and the second detection result.
In some possible designs, the feature extraction submodel includes: a first feature extraction submodel and a second feature extraction submodel;
the first feature extraction submodel is used for carrying out multiple times of downsampling processing on the original feature map of the training sample and extracting the first feature maps with different scales;
the second feature extraction submodel is used for carrying out multiple times of last sampling processing on the first feature map with the minimum scale and extracting a plurality of second feature maps with different scales; and acquiring the first detection result according to the second characteristic diagram with the largest scale.
In some possible designs, the third feature map is the first feature map with the largest scale in the plurality of first feature maps with different scales; the fourth feature map is the second feature map with the same scale as the third feature map.
In some possible designs, the first fused feature map is fused with the first detection result to obtain a second fused feature map; and inputting the second fusion feature map into the classification predictor model to obtain the second detection result, wherein the second detection result comprises:
adding the probability values of the same pixel points in the first fusion characteristic diagram and the first detection result to obtain a second fusion characteristic diagram;
inputting the second fusion characteristic graph into N channels of the classification prediction submodel respectively, calculating the second fusion characteristic graph according to the classification function of each channel, and acquiring probability values of each pixel point belonging to N text types respectively; n is an integer greater than 2; the N channels correspond to N text types one by one, the N text types are divided into a plurality of text type groups, and classification functions corresponding to the text type groups are not completely the same;
aiming at each pixel point, determining the text type to which the pixel point belongs according to the maximum value in the probability values of the pixel point belonging to the N text types respectively; and acquiring the second detection result according to the text type to which each pixel point belongs and the connected domain of each text type.
In some possible designs, the obtaining a first loss value according to a preset loss function, the detection result, and the standard detection result includes:
respectively acquiring a second loss value of the first detection result and the standard detection result and a third loss value between the second detection result and the standard detection result according to a preset loss function;
and acquiring the first loss value according to the second loss value and the third loss value.
In a second aspect, the present disclosure provides a training apparatus for a text detection model, including:
an obtaining module, configured to obtain a training sample, where a standard detection result of the training sample includes: at least one standard text region and a text type identifier to which each standard text region belongs; the standard text region is obtained by reducing the initial text region, and the reduction distance is determined according to the minimum value of the length and the width of the initial text region;
the processing module is used for inputting the training sample into an initial text detection model and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text region and a text type identifier to which each text region belongs; acquiring a first loss value according to a preset loss function, the plurality of detection results and the standard detection result; and training the initial text detection model according to the first loss value until the training times meet the preset iteration times, and acquiring the text detection model.
In a third aspect, an embodiment of the present disclosure further provides an electronic device, including: memory, processor, and computer program instructions;
the memory configured to store the computer program instructions;
the processor configured to execute the computer program instructions, the processor executing the computer program instructions to perform the training method of the text detection model according to any one of the first aspect.
In a fourth aspect, an embodiment of the present disclosure further provides a readable storage medium, including: computer program instructions;
the computer program instructions, when executed by a processor of an electronic device, are configured to perform the method of training a text detection model according to any of the first aspect.
In a fifth aspect, the disclosed embodiments also provide a program product, where the program product includes a computer program, the computer program is stored in a readable storage medium, the computer program can be read by at least one processor of a training apparatus of the text detection model, and the at least one processor executes the computer program to make the training apparatus of the text detection model execute the training method of the text detection model according to any one of the first aspect.
The embodiment of the disclosure provides a training method and a device for a text detection model and a readable storage medium, wherein the method comprises the following steps: obtaining a training sample carrying a standard detection result, wherein the standard detection result comprises: at least one standard text region and a text type identifier to which each standard text region belongs; then, inputting the training sample into the initial text detection model, and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text area; acquiring a first loss value according to a preset loss function, a plurality of detection results and a standard detection result; and training the initial text detection model according to the first loss value until the training times meet the preset iteration times, and obtaining the text detection model. In the scheme, the standard text region is obtained by reducing the initial text region, and the reduction distance is only related to the minimum value of the length and the width of the initial text region, so that the method is more suitable for the detection scene of the line text. The reduction distances of adjacent text regions with the same width but different lengths are the same, so that the difference of the reduction distances caused by the different lengths of the text regions can be avoided. And further, the stability of the training of the text detection model is ensured, and the detection accuracy of the text detection model is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a flowchart of a training method of a text detection model according to an embodiment of the present disclosure;
FIG. 2 is a diagram illustrating a comparison between an initial text region and a standard text region of a training sample according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of a training method of a text detection model according to another embodiment of the present disclosure;
fig. 4 is a flowchart of a training method of a text detection model according to another embodiment provided in an embodiment of the present disclosure;
fig. 5a is a flowchart of a training method of a text detection model according to another embodiment of the present disclosure;
fig. 5b is a schematic structural diagram of a feature fusion submodel according to an embodiment of the disclosure;
FIG. 6 is a schematic structural diagram of a classification predictor model according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a training apparatus for a text detection model according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
In the "AI + education" scenario, multi-class detection of image text is crucial. For example, in a science and chemistry scenario, besides printed text, there are a large number of special characters, such as formula text, handwritten text, and the like, and the text recognition of these categories is difficult, and therefore, the content recognition model is usually customized for these categories individually. If the text type detection is wrong, the recognized text instance is input into the unmatched content recognition model, and further serious errors occur in text content recognition.
At present, there are two methods for text detection model based on deep learning, which are respectively: a regression method based on a preset frame and a method based on pixel segmentation.
Typical regression methods based on a preset frame include: the Faster RCNN has the advantages of strong classification capability and insensitivity to image noise, but has poor adaptability to dense and curved text lines, and the processing time of a single image text in practical application is long, so that the real-time requirement is difficult to meet.
The pixel segmentation based method has obvious advantages for dense texts and curved texts by performing dense prediction on the texts and extracting various types of text examples through a special post-processing strategy. Currently, methods based on pixel segmentation include: when the PSENet and DBNet algorithms are adopted, in order to avoid the adhesion problem of adjacent dense text lines, one or more text regions with the same shape and different sizes are usually predicted, the size of the text regions is smaller than that of a real text region, and in the text detection model training process, the PSENet and DBNet algorithms both adopt a polygon clipping algorithm (a variable clipping algorithm) to inwards narrow an initial text region by a certain distance, so that the text regions for multi-class text detection are obtained. Wherein the reduction distance satisfies formula (1):
Figure 376439DEST_PATH_IMAGE002
formula (1)
Wherein the content of the first and second substances,
Figure 240490DEST_PATH_IMAGE003
representing the second in the image text
Figure 258125DEST_PATH_IMAGE004
An area of the initial text region;
Figure 549429DEST_PATH_IMAGE005
is shown as
Figure 82041DEST_PATH_IMAGE004
A perimeter of the initial text region;
Figure 749783DEST_PATH_IMAGE006
indicates that the hyper-parameter, optionally,
Figure 621924DEST_PATH_IMAGE007
although the above method can avoid the problem of text region blocking, it can be known from formula (1) that the calculation formula of the reduced distance of the text region is a complex function of the area and the perimeter in the prior art. By adopting a calculation mode in the prior art, when the text regions have the same width and different lengths, the difference of the reduction distances of the text regions is large, which easily causes unstable training of the text detection model, and further causes the accuracy of the text region predicted by the text detection model to be reduced.
In order to solve the problems in the prior art, the present disclosure provides a training method for a text detection model. The following describes the training method of the text detection model provided by the present disclosure in detail through several specific embodiments.
Fig. 1 is a flowchart of a training method of a text detection model according to an embodiment of the present disclosure. The execution subject of the method can be a training device of the text detection model provided by the embodiment of the disclosure, and the device can be realized in a software and/or hardware manner. As shown in fig. 1, the method includes:
s101, obtaining a training sample, wherein the standard detection result of the training sample comprises: at least one standard text region and a text type identifier to which each standard text region belongs; the standard text region is obtained by reducing an initial text region, and a reduction distance is determined according to a minimum value of a length and a width of the initial text region.
Each training sample carries a standard detection result, specifically, the standard detection result of each training sample includes at least one standard text region, the size of the standard text region is smaller than that of the real text region, and the standard text region is obtained by reducing the initial text region. Wherein the initial text region may be manually pre-labeled.
In the scheme, the reduction distance corresponding to each initial text region has an association relation with the minimum value of the length and the width of the initial text region.
Alternatively, the reduction distance satisfies the following formula (2):
Figure 84129DEST_PATH_IMAGE008
formula (2)
Wherein d represents a reduction distance; w represents the length of the initial text region; h represents the width of the initial text region;
Figure 805836DEST_PATH_IMAGE009
representing a hyper-parameter. Alternatively,
Figure 808427DEST_PATH_IMAGE009
has a value range of [8.0, 10.0 ]]By means of hyper-parameters
Figure 535074DEST_PATH_IMAGE009
The reduction distance of each initial text region can be controlled to avoidThe problem of text area blocking. It should be understood that in equation (1)
Figure 230498DEST_PATH_IMAGE010
In addition, in the scheme, the reduction distance corresponding to each initial text region is related to the minimum side length of the initial text region, so that the method is more suitable for the detection scene of the line text. The reduction distances of adjacent text regions with the same width but different lengths are the same, so that the difference of the reduction distances caused by the different lengths of the text regions can be avoided.
Exemplarily, fig. 2 shows a comparison result between an initial text region and a standard text region. Referring to fig. 2, the standard text area is smaller than the initial text area, and in practical applications, areas of different text types may be marked by different identifications, for example, as shown in fig. 2, printed text is marked by solid lines, and a table is marked by dotted lines.
S102, inputting the training sample into the initial text detection model, and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text region and a text type identifier to which each text region belongs.
The specific implementation manner of the initial file detection model outputting a plurality of detection results is described with reference to the embodiment in fig. 4.
S103, acquiring a first loss value according to a preset loss function, the plurality of detection results and the standard detection result.
The purpose of this step is: and performing statistical analysis on loss values corresponding to the detection results respectively to obtain the loss value of the text detection model, and providing a guidance basis for adjusting the weight value of the parameter in the initial text detection model.
Specifically, for each training sample, calculating a loss value between each detection result and a standard detection result according to a preset loss function corresponding to each detection result; and then, weighting the loss values corresponding to the detection results according to the weight coefficients of the loss values corresponding to the detection results, so as to obtain a first loss value of the text detection model.
Take the example that the plurality of detection results includes two detection results: the loss value between one of the detection results and the standard detection result is a second loss value, and is recorded as: loss 0; the Loss between the other test result and the standard test result is the third Loss, which is denoted as Loss 1. The first Loss value Loss = w Loss0+ (1-w) Loss1, where w is the adjustment factor between the second Loss value and the third Loss value. Optionally, w = 0.2.
And S104, training the initial text detection model according to the first loss value until the training times meet preset iteration times, and obtaining the text detection model.
Specifically, according to the first loss value obtained by calculation in S103, the weight values of one or more parameters in the initial text detection model are adjusted, and the training is repeated until the training times satisfy the preset iteration times, so as to obtain the text detection model.
The specific implementation manner of performing parameter adjustment on the initial text detection model based on the loss value may be an implementation manner in the prior art, which is not limited in the embodiment of the present disclosure.
In the scheme, when the training times meet the preset iteration times, the output text detection model meets the preset precision requirement. The preset iteration number may be set according to a requirement, which is not limited in the embodiments of the present disclosure.
The training method for the text detection model provided by the embodiment comprises the following steps: obtaining a training sample carrying a standard detection result, wherein the standard detection result comprises: at least one standard text region and a text type identifier to which each standard text region belongs; then, inputting the training sample into the initial text detection model, and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text area; acquiring a first loss value according to a preset loss function, a plurality of detection results and a standard detection result; and training the initial text detection model according to the first loss value until the training times meet the preset iteration times, and obtaining the text detection model. In the scheme, the standard text region is obtained by reducing the initial text region, and the reduction distance is only related to the minimum value of the length and the width of the initial text region, so that the method is more suitable for the detection scene of the line text. The reduction distances of adjacent text regions with the same width but different lengths are the same, so that the difference of the reduction distances caused by the different lengths of the text regions can be avoided. And further, the stability of the training of the text detection model is ensured, and the detection accuracy of the text detection model is improved.
On the basis of the embodiment shown in fig. 1, a structure of an initial text detection model and a specific implementation manner for obtaining a plurality of detection results by the initial text detection model are described in detail, where fig. 3 is a schematic structural diagram of the initial text detection model provided in an embodiment of the present disclosure; fig. 4 is a flowchart of a training method of a text detection model according to another embodiment of the present disclosure. Specifically, the method comprises the following steps:
one possible implementation, as shown in fig. 3, the initial text detection model 300 includes: a feature extraction sub-model 301, a feature fusion sub-model 302, and a classification prediction sub-model 303.
In some possible implementations, the feature extraction submodel 301 may include two parts: the system comprises a first feature extraction sub-model 3011 and a second feature extraction sub-model 3012, wherein the second feature extraction sub-model 3011 is connected with at least one output interface of the first feature extraction sub-model 3012.
On the basis of fig. 3, inputting a training sample into the initial text detection model, and obtaining a plurality of detection results of the training sample, where each detection result includes at least one text region and a text type identifier to which each text region belongs, and the method may include the following steps:
s401, inputting the training sample into the feature extraction submodel, and acquiring a plurality of first feature maps with different scales, a plurality of second feature maps with different scales and a first detection result output by the feature extraction submodel.
It should be noted that the scale of the plurality of first feature maps is smaller than or equal to the scale of the original feature map corresponding to the training sample. And the scales of the plurality of second feature maps are smaller than or equal to the scales of the original feature maps corresponding to the training samples.
Specifically, as shown in fig. 3, the first feature extraction submodel 3011 is configured to perform continuous downsampling processing on an original feature map of a training sample for multiple times, and output multiple first feature maps with different scales; illustratively, the first feature extraction submodel 3011 may employ ResNet18 as the primary network architecture. The second feature extraction submodel 3012 is configured to perform multiple times of continuous upsampling processing on the first feature map with the smallest scale, and output multiple second feature maps with different scales; the second feature extraction submodel outputs a first detection result according to the second feature graph with the largest scale; illustratively, the portion of the second feature extraction submodel for extracting the second feature map of the plurality of scales may be implemented using a Feature Pyramid Network (FPN). The part of the second feature extraction submodel for obtaining the first detection result may adopt the same network structure as the classification prediction submodel.
S402, fusing the third feature map and the fourth feature map through the feature fusion sub-model, and outputting a first fused feature map, wherein the first feature maps with different scales comprise the third feature map, the second feature maps with different scales comprise the fourth feature map, and the scales of the third feature map and the fourth feature map are the same.
Optionally, the third feature map is a first feature map with a largest scale in the plurality of first feature maps with different scales; the fourth feature map is a second feature map having the same scale as the third feature map among a plurality of second feature maps having different scales. For example, the scale of the third feature map and the scale of the fourth feature map are both half of the scale of the original feature map.
Specifically, the third feature map and the fourth feature map are spliced along the channel direction, and the obtained fusion feature map is subjected to continuous multiple downsampling and continuous multiple upsampling to obtain a first fusion feature map.
S403, fusing the first fusion characteristic diagram with the first detection result to obtain a second fusion characteristic diagram; and inputting the second fusion characteristic graph to the classification predictor model to obtain a second detection result.
In one possible implementation, S403 may include the following steps:
step one, adding probability values of the same pixel points in the first fusion characteristic graph and the first detection result to obtain a second fusion characteristic graph.
Step two, inputting the second fusion characteristic graph into N channels of a classification prediction submodel respectively, calculating the second fusion characteristic graph according to the classification function of each channel, and acquiring probability values of each pixel point belonging to N text types respectively, wherein N is an integer greater than 2; the N channels correspond to the N text types one by one, the N text types are divided into a plurality of text type groups, and classification functions corresponding to the text type groups are not completely the same.
And step three, aiming at each pixel point, determining the text type to which the pixel point belongs according to the maximum value in the probability values of the pixel point belonging to the N text types respectively.
And step four, acquiring a second detection result according to the text type to which each pixel point belongs and the connected domain of each text type.
It should be understood that, in this embodiment, the plurality of detection results corresponding to the training samples at least include: a first detection result and a second detection result.
It should be noted that, in practical application, the shallow fine-grained feature map retains more texture information and position information, and can better improve the classification capability of the text class and the background class; the deep coarse-grained characteristic diagram has richer semantic information, and is beneficial to improving the distinguishing capability among different categories. Therefore, according to the characteristics of multi-class text detection tasks, the shallow feature map and the deep feature map are subjected to multi-level fusion, and the advantages of the shallow feature and the deep feature are fully exerted in a multi-prediction mode, so that the detection capability of a text detection model is improved.
In addition, the scheme divides the N text types into a plurality of text type groups, thereby realizing the pixel classification of category mutual exclusion and non-mutual exclusion and avoiding the problem of text type confusion.
The following describes the training process of the text detection model in detail by taking a training sample as an example:
(1) and inwards reducing each initial text region in the training sample, wherein the reduced distance corresponding to each initial text region can be obtained through the formula (2) so as to obtain the training sample carrying the standard detection result. (2) Referring to FIG. 5a, the original feature map of the training sample is extracted
Figure 940965DEST_PATH_IMAGE011
And extracting the sub-model through the first characteristic by using the original characteristic diagram of the training sample
Figure 684930DEST_PATH_IMAGE011
Carrying out continuous multiple downsampling processing, and outputting first feature maps of four scales, which are respectively:
Figure 531663DEST_PATH_IMAGE012
Figure 132409DEST_PATH_IMAGE013
Figure 330172DEST_PATH_IMAGE014
Figure 612249DEST_PATH_IMAGE015
it should be noted that the original feature map of the training sample is described above
Figure 375805DEST_PATH_IMAGE011
Is the same scale as the training sample.
(3) And extracting the sub-model to the first characteristic diagram through the second characteristic
Figure 350715DEST_PATH_IMAGE015
Carrying out continuous multi-time upsampling processing, and outputting second characteristic diagrams of four scales, which are respectively:
Figure 95878DEST_PATH_IMAGE016
Figure 447225DEST_PATH_IMAGE017
Figure 268551DEST_PATH_IMAGE018
Figure 476678DEST_PATH_IMAGE019
(ii) a According to a second characteristic diagram
Figure 383454DEST_PATH_IMAGE019
Performing upsampling to obtain a first detection result
Figure 7334DEST_PATH_IMAGE020
In this embodiment, the first characteristic diagram
Figure 745482DEST_PATH_IMAGE012
Is the original feature map
Figure 62194DEST_PATH_IMAGE011
One half of the scale; second characteristic diagram
Figure 456266DEST_PATH_IMAGE019
Is the original feature map
Figure 946154DEST_PATH_IMAGE011
One half of the scale. I.e. the first characteristic diagram
Figure 210913DEST_PATH_IMAGE012
Corresponds to the third characteristic diagram in the previous embodiment; second characteristic diagram
Figure 760843DEST_PATH_IMAGE019
Corresponding to the fourth characteristic diagram in the previous embodiment.
(4) And matching the second feature map through the feature fusion submodel
Figure 140746DEST_PATH_IMAGE019
Down-sampling, and obtaining the down-sampled second feature map
Figure 372008DEST_PATH_IMAGE019
And the first characteristic diagram
Figure 819169DEST_PATH_IMAGE012
Splicing along the channel direction, and obtaining a fusion characteristic diagram after corresponding convolution layer processing
Figure 212105DEST_PATH_IMAGE021
(ii) a Then, the fused feature map is processed
Figure 580769DEST_PATH_IMAGE021
Continuously and repeatedly downsampling to respectively obtain a fusion characteristic diagram
Figure 412459DEST_PATH_IMAGE022
Figure 917390DEST_PATH_IMAGE023
(ii) a For the fusion feature map
Figure 746805DEST_PATH_IMAGE023
Continuous multi-time up-sampling processing is carried out to obtain a fusion characteristic diagram
Figure 399504DEST_PATH_IMAGE024
Figure 972567DEST_PATH_IMAGE025
Figure 128742DEST_PATH_IMAGE026
And
Figure 129059DEST_PATH_IMAGE027
wherein feature maps are fused
Figure 708201DEST_PATH_IMAGE027
Which is the first fused feature map in the previous embodiment.
Referring to fig. 5b, the feature fusion submodel includes 3 × 3 convolutional layers with a step size of 2, 3 × 3 convolutional layers with a step size of 1, and 1 prediction layer. Specifically, the second feature map after down-sampling
Figure 147273DEST_PATH_IMAGE019
And the first characteristic diagram
Figure 626796DEST_PATH_IMAGE012
And (3) splicing along the channel direction, and recording the obtained fusion feature map (as:
Figure 798014DEST_PATH_IMAGE028
) As input for the feature fusion submodel. The fused feature map
Figure 159725DEST_PATH_IMAGE028
Obtaining a fused feature map by 3-by-3 convolution layers with the step length of 2
Figure 74592DEST_PATH_IMAGE023
Fusing feature maps
Figure 205359DEST_PATH_IMAGE023
The scale of (2) is one sixteenth of the scale of the training sample, and the number of characteristic diagram channels in the process can be 64, so that the expression capability of the network can be improved without increasing the model parameters significantly.
Next, for the fused feature map
Figure 813058DEST_PATH_IMAGE023
And (3) sequentially expanding the scale of the convolution layer to be twice of that of the previous layer by adopting a bilinear difference method, and fusing the bilinear difference value with a feature map which is output by the previous convolution layer and has the same scale, wherein the fusion mode is element addition. This process uses 3 x 3 convolutional layers with step size 1. This reduces aliasing effects produced by the fusion process.
In obtaining a fused feature map
Figure 599748DEST_PATH_IMAGE026
Then, the up-sampling processing of 2 times is carried out on the two continuous images, thereby obtaining a fused feature map
Figure 318305DEST_PATH_IMAGE027
(5) Fusing the feature maps
Figure 569158DEST_PATH_IMAGE027
And the first detection result
Figure 82179DEST_PATH_IMAGE020
Performing fusion, wherein the fusion mode is that corresponding elements are added to obtain a second fusion characteristic diagram Y0
(6) The second fused feature map Y0Inputting the multi-class detection result into the classification predictor model to perform multi-class detection, and obtaining a second detection result P. In this embodiment, the text types include: background class, printed text, handwritten text, formula text, illustrations and tables, wherein the background class, the printed text, the handwritten text and the formula text belong to a text type group 1, and the illustrations and the tables belong to a text type group 2.
Confidence normalization is carried out among a plurality of text types included in the text type group 1 in a softmax mode, and confusion among the text types is reduced through explicit category mutual exclusion.
Confidence normalization is carried out among a plurality of text types included in the text type group 2 by adopting a sigmoid function, so that the suppression of other text types by a table and an illustration area can be effectively avoided.
In the model training process, the difference between the detection result output by the model and the standard detection result can be measured through the loss function, and then the weight value of the parameter of the model is driven to be updated, so that the model is trained. The penalty function for classifying the predictor models is as follows:
aiming at the text type group 1, a multi-classification cross entropy calculation loss function is adopted, and the formula (3) is satisfied:
Figure 418483DEST_PATH_IMAGE029
formula (3)
Wherein A represents a set of pixel points belonging to the text type group 1;
Figure 173687DEST_PATH_IMAGE030
is shown as
Figure 482308DEST_PATH_IMAGE031
Whether each pixel belongs to the first
Figure 228547DEST_PATH_IMAGE032
A text type;
Figure 724251DEST_PATH_IMAGE033
is shown as
Figure 50190DEST_PATH_IMAGE031
Whether each pixel belongs to the first
Figure 10056DEST_PATH_IMAGE032
The predicted value of each text type is,
Figure 864879DEST_PATH_IMAGE034
is in the range of 1 to
Figure 644616DEST_PATH_IMAGE035
Figure 508667DEST_PATH_IMAGE036
Is shown as
Figure 526302DEST_PATH_IMAGE031
Whether each pixel belongs to the first
Figure 319071DEST_PATH_IMAGE034
A predicted value of each text type;
Figure 851683DEST_PATH_IMAGE034
is in the range of 1 to
Figure 519425DEST_PATH_IMAGE035
Figure 125987DEST_PATH_IMAGE035
Indicates the total number of types of text,
Figure 650509DEST_PATH_IMAGE035
is an integer greater than or equal to 2; e is a natural constant. In the formula (4), the first and second groups,
Figure 608101DEST_PATH_IMAGE032
and
Figure 610692DEST_PATH_IMAGE034
are all traversal parameters, represent
Figure 337339DEST_PATH_IMAGE035
Each text type is fetched.
Aiming at the text type group 2, a two-classification cross entropy calculation loss function is adopted to satisfy the formula (4):
Figure 236025DEST_PATH_IMAGE037
formula (4)
Wherein B represents a set of pixel points belonging to text type group 2;
Figure 946492DEST_PATH_IMAGE038
is shown as
Figure 690457DEST_PATH_IMAGE031
Whether each pixel belongs to a positive sample or not;
Figure 599508DEST_PATH_IMAGE039
is shown as
Figure 636471DEST_PATH_IMAGE031
Each pixel point belongs to the predicted value of the positive sample;
Figure 834235DEST_PATH_IMAGE040
is an intermediate variable;
Figure 178628DEST_PATH_IMAGE041
is a natural constant.
When the sigmoid value of the pixel point meets the corresponding preset condition, the pixel point is determined to belong to one text type in the text type group 2, and when the sigmoid value of the pixel point does not meet the corresponding preset condition, the pixel point is determined to belong to the other text type in the text type group 2. For example, when the sigmoid value of the pixel point is greater than 0.5, the pixel point is determined to belong to the table, and when the sigmoid value of the pixel point is less than 0.5, the pixel point is determined to belong to the illustration.
The structure of the classification prediction submodel is shown in fig. 6, the classification prediction submodel includes 6 channels, wherein the text type group 1 corresponds to 4 channels, the text type group 2 corresponds to 2 channels, the 4 channels corresponding to the text type group 1 all adopt a softmax mode to perform confidence normalization, and the 2 channels corresponding to the text type group 2 all adopt a sigmoid mode to perform confidence normalization.
The loss function corresponding to the classification prediction submodel is obtained by calculating according to the multi-classification cross entropy corresponding to the text type group 1 and the two-classification cross entropy corresponding to the text type group 2. Specifically, the loss function corresponding to the classification predictor model satisfies formula (5):
Figure 879868DEST_PATH_IMAGE042
formula (5)
Wherein the content of the first and second substances,
Figure 120356DEST_PATH_IMAGE043
a loss function representing a classification predictor model;
Figure 805416DEST_PATH_IMAGE044
a loss function representing multi-class cross entropy, i.e. corresponding to the above equation (3);
Figure 891183DEST_PATH_IMAGE045
the loss function representing the cross entropy of the two classes corresponds to equation (4) above.
It should be noted that the first detection result is
Figure 509247DEST_PATH_IMAGE046
The corresponding loss function (i.e. the aforementioned second loss value) and the second detection result
Figure 920636DEST_PATH_IMAGE047
The corresponding loss function (i.e. the aforementioned third loss value) is calculated in the same manner, and is not described herein again.
(7) According to the formula (6), the first detection result
Figure 561833DEST_PATH_IMAGE046
Corresponding loss function
Figure 248029DEST_PATH_IMAGE048
And a second detection result
Figure 690906DEST_PATH_IMAGE047
Corresponding loss function
Figure 742038DEST_PATH_IMAGE049
Calculating to obtain a first loss value, wherein equation (6) is:
Figure 932848DEST_PATH_IMAGE050
formula (6)
Wherein the content of the first and second substances,
Figure 625998DEST_PATH_IMAGE051
which represents the value of the first loss to be,
Figure 156336DEST_PATH_IMAGE052
to represent
Figure 706266DEST_PATH_IMAGE048
And
Figure 322055DEST_PATH_IMAGE049
see the detailed description in example 1 for adjustment coefficients therebetween.
(8) And updating the weight values of one or more parameters of the initial text detection model according to the first loss value, and retraining. When retraining, the training samples may be the same as the training samples of the previous round, or may be different from the training samples of the previous round, which is not limited in this disclosure.
In practical application, the number of training samples is large, each training sample corresponds to one first loss value, and the weight values of one or more parameters of the initial text detection model can be updated according to the maximum first loss value. The maximum first loss value can reflect the situation that the text detection effect is the worst, so the weight value of the model parameter is adjusted according to the maximum first loss value, and the performance of the text detection model can be effectively improved.
And repeatedly executing the training process until the training times meet the preset iteration times, stopping training and outputting the text detection model.
The embodiment has at least the following beneficial effects:
the reduction distance corresponding to each initial text region in the training sample is only related to the minimum value of the length and the width of the initial text region, so that the detection result of the line text is more fit with the text boundary, the method is more suitable for the detection scene of the dense line text, and the problem of inconsistent reduction distances of the text lines with the same width and different lengths is remarkably solved.
Secondly, the shallow feature map and the deep feature map are subjected to multi-stage fusion in a cascaded pixel prediction mode, the advantages of the shallow feature and the deep feature are fully played in a twice prediction mode, and the overall detection capability is effectively improved.
In addition, in order to realize multi-class detection of text types such as tables, illustrations, printed texts, handwritten texts, formula texts and the like, the scheme can simultaneously carry out pixel classification tasks of class mutual exclusion and non-mutual exclusion through a class normalization strategy of a grouping mode; specifically, sigmoid is fully utilized to normalize pixel points of two text types, namely a table and an illustration; the category mutual exclusion advantage of softmax is fully utilized, the problem of category confusion among printed texts, handwritten texts and formula texts is remarkably reduced, and meanwhile, the contradiction that tables and illustrations are not mutually exclusive with other text types is avoided.
Fig. 7 is a schematic structural diagram of a training apparatus for a text detection model according to an embodiment of the present disclosure. Referring to fig. 7, the training apparatus 700 for text detection model according to this embodiment includes: an acquisition module 701 and a processing module 702.
The obtaining module 701 is configured to obtain a training sample, where a standard detection result of the training sample includes: at least one standard text region and a text type identifier to which each standard text region belongs; the standard text region is obtained by reducing an initial text region, and a reduction distance is determined according to a minimum value of a length and a width of the initial text region.
A processing module 702, configured to input the training sample into an initial text detection model, and obtain a plurality of detection results of the training sample, where each detection result includes at least one text region and a text type identifier to which each text region belongs; acquiring a first loss value according to a preset loss function, the plurality of detection results and the standard detection result; and training the initial text detection model according to the first loss value until the training times meet the preset iteration times, and acquiring the text detection model.
In some possible designs, the reduction distance when reducing the initial text region satisfies the formula:
Figure 553316DEST_PATH_IMAGE053
(ii) a Wherein d represents a reduction distance; w represents the length of the initial text region; h represents the width of the initial text region; a represents a hyper-parameter.
In some possible designs, the initial text detection model includes a feature extraction sub-model, a feature fusion sub-model, and a classification prediction sub-model; a processing module 702, configured to input the training sample to the feature extraction sub-module, and extract a plurality of first feature maps with different scales of the training sample, a plurality of second feature maps with different scales of the training sample, and a first detection result; fusing the third feature map and the fourth feature map through the feature fusion submodel, and outputting a first fusion feature map; wherein the first feature maps with different scales comprise the third feature map, the second feature maps with different scales comprise the fourth feature map, and the third feature map and the fourth feature map have the same scale; fusing the first fused feature map and the first detection result to obtain a second fused feature map; and inputting the second fusion characteristic graph to the classification predictor model to obtain the second detection result.
The plurality of detection results includes the first detection result and the second detection result.
In some possible designs, the feature extraction submodel includes: a first feature extraction submodel and a second feature extraction submodel; the first feature extraction submodel is used for carrying out multiple times of downsampling processing on the original feature map of the training sample and extracting the first feature maps with different scales; the second feature extraction submodel is used for carrying out multiple times of last sampling processing on the first feature map with the minimum scale and extracting a plurality of second feature maps with different scales; and acquiring the first detection result according to the second characteristic diagram with the largest scale.
In some possible designs, the third feature map is the first feature map with the largest scale in the plurality of first feature maps with different scales; the fourth feature map is the second feature map with the same scale as the third feature map.
In some possible designs, the processing module 702 is specifically configured to add probability values of the same pixel points in the first fusion feature map and the first detection result to obtain the second fusion feature map; inputting the second fusion characteristic graph into N channels of the classification prediction submodel respectively, calculating the second fusion characteristic graph according to the classification function of each channel, and acquiring probability values of each pixel point belonging to N text types respectively; n is an integer greater than 2; the N channels correspond to N text types one by one, the N text types are divided into a plurality of text type groups, and classification functions corresponding to the text type groups are not completely the same; for each pixel point, determining the text type to which the pixel point belongs according to the maximum value in the probability values of the pixel point belonging to the N text types respectively; and acquiring the second detection result according to the text type to which each pixel point belongs and the connected domain of each text type.
In some possible designs, the processing module 702 is specifically configured to obtain, according to a preset loss function, a second loss value of the first detection result and the standard detection result, and a third loss value between the second detection result and the standard detection result, respectively; and acquiring the first loss value according to the second loss value and the third loss value.
The training apparatus for text detection models provided in this embodiment may be used to implement the technical solution in any of the above method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
Fig. 8 is a schematic structural diagram of an electronic device according to another embodiment of the present disclosure. As shown in fig. 8, the electronic device 800 provided in the present embodiment includes: a memory 801 and a processor 802.
The memory 801 may be a separate physical unit, and the processor 802 may be connected via the bus 803. The memory 801 and the processor 802 may also be integrated, implemented in hardware, and the like.
The memory 801 is used to store program instructions that are called by the processor 802 to perform the operations of any of the above method embodiments.
Alternatively, when part or all of the methods of the above embodiments are implemented by software, the electronic device 800 may only include the processor 802. A memory 801 for storing programs is located outside the electronic device 800, and a processor 802 is connected to the memory via circuits/wires for reading and executing the programs stored in the memory.
The processor 802 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.
The processor 802 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.
The memory 801 may include a volatile memory (volatile memory), such as a random-access memory (RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); the memory may also comprise a combination of memories of the kind described above.
The present disclosure also provides a computer-readable storage medium comprising computer program instructions which, when executed by a processor of a training apparatus for a text detection model, perform the solution of any of the above method embodiments.
The present disclosure also provides a program product comprising a computer program stored in a readable storage medium, from which the computer program can be read by at least one processor of a training apparatus of a text detection model, the execution of which by the at least one processor causes the training apparatus of the text detection model to carry out the solution of any one of the above method embodiments.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. A training method of a text detection model is characterized by comprising the following steps:
obtaining a training sample, wherein the standard detection result of the training sample comprises: at least one standard text region and a text type identifier to which each standard text region belongs; the standard text region is obtained by reducing the initial text region, and the reduction distance is determined according to the minimum value of the length and the width of the initial text region;
inputting the training sample into an initial text detection model, and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text region and a text type identifier to which each text region belongs;
acquiring a first loss value according to a preset loss function, the plurality of detection results and the standard detection result;
training the initial text detection model according to the first loss value until the training times meet preset iteration times, and acquiring a text detection model;
wherein a reduction distance when the initial text region is reduced satisfies a formula:
Figure 376324DEST_PATH_IMAGE001
wherein d represents a reduction distance; w represents the length of the initial text region; h represents the width of the initial text region; a represents a hyper-parameter.
2. The training method of the text detection model according to claim 1, wherein the initial text detection model comprises a feature extraction sub-model, a feature fusion sub-model and a classification prediction sub-model; the inputting the training sample into an initial text detection model to obtain a plurality of detection results of the training sample includes:
inputting the training sample into the feature extraction submodel, and extracting a plurality of first feature maps with different scales of the training sample, a plurality of second feature maps with different scales of the training sample and a first detection result;
fusing the third feature map and the fourth feature map through the feature fusion submodel, and outputting a first fusion feature map; wherein the first feature maps with different scales comprise the third feature map, the second feature maps with different scales comprise the fourth feature map, and the third feature map and the fourth feature map have the same scale;
fusing the first fused feature map and the first detection result to obtain a second fused feature map; inputting the second fusion characteristic diagram into the classification predictor model to obtain a second detection result;
the plurality of detection results includes the first detection result and the second detection result.
3. The method of claim 2, wherein the feature extraction submodel comprises: a first feature extraction submodel and a second feature extraction submodel;
the first feature extraction submodel is used for carrying out multiple times of downsampling processing on the original feature map of the training sample and extracting the first feature maps with different scales;
the second feature extraction submodel is used for carrying out multiple times of last sampling processing on the first feature map with the minimum scale and extracting a plurality of second feature maps with different scales; and acquiring the first detection result according to the second characteristic diagram with the largest scale.
4. The method for training the text detection model according to claim 2, wherein the third feature map is a first feature map with a largest scale in the first feature maps with different scales; the fourth feature map is the second feature map with the same scale as the third feature map.
5. The training method of the text detection model according to claim 2, wherein the first fused feature map is fused with the first detection result to obtain a second fused feature map; and inputting the second fusion feature map into the classification predictor model to obtain the second detection result, wherein the second detection result comprises:
adding the probability values of the same pixel points in the first fusion characteristic diagram and the first detection result to obtain a second fusion characteristic diagram;
inputting the second fusion characteristic graph into N channels of the classification prediction submodel respectively, calculating the second fusion characteristic graph according to the classification function of each channel, and acquiring probability values of each pixel point belonging to N text types respectively; n is an integer greater than 2; the N channels correspond to N text types one by one, the N text types are divided into a plurality of text type groups, and classification functions corresponding to the text type groups are not completely the same;
aiming at each pixel point, determining the text type to which the pixel point belongs according to the maximum value in the probability values of the pixel point belonging to the N text types respectively; and acquiring the second detection result according to the text type to which each pixel point belongs and the connected domain of each text type.
6. The method for training the text detection model according to claim 2, wherein the obtaining a first loss value according to a preset loss function, the detection result and the standard detection result comprises:
respectively acquiring a second loss value of the first detection result and the standard detection result and a third loss value between the second detection result and the standard detection result according to a preset loss function;
and acquiring the first loss value according to the second loss value and the third loss value.
7. An apparatus for training a text detection model, comprising:
an obtaining module, configured to obtain a training sample, where a standard detection result of the training sample includes: at least one standard text region and a text type identifier to which each standard text region belongs; the standard text region is obtained by reducing the initial text region, and the reduction distance is determined according to the minimum value of the length and the width of the initial text region;
the processing module is used for inputting the training sample into an initial text detection model and obtaining a plurality of detection results of the training sample, wherein each detection result comprises at least one text region and a text type identifier to which each text region belongs; acquiring a first loss value according to a preset loss function, the plurality of detection results and the standard detection result; training the initial text detection model according to the first loss value until the training times meet preset iteration times, and acquiring a text detection model;
wherein a reduction distance when the initial text region is reduced satisfies a formula:
Figure 824623DEST_PATH_IMAGE002
wherein d represents a reduction distance; w represents the length of the initial text region; h represents the width of the initial text region; a represents a hyper-parameter.
8. An electronic device, comprising: memory, processor, and computer program instructions;
the memory configured to store the computer program instructions;
the processor configured to execute the computer program instructions, the processor executing the computer program instructions to perform the method of training a text detection model according to any of claims 1 to 6.
9. A readable storage medium, comprising: computer program instructions;
the computer program instructions, when executed by a processor of an electronic device, perform a method of training a text detection model according to any of claims 1 to 6.
CN202110397684.3A 2021-04-14 2021-04-14 Training method and device of text detection model and readable storage medium Active CN112801097B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110397684.3A CN112801097B (en) 2021-04-14 2021-04-14 Training method and device of text detection model and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110397684.3A CN112801097B (en) 2021-04-14 2021-04-14 Training method and device of text detection model and readable storage medium

Publications (2)

Publication Number Publication Date
CN112801097A CN112801097A (en) 2021-05-14
CN112801097B true CN112801097B (en) 2021-07-16

Family

ID=75817101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110397684.3A Active CN112801097B (en) 2021-04-14 2021-04-14 Training method and device of text detection model and readable storage medium

Country Status (1)

Country Link
CN (1) CN112801097B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342997B (en) * 2021-05-18 2022-11-11 成都快眼科技有限公司 Cross-image text book reading method based on text line matching
CN113326766B (en) * 2021-05-27 2023-09-29 北京百度网讯科技有限公司 Training method and device of text detection model, text detection method and device
CN115223160A (en) * 2022-09-20 2022-10-21 恒银金融科技股份有限公司 Design method of sign recognition system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103632159A (en) * 2012-08-23 2014-03-12 阿里巴巴集团控股有限公司 Method and system for training classifier and detecting text area in image
US10049289B2 (en) * 2016-02-12 2018-08-14 Wacom Co., Ltd. Method and system for generating and selectively outputting two types of ink vector data
CN110378338A (en) * 2019-07-11 2019-10-25 腾讯科技(深圳)有限公司 A kind of text recognition method, device, electronic equipment and storage medium
CN111079632A (en) * 2019-12-12 2020-04-28 上海眼控科技股份有限公司 Training method and device of text detection model, computer equipment and storage medium
CN111753839A (en) * 2020-05-18 2020-10-09 北京捷通华声科技股份有限公司 Text detection method and device
CN112528976A (en) * 2021-02-09 2021-03-19 北京世纪好未来教育科技有限公司 Text detection model generation method and text detection method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110715A (en) * 2019-04-30 2019-08-09 北京金山云网络技术有限公司 Text detection model training method, text filed, content determine method and apparatus
CN111932577B (en) * 2020-09-16 2021-01-08 北京易真学思教育科技有限公司 Text detection method, electronic device and computer readable medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103632159A (en) * 2012-08-23 2014-03-12 阿里巴巴集团控股有限公司 Method and system for training classifier and detecting text area in image
US10049289B2 (en) * 2016-02-12 2018-08-14 Wacom Co., Ltd. Method and system for generating and selectively outputting two types of ink vector data
CN110378338A (en) * 2019-07-11 2019-10-25 腾讯科技(深圳)有限公司 A kind of text recognition method, device, electronic equipment and storage medium
CN111079632A (en) * 2019-12-12 2020-04-28 上海眼控科技股份有限公司 Training method and device of text detection model, computer equipment and storage medium
CN111753839A (en) * 2020-05-18 2020-10-09 北京捷通华声科技股份有限公司 Text detection method and device
CN112528976A (en) * 2021-02-09 2021-03-19 北京世纪好未来教育科技有限公司 Text detection model generation method and text detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Shape Robust Text Detection With Progressive Scale Expansion Network;Wenhai Wang et al;《2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》;20200109;9328-9337页 *
自然场景文本检测技术研究综述;白志程等;《工程科学学报》;20201130;第42卷(第11期);1433-1448页 *

Also Published As

Publication number Publication date
CN112801097A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112801097B (en) Training method and device of text detection model and readable storage medium
CN110443143B (en) Multi-branch convolutional neural network fused remote sensing image scene classification method
CN110674866B (en) Method for detecting X-ray breast lesion images by using transfer learning characteristic pyramid network
CN110443818B (en) Graffiti-based weak supervision semantic segmentation method and system
CN111260055B (en) Model training method based on three-dimensional image recognition, storage medium and device
CN112347248A (en) Aspect-level text emotion classification method and system
EP3572984A1 (en) Implementing traditional computer vision algorithms as neural networks
CN112966691A (en) Multi-scale text detection method and device based on semantic segmentation and electronic equipment
CN111783779B (en) Image processing method, apparatus and computer readable storage medium
CN113220886A (en) Text classification method, text classification model training method and related equipment
CN114549913B (en) Semantic segmentation method and device, computer equipment and storage medium
CN116645592A (en) Crack detection method based on image processing and storage medium
CN113487610B (en) Herpes image recognition method and device, computer equipment and storage medium
CN113902966A (en) Anchor frame-free target detection network for electronic components and detection method applying same
CN116563285B (en) Focus characteristic identifying and dividing method and system based on full neural network
CN111582057B (en) Face verification method based on local receptive field
CN112801107A (en) Image segmentation method and electronic equipment
CN116258877A (en) Land utilization scene similarity change detection method, device, medium and equipment
CN113378866B (en) Image classification method, system, storage medium and electronic device
CN114998630A (en) Ground-to-air image registration method from coarse to fine
CN111768803B (en) General audio steganalysis method based on convolutional neural network and multitask learning
CN114202694A (en) Small sample remote sensing scene image classification method based on manifold mixed interpolation and contrast learning
CN113869165A (en) Traffic scene target detection method and system
US20240135139A1 (en) Implementing Traditional Computer Vision Algorithms as Neural Networks
CN113095335B (en) Image recognition method based on category consistency deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant