CN116206313A

CN116206313A - Text detection and model training method, device and system and readable storage medium

Info

Publication number: CN116206313A
Application number: CN202211441255.2A
Authority: CN
Inventors: 谌贵雄; 张丽民; 徐兵; 张楠赓
Original assignee: Hangzhou Canaan Creative Information Technology Ltd
Current assignee: Hangzhou Canaan Creative Information Technology Ltd
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-06-02

Abstract

The invention provides a text detection and model training method, a device and a system thereof and a readable storage medium method, wherein the model training method comprises the following steps: acquiring a sample feature map of a sample picture, inputting the sample feature map into a first prediction network, and obtaining a first probability map and a first threshold map, the size of which is smaller than that of the sample picture; performing micro binarization processing on the first probability map and the first threshold map to obtain an approximate binary map; and performing supervised learning on the approximate binary image based on the sample label of the sample image, and training to generate a character detection model. By using the method, the post-processing efficiency of the character detection model can be obviously improved.

Description

Text detection and model training method, device and system and readable storage medium

Technical Field

The invention belongs to the field of character recognition, and particularly relates to a character detection and model training method, device and system and a readable storage medium.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

Along with the increase of application requirements of edge end text recognition, the requirements on the running efficiency of the model on hardware equipment are higher and higher, so that the optimization acceleration of the model is particularly important, and the general model acceleration method comprises pruning, quantization, distillation, network structure modification and the like.

The existing text detection network is mainly based on a regression method and a segmentation method of a convolutional neural network. Regression-based algorithms generally use general Object Detection (Object Detection) algorithms to detect regular-shape text with good results by setting regression Detection boxes or directly performing pixel regression, and have the disadvantage of relatively poor results.

Therefore, how to perform optimization acceleration under the premise of ensuring the model accuracy is a problem to be solved urgently.

Disclosure of Invention

In order to solve the problems in the prior art, a text detection method, a text detection device, a text detection system, a text model training method, a text detection system, a text model training device, a text model training system and a computer readable storage medium are provided.

The present invention provides the following.

In a first aspect, a text detection model training method is provided, including: acquiring a sample feature map of a sample picture, inputting the sample feature map into a first prediction network, and obtaining a first probability map and a first threshold map, the size of which is smaller than that of the sample picture; performing micro binarization processing on the first probability map and the first threshold map to obtain an approximate binary map; and performing supervised learning on the approximate binary image based on the sample label of the sample image, and training to generate a character detection model.

The method further comprises the steps of:

in one embodiment, the method further comprises: and inputting the sample feature map into a second prediction network, wherein the output feature size of the second prediction network is larger than or equal to the size of the sample picture, and performing intermediate supervised learning on the character detection model by utilizing the output feature of the second prediction network.

In one embodiment, the method further comprises: inputting the sample feature map into a second prediction network to obtain a second probability map and a second threshold map, wherein the sizes of the second probability map and the second threshold map are larger than or equal to the sizes of the sample pictures; calculating a probability map loss of the first probability map using the second probability map; calculating a threshold map loss of the first threshold map using the second threshold map; and performing intermediate supervised learning on the text detection model by using the probability map loss and the threshold map loss.

In one embodiment, the second probability map, the second threshold map, and the sample picture are the same size.

In one embodiment, the probability map loss and/or the threshold map loss are calculated using a KL divergence loss function.

In one embodiment, the intermediate supervised learning is performed using the following KL divergence loss function formula:

wherein, pred refers to the output of the first prediction network, pred' refers to the output of the second prediction network, i refers to the pixel sequence number, and x _i Refers to the pixel point, and N refers to the total number of pixels.

In one embodiment, the first probability map and the first threshold map are 1/2 the size of the sample picture.

In one embodiment, the first predictive network performs: performing convolution operation on the sample feature map to obtain a first intermediate map; performing batch standardization processing and activation processing on the first intermediate graph to obtain a second intermediate graph; performing deconvolution operation on the second intermediate graph, and outputting a third intermediate graph; performing convolution operation on the third intermediate graph, and outputting a fourth intermediate graph with the number of channels being 1; and outputting a first probability map and a first threshold map according to the sigmoid function and the fourth intermediate map.

In one embodiment, the second predictive network performs: performing convolution operation on the sample feature map to obtain a fifth intermediate map; performing batch standardization processing and activation processing on the fifth intermediate graph to obtain a sixth intermediate graph; performing deconvolution operation on the sixth intermediate graph, and outputting a seventh intermediate graph; performing deconvolution operation on the seventh intermediate graph, and outputting an eighth intermediate graph with the number of channels being 1; and outputting a second probability map and a second threshold map according to the sigmoid function and the fourth intermediate map.

In one embodiment, obtaining a sample feature map of a sample picture includes: acquiring a training sample set, wherein the training sample set comprises a plurality of sample pictures carrying sample labels; and inputting the sample picture into a feature extraction network to obtain a sample feature map.

In one embodiment, the sample label based on the sample picture performs supervised learning on the approximate binary image, and further comprises: and carrying out downsampling treatment on the sample label to ensure that the size of the sample label is the same as that of the approximate binary image.

In one embodiment, the text detection model includes at least: the probability map branches of the trained feature extraction network and the first prediction network.

In a second aspect, a text detection method is provided, including: acquiring a picture to be detected, inputting the picture to be detected into a character detection model trained by the method according to the first aspect, and outputting a probability map of the picture to be detected; performing binarization processing on the probability map by using a fixed threshold or a threshold map output by a text detection model to obtain a binary map; and performing text segmentation on the image to be detected by using the binary image.

In a third aspect, there is provided a text detection model training apparatus configured to perform the method of the first aspect, comprising: the feature extraction module is used for inputting the sample picture into a feature extraction network to obtain a sample feature map; the first prediction module is used for inputting the sample feature image into a first prediction network to obtain a first probability image and a first threshold image, wherein the size of the first probability image is smaller than that of the sample image; the micro binarization module is used for performing micro binarization processing on the first probability map and the first threshold map to obtain an approximate binary map; and the training module is used for performing supervised learning on the approximate binary image based on the sample label of the sample image, and training to generate a text detection model.

In a fourth aspect, there is provided a text detection device configured to perform the method of the second aspect, the device comprising: the detection module is used for acquiring a picture to be detected, inputting the picture to be detected into the text detection model trained by the method according to the first aspect, and outputting a probability map of the picture to be detected; the binarization module is used for carrying out binarization processing on the probability map by using a fixed threshold value or a threshold value map output by the text detection model to obtain a binary map; and the character segmentation module is used for carrying out character segmentation on the image to be detected by utilizing the binary image.

In a fifth aspect, a text detection model training system is provided, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform: the method of the first aspect.

In a sixth aspect, a text detection system is provided, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform: the method of the second aspect.

In a seventh aspect, there is provided a computer readable storage medium storing a program which, when executed by a processor, causes the processor to perform a method as in the first or second aspect.

According to one of the advantages of the embodiment, training is performed through the probability graph and the threshold graph of which the sizes are smaller than those of the sample pictures, so that the post-processing efficiency of the trained text detection model can be obviously improved.

Other advantages of the present invention will be explained in more detail in connection with the following description and accompanying drawings.

It should be understood that the foregoing description is only an overview of the technical solutions of the present invention, so that the technical means of the present invention may be more clearly understood and implemented in accordance with the content of the specification. The following specific embodiments of the present invention are described in order to make the above and other objects, features and advantages of the present invention more comprehensible.

Drawings

The advantages and benefits described herein, as well as other advantages and benefits, will become apparent to those of ordinary skill in the art upon reading the following detailed description of the exemplary embodiments. The drawings are only for purposes of illustrating exemplary embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a schematic diagram of a text detection model training apparatus according to an embodiment of the present invention;

FIG. 2 is a flow chart of a text detection model training method according to an embodiment of the invention;

FIG. 3 is a flowchart of another text detection model training method according to an embodiment of the present invention;

FIG. 4 is a flowchart of a training method for a text detection model according to an embodiment of the present invention;

FIG. 5 is a flowchart of a training method for a text detection model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a calculation process of a first predictive network according to an embodiment of the invention;

FIG. 7 is a schematic diagram of a calculation process of a second predictive network according to an embodiment of the invention;

FIG. 8 is a flow chart of a text detection method according to an embodiment of the invention;

FIG. 9 is a flowchart of another text detection method according to an embodiment of the invention;

FIG. 10 is a schematic diagram of a training device for text detection models according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a text detection device according to an embodiment of the present invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the description of embodiments of the present application, it should be understood that terms such as "comprises" or "comprising" are intended to indicate the presence of features, numbers, steps, acts, components, portions or combinations thereof disclosed in the present specification, and are not intended to exclude the possibility of the presence of one or more other features, numbers, steps, acts, components, portions or combinations thereof.

Unless otherwise indicated, "/" means or, e.g., A/B may represent A or B; "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone.

The terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.

The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Referring first to FIG. 1, a schematic diagram of an environment 100 in which an exemplary implementation according to the present disclosure may be used is schematically illustrated.

Fig. 1 shows a schematic diagram of an example of a computing device 100 according to an embodiment of the disclosure. It should be noted that fig. 1 is a schematic structural diagram of a hardware running environment of the training method of the text detection model. The training equipment based on the text detection model in the embodiment of the invention can be terminal equipment such as a PC, a portable computer and the like.

As shown in fig. 1, the text detection model training apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

It will be appreciated by those skilled in the art that the word detection model training device structure shown in fig. 1 is not limiting of the word detection model training device, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a text detection model training program may be included in a memory 1005, which is a type of computer storage medium. The operating system is a program for managing and controlling hardware and software resources of the text detection model training device, and supports the running of the text detection model training program and other software or programs.

In the text detection model training apparatus shown in fig. 1, the user interface 1003 is mainly used for receiving a request, data, etc. sent by the first terminal, the second terminal, and the supervision terminal; the network interface 1004 is mainly used for connecting the background server and the background server for data communication; and the processor 1001 may be configured to invoke the text detection model training program stored in the memory 1005 and perform the following operations:

acquiring a sample feature map of a sample picture; and inputting the sample characteristic map into a first prediction network to obtain a first probability map and a first threshold map, wherein the size of the first probability map and the first threshold map are smaller than that of the sample picture. And performing micro binarization processing on the first probability map and the first threshold map to obtain an approximate binary map. And performing supervised learning on the approximate binary image based on the sample label of the sample image, and training to generate a character detection model. Therefore, training is performed through the probability graph and the threshold graph with the size smaller than that of the sample image, so that the post-processing efficiency of the trained text detection model can be obviously improved.

The invention also aims at optimizing the neural network structure of the first prediction network, correspondingly adjusting the convolution kernel and the weight of the neural network structure, so that the output size of the first prediction network is smaller than the first probability map and the first threshold map of the sample picture, and the processing speed of the first prediction network can be improved because the output size is smaller; the method has the advantages that the probability map and the threshold map with the sizes smaller than that of the sample picture are output for post-processing, so that the calculated amount can be reduced, the text detection efficiency is improved, and the post-processing means that the calculation such as binarization, contour calculation, contour expansion, circumscribed rectangle and the like is performed based on the first probability map and the first threshold map so as to generate the segmentation feature of text detection.

FIG. 2 illustrates a flow chart for performing a text detection model training method according to an embodiment of the present disclosure. The method may be performed, for example, by a computing device 100 as shown in fig. 1. It should be understood that method 200 may also include additional blocks not shown and/or that the blocks shown may be omitted, the scope of the disclosure being not limited in this respect.

Step 210, obtaining a sample feature map of a sample picture;

step 220, inputting the sample feature map into a first prediction network to obtain a first probability map and a first threshold map.

In particular, the first probability map and the first threshold map have a size smaller than the sample picture, e.g. the first probability map and the first threshold map may have a size of 1/2 of the sample picture.

And 230, performing micro binarization processing on the first probability map and the first threshold map to obtain an approximate binary map.

Step 240, performing supervised learning on the probability map, the threshold map and the approximate binary map based on the sample labels of the sample pictures, and training to generate a text detection model.

Referring to fig. 3, in the training process of the text detection model, a sample picture is input into the text detection model to be trained, wherein the text detection model is a model based on a DBNET network structure and comprises a feature extraction network, a first prediction network and a micro binarization module. The feature extraction network comprises an FPN network and a Concat operation module. In step 210, the sample picture obtains four feature graphs respectively 1/4,1/8,1/16, and 1/32 of the sample picture through the FPN network structure of the feature extraction network; then, the four feature images are respectively up-sampled and fused, and then the concat operation is carried out, so that a sample feature image F in the figure 3 is obtained, and the size of the feature image is 1/4 of that of the original sample image. The first prediction network may include pred' prediction heads. The feature extraction network is not limited to the feature extraction step, and the size of the finally extracted feature map is not limited to 1/4 of the original sample picture, but may be 1/8,1/16, etc.

In step 220, the sample feature map is input into the first prediction network, and a series of convolution operations, batch normalization operations, relu operations, deconvolution operations, and sigmoid function operations are performed on the 1/4-sized feature map by the pred' prediction head, so as to output a first probability map and a first threshold map of 1/2 sample picture size, respectively. Further, DBnet proposes a micro-binarizable (DB) module that can perform binarization processing in a segmentation network, and a differential binarization method with adaptive threshold can not only distinguish text regions from the background, but also separate closely connected text instances. Based on this, in step 230, the generated first probability map and first threshold map are input to a micro-binarizable (DB) module, and an approximate binary map is output. Then, in step 240, the loss function 1 is calculated according to the output probability map, threshold map, approximate binary map and sample label of the sample image, and parameters of the feature extraction network, the first prediction network and the micro binarizable module in the text detection model are adjusted according to the loss calculation result until the model converges.

The sample label is a sample label graph, and includes a sample probability graph, a sample threshold graph, and a sample binary graph, for example, a binary graph that can clearly mark a foreground portion and a background portion of the sample picture.

Probability map: the value of each pixel represents the probability that the location belongs to the text region, corresponding to the sample probability map during model training. Threshold value diagram: the value of each pixel represents the binarization threshold value of the position, and corresponds to a sample threshold value diagram during model training. Approximate binary diagram: the value of each pixel point is 0 or 1, the probability map and the threshold map are calculated through a DB algorithm, and the corresponding sample binary map is obtained during model training.

Before the loss function is calculated by using the sample label of the sample picture, the probability map, the threshold map and the approximate binary map, the sample label of the sample picture can be downsampled, so that the downsampled sample label is matched with the sizes of the probability map, the threshold map and the approximate binary map, and model training is facilitated.

It can be understood that the output probability map and the threshold map are smaller than the original sample image in size, so that the efficiency of the trained text detection model is obviously improved in post-processing calculation processes such as binarization, contour calculation, contour expansion, circumscribed rectangle and the like.

In one embodiment, in step 210, to obtain the sample feature map, the following operations may be performed: firstly, collecting a training sample set, wherein the training sample set comprises a plurality of sample pictures carrying sample labels; the sample tag may be text outline annotation information of the sample picture. And then, inputting the sample picture into a feature extraction network to obtain a sample feature map.

Fig. 4 is a schematic flow chart of a text detection model method according to another exemplary embodiment of the present invention, and the present embodiment further corrects the detection accuracy based on the embodiment shown in fig. 2.

And inputting the sample feature map into a second prediction network, wherein the output feature size of the second prediction network is larger than or equal to the size of the sample picture, and performing intermediate supervised learning on the text detection model by utilizing the output feature of the second prediction network. The output characteristics of the second prediction network which is trained and meets the precision requirement are utilized to monitor the first prediction network, and the output characteristic size of the second prediction network is large enough and the precision is enough to accurately detect characters, so that a reference can be provided for the first prediction network which outputs smaller sizes, parameters of the characteristic extraction network, the first prediction network and the micro binarization module in the character detection model are adjusted, the output of the first prediction network meets the precision requirement, and the character detection can be accurately performed.

As shown in fig. 4, the method provided in this embodiment may further include the following steps:

step 251, the sample feature map is input into a second prediction network, so as to obtain a second probability map and a second threshold map.

The sizes of the second probability map and the second threshold map are larger than or equal to the size of the sample picture;

step 252, calculating a probability map loss of the first probability map using the second probability map;

step 253, calculating a threshold map loss of the first threshold map by using the second threshold map;

and step 254, performing intermediate supervised learning on the text detection model by using the probability map loss and the threshold map loss.

Referring to fig. 5, the second prediction network may include pred prediction heads. In step 251, the sample feature map is input into a second prediction network, and a second probability map and a second threshold map, which have sizes greater than or equal to that of the sample picture, are output through a series of convolution operations, batch normalization operations, relu operations, deconvolution operations, convolution operations, and sigmoid function operations performed by the pred prediction head, respectively. In steps 242 and 253, after the second probability map and the second threshold map are downsampled or the first probability map and the first threshold map are upsampled, probability map losses between the second probability map and the first probability map, and threshold map losses between the second threshold map and the first threshold map are respectively calculated, a loss function 2 is calculated, and parameters of the feature extraction network, the first prediction network and the micro-binarizable module in the text detection model are adjusted according to the loss calculation result.

It will be appreciated that the second probability map and the second threshold map have the same size as the sample image, and thus have higher accuracy, and the loss function 2 can effectively transfer the high-resolution and more accurate segmentation edge information to the segmentation map with the low-resolution output, so that the result of the low-resolution prediction can be corrected.

Optionally, the second probability map and the second threshold map are the same as the sample picture in size. Therefore, high enough image precision can be reserved, and the low-resolution prediction result can be effectively corrected.

Optionally, the first probability map and the first threshold map are 1/2 of the size of the sample picture. The calculated data volume is obviously reduced, and the post-processing efficiency can be effectively improved.

Alternatively, during the training process, cross training may be performed according to the loss function 1 and the loss function 2. The loss function 1 and the loss function 2 can be weighted and combined into one loss function, and then the whole training can be performed. This is not particularly limited in this application.

It should be noted that, the steps not described in detail in this embodiment may refer to descriptions of related steps in the embodiment shown in fig. 2, which are not described herein.

In one embodiment, in step 252 and step 253 described above, the probability map loss and/or the threshold map loss may be calculated using a KL divergence loss function. The KL divergence is calculated to be the expected value of the logarithmic difference of the probability of the original distribution and the approximate distribution of the data, and the distribution difference degree between the first probability map/the first threshold map with lower precision and the second probability map/the second threshold map with higher precision can be effectively presented.

Specifically, the following KL divergence loss function formula may be utilized for intermediate supervised learning:

wherein pred' (x) _i ) Refers to the output of the first predictive network, corresponding to x, for a first probability map/first threshold map _i The value of the pixel, pred (x _i ) Refers to the output of the second predictive network, corresponding to x, of the second probability map/second threshold map _i The pixel value i is the pixel number, and N is the total number of pixels.

Alternatively, other loss functions capable of exhibiting a degree of distribution discrepancy may be used to calculate the probability map loss and/or the threshold map loss, as not particularly limited in this application.

In one embodiment, to obtain a first probability map and a first threshold map of size 1/2 of a sample picture, the first prediction network may perform the following operations:

(1) Performing convolution operation on the sample feature map to obtain a first intermediate map;

(2) Performing batch standardization processing and activation processing on the first intermediate graph to obtain a second intermediate graph;

(3) Performing deconvolution operation on the second intermediate graph, and outputting a third intermediate graph;

(4) Performing convolution operation on the third intermediate graph, and outputting a fourth intermediate graph with the number of channels being 1;

(5) And outputting a first probability map and a first threshold map according to the sigmoid function and the fourth intermediate map.

Referring to FIG. 6, for example, first, a convolution operation is performed on the sample signature (3*3), compressing the channel to 1/4 of the input, and then performing batch normalization and activation processing to obtain a signature of size (batch size, 64,1/4W, 1/4H), i.e., a second intermediate graph; performing deconvolution operation on the second intermediate graph, wherein the convolution kernel is (2 x 2), and obtaining a characteristic graph with the size (batch size, 256,1/2W, 1/2H), namely a third intermediate graph, and the size of the third intermediate graph is 1/2 of that of the original graph; performing (3*3) convolution operation on the third intermediate graph to output a characteristic graph with a characteristic graph channel of 1 and a characteristic graph with a size (batch size, W/2, H/2), namely a fourth intermediate graph; and finally, the fourth intermediate diagram is subjected to a sigmoid function, and a first probability diagram P and a first threshold diagram T with the size of 1/2 original diagram are respectively output. The neural network structure of the first prediction network is not limited to the above embodiment, and focuses on adjusting the structure, and then adaptively adjusting the convolution kernel and the weight of the neural network structure to make the output channel of the neural network structure be 1, and the size of the neural network structure is the characteristic map (batch size, W/2, h/2), namely, the fourth intermediate map. And then, performing supervised learning on the first prediction network by using the trained second prediction network and sample labels to adjust the parameters of the first prediction network and also adjust the parameters of the feature extraction network so as to train a character detection model meeting the precision requirement, and finally, outputting a first probability map and a first threshold map which are smaller than the original map in size and higher in precision, participating in post-processing calculation and accelerating character detection time.

In one embodiment, to obtain a second probability map and a second threshold map of the same size as the sample picture, the second prediction network may perform the following operations:

(1) Performing convolution operation on the sample feature map to obtain a fifth intermediate map;

(2) Performing batch standardization processing and activation processing on the fifth intermediate graph to obtain a sixth intermediate graph;

(3) Performing deconvolution operation on the sixth intermediate graph, and outputting a seventh intermediate graph;

(4) Performing deconvolution operation on the seventh intermediate graph, and outputting an eighth intermediate graph with the number of channels being 1;

(5) And outputting a second probability map and a second threshold map according to the sigmoid function and the fourth intermediate map.

Referring to fig. 7, first, a feature map with a size of 1/4 of a sample picture is obtained through a feature extraction network, the size of the feature map is (batch size, 256,1/4w, 1/4H), and a specific process of obtaining a second probability map and a second threshold map through a pred pre-measurement head is as follows: performing (3*3) convolution operation on the sample feature map, compressing the channel into 1/4 of the input, and performing batch standardization and activation treatment to obtain a feature map with the size (batch size, 64,1/4W, 1/4H), namely a sixth intermediate map; performing deconvolution operation on the sixth intermediate graph, wherein the convolution kernel is (2 x 2), and obtaining a characteristic graph with the size (batch size, 256,1/2W, 1/2H), namely a seventh intermediate graph, wherein the size of the seventh intermediate graph is 1/2 of the size of the original graph; performing deconvolution operation (2 x 2) on the seventh intermediate graph, and outputting a characteristic graph with the number of channels being 1 and the size being (batch size, W, H), namely an eighth intermediate graph; and finally, respectively outputting a second probability map P2 and a second threshold map T2 of the original map size by the eighth intermediate map through a sigmoid function. The neural network structure of the second prediction network is not limited to the above embodiment, and it can be understood that the second prediction network is a relatively mature trained model for text detection in the prior art, and the output second probability map and the second threshold map of the second prediction network meet the accuracy requirement, so that a better text detection effect can be achieved. The first probability map and the first threshold map of the first prediction network are supervised with the second probability map and the second threshold map outputted thereby to adjust the parameters of the first prediction network, and the parameters of the feature extraction network correspondingly arranged at the front end of the first prediction network can also be adjusted.

In one embodiment, the text detection model includes: the probability map branches of the trained feature extraction network, the first prediction network. Therefore, the threshold value diagram part calculation can be saved in the subsequent text detection model, and the fixed threshold value is adopted for calculation, so that the post-processing calculation efficiency can be effectively improved.

Alternatively, the text detection model may of course also include: a trained feature extraction network and an overall first prediction network. In this case, the detection accuracy can be better ensured.

In the description of the present specification, reference to the terms "some possible embodiments," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiments or examples is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the various embodiments or examples described in this specification and the features of the various embodiments or examples may be combined and combined by those skilled in the art without contradiction.

With respect to the method flowcharts of the embodiments of the present application, certain operations are described as distinct steps performed in a certain order. Such a flowchart is illustrative and not limiting. Some steps described herein may be grouped together and performed in a single operation, may be partitioned into multiple sub-steps, and may be performed in an order different than that shown herein. The various steps illustrated in the flowcharts may be implemented in any manner by any circuit structure and/or tangible mechanism (e.g., by software running on a computer device, hardware (e.g., processor or chip implemented logic functions), etc., and/or any combination thereof).

Based on the same technical conception, the embodiment of the invention also provides a text detection method, which is an inference method for text contour detection by using the text detection model trained in any one of the embodiments. Fig. 8 is a flowchart of a text detection method according to an embodiment of the present invention.

Referring to fig. 8, the method 800 includes:

step 810, obtaining a picture to be detected, inputting the picture to be detected into a text detection model trained by the method in the embodiment, and outputting a probability map of the picture to be detected;

Step 820, binarizing the probability map by using a fixed threshold or a threshold map output by the text detection model to obtain a binary map;

and 830, performing text segmentation on the picture to be detected by using the binary image. Thus, the background part and the text part of the picture to be detected are determined, and a text detection frame is formed. Before character segmentation, the binary image needs to be up-sampled to be consistent with the size of the picture to be detected.

Referring to fig. 9, for example, an acquired picture to be detected may be input into a text detection model, a feature map is generated through a trained feature extraction network, the feature map is input into a first prediction network, a probability map is generated by using probability map branches of the trained first prediction network, and a fixed threshold binarization process is performed on the probability map based on a fixed threshold to obtain an approximate binary map; and carrying out post-processing on the approximate binary image, finally up-sampling the post-processed image to the original image size, identifying the background part and the character part of the image to be detected according to the up-sampled post-processed image, and determining the detection frame of the image to be detected to obtain the image with the detection frame. The fixed threshold method can save the calculated amount and improve the processing efficiency.

Alternatively, instead of the fixed threshold method, a threshold map may be generated based on the threshold map branch of the first prediction network, and binarization processing may be performed based on the threshold map and the probability map to obtain the approximate binary map. In this way, a more accurate detection effect can be obtained, which is not limited in this application.

Based on the same technical concept, the embodiment of the invention also provides a text detection model training device, which is used for executing the text detection model training method provided by any embodiment. Fig. 10 is a schematic structural diagram of a training device for text detection models according to an embodiment of the present invention.

Referring to fig. 10, the apparatus 100 includes:

the feature extraction module 101 is configured to input a sample picture into a feature extraction network to obtain a sample feature map;

the first prediction module 102 is configured to input the sample feature map into a first prediction network, and obtain a first probability map and a first threshold map, which are smaller than the sample picture in size;

the micro-binarizable module 103 is configured to perform micro-binarization processing on the first probability map and the first threshold map to obtain an approximate binary map;

and the training module 104 is used for performing supervised learning on the approximate binary image based on the sample label of the sample image, and training to generate a text detection model.

Based on the same technical concept, the embodiment of the invention also provides a text detection device, which is used for executing the text detection method provided by any one of the embodiments. Fig. 11 is a schematic structural diagram of a text detection device according to an embodiment of the present invention.

Referring to fig. 11, the apparatus includes:

the detection module 111 is configured to obtain a picture to be detected, input the picture to be detected into the text detection model trained by the method according to any one of claims 1 to 10, and output a probability map of the picture to be detected;

the binarization module 112 is configured to perform binarization processing on the probability map by using a fixed threshold or a threshold map output by the text detection model, to obtain an approximate binary map;

the text segmentation module 113 is configured to perform text segmentation on the image to be detected by using the approximate binary image.

It should be noted that, the apparatus in the embodiments of the present application may implement each process of the foregoing method embodiment and achieve the same effects and functions, which are not described herein again.

According to some embodiments of the present application, there is provided a non-transitory computer storage medium having stored thereon computer executable instructions configured to, when executed by a processor, perform: the method according to the above embodiment.

In this application, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are referred to each other, and each embodiment is mainly described as different from other embodiments. In particular, for apparatus, devices and computer readable storage medium embodiments, the description thereof is simplified as it is substantially similar to the method embodiments, as relevant points may be found in part in the description of the method embodiments.

The apparatus, the device, and the computer readable storage medium provided in the embodiments of the present application are in one-to-one correspondence with the methods, and therefore, the apparatus, the device, and the computer readable storage medium also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the apparatus, the device, and the computer readable storage medium are not repeated herein.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus (device or system), or computer readable storage medium. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the invention may take the form of a computer-readable storage medium embodied in one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices or systems) and computer-readable storage media according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Furthermore, although the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

While the spirit and principles of the present invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments nor does it imply that features of the various aspects are not useful in combination, nor are they useful in any combination, such as for convenience of description. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A training method for a text detection model is characterized by comprising the following steps:

acquiring a sample feature map of a sample picture, and inputting the sample feature map into a first prediction network to obtain a first probability map and a first threshold map, wherein the size of the first probability map and the first threshold map are smaller than those of the sample picture;

performing micro binarization processing on the first probability map and the first threshold map to obtain an approximate binary map;

and performing supervised learning on the approximate binary image based on the sample label of the sample image, and training to generate a text detection model.

2. The method according to claim 1, wherein the method further comprises:

and inputting the sample feature map into a second prediction network, wherein the output feature size of the second prediction network is larger than or equal to the size of the sample picture, and performing intermediate supervised learning on the text detection model by utilizing the output feature of the second prediction network.

3. The method according to claim 2, wherein the method further comprises:

inputting the sample feature map into a second prediction network to obtain a second probability map and a second threshold map, wherein the sizes of the second probability map and the second threshold map are larger than or equal to the sizes of the sample pictures;

calculating a probability map penalty for the first probability map using the second probability map;

calculating a threshold map penalty for the first threshold map using the second threshold map;

and performing intermediate supervised learning on the text detection model by utilizing the probability map loss and the threshold map loss.

4. A method according to claim 3, wherein the second probability map, the second threshold map and the sample picture are the same size.

5. A method according to claim 3, further comprising: the probability map loss and/or the threshold map loss are calculated using a KL divergence loss function.

6. The method according to claim 5, wherein the intermediate supervised learning is performed using the KL divergence loss function formula:

7. The method of claim 1, wherein the first probability map and the first threshold map are 1/2 of the sample picture in size.

8. The method of claim 1, wherein the first predictive network performs:

performing convolution operation on the sample feature map to obtain a first intermediate map;

performing batch standardization processing and activation processing on the first intermediate graph to obtain a second intermediate graph;

performing deconvolution operation on the second intermediate graph, and outputting a third intermediate graph;

performing convolution operation on the third intermediate graph, and outputting a fourth intermediate graph with the number of channels being 1;

and outputting the first probability map and the first threshold map according to a sigmoid function and the fourth intermediate map.

9. A method according to claim 3, wherein the second predictive network performs:

performing convolution operation on the sample feature map to obtain a fifth intermediate map;

performing batch standardization processing and activation processing on the fifth intermediate graph to obtain a sixth intermediate graph;

performing deconvolution operation on the sixth intermediate graph, and outputting a seventh intermediate graph;

performing deconvolution operation on the seventh intermediate graph, and outputting an eighth intermediate graph with the number of channels being 1;

And outputting the second probability map and the second threshold map according to a sigmoid function and the fourth intermediate map.

10. The method of claim 1, wherein the obtaining a sample feature map of a sample picture comprises:

acquiring a training sample set, wherein the training sample set comprises a plurality of sample pictures carrying sample labels;

and inputting the sample picture into a feature extraction network to obtain a sample feature map.

11. The method of claim 1, wherein the approximate binary image is supervised learning based on sample labels of the sample images, further comprising:

and carrying out downsampling treatment on the sample label to ensure that the size of the sample label is the same as that of the approximate binary image.

12. The method of claim 10, wherein the text detection model comprises at least: the trained probability map branches of the feature extraction network and the first prediction network.

13. A character detection method is characterized by comprising the following steps:

obtaining a picture to be detected, inputting the picture to be detected into a character detection model trained by the method according to any one of claims 1-11, and outputting a probability map of the picture to be detected;

Performing binarization processing on the probability map by using a fixed threshold or a threshold map output by the text detection model to obtain a binary map;

and performing text segmentation on the image to be detected by using the binary image.

14. A word detection model training apparatus configured to perform the method of any of claims 1-12, comprising:

the feature extraction module is used for inputting the sample picture into a feature extraction network to obtain a sample feature map;

the first prediction module is used for inputting the sample feature image into a first prediction network to obtain a first probability image and a first threshold image, wherein the size of the first probability image is smaller than that of the sample image;

the micro binarization module is used for performing micro binarization processing on the first probability map and the first threshold map to obtain an approximate binary map;

and the training module is used for performing supervised learning on the approximate binary image based on the sample label of the sample image, and training to generate a character detection model.

15. A text detection device configured to perform the method of claim 13, the device comprising:

the detection module is used for acquiring a picture to be detected, inputting the picture to be detected into a character detection model trained by the method according to any one of claims 1-12, and outputting a probability map of the picture to be detected;

The binarization module is used for carrying out binarization processing on the probability map by using a fixed threshold or a threshold map output by the text detection model to obtain a binary map;

and the character segmentation module is used for carrying out character segmentation on the image to be detected by utilizing the binary image.

16. A text detection model training system, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform: the method of any one of claims 1-12.

17. A text detection system, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform: the method of claim 13.

18. A computer readable storage medium storing a program which, when executed by a processor, causes the processor to perform the method of any one of claims 1-12 or the method of claim 12.