CN115937870A

CN115937870A - Training method and device of character-level text detection model, medium and terminal

Info

Publication number: CN115937870A
Application number: CN202111159043.0A
Authority: CN
Inventors: 聂诗武; 沈晓静; 张子也; 何思清
Original assignee: Shanghai Fudan Microelectronics Group Co Ltd
Current assignee: Shanghai Fudan Microelectronics Group Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2023-04-07

Abstract

A training method, a device, a medium and a terminal of a character-level text detection model are provided, and the method comprises the following steps: in the kth iteration process, deducing an enhanced sample set of each sample in the service candidate data set by adopting a text detection model obtained by the kth-1 th iteration to obtain a prediction frame set; calculating a predicted result consistency index of the sample; selecting a target sample according to the predicted result consistency index of each sample; adding the target sample after the artificial word level labeling into the selected sample set, deducing the target sample by using a text detection model obtained by the k-1 th iteration to obtain a character level pseudo label, and training the text detection model obtained by the k-1 th iteration based on the labeled target sample and the character level pseudo label thereof to obtain a text detection model of the k-1 th iteration; and evaluating the text detection model obtained by the k iteration, and if the evaluation is passed, obtaining a character-level text detection model. According to the scheme, the time consumption and the cost of labeling can be reduced.

Description

Training method and device of character-level text detection model, medium and terminal

Technical Field

The embodiment of the invention relates to the field of visual text detection, in particular to a training method and device, a medium and a terminal for a character-level text detection model.

Background

With the rapid development of deep learning and the great improvement of computer capability, many tasks realize intelligent transformation with the help of a deep learning model. However, deep learning requires massive labeled samples to train the model to achieve the generalization capability meeting the expectation.

In the field of visual text detection, text information in a real scene is rich, and fonts, character sizes, bending degrees, font slopes and the like are uncertain. In order to accurately predict text information in a real scene, before training a model, massive training data needs to be manually labeled to help the model learn rich text knowledge. However, the amount of training data can reach tens of thousands of training data, even millions of training data, which results in large amount of labeling data, long time for labeling and high cost in the labeling process of the training data.

Disclosure of Invention

The embodiment of the invention solves the technical problems of large labeling data volume, long labeling time and high cost.

In order to solve the above technical problem, an embodiment of the present invention provides a method for training a character-level text detection model, including: in the kth iteration process, deducing an enhanced sample set of each sample in the service candidate data set by using a text detection model obtained by the kth-1 th iteration to obtain a prediction frame set corresponding to the enhanced sample set of each sample, wherein the enhanced sample set comprises an original sample and an enhanced sample of the original sample, the prediction frame set comprises prediction frames respectively corresponding to the original sample and the enhanced sample of the original sample, and k is an integer greater than 1; for each sample, calculating a predicted result consistency index of each sample according to a prediction frame set of an enhanced sample set of each sample, wherein the predicted result consistency index is used for representing the consistency degree of the predicted result; selecting a target sample from the service candidate data set according to the predicted result consistency index of each sample; performing artificial word-level labeling on the target samples, and adding the target samples subjected to the artificial word-level labeling into a selected sample set, wherein the selected sample set comprises a plurality of target samples subjected to word-level labeling; deducing each labeled target sample by adopting the text detection model obtained by the (k-1) th iteration to obtain a character-level pseudo label of each target sample, and training the text detection model obtained by the (k-1) th iteration on the basis of the character-level pseudo label of each word-level labeled target sample and each target sample to obtain a text detection model of the (k-1) th iteration; and evaluating the text detection model obtained by the k iteration by using the test set, and if the evaluation is passed, obtaining the character-level text detection model.

Optionally, the training method for the character-level text detection model further includes: and if the evaluation is not passed, continuing to carry out iterative training based on the text detection model of the kth iteration until the evaluation is passed, and obtaining the character-level text detection model.

Optionally, the calculating, for each sample, a predicted result consistency index of each sample according to the prediction frame set of the enhanced data set of each sample includes: for each sample, for the target object in each sample, calculating the ratio of the union of the prediction frames of the target object to the intersection of the prediction frames of the target object, and obtaining the consistency index of the prediction result of the target object according to the ratio; and aiming at each sample, obtaining the consistency index of the prediction result of each sample according to the consistency index of the prediction result of the target object in each sample.

Optionally, the obtaining, for each sample, the predicted result consistency index of each sample according to the predicted result consistency index of the target object in each sample includes: determining a weight of each target object when the sample includes a plurality of target objects; and weighting the predicted result consistency indexes of the plurality of target objects according to the weight of each target object and the predicted result consistency index of each target object, and taking the weighted result as the predicted result consistency index of the sample.

Optionally, the determining the weight of each target object includes: determining a weight according to the size of each target object in the prediction frame of the original sample, wherein the weight is positively correlated with the size of the target object in the prediction frame of the original sample.

Optionally, the inferring a target sample labeled at each word level by using the text detection model obtained by the (k-1) th iteration to obtain a character-level pseudo label of each target sample, and training the text detection model obtained by the (k-1) th iteration based on the character-level pseudo label of the target sample labeled at each word level and each target sample to obtain a text detection model of the kth iteration, includes: performing word-level slicing on each target sample in the selected sample set according to word-level labels to obtain one or more slices; deducing the slices of each target sample by adopting the text detection model obtained by the (k-1) th iteration to obtain a character-level region probability heat map; cutting each slice by adopting an image segmentation algorithm according to the character level region probability heat map of the slice of each target sample, and predicting to obtain a character prediction frame of each character; mapping and producing a two-dimensional Gaussian heatmap on a character prediction frame of each character; aiming at each target sample, obtaining a character-level pseudo label of each target sample according to the two-dimensional Gaussian heatmap and each target sample; and training the text detection model obtained by the k-1 th iteration according to the character-level pseudo label of each target sample and each target sample to obtain the text detection model of the k-th iteration.

Optionally, the training the text detection model obtained by the k-1 st iteration according to the pseudo label of each target sample and each target sample includes: for each slice, predicting word prediction length of a word in each slice according to a character prediction frame of each character of each slice; aiming at each slice, calculating the truth index of the prediction result of each slice according to the word prediction length and the word true degree of each slice; determining the learning weight of each target sample according to the truth indexes of the prediction results of all the slices in each target sample aiming at each target sample; and training the text detection model obtained by the k-1 iteration by combining the learning weight of each target sample, the character-level pseudo label of each target sample and the word-level label of each target sample.

Optionally, after selecting a target sample from the service candidate data set, the target sample is deleted from the service candidate data set to update the service candidate data set.

Optionally, the training method of the character-level text detection model further includes: and when k is 1, in the 1 st iteration process, deducing the enhanced sample set of each sample in the service candidate data set by adopting an initial iteration text detection model.

Optionally, the initial iterative text detection model is obtained by the following method: performing word-level slicing on each image sample in an open source training sample set to obtain a plurality of slices, wherein the open source training sample set comprises a plurality of word-level labeled image samples; deducing each slice obtained by the training sample set by adopting an original model to obtain a character-level region probability heat map; for each slice, predicting the word prediction length in each slice according to the character level region probability heat map of each slice; calculating the truth index of the prediction result of each slice according to the word prediction length and the word real length in each slice; for each image sample, determining the learning weight of each image sample according to the prediction result authenticity indexes of all slices in each image sample, wherein the prediction result authenticity indexes are in positive correlation with the learning weight; and training the original model by combining the learning weight of each image sample, each image sample and the character level region probability heatmap thereof to obtain the initial iterative text detection model.

Optionally, the original model is obtained by training in the following manner: generating the word text information by randomly arranging and combining the single characters; attaching word text information to a background picture without text information to generate a synthesized character-level annotation text image sample; and training by adopting the synthesized character-level labeling text image sample to obtain the original model.

Optionally, the enhanced sample of the original sample is obtained by: performing at least one of the following data enhancement operations on the original sample to obtain an enhanced sample of the original sample: motion blur operation, zoom operation, rotation operation, noise adding operation, turning operation, brightness adjustment and color adjustment.

The embodiment of the invention also provides a text detection method, which comprises the following steps: acquiring a text image to be detected; detecting the text image to be detected by adopting the character-level text detection model obtained by training the training method of the character-level text detection model to obtain a character detection result, wherein the character detection result comprises the following steps: a character probability heat map, the character probability heat map being used to characterize probabilities of character regions; and communicating character areas according to the character detection result and the communication domain prediction result between the characters to obtain a word-level detection result.

The embodiment of the present invention further provides a training device for a character-level text detection model, including: the iteration unit is used for deducing an enhanced sample set of each sample in the service candidate data set by adopting a text detection model obtained by the k-1 iteration in the k iteration process to obtain a prediction frame set corresponding to the enhanced sample set of each sample, wherein the enhanced sample set comprises an original sample and an enhanced sample of the original sample, the prediction frame set comprises prediction frames respectively corresponding to the original sample and the enhanced sample of the original sample, and k is an integer greater than 1; the calculation unit is used for calculating a prediction result consistency index of each sample according to the prediction frame set of the enhanced sample set of each sample; the selecting unit is used for selecting a target sample from the service candidate data set according to the predicted result consistency index of each sample; the training unit is used for carrying out manual word-level labeling on the target samples and adding the target samples subjected to the manual word-level labeling into a selected sample set, wherein the selected sample set comprises a plurality of target samples subjected to word-level labeling; deducing the target sample marked by each word level by adopting the text detection model obtained by the k-1 iteration to obtain a character level pseudo label of each target sample, and training the text detection model obtained by the k-1 iteration on the basis of the character level pseudo label of the target sample marked by each word level and each target sample to obtain a text detection model of the k iteration; and the evaluation unit is used for evaluating the text detection model obtained by the kth iteration by adopting the test set, and if the evaluation is passed, obtaining the character-level text detection model.

An embodiment of the present invention further provides a computer-readable storage medium, which is a non-volatile storage medium or a non-transitory storage medium, and on which a computer program is stored, where the computer program, when executed by a processor, performs any of the steps of the training method for the character-level text detection model.

The embodiment of the present invention further provides a terminal, which includes a memory and a processor, where the memory stores a computer program capable of running on the processor, and the processor executes the steps of any one of the above training methods for the character-level text detection model when running the computer program.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

in the kth iteration process, a text detection model obtained by the kth-1 th iteration is adopted to infer an enhanced sample set of each sample in the service candidate data set, so as to obtain a prediction frame set corresponding to the enhanced sample set of each sample, wherein the enhanced sample set of each sample comprises an original sample and an enhanced sample of the original sample, and the prediction frame set comprises prediction frames corresponding to the original sample and the enhanced sample of the original sample respectively. Calculating a prediction result consistency index of the enhanced sample set of each sample according to the prediction frame set of the enhanced sample set of each sample; selecting a target sample from the service candidate data set according to the predicted result consistency index of each sample; and carrying out artificial word level labeling on the selected target sample, and adding the target sample subjected to the artificial word level labeling into the selected sample set. And deducing the target sample marked by each word level by adopting the text detection model obtained by the k-1 th iteration to obtain the character level pseudo label of each target sample, and training the text detection model obtained by the k-1 th iteration on the basis of the character level pseudo label of the target sample marked by each word level and each target sample to obtain the text detection model of the k-1 th iteration. According to the method, a prediction frame set obtained by deducing an enhanced sample set of samples according to a text detection model obtained by the (k-1) th iteration is calculated to obtain a prediction result consistency index of each sample, and then a target sample is selected according to the prediction result consistency index of each sample to carry out artificial word-level labeling, so that in each iteration process, the text detection model obtained by the last iteration selects a target sample with large self-income, the selected target sample is subjected to artificial word-level labeling, and the target sample subjected to the artificial word-level labeling is fed back to the text detection model obtained by the last iteration to carry out current iteration training, so that active learning is realized, through multiple iterations, the target sample which helps to be large in model training is automatically selected, and the target sample subjected to the artificial word-level labeling can be adopted to carry out the iteration training, so that the target sample which helps to be large in the model training can be selected, the learned characteristics of the text detection model can be enriched constantly and efficiently, and the accuracy of the detection result is further improved, and the model is completed. Selecting a target sample by combining iterative active learning and a character-level pseudo label, wherein on one hand, the target sample is subjected to artificial word-level labeling without character-level labeling, and the labeling workload can be reduced by reducing the labeling difficulty; on the other hand, the total sample size finally selected and needing manual labeling is often far lower than the data size of the candidate sample corpus, so that the sample labeling amount in the training process of the character-level text detection model can be reduced, and the labeling time and the labeling cost are reduced.

Drawings

FIG. 1 is a flow chart of a method for training a character-level text detection model in an embodiment of the present invention;

FIG. 2 is a flow chart of training of an initial iterative text detection model in an embodiment of the present invention;

FIG. 3 is a flowchart of one embodiment of step S15;

FIG. 4 is a flowchart of one embodiment of step S156;

FIG. 5 is a flow chart illustrating another exemplary training process for a character-level text detection model in an embodiment of the present invention;

FIG. 6 is a flow chart of a text detection method in an embodiment of the invention;

fig. 7 is a schematic structural diagram of a training apparatus for a character-level text detection model in an embodiment of the present invention.

Detailed Description

As described above, the models employed by existing text detection tasks typically include a text detection model based on word-level regression boxes and a text detection model based on character-level probabilistic heatmaps. The output of a text detection model based on a word-level regression box mostly adopts word-level prediction, that is, a rectangular box is used for positioning on the whole word. However, there are two problems with text detection models based on word-level regression boxes, the first: the inference results output by the text detection model based on the word-level regression box will identify the detected text portions using rectangular boxes. However, the slope of characters in the real scene is variable, the bending degree is not fixed, the words and the characters are positioned by directly adopting the rectangular box, and the accuracy of a text detection result is low. The second problem is that: when long words are learned and predicted, the receptive field of the long words is difficult to cover the word size, and the accuracy of a text detection result is low. The character level probability heat map-based text detection model predicts the regions where characters possibly exist on the image and the joint tone regions possibly among the characters and outputs the probability heat map, character regions are connected by adopting post-processing, the output word positioning frame can be non-rectangular, and the detection target is a character, so that the problem of insufficient receptive field of the word level regression frame-based text detection model can be solved. But when the text detection model based on the character-level probability heatmap is trained, the input label is also a character-level labeling box. For the text detection task of English words, the character-level labeling cost is usually about 6 times of the word-level labeling cost. The manual labeling of training data at the word level is a task which consumes resources, and if the manual labeling is at the character level, the labeling difficulty is undoubtedly further improved by one level, so that the labeling data volume is large, the labeling time is long, and the cost is high

In order to solve the above problem, in the embodiment of the present invention, in the kth iteration process, a text detection model obtained by the kth-1 th iteration is used to infer an enhanced sample set of each sample in a service candidate data set, so as to obtain a prediction frame set corresponding to the enhanced sample set of each sample, where the enhanced sample set of each sample includes an original sample and an enhanced sample of the original sample, and the prediction frame set includes prediction frames corresponding to the original sample and the enhanced sample of the original sample, respectively. Calculating a prediction result consistency index of the enhancement sample set of each sample according to the prediction frame set of the enhancement sample set of each sample; selecting a target sample from the service candidate data set according to the predicted result consistency index of each sample; and carrying out artificial word level labeling on the selected target sample, and adding the target sample subjected to the artificial word level labeling into the selected sample set. And deducing the target sample marked by each word level by adopting the text detection model obtained by the k-1 th iteration to obtain the character level pseudo label of each target sample, and training the text detection model obtained by the k-1 th iteration based on the character level pseudo label of the target sample marked by each word level and each target sample to obtain the text detection model of the k-1 th iteration. According to the method, a prediction frame set obtained by deducing an enhanced sample set of samples according to a text detection model obtained by the (k-1) th iteration is calculated to obtain a prediction result consistency index of each sample, and then a target sample is selected according to the prediction result consistency index of each sample to carry out artificial word level labeling, so that in each iteration process, the text detection model obtained by the last iteration selects a target sample with a large self-income, the selected target sample is subjected to the artificial word level labeling, and the target sample subjected to the artificial word level labeling is fed back to the text detection model obtained by the last iteration for current iteration training, so that active learning is realized, through multiple iterations, the target sample which helps the model training to be large is automatically selected, and the target sample subjected to the artificial word level labeling can be adopted for iteration training, so that the target sample which helps the model training to be large can be selected, and the learned characteristics of the text detection model can be enriched constantly and efficiently, so that the accuracy of the detection result is further improved, and the model is completed. The target sample is selected by combining iterative active learning and a character-level pseudo label, so that the target sample can be labeled manually at a word level without character-level labeling, and the labeling workload can be reduced by reducing the labeling difficulty; on the other hand, the total sample size to be manually labeled, which is selected, is often far lower than the data size of the candidate sample corpus, so that the sample labeling amount in the training process of the character-level text detection model can be reduced, and the labeling time and the labeling cost are reduced.

In order to make the aforementioned objects, features and advantages of the embodiments of the present invention more comprehensible, specific embodiments accompanied with figures are described in detail below.

The embodiment of the invention provides a training method of a character-level text detection model, which can be used in the text detection scenes of English, german, french, spanish or other languages needing to form words through characters.

Referring to fig. 1, a flowchart of a training method for a character-level text detection model in the embodiment of the present invention is shown, which specifically includes the following steps:

and S11, in the k iteration process, deducing the enhanced sample set of each sample in the service candidate data set by adopting the text detection model obtained by the k-1 iteration to obtain a prediction frame set corresponding to the enhanced sample set of each sample.

In a specific implementation, the enhanced sample set (which may also be referred to as the PATCH data set of samples) of each sample includes an original sample and an enhanced sample of the original sample. So that the set of prediction boxes corresponding to the set of enhanced samples for each sample includes the prediction box corresponding to the original sample and the prediction box of the enhanced sample for the original sample.

In the detection task of computer vision, word-level text detection is usually performed, and words in the text also need to be predicted, so the prediction box refers to a quadrangular box in which a text detection model infers and outputs each word in an image, and the prediction box is used for framing the words.

In some non-limiting embodiments, a data enhancement operation may be performed on the original sample to obtain an enhanced sample of the original sample, and the number of the enhanced samples may be one or more. In each iteration process, each enhanced sample in the enhanced sample set of the adopted samples can be obtained by adopting different data enhancement operations. Specifically, in each iteration process, the data enhancement operation performed on the original sample may be set according to a text detection result and the like.

The data enhancement operations may include motion blur operations, zoom operations, rotation operations, noise operations, flip operations, brightness adjustments, color adjustments, and the like. By carrying out data enhancement on the original sample and carrying out data enhancement operation on the original sample, on one hand, the sample data size can be increased, and on the other hand, the enhanced sample is used for training the character-level text detection model, so that the performance of the character-level text detection model obtained by training, the generalization capability and the anti-interference capability of business data and the like can be improved.

In specific implementation, the data enhancement mode of the original sample can be selected based on the prior knowledge of experts in combination with the application scene of the character-level text detection model, the characteristics of the text image to be detected in the application scene, and the like.

When k is an integer greater than 1, namely, starting from the 2 nd iteration, in the kth iteration process, inferring an enhanced sample set of each sample in the service candidate data set by using a text detection model obtained from the k-1 st iteration to obtain a prediction box set corresponding to the enhanced sample set of each sample, namely, aiming at the enhanced sample set of each sample, wherein the prediction box set comprises a prediction box of an original sample and a prediction box of the enhanced sample.

When k is 1, that is, in the 1 st iteration process, the initial iteration text detection model may be adopted to infer the enhancement sample set of each sample in the service candidate set, and obtain the prediction box set corresponding to the enhancement sample set of each sample.

Referring to fig. 2, a flowchart of training an initial iterative text detection model in an embodiment of the present invention is shown. In some non-limiting embodiments, the initial iterative text detection model may be trained in the following manner, which may specifically include the following steps.

And S21, performing word-level slicing on each image sample in an open source training sample set to obtain a plurality of slices, wherein the training sample set comprises a plurality of word-level labeled image samples.

The open-source word-level label image text detection data set acquired in a real scene can be used as an open-source training sample, and because the open-source word-level label image text detection data set acquired in the real scene is a labeled sample, when an initial iteration text detection model is obtained through training, manual labeling is not needed, so that the manual labeling workload of the character-level text detection model in the training process is reduced, and the adaptability of the trained initial iteration text detection model to noise, fuzziness, font variation and the like can be improved, so that the initial iteration text detection model has certain generalization prediction capability, the accuracy of the trained character-level text detection model to text information prediction is further improved, a better basis is laid for the generation of a prediction box in step S11 (namely the generation of a pseudo label of a sample in a business candidate data set), and the accuracy of the prediction box in step S11 can be improved.

Because the open source training sample set is the image samples labeled at a plurality of word levels, each image sample can be sliced according to a word level TRUE (GROUND TRUE) frame to obtain slices. Each word-level annotation may include one or more words in the image sample, one word in each slice. In supervised learning, data is labeled and appears in the form of (x, t), wherein x is input data and t is a label. The correct label t is GROUND TRUE.

And S22, deducing each slice obtained by the open source training sample set by adopting an original model to obtain a character level region probability heat map.

In specific implementation, an original model is adopted to perform text detection on each slice to obtain a character-level region probability heat map, and then the character-level pseudo label is obtained. The region probability heat map is used for representing the probability that each region is a certain character. The pseudo label is a label which is not artificially marked, is a result obtained by using a certain model for inference, and may not be completely correct and has a certain error rate. Character-level pseudo-labels refer to pseudo-labels at the character level.

And S23, predicting the predicted word length in each slice according to the character level region probability heat map of each slice.

Specifically, image segmentation is performed by using an image segmentation algorithm according to the character-level region probability heat map of each slice (also called group TRUE slice region), so as to obtain a character bounding box. And predicting the word prediction length in each slice according to each character boundary box. The image segmentation algorithm may include a watershed segmentation algorithm, or the like.

And step S24, calculating the truth indexes of the prediction results of the slices according to the word prediction lengths and the word real lengths in the slices.

Because the multiple word-level labeled image samples in the open source training sample set are labeled image samples, the word-level labels of the image samples are provided with the real word lengths, and the real word lengths are known. And calculating the truth index of the prediction result according to the word prediction length and the word real length in each slice. The predicted result authenticity index is used for representing the consistency of the word predicted length and the word real length, the higher the consistency of the word predicted length and the word real length is, and the higher the predicted result authenticity index is, the higher the accuracy of the representation predicted result is.

In some non-limiting embodiments, for each slice, a ratio of the predicted length of the word to the true length of the word for each slice may be used as the prediction result plausibility index for each slice.

In other non-limiting embodiments, considering that the predicted length of a word may be greater than the true length of the word, or may be less than the true length of the word, in order to improve the plausibility of the prediction result, the greater the plausibility index of the prediction result for each slice may be determined using the following equation (1).

Wherein is W _i The truth index of the prediction result of the ith slice is taken as the index; l _i The real length of the word in the ith slice is;

predicting the length of the word in the ith slice; />

The absolute value of the difference value between the word real length and the word predicted length in the ith slice is taken as the word prediction length; />

To get l _i And &>

Minimum value of (1).

And step S25, aiming at each image sample, determining the learning weight of each image sample according to the prediction result authenticity indexes of all the slices in each image sample.

In particular implementations, the learning weights for individual image samples may be determined in a variety of ways.

In some non-limiting embodiments, when the image sample comprises a slice, the prediction result plausibility index of this slice is taken as the learning weight of the image.

In other non-limiting embodiments, when the image sample includes a plurality of slices, the learning weight of each slice is obtained according to the plausibility index of the prediction result of each slice, that is, the learning weight of the image sample in the region corresponding to different slices may be different. I.e., the image sample comprises a plurality of slices, the image sample may comprise a plurality of learning weights.

In a specific implementation, the prediction outcome veracity index is positively correlated with the learning weight. For each slice, the higher the consistency of the word prediction length with the word true length, the larger the prediction result authenticity index, and accordingly, the larger the learning weight. Conversely, the lower the consistency of the word prediction length and the word true length, the smaller the prediction result authenticity index, and correspondingly, the smaller the learning weight.

And S26, training an original model by combining the learning weight of each image sample, each image sample and the character level region probability heat map thereof to obtain the initial iterative text detection model.

By using the artificial word level label as a constraint, determining the learning weight according to the word real length and the word prediction length of each slice by generating a character pseudo label for the model to learn, preventing the model from shaking, difficult convergence or other wrong recalls or few recalls and the like caused by the false label generation error, and preventing the sample prediction with an abnormal value being highly self-confident (but wrong) from damaging the training of the whole initial iteration text detection model.

In some non-limiting embodiments, the original model may be trained as follows: generating the word text information by randomly arranging and combining the single characters; attaching word text information to a background picture without text information to generate a synthesized character-level annotation text image sample; and training by adopting the synthesized character-level labeling text image sample to obtain the original model.

The word text information may be generated by combining individual characters by random permutation. That is, the word text information is generated by randomly arranging and combining one or more characters in units of single characters. In the process of generating the word text information, the influence factors such as word fonts, word lengths, character combinations, frequency of character collocation and the like can be considered, so that the word text information meeting the requirements can be obtained.

For example, word text information is generated by randomly arranging and combining single characters, attached to a background picture without text information, and a synthesized character-level annotation text image sample is automatically generated by adopting a synthesis script and other modes. Because the character position information is known by adopting the generation of the synthesis script, the manual marking is not needed, the marking amount can be greatly reduced, and the marking time is reduced.

The original model is trained by adopting the synthesized character-level label text image samples, the prediction capability of the original model on character-level information can be endowed, meanwhile, the generalization prediction capability of the original model on character information is ensured by massive synthesized character-level label text image samples, and a foundation is laid for the generation of subsequent character-level pseudo labels.

And step S12, aiming at each sample, calculating a predicted result consistency index of each sample according to the prediction frame set of the enhanced sample set of each sample.

In specific implementation, for each sample, for a target object in each sample, a ratio of an intersection of a union of prediction frames of the target object and a prediction frame of the target object is calculated, and a prediction result consistency index of the target object is obtained according to the ratio. The ratio of the Union of the prediction frames of the target object to the Intersection of the prediction frames of the target object may also be referred to as an Intersection Over Unit (IOU) of the prediction frames of the target object. The target object is in units of words, that is, the target object is a word.

Specifically, for each sample, for the target object in each sample, for the same target object, a ratio of the union of the prediction frames of the target object to the intersection of the prediction frames of the target object is calculated, and a prediction result consistency index of the target object is obtained according to the ratio. And obtaining the consistency index of the prediction result of each sample according to the consistency index of the prediction result of the target object of each sample. That is, each sample corresponds to one enhanced sample set, and each sample corresponds to a prediction result consistency index.

For example, n pictures of the original sample and the enhanced sample in the enhanced sample set of each sample, where n is a positive integer greater than 1. The predicted result consistency index of the target object may be calculated using the following formula (2) in units of the enhanced sample set.

Wherein S is _j The predicted result consistency index of the target object j is obtained; box ₁ A prediction frame of a target object j in a first picture; box ₂ A prediction frame of a target object j in a second picture; box _n A prediction frame of a target object j in the nth picture; box ₁ ∩box ₂ ∩…∩box _n To take the prediction box ₁ 、box ₂ To box _n The intersection of (a); box ₁ ∪box ₂ ∪…∪box _n To take the prediction box ₁ 、box ₂ To box _n The union of (a).

And when the sample comprises a target object, taking the predicted result consistency index of the target object as the predicted result consistency index of the sample.

When the sample includes a plurality of target objects, the prediction result consistency index of the sample may be obtained in the following manner. Specifically, when the number of target objects in the sample is plural, the weight of each target object is determined. And calculating the consistency index of the prediction result of each target object, weighting the consistency index of the prediction result of each target object according to the weight of each target object and the consistency index of the prediction result, and taking the weighted result as the consistency index of the prediction result of the sample.

For example, the sample includes m target objects, and the weights of the m target objects are w ₁ 、w ₂ ……w _m The predicted result consistency index of the sample can be calculated by using the following formula (3)。

Wherein S is a predicted result consistency index of the sample; s ₁ To S _m Respectively corresponding prediction result consistency indexes of the m target objects; m is the total number of target objects, and m is a positive integer greater than or equal to 1.

Research shows that the smaller the target object, the more easily the target object is influenced by data enhancement such as noise addition and blurring, so that the accuracy of the smaller prediction result consistency index of the target object in the enhanced sample set is lower. In order to improve the accuracy of the predicted result consistency index of the sample, in the embodiment of the present invention, the weight may be determined according to the size of each target object in the predicted frame of the original sample, where the weight is positively correlated to the size of the target object in the predicted frame of the original sample. That is, the larger the size of the target object in the prediction box of the original sample is, the larger the weight is; conversely, the smaller the size of the target object in the prediction box of the original object, the smaller the weight. The weight is used for representing the importance degree of the predicted result consistency index of the target object to the predicted result consistency index of the sample.

The weight of the target object is determined according to the size of the prediction frame of the target object in the original sample, so that the influence of the smaller prediction result consistency index of the target object on the prediction result consistency index of the sample can be reduced, and the accuracy of the prediction result consistency index of the sample is improved. Based on the size of the prediction frame of the target object in the original sample, because the original sample is not subjected to data enhancement, the influence of the data enhancement on the target object can be avoided, the accuracy of size prediction on the target object can be improved, the reasonability of the weight set for each target object is further improved, the unfairness caused by more small target objects is avoided, and the accuracy of the prediction result consistency index of the sample can be further improved.

And S13, selecting a target sample from the service candidate data set according to the predicted result consistency index of each sample.

In a specific implementation, the target sample may be selected from the service candidate data set according to the predicted result consistency index of each sample.

In some non-limiting embodiments, a sample with a predicted result consistency index below a set threshold may be selected as the target sample.

In other non-limiting embodiments, the predicted result consistency indexes of the samples may be arranged in a reverse order, and from low to high, the sample with the P bits before the predicted result consistency index is sorted is selected as the target sample. It is understood that other ways to select the target sample from the service candidate data set may be adopted, where P is a positive integer.

Specifically, the samples with smaller values of the predicted result consistency index are more inconsistent with the inferred result of the text detection model obtained by representing the (k-1) th iteration on the sample enhanced sample set. That is, the less the intrinsic features of the sample learned by the text detection model obtained by the k-1 st iteration are, the poorer the anti-interference capability is. The target samples selected according to the predicted result consistency indexes of the samples are beneficial to screening out the samples with smaller predicted result consistency indexes from the business candidate data set, the samples (target samples) with smaller predicted result consistency indexes are the samples which are obtained by the k-1 iteration and are more suitable for learning of the text detection model, and the benefit of learning of the samples with smaller predicted result consistency indexes of the text detection model obtained by the k-1 iteration is higher.

And S14, carrying out artificial word level labeling on the target sample, and adding the target sample subjected to the artificial word level labeling into the selected sample set.

In specific implementation, the target samples are subjected to manual word-level labeling, and the target samples subjected to the manual word-level labeling are added into a selected sample set, wherein the selected sample set comprises a plurality of target samples subjected to word-level labeling.

In some non-limiting embodiments, the target sample may be labeled with an artificial word-level text box by a professional labeling person, so as to obtain the target sample after artificial word-level labeling.

And S15, deducing the target sample marked by each word level by adopting the text detection model obtained by the (k-1) th iteration to obtain a character-level pseudo label of each target sample, and training the text detection model obtained by the (k-1) th iteration based on the character-level pseudo label of the target sample marked by each word level and each target sample to obtain a text detection model of the (k) th iteration.

In particular implementation, referring to fig. 3, a flowchart of one embodiment of step S15 is given. The step S15 may specifically include the following steps S151 to S156:

step S151, performing word-level slicing on each target sample in the selected sample set according to word-level labels to obtain one or more slices;

s152, deducing the slices of each target sample by using the text detection model obtained by the k-1 iteration to obtain a character-level region probability heat map;

step S153, cutting each slice by adopting an image segmentation algorithm according to the character level region probability heat map of the slice of each target sample, and predicting to obtain a character prediction frame of each character;

step S154, mapping and producing a two-dimensional Gaussian heat map on the character prediction frame of each character;

step S155, aiming at each target sample, obtaining a character-level pseudo label of each target sample according to the two-dimensional Gaussian heatmap and each target sample;

and step S156, training the text detection model obtained by the k-1 th iteration according to the character-level pseudo label of each target sample and each target sample to obtain the text detection model of the k-th iteration.

Further, in order to improve the performance of the trained text detection model of the kth iteration and prevent the target sample with a highly confident but wrong abnormal value from destroying the model, in the embodiment of the present invention, referring to fig. 4, a flowchart of a specific implementation manner of step S156 is given. Step S156 may include steps S1561 to S1564 as follows.

Step S1561, for each slice, the word prediction length of the word in each slice is estimated based on the character prediction box of each character of each slice.

Step S1562, for each slice, the prediction result plausibility index of each slice is calculated from the word prediction length and the word truth degree of each slice.

In a specific implementation, the formula (1) may be referred to for a calculation manner of the predicted result plausibility index, which is not described herein again.

Step S1563, aiming at each target sample, determining the learning weight of each target sample according to the truth indexes of the prediction results of all the slices in each target sample;

and step S1564, training the text detection model obtained by the k-1 iteration by combining the learning weight of each target sample, the character-level pseudo label of each target sample and the word-level label of each target sample.

Therefore, the target sample is subjected to manual word level labeling, the selected sample set is added, the selected sample set is adopted to train the text detection model obtained by the k-1 th iteration, the text detection model of the k-th iteration is obtained, the characteristics of the target sample can be learned by the text detection model of the k-th iteration, the characteristics learned by the text detection model of the k-th iteration can be enriched, and the performance of the text detection model of the k-th iteration is improved.

And S16, evaluating the text detection model obtained by the k iteration by using the test set, and if the evaluation is passed, obtaining the character-level text detection model.

In specific implementation, after the text detection model of the kth iteration is obtained, the text detection model obtained by the kth iteration can be evaluated by using a test set, and if the evaluation passes, the text detection model obtained by the kth iteration is used as a character-level text detection model.

In specific implementation, if the evaluation fails, the iterative training is continued based on the text detection model of the kth iteration until the evaluation passes, so as to obtain the character-level text detection model. For a specific process of the iterative training performed subsequently, reference may be made to the description of the kth iterative training in the above embodiment, and details are not described here again.

In a specific implementation, after a target sample is selected from the service candidate data set, the target sample is deleted from the service candidate data set to update the service candidate data set. And taking the updated service data set as the service data set of the (k + 1) th iteration.

As can be seen from the above, in the kth iteration process, the text detection model obtained by the kth-1 th iteration is used to infer the enhanced sample set of each sample in the service candidate data set, so as to obtain the prediction frame set corresponding to the enhanced sample set of each sample, where the enhanced sample set of each sample includes the original sample and the enhanced sample of the original sample, and the prediction frame set includes the prediction frames corresponding to the original sample and the enhanced sample of the original sample, respectively. Calculating a prediction result consistency index of the enhancement sample set of each sample according to the prediction frame set of the enhancement sample set of each sample; selecting a target sample from the service candidate data set according to the predicted result consistency index of each sample; and carrying out artificial word level labeling on the selected target sample, and adding the target sample subjected to the artificial word level labeling into the selected sample set. And deducing the target sample marked by each word level by adopting the text detection model obtained by the k-1 th iteration to obtain the character level pseudo label of each target sample, and training the text detection model obtained by the k-1 th iteration based on the character level pseudo label of the target sample marked by each word level and each target sample to obtain the text detection model of the k-1 th iteration. According to the method, a prediction frame set obtained by deducing an enhanced sample set of samples according to a text detection model obtained by the (k-1) th iteration is calculated to obtain a prediction result consistency index of each sample, and then a target sample is selected according to the prediction result consistency index of each sample to carry out artificial word level labeling, so that in each iteration process, the text detection model obtained by the last iteration selects a target sample with a large self-income, the selected target sample is subjected to the artificial word level labeling, and the target sample subjected to the artificial word level labeling is fed back to the text detection model obtained by the last iteration for current iteration training, so that active learning is realized, through multiple iterations, the target sample which helps the model training to be large is automatically selected, and the target sample subjected to the artificial word level labeling can be adopted for iteration training, so that the target sample which helps the model training to be large can be selected, and the learned characteristics of the text detection model can be enriched constantly and efficiently, so that the accuracy of the detection result is further improved, and the model is completed. The target sample is selected by combining iterative active learning and a character-level pseudo label, so that the target sample can be labeled manually at a word level without character-level labeling, and the labeling workload can be reduced by reducing the labeling difficulty; on the other hand, the total amount of the finally selected target samples needing to be manually labeled is often lower than that of the candidate set sample set, so that the sample labeling amount in the training process of the character-level text detection model can be reduced, and the labeling time consumption and the labeling cost are reduced.

In order to facilitate better understanding and implementation of the embodiments of the present invention for those skilled in the art, a specific flow of the training method for the character-level text detection model is described below with reference to a specific embodiment. Referring to fig. 5, a training process of another character-level text detection model in the embodiment of the present invention is shown, and is described in detail below.

Training of the character-level text detection model may be divided into an initialization phase and an iteration phase.

Regarding the initialization phase, it is mainly to obtain the initial iterative text detection model M ₀ And a good head is started for the model in the subsequent iteration, so that the situations that the generation effect of the pseudo label on the service data is poor and the selected target sample is not suitable due to poor model performance are prevented. This phase can be divided into 2 steps.

Step one, training an original model.

Step two, training based on the original model to obtain an initial model Q2, namely an initial iteration text detection model M ₀ 。

Specifically, the word text information is generated by randomly arranging and combining the single characters, attached to the background picture without the text information, and automatically generated into a synthesized character-level annotation text image sample by adopting a synthesis script and other methods. Because the character position information is known by adopting the generation of the synthesis script, the manual marking is not needed, the marking amount can be greatly reduced, and the marking time is reduced. The original model is trained by adopting the synthesized character-level label text image samples, the prediction capability of the original model on character-level information can be endowed, meanwhile, the generalization prediction capability of the original model on character information is ensured by massive synthesized character-level label text image samples, and a foundation is laid for the generation of subsequent character-level pseudo labels.

For step two, reference may be specifically made to the related descriptions in step S21 to step S26, and details are not repeated here.

Regarding the iteration stage, for convenience of description, taking the kth iteration as an example, the following steps S401 to S409 may be specifically included.

Step S401, performing data enhancement on each sample in the candidate sample set C, where the original sample and the enhanced sample form a PATCH set C.

In step S402, each sample in the PATCH set C is inferred to obtain a prediction frame set.

For a specific implementation scheme of step S402, reference may be made to relevant descriptions in step S11 in the foregoing embodiments, and details are not described here again.

In step S403, for the PATCH set C, a prediction result consistency index is calculated by using the IOU in units of patches.

For a specific implementation scheme of step S403, reference may be made to relevant descriptions in step S12 in the foregoing embodiment, and details are not described here again.

And S404, performing reverse order arrangement on the predicted result consistency indexes, and selecting a target sample.

For a specific implementation of step S404, reference may be made to the relevant description in step S13 in the foregoing embodiment, and details are not described here again.

And S405, performing artificial word-level labeling on the target sample, adding the target sample subjected to the artificial word-level labeling into the selected sample set S, and removing the target sample from the candidate sample set C.

Step S406, adopting the selected sample set S to pair the model M _k-1 Training to obtain model M _k 。

For a specific implementation of step S406, reference may be made to the relevant description in step S15 in the foregoing embodiment, and details are not described here again.

Wherein, when k is 1, the model M _k-1 Detecting model M for initial iteration text ₀ 。

Step S407, adopting the test set A to pair the model M _k And (4) evaluating.

Step S408, judging whether the service index is reached.

And when the judgment result is yes, ending the flow. And if the judgment result is negative, executing the step S409 and continuing the next iterative training.

In step S409, k = k +1 is assigned.

When the (k + 1) th iteration is started, analysis can be performed according to the evaluation result, the data enhancement scheme V is updated, and the step S401 is continuously executed until the iteration is finished.

An embodiment of the present invention further provides a text detection method, and referring to fig. 6, a flowchart of the text detection method in the embodiment of the present invention is provided, which specifically includes the following steps:

s61, acquiring a text image to be detected;

step S62, detecting the text image to be detected by adopting a character-level text detection model to obtain a character detection result, wherein the character detection result comprises the following steps: a character probability heat map, the character probability heat map being used to characterize probabilities of character regions.

In a specific implementation, the character-level text detection model may be obtained by training using the training method of the character-level text detection model provided in any of the above embodiments. For a specific implementation of the training method for the character-level text detection model, the description in the training method for the character-level text detection model provided in fig. 1 to 5 and with reference to any of the above embodiments may be combined, and details are not repeated here.

And S63, communicating character areas according to the character detection result and the communication domain prediction result among the characters to obtain a word-level detection result.

An embodiment of the present invention further provides a training device for a character-level text detection model, and referring to fig. 7, a schematic structural diagram of the training device for a character-level text detection model in the embodiment of the present invention is given.

The training device 70 for the character-level text detection model may include:

an iteration unit 71, configured to, in a kth iteration process, infer, by using a text detection model obtained by a kth-1 th iteration, an enhanced sample set of each sample in a service candidate data set to obtain a prediction frame set corresponding to the enhanced sample set of each sample, where the enhanced sample set includes an original sample and an enhanced sample of the original sample, the prediction frame set includes prediction frames corresponding to the original sample and the enhanced sample of the original sample, respectively, and k is an integer greater than 1;

a calculating unit 72, configured to calculate, for each sample, a prediction result consistency index of each sample according to the prediction frame set of the enhanced sample set of each sample;

a selecting unit 73, configured to select a target sample from the service candidate data set according to the predicted result consistency index of each sample;

the training unit 74 is configured to perform manual word-level labeling on the target samples, and add the target samples subjected to the manual word-level labeling into a selected sample set, where the selected sample set includes a plurality of word-level labeled target samples; deducing the target sample marked by each word level by adopting the text detection model obtained by the k-1 iteration to obtain a character level pseudo label of each target sample, and training the text detection model obtained by the k-1 iteration on the basis of the character level pseudo label of the target sample marked by each word level and each target sample to obtain a text detection model of the k iteration;

and the evaluation unit 75 is configured to evaluate the text detection model obtained by the kth iteration by using the test set, and if the evaluation passes, obtain the character-level text detection model.

In a specific implementation, the specific working principle and the working flow of the training apparatus 70 for the character-level text detection model may refer to the description in the training method for the character-level text detection model in the foregoing embodiment, and are not described herein again.

An embodiment of the present invention further provides a computer-readable storage medium, which is a non-volatile storage medium or a non-transitory storage medium, and on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the training method for the character-level text detection model provided in any of the above embodiments.

The embodiment of the present invention further provides a terminal, which includes a memory and a processor, where the memory stores a computer program capable of running on the processor, and the processor executes the steps of the training method for the character-level text detection model provided in any of the above embodiments when running the computer program.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, and the program may be stored in any computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected by one skilled in the art without departing from the spirit and scope of the invention, as defined in the appended claims.

Claims

1. A training method of a character-level text detection model is characterized by comprising the following steps:

in the kth iteration process, deducing an enhanced sample set of each sample in a service candidate data set by using a text detection model obtained by the kth-1 th iteration to obtain a prediction frame set corresponding to the enhanced sample set of each sample, wherein the enhanced sample set comprises an original sample and an enhanced sample of the original sample, the prediction frame set comprises prediction frames respectively corresponding to the original sample and the enhanced sample of the original sample, and k is an integer greater than 1;

for each sample, calculating a predicted result consistency index of each sample according to a prediction frame set of an enhanced sample set of each sample, wherein the predicted result consistency index is used for representing the consistency degree of the predicted result;

selecting a target sample from the service candidate data set according to the predicted result consistency index of each sample;

carrying out manual word-level labeling on the target samples, and adding the target samples subjected to the manual word-level labeling into a selected sample set, wherein the selected sample set comprises a plurality of target samples subjected to word-level labeling;

deducing the target sample labeled by each word level by adopting the text detection model obtained by the k-1 iteration to obtain a character level pseudo label of each target sample, and training the text detection model obtained by the k-1 iteration on the basis of the character level pseudo label of the target sample labeled by each word level and each target sample to obtain a text detection model of the k iteration;

and evaluating the text detection model obtained by the k iteration by using the test set, and if the evaluation is passed, obtaining the character-level text detection model.

2. The method of training a character-level text detection model according to claim 1, further comprising:

and if the evaluation is not passed, continuing iterative training based on the text detection model of the kth iteration until the evaluation is passed to obtain the character-level text detection model.

3. The method for training a character-level text detection model according to claim 1, wherein the calculating, for each sample, a predicted result consistency index for each sample according to the predicted box set of the enhanced data set of each sample comprises:

for each sample, for the target object in each sample, calculating the ratio of the union of the prediction frames of the target object to the intersection of the prediction frames of the target object, and obtaining the consistency index of the prediction result of the target object according to the ratio;

and aiming at each sample, obtaining the consistency index of the prediction result of each sample according to the consistency index of the prediction result of the target object in each sample.

4. The training method of the character-level text detection model according to claim 3, wherein the obtaining the predicted result consistency index of each sample according to the predicted result consistency index of the target object in each sample comprises:

determining a weight of each target object when the sample includes a plurality of target objects;

and weighting the predicted result consistency indexes of the plurality of target objects according to the weight of each target object and the predicted result consistency index of each target object, and taking the weighted result as the predicted result consistency index of the sample.

5. The method of claim 4, wherein determining the weight of each target object comprises:

determining a weight according to the size of each target object in the prediction frame of the original sample, wherein the weight is positively correlated with the size of the target object in the prediction frame of the original sample.

6. The method for training the character-level text detection model according to claim 1, wherein the inferring the target sample of each word-level label by using the text detection model obtained in the k-1 th iteration to obtain the character-level pseudo label of each target sample, and training the text detection model obtained in the k-1 th iteration based on the character-level pseudo label of the target sample of each word-level label and each target sample to obtain the text detection model of the k-th iteration comprises:

performing word-level slicing on each target sample in the selected sample set according to word-level labels to obtain one or more slices;

deducing the slices of each target sample by adopting the text detection model obtained by the (k-1) th iteration to obtain a character-level region probability heat map;

cutting each slice by adopting an image segmentation algorithm according to the character level region probability heatmap of the slice of each target sample, and predicting to obtain a character prediction frame of each character;

mapping and producing a two-dimensional Gaussian heat map on a character prediction frame of each character;

aiming at each target sample, obtaining a character-level pseudo label of each target sample according to the two-dimensional Gaussian heatmap and each target sample;

and training the text detection model obtained by the k-1 iteration according to the character-level pseudo label of each target sample and each target sample to obtain a text detection model of the k iteration.

7. The method for training the character-level text detection model according to claim 6, wherein the training the text detection model obtained by the (k-1) th iteration according to the character-level pseudo label of each target sample and each target sample comprises:

for each slice, predicting word prediction length of a word in each slice according to a character prediction frame of each character of each slice;

aiming at each slice, calculating the truth index of the prediction result of each slice according to the word prediction length and the word true degree of each slice;

determining the learning weight of each target sample according to the truth indexes of the prediction results of all the slices in each target sample aiming at each target sample;

and training the text detection model obtained by the k-1 iteration by combining the learning weight of each target sample, the character-level pseudo label of each target sample and the word-level label of each target sample.

8. The method of claim 1, wherein after selecting a target sample from the business candidate data set, the target sample is deleted from the business candidate data set to update the business candidate data set.

9. The method of training a character-level text detection model according to claim 1, further comprising: and when k is 1, in the 1 st iteration process, deducing the enhanced sample set of each sample in the service candidate data set by adopting an initial iteration text detection model.

10. The method for training a character-level text detection model according to claim 9, wherein the initial iterative text detection model is obtained by:

performing word-level slicing on each image sample in an open source training sample set to obtain a plurality of slices, wherein the open source training sample set comprises a plurality of word-level labeled image samples;

deducing each slice obtained by the open source training sample set by adopting an original model to obtain a character level region probability heat map;

for each slice, predicting the word prediction length in each slice according to the character level region probability heat map of each slice;

calculating the truth index of the prediction result of each slice according to the word prediction length and the word real length in each slice;

for each image sample, determining the learning weight of each image sample according to the prediction result authenticity indexes of all slices in each image sample, wherein the prediction result authenticity indexes are in positive correlation with the learning weight;

and training the original model by combining the learning weight of each image sample, each image sample and the character level region probability heat map thereof to obtain the initial iterative text detection model.

11. The method for training the character-level text detection model according to claim 10, wherein the original model is trained by:

generating the word text information by randomly arranging and combining the single characters;

attaching word text information to a background picture without text information to generate a synthesized character-level annotation text image sample;

and training by adopting the synthesized character-level labeling text image sample to obtain the original model.

12. The method for training a character-level text detection model according to claim 1, wherein the enhanced samples of the original samples are obtained by:

performing at least one of the following data enhancement operations on the original sample to obtain an enhanced sample of the original sample: motion blur operation, zoom operation, rotation operation, noise adding operation, turning operation, brightness adjustment and color adjustment.

13. A text detection method, comprising:

acquiring a text image to be detected;

the character-level text detection model obtained by training with the training method of a character-level text detection model according to any one of claims 1 to 12 is used for detecting the text image to be detected to obtain a character detection result, wherein the character detection result comprises: a character probability heat map, the character probability heat map being used to characterize probabilities of character regions;

and communicating character areas according to the character detection result and the communication domain prediction result between the characters to obtain a word-level detection result.

14. An apparatus for training a character-level text detection model, comprising:

the iteration unit is used for deducing an enhanced sample set of each sample in the service candidate data set by adopting a text detection model obtained by the k-1 iteration in the k iteration process to obtain a prediction frame set corresponding to the enhanced sample set of each sample, wherein the enhanced sample set comprises an original sample and an enhanced sample of the original sample, the prediction frame set comprises prediction frames respectively corresponding to the original sample and the enhanced sample of the original sample, and k is an integer greater than 1;

the calculation unit is used for calculating a prediction result consistency index of each sample according to the prediction frame set of the enhanced sample set of each sample;

the selecting unit is used for selecting a target sample from the service candidate data set according to the predicted result consistency index of each sample;

the training unit is used for carrying out manual word-level labeling on the target samples and adding the target samples subjected to the manual word-level labeling into a selected sample set, wherein the selected sample set comprises a plurality of target samples subjected to word-level labeling; deducing the target sample marked by each word level by adopting the text detection model obtained by the k-1 iteration to obtain a character level pseudo label of each target sample, and training the text detection model obtained by the k-1 iteration on the basis of the character level pseudo label of the target sample marked by each word level and each target sample to obtain a text detection model of the k iteration;

and the evaluation unit is used for evaluating the text detection model obtained by the k iteration by adopting the test set, and if the evaluation is passed, the character-level text detection model is obtained.

15. A computer-readable storage medium, being a non-volatile storage medium or a non-transitory storage medium, having a computer program stored thereon, wherein the computer program, when being executed by a processor, is adapted to perform the steps of the method for training a character-level text detection model according to any of the claims 1 to 12.

16. A terminal comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, wherein the processor, when executing the computer program, performs the steps of the method of training a character-level text detection model according to any of claims 1 to 12.