CN116935368A

CN116935368A - Deep learning model training method, text line detection method, device and equipment

Info

Publication number: CN116935368A
Application number: CN202310706632.9A
Authority: CN
Inventors: 万星宇; 吕鹏原; 范森; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-10-24

Abstract

The disclosure discloses a deep learning model training method, a text line detection device and equipment, relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be used for optical character recognition scenes. The specific implementation scheme is as follows: and respectively processing a plurality of first sample images included in the label-free sample set by using a target detection model to obtain pseudo labels of the first sample images, wherein the target detection model is obtained by training the label sample set. And training the initial model by using the plurality of first sample images, the pseudo labels of the plurality of first sample images and the labeling sample set, and obtaining a deep learning model.

Description

Deep learning model training method, text line detection method, device and equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the field of computer vision and deep learning, which may be used in optical character recognition scenarios. More specifically, a deep learning model training method, a text line detection device, an electronic device and a storage medium are disclosed.

Background

Text line detection is an important task in the field of computer vision, which refers to the process of locating and identifying text lines in an image. In many application scenarios, such as license plate recognition, identification card recognition, bill recognition, etc., the accuracy of text line detection technology directly affects the performance of the whole system.

Disclosure of Invention

The disclosure provides a deep learning model training method, a text line detection device, electronic equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a deep learning model training method, including: respectively processing a plurality of first sample images included in a label-free sample set by using a target detection model to obtain pseudo labels of the first sample images, wherein the target detection model is obtained by training a label sample set; and training an initial model by using the plurality of first sample images, the pseudo labels of the plurality of first sample images and the labeling sample set, so as to obtain a deep learning model. .

According to another aspect of the present disclosure, there is provided a text line detection method, including: processing the image to be detected by using a deep learning model to obtain a text line detection result; wherein the deep learning model is obtained by training by using the deep learning model training method.

According to another aspect of the present disclosure, there is provided a deep learning model training apparatus including: the first processing module is used for respectively processing a plurality of first sample images included in the label-free sample set by utilizing a target detection model to obtain pseudo labels of the first sample images, wherein the target detection model is obtained by training a labeling sample set; and the training module is used for training the initial model by using the plurality of first sample images, the pseudo labels of the plurality of first sample images and the labeling sample set to obtain a deep learning model.

According to another aspect of the present disclosure, there is provided a text line detection apparatus including: the second processing module is used for processing the image to be detected by using the deep learning model to obtain a text line detection result; wherein the deep learning model is obtained by training by using the deep learning model training method.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method as described above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically illustrates an exemplary system architecture to which a deep learning model training method or text line detection method and apparatus may be applied, according to an embodiment of the present disclosure.

Fig. 2 schematically illustrates a flow chart of a deep learning model training method according to an embodiment of the present disclosure.

Fig. 3 schematically illustrates a schematic diagram of a deep learning model training method according to an embodiment of the present disclosure.

Fig. 4 schematically illustrates a schematic diagram of a deep learning model training method according to another embodiment of the present disclosure.

Fig. 5 schematically illustrates a flow chart of a text line detection method according to an embodiment of the present disclosure.

Fig. 6 schematically illustrates a block diagram of a deep learning model training apparatus according to an embodiment of the present disclosure.

Fig. 7 schematically illustrates a block diagram of a text line detection device according to an embodiment of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Text line detection may be used to locate and identify text lines in an image. The variety of text lines is wide, and the text lines cover the changes of factors such as various sizes, fonts, directions, colors, backgrounds and the like, which brings great challenges to the text line detection task. Therefore, in order to improve the accuracy and robustness of the text line detection technique, as well as generalization for different scenarios, a large amount of data needs to be used to train the text line detection model.

In the related art, a text line detection method generally requires a large amount of annotation data to train a model, but the acquisition of the annotation data requires a large amount of time and labor cost, and the coverage of the annotation data may not be wide enough, resulting in poor generalization performance of the model. In addition, due to the limitation of the parameter number, the small model often needs to train different text line detection models in different scenes, which brings about a plurality of problems, such as the consumption of a great deal of time and resources for maintaining a plurality of models, and the problem of unstable model precision. To enhance the generalization ability of a text line detection model in different scenarios, a large model needs to be used to improve its characterizability. However, because of the large amount of parameters of the large model, training directly on large-scale data can result in high computational costs.

In view of this, the embodiments of the present disclosure provide a deep learning model training method, a text line detection method, a device, an electronic apparatus, and a storage medium, which can utilize a large amount of unlabeled data to improve generalization capability of a model, and accuracy and robustness of text line detection by using a semi-supervised training method, and can be applied to various text line detection scenes, such as license plate recognition, identification card recognition, ticket recognition, and the like, so as to provide a better solution for text recognition tasks in practical application scenes.

Specifically, the deep learning model training method includes: and respectively processing a plurality of first sample images included in the label-free sample set by using a target detection model to obtain pseudo labels of the first sample images, wherein the target detection model is obtained by training the label sample set. And training the initial model by using the plurality of first sample images, the pseudo labels of the plurality of first sample images and the labeling sample set, and obtaining a deep learning model.

It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the deep learning model training method or the text line detection method and apparatus may be applied may include a terminal device, but the terminal device may implement the deep learning model training method or the text line detection method and apparatus provided by the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105.

The terminal devices 101, 102, 103 may be a variety of electronic devices including, but not limited to, smartphones, tablets, laptop portable computers, desktop computers, and the like. Optionally, the terminal device 101, 102, 103 may be configured with a GPU for completing training of the deep learning model. Alternatively, the terminal apparatuses 101, 102, 103 may be configured with an image pickup apparatus for realizing acquisition of the first sample image.

The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The server 105 may be a server providing various services, or may be various cloud servers, and is not limited herein.

It should be noted that, the deep learning model training method or the text line detection method provided by the embodiments of the present disclosure may be generally performed by the terminal device 101, 102, or 103. Accordingly, the deep learning model training method or the text line detection apparatus provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103. Alternatively, the deep learning model training method or text line detection method provided by embodiments of the present disclosure may also be generally performed by the server 105. Accordingly, the deep learning model training method or text line detection device provided by the embodiments of the present disclosure may be generally provided in the server 105. The deep learning model training method or text line detection method provided by the embodiments of the present disclosure may also be performed by a server or server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the deep learning model training method or text line detection apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

For example, a user may initiate a model training request via the terminal devices 101, 102, 103, which request and label sample sets, unlabeled sample sets may be transmitted to the server 105 via the network 104. The server 105 may process the plurality of first sample images included in the unlabeled exemplar set by using the target detection model to obtain pseudo labels of the plurality of first sample images, and train the initial model by using the plurality of first sample images, the pseudo labels of the plurality of first sample images, and the labeled exemplar set to obtain the deep learning model.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, applying and the like of the personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

As shown in fig. 2, the method 200 includes operations S210-S220.

In operation S210, a plurality of first sample images included in the unlabeled exemplar set are respectively processed using the object detection model to obtain pseudo labels of the plurality of first sample images.

In operation S220, the initial model is trained using the plurality of first sample images, the pseudo tags and the set of labeling samples for each of the plurality of first sample images, and a deep learning model is obtained.

According to embodiments of the present disclosure, the annotation sample set may include text line annotation data under various general scenarios, and each text line annotation data may be represented as an image. Various general scenes may include handwritten text, such as handwritten letters, handwritten numbers, handwritten symbols, etc., printed text, such as printed letters, printed numbers, printed symbols, etc., natural scene text, such as road signs, billboards, shop signs, license plates, etc., and forms and book text, such as reports, invoices, contracts, novels, textbooks, etc., without limitation.

According to embodiments of the present disclosure, the target detection model may be a model that has been trained and that may be used to complete target detection tasks. For example, the object detection model may be obtained by training the DINO-SwinL model using a manually labeled text line detection dataset, i.e., a labeled sample set. The DINO-SwinL model can refer to a DINO (DETR with Improved Denoising Anchor Boxes) model with SwinL as a backbone network, and is a model based on a Transformer network structure. The object detection model may have larger network parameters, i.e. the object detection model may have more network layers, and each network layer may comprise more neurons.

According to embodiments of the present disclosure, the SwinL backbone network of the DINO-SwinL model may be preloaded with the trained network weights on the public data set, and the DINO-SwinL model may have fine-tuned model parameters on other public data sets. When the labeling sample set is used for retraining the DINO-SwinL model, various data enhancement processes can be performed on the labeling sample set, and the various data enhancement processes can include, but are not limited to, random image overturn, random rotation, random scaling, random cropping, multi-scale scaling according to short sides, image normalization and the like, and are not limited herein. The model training process may use the artificial annotation result of the training sample as a true value to calculate and optimize the loss function between the model prediction result and the true value. The penalty functions employed may include, but are not limited to, the classification penalty, the L1 penalty for the detection box, the GIOU penalty, and the L1 penalty and GIOU penalty for the polygon, among others. An AdamW optimizer may be used to accelerate model training.

In accordance with embodiments of the present disclosure, a non-labeled sample set may include sample data derived from multiple scenarios, not limited herein. For example, the unlabeled exemplar set may include exemplar data as in various generic scenarios related to the labeled exemplar set. The unlabeled exemplar set may also include exemplar data for various specific word recognition scenarios. The specific character recognition scenes may be, for example, a vertical character scene, a ancient book scene, a small language scene, and the like. Optionally, the unlabeled exemplar set may further include exemplar data under bad case scenarios. The sample data in the bad case scenario may be, for example, sample data with an evaluation parameter in the labeled sample set lower than a certain value, which is determined by checking the labeled sample set by using the target detection model. The proportion of sample data in each scene is not limited herein. For example, the sample data in various general scenes may be set to be 70% of the total sample number, the sample number in a specific character recognition scene may be set to be 20% of the total sample number, and the sample data in a bad case scene may be set to be 10% of the total sample number.

According to an embodiment of the present disclosure, the unlabeled exemplar set includes exemplar data, i.e., a first exemplar image. The first sample image is processed by using the target detection model, the first sample image may be input into the target detection model respectively, and the output of the target detection model is the pseudo tag of the first sample image.

According to embodiments of the present disclosure, the pseudo tags output by the object detection model may have a similar data structure as manually labeled tags labeling the sample set. Specifically, the pseudo tag may include position information of the detection frame, category information of the object within the detection frame, confidence data of the detection frame determined based on the category of the object within the detection frame, and the like. The position information of the detection frame may include information of a coordinate position of the detection frame, a size of the detection frame, a rotation angle of the detection frame, and the like. The category information of the object in the detection frame may be a category to which the object in the region framed by the detection frame belongs. For example, it may be classification information indicating that the text line object within the detection box belongs to handwritten text, printed text, or natural scene text. Confidence data for a detection box determined based on the class of the object within the detection box may be expressed as a probability of the object being contained within the detection box. For example, the class information of the object in the detection frame is represented as including the object a in the detection frame, and for the actual detection frame not including any object, the confidence data of the detection frame determined based on the class of the object in the detection frame may be a value close to 0.

According to embodiments of the present disclosure, the initial model may also be a model for completing the target detection task, similar to the target detection model. For example, the initial model may be a DINO model with res net50 as the backbone network.

According to embodiments of the present disclosure, the set of annotation samples may include a plurality of annotation sample images and respective labels for the plurality of annotation sample images. When the initial model is trained, the first sample image and the label sample image can be used as training samples, and the pseudo tag of the first sample image and the tag of the label sample image can be used as tags of the training samples. The label of the training sample can be used as a true value to calculate a loss value between a predicted result obtained after the training sample is input into the initial model and the true value, and model parameters of the initial model are optimized by using the loss value, so that a deep learning model is finally obtained.

According to the embodiment of the disclosure, training of the deep learning model can be completed in a semi-supervised training mode, namely, the trained target detection model is used for marking a plurality of first sample images to obtain pseudo labels of the plurality of first sample images, then the first sample images with the pseudo labels and the manually marked marking sample set are used for completing training of the deep learning model, and the existing model can be used for completing annotation of a large amount of sample data, so that the manual marking cost can be effectively reduced. Meanwhile, the coverage range of the application scene of the training sample can be enlarged in a mode of accumulating sample quantity, so that the model can adapt to different application scenes, and the generalization capability and the robustness of the model can be effectively improved.

The method illustrated in fig. 2 is further described below with reference to fig. 3 and 4 in conjunction with the exemplary embodiment.

As shown in fig. 3, the deep learning model training method may include a training process of the target detection model, a generating process of the pseudo tag, and a training process of the deep learning model.

According to the embodiment of the present disclosure, in the training process of the target detection model, the training of the target detection model 302 may be completed by using the labeling sample set 301, and the specific training process is not described herein.

In accordance with an embodiment of the present disclosure, in the pseudo tag generation process, the target detection model 302 may be used to process the unlabeled exemplar set 303 to obtain the pseudo tags 304 of each of the plurality of first exemplar images in the unlabeled exemplar set 303.

According to the embodiment of the disclosure, the quality of the pseudo tag can directly influence the training effect and performance of the deep learning model, and in order to reduce the influence of misprediction, noise data and the like of the target detection model, a generalized screening processing strategy can be adopted to optimize the generation flow of the pseudo tag.

According to an embodiment of the present disclosure, processing a plurality of first sample images included in a label-free sample set, respectively, using an object detection model, to obtain pseudo labels of the plurality of first sample images, respectively, may include the operations of:

And for each first sample image, performing data enhancement processing on the first sample image to obtain a plurality of second sample images. And respectively processing the plurality of second sample images by using the target detection model to obtain a plurality of first labels. A pseudo tag of the first sample image is determined based on the plurality of first tags.

According to embodiments of the present disclosure, generalization of samples may be achieved using data enhancement methods. The data enhancement process may include, but is not limited to, image random flipping, random rotation, random scaling, random cropping, multi-scale scaling by short side, image normalization, etc., without limitation.

According to an embodiment of the present disclosure, through the data enhancement process, one first sample image may generate a plurality of different second sample images, each of which, after inputting the target detection model, is output as a first label.

According to the embodiment of the disclosure, the pseudo tag of the first sample image may be screened from a plurality of first tags, and the screening manner is not limited herein. For example, the pseudo tags may be selected based on the confidence of each first tag, in particular, determining the pseudo tag of the first sample image based on the plurality of first tags may comprise:

And performing non-maximum value inhibition processing on the plurality of first labels based on confidence coefficient values included in the plurality of first labels, so as to obtain pseudo labels of the first sample images.

According to embodiments of the present disclosure, non-maximum suppression (Non-Maximum Suppression, NMS) may be searching for local maxima and suppressing Non-maximum elements. Specifically, each first label may be represented as a detection frame at the target area of the first sample image. After mapping the plurality of first labels to the same first sample image, the plurality of first labels may overlap between the detection frames of the target region. Non-maximum suppression is to select the detection frame with the largest confidence value from the overlapped detection frames for reservation and eliminate redundant other detection frames. The reserved detection frame with the maximum confidence coefficient value is the detection frame corresponding to the pseudo tag.

According to the embodiment of the disclosure, the quality of the generated pseudo tag can be effectively improved through the screening processing based on non-maximum suppression, and further the influence of the error tag on the training effect of the deep learning model can be effectively reduced.

According to the embodiment of the disclosure, in view of the fact that the data enhancement method can be used for enhancing the data of the labeling sample set in the training process of the target detection model, the generalization capability of the target detection model for the position change of the detection object is strong, namely, the target detection model can output similar detection results for the same object in different positions. Thus, alternatively, the data enhancement processing for the first sample image may be a scale change processing, so that the object detection model may extract features of objects of different sizes. Specifically, performing data enhancement processing on the first sample image to obtain a plurality of second sample images may include the following operations:

And performing multiple scale transformation on the first sample image to obtain multiple second sample images, wherein the scaling ratios used by the multiple scale transformation are different.

According to an embodiment of the present disclosure, for example, the scaling may be to enlarge the first sample image by 20% based on the geometric center of the image, and the scaling used may be expressed as 120%. For another example, the scaling may be to scale down the first sample image by 30% based on the end point of the image, and the scale used may be expressed as 70%.

According to the embodiment of the disclosure, since the scaling processes of different scales are adopted, the sizes of the detection frames of different first tag representations can be differentiated for the same object. Thus, for each second sample, the first label may be restored to obtain a second label based on the scaling used in generating the second sample. A pseudo tag of the first sample image is determined based on the plurality of second tags. .

According to the embodiment of the present disclosure, for example, when the second sample image is generated and enlarged by 25%, the detection frame represented by the first label may be reduced to 80% of the original based on the same manner as when the image is enlarged, to obtain the second label.

According to the embodiment of the present disclosure, similar to the method of determining the pseudo tag based on the plurality of first tags, the determination of the pseudo tag based on the plurality of second tags may be obtained using a non-maximum suppression method, which is not described herein.

According to the embodiment of the disclosure, the robustness of the target detection model to different-size image input can be effectively enhanced through the multi-scale data enhancement strategy, so that the accuracy of the pseudo tag can be effectively improved.

According to an embodiment of the present disclosure, in a training process of the deep learning model, a label sample set 301 and a label-free sample set 303 may be fused to be used as a training sample set, and a label included in the label sample set 301 and a pseudo label 304 are fused to be used as a label of the training sample set, so as to train an initial model 305 to obtain a deep learning model 306.

According to an embodiment of the present disclosure, training an initial model using a plurality of first sample images, pseudo tags and labeled sample sets of each of the plurality of first sample images, obtaining a deep learning model may include the operations of:

and sampling the plurality of first sample images and the labeled sample set to obtain a training sample set. And training the initial model by using a training sample set to obtain a deep learning model.

According to embodiments of the present disclosure, each iteration of the deep learning model training phase may be trained using a different training set. For example, the sample size of the labeling sample set is 20 ten thousand, the sample size of the unlabeled sample set is 100 ten thousand, 10 ten thousand sample data can be randomly selected from the labeling sample set to be added into the training sample set before each round of iterative training starts, 90 ten thousand sample data are randomly selected from the unlabeled sample set to be added into the training sample set, and the training of the round is completed by the training sample set. The proportion of training samples selected from the labeled sample set and the unlabeled sample set may be set according to a specific application scenario, and is not limited herein.

According to embodiments of the present disclosure, the initial model may include a regression branch network and a classification branch network. The regression branch network may be used to determine location information of the detection frame and confidence values of the detection frame, and the classification branch network may be used to determine a class of objects within the detection frame.

According to embodiments of the present disclosure, different training samples may be used for training the regression branch network and the classification branch network, respectively, when training the initial model. Specifically, in performing regression analysis, it is desirable that the model be able to locate the target object more accurately, and the desired training sample should be one with a higher confidence level, i.e., the label of the desired training sample should have a higher confidence value. Thus, training of the regression branch network may be accomplished using the first sample image with higher confidence data in the labeled and unlabeled sample sets. In performing the classification analysis, it is then desirable that the model be able to accurately distinguish between different types of objects and be able to distinguish between objects and background, and that the desired training sample should include a positive sample and a negative sample, i.e. that the labels of the desired training sample should include labels indicating the foreground and labels indicating the background. Thus, training of the categorized branch network can be accomplished using the labeled sample set and the complete unlabeled sample set.

According to embodiments of the present disclosure, the first confidence threshold may be determined based on confidence values included in the respective pseudo tags of the plurality of first sample images. The first confidence threshold may be used for distinguishing between positive and negative samples, i.e. the plurality of first sample images may be divided into a first sample subset and a second sample subset based on the first confidence threshold, wherein the confidence value associated with the first sample image comprised by the first sample subset is greater than or equal to the first confidence threshold and the confidence value associated with the first sample image comprised by the second sample subset is less than the first confidence threshold.

According to embodiments of the present disclosure, the first confidence threshold may be determined based on a distribution trend of confidence values for each of the pseudo tags of each of the first sample images. Specifically, determining the first confidence threshold based on the confidence values included in the pseudo tags of the respective plurality of first sample images includes:

based on the confidence values included in the pseudo tags of the respective first sample images, scale data associated with the respective preset confidence intervals is determined. A first confidence threshold is determined based on the scale data associated with each of the plurality of preset confidence intervals and the interval endpoint value for each of the plurality of preset confidence intervals.

According to an embodiment of the present disclosure, the confidence value may be a value between 0 and 1, and the preset confidence interval may be expressed as a value interval distributed between 0 and 1, respectively. For example, the plurality of preset confidence intervals may represent [0,0.2 ], [0.2,0.4 ], [0.4,0.6 ], [0.6,0.8 ], and [0.8,1], respectively. Based on the preset confidence interval as above, the interval end point values may be selected to include 0.2,0.4, 0.6 and 0.8, i.e., a numerical value may be determined from 0.2,0.4, 0.6 and 0.8 as the first confidence threshold.

According to embodiments of the present disclosure, the ratio data related to the preset confidence interval may be expressed as a ratio between the number of pseudo tags whose confidence value falls within the preset confidence interval and the total number of pseudo tags.

According to the embodiment of the disclosure, the specific value of the first confidence threshold value can be determined according to the proportion of the positive and negative samples required in a specific application scene and the proportion data related to each of a plurality of preset confidence intervals. For example, in an application scenario, the desired ratio for the number of positive and negative samples may be 1:1. the plurality of preset confidence intervals may include [0,0.2 ], [0.2,0.4 ], [0.4,0.6 ], [0.6,0.8), and [0.8,1], and the ratio data associated with each of the plurality of preset confidence intervals may be as shown in table 1. Specifically, the proportion data related to the preset confidence interval [0,0.2 ] is 5%, the proportion data related to the preset confidence interval [0.2,0.4 ] is 41%, the proportion data related to the preset confidence interval [0.4,0.6 ] is 28%, the proportion data related to the preset confidence interval [0.6,0.8 ] is 19%, and the proportion data related to the preset confidence interval [0.8,1] is 7%. The desired negative sample ratio may be determined to be 50% and the positive sample ratio to be 50% based on the desired ratio. Based on Table 1, it can be determined that the ratio data associated with the confidence interval [0,0.4 ] is 46% and the ratio data associated with the confidence interval [0.4,1] is 54%, and therefore, the first confidence threshold value can be determined to be 0.4.

TABLE 1

Preset confidence interval	[0，0.2)	[0.2，0.4)	[0.4，0.6)	[0.6，0.8)	[0.8，1]
						Ratio data	5％	41％	28％	19％	7％

According to an embodiment of the present disclosure, as an alternative implementation, the first confidence threshold may be set to a fixed value, for example, may be set to 0.3 or the like. In the case where the positive-negative sample ratio determined based on the first confidence threshold set to a fixed value is not desirable, the positive sample or the negative sample may be sampled again to correct the positive-negative sample ratio.

According to an embodiment of the present disclosure, when performing regression analysis, training of the regression branch network is completed using the first sample image having higher confidence data in the labeled sample set and the unlabeled sample set, that is, training the regression branch network with the first sample subset and the labeled sample set. When the classification analysis is performed, the first sample image with higher confidence coefficient data in the labeled sample set and the unlabeled sample set is used for completing the training of the regression branch network, namely, the first sample subset, the second sample subset and the labeled sample set are used for training the classification branch network.

According to the embodiment of the disclosure, as an optional implementation manner, in the training stage of the initial model, the pseudo tags may be further screened based on the confidence coefficient value of the pseudo tags, so as to make full use of the high-quality pseudo tags to complete the back propagation parameter tuning process, thereby effectively improving the robustness of the model. Specifically, training the initial model using the plurality of first sample images, the pseudo tags and the set of labeling samples for each of the plurality of first sample images, the obtaining the deep learning model may include the operations of:

The first penalty is derived based on the plurality of first sample images, the pseudo labels of the plurality of first sample images, and the mask values of the plurality of first sample images. And obtaining a second loss based on the loss of the plurality of labeled sample images and the labels of the plurality of labeled sample images. And adjusting model parameters of the initial model by using the first loss and the second loss so as to finally train and obtain the deep learning model.

According to the embodiment of the disclosure, in a forward training stage of initial model training, all first sample images can participate in the training of the stage, that is, each first sample image is input into an initial model, and a detection result can be obtained. In the back propagation stage of the initial model training, the mask values may be multiplied as weights for the loss values to obtain the total loss value, as shown in equation (1):

in equation (1), L may represent a total loss value that may be used to adjust model parameters of the initial model. mask _i The mask value of the i-th first sample image may be represented, the mask value of the first sample image with the high quality pseudo tag determined based on the confidence value may be 1, and the mask value of the first sample image without the high quality pseudo tag determined based on the confidence value may be 0.Loss () may represent a Loss function. The detection result obtained after the i-th first sample image is input into the initial model can be represented. y is _i A pseudo tag of the ith first sample image may be represented.

Whether the pseudo tag is a high quality pseudo tag may be determined using a second confidence threshold according to embodiments of the present disclosure.

According to embodiments of the present disclosure, the second confidence threshold may be determined based on confidence values included in the respective pseudo tags of the plurality of first sample images. Mask values for each of the plurality of first sample images are determined based on the respective pseudo tags of the plurality of first sample images and a second confidence threshold.

According to embodiments of the present disclosure, the second confidence threshold may be determined using the same or similar method as when the first confidence threshold is determined, and will not be described in detail herein.

According to an embodiment of the present disclosure, in a case where the confidence value of the pseudo tag of the first sample image is greater than or equal to the second confidence threshold value, it may be determined that the mask value of the first sample image is 1. In the case where the confidence value of the pseudo tag of the first sample image is less than the second confidence threshold, the mask value of the first sample image may be determined to be 0.

According to the embodiment of the disclosure, the mask value with the corresponding value of 0 or 1 is produced for each first sample image, and the mask value can be multiplied as the weight of the loss function in the model training stage to restrain the first sample images with different confidence degrees, so that the effect that only the pseudo tag with high confidence degree is reserved as a positive sample and the pseudo tag with insufficient confidence is ignored in the model training can be achieved, the high-quality pseudo tag can be fully utilized, and the robustness of the model is effectively improved.

According to embodiments of the present disclosure, since the first sample image in the unlabeled exemplar set may be derived from multiple scenes, the unlabeled exemplar set may be classified into multiple subsets based on the source of the first sample image when training of the initial model is performed. The sample ratio, the first confidence threshold, and the second confidence threshold used may differ when model training is performed with each subset.

As shown in fig. 4, the first sample image in the unlabeled exemplar set 403 may originate from 3 scenes, the 3 scenes being bad case scenes, general scenes, and special scenes, respectively.

According to an embodiment of the present disclosure, in a bad case scenario, the first sample image may be screened from the labeling sample set 401 based on the preset evaluation parameter 407. The preset evaluation parameters 407 may include, for example, indexes for evaluating the target detection model 402, including, but not limited to, precision, recall, F1 Score, and the like.

For example, the target detection model 402 may be validated using the set of annotation samples 401, and the F1 Score for each of the annotation sample images is determined based on the output results and the labels of the set of annotation samples 401. The labeled sample image with F1 Score below 0.7 may be screened from the labeled sample set 401 as a first sample image and added to the unlabeled sample set 403.

According to embodiments of the present disclosure, in a general scenario, the first sample image may be screened from the reflow data 409 of the front-end image recognition service 408. The front-end image recognition service 408 may be, for example, an online OCR (Optical Character Recognitio, optical character recognition) service. The user may upload the image to an online OCR service, which may return a corresponding text recognition result. The reflow data 409 of the front-end image recognition service 408 is the image that the user uploaded to the online OCR service.

According to an embodiment of the present disclosure, when the first sample image is screened from the reflow data 409, a scene type of each image included in the reflow data 409 may be determined, and then screening of the first sample image may be performed based on a principle of full coverage of the scene type, or screening of the first sample image may be performed based on a principle of preferentially selecting a high frequency scene.

According to an embodiment of the present disclosure, in a special scenario, a first sample image may be screened from the shared resource 411 based on the preset scenario features 410.

According to embodiments of the present disclosure, the preset scene features 410 may include feature words, shape features, and the like. The feature words may include, for example, "ancient books", "street views", "vertical rows", etc. The shape feature may be, for example, "elongated," "irregular," or the like. The shared resource 411 may be, for example, a public resource on the internet, or may be various resource data for which data use is permitted.

According to the embodiment of the disclosure, by screening the first sample image from the shared resource 411 based on the preset scene feature 410, some specific vertical-class character recognition scenes including vertical text scenes, ancient book scenes, street scenes, small languages, and program shots can be directionally collected, so that coverage of the unlabeled exemplar set 403 can be enlarged.

According to embodiments of the present disclosure, the unlabeled exemplar set 403 may be divided into 3 sub-exemplar sets, respectively, a first sub-exemplar set, a second sub-exemplar set, and a third sub-exemplar set, depending on the scene. In training the initial model 405, the first sub-sample set, the second sub-sample set, and the third sub-sample set may have respective sampling ratios, a first confidence threshold, and a second confidence threshold, and may be trained to obtain the deep learning model 406 by using the method described above, which is not described herein.

According to an embodiment of the present disclosure, in a training process of the deep learning model, a label sample set 401 and a label-free sample set 403 may be fused to be used as a training sample set, and a label included in the label sample set 401 and a pseudo label 404 are fused to be used as a label of the training sample set, so as to train an initial model 405 to obtain a deep learning model 406.

According to the embodiment of the disclosure, the characterization capability of the small parameter quantity model in a plurality of different text scenes is improved by learning the characteristics of mass data. Compared with the supervised learning of the related art, multiple training iterations are often required on different scene data and multiple different models are produced to realize the text line detection of multiple scenes. The massive unlabeled data used in the embodiment of the disclosure includes image data of various simple to difficult scenes, and can effectively migrate different data features of various scenes into a small model, so that the effects of one model, one training time and multiple scenes are realized.

According to the embodiment of the disclosure, on the other hand, the cost of manually labeling data can be reduced by generating the pseudo tag through the DINO large model. Labeling data is one of the key resources of machine learning, and labeling a piece of data often requires a lot of time and effort. The pseudo tag is used for semi-supervised learning, so that the required quantity of marking data can be reduced, and the marking cost is reduced. The embodiment of the disclosure avoids consuming a great deal of time and resources for manual labeling by generating the pseudo tag, and greatly shortens the time and cost of model training.

According to the embodiment of the disclosure, in addition, semi-supervised learning and multiple pseudo tag screening strategies are adopted, so that training efficiency and performance of the model are improved. Semi-supervised learning using pseudo tags can increase the amount of training data using unlabeled data, thereby improving generalization ability and performance of the model. In addition, the introduction of the pseudo tag can increase data diversity, so that the model is more robust, and the performance and generalization capability of the model are improved. In addition, the embodiment of the disclosure provides a plurality of pseudo tag cleaning and screening strategies, including multiscale test+NMS post-processing and sub-scene ignore strategies, which can ensure that full and high-quality pseudo tag data is used in a semi-supervised learning stage.

As shown in fig. 5, the method 500 includes operation S510.

In operation S510, the image to be detected is processed by using the deep learning model, and a text line detection result is obtained.

According to the embodiments of the present disclosure, the deep learning model may be trained by using the deep learning model training method as described above, and will not be described herein.

As shown in fig. 6, the deep learning model training apparatus 600 includes a first processing module 610 and a training module 620.

The first processing module 610 is configured to process, respectively, a plurality of first sample images included in the label-free sample set by using a target detection model to obtain pseudo labels of each of the plurality of first sample images, where the target detection model is obtained by training using the labeling sample set.

The training module 620 is configured to train the initial model by using the plurality of first sample images, the pseudo labels and the labeled sample sets of the plurality of first sample images, and obtain a deep learning model.

According to an embodiment of the present disclosure, the first processing module 610 includes a first processing unit, a second processing unit, and a third processing unit.

And the first processing unit is used for carrying out data enhancement processing on the first sample images for each first sample image to obtain a plurality of second sample images.

And the second processing unit is used for respectively processing the plurality of second sample images by utilizing the target detection model to obtain a plurality of first labels.

And a third processing unit for determining a pseudo tag of the first sample image based on the plurality of first tags.

According to an embodiment of the present disclosure, the third processing unit comprises a first processing subunit.

And the first processing subunit is used for performing non-maximum value inhibition processing on the plurality of first labels based on the confidence coefficient values included in the plurality of first labels, so as to obtain pseudo labels of the first sample image.

According to an embodiment of the present disclosure, the first processing unit comprises a second processing subunit.

And the second processing subunit is used for carrying out multiple scale transformation on the first sample image to obtain a plurality of second sample images, wherein the scaling ratios used by the multiple scale transformation are different.

According to an embodiment of the present disclosure, the third processing unit comprises a third processing subunit and a fourth processing subunit.

And the third processing subunit is used for restoring the first label based on the scaling used in the process of generating the second sample image for each second sample image to obtain a second label.

A fourth processing subunit for determining a pseudo tag of the first sample image based on the plurality of second tags. .

According to an embodiment of the present disclosure, the training module 620 includes a first training unit and a second training unit.

The first training unit is used for sampling the plurality of first sample images and the labeled sample set to obtain a training sample set.

And the second training unit is used for training the initial model by utilizing the training sample set to obtain a deep learning model.

According to an embodiment of the present disclosure, the deep learning model training apparatus 600 further includes a first determination module and a division module.

The first determining module is used for determining a first confidence threshold value based on confidence coefficient values included in the pseudo tags of the first sample images.

And the dividing module is used for dividing the plurality of first sample images into a first sample subset and a second sample subset based on a first confidence threshold value, wherein the confidence value related to the first sample image included in the first sample subset is larger than or equal to the first confidence threshold value, and the confidence value related to the first sample image included in the second sample subset is smaller than the first confidence threshold value.

According to an embodiment of the present disclosure, the initial model includes a regression branch network and a classification branch network.

According to an embodiment of the present disclosure, the training module 620 includes a third training unit and a fourth training unit.

And the third training unit is used for training the regression branch network by using the first sample subset and the labeling sample set.

And the fourth training unit is used for training the classification branch network by using the first sample subset, the second sample subset and the labeling sample set.

According to an embodiment of the present disclosure, the first determination module includes a first determination unit and a second determination unit.

And a first determining unit configured to determine scale data related to each of a plurality of preset confidence intervals based on confidence values included in the pseudo tags of each of the plurality of first sample images.

And a second determining unit for determining a first confidence threshold based on the ratio data related to each of the plurality of preset confidence intervals and the interval endpoint value of each of the plurality of preset confidence intervals.

According to an embodiment of the present disclosure, the set of annotation samples comprises a plurality of annotation sample images and respective labels of the plurality of annotation sample images.

According to an embodiment of the present disclosure, the training module 620 includes a fifth training unit, a sixth training unit, and a seventh training unit.

And a fifth training unit, configured to obtain a first loss based on the plurality of first sample images, the pseudo labels of the plurality of first sample images, and the mask values of the plurality of first sample images.

And a sixth training unit, configured to obtain a second loss based on the loss of the plurality of labeled sample images and the labels of the plurality of labeled sample images.

And the seventh training unit is used for adjusting the model parameters of the initial model by utilizing the first loss and the second loss so as to finally train to obtain the deep learning model.

According to an embodiment of the present disclosure, the deep learning model training apparatus 600 further includes a second determination module and a third determination module.

And the second determining module is used for determining a second confidence threshold value based on the confidence value included in the pseudo tags of the plurality of first sample images.

And a third determining module, configured to determine mask values of each of the plurality of first sample images based on the pseudo tags of each of the plurality of first sample images and the second confidence threshold.

According to an embodiment of the present disclosure, the deep learning model training apparatus 600 further includes a first screening module.

And the first screening module is used for screening and obtaining a first sample image from the marked sample set based on preset evaluation parameters.

According to an embodiment of the present disclosure, the deep learning model training apparatus 600 further includes a second screening module.

And the second screening module is used for screening and obtaining the first sample image from the reflow data of the front-end image recognition service.

According to an embodiment of the present disclosure, the deep learning model training apparatus 600 further includes a third screening module.

And the third screening module is used for screening and obtaining the first sample image from the shared resource based on the preset scene characteristics.

As shown in fig. 7, the text line detecting apparatus 700 includes a second processing module 710.

The second processing module 710 is configured to process the image to be detected by using the deep learning model, so as to obtain a text line detection result.

According to an embodiment of the present disclosure, the deep learning model includes training using the deep learning model training method as described above.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

FIG. 8 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to an input/output (I/O) interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as a deep learning model training method or a text line detection method. For example, in some embodiments, the deep learning model training method or text line detection method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the deep learning model training method or text line detection method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a deep learning model training method or a text line detection method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A deep learning model training method, comprising:

respectively processing a plurality of first sample images included in a label-free sample set by using a target detection model to obtain pseudo labels of the first sample images, wherein the target detection model is obtained by training a labeling sample set; and

and training an initial model by using the plurality of first sample images, the pseudo labels of the plurality of first sample images and the labeling sample set to obtain a deep learning model.

2. The method of claim 1, wherein the processing the plurality of first sample images included in the unlabeled exemplar set with the target detection model to obtain the respective pseudo labels of the plurality of first sample images, respectively, includes:

for each first sample image, carrying out data enhancement processing on the first sample image to obtain a plurality of second sample images;

respectively processing the plurality of second sample images by using the target detection model to obtain a plurality of first labels; and

based on the plurality of first labels, a pseudo label of the first sample image is determined.

3. The method of claim 2, wherein the determining a pseudo tag for the first sample image based on the plurality of first tags comprises:

and performing non-maximum value inhibition processing on the plurality of first labels based on confidence coefficient values included in the plurality of first labels, so as to obtain pseudo labels of the first sample image.

4. The method of claim 2, wherein the performing data enhancement processing on the first sample image to obtain a plurality of second sample images includes:

and performing multiple scale transformation on the first sample image to obtain a plurality of second sample images, wherein the scaling ratios used by the multiple scale transformation are different.

5. The method of claim 4, wherein the determining a pseudo tag for the first sample image based on the plurality of first tags comprises:

for each second sample image, carrying out reduction processing on the first label based on the scaling used in the process of generating the second sample image to obtain a second label; and

based on the plurality of second labels, a pseudo label of the first sample image is determined.

6. The method of claim 1, wherein training the initial model with the plurality of first sample images, the pseudo labels of the plurality of first sample images, and the labeled sample set to obtain a deep learning model, comprises:

sampling the plurality of first sample images and the labeling sample set to obtain a training sample set; and

and training the initial model by using the training sample set to obtain the deep learning model.

7. The method of claim 1, further comprising:

determining a first confidence threshold based on confidence values included in the pseudo tags of the plurality of first sample images; and

the plurality of first sample images are divided into a first sample subset and a second sample subset based on the first confidence threshold, wherein a confidence value associated with a first sample image included in the first sample subset is greater than or equal to the first confidence threshold and a confidence value associated with a first sample image included in the second sample subset is less than the first confidence threshold.

8. The method of claim 7, wherein the initial model comprises a regression branch network and a classification branch network;

the training initial model by using the plurality of first sample images, the pseudo labels of the plurality of first sample images and the labeling sample set to obtain a deep learning model includes:

training the regression branch network with the first subset of samples and the set of labeling samples; and

training the classification branch network using the first subset of samples, the second subset of samples, and the labeled set of samples.

9. The method of claim 7, wherein the determining a first confidence threshold based on the confidence values included in the pseudo tags of the respective plurality of first sample images comprises:

determining proportion data related to each of a plurality of preset confidence intervals based on confidence values included in the pseudo labels of each of the plurality of first sample images; and

the first confidence threshold is determined based on the scale data associated with each of the plurality of preset confidence intervals and the interval endpoint value of each of the plurality of preset confidence intervals.

10. The method of claim 1, wherein the set of annotation samples comprises a plurality of annotation sample images and respective labels of the plurality of annotation sample images;

obtaining a first loss based on the plurality of first sample images, the pseudo labels of the plurality of first sample images, and the mask values of the plurality of first sample images;

obtaining a second loss based on the loss of the plurality of labeled sample images and the labels of the plurality of labeled sample images; and

and adjusting model parameters of the initial model by using the first loss and the second loss so as to finally train to obtain the deep learning model.

11. The method of claim 10, further comprising:

determining a second confidence threshold based on confidence values included in the pseudo tags of the plurality of first sample images; and

determining mask values for each of the plurality of first sample images based on the pseudo tags for each of the plurality of first sample images and the second confidence threshold.

12. The method of claims 1-11, further comprising:

and screening the first sample image from the marked sample set based on preset evaluation parameters.

13. The method of claims 1-11, further comprising:

and screening from the reflow data of the front-end image recognition service to obtain the first sample image.

14. The method of claims 1-11, further comprising:

and screening the first sample image from the shared resource based on the preset scene characteristics.

15. A text line detection method, comprising:

processing the image to be detected by using a deep learning model to obtain a text line detection result;

wherein the deep learning model comprises training using the deep learning model training method according to any one of claims 1 to 14.

16. A deep learning model training apparatus comprising:

the first processing module is used for respectively processing a plurality of first sample images included in the label-free sample set by utilizing a target detection model to obtain respective pseudo labels of the plurality of first sample images, wherein the target detection model is obtained by training a labeling sample set; and

and the training module is used for training the initial model by using the plurality of first sample images, the pseudo labels of the plurality of first sample images and the labeling sample set to obtain a deep learning model.

17. The apparatus of claim 16, wherein the first processing module comprises a first processing unit, a second processing unit, and a third processing unit;

the first processing unit is used for carrying out data enhancement processing on each first sample image to obtain a plurality of second sample images;

the second processing unit is used for respectively processing the plurality of second sample images by utilizing the target detection model to obtain a plurality of first labels; and

and a third processing unit, configured to determine a pseudo tag of the first sample image based on the plurality of first tags.

18. The apparatus of claim 17, wherein the third processing unit comprises a first processing subunit;

19. The apparatus of claim 17, wherein the first processing unit comprises a second processing subunit;

20. The apparatus of claim 19, wherein the third processing unit comprises a third processing subunit and a fourth processing subunit;

a third processing subunit, configured to, for each of the second sample images, perform reduction processing on the first label based on a scaling used when the second sample image is generated, to obtain a second label; and

a fourth processing subunit for determining a pseudo tag of the first sample image based on the plurality of second tags.

21. The apparatus of claim 16, wherein the training module comprises a first training unit and a second training unit;

the first training unit is used for carrying out sampling processing on the plurality of first sample images and the labeling sample set to obtain a training sample set; and

and the second training unit is used for training the initial model by using the training sample set to obtain the deep learning model.

22. The apparatus of claim 16, further comprising a first determination module and a partitioning module;

a first determining module, configured to determine a first confidence threshold based on confidence values included in the pseudo tags of the respective plurality of first sample images; and

A dividing module, configured to divide the plurality of first sample images into a first sample subset and a second sample subset based on the first confidence threshold, where a confidence value associated with a first sample image included in the first sample subset is greater than or equal to the first confidence threshold and a confidence value associated with a first sample image included in the second sample subset is less than the first confidence threshold.

23. The apparatus of claim 22, wherein the initial model comprises a regression branch network and a classification branch network;

the training module comprises a third training unit and a fourth training unit;

a third training unit for training the regression branch network using the first sample subset and the labeled sample set; and

and a fourth training unit, configured to train the classification branch network by using the first sample subset, the second sample subset and the labeling sample set.

24. The apparatus of claim 22, wherein the first determination module comprises a first determination unit and a second determination unit;

a first determining unit configured to determine scale data related to each of a plurality of preset confidence intervals based on confidence values included in pseudo tags of each of the plurality of first sample images; and

And a second determining unit configured to determine the first confidence threshold based on the ratio data related to each of the plurality of preset confidence intervals and the interval endpoint value of each of the plurality of preset confidence intervals.

25. The apparatus of claim 16, wherein the set of annotation samples comprises a plurality of annotation sample images and respective labels for the plurality of annotation sample images;

the training module comprises a fifth training unit, a sixth training unit and a seventh training unit;

a fifth training unit, configured to obtain a first loss based on the plurality of first sample images, the pseudo labels of the plurality of first sample images, and the mask values of the plurality of first sample images;

a sixth training unit, configured to obtain a second loss based on the loss of the plurality of labeled sample images and the labels of the plurality of labeled sample images; and

26. The apparatus of claim 25, further comprising a second determination module and a third determination module;

A second determining module, configured to determine a second confidence threshold based on confidence values included in the pseudo labels of the respective plurality of first sample images; and

27. The apparatus of claims 16-26, further comprising a first screening module;

and the first screening module is used for screening the first sample image from the marked sample set based on preset evaluation parameters.

28. The apparatus of claims 16-26, further comprising a second screening module;

and the second screening module is used for screening from the reflow data of the front-end image recognition service to obtain the first sample image.

29. The apparatus of claims 16-26, further comprising a third screening module;

30. A text line detection device comprising:

the second processing module is used for processing the image to be detected by using the deep learning model to obtain a text line detection result;

31. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.

32. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-15.

33. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-15.