CN115019321A

CN115019321A - Text recognition method, text model training method, text recognition device, text model training equipment and storage medium

Info

Publication number: CN115019321A
Application number: CN202210800458.XA
Authority: CN
Inventors: 范森; 王晓燕; 吕鹏原; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-07-06
Filing date: 2022-07-06
Publication date: 2022-09-06

Abstract

The disclosure provides a text recognition method, a text recognition device, a model training device and a storage medium, relates to the technical field of artificial intelligence, specifically to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as OCR (optical character recognition). The specific implementation scheme is as follows: zooming the image to be identified to obtain a first image with a reference size; extracting image features of the first image; determining the minimum text height of the text in the first image according to the image characteristics; obtaining a scaling coefficient for keeping the text clear after scaling the first image based on the minimum text height and the reference text height corresponding to the reference size; carrying out scaling processing on the first image according to the scaling coefficient to obtain a second image; and performing text recognition on the second image. By applying the scheme provided by the embodiment of the disclosure, text recognition can be performed on images of various sizes.

Description

Text recognition method, text model training method, text recognition device, text model training equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as OCR (optical character recognition).

Background

Images are more and more common in daily life, and some images may record texts, for example, texts such as articles are recorded in web page images, and texts such as invoices and invoices date are recorded in invoice images.

However, the sizes of the images in different scenes may differ widely, e.g., the size of the emoticon image tends to be small, while the size of the slide-screen image tends to be large.

In view of the foregoing, it is desirable to provide an image-based text recognition method capable of performing text recognition for images of various sizes.

Disclosure of Invention

The disclosure provides a text recognition method, a text recognition device, a model training method, a text recognition model, a model training device, a text recognition equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a text recognition method including:

zooming the image to be identified to obtain a first image with a reference size;

extracting image features of the first image;

determining the minimum text height of the text in the first image according to the image characteristics;

obtaining a scaling coefficient for keeping the text clear after scaling the first image based on the minimum text height and the reference text height corresponding to the reference size;

carrying out scaling processing on the first image according to the scaling coefficient to obtain a second image;

and performing text recognition on the second image.

According to another aspect of the present disclosure, there is provided a model training method, including:

acquiring a sample original image and a labeling frame of a text in the sample original image;

obtaining a scaling ratio for scaling the first size to a reference size according to the first size of the sample original image;

according to the scaling, scaling the original sample image to obtain a processed sample image;

zooming the labeling frame according to the zooming proportion;

determining the minimum frame height of the zoomed labeling frame;

acquiring a first scaling coefficient based on the minimum frame height and the reference text height corresponding to the reference size;

inputting the sample processing image into a preset neural network model to obtain an output second scaling coefficient;

and adjusting network parameters of the neural network model according to first difference information between the first scaling coefficient and the second scaling coefficient to obtain a scaling coefficient prediction model.

According to another aspect of the present disclosure, there is provided a text recognition apparatus including:

the first image obtaining module is used for carrying out scaling processing on the image to be identified to obtain a first image with a reference size;

the scaling coefficient acquisition module is used for extracting image features of the first image, determining the minimum text height of a text in the first image according to the image features, and acquiring a scaling coefficient for keeping the text clear after scaling the first image based on the minimum text height and the reference text height corresponding to the reference size;

the second image obtaining module is used for carrying out scaling processing on the first image according to the scaling coefficient to obtain a second image;

and the text recognition module is used for performing text recognition on the second image.

According to another aspect of the present disclosure, there is provided a model training apparatus including:

the system comprises a sample original image obtaining module, a text labeling module and a text labeling module, wherein the sample original image obtaining module is used for obtaining a sample original image and a text labeling frame in the sample original image;

the scaling acquisition module is used for acquiring the scaling of scaling the first size to a reference size according to the first size of the sample original image;

the sample processing image obtaining module is used for carrying out scaling processing on the original sample image according to the scaling ratio to obtain a sample processing image and carrying out scaling processing on the marking frame according to the scaling ratio;

the minimum frame height determining module is used for determining the minimum frame height of the zoomed labeling frame;

a first scaling factor obtaining module, configured to obtain a first scaling factor based on the minimum frame height and a reference text height corresponding to the reference size;

the sample processing image input module is used for inputting the sample processing image into a preset neural network model to obtain an output second scaling coefficient;

and the parameter adjusting module is used for adjusting network parameters of the neural network model according to first difference information between the first scaling coefficient and the second scaling coefficient to obtain a scaling coefficient prediction model.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text recognition method or the model training method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the above-described text recognition method or model training method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the text recognition method or the model training method described above.

As can be seen from the above, when text recognition is performed by applying the scheme provided by the embodiment of the present disclosure, the recognition image is first zoomed into the first image with the reference size, then a zoom coefficient for keeping the text clear after zooming the first image is obtained based on the minimum text height in the first image and the reference text height corresponding to the reference size, and then the first image is zoomed according to the zoom coefficient to obtain the second image, at this time, the text in the second image is clearly displayed, and then the text in the second image is recognized. That is, for the images to be recognized of various original sizes, the images to be recognized are scaled into the second images with clear texts, so that the accuracy of recognizing the texts in the second images is improved. Therefore, the scheme provided by the embodiment of the disclosure can perform text recognition on images with various sizes.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart of a text recognition method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram illustrating a first model training method according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart diagram illustrating a second model training method provided by the present disclosure;

fig. 4 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a model training apparatus provided in an embodiment of the present disclosure;

FIG. 6 is a block diagram of an electronic device for implementing a text recognition method or a model training method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The text recognition method provided by the embodiment of the present disclosure is explained in detail by specific embodiments below.

Referring to fig. 1, fig. 1 is a schematic flowchart of a text recognition method according to an embodiment of the present disclosure, where the method includes the following steps S101 to S106.

Step S101: and carrying out scaling processing on the image to be recognized to obtain a first image with the reference size.

The image to be recognized may be various types of images, for example, the image to be recognized may be an image of RGB color image, grayscale image, black-and-white binary image, or the like.

The reference size may be an image size which is set by a worker according to actual needs and is convenient for text recognition.

The reference size may be a size in which the length and width of the image are the same. For example, it may be 640 pixels × 640 pixels, 960 pixels × 960 pixels, or the like; the reference size may be a size in which the length and width of the image are different. For example, 640 pixels × 320 pixels, 960 pixels × 480 pixels, and the like may be possible.

Specifically, the image to be recognized may be scaled to the first image of the reference size by:

calculating the ratio of the length of the reference size to the length of the image to be recognized as the scaling of the length of the image to be recognized; and calculating the ratio of the width of the reference size to the width of the image to be recognized as the scaling ratio of the width of the image to be recognized. Then, the image to be recognized is respectively scaled along the length direction and the width direction according to the corresponding scaling ratio, and a first image of the reference size is obtained.

Step S102: image features of the first image are extracted.

The manner of extracting the image features of the first image is not limited in this step, and the image features of the first image may be extracted based on a feature extraction algorithm such as an edge extraction operator, a texture feature extraction algorithm, or a convolutional neural network algorithm, for example.

Step S103: the minimum text height of the text in the first image is determined according to the image features.

The minimum text height is a text height of a text having a minimum height in the first image. Since the text height can measure the size of one text, the text corresponding to the minimum text height determined in this step can be understood as the smallest text in the first image.

The manner in which the minimum text height of the text in the first image is determined from the extracted image features is described below.

In one embodiment, each text region in the first image may be determined based on the extracted image features, and then a minimum region height of each text region may be determined, which may be determined as a minimum text height.

For example, a rectangular region where a text is located in the first image may be determined based on image features, where the rectangular region may be a minimum bounding rectangle of a line of text, or a minimum bounding rectangle of each character or text, and then the minimum height of these rectangular regions is selected as the minimum text height.

Step S104: and obtaining a scaling coefficient for keeping the text clear after scaling the first image based on the minimum text height and the reference text height corresponding to the reference size.

The reference text height is a text height at which a text can be clearly displayed in an image of a reference size.

It is to be understood that the reference text heights are different in different reference sizes, and the worker may set the reference text heights corresponding to the respective reference sizes in advance.

For example, the reference size S1 is 640 pixels × 640 pixels, and a text having a text height of 20 pixels can be clearly displayed in an image of the reference size S1, so that the reference text height corresponding to the reference size S1 can be 20 pixels; the reference size S2 is 960 pixels × 960 pixels, and text having a text height of 30 pixels can be clearly displayed in an image of the reference size S2, so that the reference text height corresponding to the reference size S2 may be 30 pixels.

Because the text corresponding to the minimum text height is the minimum text in the first image with the reference size, and the reference text height is the text height capable of clearly displaying the text in the image with the reference size, the zoom factor for keeping the text clear after the first image is zoomed can be obtained based on the minimum text height and the reference text height. The detailed description can be seen in the following examples, which are not detailed here.

In an embodiment of the present disclosure, this step may be implemented by a pre-trained scale factor prediction model, which is described in detail in the following embodiments and will not be described in detail here.

Step S105: and carrying out scaling processing on the first image according to the scaling coefficient to obtain a second image.

In this step, the first image may be scaled according to the scaling coefficients along both the length direction and the width direction, so as to obtain a scaled second image.

Step S106: and performing text recognition on the second image.

In this step, when performing text recognition on the second image, a pre-trained text recognition model may be used, and the pre-trained text recognition model may include a text region detection function and a text recognition function. The text region detection function is used for detecting each text region in the second image, and the text recognition function is used for recognizing the text in each text region.

The model architecture of the above-described text recognition model is briefly described below by way of example.

The text area detection function can be realized based on a lightweight network, namely, Mobile-v3 and a Unet network, wherein the Mobile-v3 is used as a backbone network, and the Unet network is used as a task network.

After the second image is input into the text recognition model, the text region detection function may extract image features of the second image based on the network, and output a text region score map based on the extracted image features.

For example, when the type of the second image is an RGB image, the second image having 3 channels is input to the model, and the model may output a text region score map. For example, a text region is scored as 1 and a non-text region is scored as 0, so that a region scored as 1 may be determined as a text region.

The text area determined by the text area detection function may be each text area, or each text line area or text column area.

The text area recognition function may be implemented based on a CRNN (Convolutional Recurrent Neural Network) + CTC (Convolutional temporal classification based on a Neural Network) architecture, that is, CRNN serves as a backbone Network and CTC serves as a decoding layer.

The CRNN may use a Resnet18 network as an encoder, and is configured to obtain an image global feature of the text region determined by the text region detection function; the CTC is used as a text line field decoder, and is configured to perform serialization modeling on the image global features, predict each text character type by combining context features of a text, and output a recognition result of each text character.

It should be noted that the above pre-trained text recognition model for recognizing the text in the second image is only an example, and the disclosure does not limit the specific model for recognizing the text in the second image.

It should be noted that, as can be seen from the foregoing embodiment, the obtaining of the scaling factor in step S104 may be implemented by a pre-trained scaling factor prediction model, and when the obtaining of the scaling factor is implemented by using the scaling factor model, the scaling factor model may be a sub-model in the text recognition model; alternatively, it may be one or more network layers in a text recognition model; alternatively, one of the text recognition models may be a functional module.

As can be seen from the above, when text recognition is performed by applying the scheme provided by the embodiment of the present disclosure, the recognition image is first zoomed into the first image with the reference size, then a zoom coefficient for keeping the text clear after zooming the first image is obtained based on the minimum text height in the first image and the reference text height corresponding to the reference size, and then the first image is zoomed according to the zoom coefficient to obtain the second image, at this time, the text in the second image is clearly displayed, and then the text in the second image is recognized. That is, for the images to be recognized of various original sizes, the images to be recognized are scaled into the second images with clear texts, so that the accuracy of recognizing the texts in the second images is improved.

Therefore, the scheme provided by the embodiment of the disclosure can perform text recognition on images of various sizes, improves the universality of the text recognition scheme, can be applied to more recognition scenes, and improves the experience of users who need to perform text recognition.

The differences between the solutions provided by the embodiments of the present disclosure and the prior art are described below.

In the prior art, before text recognition is performed on an image to be recognized, it is required to ensure that the size of the image to be recognized is uniform, so that the image to be recognized is zoomed to a preset fixed size. However, in an actual scene, the original sizes of the images to be recognized are different, so that scaling the images to be recognized with different original sizes to a fixed size results in that the size of the text in the scaled images is not in accordance with the expectation, and further affects the text recognition result: on one hand, after the image to be recognized with the overlarge original size is zoomed to a fixed size, the text in the zoomed image is possibly too small to be recognized; on the other hand, after the image with the smaller original size is scaled to a fixed size, the text in the scaled image may be too large, which may cause redundant calculation in the text recognition process, and increase the recognition time.

Compared with the prior art, when the scheme provided by the embodiment of the disclosure is adopted for text recognition, the scaling coefficient of the image to be recognized can be predicted in a targeted manner for the images to be recognized in various original sizes, so that the size of the text in the second image obtained based on the scaling coefficient approaches to the expected size, the second image containing the clear text can be displayed clearly, and the second image containing the clear text is used as the image for text recognition. Therefore, compared with the prior art that the image to be recognized is completely zoomed to a fixed size, and the image zoomed to the fixed size is used as the image for text recognition, the text size in the second image in the scheme provided by the disclosure is close to the expected size, so that the probability of redundant calculation and resource waste caused by too small text or too large text in the image for text recognition can be effectively reduced, and the efficiency of text recognition is improved.

In an embodiment of the present invention, the above steps S102 to S104 may be implemented by a pre-trained scaling model:

and inputting the first image into a pre-trained scaling coefficient prediction model to obtain an output scaling coefficient.

Wherein, the scaling coefficient prediction model is as follows: and the model is obtained by training a preset neural network model by taking the sample image with the reference size as input information and the sample scaling coefficient as a training label and is used for predicting the scaling coefficient, and the sample scaling coefficient is determined according to the minimum height of the text in the sample image and the height of the reference text.

The specific training manner of the above-mentioned scale factor prediction model is detailed in the following embodiment shown in fig. 2, and will not be detailed here.

The following describes a manner of acquiring the zoom factor for keeping the text clear after the zoom process is performed on the first image in step S104:

in one embodiment, a ratio between the reference text height and the minimum text height may be first calculated, and then a scaling factor for sharpening the text after scaling the first image may be obtained based on the ratio.

The above-mentioned ratio between the reference text height and the minimum text height may characterize the following meanings:

when the ratio is greater than 1, it indicates that the minimum text height in the zoomed first image is smaller than the reference text height, that is, the minimum text with the minimum text height is too small to be clearly displayed in the image with the reference size, and the first image may be enlarged to facilitate the subsequent preparation for recognizing the text in the first image.

When the ratio is smaller than 1, it indicates that the minimum text height in the scaled first image is larger than the reference text height, that is, the minimum text with the minimum text height is too large to be clearly displayed in the image with the reference size.

Based on the meaning of the above ratio, a scaling factor for making the text clear after the scaling processing is performed on the first image can be obtained in the following manner.

First, the ratio may be directly determined as the scaling factor.

For example, the reference text height corresponding to the reference size S1 is 20 pixels, the minimum text height in the first image is 10 pixels, the ratio is 20/10-2, and 2 may be determined as the scaling factor.

Therefore, the minimum text height in the zoomed second image is equal to the reference text height, namely, the minimum text with the minimum text height in the zoomed second image can be clearly displayed in the second image, and other texts in the second image can also be clearly displayed in the second image because other texts in the second image are not less than the minimum text. That is, all the text in the second image scaled by the above scaling factor can be clearly displayed.

In a second mode, a floating value may be preset on the basis of the ratio, a floating range may be determined on the basis of the ratio and the floating value, and any value within the floating range may be determined as the scaling factor.

For example, the reference text height corresponding to the reference size S1 is 20 pixels, the minimum text height in the first image is 10 pixels, the ratio is 20/10-2, and the preset float value is 5%, so that the float range is: [1.9-2.1], any value within [1.9-2.1] can be determined as the above-mentioned scaling factor.

This may allow the minimum text height in the scaled second image to approach the reference text height.

As can be seen from the foregoing embodiment, when the minimum text height in the zoomed second image is equal to the reference text height, all texts in the second image can be clearly displayed, and thus, in this embodiment, when the minimum text height in the zoomed second image approaches the reference text height, all texts in the second image can be ensured to be clearly displayed.

Thus, the scaling factor is determined based on the ratio of the reference text height to the minimum text height, which is beneficial to enabling the minimum text height in the second image scaled according to the scaling factor to approach the reference text height, namely beneficial to enabling the text in the second image to be clearly displayed after scaling, and facilitating subsequent text recognition.

Because the scaling coefficient prediction model is obtained by pre-training a large number of sample images, the scaling coefficient prediction model can learn the characteristics of the sample images, the final output images and the scaling coefficients in the process of scaling the large number of sample images to obtain the images with clear texts, so that the scaling coefficient prediction model can accurately predict the scaling coefficients. In addition, the zoom coefficient is obtained through a pre-trained zoom coefficient prediction model, so that the time for obtaining the zoom coefficient is shortened, the efficiency for obtaining the zoom coefficient is improved, and the efficiency of the whole text recognition process is further improved.

In another embodiment of the present invention, the steps S102 and S103 and the steps mentioned in the embodiment of implementing step S104 may be implemented by the scaling factor model.

Corresponding to the text recognition method, the embodiment of the present disclosure further provides a model training method, which is used for training the scaling coefficient prediction model.

Referring to fig. 2, fig. 2 is a schematic flow chart diagram of a first model training method provided in the embodiment of the present disclosure, where the method includes the following steps S201 to S208.

Step S201: and acquiring the original sample image and the labeling frame of the text in the original sample image.

Specifically, the labeling box of the text in the sample original image can be obtained in the following manner.

In one embodiment, image features of the sample original image may be extracted, each text region in the image may be determined according to the extracted image features, and a frame of the determined text region may be labeled as a labeling frame of a text in the sample original image.

The manner of extracting the image feature and determining each text region in the image may refer to the embodiment shown in fig. 1, and specifically, may be obtained on the basis of the manner of extracting the image feature of the first image and determining each text region in the first image based on the extracted image feature in step S102 and step S103 of the embodiment shown in fig. 1, and will not be described herein again.

In another embodiment, a manual labeling mode can be adopted to label the labeling frame of the text in the original image of the sample, so that an accurate labeling frame can be obtained by means of manual rich experience and visual characteristics.

It is worth mentioning that, as can be seen from the embodiment shown in fig. 1, when performing text recognition on the second image, the text recognition may be implemented based on a pre-trained text recognition model, a sample image is also required when training the text recognition model, and a text region in the sample image needs to be labeled, so that when training the scale factor prediction model here, label information of the sample image and the text region used in the training process of the text recognition model may be multiplexed, which does not need to increase labeling cost, is beneficial to reducing resource consumption, and improves efficiency of obtaining labeling frames of texts in the sample original image and the sample original image, thereby being beneficial to improving training efficiency of the scale factor prediction model.

Step S202: according to the first size of the original image of the sample, the scaling of the first size to the reference size is obtained.

The scaling ratios are the scaling ratios of the length and width of the original image of the sample, respectively.

Specifically, a ratio of the length of the reference size to the length of the sample original image may be calculated as a scaling ratio of the length of the sample original image; the ratio of the width of the reference size to the width of the sample original image is calculated as a scaling of the length of the sample original image.

For example, if the reference size is 640 pixels × 640 pixels and the first size is 1280 pixels × 640 pixels, when the size of the sample original image is scaled to the reference size, the scaling ratio of the length of the sample original image is 640/1280 ═ 0.5, and the scaling ratio of the width of the sample original image is 640/640 ═ 1.

Step S203: and carrying out scaling processing on the original sample image according to the scaling ratio to obtain a sample processing image.

Specifically, the sample processed image may be obtained by scaling the sample original image in the length direction and the width direction according to the determined scaling ratio.

Step S204: and carrying out scaling processing on the marking frame according to the scaling.

Because the labeling frame is labeled for the text in the sample original image, after the sample original image is scaled according to the scaling ratio to obtain the sample processed image, the scaling processing can be performed on the length and width of the labeling frame according to the scaling ratios corresponding to the length and width of the sample original image.

Because the labeling box is labeled according to the text in the sample original image and reflects the size of the text in the sample original image, and the text in the sample original image is also scaled after the sample original image is scaled to obtain the sample processed image, the size of the labeling box scaled according to the scaling ratio actually reflects the size of the text in the scaled sample processed image.

Step S205: the minimum frame height of the scaled labeling frame is determined.

As can be seen from the above step S204, the scaled annotation box size actually reflects the text size in the scaled sample processing image, and therefore, the minimum box height of the scaled annotation box is also the minimum text height in the scaled sample processing image.

Step S206: and acquiring a first scaling coefficient based on the minimum frame height and the reference text height corresponding to the reference size.

In this step, a first scaling factor obtained based on the minimum box height and the reference text height is: and the scaling coefficient which can enable the sample processing image to keep the text clear after being scaled according to the first scaling coefficient, namely the scaling coefficient which is expected to be output by the model.

The specific manner of obtaining the first scaling factor is described in the following embodiments, and will not be described in detail here.

Step S207: and inputting the sample processing image into a preset neural network model to obtain an output second scaling coefficient.

Specifically, the preset neural network model can extract image features of the sample processed image, and predict the second scaling factor of the sample processed image based on the image features. The preset neural network model can adopt a lightweight Mobile-v3 model as a backbone network.

It should be noted that the Mobile-v3 model is only an example, and the disclosure does not limit the specific architecture of the preset neural network model.

Step S208: and adjusting network parameters of the neural network model according to first difference information between the first scaling coefficient and the second scaling coefficient to obtain a scaling coefficient prediction model.

Specifically, a loss value generated when the neural network model predicts the scaling coefficient may be calculated according to a difference between the first scaling coefficient and the second scaling coefficient, the network parameter of the neural network model is adjusted by using the loss value, iterative training is continued based on the adjusted parameter, and the training is completed after a preset training end condition is met, so as to obtain a trained scaling coefficient prediction model. The preset training end condition may be that the loss value is smaller than a preset value, a preset training number is reached, and the like.

The specific way of calculating the loss value may be to calculate by using a preset loss function, for example, the loss function may be a mean square error loss function, a cross entropy loss function, and the like, and the specific way of calculating the loss value is not limited in this disclosure.

As can be seen from the above, when the scheme provided by the embodiment of the present disclosure is applied to model training, first, a sample original image and a labeling frame of a text in the sample original image are obtained, then the sample original image is scaled to a sample processing image with a reference size, the labeling frame is scaled according to a scaling coefficient when the sample original image is scaled, and a minimum frame height after scaling is a minimum text height in the sample processing image. Therefore, based on the minimum frame height and the reference text height corresponding to the reference size, a first scaling coefficient which can enable the sample processing image to keep the text clear after scaling can be obtained, and then the model is trained by taking the first scaling coefficient as the marking information of the sample, so that the model can learn the relation between the sample processing image and the first scaling coefficient, and then the model which can predict the scaling coefficient can be trained.

The manner in which the aforementioned first scaling factor is obtained is explained below:

specifically, a ratio between the reference text height corresponding to the reference size and the minimum box height may be calculated, and the first scaling factor is the ratio.

Therefore, the ratio of the height of the reference text to the height of the minimum frame is used as the first scaling factor, so that the preset neural network model can learn the relation between the image characteristics of the sample processing image and the expected scaling factor of the sample processing image, and the accuracy of the model for predicting the scaling factor is improved.

In an embodiment of the present disclosure, the foregoing step S206 may be implemented by:

in response to determining that the minimum box height is greater than the preset height, a first scaling factor is obtained based on the minimum box height and the reference text height.

That is, if the minimum frame height is determined to be greater than the preset height, if so, step S207 and the subsequent steps are executed; if not, step S207 and the subsequent steps are not executed.

The preset height may be set by a worker based on experience, and may be 1, 2, etc., for example.

And when the minimum frame height is smaller than the preset height, the minimum text in the second image after zooming is over-small, or the relative difference between the size of the minimum text in the first image before zooming and the size of the first image is over-large.

The following takes the preset height as 1 and the reference size as 640 × 640 pixels as an example, and the above-mentioned case where the minimum frame height is not greater than the preset height is exemplified.

In the first case, it may be that the smallest text in the sample original image having a huge size is not large enough.

For example: the size of the sample original image P1 is 12800 pixels × 12800 pixels, the text height of the minimum text in P1 is 20 pixels, the height of the labeling box corresponding to the minimum text is 20 pixels, and the scaling ratio of the length and width of P1 is calculated to be 640/12800-1/20, so that after the length and width of the text labeling box in P1 are respectively scaled according to 1/20, the height of the labeling box corresponding to the minimum text is 1, that is, the height of the minimum box is 1, and is not greater than the preset height 1.

In this case, the minimum text size in P1 is smaller than the size in P1, and thus when the text recognition is performed after scaling P1, the text in the scaled image is too small to be recognized. Therefore, no subsequent steps are performed on the sample original image P1, i.e., the sample original image is discarded and not used as a sample for training the neural network model.

In the second case, it may be that the smallest text in the normal size sample original image is extremely small.

For example: the size of the sample original image P2 is 1280 pixels × 1280 pixels, the text height of the smallest text in P2 is 1 pixel, the height of the labeling box corresponding to the smallest text is 1 pixel, and the scaling ratio of the length and the width of P1 is calculated to be 640/1280-0.5, so that after the length and the width of the text labeling box in P2 are respectively scaled according to 0.5, the height of the labeling box corresponding to the smallest text is 0.5, that is, the height of the above-mentioned smallest box is 0.5, and is not greater than the preset height 1.

It can be seen that in this case, the minimum text size in P2 is extremely small and difficult to recognize. Therefore, no subsequent steps are performed on the sample original image P2, i.e., the sample original image is discarded and not used as a sample for training the neural network model.

Therefore, the sample processing images which are difficult to identify by the contained texts can be filtered, the model is trained by only adopting the sample processing images which are convenient to identify by the contained texts, the model is prevented from learning the image features of the sample processing images which are difficult to identify by the contained texts, so that the training result of the model is prevented from being interfered, the interference during model training is reduced, and the accuracy of the model obtained by training when the scaling coefficient is predicted is improved.

On the basis of the embodiment shown in fig. 2, the neural network model may output at least one of the following in addition to the second scaling factor: first information characterizing whether text exists in a sample processing image; second information characterizing a direction of text in the sample-processed image. In view of the above, the embodiments of the present disclosure provide a second model training method.

Referring to fig. 3, fig. 3 is a schematic flow chart of a second model training method provided in the embodiment of the present disclosure, where the method includes the following steps S301 to S308.

Step S301: and obtaining the sample original image and the labeling box of the text in the sample original image.

Step S302: according to the first size of the original image of the sample, the scaling of scaling the first size to the reference size is obtained.

Step S303: and carrying out scaling processing on the original sample image according to the scaling ratio to obtain a sample processing image.

Step S304: and carrying out scaling processing on the marking frame according to the scaling.

Step S305: the minimum frame height of the scaled labeling frame is determined.

Step S306: and obtaining a first scaling coefficient based on the minimum frame height and the reference text height corresponding to the reference size.

The steps S301 to S306 are the same as the steps S201 to S206 in the embodiment shown in fig. 2, and are not described again here.

Step S307: inputting the sample processing image into a preset neural network model to obtain an output second scaling coefficient and at least one of the following factors: the first information is used for representing whether the text exists in the sample processing image, and the second information is used for representing the text direction in the sample processing image.

It can be seen that there are many situations in which the output of the neural network model described above may exist. Specifically, the second scaling factor and the first information may be included; the second scaling factor and the second information may be included, and the second scaling factor, the first information, and the second information may be included. The manner of adjusting the network parameters is different when the output is different, and the related content is specifically referred to the subsequent step S308, which is not detailed here.

After the sample processing image is input into a preset neural network model, the model can extract features of the sample processing image and then perform scaling coefficient prediction based on the extracted features. In addition, when the first information and the second information are obtained, the characteristics of the sample processing image are used, so that the three kinds of information are output in the same model, the characteristics of the sample processing image can be shared, the image characteristic extraction is not required to be carried out for multiple times, and the computing resource is saved.

Step S308: and adjusting network parameters of the neural network model according to the extension information and first difference information between the first scaling coefficient and the second scaling coefficient to obtain a scaling coefficient prediction model.

Wherein the extended information includes at least one of: second difference information between the second information and third information, wherein the third information represents the text direction in the labeling box;

third difference information between the first information and fourth information, the fourth information representing that the sample processing image exists text.

Specifically, the extended information includes different contents according to different outputs of the neural network model, and the network parameter adjustment mode for the neural network model is different according to different contents included in the extended information. The following conditions are specific:

in a first case, the output parameter of the neural network model includes a second scaling factor and second information, and the extension information includes the second difference information.

In this case, when the network parameter of the neural network model is adjusted, on one hand, the model parameter is adjusted according to the first difference information, and the specific adjustment manner may refer to step S208 in the embodiment shown in fig. 2, which is not described herein again; on the other hand, according to the second difference information, the model parameters are adjusted so that the model learns the relationship between the image features of the sample processing image and the text direction in the labeling frame, and further the model can output second information representing the text direction in the sample processing image.

The manner of adjusting the network parameters according to the second difference information may be obtained based on the manner of adjusting the model parameters according to the first difference information, and is not described here again.

In a second case, the output parameters of the neural network model may include a second scaling factor and the first information, and the extension information may include the third difference information.

In this case, when the network parameter of the neural network model is adjusted, on one hand, the model parameter is adjusted according to the first difference information, and the specific adjustment manner may refer to step S208 in the embodiment shown in fig. 2, which is not described herein again; on the other hand, the model parameters are adjusted according to the third difference information, so that the model learns the relationship between the image features of the sample processing image and whether the text exists in the sample processing image, and further, the model can output the first information representing whether the text exists in the sample processing image.

The manner of adjusting the network parameters according to the third difference information may be obtained on the basis of the manner of adjusting the model parameters according to the first difference information, and is not described herein again.

In a third case, the output parameters of the neural network model may include a second scaling factor, the first information, and the second information, and the extension information includes the second difference information and the third difference information.

In this case, when network parameters of the neural network model are adjusted, on the first hand, model parameters are adjusted according to the first difference information; in the second aspect, the model parameters are adjusted according to the second difference information; in the third aspect, the model parameters are adjusted according to the third difference information. Therefore, a scaling coefficient prediction model which can output the second scaling coefficient, the second information representing the text direction in the sample processing image and the first information representing whether the text exists in the sample processing image can be obtained through training. The adjustment manner of the network parameters is described in the foregoing description, and is not described in detail here.

As can be seen from the above, the scaling factor model obtained by applying the scheme training provided by the embodiment of the present disclosure, in addition to being capable of outputting the second scaling factor for the incoming and outgoing first image, may further output at least one of first information representing whether a text exists in the first image and second information representing a text direction in the first image, so that before the identification of the second image obtained based on the scaling factor, the second image may be filtered or adjusted according to the at least one of the first information and the second information, which is beneficial to smooth text identification, and improves the identification efficiency.

In an embodiment of the present disclosure, since the scale factor prediction model trained in the embodiment shown in fig. 3 can output, for the first image, the second scale factor, and at least one of first information representing whether a text exists in the first image and second information representing a text direction in the first image, the second image obtained based on the scale factor may be filtered or adjusted according to at least one of the first information and the second information before performing text recognition on the second image. The specific implementation mode is as follows:

in one embodiment, the scale factor prediction model may output the second scale factor and the first information. In this case, it may be determined whether the first information represents that a text exists in the first image, and if so, the first image is scaled based on the scaling factor to obtain a second image, and the second image is subjected to text recognition; if not, the first image is indicated to have no text, so that the subsequent processing is not carried out on the first image, and the text recognition process is ended.

Therefore, the text recognition process can be ended when no text exists in the first image, and the waste of computing resources caused by text recognition of the image without the text is reduced.

In another embodiment, the scale factor prediction model may output the second scale factor and the second information. In this case, it may be determined whether the second information represents that a text direction in the first image is a preset arrangement direction, and if so, the first image is scaled based on the scaling factor to obtain a second image, and the second image is subjected to text recognition; if not, the text in the first image is not in the preset arrangement direction, so that the first image can be further rotated to obtain a new first image with the text arrangement direction being in the preset arrangement direction, then the new first image is zoomed based on the zoom coefficient to obtain a second image, and the text recognition is carried out on the second image. The preset arrangement direction may be a preset text direction for facilitating subsequent text recognition, and may be a horizontal arrangement direction, for example.

Therefore, when the text direction in the first image is not the preset arrangement direction, a new first image with the text arrangement direction being the preset arrangement direction can be obtained by rotating the first image, and then the new first image is zoomed based on the zoom factor to obtain a second image.

In another embodiment, the scale factor prediction model may output the second scale factor, the first information, and the second information. In this case, it may be determined whether the first is to represent that a text exists in the first image, and if so, it may be determined whether the second information represents that a text direction in the first image is a preset arrangement direction, the first image is adjusted according to the determination result, then the new adjusted first image is scaled based on the scaling factor to obtain a second image, and the second image is subjected to text recognition; if not, the first image is indicated to have no text, so that the subsequent processing is not carried out on the first image, and the text recognition process is ended. The above adjustment is referred to the foregoing embodiment, and is not described herein again.

Therefore, the first image can be preprocessed based on the first information and the second information, and the first image without the text can be filtered or the text arrangement direction in the first image can be adjusted, so that the waste of computing resources is reduced, and the accuracy of text recognition is improved.

Corresponding to the text recognition method, the embodiment of the disclosure provides a text recognition device.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present disclosure, where the apparatus includes the following

modules

401 and 404.

The first image obtaining module 401 is configured to perform scaling processing on an image to be identified to obtain a first image with a reference size;

a scaling factor obtaining module 402, configured to extract an image feature of the first image, determine a minimum text height of a text in the first image according to the image feature, and obtain a scaling factor for keeping the text clear after scaling the first image based on the minimum text height and a reference text height corresponding to the reference size;

a second image obtaining module 403, configured to perform scaling processing on the first image according to the scaling factor to obtain a second image;

a text recognition module 404, configured to perform text recognition on the second image.

In an embodiment of the present disclosure, the scaling factor obtaining module 402 is specifically configured to obtain the output scaling factor by inputting the first image into a pre-trained scaling factor prediction model; wherein the scaling factor prediction model is: and training a preset neural network model by taking the sample image with the reference size as input information and a sample scaling coefficient as a training label to obtain a model for predicting the scaling coefficient, wherein the sample scaling coefficient is determined according to the minimum height of the text in the sample image and the height of the reference text.

Because the zoom factor prediction model is obtained by pre-training a large number of sample images, the zoom factor prediction model can learn the characteristics of the sample images, the final output images and the zoom factors in the process of carrying out zoom processing on the large number of sample images to obtain images with clear texts, so that the zoom factor prediction model can accurately predict the zoom factors. In addition, the zoom coefficient is obtained through a pre-trained zoom coefficient prediction model, so that the time for obtaining the zoom coefficient is shortened, the efficiency for obtaining the zoom coefficient is improved, and the efficiency of the whole text recognition process is further improved.

In an embodiment of the present disclosure, the scaling factor obtaining module 402 is specifically configured to extract an image feature of the first image, determine a minimum text height of a text in the first image according to the image feature, and calculate a ratio between the reference text height and the minimum text height; and obtaining the scaling coefficient based on the ratio.

Corresponding to the above model training method, the embodiment of the present disclosure provides a model training apparatus.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a model training device provided in the embodiment of the present disclosure, where the device includes the following

modules

501 and 507.

A sample original image obtaining module 501, configured to obtain a sample original image and an annotation box of a text in the sample original image;

a scaling obtaining module 502, configured to obtain, according to a first size of the sample original image, a scaling for scaling the first size to a reference size;

a sample processing image obtaining module 503, configured to perform scaling processing on the sample original image according to the scaling ratio to obtain a sample processing image, and perform scaling processing on the labeling frame according to the scaling ratio;

a minimum frame height determining module 504, configured to determine a minimum frame height of the scaled labeling frame;

a first scaling factor obtaining module 505, configured to obtain a first scaling factor based on the minimum frame height and a reference text height corresponding to the reference size;

a sample processing image input module 506, configured to input the sample processing image into a preset neural network model, so as to obtain an output second scaling coefficient;

and a parameter adjusting module 507, configured to perform network parameter adjustment on the neural network model according to first difference information between the first scaling coefficient and the second scaling coefficient, to obtain a scaling coefficient prediction model.

As can be seen from the above, when the scheme provided by the embodiment of the present disclosure is applied to model training, first, a sample original image and a labeling frame of a text in the sample original image are obtained, then the sample original image is scaled to a sample processing image with a reference size, the labeling frame is scaled according to a scaling coefficient when the sample original image is scaled, and a minimum frame height after scaling is a minimum text height in the sample processing image. Therefore, based on the minimum frame height and the reference text height corresponding to the reference size, a first scaling coefficient which can keep the text clear after the sample processing image is scaled can be obtained, and then the model is trained by taking the first scaling coefficient as the marking information of the sample, so that the model can learn the relation between the sample processing image and the first scaling coefficient, and then the model which can predict the scaling coefficient can be trained.

In an embodiment of the present disclosure, the first scaling factor obtaining module 505 is specifically configured to calculate a ratio between the reference text height and the minimum frame height;

the first scaling factor is the ratio.

Therefore, the ratio of the height of the reference text to the height of the minimum frame is used as the first scaling coefficient, so that the preset neural network model can learn the relation between the image characteristics of the sample processing image and the expected scaling coefficient of the sample processing image, and the accuracy of the model for predicting the scaling coefficient is improved.

In an embodiment of the present disclosure, the first scaling factor obtaining module 505 is specifically configured to, in response to determining that the minimum box height is greater than a preset height, obtain the first scaling factor based on the minimum box height and the reference text height.

In one embodiment of the present disclosure, the output of the neural network model further comprises at least one of: first information for representing whether a text exists in the sample processing image and second information for representing the direction of the text in the sample processing image;

the parameter adjusting module 507 is specifically configured to perform network parameter adjustment on the neural network model according to extension information and the first difference information to obtain a scaling coefficient prediction model;

wherein the extension information includes at least one of: second difference information between the second information and third information, wherein the third information represents a text direction in the labeling box;

third difference information between the first information and fourth information, the fourth information characterizing that text exists in the sample processing image.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

An embodiment of the present disclosure provides an electronic device, including:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a text recognition method or a model training method.

The disclosed embodiments provide a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to execute a model text recognition method or a model training method.

Embodiments of the present disclosure provide a computer program product comprising a computer program which, when executed by a processor, implements a text recognition method or a model training method.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as a text recognition method or a model training method. For example, in some embodiments, the text recognition method or the model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the text recognition method or the model training method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform a text recognition method or a model training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A text recognition method, comprising:

zooming the image to be recognized to obtain a first image with a reference size;

extracting image features of the first image;

and performing text recognition on the second image.

2. The method according to claim 1, wherein the extracting image features of the first image, determining a minimum text height of text in the first image according to the image features, and obtaining a scaling coefficient for keeping text clearness after scaling the first image based on the minimum text height and a reference text height corresponding to the reference size comprises:

inputting the first image into a pre-trained scaling coefficient prediction model to obtain the output scaling coefficient;

wherein the scaling factor prediction model is: and training a preset neural network model by taking the sample image with the reference size as input information and a sample scaling coefficient as a training label to obtain a model for predicting the scaling coefficient, wherein the sample scaling coefficient is determined according to the minimum height of the text in the sample image and the height of the reference text.

3. The method according to claim 1, wherein the obtaining a scaling factor for keeping text clearness after scaling the first image based on the minimum text height and a reference text height corresponding to the reference size comprises:

calculating a ratio between the reference text height and the minimum text height;

and obtaining the scaling coefficient based on the ratio.

4. A model training method, comprising:

zooming the labeling frame according to the zooming proportion;

determining the minimum frame height of the zoomed labeling frame;

5. The method according to claim 4, wherein the obtaining a first scaling factor based on the minimum frame height and a reference text height corresponding to the reference size comprises:

calculating a ratio between the reference text height and the minimum box height;

the first scaling factor is the ratio.

6. The method of claim 4, wherein the obtaining a first scaling factor based on the minimum box height and a reference text height corresponding to the reference size comprises:

in response to determining that the minimum box height is greater than a preset height, obtaining the first scaling factor based on the minimum box height and the reference text height.

7. The method of any one of claims 4-6,

the output of the neural network model further comprises at least one of: first information for representing whether a text exists in the sample processing image and second information for representing the direction of the text in the sample processing image;

the network parameter adjustment is performed on the neural network model according to the first difference information between the first scaling coefficient and the second scaling coefficient to obtain a scaling coefficient prediction model, which includes:

according to the extension information and the first difference information, network parameter adjustment is carried out on the neural network model to obtain a scaling coefficient prediction model; wherein the content of the first and second substances,

the extended information includes at least one of:

second difference information between the second information and third information, wherein the third information represents the text direction in the label box;

8. A text recognition apparatus comprising:

9. The apparatus of claim 8, wherein,

the scaling coefficient obtaining module is specifically configured to obtain the output scaling coefficient by inputting the first image into a pre-trained scaling coefficient prediction model; wherein the scaling factor prediction model is: and training a preset neural network model by taking the sample image with the reference size as input information and a sample scaling coefficient as a training label to obtain a model for predicting the scaling coefficient, wherein the sample scaling coefficient is determined according to the minimum height of the text in the sample image and the height of the reference text.

10. The apparatus of claim 8, wherein,

the scaling coefficient obtaining module is specifically configured to extract image features of the first image, determine a minimum text height of a text in the first image according to the image features, and calculate a ratio between the reference text height and the minimum text height; and obtaining the scaling coefficient based on the ratio.

11. A model training apparatus comprising:

12. The apparatus of claim 11, wherein,

the first scaling factor obtaining module is specifically configured to calculate a ratio between the reference text height and the minimum frame height;

the first scaling factor is the ratio.

13. The apparatus of claim 11, wherein,

the first scaling factor obtaining module is specifically configured to, in response to determining that the minimum box height is greater than a preset height, obtain the first scaling factor based on the minimum box height and the reference text height.

14. The apparatus of any one of claims 11-13,

the parameter adjusting module is specifically configured to perform network parameter adjustment on the neural network model according to extension information and the first difference information to obtain a scaling coefficient prediction model;

wherein the extended information includes at least one of: second difference information between the second information and third information, wherein the third information represents a text direction in the labeling box;

third difference information between the first information and fourth information, the fourth information characterizing that the sample processing image has text.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.