WO2024040870A1

WO2024040870A1 - Text image generation, training, and processing methods, and electronic device

Info

Publication number: WO2024040870A1
Application number: PCT/CN2023/074125
Authority: WO
Inventors: 郭若愚; 杜宇宁; 赖宝华; 马艳军
Original assignee: 北京百度网讯科技有限公司
Priority date: 2022-08-24
Filing date: 2023-02-01
Publication date: 2024-02-29
Also published as: CN115082598A; CN115082598B

Abstract

The present invention relates to the technical field of artificial intelligence, and provides text image generation, training, and processing methods, and an electronic device. A specific implementation solution comprises: dividing a sample text image set into at least one sample text image subset, according to a sample text output result set and a sample label set of the sample text image set (S210); according to a sample text output result set of a set of sample text images to be clipped, determining a target clipping location set of the set of sample text images to be clipped (S220); clipping the set of sample text images to be clipped on the basis of the target clipping location set, to obtain at least one clipped sample text image subset (S230); obtaining a target sample text image set according to the at least one clipped sample text image subset and the at least one sample text image subset (S240). The accuracy of the target clipping location can be effectively ensured, character information is effectively prevented from being damaged, and image background complexity and image diversity of sample text images in the target sample text image set are improved.

Description

Text image generation, training, text image processing methods and electronic devices

This application claims priority from Chinese Patent Application No. 202211015424.6 submitted on August 24, 2022, the content of which is hereby incorporated by reference.

Technical field

The present disclosure relates to the field of artificial intelligence technology, in particular to the field of computer vision and deep learning technology, and can be applied to optical character recognition scenarios. Specifically, it relates to a text image generation, training, text image processing method and electronic device.

Background technique

With the development of computer technology, artificial intelligence technology has also developed. Artificial intelligence technology can include computer vision technology, speech recognition technology, natural language processing technology, machine learning, deep learning, big data processing technology and knowledge graph technology, etc.

Artificial intelligence technology has been widely used in various fields. For example, AI technology can be leveraged to generate text images for training deep learning models.

Contents of the invention

The present disclosure provides a text image generation, training, text image processing method and electronic device.

According to an aspect of the present disclosure, a text image generation method is provided, including: dividing the above sample text image set into at least one sample text image subset according to a sample text output result set and a sample label set of the sample text image set, Wherein, the above-mentioned at least one sample text image subset includes a first sample text image subset, and the above-mentioned first sample text image subset includes sample text images with correct sample text output results; according to the sample text of the sample text image set to be cropped Output the result set and determine the target cropping position set of the sample text image set to be cropped, wherein the sample text image set to be cropped is determined based on the first sample text image subset; based on the target cropping position set, the sample text image set to be cropped is determined. Crop the sample text image set to obtain at least one cropped sample text image subset; and obtain a target sample text image set based on the at least one cropped sample text image subset and the at least one sample text image subset.

According to another aspect of the present disclosure, a training method for a deep learning model is provided, including: obtaining a target sample text image set; and, using the target sample text image set to train the above-mentioned deep learning model to obtain a text image processing model, wherein , the above target sample text image set is obtained by using the method described above according to the present disclosure.

According to another aspect of the present disclosure, a text image processing method is provided, including: obtaining a text image to be processed; and inputting the text image to be processed into a text image processing model to obtain a text image processing result, wherein the text image is The processing model is trained using the methods described above in accordance with this disclosure.

According to another aspect of the present disclosure, a text image generating device is provided, including: a dividing module configured to divide the sample text image set into at least one according to a sample text output result set and a sample label set of the sample text image set. Sample text image subsets, wherein the at least one sample text image subset includes a first sample text image subset, and the first sample text image subset includes sample text images with correct sample text output results; the determination module uses Determine the target cropping position set of the sample text image set to be cropped based on the sample text output result set of the sample text image set to be cropped, wherein the sample text image set to be cropped is determined based on the first sample text image subset The first acquisition module is used to crop the above-mentioned sample text image set to be cropped based on the above-mentioned target cropping position set, and obtain at least one cropped sample text image subset; and the second acquisition module is used to crop the above-mentioned sample text image set based on the above-mentioned at least one cropping position set. The sample text image subset and the above-mentioned at least one sample text image subset are used to obtain a target sample text image set.

According to another aspect of the present disclosure, a training device for a deep learning model is provided, including: a first acquisition module for acquiring a target sample text image set; and a third acquisition module for utilizing the target sample text image set. The above-mentioned deep learning model is trained to obtain a text image processing model, wherein the above-mentioned target sample text image set is obtained using the above-mentioned device according to the present disclosure.

According to another aspect of the present disclosure, a text image processing device is provided, including: a second acquisition module for acquiring a text image to be processed; and a fourth acquisition module for inputting the text image to be processed into text image processing model to obtain a text image processing result, wherein the above text image processing model is trained using the above device according to the present disclosure.

According to another aspect of the present disclosure, an electronic device is provided, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores Instructions executable by the above-mentioned at least one processor are stored, and the above-mentioned instructions are executed by the above-mentioned at least one processor, so that the above-mentioned at least one processor can execute the method according to the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to the present disclosure.

It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.

Description of drawings

The accompanying drawings are used to better understand the present solution and do not constitute a limitation of the present disclosure. in:

Figure 1 schematically illustrates an exemplary system architecture that may include a text image generation method, a deep learning model training method, and a text image processing method and device according to an embodiment of the present disclosure;

Figure 2 schematically shows a flow chart of a text image generation method according to an embodiment of the present disclosure;

FIG. 3A schematically shows a principle diagram of a text image generation method according to an embodiment of the present disclosure;

3B schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to an embodiment of the present disclosure;

3C schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure;

3D schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure;

3E schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure;

Figure 4 schematically shows a flow chart of a training method for a deep learning model according to an embodiment of the present disclosure;

Figure 5 schematically shows a flow chart of a text image processing method according to an embodiment of the present disclosure;

Figure 6 schematically shows a block diagram of a text image generating device according to an embodiment of the present disclosure;

Figure 7 schematically shows a block diagram of a training device for a deep learning model according to an embodiment of the present disclosure;

Figure 8 schematically shows a block diagram of a text image processing device according to an embodiment of the present disclosure; and

FIG. 9 schematically shows a block diagram of an electronic device suitable for implementing a text image generation method, a deep learning model training method, and a text image processing method according to an embodiment of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding and should be considered to be exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

FIG. 1 schematically illustrates an exemplary system architecture that may include a text image generation method, a deep learning model training method, and a text image processing method and device according to an embodiment of the present disclosure.

It should be noted that Figure 1 is only an example of a system architecture to which embodiments of the present disclosure can be applied, to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure cannot be used in other applications. Device, system, environment or scenario. For example, in another embodiment, the exemplary system architecture in which the text image generation method, the deep learning model training method, and the text image generation method and apparatus can be applied may include a terminal device, but the terminal device may not need to interact with the server, that is, The text image generation method, deep learning model training method, and text image processing method and device provided by the embodiments of the present disclosure can be implemented.

As shown in Figure 1, the system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 is a medium used to provide communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types. For example, at least one of wired and wireless communication links, and the like.

Users can use terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages, etc. Various communication client applications can be installed on the terminal devices 101, 102, and 103. For example, at least one of a knowledge reading application, a web browser application, a search application, an instant messaging tool, an email client, and a social platform software.

The terminal devices 101, 102, and 103 may be various electronic devices having a display screen and supporting web browsing. For example, this could include smartphones, tablets, laptops, and desktops. At least one of a computer, etc.

Server 105 may be various types of servers providing various services. For example, the server 105 can be a cloud server, also known as a cloud computing server or a cloud host. It is a host product in the cloud computing service system to solve the problem between traditional physical hosts and VPS services (Virtual Private Server). , which has the disadvantages of difficult management and weak business scalability. The server 105 can also be a server of a distributed system, or a server combined with a blockchain.

It should be noted that the text image generation method and text image processing method provided by the embodiments of the present disclosure can generally be executed by the terminal device 101, 102, or 103. Correspondingly, the text image generating device and the text image processing device provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103.

Alternatively, the text image generation method and text image processing method provided by the embodiments of the present disclosure can generally also be executed by the server 105 . Correspondingly, the text image generation device and the text image processing device provided by the embodiments of the present disclosure may generally be provided in the server 105 . The text image generation method and text image processing method provided by the embodiments of the present disclosure can also be executed by a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105. Correspondingly, the text image generation device and the text image processing device provided by the embodiments of the present disclosure may also be provided in a server or server cluster that is different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

It should be noted that the training method of the deep learning model provided by the embodiment of the present disclosure can generally be executed by the server 105 . Correspondingly, the training device for the deep learning model provided by the embodiment of the present disclosure may generally be provided in the server 105 . The deep learning model training method provided by the embodiment of the present disclosure can also be executed by a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105. Correspondingly, the training device of the deep learning model provided by the embodiment of the present disclosure can also be set up in a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105.

Alternatively, the deep learning model training method provided by the embodiment of the present disclosure can generally also be executed by the terminal device 101, 102, or 103. Correspondingly, the training device for the deep learning model provided by the embodiment of the present disclosure can also be provided in the terminal device 101, 102, or 103.

It should be understood that the number of terminal devices, networks and servers in Figure 1 is only illustrative. Depending on implementation needs, there can be any number of end devices, networks, and servers.

It should be noted that the sequence number of each operation in the following method is only used as a representation of the operation for the purpose of description, and should not be regarded as indicating the execution order of the respective operations. Unless explicitly stated, the methods need not be performed in exactly the order shown.

FIG. 2 schematically shows a flow chart of a text image generation method according to an embodiment of the present disclosure.

As shown in Figure 2, the method 200 includes operations S210 to S240.

In operation S210, the sample text image set is divided into at least one sample text image subset according to the sample text output result set and the sample label set of the sample text image set.

In operation S220, a target cropping position set of the sample text image set to be cropped is determined based on the sample text output result set of the sample text image set to be cropped.

In operation S230, the sample text image set to be cropped is cropped based on the target cropping position set to obtain at least one cropped sample text image subset.

In operation S240, a target sample text image set is obtained based on at least one cropped sample text image subset and at least one sample text image subset.

According to an embodiment of the present disclosure, the at least one sample text image subset may include a first sample text image subset. The first subset of sample text images may include sample text images whose sample text output results are correct. The set of sample text images to be cropped may be determined based on the first subset of sample text images.

According to embodiments of the present disclosure, the text image may include at least one of the following: a document text image and a scene text image. Document text images can refer to text images with neat layout, controlled lighting and a relatively simple background. Scene text images can refer to text images with complex backgrounds, diverse text forms, and uncontrolled lighting. Text form may include at least one of the following: text color, size, font, direction, irregular layout, etc. Layout irregularities may include at least one of bends, tilts, wrinkles, deformations, and mutilations.

According to embodiments of the present disclosure, the sample text image set may include at least one sample text image. The sample text image may include at least one of the following: a sample document text image and a sample scene text image. The sample text image set may be an image set for a text vision task. The sample text images can be text images for various text vision tasks. For example, the text vision task may include at least one of the following: text image recognition task, text image classification task, text image segmentation task, text image detection task, text image retrieval task, etc. In addition, text vision tasks can also include Including at least one of the following: subdivided field tasks corresponding to text image recognition tasks, subdivided field tasks corresponding to text image classification tasks, subdivided field tasks corresponding to text image segmentation tasks, and detailed field tasks corresponding to text image detection tasks. Sub-domain tasks, sub-domain tasks corresponding to text image detection tasks, and sub-domain tasks corresponding to text image retrieval tasks.

According to embodiments of the present disclosure, for example, the subdivision task corresponding to the text image recognition task may include at least one of the following: a bill image recognition task, a medical text image recognition task, a financial product text image recognition task, a video subtitle recognition task, and Security monitoring and identification tasks, etc. The subdivision tasks corresponding to the text image classification task may include at least one of the following: bill image classification tasks, medical text image classification tasks, financial product text image classification tasks, video subtitle classification tasks, security monitoring classification tasks, etc. The subdivided domain tasks corresponding to the text image segmentation task may include at least one of the following: bill image segmentation tasks, medical text image segmentation tasks, financial product text image segmentation tasks, etc. The subdivision tasks corresponding to the text image detection task may include at least one of the following: bill image detection tasks, medical text image detection tasks, financial product text image detection tasks, video subtitle detection tasks, security monitoring detection tasks, etc. The subdivision tasks corresponding to text image retrieval tasks may include at least one of the following: bill image retrieval tasks, medical text image retrieval tasks, financial product text image retrieval tasks, video subtitle retrieval tasks, security monitoring retrieval tasks, etc.

According to an embodiment of the present disclosure, there may be a sample text output result set and a sample label set corresponding to a sample text image set. The set of sample text output results may include at least one sample text output result. The sample label set may include at least one sample label. The sample text image may have a sample text output result and a sample label corresponding to the sample text image. The sample text output result can characterize the predicted text result of the sample text image. The sample text output result may include at least one of a sample text recognition output result and a sample text semantic output result. The sample text recognition output result can characterize the predicted text recognition result of the sample text image. The sample text semantic output result can characterize the predicted semantic result of the sample text image. Sample labels can characterize the real text results of sample text images. The sample label may include at least one of a sample text recognition label and a sample text semantic label. The sample text recognition label can characterize the real text recognition results of the sample text image. Sample text semantic labels can characterize the real semantic results of sample text images. The text recognition result may refer to a sequence of characters included in the text image.

According to an embodiment of the present disclosure, the sample text image set may include a first sample text image sub-set set. The sample text images in the first subset of sample text images may refer to sample text images whose sample text output results are correct sample text output results. The first subset of sample text images may include a set of sample text images to be cropped. The set of sample text images to be cropped may include at least one sample text image to be cropped. The sample text image to be cropped may refer to the sample text image in the first sample text image subset that satisfies the predetermined cropping condition. The predetermined tailoring conditions can be configured according to actual business needs and are not limited here. For example, the predetermined cropping condition may include that a predetermined probability value corresponding to the sample text image is less than or equal to a predetermined probability threshold.

According to an embodiment of the present disclosure, the sample text image to be cropped may have at least one cropping position corresponding to the sample text image to be cropped. The target cropping position may refer to a cropping position that satisfies a predetermined position condition among at least one cropping position. The predetermined location conditions can be configured according to actual business needs and are not limited here. For example, the predetermined position condition may refer to a condition randomly determined from at least one cropping position.

According to embodiments of the present disclosure, the subset of cropped sample text images may include at least one cropped sample text image. The cropped sample text image may be obtained by cropping the sample text image to be cropped based on the target cropping position.

According to embodiments of the present disclosure, a sample text image set may be obtained from a data source in response to detecting the text image generation instruction. Data sources may include at least one of the following: local databases, cloud databases, and network resources. The data interface can be called. Use the data interface to obtain a sample text image set from the data source. The set of sample text images may include at least one sample text image. The sample text image may be at least one of the following: a simulated sample text image and a real sample text image. Real sample text images can be sample text images in public datasets. The simulated sample text image is generated based on one of the following methods: generated based on predetermined image parameters and generated based on a generative adversarial network model processing predetermined random noise data.

According to an embodiment of the present disclosure, for the sample text image in the sample text image set, the first local feature extraction can be performed on the sample text image to obtain the first local sample feature map. Global features can be extracted from the first local sample feature map to obtain a global sample feature sequence. The global sample feature sequence can be sequence decoded to obtain the sample text recognition output result of the sample text image. The second local feature extraction can be performed on the sample text image to obtain a second local sample feature map. The second local sample feature map can be semantically understood to obtain the sample text semantic output result of the sample text image. According to the sample text recognition output of the sample text image At least one of the output result and the sample text semantic output result is obtained, and the sample text output result of the sample text image is obtained. For example, the sample text image can be processed based on the deep learning model to obtain the sample text output result. Deep learning models can include deep learning models that can realize text recognition of variable-length character sequences and deep learning models that can realize text semantic understanding. The model structure of the deep learning model can be configured according to actual business needs and is not limited here. For example, a deep learning model may include at least one model structure. The model structure may include at least one model substructure and connection relationships between each model substructure. The model structure may be a structure obtained by connecting at least one model substructure based on the connection relationship between the model substructures. The model structure includes at least one model substructure that may be a structure from at least one operational layer. For example, the model structure may be a structure obtained by connecting at least one model substructure from at least one operation layer based on the connection relationship between the model substructures. For example, at least one operation layer may include at least one of the following: an input layer, a convolutional layer, a hidden layer, a transcription layer, a pooling layer, an unpooling layer, a deconvolution layer, a feedforward neural network layer, an attention layer, Residual layer, fully connected layer, batch normalization layer, linear embedding (ie Linear Embedding) layer and non-linear layer, etc.

According to embodiments of the present disclosure, the deep learning model for text recognition may include one of the following: a text recognition model based on CRNN (Convolutional Recurrent Neural Network, Convolutional Recurrent Neural Network) and a text recognition model based on an encoder-decoder. CRNN can include convolutional layers, recurrent layers, and transcription layers. The encoder-decoder can include one of the following: symmetric encoder-decoder and asymmetric encoder-decoder.

According to embodiments of the present disclosure, the CRNN-based text recognition model may include at least one of the following: a CRNN model based on CTC (ie Connectionist Temporal Classification), a CRNN model based on Attention (ie attention), and a CRNN model based on ACE (ie Aggregation Cross Entropy ) CRNN model. The encoder-decoder based text recognition model may include a Seq-To-Seq (ie Sequence-To-Sequence) based text recognition model.

According to embodiments of the present disclosure, the deep learning model for text semantic understanding may include at least one of the following: a convolutional neural network-based text semantic understanding model, a recurrent neural network-based text semantic understanding model, and a Transformer-based (i.e., converter)-based text semantic understanding model. Text semantic understanding model.

According to embodiments of the present disclosure, the training method of the deep learning model can be configured according to actual business needs, and is not limited here. For example, the training method may include at least one of the following: unsupervised training, supervised training, and semi-supervised training.

According to an embodiment of the present disclosure, the sample text image set may be divided into at least one sample text image subset according to the sample text output result and the sample label of the sample text image. For example, the at least one sample text image subset may include a first sample text image subset. In addition, the at least one sample text image subset may also include a second sample text image subset. The sample text images in the second subset of sample text images may refer to sample text images whose sample text output results are incorrect sample text output results.

According to an embodiment of the present disclosure, for a sample text image to be cropped in a set of sample text images to be cropped, a plurality of candidate cropping positions may be determined based on the sample text output result of the sample text image to be cropped. Determine at least one target cropping position from a plurality of candidate cropping positions. For example, at least one target cropping position may be randomly determined from a plurality of candidate cropping positions. Alternatively, a position corresponding to at least one target character may be determined from a plurality of candidate cropping positions. A position corresponding to at least one target character is determined as at least one target cropping position.

According to an embodiment of the present disclosure, for a sample text image to be cropped in a set of sample text images to be cropped, the sample text image to be cropped can be cropped based on at least one target cropping position corresponding to the sample text image to be cropped, to obtain at least one Crop the sample image.

According to an embodiment of the present disclosure, after obtaining at least one cropped sample text image corresponding to each of the sample text images to be cropped included in the sample text image set to be cropped, the sample text images to be cropped included in the sample text image set to be cropped may be processed At least one corresponding cropped sample text image is combined to obtain at least one combined sample text image.

According to an embodiment of the present disclosure, obtaining a target sample text image set based on at least one cropped sample text image subset and at least one sample text image subset may include: dividing the first sample text based on at least one sample text image subset Other sample text image subsets other than the image subset, other sample text images in the first sample text image subset other than the sample text image set to be cropped, and at least one combined sample text image are used to obtain a target sample text image set. Alternatively, a target sample text image set may be obtained based on the sample text image set and at least one combined sample text image.

According to embodiments of the present disclosure, the text image generation method of the embodiment of the present disclosure can be executed by an electronic device. For example, the electronic device may be a server or a terminal device. The electronic device may include at least one processor. The processor may be used to execute the text image generation method provided by embodiments of the present disclosure. For example, a single processor may be used to perform text image generation provided by embodiments of the present disclosure. method, multiple processors may also be used to execute the text image generation method provided by the embodiments of the present disclosure in parallel.

According to an embodiment of the present disclosure, since the target cropping position set is determined based on the sample text output result set of the sample text image set to be cropped, and the sample text image set to be cropped is determined based on the first sample text image subset, the first The sample text images in the sample text image subset are sample text images that include the correct sample text output result and are determined from the sample text image set based on the sample text output result set and the sample label set of the sample text image set. Therefore, the target can be effectively guaranteed The accuracy of the cropping position effectively prevents character information from being destroyed. In addition, the target sample text image set is obtained based on at least one sample text image subset and at least one cropped sample text image subset obtained by cropping the to-be-cropped sample text image set based on the target cropping position set, which improves the target sample text image set. The image background complexity and image diversity of the sample text images can be used to obtain a target sample text image set with richer contextual information. As a result, the target sample text image set is used for subsequent model training and optimization, which reduces the number of model iterations, increases the model training speed, thereby reducing the data processing volume and resource consumption of electronic devices, and thereby obtaining electronic products that conform to the laws of nature. The effect of improving the internal performance of the equipment, thereby enhancing the core competitiveness of electronic equipment.

According to embodiments of the present disclosure, the above text image generation method may further include the following operations.

Perform data enhancement processing on the original sample text image set to obtain an intermediate sample text image set. According to the original sample text image set and the intermediate sample text image set, a sample text image set is obtained.

According to embodiments of the present disclosure, the set of original sample text images may include at least one original sample text image. Data augmentation may include at least one of the following: supervised data augmentation and unsupervised data augmentation. Supervised data augmentation may include at least one of the following: single-sample data augmentation and multi-sample data augmentation. Unsupervised data augmentation may include at least one of the following: data augmentation to generate new data and data augmentation to learn an augmentation strategy.

According to embodiments of the present disclosure, single-sample data enhancement may include at least one of the following: a geometric transformation class and a color transformation class. The geometric transformation class may include at least one of the following: flipping, rotation, random cropping, deformation, scaling, etc. The color transformation class may include at least one of the following: noise, blur, color transformation, erasure and fill, etc.

According to embodiments of the present disclosure, multi-sample data enhancement may include at least one of the following: SMOTE (Synthetic Minority Over-sampling Technique), Sample Pairing, Mixup, Cutout, Cutmix, Fmix and ROImix, etc.

According to embodiments of the present disclosure, data augmentation to generate new data may include data augmentation based on a generative adversarial network model. Data augmentation for learning augmentation strategies can include automatic data augmentation.

According to embodiments of the present disclosure, data enhancement can be performed on the original sample text image in the original sample text image set to obtain at least one intermediate sample text image corresponding to the original sample text image. The data augmentation of each original sample text image may be one of different from each other, partially the same, or completely the same. For example, the original sample text image set may include original sample text image A and original sample text image B. The original sample text image A can be subjected to geometric transformation data enhancement to obtain at least one intermediate sample text image corresponding to the original sample text image A. Data enhancement such as color transformation can be performed on the original sample text image B to obtain at least one intermediate sample text image corresponding to the original sample text image B.

According to an embodiment of the present disclosure, obtaining the sample text image set according to the original sample text image set and the intermediate sample text image set may include: determining the intermediate sample text image set as the sample text image set. Alternatively, at least part of the original sample text image set and at least part of the intermediate sample text image set are determined as the sample text image set.

According to embodiments of the present disclosure, since different data enhancements can be performed on different original sample text images, the image diversity of the third sample text image in the third sample text image subset can be effectively guaranteed. On this basis, training the deep learning model using the third sample text image subset can improve the generalization performance of the model.

According to an embodiment of the present disclosure, obtaining a sample text image set based on the original sample text image set and the intermediate sample text image set may include the following operations.

For the original sample text image in the original sample text image set, when it is determined that the height of the original sample text image is not a predetermined height, while keeping the aspect ratio of the original sample text image unchanged, the height of the original sample text image is changed Adjust to a predetermined height to obtain the adjusted original sample text image. For the intermediate sample text image in the intermediate sample text image set, when it is determined that the height of the intermediate sample text image is not a predetermined height, while keeping the aspect ratio of the intermediate sample text image unchanged, the height of the intermediate sample text image is changed Adjust to a predetermined height to obtain the adjusted intermediate sample text image. According to the original sample text image set, at least one adjusted original sample text image, the intermediate sample text image set and at least one adjusted At least one of the intermediate sample text image sets is used to obtain a sample text image set.

According to an embodiment of the present disclosure, operation S210 may include the following operations.

Compare the sample text output result set of the sample text image set and the sample label set to obtain a comparison result. According to the comparison result, the sample text image set is divided into at least one sample text image subset.

According to an embodiment of the present disclosure, the comparison result may include that the relationship between the two objects satisfies the predetermined matching condition and the relationship between the two objects does not satisfy the predetermined matching condition. The two objects can refer to sample text output results and sample labels. The predetermined matching conditions can be configured according to actual business needs and are not limited here. For example, the predetermined matching condition may include two objects matching.

According to embodiments of the present disclosure, for the sample text image in the sample text image set, the sample text output result of the sample text image and the sample label can be compared to obtain a comparison result corresponding to the sample text image. According to the comparison result corresponding to the sample text image, the sample text image can be divided into a sample text image subset corresponding to the comparison result.

According to embodiments of the present disclosure, the sample text image set may include a plurality of sample text images. The at least one subset of sample text images may also include a second subset of sample text images.

According to an embodiment of the present disclosure, dividing the sample text image set into at least one sample text image subset according to the comparison result may include the following operations.

For the sample text image among the plurality of sample text images, when it is determined that the relationship between the sample text output result of the sample text image and the sample label satisfies the predetermined matching condition, the sample text image is determined to be the first sample text image sub-image. Concentrated sample text image. When it is determined that the relationship between the sample text output result of the sample text image and the sample label does not satisfy the predetermined matching condition, the sample text image is determined to be the sample text image in the second sample text image subset.

According to an embodiment of the present disclosure, the predetermined matching condition may be used as a basis for dividing the sample text image subsets. The predetermined matching condition may include that the difference between the sample text output result and the sample label is less than or equal to a predetermined difference threshold. The predetermined difference threshold can be configured according to actual business needs and is not limited here. For example, the predetermined difference threshold may be 0.1.

According to an embodiment of the present disclosure, the sample text images in the first subset of sample text images may refer to sample text images whose sample text output results are correct sample text output results. The sample text image in the second sample text image subset may mean that the sample text output result is an incorrect sample text. A sample text image for this output.

According to an embodiment of the present disclosure, for a sample text image among a plurality of sample text images, it is determined whether a difference between a sample text output result of the sample text image and a sample label is less than or equal to a predetermined difference threshold. When it is determined that the difference between the sample text output result of the sample text image and the sample label is less than or equal to the predetermined difference threshold, the sample text image may be determined to be a sample text image in the first sample text image subset . When it is determined that the difference between the sample text output result of the sample text image and the sample label is greater than the predetermined difference threshold, the sample text image may be determined to be a sample text image in the second sample text image subset.

According to an embodiment of the present disclosure, since the target cropping position set is determined based on the sample text output result set of the sample text image set to be cropped, and the sample text image set to be cropped is determined based on the first sample text image subset, the first The first sample text image in the sample text image subset is a sample text image whose relationship between the sample text output result and the sample label satisfies the predetermined matching conditions. Therefore, the accuracy of the target cropping position can be effectively guaranteed and character information can be effectively avoided. destroyed.

According to embodiments of the present disclosure, the first sample text image set may include a plurality of first sample text images.

According to an embodiment of the present disclosure, the sample text image set to be cropped may be determined in the following manner:

For the first sample text image among the plurality of first sample text images, when it is determined that the predetermined probability value of the first sample text image is less than or equal to the predetermined probability threshold, the first sample text image is determined to be the target text image. The sample text image to be cropped in the cropped sample text image set.

According to an embodiment of the present disclosure, the predetermined probability value and the predetermined probability threshold may be used to determine that the first sample text image in the first sample text image subset is a sample text image to be cropped in the sample text image set to be cropped. The predetermined probability value and the predetermined probability threshold can be configured according to actual business requirements and are not limited here. The predetermined probability value may be a number greater than or equal to 0 and less than 1. The predetermined probability threshold may be a number greater than or equal to 0 and less than or equal to 1. For example, the predetermined probability threshold can be determined based on model characteristics of the deep learning model. Model characteristics may include at least one of model structural complexity, fit, and generality. For example, if the model structure of a deep learning model is characterized by strong versatility, greater complexity, and easy overfitting, If the probability is less than one, you can configure a predetermined probability threshold with a larger value. If the model structure of the deep learning model is characterized by at least one of weak generality, low complexity, and easy underfitting, a predetermined probability threshold with a smaller value may be configured.

According to embodiments of the present disclosure, the set of sample text images to be cropped may include a plurality of sample text images to be cropped.

According to an embodiment of the present disclosure, operation S220 may include the following operations.

For the sample text image to be cropped in the sample text image set to be cropped, at least one target cropping position is determined from a plurality of candidate cropping positions according to the sample text output result of the sample text image to be cropped.

According to embodiments of the present disclosure, a plurality of candidate cropping positions may be determined based on the sample text output results of the sample text image to be cropped. At least one target cropping position is randomly determined from a plurality of candidate cropping positions.

According to embodiments of the present disclosure, image diversity of sample text images can be improved by randomly determining at least one target cropping position from a plurality of candidate cropping positions.

According to embodiments of the present disclosure, the sample text image set may include a plurality of sample text images.

According to embodiments of the present disclosure, the sample text recognition output result may be obtained by sequentially decoding the global sample feature sequence of the sample text image. The global sample feature sequence may be obtained by extracting global features from the first local sample feature map of the sample text image. The first local sample feature map may be obtained by extracting the first local feature from the sample text image.

According to embodiments of the present disclosure, the sample text semantic output result may be obtained by semantic understanding of the second local sample feature map of the sample text image. The second local sample feature map may be obtained by performing second local feature extraction on the sample text image.

According to embodiments of the present disclosure, a CRNN-based text recognition model can be used to process sample text images to obtain sample text recognition output results. CRNN can include convolutional layers, recurrent layers and transcription layers. The convolutional layer can be used to process the sample text image to obtain the first local sample feature map. The loop layer can be used to process the first local sample feature map to obtain the global sample feature sequence. The transcription layer can be used to process the global sample feature sequence and obtain the sample text recognition output result.

According to an embodiment of the present disclosure, in the case where the sample text output result includes a sample text recognition result and a sample text semantic output result, at least one of the candidate cropping positions is determined from a plurality of candidate cropping positions according to the sample text output result of the sample text image to be cropped. The target cropping position can be included Including the following operations.

Multiple candidate cropping positions are determined based on the sample text recognition output results of the sample text image to be cropped. Determine at least one target cropping position from a plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped.

According to an embodiment of the present disclosure, for example, the sample text recognition output result of the sample text image to be cropped may be "Go to work today." According to the sample text recognition output results, four candidate cropping positions are determined, namely the candidate cropping position between "today" and "day", the candidate cropping position between "day" and "go", "go" and "上" Candidate cropping positions between "up" and "class". According to the sample text semantic output results, it can be determined that "today" and "day" should not be separated, and "on" and "class" should not be separated. Therefore, two target cropping positions can be determined from four candidate cropping positions, That is, the candidate cropping positions between "day" and "go" and the candidate cropping positions between "go" and "up".

According to an embodiment of the present disclosure, at least one target cropping position is determined from a plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped, thereby improving the accuracy of the target cropping position.

According to an embodiment of the present disclosure, operation S230 may include the following operations.

The sample text image set to be cropped is cropped based on the target cropping position set to obtain a first cropped sample text image subset and a second cropped sample text image subset.

According to an embodiment of the present disclosure, the first cropped sample text image subset may include at least one first cropped sample text image. The second subset of cropped sample text images may include at least one second cropped sample text image. The at least one target cropping position corresponding to the sample text image to be cropped may include a first target cropping position and a second target cropping position.

According to an embodiment of the present disclosure, the sample text image to be cropped in the sample text image set to be cropped can be cropped based on the first target cropping position corresponding to the sample text image to be cropped, and a sample text image corresponding to the sample text image to be cropped is obtained. First crop the sample text image. Cropping may be performed based on the second target cropping position corresponding to the sample text image to be cropped, to obtain a second cropped sample text image corresponding to the sample text image to be cropped.

According to an embodiment of the present disclosure, operation S240 may include the following operations.

A third sample text image subset is obtained based on at least one cropped sample text image subset. Obtain the target sample according to at least one sample text image subset and the third sample text image subset Text image set.

According to embodiments of the present disclosure, at least one cropped sample text image subset may be combined to obtain a third sample text image subset. The target sample text image set can be obtained according to the second sample text image subset and the third sample text image subset.

According to an embodiment of the present disclosure, obtaining a third sample text image subset based on at least one cropped sample text image subset may include the following operations.

Based on a predetermined combination strategy, the cropped sample text images in at least one cropped sample text image subset are combined to obtain a third sample text image subset.

According to an embodiment of the present disclosure, the predetermined combination strategy may refer to a strategy for combining cropped sample text images. For example, the predetermined combination strategy may include at least one of the following: a random combination strategy and a fixed combination strategy. The third sample text image subset may include at least one third sample text image. The third sample text image may be the same as or different from the sample text image in the sample text image set.

According to an embodiment of the present disclosure, for a subset of cropped sample text images in at least one subset of cropped sample text images, for a subset of cropped sample text images in the subset of cropped sample text images, the cropped sample text image may be combined with other cropped sample text The cropped sample text images in the image subset are combined to obtain at least one third sample text image. Other cropped sample text image subsets may be any other one or more cropped sample text image subsets in at least one cropped sample text image subset except the cropped sample text image subset.

For example, the at least one cropped sample text image subset may include a first cropped sample text image subset and a second cropped sample text image subset. The first cropped sample text image subset may represent a cropped sample text image subset in the first direction. The second collected sample text image subset may represent the cropped sample text image subset in the second direction. The first direction may refer to the right direction. The second direction may refer to the left direction. For the first cropped sample text image in the first subset of cropped sample text images, the first cropped sample text image may be combined with at least one second cropped sample text image in the second subset of cropped sample text images to obtain at least one first cropped sample text image. Three sample text images.

According to an embodiment of the present disclosure, since the third sample text image subset is obtained by combining cropped sample text images in at least one cropped sample text image subset based on a predetermined combination strategy, a random combination of cropped sample text images is achieved. , improving the image background complexity and image diversity of the third sample text image in the third sample text image subset. On this basis On the other hand, using the third sample text image subset to train the deep learning model can improve the generalization performance of the model.

The sample label set of the sample text image set to be cropped is cropped based on the target cropping position set to obtain at least one cropped sample label subset. A target sample label set is obtained based on a sample label subset corresponding to at least one sample text image subset and at least one cropped sample label subset.

According to an embodiment of the present disclosure, obtaining a target sample label set based on a sample label subset corresponding to at least one sample text image subset and at least one cropped sample label subset may include the following operations.

According to at least one cropped sample label subset, a sample label subset corresponding to the third sample text image subset is obtained. A target sample label set is obtained according to the sample label subset corresponding to at least one sample text image subset and the sample label subset corresponding to the third sample text image subset.

According to an embodiment of the present disclosure, obtaining a sample label subset corresponding to the third sample text image subset based on at least one cropped sample label subset may include the following operations.

Based on a predetermined combination strategy, the cropped sample labels in at least one cropped sample label subset are combined to obtain a sample label subset corresponding to the third sample text image subset.

The text image generation method according to the embodiment of the present disclosure will be further described below with reference to FIG. 3A, FIG. 3B, FIG. 3C, FIG. 3D, and FIG. 3E in conjunction with specific embodiments.

FIG. 3A schematically shows a principle diagram of a text image generation method according to an embodiment of the present disclosure.

As shown in Figure 3A, in step 300A, the sample text image set 303 is divided into a first sample text image subset 303_1 and a second sample text according to the sample text output result set 301 and the sample label set 302 of the sample text image set. Image subset 303_2. The sample text image set 304 to be cropped is determined according to the first sample text image subset 303_1.

According to the sample text output result set 305 of the sample text image set 304 to be cropped, the target cropping position set 306 of the sample text image set 304 to be cropped is determined. The to-be-cropped sample text image set 304 is cropped based on the target cropping position set 306 to obtain at least one cropped sample text image subset 307. According to at least one cropped sample text image subset 307, the first sample text image subset 303_1 and the second sample text image subset 303_2, a target sample text image set 308 is obtained.

FIG. 3B schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to an embodiment of the present disclosure.

As shown in FIG. 3B , in 300B, the sample text image set 309 to be cropped may include a sample text image to be cropped 309_1 and a sample text image to be cropped 309_2.

According to the sample text output result of the sample text image 309_1 to be cropped, the target cropping position is determined to be "the position between Ying and Bai" from multiple candidate cropping positions. The sample text image 309_1 to be cropped is cropped based on the target cropping position to obtain a cropped sample text image 309_1_1 and a cropped sample text image 309_1_2. The cropped sample text image 309_1_1 is a sample text image corresponding to "Mother and Baby". The cropped sample text image 309_1_2 is a sample text image corresponding to "Parkway".

According to the sample text output result of the sample text image 309_2 to be cropped, it is determined from multiple candidate cropping positions that the target cropping position is the "position between transfer and transfer". The to-be-cropped sample text image 309_2 is cropped based on the target cropping position to obtain a cropped sample text image 309_2_1 and a cropped sample text image 309_2_2. The cropped sample text image 309_2_1 is a sample text image corresponding to "turn". The cropped sample text image 309_2_2 is a sample text image corresponding to "Let".

Based on the predetermined combination strategy, the cropped sample text image 309_1_1 and the cropped sample text image 309_2_2 are combined to obtain the third sample text image 310_1 in the third sample text image subset 310, and the cropped sample text image 309_1_2 and the cropped sample text image are obtained 309_2_1 are combined to obtain the third sample text image 310_2 in the third sample text image subset 310. The third sample text image 310_1 is a sample text image corresponding to "Mother and Infant Let". The third sample text image 310_2 is a sample text image corresponding to "Zhuanbahui".

FIG. 3C schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure.

As shown in FIG. 3C, in 300C, what is different from FIG. 3B is that the third sample text image 311_1 is a sample text image corresponding to "Let mother and baby". The third sample text image 311_2 is a sample text image corresponding to "Baihuizhuan".

FIG. 3D schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure.

As shown in Figure 3D, in 300D, what is different from Figure 3B is that based on the predetermined combination strategy, The cropped sample text image 309_1_1 and the cropped sample text image 309_2_1 are combined to obtain the third sample text image 312_1 in the third sample text image subset 312, and the cropped sample text image 309_1_2 and the cropped sample text image 309_2_2 are combined to obtain The third sample text image 312_2 in the third sample text image subset 312. The third sample text image 312_1 is a sample text image corresponding to "Mother-to-child transfer". The third sample text image 312_2 is a sample text image corresponding to "Baihui Rang".

FIG. 3E schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure.

As shown in FIG. 3E, in 300E, what is different from FIG. 3D is that the third sample text image 313_1 is a sample text image corresponding to "transformation of mother and child". The third sample text image 313_2 is a sample text image corresponding to "Let Baihui".

Figure 4 schematically shows a flow chart of a training method of a deep learning model according to an embodiment of the present disclosure.

As shown in Figure 4, the method 400 may include operations S410 to S420.

In operation S410, a target sample text image set is obtained.

In operation S420, a deep learning model is trained using the target sample text image set to obtain a text image processing model.

According to embodiments of the present disclosure, the target sample text image set may be obtained according to the text image generation method described in the embodiments of the present disclosure.

According to an embodiment of the present disclosure, since the target cropping position set of the target sample text image set is determined based on the sample text output result set of the sample text image set to be cropped, the sample text image set to be cropped is based on the first sample text image subset. It is determined that the first sample text image subset is a sample text image that includes the correct sample text output result and is determined from the sample text image set according to the sample text output result set and the sample label set of the sample text image set. Therefore, it can effectively ensure that The accuracy of the target cropping position effectively prevents character information from being destroyed. On this basis, a target sample text image set is obtained based on at least one cropped sample text image subset and at least one sample text image subset, and a target sample text image set with richer contextual information can be obtained. As a result, the target sample text image set is used for subsequent model training and optimization, which reduces the number of model iterations and increases the training speed of the model. This reduces the data processing volume and resource consumption of electronic devices, thereby obtaining a model that conforms to natural laws. The effect of improving the internal performance of electronic equipment, from And enhance the core competitiveness of electronic equipment.

FIG. 5 schematically shows a flowchart of a text image processing method according to an embodiment of the present disclosure.

As shown in Figure 5, the method 500 includes operations S510 to S520.

In operation S510, a text image to be processed is obtained.

In operation S520, the text image to be processed is input into the text image processing model to obtain a text image processing result.

According to embodiments of the present disclosure, the text image processing model may be trained according to the deep learning model training method described in the embodiments of the present disclosure.

In the technical solution of this disclosure, the collection, storage, use, processing, transmission, provision, disclosure and application of user personal information are in compliance with relevant laws and regulations, necessary confidentiality measures have been taken, and do not violate public order and good customs . In the technical solution of the present disclosure, the user's authorization or consent is obtained before obtaining or collecting the user's personal information.

The above are only exemplary embodiments, but are not limited thereto, and may also include other text image generation methods, deep learning model training methods and text image processing methods known in the art, as long as the accuracy of the target cropping position and Just obtain a target sample text image set with richer contextual information.

FIG. 6 schematically shows a block diagram of a text image generating device according to an embodiment of the present disclosure.

As shown in FIG. 6 , the text image generating device 600 may include a dividing module 610 , a determining module 620 , a first obtaining module 630 and a second obtaining module 640 .

The dividing module 610 is configured to divide the sample text image set into at least one sample text image subset according to the sample text output result set and the sample label set of the sample text image set. At least one sample text image subset includes a first sample text image subset. The first sample text image subset includes sample text images with correct sample text output results.

The determination module 620 is configured to determine a target cropping position set of the sample text image set to be cropped based on the sample text output result set of the sample text image set to be cropped. The set of sample text images to be cropped is determined based on the first subset of sample text images.

The first obtaining module 630 is configured to crop the sample text image set to be cropped based on the target cropping position set to obtain at least one cropped sample text image subset.

The second obtaining module 640 is configured to obtain a target sample text image set based on at least one cropped sample text image subset and at least one sample text image subset.

According to an embodiment of the present disclosure, the partition module 610 may include a comparison sub-module and a partition sub-module.

The comparison submodule is used to compare the sample text output result set of the sample text image set and the sample label set to obtain the comparison result.

The dividing submodule is used to divide the sample text image set into at least one sample text image subset according to the comparison result.

According to an embodiment of the present disclosure, the sample text image set includes a plurality of sample text images, and at least one sample text image subset further includes a second sample text image subset.

According to an embodiment of the present disclosure, for the sample text image among the plurality of sample text images, the dividing sub-module may include a first determination unit and a second determination unit.

A first determination unit configured to determine the sample text image as a sample text image in the first sample text image subset when it is determined that the relationship between the sample text output result and the sample label of the sample text image satisfies the predetermined matching condition. .

A second determination unit configured to determine the sample text image as a sample text image in the second sample text image subset when it is determined that the relationship between the sample text output result and the sample label of the sample text image does not satisfy the predetermined matching condition. .

According to an embodiment of the present disclosure, for the sample text image to be cropped in the set of sample text images to be cropped, the determining module 620 may include a determining sub-module.

The determining submodule is configured to determine at least one target cropping position from a plurality of candidate cropping positions based on the sample text output result of the sample text image to be cropped.

According to an embodiment of the present disclosure, the sample text output result may include at least one of the following: a sample text recognition output result and a sample text semantic output result.

According to embodiments of the present disclosure, the sample text semantic output result may be a sample text image The second local sample feature map is obtained through semantic understanding. The second local sample feature map may be obtained by performing second local feature extraction on the sample text image.

According to an embodiment of the present disclosure, in the case where the sample text output result includes a sample text recognition result and a sample text semantic output result, the determination sub-module may include a third determination unit and a fourth determination unit.

The third determination unit is used to determine multiple candidate cropping positions based on the sample text recognition output result of the sample text image to be cropped.

The fourth determination unit is configured to determine at least one target cropping position from a plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped.

According to an embodiment of the present disclosure, the first obtaining module 630 may include a first obtaining sub-module.

The first obtaining submodule is used to crop the sample text image set to be cropped based on the target cropping position set, and obtain the first cropped sample text image subset and the second cropped sample text image subset.

According to an embodiment of the present disclosure, the second obtaining module 640 may include a second obtaining sub-module and a third obtaining sub-module.

The second obtaining submodule is used to obtain a third sample text image subset based on at least one cropped sample text image subset.

The third obtaining submodule is used to obtain a target sample text image set based on at least one sample text image subset and a third sample text image subset.

According to embodiments of the present disclosure, the second obtaining sub-module may include an obtaining unit.

The obtaining unit is configured to combine the cropped sample text images in at least one cropped sample text image subset based on a predetermined combination strategy to obtain a third sample text image subset.

For a first sample text image among the plurality of first sample text images,

When it is determined that the predetermined probability value of the first sample text image is less than or equal to the predetermined probability threshold, the first sample text image is determined as the sample text image to be cropped in the set of sample text images to be cropped.

According to an embodiment of the present disclosure, the text image generating device may further include a third obtaining module and The fourth acquisition module.

The third acquisition module is used to perform data enhancement processing on the original sample text image set to obtain an intermediate sample text image set.

The fourth obtaining module is used to obtain a sample text image set based on the original sample text image set and the intermediate sample text image set.

According to an embodiment of the present disclosure, the sample text image set may be a text image set of a text vision task.

Figure 7 schematically shows a block diagram of a training device for a deep learning model according to an embodiment of the present disclosure.

As shown in FIG. 7 , the deep learning model training device 700 may include a first acquisition module 710 and a fifth acquisition module 720 .

The first acquisition module 710 is used to acquire the target sample text image set.

The fifth acquisition module 720 is used to train a deep learning model using the target sample text image set to obtain a text image processing model.

According to an embodiment of the present disclosure, the target sample text image set may be trained according to the training device of the deep learning model of the embodiment of the present disclosure.

FIG. 8 schematically shows a block diagram of a text image processing device according to an embodiment of the present disclosure.

As shown in FIG. 8 , the image processing device 800 may include a second acquisition module 810 and a sixth acquisition module 820 .

The second acquisition module 810 is used to acquire text images to be processed.

The sixth obtaining module 820 is used to input the text image to be processed into the text image processing model to obtain the text image processing result.

According to embodiments of the present disclosure, the text image processing model may be trained according to the image processing device according to the embodiments of the present disclosure.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions that can be executed by at least one processor, and the instructions are processed by at least one processor. processor execution, so that at least one processor can execute the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium stores computer instructions, wherein the computer instructions are used to cause a computer to perform the method as described above.

According to an embodiment of the present disclosure, a computer program product includes a computer program, and when executed by a processor, the computer program implements the method as described above.

FIG. 9 schematically shows a block diagram of an electronic device suitable for implementing a text image generation method, a deep learning model training method, and a text image processing method according to an embodiment of the present disclosure. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 9 , the electronic device 900 includes a computing unit 901 that can perform calculations according to a computer program stored in a read-only memory (ROM) 902 or loaded from a storage unit 908 into a random access memory (RAM) 903 . Perform various appropriate actions and processing. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. Computing unit 901, ROM 902 and RAM 903 are connected to each other via bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

Multiple components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906, such as a keyboard, a mouse, etc.; an output unit 907, such as various types of displays, speakers, etc.; a storage unit 908, such as a magnetic disk, an optical disk, etc. etc.; and a communication unit 909, such as a network card, modem, wireless communication transceiver, etc. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunications networks.

Computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc. The computing unit 901 performs various methods and processes described above, for example, a text image generation method, a deep learning model training method, and a text image processing method. For example, in some embodiments, text image generation methods, deep learning models The training method and the text image processing method may be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto electronic device 900 via ROM 902 and/or communication unit 909 . When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the text image generation method, the deep learning model training method and the text image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the text image generation method, the deep learning model training method, and the text image processing method in any other suitable manner (eg, by means of firmware).

Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include electrical connections based on one or more wires, Portable computer disk, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM) ), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.

The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.

Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, a distributed system server, or a server combined with a blockchain.

It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in the present disclosure can be executed in parallel, sequentially, or in a different order. As long as the desired results of the technical solution disclosed in the present disclosure can be achieved, there is no limitation here.

The above-mentioned specific embodiments do not constitute a limitation on the scope of the present disclosure. Technology in this field Personnel should understand that various modifications, combinations, subcombinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this disclosure shall be included in the protection scope of this disclosure.

Claims

A method for generating text images, including:

According to the sample text output result set and the sample label set of the sample text image set, the sample text image set is divided into at least one sample text image subset, wherein the at least one sample text image subset includes a first sample text Image subset, the first sample text image subset includes sample text images with correct sample text output results;

Determine the target cropping position set of the sample text image set to be cropped according to the sample text output result set of the sample text image set to be cropped, wherein the sample text image set to be cropped is based on the first sample text image sub-set Set determined;

Crop the sample text image set to be cropped based on the target cropping position set to obtain at least one cropped sample text image subset; and

According to the at least one cropped sample text image subset and the at least one sample text image subset, a target sample text image set is obtained.
The method according to claim 1, wherein dividing the sample text image set into at least one sample text image subset according to the sample text output result set and the sample label set of the sample text image set includes:

Compare the sample text output result set and the sample label set of the sample text image set to obtain a comparison result; and

According to the comparison result, the sample text image set is divided into the at least one sample text image subset.
The method of claim 2, wherein the sample text image set includes a plurality of sample text images, and the at least one sample text image subset further includes a second sample text image subset;

Wherein, according to the comparison result, dividing the sample text image set into the at least one sample text image subset includes:

For a sample text image among the plurality of sample text images,

When it is determined that the relationship between the sample text output result of the sample text image and the sample label satisfies the predetermined matching condition, the sample text image is determined to be Determined to be a sample text image in the first subset of sample text images; and

When it is determined that the relationship between the sample text output result of the sample text image and the sample label does not satisfy the predetermined matching condition, the sample text image is determined to be the sample text in the second sample text image subset. image.
The method according to any one of claims 1 to 3, wherein the sample text image set to be cropped includes a plurality of sample text images to be cropped;

Wherein, determining the target cropping position set of the sample text image set to be cropped based on the sample text output result set of the sample text image set to be cropped includes:

For the sample text image to be cropped in the set of sample text images to be cropped,

At least one target cropping position is determined from a plurality of candidate cropping positions according to the sample text output result of the sample text image to be cropped.
The method according to claim 4, wherein the sample text output result includes at least one of the following: a sample text recognition output result and a sample text semantic output result.
The method of claim 5, wherein the set of sample text images includes a plurality of sample text images;

Wherein, the sample text recognition output result is obtained by decoding the global sample feature sequence of the sample text image, and the global sample feature sequence is obtained by performing global feature analysis on the first local sample feature map of the sample text image. Extracted, the first local sample feature map is obtained by extracting the first local feature of the sample text image;

Wherein, the sample text semantic output result is obtained by semantic understanding of the second local sample feature map of the sample text image, and the second local sample feature map is obtained by performing second local feature extraction on the sample text image. owned.
The method according to claim 5, wherein, in the case where the sample text output result includes the sample text recognition result and the sample text semantic output result, the sample text according to the sample text image to be cropped Output the result and determine at least one target cropping position from multiple candidate cropping positions, including:

Determine the plurality of candidate cropping positions according to the sample text recognition output result of the sample text image to be cropped; and

At least one target cropping position is determined from the plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped.
The method according to any one of claims 1 to 3, wherein the cropping the sample text image set to be cropped based on the target cropping position set to obtain at least one cropped sample text image subset includes:

The sample text image set to be cropped is cropped based on the target cropping position set to obtain a first cropped sample text image subset and a second cropped sample text image subset.
The method according to any one of claims 1 to 3, wherein obtaining the target sample text image set based on the at least one cropped sample text image subset and the at least one sample text image subset includes:

Obtain a third sample text image subset according to the at least one cropped sample text image subset; and

The target sample text image set is obtained according to the at least one sample text image subset and the third sample text image subset.
The method of claim 9, wherein obtaining a third sample text image subset based on the at least one cropped sample text image subset includes:

Based on a predetermined combination strategy, the cropped sample text images in the at least one cropped sample text image subset are combined to obtain the third sample text image subset.
The method according to any one of claims 1 to 3, wherein the first sample text image set includes a plurality of first sample text images;

Wherein, the sample text image set to be cropped is determined in the following way:

For a first sample text image among the plurality of first sample text images,

If it is determined that the predetermined probability value of the first sample text image is less than or equal to the predetermined probability threshold, the first sample text image is determined as the sample text image to be cropped in the set of sample text images to be cropped.
The method according to any one of claims 1 to 3, further comprising:

Perform data enhancement processing on the original sample text image set to obtain an intermediate sample text image set; and

The sample text image set is obtained according to the original sample text image set and the intermediate sample text image set.
The method according to any one of claims 1 to 3, wherein the sample text image set is a text image set of a text vision task.
A training method for a deep learning model, including:

Obtain the target sample text image set; and

Use the target sample text image set to train the deep learning model to obtain a text image processing model,

Wherein, the target sample text image set is obtained by using the method according to any one of claims 1 to 13.
A text image processing method, including:

Get the text image to be processed; and

Input the text image to be processed into the text image processing model to obtain the text image processing result,

Wherein, the text image processing model is trained using the method according to claim 14.
A text image generating device, including:

A dividing module, configured to divide the sample text image set into at least one sample text image subset according to the sample text output result set and the sample label set, wherein the at least one sample text image subset includes A first sample text image subset, the first sample text image subset includes sample text images with correct sample text output results;

A determination module configured to determine a target cropping position set of the sample text image set to be cropped based on a sample text output result set of the sample text image set to be cropped, wherein the sample text image set to be cropped is based on the first Determined by a subset of sample text images;

A first obtaining module, configured to crop the sample text image set to be cropped based on the target cropping position set to obtain at least one cropped sample text image subset; and

The second obtaining module is configured to use the at least one cropped sample text image sub- set and the at least one sample text image subset to obtain a target sample text image set.
The device according to claim 16, wherein the dividing module includes:

A comparison submodule, used to compare the sample text output result set of the sample text image set and the sample label set to obtain a comparison result; and

A dividing submodule, configured to divide the sample text image set into the at least one sample text image subset according to the comparison result.
The apparatus of claim 17, wherein the set of sample text images includes a plurality of sample text images, and the at least one subset of sample text images further includes a second subset of sample text images;

Wherein, for the sample text image among the plurality of sample text images, the dividing sub-module includes:

A first determination unit configured to determine the sample text image as the first sample text image when it is determined that the relationship between the sample text output result and the sample label of the sample text image satisfies a predetermined matching condition. Sample text images in the subset; and

A second determination unit configured to determine the sample text image as the second sample when it is determined that the relationship between the sample text output result and the sample label of the sample text image does not satisfy the predetermined matching condition. Sample text images from the text image subset.
The device according to any one of claims 16 to 18, wherein the sample text image set to be cropped includes a plurality of sample text images to be cropped;

Wherein, for the sample text image to be cropped in the sample text image set to be cropped, the determination module includes:

Determining submodule, configured to determine at least one target cropping position from a plurality of candidate cropping positions according to the sample text output result of the sample text image to be cropped.
The device according to claim 19, wherein the sample text output result includes at least one of the following: a sample text recognition output result and a sample text semantic output result.
The device of claim 20, wherein the set of sample text images includes a plurality of sample text images;

Wherein, the sample text recognition output result is obtained by decoding the global sample feature sequence of the sample text image, and the global sample feature sequence is obtained by performing global feature analysis on the first local sample feature map of the sample text image. Extracted, the first local sample feature map is obtained by extracting the first local feature of the sample text image;

Wherein, the sample text semantic output result is obtained by semantic understanding of the second local sample feature map of the sample text image, and the second local sample feature map is obtained by performing second local feature extraction on the sample text image. owned.
The device according to claim 20, wherein, in the case where the sample text output result includes the sample text recognition result and the sample text semantic output result, the determination sub-module includes:

A third determination unit configured to determine the plurality of candidate cropping positions based on the sample text recognition output result of the sample text image to be cropped; and

A fourth determination unit configured to determine at least one target cropping position from the plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped.
The device according to any one of claims 16 to 18, wherein the first acquisition module includes:

The first obtaining sub-module is used to crop the sample text image set to be cropped based on the target cropping position set to obtain a first cropped sample text image subset and a second cropped sample text image subset.
The device according to any one of claims 16 to 18, wherein the second acquisition module includes:

a second obtaining submodule, configured to obtain a third sample text image subset according to the at least one cropped sample text image subset; and

The third obtaining sub-module is used to obtain the target sample text image set according to the at least one sample text image subset and the third sample text image subset.
The apparatus of claim 24, wherein the second obtained sub-module blocks, including:

The obtaining unit is configured to combine the cropped sample text images in the at least one cropped sample text image subset based on a predetermined combination strategy to obtain the third sample text image subset.
The device according to any one of claims 16 to 18, wherein the first sample text image set includes a plurality of first sample text images;

Wherein, the sample text image set to be cropped is determined in the following way:

For a first sample text image among the plurality of first sample text images,

If it is determined that the predetermined probability value of the first sample text image is less than or equal to the predetermined probability threshold, the first sample text image is determined as the sample text image to be cropped in the set of sample text images to be cropped.
The device according to any one of claims 16 to 18, further comprising:

The third acquisition module is used to perform data enhancement processing on the original sample text image set to obtain an intermediate sample text image set; and

The fourth obtaining module is used to obtain the sample text image set according to the original sample text image set and the intermediate sample text image set.
The device according to any one of claims 16 to 18, wherein the sample text image set is a text image set of a text vision task.
A training device for a deep learning model, including:

The first acquisition module is used to acquire the target sample text image set; and

The fifth acquisition module is used to train the deep learning model using the target sample text image set to obtain a text image processing model,

Wherein, the target sample text image set is obtained by using the device according to any one of claims 16 to 28.
A text image processing device, including:

The second acquisition module is used to acquire the text image to be processed; and

The sixth acquisition module is used to input the text image to be processed into the text image processing model to obtain the text image processing result,

Wherein, the text image processing model is trained using the device according to claim 29.
An electronic device including:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute any one of claims 1 to 15. Methods.
A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1 to 15.
A computer program product, including a computer program, characterized in that when the computer program is executed by a processor, the steps of the method described in any one of claims 1 to 15 are implemented.