CN115082598A

CN115082598A - Text image generation method, text image training method, text image processing method and electronic equipment

Info

Publication number: CN115082598A
Application number: CN202211015424.6A
Authority: CN
Inventors: 郭若愚; 杜宇宁; 赖宝华; 马艳军
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-08-24
Filing date: 2022-08-24
Publication date: 2022-09-20
Anticipated expiration: 2042-08-24
Also published as: CN115082598B; WO2024040870A1

Abstract

The invention provides a text image generation method, a text image training method, a text image processing method and electronic equipment, and relates to the technical field of artificial intelligence. The specific implementation scheme is as follows: dividing the sample text image set into at least one sample text image subset according to the sample text output result set and the sample label set of the sample text image set; according to a sample text output result set of a sample image set to be cut, determining a target cutting position set of the sample image set to be cut; cutting the sample text image set to be cut based on the target cutting position set to obtain at least one sample text image subset to be cut; and obtaining a target sample text image set according to the at least one clipping sample text image subset and the at least one sample text image subset. The accuracy of the target cutting position can be effectively guaranteed, the character information is effectively prevented from being damaged, and the image background complexity and the image diversity of the sample text images in the target sample text image set are improved.

Description

Text image generation method, text image training method, text image processing method and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be applied to an optical character recognition scene. And in particular, to a text image generation, training, text image processing method, and an electronic device.

Background

With the development of computer technology, artificial intelligence technology has also been developed. Artificial intelligence techniques may include computer vision techniques, speech recognition techniques, natural language processing techniques, machine learning, deep learning, big data processing techniques, knowledge-graph techniques, and the like.

Artificial intelligence technology has found wide application in a variety of fields. For example, text images for training the deep learning model may be generated using artificial intelligence techniques.

Disclosure of Invention

The invention provides a text image generation method, a text image training method, a text image processing method and electronic equipment.

According to an aspect of the present invention, there is provided a text image generating method including: dividing the sample text image set into at least one sample text image subset according to a sample text output result set and a sample label set of the sample text image set, wherein the at least one sample text image subset comprises a first sample text image subset, and the first sample text image subset comprises a sample text image with a correct sample text output result; determining a target clipping position set of the sample text image set to be clipped according to a sample text output result set of the sample text image set to be clipped, wherein the sample text image set to be clipped is determined according to the first sample text image subset; cutting the sample text image set to be cut based on the target cutting position set to obtain at least one sample text image subset to be cut; and obtaining a target sample text image set according to the at least one clipping sample text image subset and the at least one sample text image subset.

According to another aspect of the present invention, there is provided a training method of a deep learning model, including: acquiring a target sample text image set; and training the deep learning model by using the target sample text image set to obtain a text image processing model, wherein the target sample text image set is obtained by using the method according to the invention.

According to another aspect of the present invention, there is provided a text image processing method including: acquiring a text image to be processed; and inputting the text image to be processed into a text image processing model to obtain a text image processing result, wherein the text image processing model is obtained by utilizing the method for training.

According to another aspect of the present invention, there is provided a text image generating apparatus including: a dividing module, configured to divide a sample text image set into at least one sample text image subset according to a sample text output result set and a sample label set of the sample text image set, where the at least one sample text image subset includes a first sample text image subset, and the first sample text image subset includes a sample text image with a correct sample text output result; a determining module, configured to determine a target clipping position set of a sample text image set to be clipped according to a sample text output result set of the sample text image set to be clipped, where the sample text image set to be clipped is determined according to the first sample text image subset; the first obtaining module is used for cutting the sample text image set to be cut based on the target cutting position set to obtain at least one cutting sample text image subset; and a second obtaining module, configured to obtain a target sample text image set according to the at least one clipped sample text image subset and the at least one sample text image subset.

According to another aspect of the present invention, there is provided a training apparatus for deep learning models, including: the first acquisition module is used for acquiring a target sample text image set; and a third obtaining module, configured to train the deep learning model with the target sample text image set to obtain a text image processing model, where the target sample text image set is obtained by using the apparatus according to the present invention.

According to another aspect of the present invention, there is provided a text image processing apparatus including: the second acquisition module is used for acquiring a text image to be processed; and a fourth obtaining module, configured to input the to-be-processed text image into a text image processing model to obtain a text image processing result, where the text image processing model is obtained by training using the apparatus according to the present invention.

According to another aspect of the present invention, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method according to the present invention.

According to another aspect of the present invention, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the present invention.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the invention. Wherein:

fig. 1 schematically shows an exemplary system architecture of a text image generation method, a training method of a deep learning model, and a text image processing method and apparatus according to an embodiment of the present invention;

FIG. 2 schematically shows a flow diagram of a text image generation method according to an embodiment of the invention;

FIG. 3A schematically illustrates a schematic diagram of a text image generation method according to an embodiment of the invention;

FIG. 3B schematically illustrates an example schematic of a process of generating a third sample text image subset according to an embodiment of the invention;

FIG. 3C schematically illustrates an example of a process of generating a third sample subset of text images according to another embodiment of this disclosure;

FIG. 3D schematically illustrates an example schematic of a process of generating a third sample text image subset according to another embodiment of this disclosure;

FIG. 3E schematically illustrates an example schematic of a process of generating a third sample text image subset according to another embodiment of the invention;

FIG. 4 schematically shows a flow diagram of a method of training a deep learning model according to an embodiment of the invention;

FIG. 5 schematically shows a flow chart of a text image processing method according to an embodiment of the invention;

fig. 6 schematically shows a block diagram of a text image generating apparatus according to an embodiment of the present invention;

FIG. 7 schematically shows a block diagram of a training apparatus for deep learning models according to an embodiment of the present invention;

fig. 8 schematically shows a block diagram of a text image processing apparatus according to an embodiment of the present invention; and

fig. 9 schematically shows a block diagram of an electronic device adapted to implement a text image generation method, a training method of a deep learning model, and a text image processing method according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 schematically shows an exemplary system architecture of a text image generation method, a deep learning model training method, and a text image processing method and apparatus according to an embodiment of the present invention.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiment of the present invention may be applied to help those skilled in the art understand the technical content of the present invention, and does not mean that the embodiment of the present invention may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, an exemplary system architecture to which the text image generation method, the deep learning model training method, and the text image generation device and method may be applied may include a terminal device, but the terminal device may implement the text image generation method, the deep learning model training method, and the text image processing method and device provided in the embodiments of the present invention without interacting with a server.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types. E.g., at least one of wired and wireless communication links, etc.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications may be installed on the

terminal devices

101, 102, 103. For example, at least one of a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having display screens and supporting web browsing. For example, at least one of a smartphone, a tablet, a laptop portable computer, a desktop computer, and the like may be included.

The server 105 may be various types of servers that provide various services. For example, the Server 105 may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a conventional physical host and a VPS service (Virtual Private Server). Server 105 may also be a server of a distributed system or a server that incorporates a blockchain.

It should be noted that the text image generation method and the text image processing method provided by the embodiment of the present invention may be generally executed by the

terminal device

101, 102, or 103. Accordingly, the text image generation apparatus and the text image processing apparatus provided in the embodiments of the present invention may also be provided in the

terminal device

101, 102, or 103.

Alternatively, the text image generation method and the text image processing method provided by the embodiment of the present invention may also be generally executed by the server 105. Accordingly, the text image generation apparatus and the text image processing apparatus provided by the embodiments of the present invention may be generally provided in the server 105. The text image generation method and the text image processing method provided by the embodiment of the present invention may also be executed by a server or a server cluster that is different from the server 105 and can communicate with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the text image generating apparatus and the text image processing apparatus provided in the embodiments of the present invention may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, and 103 and/or the server 105.

It should be noted that the training method of the deep learning model provided by the embodiment of the present invention may be generally executed by the server 105. Accordingly, the training device for the deep learning model provided by the embodiment of the present invention may be generally disposed in the server 105. The training method of the deep learning model provided by the embodiment of the present invention may also be executed by a server or a server cluster that is different from the server 105 and can communicate with the

terminal devices

101, 102, and 103 and/or the server 105. Correspondingly, the training device for the deep learning model provided in the embodiment of the present invention may also be disposed in a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

Alternatively, the training method of the deep learning model provided by the embodiment of the present invention may also be generally executed by the

terminal device

101, 102, or 103. Correspondingly, the training device for the deep learning model provided by the embodiment of the invention can also be arranged in the

terminal equipment

101, 102 or 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation.

It should be noted that the sequence numbers of the respective operations in the following methods are merely used as representations of the operations for description, and should not be construed as representing the execution order of the respective operations. The method need not be performed in the exact order shown, unless explicitly stated.

Fig. 2 schematically shows a flow chart of a text image generation method according to an embodiment of the present invention.

As shown in FIG. 2, the method 200 includes operations S210-S240.

In operation S210, the sample text image set is divided into at least one sample text image subset according to the sample text output result set and the sample label set of the sample text image set.

In operation S220, a set of target clipping positions of the image set of the sample to be clipped is determined according to the output result set of the sample texts of the image set of the sample to be clipped.

In operation S230, the sample text image set to be cropped is cropped based on the target cropping position set, so as to obtain at least one cropping sample text image subset.

In operation S240, a target sample text image set is obtained according to the at least one subset of cropped sample text images and the at least one subset of sample text images.

According to an embodiment of the present invention, the at least one sample text image subset may include a first sample text image subset. The first sample text image subset may include sample text images for which the sample text output results are correct. The set of sample text images to be cropped may be determined from the first subset of sample text images.

According to an embodiment of the present invention, the text image may include at least one of: a document text image and a scene text image. A document text image may refer to a text image that is well-laid out, light controlled, and relatively single-background. The scene text image can refer to a text image with a complex background, various character forms and uncontrolled light. The textual form may include at least one of: the characters have irregular colors, sizes, fonts, directions, layouts and the like. The layout irregularities may include at least one of bends, tilts, wrinkles, deformations, deformities, incompletions, and the like.

According to an embodiment of the present invention, the sample text image set may include at least one sample text image. The sample text image may include at least one of: a sample document text image and a sample scene text image. The sample text image set may be an image set of a text vision task. The sample text images may be text images of various text vision tasks. For example, the text vision task may include at least one of: the method comprises a text image recognition task, a text image classification task, a text image segmentation task, a text image detection task, a text image retrieval task and the like. Additionally, the text vision task may also include at least one of: a segment domain task corresponding to the text image recognition task, a segment domain task corresponding to the text image classification task, a segment domain task corresponding to the text image segmentation task, a segment domain task corresponding to the text image detection task, and a segment domain task corresponding to the text image retrieval task.

According to an embodiment of the present invention, for example, the segment domain task corresponding to the text image recognition task may include at least one of: the system comprises a bill image recognition task, a medical text image recognition task, a financial product text image recognition task, a video subtitle recognition task, a safety monitoring recognition task and the like. The segment domain task corresponding to the text image classification task may include at least one of: the system comprises a bill image classification task, a medical text image classification task, a financial product text image classification task, a video subtitle classification task, a safety monitoring classification task and the like. The segment domain task corresponding to the text image segmentation task may include at least one of: the system comprises a bill image segmentation task, a medical text image segmentation task, a financial product text image segmentation task and the like. The segment domain task corresponding to the text image detection task may include at least one of: the system comprises a bill image detection task, a medical text image detection task, a financial product text image detection task, a video subtitle detection task, a safety monitoring detection task and the like. The segment domain task corresponding to the text image retrieval task may include at least one of: the system comprises a bill image retrieval task, a medical text image retrieval task, a financial product text image retrieval task, a video subtitle retrieval task, a safety monitoring retrieval task and the like.

According to embodiments of the present invention, there may be a sample text output result set and a sample label set corresponding to the sample text image set. The sample text output result set may include at least one sample text output result. The set of exemplar labels can include at least one exemplar label. The sample text image may have a sample text output result and a sample label corresponding to the sample text image. The sample text output result may characterize a predicted text result of the sample text image. The sample text output result may include at least one of a sample text recognition output result and a sample text semantic output result. The sample text recognition output may characterize a predicted text recognition result of the sample text image. The sample text semantic output result can represent a predicted semantic result of the sample text image. The sample label may characterize the true text result of the sample text image. The sample tags may include at least one of sample text identification tags and sample text semantic tags. The sample text identification label can represent a real text identification result of the sample text image. The sample text semantic tags can characterize the true semantic result of the sample text image. The text recognition result may refer to a sequence of characters included in the text image.

According to an embodiment of the invention, the set of sample text images may include a first subset of sample text images. The sample text image in the first sample text image subset may refer to a sample text image whose sample text output result is a correct sample text output result. The first subset of sample text images may comprise a set of sample text images to be cropped. The set of sample text images to be cropped may include at least one sample text image to be cropped. The sample text image to be clipped may refer to a sample text image satisfying a predetermined clipping condition among the first subset of sample text images. The predetermined clipping condition may be configured according to an actual service requirement, and is not limited herein. For example, the predetermined cropping condition may include that a predetermined probability value corresponding to the sample text image is less than or equal to a predetermined probability threshold.

According to an embodiment of the present invention, the sample text image to be cut may have at least one cutting position corresponding to the sample text image to be cut. The target clipping position may refer to a clipping position satisfying a predetermined position condition among the at least one clipping position. The predetermined location condition may be configured according to an actual service requirement, and is not limited herein. For example, the predetermined position condition may refer to a condition randomly determined from at least one clipping position.

According to an embodiment of the present invention, the subset of cropped sample text images may include at least one cropped sample text image. The clipping sample text image may be obtained by clipping a sample text image to be clipped based on the target clipping position.

According to embodiments of the present invention, a sample set of text images may be obtained from a data source in response to detecting a text image generation instruction. The data source may include at least one of: a local database, a cloud database, and network resources. A data interface may be invoked. A sample text image set is obtained from a data source using a data interface. The set of sample text images may include at least one sample text image. The sample text image may be at least one of: simulating a sample text image and a real sample text image. The authentic sample text image may be a sample text image in the public data set. The simulated sample text image is generated based on one of the following: generated based on predetermined image parameters and generated based on generating a competing network model to process predetermined random noise data.

According to the embodiment of the invention, for the sample text image in the sample text image set, the first local feature extraction may be performed on the sample text image to obtain the first local sample feature map. Global feature extraction can be performed on the first local sample feature map to obtain a global sample feature sequence. The global sample feature sequence can be subjected to sequence decoding to obtain a sample text recognition output result of the sample text image. Second local feature extraction can be performed on the sample text image to obtain a second local sample feature map. Semantic understanding can be carried out on the second local sample characteristic graph to obtain a sample text semantic output result of the sample text image. And obtaining a sample text output result of the sample text image according to at least one of a sample text recognition output result and a sample text semantic output result of the sample text image. For example, the sample text image may be processed based on a deep learning model, resulting in a sample text output result. The deep learning model can comprise a deep learning model capable of realizing text recognition of character sequences with indefinite lengths and a deep learning model capable of realizing semantic understanding of texts. The model structure of the deep learning model can be configured according to actual business requirements, and is not limited herein. For example, the deep learning model may include at least one model structure. The model structure may comprise at least one model substructure and a connection relationship of the respective model substructures to each other. The model structure may be a structure obtained by connecting at least one model substructure based on a connection relationship between the model substructures. The at least one model substructure comprised by the model structure may be a structure from at least one operational layer. For example, the model structure may be a structure obtained by connecting at least one model substructure from at least one operation layer based on a connection relationship between the model substructures. For example, the at least one operational layer may include at least one of: an input layer, a convolutional layer, a hidden layer, a transcription layer, a pooling layer, an inverse convolutional layer, a feedforward neural network layer, an attention layer, a residual layer, a fully-connected layer, a batch normalization layer, a Linear Embedding (i.e., Linear Embedding) layer, a nonlinear layer, and the like.

According to an embodiment of the invention, the deep learning model of text recognition may comprise one of: a CRNN (Convolutional Recurrent Neural Network) based text recognition model and a coder-decoder based text recognition model. The CRNN may include a convolutional layer, a cyclic layer, and a transcription layer encoder-decoder may include one of: symmetric encoder-decoders and asymmetric encoder-decoders.

According to embodiments of the present invention, the CRNN-based text recognition model may include at least one of: a CRNN model based on CTC (i.e., Connectionist Temporal Classification), a CRNN model based on Attention (i.e., Attention), and a CRNN model based on ACE (i.e., Aggregation Cross control). The encoder-decoder based text recognition model may include a Sequence-To-Sequence (i.e., Sequence-To-Sequence) based text recognition model.

According to an embodiment of the invention, the deep learning model for semantic understanding of text may include at least one of: a text semantic understanding model based on a convolutional neural network, a text semantic understanding model based on a cyclic neural network, and a text semantic understanding model based on a Transformer (i.e., a converter).

According to the embodiment of the invention, the training mode of the deep learning model can be configured according to the actual business requirements, and is not limited herein. For example, the training mode may include at least one of: unsupervised training, supervised training and semi-supervised training.

According to the embodiment of the invention, the sample text image set can be divided into at least one sample text image subset according to the sample text output result and the sample label of the sample text image. For example, the at least one sample text image subset may include a first sample text image subset. Additionally, the at least one subset of sample text images may also include a second subset of sample text images. The sample text images in the second subset of sample text images may refer to sample text images whose sample text output results are erroneous sample text output results.

According to the embodiment of the invention, for the sample text image to be cut in the sample text image set to be cut, a plurality of candidate cutting positions can be determined according to the sample text output result of the sample text image to be cut. At least one target clipping location is determined from the plurality of candidate clipping locations. For example, at least one target clipping location may be randomly determined from a plurality of candidate clipping locations. Alternatively, a position corresponding to at least one target character may be determined from a plurality of candidate clipping positions. And determining a position corresponding to the at least one target character as at least one target cutting position.

According to the embodiment of the invention, for the sample image to be cut in the sample image set to be cut, the sample image to be cut can be cut based on at least one target cutting position corresponding to the sample image to be cut, so as to obtain at least one cutting sample image.

According to the embodiment of the invention, after at least one cropping sample text image corresponding to each of the sample images to be cropped included in the sample image set to be cropped is obtained, at least one cropping sample text image corresponding to each of the sample images to be cropped included in the sample image set to be cropped can be combined to obtain at least one combined sample text image.

According to an embodiment of the present invention, obtaining the target sample text image set according to the at least one subset of cropped sample text images and the at least one subset of sample text images may include: the target sample text image set may be obtained according to other sample text image subsets except the first sample text image subset in the at least one sample text image subset, other sample text images except the sample text image set to be cut in the first sample text image subset, and the at least one combined sample text image. Alternatively, a target sample text image set may be derived from the sample text image set and the at least one combined sample text image.

According to the embodiment of the present invention, the text image generation method of the embodiment of the present invention can be performed by an electronic device. For example, the electronic device may be a server or a terminal device. The electronic device may include at least one processor. The processor can be used for executing the text image generation method provided by the embodiment of the invention. For example, a single processor may be used to execute the text image generation method provided by the embodiment of the present invention, or a plurality of processors may be used to execute the text image generation method provided by the embodiment of the present invention in parallel.

According to the embodiment of the invention, the target cropping position set is determined according to the sample text output result set of the sample text image set to be cropped, the sample text image set to be cropped is determined according to the first sample text image subset, and the sample text images in the first sample text image subset include the sample text images with correct output results, which are determined from the sample text image set according to the sample text output result set and the sample label set of the sample text image set, so that the accuracy of the target cropping position can be effectively ensured, and the character information is effectively prevented from being damaged. In addition, the target sample text image set is obtained according to at least one sample text image subset and at least one cut sample text image subset obtained by cutting the sample text image set to be cut based on the target cut position set, so that the image background complexity and the image diversity of the sample text images in the target sample text image set are improved, and the target sample text image set with richer context information can be obtained. Therefore, the training optimization of the subsequent model is carried out by utilizing the target sample text image set, the model iteration times are reduced, the training speed of the model is improved, the data processing amount and the resource consumption amount of the electronic equipment are further reduced, the effect of improving the internal performance of the electronic equipment according with the natural law is further obtained, and the core competitiveness of the electronic equipment is further improved.

According to an embodiment of the present invention, the text image generation method may further include the following operations.

And carrying out data enhancement processing on the original sample text image set to obtain an intermediate sample text image set. And obtaining a sample text image set according to the original sample text image set and the intermediate sample text image set.

According to an embodiment of the present invention, the original sample text image set may include at least one original sample text image. The data enhancement may include at least one of: supervised and unsupervised data enhancement. The supervised data enhancement may include at least one of: single sample data enhancement and multi-sample data enhancement. Unsupervised data enhancement may include at least one of: data enhancement of new data and data enhancement of learning enhancement strategies are generated.

According to embodiments of the invention, the single sample data enhancement may comprise at least one of: a geometric transformation class and a color transformation class. The class of geometric transformations may include at least one of: flipping, rotating, random cropping, morphing, scaling, and the like. The color transform class may include at least one of: noise, blur, color transformation, erasure and padding, etc.

According to embodiments of the invention, the multi-sample data enhancement may comprise at least one of: SMOTE (i.e., Synthetic minor Over-sampling Technique), Sample Pairing, Mixup, Cutout, Cutmix, Fmix, and ROImix, among others.

According to an embodiment of the invention, generating data augmentation of the new data may include generating data augmentation against a network model. Data enhancement of the learning enhancement strategy may include automatic data enhancement.

According to the embodiment of the invention, for an original sample text image in an original sample text image set, data enhancement can be performed on the original sample text image to obtain at least one intermediate sample text image corresponding to the original sample text image. The data enhancement of the respective original sample text images may be one of different, partially identical, and wholly identical to one another. For example, the original sample text image set may include an original sample text image a and an original sample text image B. The original sample text image a may be subjected to data enhancement of a geometric transformation type to obtain at least one intermediate sample text image corresponding to the original sample text image a. The original sample text image B may be subjected to a color transform-like data enhancement to obtain at least one intermediate sample text image corresponding to the original sample text image B.

According to an embodiment of the present invention, obtaining a sample text image set according to the original sample text image set and the intermediate sample text image set may include: and determining the intermediate sample text image set as a sample text image set. Alternatively, at least part of the original sample text image set and at least part of the intermediate sample text image set are determined as sample text image sets.

According to the embodiment of the invention, different original sample text images can be subjected to different data enhancement, so that the image diversity of the third sample text image in the third sample text image subset can be effectively ensured. On the basis, the third sample text image subset is used for training the deep learning model, so that the generalization performance of the model can be improved.

According to an embodiment of the present invention, obtaining the sample text image set according to the original sample text image set and the intermediate sample text image set may include the following operations.

And aiming at the original sample text images in the original sample text image set, under the condition that the height of the original sample text images is determined not to be the preset height, and under the condition that the aspect ratio of the original sample text images is kept unchanged, the height of the original sample text images is adjusted to be the preset height, and the adjusted original sample text images are obtained. And aiming at the intermediate sample text images in the intermediate sample text image set, under the condition that the height of the intermediate sample text images is not the preset height, and under the condition that the aspect ratio of the intermediate sample text images is kept unchanged, the height of the intermediate sample text images is adjusted to the preset height, and the adjusted intermediate sample text images are obtained. And obtaining a sample text image set according to at least one of the original sample text image set, the at least one adjusted original sample text image, the intermediate sample text image set and the at least one adjusted intermediate sample text image set.

According to an embodiment of the present invention, operation S210 may include the following operations.

And comparing the sample text output result set of the sample text image set with the sample label set to obtain a comparison result. And dividing the sample text image set into at least one sample text image subset according to the comparison result.

According to an embodiment of the present invention, the comparison result may include that the relationship between the two objects satisfies the predetermined matching condition and the relationship between the two objects does not satisfy the predetermined matching condition. Two objects may refer to a sample text output result and a sample label. The predetermined matching condition may be configured according to an actual service requirement, and is not limited herein. For example, the predetermined matching condition may include two objects matching.

According to the embodiment of the invention, for the sample text image in the sample text image set, the sample text output result of the sample text image can be compared with the sample label, so as to obtain the comparison result corresponding to the sample text image. The sample text image may be divided into a subset of sample text images corresponding to the comparison result according to the comparison result corresponding to the sample text image.

According to an embodiment of the present invention, the sample text image set may include a plurality of sample text images. The at least one subset of sample text images may also include a second subset of sample text images.

According to an embodiment of the present invention, dividing the sample text image set into at least one sample text image subset according to the comparison result may include the following operations.

For a sample text image of the plurality of sample text images, determining the sample text image as a sample text image in a first sample text image subset if it is determined that a relationship between a sample text output result of the sample text image and the sample label satisfies a predetermined matching condition. In a case where it is determined that the relationship between the sample text output result of the sample text image and the sample label does not satisfy the predetermined matching condition, the sample text image is determined as a sample text image in the second sample text image subset.

According to an embodiment of the present invention, the predetermined matching condition may refer to a criterion for dividing a subset of the sample text image. The predetermined matching condition may include a difference between the sample text output result and the sample label being less than or equal to a predetermined difference threshold. The predetermined difference threshold may be configured according to actual service requirements, and is not limited herein. For example, the predetermined difference threshold may be 0.1.

According to an embodiment of the present invention, the sample text image in the first sample text image subset may refer to a sample text image whose sample text output result is a correct sample text output result. The sample text images in the second subset of sample text images may refer to sample text images whose sample text output results are erroneous sample text output results.

According to an embodiment of the present invention, it is determined whether a difference between a sample text output result of a sample text image and a sample label is less than or equal to a predetermined difference threshold value for the sample text image among a plurality of sample text images. In a case where it is determined that the difference between the sample text output result and the sample label of the sample text image is less than or equal to the predetermined difference threshold, the sample text image may be determined as the sample text image in the first subset of sample text images. In a case where it is determined that the difference between the sample text output result and the sample label of the sample text image is greater than the predetermined difference threshold, the sample text image may be determined as the sample text image in the second subset of sample text images.

According to the embodiment of the invention, the target clipping position set is determined according to the sample text output result set of the sample text image set to be clipped, the sample text image set to be clipped is determined according to the first sample text image subset, and the first sample text image in the first sample text image subset is the sample text image of which the relation between the sample text output result and the sample label meets the preset matching condition, so that the accuracy of the target clipping position can be effectively ensured, and the character information is effectively prevented from being damaged.

According to an embodiment of the present invention, the first sample text image set may include a plurality of first sample text images.

According to an embodiment of the present invention, the sample image set to be cropped may be determined by:

for a first sample text image of the plurality of first sample text images, determining the first sample text image as a sample text image to be cropped in a sample text image set to be cropped if it is determined that the predetermined probability value of the first sample text image is less than or equal to the predetermined probability threshold.

According to an embodiment of the present invention, the predetermined probability value and the predetermined probability threshold may be used as a determination that the first sample image in the first subset of sample images is a sample image to be cropped in the set of sample images to be cropped. The predetermined probability value and the predetermined probability threshold may be configured according to an actual service requirement, which is not limited herein. The predetermined probability value may be a number greater than or equal to 0 and less than 1. The predetermined probability threshold may be a number greater than or equal to 0 and less than or equal to 1. For example, the predetermined probability threshold may be determined based on model characteristics of the deep learning model. The model features may include at least one of complexity, fit, and versatility of the model structure. For example, if the model structure of the deep learning model is characterized by at least one of greater versatility, greater complexity, and ease of overfitting, a numerically greater predetermined probability threshold may be configured. A predetermined probability threshold of a small value may be configured if the model structure of the deep-learning model is characterized by at least one of a low degree of generality, low complexity, and easy under-fitting.

According to an embodiment of the present invention, the sample text image set to be cut may include a plurality of sample text images to be cut.

According to an embodiment of the present invention, operation S220 may include the following operations.

And determining at least one target clipping position from a plurality of candidate clipping positions according to a sample text output result of the sample text image to be clipped in the sample text image set to be clipped.

According to the embodiment of the invention, a plurality of candidate clipping positions can be determined according to the sample text output result of the sample text image to be clipped. At least one target clipping location is randomly determined from the plurality of candidate clipping locations.

According to the embodiment of the invention, by randomly determining at least one target clipping position from a plurality of candidate clipping positions, the image diversity of the sample text image can be improved.

According to an embodiment of the present invention, the sample text image set may include a plurality of sample text images.

According to the embodiment of the invention, the sample text recognition output result can be obtained by performing sequence decoding on the global sample feature sequence of the sample text image. The global sample feature sequence may be obtained by performing global feature extraction on the first local sample feature map of the sample text image. The first local sample feature map may be a result of a first local feature extraction performed on the sample text image.

According to the embodiment of the invention, the sample text semantic output result can be obtained by performing semantic understanding on the second local sample feature map of the sample text image. The second local sample feature map may be obtained by performing second local feature extraction on the sample text image.

According to the embodiment of the invention, the sample text image can be processed by utilizing the CRNN-based text recognition model, so that the sample text recognition output result is obtained. CRNN may include convolutional, cyclic, and transcriptional layers. The sample text image may be processed using the convolutional layer to obtain a first local sample feature map. The first local sample feature map may be processed using a loop layer to obtain a global sample feature sequence. The global sample feature sequence can be processed by utilizing the transcription layer to obtain a sample text recognition output result.

According to the embodiment of the invention, in the case that the sample text output result includes the sample text recognition result and the sample text semantic output result, determining at least one target clipping position from a plurality of candidate clipping positions according to the sample text output result of the sample text image to be clipped may include the following operations.

And determining a plurality of candidate cutting positions according to the sample text recognition output result of the sample text image to be cut. And determining at least one target clipping position from the plurality of candidate clipping positions according to a sample text semantic output result of the sample text image to be clipped.

According to an embodiment of the present invention, for example, the sample text recognition output result of the sample text image to be cut may be "go to work today". According to the sample text recognition output result, four candidate cutting positions are determined, namely a candidate cutting position between the current position and the day position, a candidate cutting position between the day position and the going position, a candidate cutting position between the going position and the going position, and a candidate cutting position between the going position and the class position. According to the sample text semantic output result, it can be determined that "today" and "day" should not be separated, and "up" and "class" should not be separated, and therefore, two target clipping positions, i.e., a clipping position candidate between "day" and "go", and a clipping position candidate between "go" and "up", can be determined from four clipping position candidates.

According to the embodiment of the invention, at least one target clipping position is determined from a plurality of candidate clipping positions according to the sample text semantic output result of the sample text image to be clipped, so that the accuracy of the target clipping position is improved.

According to an embodiment of the present invention, operation S230 may include the following operations.

And cutting the sample text image set to be cut based on the target cutting position set to obtain a first cutting sample text image subset and a second cutting sample text image subset.

According to an embodiment of the present invention, the first subset of cropped sample text images may include at least one first cropped sample text image. The second subset of cropped sample text images may include at least one second cropped sample text image. The at least one target clipping position corresponding to the sample image to be clipped may include a first target clipping position and a second target clipping position.

According to the embodiment of the invention, the sample image to be cut in the sample image set to be cut can be cut based on the first target cutting position corresponding to the sample image to be cut, so as to obtain the first cutting sample text image corresponding to the sample image to be cut. And performing cutting based on the second target cutting position corresponding to the sample text image to be cut to obtain a second cutting sample text image corresponding to the sample text image to be cut.

According to an embodiment of the present invention, operation S240 may include the following operations.

And obtaining a third sample text image subset according to the at least one clipping sample text image subset. And obtaining a target sample text image set according to the at least one sample text image subset and the third sample text image subset.

According to an embodiment of the present invention, at least one subset of cropped sample text images may be combined to obtain a third subset of sample text images. A target sample text image set may be obtained from the second sample text image subset and the third sample text image subset.

According to an embodiment of the present invention, obtaining the third sample text image subset from the at least one cropped sample text image subset may include the following operations.

And combining the cut sample text images in at least one cut sample text image subset based on a preset combination strategy to obtain a third sample text image subset.

According to an embodiment of the present invention, the predetermined combination policy may refer to a policy for combining the trimming sample text images. For example, the predetermined combination policy may include at least one of: a random combining strategy and a fixed combining strategy. The third subset of sample text images may include at least one third sample text image. The third sample text image may be the same as or different from the sample text images in the set of sample text images.

According to the embodiment of the invention, for at least one subset of the clipped sample text images in the subset of the clipped sample text images, the clipped sample text images in the subset of the clipped sample text images can be combined with the clipped sample text images in other subsets of the clipped sample text images to obtain at least one third sample text image. The other subset of cropped sample text images may be any one or more of the at least one subset of cropped sample text images other than the subset of cropped sample text images.

For example, the at least one subset of cropped sample text images may include a first subset of cropped sample text images and a second subset of cropped sample text images. The first subset of cropped sample text images may represent a subset of cropped sample text images in a first direction. The second subset of captured sample text images may represent a second direction subset of cropped sample text images. The first direction may refer to a right direction. The second direction may refer to a left direction. For a first cropped sample text image in the first subset of cropped sample text images, the first cropped sample text image may be combined with at least one second cropped sample text image in the second subset of cropped sample text images to obtain at least one third sample text image.

According to the embodiment of the invention, the third sample text image subset is obtained by combining the cut sample text images in at least one cut sample text image subset based on the preset combination strategy, so that the random combination of the cut sample text images is realized, and the image background complexity and the image diversity of the third sample text image in the third sample text image subset are improved. On the basis, the third sample text image subset is used for training the deep learning model, so that the generalization performance of the model can be improved.

And cutting the sample label set of the sample text image set to be cut based on the target cutting position set to obtain at least one cutting sample label subset. And obtaining a target sample label set according to the sample label subset corresponding to the at least one sample text image subset and the at least one cutting sample label subset.

According to an embodiment of the present invention, obtaining the target sample label set according to the sample label subset and the at least one cropping sample label subset corresponding to the at least one sample text image subset may include the following operations.

And obtaining a sample label subset corresponding to the third sample text image subset according to the at least one cutting sample label subset. And obtaining a target sample label set according to the sample label subset corresponding to the at least one sample text image subset and the sample label subset corresponding to the third sample text image subset.

According to an embodiment of the present invention, obtaining a sample label subset corresponding to the third sample text image subset according to the at least one cropping sample label subset may include the following operations.

And combining the cutting sample labels in at least one cutting sample label subset based on a preset combination strategy to obtain a sample label subset corresponding to the third sample text image subset.

The text image generation method according to the embodiment of the present invention is further described with reference to fig. 3A, fig. 3B, fig. 3C, fig. 3D, and fig. 3E in conjunction with specific embodiments.

Fig. 3A schematically shows a schematic diagram of a text image generation method according to an embodiment of the present invention.

As shown in fig. 3A, in 300A, a sample text image set 303 is divided into a first sample text image subset 303_1 and a second sample text image subset 303_2 according to a sample text output result set 301 and a sample label set 302 of the sample text image set. A set 304 of sample image images to be cropped is determined from the first sample image subset 303_ 1.

A set of target clipping positions 306 of the set of sample text images 304 to be clipped is determined from the set of sample text output results 305 of the set of sample text images 304 to be clipped. The set of sample text images to be cropped 304 is cropped based on the set of target cropping positions 306 resulting in at least one subset of cropped sample text images 307. A target sample text image set 308 is derived from the at least one cropped sample text image subset 307, the first sample text image subset 303_1, and the second sample text image subset 303_ 2.

Fig. 3B schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to an embodiment of the invention.

As shown in fig. 3B, in 300B, the sample image set to be cut 309 may include a sample image to be cut 309_1 and a sample image to be cut 309_ 2.

From the sample text output result of the sample text image 309_1 to be clipped, it is determined that the target clipping position is "a position between infant and hundred" from the plurality of candidate clipping positions. And clipping the sample text image to be clipped 309_1 based on the target clipping position, so as to obtain a clipping sample text image 309_1_1 and a clipping sample text image 309_1_ 2. The cut sample text image 309_1_1 is a sample text image corresponding to "mother and infant". The clipped sample text image 309_1_2 is a sample text image corresponding to "hundreds of thousands".

From the sample text output result of the sample text image 309_2 to be clipped, it is determined that the target clipping position is "a position between turn and let" from the plurality of candidate clipping positions. And clipping the sample text image to be clipped 309_2 based on the target clipping position to obtain a clipping sample text image 309_2_1 and a clipping sample text image 309_2_ 2. The clipped sample text image 309_2_1 is a sample text image corresponding to "turn". The clipped sample text image 309_2_2 is a sample text image corresponding to "let".

Based on a predetermined combination strategy, the clipped sample text image 309_1_1 and the clipped sample text image 309_2_2 are combined to obtain a third sample text image 310_1 in the third sample text image subset 310, and the clipped sample text image 309_1_2 and the clipped sample text image 309_2_1 are combined to obtain a third sample text image 310_2 in the third sample text image subset 310. The third sample text image 310_1 is a sample text image corresponding to "mother-and-baby give". The third sample text image 310_2 is a sample text image corresponding to "transfer hundreds".

FIG. 3C schematically illustrates an example schematic of a generation process of a third sample text image subset according to another embodiment of the invention.

As shown in fig. 3C, in 300C, unlike fig. 3B, the third sample text image 311_1 is a sample text image corresponding to "let mother and baby". The third sample text image 311_2 is a sample text image corresponding to "hundred remittance".

Fig. 3D schematically illustrates an example schematic of a generation process of a third sample text image subset according to another embodiment of the invention.

As shown in fig. 3D, in 300D, unlike fig. 3B, the clipped sample text image 309_1_1 and the clipped sample text image 309_2_1 are combined based on a predetermined combination strategy to obtain a third sample text image 312_1 in the third sample text image subset 312, and the clipped sample text image 309_1_2 and the clipped sample text image 309_2_2 are combined to obtain a third sample text image 312_2 in the third sample text image subset 312. The third sample text image 312_1 is a sample text image corresponding to "mother-to-baby" operation. The third sample text image 312_2 is a sample text image corresponding to "baihui".

FIG. 3E schematically illustrates an example schematic of a process of generating a third sample text image subset according to another embodiment of the invention.

As shown in fig. 3E, in 300E, unlike fig. 3D, the third sample text image 313_1 is a sample text image corresponding to "mother-to-mother". The third sample text image 313_2 is a sample text image corresponding to "give hundred remittance".

FIG. 4 schematically shows a flowchart of a training method of a deep learning model according to an embodiment of the present invention.

As shown in FIG. 4, the method 400 may include operations S410-S420.

In operation S410, a target sample text image set is acquired.

In operation S420, a deep learning model is trained using the target sample text image set, so as to obtain a text image processing model.

According to an embodiment of the present invention, the target sample text image set may be obtained according to the text image generation method described in the embodiment of the present invention.

According to the embodiment of the invention, the target sample text image set target clipping position set is determined according to the sample text output result set of the sample text image set to be clipped, the sample text image set to be clipped is determined according to the first sample text image subset, and the first sample text image subset is the sample text image which is determined from the sample text image set according to the sample text output result set of the sample text image set and the sample label set and comprises a correct sample text output result, so that the accuracy of the target clipping position can be effectively ensured, and the character information is effectively prevented from being damaged. On the basis, a target sample text image set is obtained according to at least one cutting sample text image subset and at least one sample text image subset, and the target sample text image set with richer context information can be obtained. Therefore, the training optimization of the subsequent model is carried out by utilizing the target sample text image set, the iteration times of the model are reduced, the training speed of the model is improved, the data processing amount and the resource consumption amount of the electronic equipment are reduced, the effect of improving the internal performance of the electronic equipment according with the natural law is obtained, and the core competitiveness of the electronic equipment is improved.

Fig. 5 schematically shows a flow chart of a text image processing method according to an embodiment of the present invention.

As shown in FIG. 5, the method 500 includes operations S510-S520.

In operation S510, a text image to be processed is acquired.

In operation S520, the text image to be processed is input to the text image processing model, and a text image processing result is obtained.

According to the embodiment of the invention, the text image processing model can be obtained by training according to the training method of the deep learning model described in the embodiment of the invention.

In the technical scheme of the invention, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good custom of the public order.

The above is only an exemplary embodiment, but not limited to this, and other text image generation methods, training methods of deep learning models, and text image processing methods known in the art may also be included, as long as the accuracy of the target cropping position can be effectively ensured and a target sample text image set with richer context information can be obtained.

Fig. 6 schematically shows a block diagram of a text image generating apparatus according to an embodiment of the present invention.

As shown in fig. 6, the text image generating apparatus 600 may include a dividing module 610, a determining module 620, a first obtaining module 630, and a second obtaining module 640.

The dividing module 610 is configured to output a result set and a sample label set according to a sample text of the sample text image set, and divide the sample text image set into at least one sample text image subset. The at least one sample text image subset includes a first sample text image subset. The first subset of sample text images includes sample text images for which the sample text output results are correct.

And the determining module 620 is configured to output a result set according to the sample text of the sample text image set to be cut, and determine a target cutting position set of the sample text image set to be cut. The set of sample images to be cropped is determined from the first subset of sample images.

The first obtaining module 630 is configured to crop the sample text image set to be cropped based on the target cropping position set, so as to obtain at least one cropping sample text image subset.

And a second obtaining module 640, configured to obtain a target sample text image set according to the at least one subset of the clipped sample text images and the at least one subset of the sample text images.

The partitioning module 610 may include a comparison sub-module and a partitioning sub-module according to an embodiment of the present invention.

And the comparison submodule is used for comparing the sample text output result set of the sample text image set with the sample label set to obtain a comparison result.

And the dividing submodule is used for dividing the sample text image set into at least one sample text image subset according to the comparison result.

According to an embodiment of the present invention, the sample text image set includes a plurality of sample text images, and the at least one sample text image subset further includes a second sample text image subset.

According to an embodiment of the present invention, the dividing sub-module may include a first determining unit and a second determining unit for a sample text image among the plurality of sample text images.

A first determining unit configured to determine the sample text image as a sample text image in the first sample text image subset, in a case where it is determined that a relationship between a sample text output result of the sample text image and the sample label satisfies a predetermined matching condition.

A second determination unit configured to determine the sample text image as a sample text image in the second sample text image subset, in a case where it is determined that a relationship between the sample text output result of the sample text image and the sample label does not satisfy a predetermined matching condition.

According to an embodiment of the present invention, the determining module 620 may include a determining sub-module for a sample text image to be cut in the sample text image set to be cut.

And the determining submodule is used for determining at least one target cutting position from the candidate cutting positions according to the sample text output result of the sample text image to be cut.

According to an embodiment of the present invention, the sample text output result may include at least one of: and a sample text recognition output result and a sample text semantic output result.

According to the embodiment of the invention, the sample text recognition output result can be obtained by performing sequence decoding on the global sample feature sequence of the sample text image. The global sample feature sequence may be obtained by performing global feature extraction on a first local sample feature map of the sample text image. The first local sample feature map may be a result of a first local feature extraction performed on the sample text image.

According to an embodiment of the present invention, in a case where the sample text output result includes the sample text recognition result and the sample text semantic output result, the determination submodule may include a third determination unit and a fourth determination unit.

And the third determining unit is used for determining a plurality of candidate cutting positions according to the sample text recognition output result of the sample text image to be cut.

And the fourth determining unit is used for determining at least one target clipping position from the plurality of candidate clipping positions according to the sample text semantic output result of the sample text image to be clipped.

According to an embodiment of the present invention, the first obtaining module 630 may include a first obtaining sub-module.

And the first obtaining submodule is used for cutting the sample text image set to be cut based on the target cutting position set to obtain a first cutting sample text image subset and a second cutting sample text image subset.

The second obtaining module 640 may include a second obtaining sub-module and a third obtaining sub-module according to an embodiment of the present invention.

And the second obtaining sub-module is used for obtaining a third sample text image subset according to the at least one clipping sample text image subset.

And the third obtaining submodule is used for obtaining a target sample text image set according to the at least one sample text image subset and the third sample text image subset.

According to an embodiment of the present invention, the second obtaining sub-module may include an obtaining unit.

And the obtaining unit is used for combining the clipping sample text images in at least one clipping sample text image subset based on a preset combination strategy to obtain a third sample text image subset.

for a first sample text image of the plurality of first sample text images,

in a case where it is determined that the predetermined probability value of the first sample text image is less than or equal to the predetermined probability threshold, the first sample text image is determined as a sample text image to be clipped in the sample text image set to be clipped.

According to an embodiment of the present invention, the text image generating apparatus may further include a third obtaining module and a fourth obtaining module.

And the third obtaining module is used for performing data enhancement processing on the original sample text image set to obtain an intermediate sample text image set.

And the fourth obtaining module is used for obtaining a sample text image set according to the original sample text image set and the intermediate sample text image set.

According to an embodiment of the invention, the sample text image set may be a text image set of a text vision task.

FIG. 7 schematically shows a block diagram of a training apparatus for deep learning models according to an embodiment of the present invention.

As shown in fig. 7, the training apparatus 700 for deep learning model may include a first obtaining module 710 and a fifth obtaining module 720.

A first obtaining module 710, configured to obtain a target sample text image set.

And a fifth obtaining module 720, configured to train a deep learning model by using the target sample text image set, so as to obtain a text image processing model.

According to an embodiment of the present invention, the target sample text image set may be trained by the training device of the deep learning model according to an embodiment of the present invention.

Fig. 8 schematically shows a block diagram of a text image processing apparatus according to an embodiment of the present invention.

As shown in fig. 8, the image processing apparatus 800 may include a second obtaining module 810 and a sixth obtaining module 820.

And a second obtaining module 810, configured to obtain a text image to be processed.

A sixth obtaining module 820, configured to input the text image to be processed into the text image processing model, so as to obtain a text image processing result.

According to an embodiment of the present invention, the text image processing model may be trained by the image processing apparatus according to an embodiment of the present invention.

The invention also provides an electronic device, a readable storage medium and a computer program product according to the embodiments of the invention.

According to an embodiment of the present invention, an electronic apparatus includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present invention, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to execute the method as described above.

According to an embodiment of the invention, a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

Fig. 9 schematically shows a block diagram of an electronic device adapted to implement a text image generation method, a training method of a deep learning model, and a text image processing method according to an embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 9, the electronic apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, for example, a text image generation method, a training method of a deep learning model, and a text image processing method. For example, in some embodiments, the text image generation method, the training method of the deep learning model, and the text image processing method may be implemented as computer software programs that are tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the text image generation method, the training method of the deep learning model, and the text image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g., by means of firmware) to perform a text image generation method, a training method of a deep learning model, and a text image processing method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present invention may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in this disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed herein can be achieved, and the present disclosure is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A text image generation method, comprising:

dividing the sample text image set into at least one sample text image subset according to a sample text output result set and a sample label set of the sample text image set, wherein the at least one sample text image subset comprises a first sample text image subset, and the first sample text image subset comprises a sample text image with a correct sample text output result;

determining a target clipping position set of a sample text image set to be clipped according to a sample text output result set of the sample text image set to be clipped, wherein the sample text image set to be clipped is determined according to the first sample text image subset;

cutting the sample text image set to be cut based on the target cutting position set to obtain at least one sample text image subset to be cut; and

and obtaining a target sample text image set according to the at least one clipping sample text image subset and the at least one sample text image subset.

2. The method of claim 1, wherein said partitioning the sample text image set into at least one sample text image subset according to a sample text output result set and a sample label set of the sample text image set comprises:

comparing the sample text output result set of the sample text image set with the sample label set to obtain a comparison result; and

and dividing the sample text image set into at least one sample text image subset according to the comparison result.

3. The method of claim 2, wherein the set of sample text images comprises a plurality of sample text images, the at least one subset of sample text images further comprising a second subset of sample text images;

wherein, according to the comparison result, dividing the sample text image set into the at least one sample text image subset comprises:

for a sample text image of the plurality of sample text images,

determining the sample text image as a sample text image in the first sample text image subset if it is determined that a relationship between a sample text output result of the sample text image and a sample label satisfies a predetermined matching condition; and

determining the sample text image as a sample text image in the second subset of sample text images if it is determined that the relationship between the sample text output result and the sample label of the sample text image does not satisfy the predetermined matching condition.

4. The method according to any one of claims 1-3, wherein the set of sample text images to be cropped comprises a plurality of sample text images to be cropped;

the step of determining a target clipping position set of the sample text image set to be clipped according to the sample text output result set of the sample text image set to be clipped comprises the following steps:

for a sample image to be cut in the sample image set to be cut,

and determining at least one target clipping position from a plurality of candidate clipping positions according to a sample text output result of the sample text image to be clipped.

5. The method of claim 4, wherein the sample text output results include at least one of: and identifying and outputting a sample text semantic output result.

6. The method of claim 5, wherein the set of sample text images comprises a plurality of sample text images;

the sample text recognition output result is obtained by performing sequence decoding on a global sample feature sequence of the sample text image, the global sample feature sequence is obtained by performing global feature extraction on a first local sample feature map of the sample text image, and the first local sample feature map is obtained by performing first local feature extraction on the sample text image;

the sample text semantic output result is obtained by performing semantic understanding on a second local sample feature map of the sample text image, and the second local sample feature map is obtained by performing second local feature extraction on the sample text image.

7. The method of claim 5, wherein, in a case that the sample text output result includes the sample text recognition result and the sample text semantic output result, the determining at least one target clipping position from a plurality of candidate clipping positions according to the sample text output result of the sample text image to be clipped comprises:

determining the candidate clipping positions according to a sample text recognition output result of the sample text image to be clipped; and

and determining at least one target clipping position from the plurality of candidate clipping positions according to a sample text semantic output result of the sample text image to be clipped.

8. The method according to any one of claims 1-3, wherein the cropping the set of sample text images to be cropped based on the set of target cropping positions to obtain at least one subset of cropped sample text images comprises:

9. The method of any of claims 1-3, wherein said deriving a target set of sample text images from said at least one subset of cropped sample text images and said at least one subset of sample text images comprises:

obtaining a third sample text image subset according to the at least one clipping sample text image subset; and

and obtaining the target sample text image set according to the at least one sample text image subset and the third sample text image subset.

10. The method of claim 9, wherein said deriving a third sample text image subset from said at least one cropped sample text image subset comprises:

and combining the cut sample text images in the at least one cut sample text image subset based on a preset combination strategy to obtain the third sample text image subset.

11. A method according to any one of claims 1 to 3, wherein the first sample image set comprises a plurality of first sample images;

wherein the sample text image set to be clipped is determined by:

for a first sample text image of the plurality of first sample text images,

determining the first sample text image as a sample text image to be cropped in the sample text image set to be cropped if it is determined that the predetermined probability value of the first sample text image is less than or equal to a predetermined probability threshold.

12. The method of any of claims 1-3, further comprising:

carrying out data enhancement processing on the original sample text image set to obtain an intermediate sample text image set; and

and obtaining the sample text image set according to the original sample text image set and the intermediate sample text image set.

13. The method of any of claims 1-3, wherein the sample set of text images is a set of text images for a text vision task.

14. A training method of a deep learning model comprises the following steps:

acquiring a target sample text image set; and

training the deep learning model by using the target sample text image set to obtain a text image processing model,

wherein the target sample text image set is obtained by the method according to any one of claims 1-13.

15. A text image processing method, comprising:

acquiring a text image to be processed; and

inputting the text image to be processed into a text image processing model to obtain a text image processing result,

wherein the text image processing model is trained using the method of claim 14.

16. A text image generating apparatus comprising:

the system comprises a dividing module, a judging module and a display module, wherein the dividing module is used for dividing a sample text image set into at least one sample text image subset according to a sample text output result set and a sample label set of the sample text image set, wherein the at least one sample text image subset comprises a first sample text image subset, and the first sample text image subset comprises a sample text image with a correct sample text output result;

a determining module, configured to determine a target clipping position set of a sample text image set to be clipped according to a sample text output result set of the sample text image set to be clipped, where the sample text image set to be clipped is determined according to the first sample text image subset;

the first obtaining module is used for cutting the sample text image set to be cut based on the target cutting position set to obtain at least one cutting sample text image subset; and

and the second obtaining module is used for obtaining a target sample text image set according to the at least one cutting sample text image subset and the at least one sample text image subset.

17. The apparatus of claim 16, wherein the means for partitioning comprises:

the comparison submodule is used for comparing the sample text output result set of the sample text image set with the sample label set to obtain a comparison result; and

and the dividing sub-module is used for dividing the sample text image set into at least one sample text image subset according to the comparison result.

18. The apparatus of claim 17, wherein the set of sample text images comprises a plurality of sample text images, the at least one subset of sample text images further comprising a second subset of sample text images;

wherein, for a sample text image of the plurality of sample text images, the partitioning sub-module comprises:

a first determination unit configured to determine the sample text image as a sample text image in the first sample text image subset, in a case where it is determined that a relationship between a sample text output result of the sample text image and a sample label satisfies a predetermined matching condition; and

a second determination unit configured to determine the sample text image as a sample text image in the second sample text image subset, in a case where it is determined that the relationship between the sample text output result of the sample text image and the sample label does not satisfy the predetermined matching condition.

19. The device according to any one of claims 16-18, wherein the set of sample text images to be cropped includes a plurality of sample text images to be cropped;

wherein, for a sample image to be cut in the sample image set to be cut, the determining module includes:

and the determining submodule is used for determining at least one target clipping position from a plurality of candidate clipping positions according to a sample text output result of the sample text image to be clipped.

20. The apparatus of claim 19, wherein the sample text output result comprises at least one of: and identifying and outputting a sample text semantic output result.

21. The apparatus of claim 20, wherein the set of sample text images comprises a plurality of sample text images;

22. The apparatus of claim 20, wherein in the event that the sample text output result includes the sample text recognition result and the sample text semantic output result, the determination sub-module comprises:

a third determining unit, configured to determine the multiple candidate clipping positions according to a sample text recognition output result of the sample text image to be clipped; and

and the fourth determining unit is used for determining at least one target clipping position from the plurality of candidate clipping positions according to a sample text semantic output result of the sample text image to be clipped.

23. The apparatus of any of claims 16-18, wherein the first obtaining module comprises:

24. The apparatus of any of claims 16-18, wherein the second obtaining module comprises:

the second obtaining submodule is used for obtaining a third sample text image subset according to the at least one clipping sample text image subset; and

and the third obtaining submodule is used for obtaining the target sample text image set according to the at least one sample text image subset and the third sample text image subset.

25. The apparatus of claim 24, wherein the second obtaining submodule comprises:

and the obtaining unit is used for combining the clipping sample text images in the at least one clipping sample text image subset based on a preset combination strategy to obtain the third sample text image subset.

26. An apparatus according to any one of claims 16 to 18, wherein the first set of sample text images comprises a plurality of first sample text images;

wherein the sample text image set to be clipped is determined by:

for a first sample text image of the plurality of first sample text images,

27. The apparatus of any of claims 16-18, further comprising:

the third obtaining module is used for carrying out data enhancement processing on the original sample text image set to obtain an intermediate sample text image set; and

and the fourth obtaining module is used for obtaining the sample text image set according to the original sample text image set and the intermediate sample text image set.

28. The apparatus of any of claims 16-18, wherein the sample set of text images is a set of text images for a text vision task.

29. A training apparatus for deep learning models, comprising:

the first acquisition module is used for acquiring a target sample text image set; and

a fifth obtaining module, configured to train the deep learning model with the target sample text image set to obtain a text image processing model,

wherein the target sample text image set is obtained using an apparatus according to any one of claims 16 to 28.

30. A text image processing apparatus comprising:

the second acquisition module is used for acquiring a text image to be processed; and

a sixth obtaining module, configured to input the text image to be processed into a text image processing model to obtain a text image processing result,

wherein the text image processing model is trained using the apparatus of claim 29.

31. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.

32. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-15.