WO2024040870A1 - Text image generation, training, and processing methods, and electronic device - Google Patents

Text image generation, training, and processing methods, and electronic device Download PDF

Info

Publication number
WO2024040870A1
WO2024040870A1 PCT/CN2023/074125 CN2023074125W WO2024040870A1 WO 2024040870 A1 WO2024040870 A1 WO 2024040870A1 CN 2023074125 W CN2023074125 W CN 2023074125W WO 2024040870 A1 WO2024040870 A1 WO 2024040870A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample text
text image
sample
cropped
subset
Prior art date
Application number
PCT/CN2023/074125
Other languages
French (fr)
Chinese (zh)
Inventor
郭若愚
杜宇宁
赖宝华
马艳军
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2024040870A1 publication Critical patent/WO2024040870A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level

Definitions

  • the present disclosure relates to the field of artificial intelligence technology, in particular to the field of computer vision and deep learning technology, and can be applied to optical character recognition scenarios. Specifically, it relates to a text image generation, training, text image processing method and electronic device.
  • Artificial intelligence technology can include computer vision technology, speech recognition technology, natural language processing technology, machine learning, deep learning, big data processing technology and knowledge graph technology, etc.
  • AI technology can be leveraged to generate text images for training deep learning models.
  • the present disclosure provides a text image generation, training, text image processing method and electronic device.
  • a text image generation method including: dividing the above sample text image set into at least one sample text image subset according to a sample text output result set and a sample label set of the sample text image set, Wherein, the above-mentioned at least one sample text image subset includes a first sample text image subset, and the above-mentioned first sample text image subset includes sample text images with correct sample text output results; according to the sample text of the sample text image set to be cropped Output the result set and determine the target cropping position set of the sample text image set to be cropped, wherein the sample text image set to be cropped is determined based on the first sample text image subset; based on the target cropping position set, the sample text image set to be cropped is determined. Crop the sample text image set to obtain at least one cropped sample text image subset; and obtain a target sample text image set based on the at least one cropped sample text image subset and the at least one sample text image subset.
  • a training method for a deep learning model including: obtaining a target sample text image set; and, using the target sample text image set to train the above-mentioned deep learning model to obtain a text image processing model, wherein , the above target sample text image set is obtained by using the method described above according to the present disclosure.
  • a text image processing method including: obtaining a text image to be processed; and inputting the text image to be processed into a text image processing model to obtain a text image processing result, wherein the text image is
  • the processing model is trained using the methods described above in accordance with this disclosure.
  • a text image generating device including: a dividing module configured to divide the sample text image set into at least one according to a sample text output result set and a sample label set of the sample text image set.
  • Sample text image subsets wherein the at least one sample text image subset includes a first sample text image subset, and the first sample text image subset includes sample text images with correct sample text output results;
  • the determination module uses Determine the target cropping position set of the sample text image set to be cropped based on the sample text output result set of the sample text image set to be cropped, wherein the sample text image set to be cropped is determined based on the first sample text image subset
  • the first acquisition module is used to crop the above-mentioned sample text image set to be cropped based on the above-mentioned target cropping position set, and obtain at least one cropped sample text image subset; and the second acquisition module is used to crop the above-mentioned sample text image set based on the above-mentioned at least
  • a training device for a deep learning model including: a first acquisition module for acquiring a target sample text image set; and a third acquisition module for utilizing the target sample text image set.
  • the above-mentioned deep learning model is trained to obtain a text image processing model, wherein the above-mentioned target sample text image set is obtained using the above-mentioned device according to the present disclosure.
  • a text image processing device including: a second acquisition module for acquiring a text image to be processed; and a fourth acquisition module for inputting the text image to be processed into text image processing model to obtain a text image processing result, wherein the above text image processing model is trained using the above device according to the present disclosure.
  • an electronic device including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores Instructions executable by the above-mentioned at least one processor are stored, and the above-mentioned instructions are executed by the above-mentioned at least one processor, so that the above-mentioned at least one processor can execute the method according to the present disclosure.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to the present disclosure.
  • Figure 1 schematically illustrates an exemplary system architecture that may include a text image generation method, a deep learning model training method, and a text image processing method and device according to an embodiment of the present disclosure
  • Figure 2 schematically shows a flow chart of a text image generation method according to an embodiment of the present disclosure
  • FIG. 3A schematically shows a principle diagram of a text image generation method according to an embodiment of the present disclosure
  • 3B schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to an embodiment of the present disclosure
  • 3C schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure
  • 3D schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure
  • 3E schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure
  • Figure 4 schematically shows a flow chart of a training method for a deep learning model according to an embodiment of the present disclosure
  • Figure 5 schematically shows a flow chart of a text image processing method according to an embodiment of the present disclosure
  • Figure 6 schematically shows a block diagram of a text image generating device according to an embodiment of the present disclosure
  • Figure 7 schematically shows a block diagram of a training device for a deep learning model according to an embodiment of the present disclosure
  • Figure 8 schematically shows a block diagram of a text image processing device according to an embodiment of the present disclosure.
  • FIG. 9 schematically shows a block diagram of an electronic device suitable for implementing a text image generation method, a deep learning model training method, and a text image processing method according to an embodiment of the present disclosure.
  • FIG. 1 schematically illustrates an exemplary system architecture that may include a text image generation method, a deep learning model training method, and a text image processing method and device according to an embodiment of the present disclosure.
  • Figure 1 is only an example of a system architecture to which embodiments of the present disclosure can be applied, to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure cannot be used in other applications.
  • the exemplary system architecture in which the text image generation method, the deep learning model training method, and the text image generation method and apparatus can be applied may include a terminal device, but the terminal device may not need to interact with the server, that is, The text image generation method, deep learning model training method, and text image processing method and device provided by the embodiments of the present disclosure can be implemented.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104 and a server 105.
  • the network 104 is a medium used to provide communication links between the terminal devices 101, 102, 103 and the server 105.
  • Network 104 may include various connection types. For example, at least one of wired and wireless communication links, and the like.
  • Terminal devices 101, 102, 103 Users can use terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages, etc.
  • Various communication client applications can be installed on the terminal devices 101, 102, and 103.
  • a knowledge reading application e.g., a web browser application, a search application, an instant messaging tool, an email client, and a social platform software.
  • the terminal devices 101, 102, and 103 may be various electronic devices having a display screen and supporting web browsing. For example, this could include smartphones, tablets, laptops, and desktops. At least one of a computer, etc.
  • Server 105 may be various types of servers providing various services.
  • the server 105 can be a cloud server, also known as a cloud computing server or a cloud host. It is a host product in the cloud computing service system to solve the problem between traditional physical hosts and VPS services (Virtual Private Server). , which has the disadvantages of difficult management and weak business scalability.
  • the server 105 can also be a server of a distributed system, or a server combined with a blockchain.
  • the text image generation method and text image processing method provided by the embodiments of the present disclosure can generally be executed by the terminal device 101, 102, or 103.
  • the text image generating device and the text image processing device provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103.
  • the text image generation method and text image processing method provided by the embodiments of the present disclosure can generally also be executed by the server 105 .
  • the text image generation device and the text image processing device provided by the embodiments of the present disclosure may generally be provided in the server 105 .
  • the text image generation method and text image processing method provided by the embodiments of the present disclosure can also be executed by a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105.
  • the text image generation device and the text image processing device provided by the embodiments of the present disclosure may also be provided in a server or server cluster that is different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
  • the training method of the deep learning model provided by the embodiment of the present disclosure can generally be executed by the server 105 .
  • the training device for the deep learning model provided by the embodiment of the present disclosure may generally be provided in the server 105 .
  • the deep learning model training method provided by the embodiment of the present disclosure can also be executed by a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105.
  • the training device of the deep learning model provided by the embodiment of the present disclosure can also be set up in a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105.
  • the deep learning model training method provided by the embodiment of the present disclosure can generally also be executed by the terminal device 101, 102, or 103.
  • the training device for the deep learning model provided by the embodiment of the present disclosure can also be provided in the terminal device 101, 102, or 103.
  • FIG. 2 schematically shows a flow chart of a text image generation method according to an embodiment of the present disclosure.
  • the method 200 includes operations S210 to S240.
  • the sample text image set is divided into at least one sample text image subset according to the sample text output result set and the sample label set of the sample text image set.
  • a target cropping position set of the sample text image set to be cropped is determined based on the sample text output result set of the sample text image set to be cropped.
  • the sample text image set to be cropped is cropped based on the target cropping position set to obtain at least one cropped sample text image subset.
  • a target sample text image set is obtained based on at least one cropped sample text image subset and at least one sample text image subset.
  • the at least one sample text image subset may include a first sample text image subset.
  • the first subset of sample text images may include sample text images whose sample text output results are correct.
  • the set of sample text images to be cropped may be determined based on the first subset of sample text images.
  • the text image may include at least one of the following: a document text image and a scene text image.
  • Document text images can refer to text images with neat layout, controlled lighting and a relatively simple background.
  • Scene text images can refer to text images with complex backgrounds, diverse text forms, and uncontrolled lighting.
  • Text form may include at least one of the following: text color, size, font, direction, irregular layout, etc.
  • Layout irregularities may include at least one of bends, tilts, wrinkles, deformations, and mutilations.
  • the sample text image set may include at least one sample text image.
  • the sample text image may include at least one of the following: a sample document text image and a sample scene text image.
  • the sample text image set may be an image set for a text vision task.
  • the sample text images can be text images for various text vision tasks.
  • the text vision task may include at least one of the following: text image recognition task, text image classification task, text image segmentation task, text image detection task, text image retrieval task, etc.
  • text vision tasks can also include Including at least one of the following: subdivided field tasks corresponding to text image recognition tasks, subdivided field tasks corresponding to text image classification tasks, subdivided field tasks corresponding to text image segmentation tasks, and detailed field tasks corresponding to text image detection tasks. Sub-domain tasks, sub-domain tasks corresponding to text image detection tasks, and sub-domain tasks corresponding to text image retrieval tasks.
  • the subdivision task corresponding to the text image recognition task may include at least one of the following: a bill image recognition task, a medical text image recognition task, a financial product text image recognition task, a video subtitle recognition task, and Security monitoring and identification tasks, etc.
  • the subdivision tasks corresponding to the text image classification task may include at least one of the following: bill image classification tasks, medical text image classification tasks, financial product text image classification tasks, video subtitle classification tasks, security monitoring classification tasks, etc.
  • the subdivided domain tasks corresponding to the text image segmentation task may include at least one of the following: bill image segmentation tasks, medical text image segmentation tasks, financial product text image segmentation tasks, etc.
  • the subdivision tasks corresponding to the text image detection task may include at least one of the following: bill image detection tasks, medical text image detection tasks, financial product text image detection tasks, video subtitle detection tasks, security monitoring detection tasks, etc.
  • the subdivision tasks corresponding to text image retrieval tasks may include at least one of the following: bill image retrieval tasks, medical text image retrieval tasks, financial product text image retrieval tasks, video subtitle retrieval tasks, security monitoring retrieval tasks, etc.
  • sample text output result set and a sample label set corresponding to a sample text image set.
  • the set of sample text output results may include at least one sample text output result.
  • the sample label set may include at least one sample label.
  • the sample text image may have a sample text output result and a sample label corresponding to the sample text image.
  • the sample text output result can characterize the predicted text result of the sample text image.
  • the sample text output result may include at least one of a sample text recognition output result and a sample text semantic output result.
  • the sample text recognition output result can characterize the predicted text recognition result of the sample text image.
  • sample text semantic output result can characterize the predicted semantic result of the sample text image.
  • Sample labels can characterize the real text results of sample text images.
  • the sample label may include at least one of a sample text recognition label and a sample text semantic label.
  • the sample text recognition label can characterize the real text recognition results of the sample text image.
  • Sample text semantic labels can characterize the real semantic results of sample text images.
  • the text recognition result may refer to a sequence of characters included in the text image.
  • the sample text image set may include a first sample text image sub-set set.
  • the sample text images in the first subset of sample text images may refer to sample text images whose sample text output results are correct sample text output results.
  • the first subset of sample text images may include a set of sample text images to be cropped.
  • the set of sample text images to be cropped may include at least one sample text image to be cropped.
  • the sample text image to be cropped may refer to the sample text image in the first sample text image subset that satisfies the predetermined cropping condition.
  • the predetermined tailoring conditions can be configured according to actual business needs and are not limited here.
  • the predetermined cropping condition may include that a predetermined probability value corresponding to the sample text image is less than or equal to a predetermined probability threshold.
  • the sample text image to be cropped may have at least one cropping position corresponding to the sample text image to be cropped.
  • the target cropping position may refer to a cropping position that satisfies a predetermined position condition among at least one cropping position.
  • the predetermined location conditions can be configured according to actual business needs and are not limited here.
  • the predetermined position condition may refer to a condition randomly determined from at least one cropping position.
  • the subset of cropped sample text images may include at least one cropped sample text image.
  • the cropped sample text image may be obtained by cropping the sample text image to be cropped based on the target cropping position.
  • a sample text image set may be obtained from a data source in response to detecting the text image generation instruction.
  • Data sources may include at least one of the following: local databases, cloud databases, and network resources.
  • the data interface can be called. Use the data interface to obtain a sample text image set from the data source.
  • the set of sample text images may include at least one sample text image.
  • the sample text image may be at least one of the following: a simulated sample text image and a real sample text image.
  • Real sample text images can be sample text images in public datasets.
  • the simulated sample text image is generated based on one of the following methods: generated based on predetermined image parameters and generated based on a generative adversarial network model processing predetermined random noise data.
  • the first local feature extraction can be performed on the sample text image to obtain the first local sample feature map.
  • Global features can be extracted from the first local sample feature map to obtain a global sample feature sequence.
  • the global sample feature sequence can be sequence decoded to obtain the sample text recognition output result of the sample text image.
  • the second local feature extraction can be performed on the sample text image to obtain a second local sample feature map.
  • the second local sample feature map can be semantically understood to obtain the sample text semantic output result of the sample text image.
  • the sample text recognition output of the sample text image At least one of the output result and the sample text semantic output result is obtained, and the sample text output result of the sample text image is obtained.
  • Deep learning models can include deep learning models that can realize text recognition of variable-length character sequences and deep learning models that can realize text semantic understanding.
  • the model structure of the deep learning model can be configured according to actual business needs and is not limited here.
  • a deep learning model may include at least one model structure.
  • the model structure may include at least one model substructure and connection relationships between each model substructure.
  • the model structure may be a structure obtained by connecting at least one model substructure based on the connection relationship between the model substructures.
  • the model structure includes at least one model substructure that may be a structure from at least one operational layer.
  • the model structure may be a structure obtained by connecting at least one model substructure from at least one operation layer based on the connection relationship between the model substructures.
  • at least one operation layer may include at least one of the following: an input layer, a convolutional layer, a hidden layer, a transcription layer, a pooling layer, an unpooling layer, a deconvolution layer, a feedforward neural network layer, an attention layer, Residual layer, fully connected layer, batch normalization layer, linear embedding (ie Linear Embedding) layer and non-linear layer, etc.
  • the deep learning model for text recognition may include one of the following: a text recognition model based on CRNN (Convolutional Recurrent Neural Network, Convolutional Recurrent Neural Network) and a text recognition model based on an encoder-decoder.
  • CRNN can include convolutional layers, recurrent layers, and transcription layers.
  • the encoder-decoder can include one of the following: symmetric encoder-decoder and asymmetric encoder-decoder.
  • the CRNN-based text recognition model may include at least one of the following: a CRNN model based on CTC (ie Connectionist Temporal Classification), a CRNN model based on Attention (ie attention), and a CRNN model based on ACE (ie Aggregation Cross Entropy ) CRNN model.
  • the encoder-decoder based text recognition model may include a Seq-To-Seq (ie Sequence-To-Sequence) based text recognition model.
  • the deep learning model for text semantic understanding may include at least one of the following: a convolutional neural network-based text semantic understanding model, a recurrent neural network-based text semantic understanding model, and a Transformer-based (i.e., converter)-based text semantic understanding model.
  • Text semantic understanding model a convolutional neural network-based text semantic understanding model, a recurrent neural network-based text semantic understanding model, and a Transformer-based (i.e., converter)-based text semantic understanding model.
  • the training method of the deep learning model can be configured according to actual business needs, and is not limited here.
  • the training method may include at least one of the following: unsupervised training, supervised training, and semi-supervised training.
  • the sample text image set may be divided into at least one sample text image subset according to the sample text output result and the sample label of the sample text image.
  • the at least one sample text image subset may include a first sample text image subset.
  • the at least one sample text image subset may also include a second sample text image subset.
  • the sample text images in the second subset of sample text images may refer to sample text images whose sample text output results are incorrect sample text output results.
  • a plurality of candidate cropping positions may be determined based on the sample text output result of the sample text image to be cropped. Determine at least one target cropping position from a plurality of candidate cropping positions. For example, at least one target cropping position may be randomly determined from a plurality of candidate cropping positions. Alternatively, a position corresponding to at least one target character may be determined from a plurality of candidate cropping positions. A position corresponding to at least one target character is determined as at least one target cropping position.
  • the sample text image to be cropped can be cropped based on at least one target cropping position corresponding to the sample text image to be cropped, to obtain at least one Crop the sample image.
  • the sample text images to be cropped included in the sample text image set to be cropped may be processed At least one corresponding cropped sample text image is combined to obtain at least one combined sample text image.
  • obtaining a target sample text image set based on at least one cropped sample text image subset and at least one sample text image subset may include: dividing the first sample text based on at least one sample text image subset Other sample text image subsets other than the image subset, other sample text images in the first sample text image subset other than the sample text image set to be cropped, and at least one combined sample text image are used to obtain a target sample text image set.
  • a target sample text image set may be obtained based on the sample text image set and at least one combined sample text image.
  • the text image generation method of the embodiment of the present disclosure can be executed by an electronic device.
  • the electronic device may be a server or a terminal device.
  • the electronic device may include at least one processor.
  • the processor may be used to execute the text image generation method provided by embodiments of the present disclosure.
  • a single processor may be used to perform text image generation provided by embodiments of the present disclosure.
  • multiple processors may also be used to execute the text image generation method provided by the embodiments of the present disclosure in parallel.
  • the target cropping position set is determined based on the sample text output result set of the sample text image set to be cropped, and the sample text image set to be cropped is determined based on the first sample text image subset
  • the first The sample text images in the sample text image subset are sample text images that include the correct sample text output result and are determined from the sample text image set based on the sample text output result set and the sample label set of the sample text image set. Therefore, the target can be effectively guaranteed
  • the accuracy of the cropping position effectively prevents character information from being destroyed.
  • the target sample text image set is obtained based on at least one sample text image subset and at least one cropped sample text image subset obtained by cropping the to-be-cropped sample text image set based on the target cropping position set, which improves the target sample text image set.
  • the image background complexity and image diversity of the sample text images can be used to obtain a target sample text image set with richer contextual information.
  • the target sample text image set is used for subsequent model training and optimization, which reduces the number of model iterations, increases the model training speed, thereby reducing the data processing volume and resource consumption of electronic devices, and thereby obtaining electronic products that conform to the laws of nature. The effect of improving the internal performance of the equipment, thereby enhancing the core competitiveness of electronic equipment.
  • the above text image generation method may further include the following operations.
  • the set of original sample text images may include at least one original sample text image.
  • Data augmentation may include at least one of the following: supervised data augmentation and unsupervised data augmentation.
  • Supervised data augmentation may include at least one of the following: single-sample data augmentation and multi-sample data augmentation.
  • Unsupervised data augmentation may include at least one of the following: data augmentation to generate new data and data augmentation to learn an augmentation strategy.
  • single-sample data enhancement may include at least one of the following: a geometric transformation class and a color transformation class.
  • the geometric transformation class may include at least one of the following: flipping, rotation, random cropping, deformation, scaling, etc.
  • the color transformation class may include at least one of the following: noise, blur, color transformation, erasure and fill, etc.
  • multi-sample data enhancement may include at least one of the following: SMOTE (Synthetic Minority Over-sampling Technique), Sample Pairing, Mixup, Cutout, Cutmix, Fmix and ROImix, etc.
  • SMOTE Synthetic Minority Over-sampling Technique
  • Sample Pairing Mixup, Cutout, Cutmix, Fmix and ROImix, etc.
  • data augmentation to generate new data may include data augmentation based on a generative adversarial network model.
  • Data augmentation for learning augmentation strategies can include automatic data augmentation.
  • data enhancement can be performed on the original sample text image in the original sample text image set to obtain at least one intermediate sample text image corresponding to the original sample text image.
  • the data augmentation of each original sample text image may be one of different from each other, partially the same, or completely the same.
  • the original sample text image set may include original sample text image A and original sample text image B.
  • the original sample text image A can be subjected to geometric transformation data enhancement to obtain at least one intermediate sample text image corresponding to the original sample text image A.
  • Data enhancement such as color transformation can be performed on the original sample text image B to obtain at least one intermediate sample text image corresponding to the original sample text image B.
  • obtaining the sample text image set according to the original sample text image set and the intermediate sample text image set may include: determining the intermediate sample text image set as the sample text image set. Alternatively, at least part of the original sample text image set and at least part of the intermediate sample text image set are determined as the sample text image set.
  • the image diversity of the third sample text image in the third sample text image subset can be effectively guaranteed.
  • training the deep learning model using the third sample text image subset can improve the generalization performance of the model.
  • obtaining a sample text image set based on the original sample text image set and the intermediate sample text image set may include the following operations.
  • the height of the original sample text image is changed Adjust to a predetermined height to obtain the adjusted original sample text image.
  • the intermediate sample text image in the intermediate sample text image set when it is determined that the height of the intermediate sample text image is not a predetermined height, while keeping the aspect ratio of the intermediate sample text image unchanged, the height of the intermediate sample text image is changed Adjust to a predetermined height to obtain the adjusted intermediate sample text image.
  • operation S210 may include the following operations.
  • the sample text image set is divided into at least one sample text image subset.
  • the comparison result may include that the relationship between the two objects satisfies the predetermined matching condition and the relationship between the two objects does not satisfy the predetermined matching condition.
  • the two objects can refer to sample text output results and sample labels.
  • the predetermined matching conditions can be configured according to actual business needs and are not limited here.
  • the predetermined matching condition may include two objects matching.
  • the sample text output result of the sample text image and the sample label can be compared to obtain a comparison result corresponding to the sample text image.
  • the sample text image can be divided into a sample text image subset corresponding to the comparison result.
  • the sample text image set may include a plurality of sample text images.
  • the at least one subset of sample text images may also include a second subset of sample text images.
  • dividing the sample text image set into at least one sample text image subset according to the comparison result may include the following operations.
  • the sample text image For the sample text image among the plurality of sample text images, when it is determined that the relationship between the sample text output result of the sample text image and the sample label satisfies the predetermined matching condition, the sample text image is determined to be the first sample text image sub-image. Concentrated sample text image. When it is determined that the relationship between the sample text output result of the sample text image and the sample label does not satisfy the predetermined matching condition, the sample text image is determined to be the sample text image in the second sample text image subset.
  • the predetermined matching condition may be used as a basis for dividing the sample text image subsets.
  • the predetermined matching condition may include that the difference between the sample text output result and the sample label is less than or equal to a predetermined difference threshold.
  • the predetermined difference threshold can be configured according to actual business needs and is not limited here.
  • the predetermined difference threshold may be 0.1.
  • the sample text images in the first subset of sample text images may refer to sample text images whose sample text output results are correct sample text output results.
  • the sample text image in the second sample text image subset may mean that the sample text output result is an incorrect sample text. A sample text image for this output.
  • a sample text image among a plurality of sample text images it is determined whether a difference between a sample text output result of the sample text image and a sample label is less than or equal to a predetermined difference threshold.
  • the sample text image may be determined to be a sample text image in the first sample text image subset .
  • the sample text image may be determined to be a sample text image in the second sample text image subset.
  • the target cropping position set is determined based on the sample text output result set of the sample text image set to be cropped, and the sample text image set to be cropped is determined based on the first sample text image subset
  • the first sample text image in the sample text image subset is a sample text image whose relationship between the sample text output result and the sample label satisfies the predetermined matching conditions. Therefore, the accuracy of the target cropping position can be effectively guaranteed and character information can be effectively avoided. destroyed.
  • the first sample text image set may include a plurality of first sample text images.
  • the sample text image set to be cropped may be determined in the following manner:
  • the first sample text image when it is determined that the predetermined probability value of the first sample text image is less than or equal to the predetermined probability threshold, the first sample text image is determined to be the target text image.
  • the predetermined probability value and the predetermined probability threshold may be used to determine that the first sample text image in the first sample text image subset is a sample text image to be cropped in the sample text image set to be cropped.
  • the predetermined probability value and the predetermined probability threshold can be configured according to actual business requirements and are not limited here.
  • the predetermined probability value may be a number greater than or equal to 0 and less than 1.
  • the predetermined probability threshold may be a number greater than or equal to 0 and less than or equal to 1.
  • the predetermined probability threshold can be determined based on model characteristics of the deep learning model. Model characteristics may include at least one of model structural complexity, fit, and generality.
  • the model structure of a deep learning model is characterized by strong versatility, greater complexity, and easy overfitting. If the probability is less than one, you can configure a predetermined probability threshold with a larger value. If the model structure of the deep learning model is characterized by at least one of weak generality, low complexity, and easy underfitting, a predetermined probability threshold with a smaller value may be configured.
  • the set of sample text images to be cropped may include a plurality of sample text images to be cropped.
  • operation S220 may include the following operations.
  • At least one target cropping position is determined from a plurality of candidate cropping positions according to the sample text output result of the sample text image to be cropped.
  • a plurality of candidate cropping positions may be determined based on the sample text output results of the sample text image to be cropped. At least one target cropping position is randomly determined from a plurality of candidate cropping positions.
  • image diversity of sample text images can be improved by randomly determining at least one target cropping position from a plurality of candidate cropping positions.
  • the sample text image set may include a plurality of sample text images.
  • the sample text recognition output result may be obtained by sequentially decoding the global sample feature sequence of the sample text image.
  • the global sample feature sequence may be obtained by extracting global features from the first local sample feature map of the sample text image.
  • the first local sample feature map may be obtained by extracting the first local feature from the sample text image.
  • the sample text semantic output result may be obtained by semantic understanding of the second local sample feature map of the sample text image.
  • the second local sample feature map may be obtained by performing second local feature extraction on the sample text image.
  • a CRNN-based text recognition model can be used to process sample text images to obtain sample text recognition output results.
  • CRNN can include convolutional layers, recurrent layers and transcription layers.
  • the convolutional layer can be used to process the sample text image to obtain the first local sample feature map.
  • the loop layer can be used to process the first local sample feature map to obtain the global sample feature sequence.
  • the transcription layer can be used to process the global sample feature sequence and obtain the sample text recognition output result.
  • the sample text output result includes a sample text recognition result and a sample text semantic output result
  • at least one of the candidate cropping positions is determined from a plurality of candidate cropping positions according to the sample text output result of the sample text image to be cropped.
  • the target cropping position can be included Including the following operations.
  • Multiple candidate cropping positions are determined based on the sample text recognition output results of the sample text image to be cropped. Determine at least one target cropping position from a plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped.
  • the sample text recognition output result of the sample text image to be cropped may be "Go to work today.”
  • four candidate cropping positions are determined, namely the candidate cropping position between “today” and “day”, the candidate cropping position between "day” and “go”, “go” and “ ⁇ ”
  • At least one target cropping position is determined from a plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped, thereby improving the accuracy of the target cropping position.
  • operation S230 may include the following operations.
  • the sample text image set to be cropped is cropped based on the target cropping position set to obtain a first cropped sample text image subset and a second cropped sample text image subset.
  • the first cropped sample text image subset may include at least one first cropped sample text image.
  • the second subset of cropped sample text images may include at least one second cropped sample text image.
  • the at least one target cropping position corresponding to the sample text image to be cropped may include a first target cropping position and a second target cropping position.
  • the sample text image to be cropped in the sample text image set to be cropped can be cropped based on the first target cropping position corresponding to the sample text image to be cropped, and a sample text image corresponding to the sample text image to be cropped is obtained.
  • Cropping may be performed based on the second target cropping position corresponding to the sample text image to be cropped, to obtain a second cropped sample text image corresponding to the sample text image to be cropped.
  • operation S240 may include the following operations.
  • a third sample text image subset is obtained based on at least one cropped sample text image subset. Obtain the target sample according to at least one sample text image subset and the third sample text image subset Text image set.
  • At least one cropped sample text image subset may be combined to obtain a third sample text image subset.
  • the target sample text image set can be obtained according to the second sample text image subset and the third sample text image subset.
  • obtaining a third sample text image subset based on at least one cropped sample text image subset may include the following operations.
  • the cropped sample text images in at least one cropped sample text image subset are combined to obtain a third sample text image subset.
  • the predetermined combination strategy may refer to a strategy for combining cropped sample text images.
  • the predetermined combination strategy may include at least one of the following: a random combination strategy and a fixed combination strategy.
  • the third sample text image subset may include at least one third sample text image.
  • the third sample text image may be the same as or different from the sample text image in the sample text image set.
  • the cropped sample text image may be combined with other cropped sample text
  • the cropped sample text images in the image subset are combined to obtain at least one third sample text image.
  • Other cropped sample text image subsets may be any other one or more cropped sample text image subsets in at least one cropped sample text image subset except the cropped sample text image subset.
  • the at least one cropped sample text image subset may include a first cropped sample text image subset and a second cropped sample text image subset.
  • the first cropped sample text image subset may represent a cropped sample text image subset in the first direction.
  • the second collected sample text image subset may represent the cropped sample text image subset in the second direction.
  • the first direction may refer to the right direction.
  • the second direction may refer to the left direction.
  • the first cropped sample text image in the first subset of cropped sample text images may be combined with at least one second cropped sample text image in the second subset of cropped sample text images to obtain at least one first cropped sample text image.
  • the third sample text image subset is obtained by combining cropped sample text images in at least one cropped sample text image subset based on a predetermined combination strategy, a random combination of cropped sample text images is achieved. , improving the image background complexity and image diversity of the third sample text image in the third sample text image subset. On this basis, using the third sample text image subset to train the deep learning model can improve the generalization performance of the model.
  • the above text image generation method may further include the following operations.
  • the sample label set of the sample text image set to be cropped is cropped based on the target cropping position set to obtain at least one cropped sample label subset.
  • a target sample label set is obtained based on a sample label subset corresponding to at least one sample text image subset and at least one cropped sample label subset.
  • obtaining a target sample label set based on a sample label subset corresponding to at least one sample text image subset and at least one cropped sample label subset may include the following operations.
  • a sample label subset corresponding to the third sample text image subset is obtained.
  • a target sample label set is obtained according to the sample label subset corresponding to at least one sample text image subset and the sample label subset corresponding to the third sample text image subset.
  • obtaining a sample label subset corresponding to the third sample text image subset based on at least one cropped sample label subset may include the following operations.
  • the cropped sample labels in at least one cropped sample label subset are combined to obtain a sample label subset corresponding to the third sample text image subset.
  • FIG. 3A schematically shows a principle diagram of a text image generation method according to an embodiment of the present disclosure.
  • the sample text image set 303 is divided into a first sample text image subset 303_1 and a second sample text according to the sample text output result set 301 and the sample label set 302 of the sample text image set.
  • the sample text image set 304 to be cropped is determined according to the first sample text image subset 303_1.
  • the target cropping position set 306 of the sample text image set 304 to be cropped is determined.
  • the to-be-cropped sample text image set 304 is cropped based on the target cropping position set 306 to obtain at least one cropped sample text image subset 307.
  • a target sample text image set 308 is obtained.
  • FIG. 3B schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to an embodiment of the present disclosure.
  • the sample text image set 309 to be cropped may include a sample text image to be cropped 309_1 and a sample text image to be cropped 309_2.
  • the target cropping position is determined to be "the position between Ying and Bai" from multiple candidate cropping positions.
  • the sample text image 309_1 to be cropped is cropped based on the target cropping position to obtain a cropped sample text image 309_1_1 and a cropped sample text image 309_1_2.
  • the cropped sample text image 309_1_1 is a sample text image corresponding to "Mother and Baby”.
  • the cropped sample text image 309_1_2 is a sample text image corresponding to "Parkway”.
  • the target cropping position is the "position between transfer and transfer”.
  • the to-be-cropped sample text image 309_2 is cropped based on the target cropping position to obtain a cropped sample text image 309_2_1 and a cropped sample text image 309_2_2.
  • the cropped sample text image 309_2_1 is a sample text image corresponding to "turn”.
  • the cropped sample text image 309_2_2 is a sample text image corresponding to "Let".
  • the cropped sample text image 309_1_1 and the cropped sample text image 309_2_2 are combined to obtain the third sample text image 310_1 in the third sample text image subset 310, and the cropped sample text image 309_1_2 and the cropped sample text image are obtained 309_2_1 are combined to obtain the third sample text image 310_2 in the third sample text image subset 310.
  • the third sample text image 310_1 is a sample text image corresponding to "Mother and Infant Let”.
  • the third sample text image 310_2 is a sample text image corresponding to "Zhuanbahui".
  • FIG. 3C schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure.
  • the third sample text image 311_1 is a sample text image corresponding to "Let mother and baby”.
  • the third sample text image 311_2 is a sample text image corresponding to "Baihuizhuan”.
  • FIG. 3D schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure.
  • FIG. 3E schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure.
  • the third sample text image 313_1 is a sample text image corresponding to "transformation of mother and child”.
  • the third sample text image 313_2 is a sample text image corresponding to "Let Baihui”.
  • Figure 4 schematically shows a flow chart of a training method of a deep learning model according to an embodiment of the present disclosure.
  • the method 400 may include operations S410 to S420.
  • a deep learning model is trained using the target sample text image set to obtain a text image processing model.
  • the target sample text image set may be obtained according to the text image generation method described in the embodiments of the present disclosure.
  • the sample text image set to be cropped is based on the first sample text image subset. It is determined that the first sample text image subset is a sample text image that includes the correct sample text output result and is determined from the sample text image set according to the sample text output result set and the sample label set of the sample text image set. Therefore, it can effectively ensure that The accuracy of the target cropping position effectively prevents character information from being destroyed.
  • a target sample text image set is obtained based on at least one cropped sample text image subset and at least one sample text image subset, and a target sample text image set with richer contextual information can be obtained.
  • the target sample text image set is used for subsequent model training and optimization, which reduces the number of model iterations and increases the training speed of the model. This reduces the data processing volume and resource consumption of electronic devices, thereby obtaining a model that conforms to natural laws. The effect of improving the internal performance of electronic equipment, from And enhance the core competitiveness of electronic equipment.
  • FIG. 5 schematically shows a flowchart of a text image processing method according to an embodiment of the present disclosure.
  • the method 500 includes operations S510 to S520.
  • the text image to be processed is input into the text image processing model to obtain a text image processing result.
  • the text image processing model may be trained according to the deep learning model training method described in the embodiments of the present disclosure.
  • the collection, storage, use, processing, transmission, provision, disclosure and application of user personal information are in compliance with relevant laws and regulations, necessary confidentiality measures have been taken, and do not violate public order and good customs .
  • the user's authorization or consent is obtained before obtaining or collecting the user's personal information.
  • FIG. 6 schematically shows a block diagram of a text image generating device according to an embodiment of the present disclosure.
  • the text image generating device 600 may include a dividing module 610 , a determining module 620 , a first obtaining module 630 and a second obtaining module 640 .
  • the dividing module 610 is configured to divide the sample text image set into at least one sample text image subset according to the sample text output result set and the sample label set of the sample text image set. At least one sample text image subset includes a first sample text image subset. The first sample text image subset includes sample text images with correct sample text output results.
  • the determination module 620 is configured to determine a target cropping position set of the sample text image set to be cropped based on the sample text output result set of the sample text image set to be cropped.
  • the set of sample text images to be cropped is determined based on the first subset of sample text images.
  • the first obtaining module 630 is configured to crop the sample text image set to be cropped based on the target cropping position set to obtain at least one cropped sample text image subset.
  • the second obtaining module 640 is configured to obtain a target sample text image set based on at least one cropped sample text image subset and at least one sample text image subset.
  • the partition module 610 may include a comparison sub-module and a partition sub-module.
  • the comparison submodule is used to compare the sample text output result set of the sample text image set and the sample label set to obtain the comparison result.
  • the dividing submodule is used to divide the sample text image set into at least one sample text image subset according to the comparison result.
  • the sample text image set includes a plurality of sample text images, and at least one sample text image subset further includes a second sample text image subset.
  • the dividing sub-module may include a first determination unit and a second determination unit.
  • a first determination unit configured to determine the sample text image as a sample text image in the first sample text image subset when it is determined that the relationship between the sample text output result and the sample label of the sample text image satisfies the predetermined matching condition.
  • a second determination unit configured to determine the sample text image as a sample text image in the second sample text image subset when it is determined that the relationship between the sample text output result and the sample label of the sample text image does not satisfy the predetermined matching condition.
  • the set of sample text images to be cropped may include a plurality of sample text images to be cropped.
  • the determining module 620 may include a determining sub-module.
  • the determining submodule is configured to determine at least one target cropping position from a plurality of candidate cropping positions based on the sample text output result of the sample text image to be cropped.
  • the sample text output result may include at least one of the following: a sample text recognition output result and a sample text semantic output result.
  • the sample text image set may include a plurality of sample text images.
  • the sample text recognition output result may be obtained by sequentially decoding the global sample feature sequence of the sample text image.
  • the global sample feature sequence may be obtained by extracting global features from the first local sample feature map of the sample text image.
  • the first local sample feature map may be obtained by extracting the first local feature from the sample text image.
  • the sample text semantic output result may be a sample text image
  • the second local sample feature map is obtained through semantic understanding.
  • the second local sample feature map may be obtained by performing second local feature extraction on the sample text image.
  • the determination sub-module may include a third determination unit and a fourth determination unit.
  • the third determination unit is used to determine multiple candidate cropping positions based on the sample text recognition output result of the sample text image to be cropped.
  • the fourth determination unit is configured to determine at least one target cropping position from a plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped.
  • the first obtaining module 630 may include a first obtaining sub-module.
  • the first obtaining submodule is used to crop the sample text image set to be cropped based on the target cropping position set, and obtain the first cropped sample text image subset and the second cropped sample text image subset.
  • the second obtaining module 640 may include a second obtaining sub-module and a third obtaining sub-module.
  • the second obtaining submodule is used to obtain a third sample text image subset based on at least one cropped sample text image subset.
  • the third obtaining submodule is used to obtain a target sample text image set based on at least one sample text image subset and a third sample text image subset.
  • the second obtaining sub-module may include an obtaining unit.
  • the obtaining unit is configured to combine the cropped sample text images in at least one cropped sample text image subset based on a predetermined combination strategy to obtain a third sample text image subset.
  • the first sample text image set may include a plurality of first sample text images.
  • the sample text image set to be cropped may be determined in the following manner:
  • the first sample text image is determined as the sample text image to be cropped in the set of sample text images to be cropped.
  • the text image generating device may further include a third obtaining module and The fourth acquisition module.
  • the third acquisition module is used to perform data enhancement processing on the original sample text image set to obtain an intermediate sample text image set.
  • the fourth obtaining module is used to obtain a sample text image set based on the original sample text image set and the intermediate sample text image set.
  • the sample text image set may be a text image set of a text vision task.
  • Figure 7 schematically shows a block diagram of a training device for a deep learning model according to an embodiment of the present disclosure.
  • the deep learning model training device 700 may include a first acquisition module 710 and a fifth acquisition module 720 .
  • the first acquisition module 710 is used to acquire the target sample text image set.
  • the fifth acquisition module 720 is used to train a deep learning model using the target sample text image set to obtain a text image processing model.
  • the target sample text image set may be trained according to the training device of the deep learning model of the embodiment of the present disclosure.
  • FIG. 8 schematically shows a block diagram of a text image processing device according to an embodiment of the present disclosure.
  • the image processing device 800 may include a second acquisition module 810 and a sixth acquisition module 820 .
  • the second acquisition module 810 is used to acquire text images to be processed.
  • the sixth obtaining module 820 is used to input the text image to be processed into the text image processing model to obtain the text image processing result.
  • the text image processing model may be trained according to the image processing device according to the embodiments of the present disclosure.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • an electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions that can be executed by at least one processor, and the instructions are processed by at least one processor. processor execution, so that at least one processor can execute the method as described above.
  • a non-transitory computer-readable storage medium stores computer instructions, wherein the computer instructions are used to cause a computer to perform the method as described above.
  • a computer program product includes a computer program, and when executed by a processor, the computer program implements the method as described above.
  • FIG. 9 schematically shows a block diagram of an electronic device suitable for implementing a text image generation method, a deep learning model training method, and a text image processing method according to an embodiment of the present disclosure.
  • Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the electronic device 900 includes a computing unit 901 that can perform calculations according to a computer program stored in a read-only memory (ROM) 902 or loaded from a storage unit 908 into a random access memory (RAM) 903 . Perform various appropriate actions and processing.
  • RAM 903 various programs and data required for the operation of the electronic device 900 can also be stored.
  • Computing unit 901, ROM 902 and RAM 903 are connected to each other via bus 904.
  • An input/output (I/O) interface 905 is also connected to bus 904.
  • Multiple components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906, such as a keyboard, a mouse, etc.; an output unit 907, such as various types of displays, speakers, etc.; a storage unit 908, such as a magnetic disk, an optical disk, etc. etc.; and a communication unit 909, such as a network card, modem, wireless communication transceiver, etc.
  • the communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunications networks.
  • Computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc.
  • the computing unit 901 performs various methods and processes described above, for example, a text image generation method, a deep learning model training method, and a text image processing method.
  • text image generation methods, deep learning models may be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 908.
  • part or all of the computer program may be loaded and/or installed onto electronic device 900 via ROM 902 and/or communication unit 909 .
  • the computing unit 901 may be configured to perform the text image generation method, the deep learning model training method, and the text image processing method in any other suitable manner (eg, by means of firmware).
  • Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system
  • CPLD complex programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include electrical connections based on one or more wires, Portable computer disk, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM) ), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage devices or any suitable combination of the foregoing.
  • the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
  • Computer systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, a distributed system server, or a server combined with a blockchain.

Abstract

The present invention relates to the technical field of artificial intelligence, and provides text image generation, training, and processing methods, and an electronic device. A specific implementation solution comprises: dividing a sample text image set into at least one sample text image subset, according to a sample text output result set and a sample label set of the sample text image set (S210); according to a sample text output result set of a set of sample text images to be clipped, determining a target clipping location set of the set of sample text images to be clipped (S220); clipping the set of sample text images to be clipped on the basis of the target clipping location set, to obtain at least one clipped sample text image subset (S230); obtaining a target sample text image set according to the at least one clipped sample text image subset and the at least one sample text image subset (S240). The accuracy of the target clipping location can be effectively ensured, character information is effectively prevented from being damaged, and image background complexity and image diversity of sample text images in the target sample text image set are improved.

Description

文本图像生成、训练、文本图像处理方法以及电子设备Text image generation, training, text image processing methods and electronic devices
本申请要求于2022年8月24日递交的中国专利申请No.202211015424.6的优先权,其内容一并在此作为参考。This application claims priority from Chinese Patent Application No. 202211015424.6 submitted on August 24, 2022, the content of which is hereby incorporated by reference.
技术领域Technical field
本公开涉及人工智能技术领域,尤其涉及计算机视觉和深度学习技术领域,可应用于光学字符识别场景。具体地,涉及一种文本图像生成、训练、文本图像处理方法以及电子设备。The present disclosure relates to the field of artificial intelligence technology, in particular to the field of computer vision and deep learning technology, and can be applied to optical character recognition scenarios. Specifically, it relates to a text image generation, training, text image processing method and electronic device.
背景技术Background technique
随着计算机技术的发展,人工智能技术也得以发展。人工智能技术可以包括计算机视觉技术、语音识别技术、自然语言处理技术、机器学习、深度学习、大数据处理技术和知识图谱技术等。With the development of computer technology, artificial intelligence technology has also developed. Artificial intelligence technology can include computer vision technology, speech recognition technology, natural language processing technology, machine learning, deep learning, big data processing technology and knowledge graph technology, etc.
人工智能技术在各种领域得到了广泛应用。例如,可以利用人工智能技术生成用于训练深度学习模型的文本图像。Artificial intelligence technology has been widely used in various fields. For example, AI technology can be leveraged to generate text images for training deep learning models.
发明内容Contents of the invention
本公开提供了一种文本图像生成、训练、文本图像处理方法以及电子设备。The present disclosure provides a text image generation, training, text image processing method and electronic device.
根据本公开的一方面,提供了一种文本图像生成方法,包括:根据样本文本图像集的样本文本输出结果集和样本标签集,将上述样本文本图像集划分为至少一个样本文本图像子集,其中,上述至少一个样本文本图像子集包括第一样本文本图像子集,上述第一样本文本图像子集包括样本文本输出结果正确的样本文本图像;根据待裁剪样本文本图像集的样本文本输出结果集,确定上述待裁剪样本文本图像集的目标裁剪位置集,其中,上述待裁剪样本文本图像集是根据上述第一样本文本图像子集确定的;基于上述目标裁剪位置集对上述待裁剪样本文本图像集进行裁剪,得到至少一个裁剪样本文本图像子集;以及,根据上述至少一个裁剪样本文本图像子集和上述至少一个样本文本图像子集,得到目标样本文本图像集。 According to an aspect of the present disclosure, a text image generation method is provided, including: dividing the above sample text image set into at least one sample text image subset according to a sample text output result set and a sample label set of the sample text image set, Wherein, the above-mentioned at least one sample text image subset includes a first sample text image subset, and the above-mentioned first sample text image subset includes sample text images with correct sample text output results; according to the sample text of the sample text image set to be cropped Output the result set and determine the target cropping position set of the sample text image set to be cropped, wherein the sample text image set to be cropped is determined based on the first sample text image subset; based on the target cropping position set, the sample text image set to be cropped is determined. Crop the sample text image set to obtain at least one cropped sample text image subset; and obtain a target sample text image set based on the at least one cropped sample text image subset and the at least one sample text image subset.
根据本公开的另一方面,提供了一种深度学习模型的训练方法,包括:获取目标样本文本图像集;以及,利用上述目标样本文本图像集训练上述深度学习模型,得到文本图像处理模型,其中,上述目标样本文本图像集是利用根据本公开上述的方法得到的。According to another aspect of the present disclosure, a training method for a deep learning model is provided, including: obtaining a target sample text image set; and, using the target sample text image set to train the above-mentioned deep learning model to obtain a text image processing model, wherein , the above target sample text image set is obtained by using the method described above according to the present disclosure.
根据本公开的另一方面,提供了一种文本图像处理方法,包括:获取待处理文本图像;以及,将上述待处理文本图像输入文本图像处理模型,得到文本图像处理结果,其中,上述文本图像处理模型是利用根据本公开上述的方法训练得到的。According to another aspect of the present disclosure, a text image processing method is provided, including: obtaining a text image to be processed; and inputting the text image to be processed into a text image processing model to obtain a text image processing result, wherein the text image is The processing model is trained using the methods described above in accordance with this disclosure.
根据本公开的另一方面,提供了一种文本图像生成装置,包括:划分模块,用于根据样本文本图像集的样本文本输出结果集和样本标签集,将上述样本文本图像集划分为至少一个样本文本图像子集,其中,上述至少一个样本文本图像子集包括第一样本文本图像子集,上述第一样本文本图像子集包括样本文本输出结果正确的样本文本图像;确定模块,用于根据待裁剪样本文本图像集的样本文本输出结果集,确定上述待裁剪样本文本图像集的目标裁剪位置集,其中,上述待裁剪样本文本图像集是根据上述第一样本文本图像子集确定的;第一获得模块,用于基于上述目标裁剪位置集对上述待裁剪样本文本图像集进行裁剪,得到至少一个裁剪样本文本图像子集;以及,第二获得模块,用于根据上述至少一个裁剪样本文本图像子集和上述至少一个样本文本图像子集,得到目标样本文本图像集。According to another aspect of the present disclosure, a text image generating device is provided, including: a dividing module configured to divide the sample text image set into at least one according to a sample text output result set and a sample label set of the sample text image set. Sample text image subsets, wherein the at least one sample text image subset includes a first sample text image subset, and the first sample text image subset includes sample text images with correct sample text output results; the determination module uses Determine the target cropping position set of the sample text image set to be cropped based on the sample text output result set of the sample text image set to be cropped, wherein the sample text image set to be cropped is determined based on the first sample text image subset The first acquisition module is used to crop the above-mentioned sample text image set to be cropped based on the above-mentioned target cropping position set, and obtain at least one cropped sample text image subset; and the second acquisition module is used to crop the above-mentioned sample text image set based on the above-mentioned at least one cropping position set. The sample text image subset and the above-mentioned at least one sample text image subset are used to obtain a target sample text image set.
根据本公开的另一方面,提供了一种深度学习模型的训练装置,包括:第一获取模块,用于获取目标样本文本图像集;以及第三获得模块,用于利用上述目标样本文本图像集训练上述深度学习模型,得到文本图像处理模型,其中,上述目标样本文本图像集是利用根据本公开上述的装置得到的。According to another aspect of the present disclosure, a training device for a deep learning model is provided, including: a first acquisition module for acquiring a target sample text image set; and a third acquisition module for utilizing the target sample text image set. The above-mentioned deep learning model is trained to obtain a text image processing model, wherein the above-mentioned target sample text image set is obtained using the above-mentioned device according to the present disclosure.
根据本公开的另一方面,提供了一种文本图像处理装置,包括:第二获取模块,用于获取待处理文本图像;以及第四获得模块,用于将上述待处理文本图像输入文本图像处理模型,得到文本图像处理结果,其中,上述文本图像处理模型是利用根据本公开上述的装置训练得到的。According to another aspect of the present disclosure, a text image processing device is provided, including: a second acquisition module for acquiring a text image to be processed; and a fourth acquisition module for inputting the text image to be processed into text image processing model to obtain a text image processing result, wherein the above text image processing model is trained using the above device according to the present disclosure.
根据本公开的另一方面,提供了一种电子设备,包括:至少一个处理器;以及与上述至少一个处理器通信连接的存储器;其中,上述存储器存 储有可被上述至少一个处理器执行的指令,上述指令被上述至少一个处理器执行,以使上述至少一个处理器能够执行如本公开所述的方法。According to another aspect of the present disclosure, an electronic device is provided, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores Instructions executable by the above-mentioned at least one processor are stored, and the above-mentioned instructions are executed by the above-mentioned at least one processor, so that the above-mentioned at least one processor can execute the method according to the present disclosure.
根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,上述计算机指令用于使上述计算机执行如本公开所述的方法。According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to the present disclosure.
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.
附图说明Description of drawings
附图用于更好地理解本方案,不构成对本公开的限定。其中:The accompanying drawings are used to better understand the present solution and do not constitute a limitation of the present disclosure. in:
图1示意性示出了根据本公开实施例的可以文本图像生成方法、深度学习模型的训练方法和文本图像处理方法及装置的示例性系统架构;Figure 1 schematically illustrates an exemplary system architecture that may include a text image generation method, a deep learning model training method, and a text image processing method and device according to an embodiment of the present disclosure;
图2示意性示出了根据本公开实施例的文本图像生成方法的流程图;Figure 2 schematically shows a flow chart of a text image generation method according to an embodiment of the present disclosure;
图3A示意性示出了根据本公开实施例的文本图像生成方法的原理示意图;FIG. 3A schematically shows a principle diagram of a text image generation method according to an embodiment of the present disclosure;
图3B示意性示出了根据本公开实施例的第三样本文本图像子集的生成过程的示例示意图;3B schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to an embodiment of the present disclosure;
图3C示意性示出了根据本公开另一实施例的第三样本文本图像子集的生成过程的示例示意图;3C schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure;
图3D示意性示出了根据本公开另一实施例的第三样本文本图像子集的生成过程的示例示意图;3D schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure;
图3E示意性示出了根据本公开另一实施例的第三样本文本图像子集的生成过程的示例示意图;3E schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure;
图4示意性示出了根据本公开实施例的深度学习模型的训练方法的流程图;Figure 4 schematically shows a flow chart of a training method for a deep learning model according to an embodiment of the present disclosure;
图5示意性示出了根据本公开实施例的文本图像处理方法的流程图;Figure 5 schematically shows a flow chart of a text image processing method according to an embodiment of the present disclosure;
图6示意性示出了根据本公开实施例的文本图像生成装置的框图;Figure 6 schematically shows a block diagram of a text image generating device according to an embodiment of the present disclosure;
图7示意性示出了根据本公开实施例的深度学习模型的训练装置的框图; Figure 7 schematically shows a block diagram of a training device for a deep learning model according to an embodiment of the present disclosure;
图8示意性示出了根据本公开实施例的文本图像处理装置的框图;以及Figure 8 schematically shows a block diagram of a text image processing device according to an embodiment of the present disclosure; and
图9示意性示出了根据本公开实施例的适于实现文本图像生成方法、深度学习模型的训练方法和文本图像处理方法的电子设备的框图。FIG. 9 schematically shows a block diagram of an electronic device suitable for implementing a text image generation method, a deep learning model training method, and a text image processing method according to an embodiment of the present disclosure.
具体实施方式Detailed ways
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding and should be considered to be exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.
图1示意性示出了根据本公开实施例的可以文本图像生成方法、深度学习模型的训练方法和文本图像处理方法及装置的示例性系统架构。FIG. 1 schematically illustrates an exemplary system architecture that may include a text image generation method, a deep learning model training method, and a text image processing method and device according to an embodiment of the present disclosure.
需要注意的是,图1所示仅为可以应用本公开实施例的系统架构的示例,以帮助本领域技术人员理解本公开的技术内容,但并不意味着本公开实施例不可以用于其他设备、系统、环境或场景。例如,在另一实施例中,可以应用文本图像生成方法、深度学习模型的训练方法和文本图像生成方法及装置的示例性系统架构可以包括终端设备,但终端设备可以无需与服务器进行交互,即可实现本公开实施例提供的文本图像生成方法、深度学习模型的训练方法和文本图像处理方法及装置。It should be noted that Figure 1 is only an example of a system architecture to which embodiments of the present disclosure can be applied, to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure cannot be used in other applications. Device, system, environment or scenario. For example, in another embodiment, the exemplary system architecture in which the text image generation method, the deep learning model training method, and the text image generation method and apparatus can be applied may include a terminal device, but the terminal device may not need to interact with the server, that is, The text image generation method, deep learning model training method, and text image processing method and device provided by the embodiments of the present disclosure can be implemented.
如图1所示,根据该实施例的系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型。例如,有线和无线通信链路等中的至少之一。As shown in Figure 1, the system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 is a medium used to provide communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types. For example, at least one of wired and wireless communication links, and the like.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用。例如,知识阅读类应用、网页浏览器应用、搜索类应用、即时通信工具、邮箱客户端和社交平台软件等中的至少之一。Users can use terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages, etc. Various communication client applications can be installed on the terminal devices 101, 102, and 103. For example, at least one of a knowledge reading application, a web browser application, a search application, an instant messaging tool, an email client, and a social platform software.
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备。例如,可以包括智能手机、平板电脑、膝上型便携计算机和台 式计算机等中的至少之一。The terminal devices 101, 102, and 103 may be various electronic devices having a display screen and supporting web browsing. For example, this could include smartphones, tablets, laptops, and desktops. At least one of a computer, etc.
服务器105可以是提供各种服务的各种类型的服务器。例如,服务器105可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务(Virtual Private Server,虚拟专用服务器)中,存在的管理难度大,业务扩展性弱的缺陷。服务器105也可以为分布式系统的服务器,或者是结合了区块链的服务器。Server 105 may be various types of servers providing various services. For example, the server 105 can be a cloud server, also known as a cloud computing server or a cloud host. It is a host product in the cloud computing service system to solve the problem between traditional physical hosts and VPS services (Virtual Private Server). , which has the disadvantages of difficult management and weak business scalability. The server 105 can also be a server of a distributed system, or a server combined with a blockchain.
需要说明的是,本公开实施例所提供的文本图像生成方法和文本图像处理方法一般可以由终端设备101、102、或103执行。相应地,本公开实施例所提供的文本图像生成装置和文本图像处理装置也可以设置于终端设备101、102、或103中。It should be noted that the text image generation method and text image processing method provided by the embodiments of the present disclosure can generally be executed by the terminal device 101, 102, or 103. Correspondingly, the text image generating device and the text image processing device provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103.
备选地,本公开实施例所提供的文本图像生成方法和文本图像处理方法一般也可以由服务器105执行。相应地,本公开实施例所提供的文本图像生成装置和文本图像处理装置一般可以设置于服务器105中。本公开实施例所提供的文本图像生成方法和文本图像处理方法也可以由不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群执行。相应地,本公开实施例所提供的文本图像生成装置和文本图像处理装置也可以设置于不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群中。Alternatively, the text image generation method and text image processing method provided by the embodiments of the present disclosure can generally also be executed by the server 105 . Correspondingly, the text image generation device and the text image processing device provided by the embodiments of the present disclosure may generally be provided in the server 105 . The text image generation method and text image processing method provided by the embodiments of the present disclosure can also be executed by a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105. Correspondingly, the text image generation device and the text image processing device provided by the embodiments of the present disclosure may also be provided in a server or server cluster that is different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
需要说明的是,本公开实施例所提供的深度学习模型的训练方法一般可以由服务器105执行。相应地,本公开实施例所提供的深度学习模型的训练装置一般可以设置于服务器105中。本公开实施例所提供的深度学习模型的训练方法也可以由不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群执行。相应地,本公开实施例所提供的深度学习模型的训练装置也可以设置于不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群中。It should be noted that the training method of the deep learning model provided by the embodiment of the present disclosure can generally be executed by the server 105 . Correspondingly, the training device for the deep learning model provided by the embodiment of the present disclosure may generally be provided in the server 105 . The deep learning model training method provided by the embodiment of the present disclosure can also be executed by a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105. Correspondingly, the training device of the deep learning model provided by the embodiment of the present disclosure can also be set up in a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105.
备选地,本公开实施例所提供的深度学习模型的训练方法一般也可以由终端设备101、102、或103执行。相应地,本公开实施例所提供的深度学习模型的训练装置也可以设置于终端设备101、102、或103中。Alternatively, the deep learning model training method provided by the embodiment of the present disclosure can generally also be executed by the terminal device 101, 102, or 103. Correspondingly, the training device for the deep learning model provided by the embodiment of the present disclosure can also be provided in the terminal device 101, 102, or 103.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。 根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the number of terminal devices, networks and servers in Figure 1 is only illustrative. Depending on implementation needs, there can be any number of end devices, networks, and servers.
应注意,以下方法中各个操作的序号仅作为该操作的表示以便描述,而不应被看作表示该各个操作的执行顺序。除非明确指出,否则该方法不需要完全按照所示顺序来执行。It should be noted that the sequence number of each operation in the following method is only used as a representation of the operation for the purpose of description, and should not be regarded as indicating the execution order of the respective operations. Unless explicitly stated, the methods need not be performed in exactly the order shown.
图2示意性示出了根据本公开实施例的文本图像生成方法的流程图。FIG. 2 schematically shows a flow chart of a text image generation method according to an embodiment of the present disclosure.
如图2所示,该方法200包括操作S210~S240。As shown in Figure 2, the method 200 includes operations S210 to S240.
在操作S210,根据样本文本图像集的样本文本输出结果集和样本标签集,将样本文本图像集划分为至少一个样本文本图像子集。In operation S210, the sample text image set is divided into at least one sample text image subset according to the sample text output result set and the sample label set of the sample text image set.
在操作S220,根据待裁剪样本文本图像集的样本文本输出结果集,确定待裁剪样本文本图像集的目标裁剪位置集。In operation S220, a target cropping position set of the sample text image set to be cropped is determined based on the sample text output result set of the sample text image set to be cropped.
在操作S230,基于目标裁剪位置集对待裁剪样本文本图像集进行裁剪,得到至少一个裁剪样本文本图像子集。In operation S230, the sample text image set to be cropped is cropped based on the target cropping position set to obtain at least one cropped sample text image subset.
在操作S240,根据至少一个裁剪样本文本图像子集和至少一个样本文本图像子集,得到目标样本文本图像集。In operation S240, a target sample text image set is obtained based on at least one cropped sample text image subset and at least one sample text image subset.
根据本公开的实施例,至少一个样本文本图像子集可以包括第一样本文本图像子集。第一样本文本图像子集可以包括样本文本输出结果正确的样本文本图像。待裁剪样本文本图像集可以是根据第一样本文本图像子集确定的。According to an embodiment of the present disclosure, the at least one sample text image subset may include a first sample text image subset. The first subset of sample text images may include sample text images whose sample text output results are correct. The set of sample text images to be cropped may be determined based on the first subset of sample text images.
根据本公开的实施例,文本图像可以包括以下至少之一:文档文本图像和场景文本图像。文档文本图像可以指布局工整、光线受控和背景较为单一的文本图像。场景文本图像可以指背景较为复杂、文字形式多样和光线不受控的文本图像。文字形式可以包括以下至少之一:文字的颜色、大小、字体、方向和布局不规律等。布局不规律可以包括弯曲、倾斜、褶皱、变形和残缺不全等中的至少之一。According to embodiments of the present disclosure, the text image may include at least one of the following: a document text image and a scene text image. Document text images can refer to text images with neat layout, controlled lighting and a relatively simple background. Scene text images can refer to text images with complex backgrounds, diverse text forms, and uncontrolled lighting. Text form may include at least one of the following: text color, size, font, direction, irregular layout, etc. Layout irregularities may include at least one of bends, tilts, wrinkles, deformations, and mutilations.
根据本公开的实施例,样本文本图像集可以包括至少一个样本文本图像。样本文本图像可以包括以下至少之一:样本文档文本图像和样本场景文本图像。样本文本图像集可以是文本视觉任务的图像集。样本文本图像可以是各种文本视觉任务的文本图像。例如,文本视觉任务可以包括以下至少之一:文本图像识别任务、文本图像分类任务、文本图像分割任务、文本图像检测任务和文本图像检索任务等。此外,文本视觉任务还可以包 括以下至少之一:与文本图像识别任务对应的细分领域任务、与文本图像分类任务对应的细分领域任务、与文本图像分割任务对应的细分领域任务、与文本图像检测任务对应的细分领域任务、与文本图像检测任务对应的细分领域任务和与文本图像检索任务对应的细分领域任务。According to embodiments of the present disclosure, the sample text image set may include at least one sample text image. The sample text image may include at least one of the following: a sample document text image and a sample scene text image. The sample text image set may be an image set for a text vision task. The sample text images can be text images for various text vision tasks. For example, the text vision task may include at least one of the following: text image recognition task, text image classification task, text image segmentation task, text image detection task, text image retrieval task, etc. In addition, text vision tasks can also include Including at least one of the following: subdivided field tasks corresponding to text image recognition tasks, subdivided field tasks corresponding to text image classification tasks, subdivided field tasks corresponding to text image segmentation tasks, and detailed field tasks corresponding to text image detection tasks. Sub-domain tasks, sub-domain tasks corresponding to text image detection tasks, and sub-domain tasks corresponding to text image retrieval tasks.
根据本公开的实施例,例如,与文本图像识别任务对应的细分领域任务可以包括以下至少之一:票据图像识别任务、医学文本图像识别任务、金融产品文本图像识别任务、视频字幕识别任务和安全监控识别任务等。与文本图像分类任务对应的细分领域任务可以包括以下至少之一:票据图像分类任务、医学文本图像分类任务、金融产品文本图像分类任务、视频字幕分类任务和安全监控分类任务等。与文本图像分割任务对应的细分领域任务可以包括以下至少之一:票据图像分割任务、医学文本图像分割任务和金融产品文本图像分割任务等。与文本图像检测任务对应的细分领域任务可以包括以下至少之一:票据图像检测任务、医学文本图像检测任务、金融产品文本图像检测任务、视频字幕检测任务和安全监控检测任务等。与文本图像检索任务对应的细分领域任务可以包括以下至少之一:票据图像检索任务、医学文本图像检索任务、金融产品文本图像检索任务、视频字幕检索任务和安全监控检索任务等。According to embodiments of the present disclosure, for example, the subdivision task corresponding to the text image recognition task may include at least one of the following: a bill image recognition task, a medical text image recognition task, a financial product text image recognition task, a video subtitle recognition task, and Security monitoring and identification tasks, etc. The subdivision tasks corresponding to the text image classification task may include at least one of the following: bill image classification tasks, medical text image classification tasks, financial product text image classification tasks, video subtitle classification tasks, security monitoring classification tasks, etc. The subdivided domain tasks corresponding to the text image segmentation task may include at least one of the following: bill image segmentation tasks, medical text image segmentation tasks, financial product text image segmentation tasks, etc. The subdivision tasks corresponding to the text image detection task may include at least one of the following: bill image detection tasks, medical text image detection tasks, financial product text image detection tasks, video subtitle detection tasks, security monitoring detection tasks, etc. The subdivision tasks corresponding to text image retrieval tasks may include at least one of the following: bill image retrieval tasks, medical text image retrieval tasks, financial product text image retrieval tasks, video subtitle retrieval tasks, security monitoring retrieval tasks, etc.
根据本公开的实施例,可以具有与样本文本图像集对应的样本文本输出结果集和样本标签集。样本文本输出结果集可以包括至少一个样本文本输出结果。样本标签集可以包括至少一个样本标签。样本文本图像可以具有与该样本文本图像对应的样本文本输出结果和样本标签。样本文本输出结果可以表征样本文本图像的预测文本结果。样本文本输出结果可以包括样本文本识别输出结果和样本文本语义输出结果中的至少之一。样本文本识别输出结果可以表征样本文本图像的预测文本识别结果。样本文本语义输出结果可以表征样本文本图像的预测语义结果。样本标签可以表征样本文本图像的真实文本结果。样本标签可以包括样本文本识别标签和样本文本语义标签中的至少之一。样本文本识别标签可以表征样本文本图像的真实文本识别结果。样本文本语义标签可以表征样本文本图像的真实语义结果。文本识别结果可以指文本图像所包括的字符序列。According to an embodiment of the present disclosure, there may be a sample text output result set and a sample label set corresponding to a sample text image set. The set of sample text output results may include at least one sample text output result. The sample label set may include at least one sample label. The sample text image may have a sample text output result and a sample label corresponding to the sample text image. The sample text output result can characterize the predicted text result of the sample text image. The sample text output result may include at least one of a sample text recognition output result and a sample text semantic output result. The sample text recognition output result can characterize the predicted text recognition result of the sample text image. The sample text semantic output result can characterize the predicted semantic result of the sample text image. Sample labels can characterize the real text results of sample text images. The sample label may include at least one of a sample text recognition label and a sample text semantic label. The sample text recognition label can characterize the real text recognition results of the sample text image. Sample text semantic labels can characterize the real semantic results of sample text images. The text recognition result may refer to a sequence of characters included in the text image.
根据本公开的实施例,样本文本图像集可以包括第一样本文本图像子 集。第一样本文本图像子集中的样本文本图像可以指样本文本输出结果为正确样本文本输出结果的样本文本图像。第一样本文本图像子集可以包括待裁剪样本文本图像集。待裁剪样本文本图像集可以包括至少一个待裁剪样本文本图像。待裁剪样本文本图像可以指第一样本文本图像子集中满足预定裁剪条件的样本文本图像。预定裁剪条件可以根据实际业务需求进行配置,在此不作限定。例如,预定裁剪条件可以包括与样本文本图像对应的预定概率值小于或等于预定概率阈值。According to an embodiment of the present disclosure, the sample text image set may include a first sample text image sub-set set. The sample text images in the first subset of sample text images may refer to sample text images whose sample text output results are correct sample text output results. The first subset of sample text images may include a set of sample text images to be cropped. The set of sample text images to be cropped may include at least one sample text image to be cropped. The sample text image to be cropped may refer to the sample text image in the first sample text image subset that satisfies the predetermined cropping condition. The predetermined tailoring conditions can be configured according to actual business needs and are not limited here. For example, the predetermined cropping condition may include that a predetermined probability value corresponding to the sample text image is less than or equal to a predetermined probability threshold.
根据本公开的实施例,待裁剪样本文本图像可以具有与该待裁剪样本文本图像对应的至少一个裁剪位置。目标裁剪位置可以指至少一个裁剪位置中的满足预定位置条件的裁剪位置。预定位置条件可以根据实际业务需求进行配置,在此不作限定。例如,预定位置条件可以指随机从至少一个裁剪位置中确定的条件。According to an embodiment of the present disclosure, the sample text image to be cropped may have at least one cropping position corresponding to the sample text image to be cropped. The target cropping position may refer to a cropping position that satisfies a predetermined position condition among at least one cropping position. The predetermined location conditions can be configured according to actual business needs and are not limited here. For example, the predetermined position condition may refer to a condition randomly determined from at least one cropping position.
根据本公开的实施例,裁剪样本文本图像子集可以包括至少一个裁剪样本文本图像。裁剪样本文本图像可以指基于目标裁剪位置对待裁剪样本文本图像进行裁剪得到的。According to embodiments of the present disclosure, the subset of cropped sample text images may include at least one cropped sample text image. The cropped sample text image may be obtained by cropping the sample text image to be cropped based on the target cropping position.
根据本公开的实施例,可以响应于检测到文本图像生成指令,从数据源中获取样本文本图像集。数据源可以包括以下至少之一:本地数据库、云数据库和网络资源。可以调用数据接口。利用数据接口从数据源中获取样本文本图像集。样本文本图像集可以包括至少一个样本文本图像。样本文本图像可以是以下至少之一:模拟样本文本图像和真实样本文本图像。真实样本文本图像可以是公开数据集中的样本文本图像。模拟样本文本图像是基于以下方式之一生成的:基于预定图像参数生成的和基于生成对抗网络模型处理预定随机噪声数据生成的。According to embodiments of the present disclosure, a sample text image set may be obtained from a data source in response to detecting the text image generation instruction. Data sources may include at least one of the following: local databases, cloud databases, and network resources. The data interface can be called. Use the data interface to obtain a sample text image set from the data source. The set of sample text images may include at least one sample text image. The sample text image may be at least one of the following: a simulated sample text image and a real sample text image. Real sample text images can be sample text images in public datasets. The simulated sample text image is generated based on one of the following methods: generated based on predetermined image parameters and generated based on a generative adversarial network model processing predetermined random noise data.
根据本公开的实施例,针对样本文本图像集中的样本文本图像,可以对该样本文本图像进行第一局部特征提取,得到第一局部样本特征图。可以对第一局部样本特征图进行全局特征提取,得到全局样本特征序列。可以对全局样本特征序列进行序列解码,得到该样本文本图像的样本文本识别输出结果。可以对该样本文本图像进行第二局部特征提取,得到第二局部样本特征图。可以对第二局部样本特征图进行语义理解,得到该样本文本图像的样本文本语义输出结果。根据该样本文本图像的样本文本识别输 出结果和样本文本语义输出结果中的至少之一,得到该样本文本图像的样本文本输出结果。例如,可以基于深度学习模型处理样本文本图像,得到样本文本输出结果。深度学习模型可以包括能够实现对不定长的字符序列进行文本识别的深度学习模型和能够实现文本语义理解的深度学习模型。深度学习模型的模型结构可以根据实际业务需求进行配置,在此不作限定。例如,深度学习模型可以包括至少一个模型结构。模型结构可以包括至少一个模型子结构和各个模型子结构彼此之间的连接关系。模型结构可以是基于模型子结构之间的连接关系,将至少一个模型子结构进行连接得到的结构。模型结构包括的至少一个模型子结构可以是来自至少一个操作层的结构。例如,模型结构可以是基于模型子结构之间的连接关系,将来自至少一个操作层的至少一个模型子结构进行连接得到的结构。例如,至少一个操作层可以包括以下至少之一:输入层、卷积层、隐藏层、转录层、池化层、反池化层、反卷积层、前馈神经网络层、注意力层、残差层、全连接层、批量归一化层、线性嵌入(即Linear Embedding)层和非线性层等。According to an embodiment of the present disclosure, for the sample text image in the sample text image set, the first local feature extraction can be performed on the sample text image to obtain the first local sample feature map. Global features can be extracted from the first local sample feature map to obtain a global sample feature sequence. The global sample feature sequence can be sequence decoded to obtain the sample text recognition output result of the sample text image. The second local feature extraction can be performed on the sample text image to obtain a second local sample feature map. The second local sample feature map can be semantically understood to obtain the sample text semantic output result of the sample text image. According to the sample text recognition output of the sample text image At least one of the output result and the sample text semantic output result is obtained, and the sample text output result of the sample text image is obtained. For example, the sample text image can be processed based on the deep learning model to obtain the sample text output result. Deep learning models can include deep learning models that can realize text recognition of variable-length character sequences and deep learning models that can realize text semantic understanding. The model structure of the deep learning model can be configured according to actual business needs and is not limited here. For example, a deep learning model may include at least one model structure. The model structure may include at least one model substructure and connection relationships between each model substructure. The model structure may be a structure obtained by connecting at least one model substructure based on the connection relationship between the model substructures. The model structure includes at least one model substructure that may be a structure from at least one operational layer. For example, the model structure may be a structure obtained by connecting at least one model substructure from at least one operation layer based on the connection relationship between the model substructures. For example, at least one operation layer may include at least one of the following: an input layer, a convolutional layer, a hidden layer, a transcription layer, a pooling layer, an unpooling layer, a deconvolution layer, a feedforward neural network layer, an attention layer, Residual layer, fully connected layer, batch normalization layer, linear embedding (ie Linear Embedding) layer and non-linear layer, etc.
根据本公开的实施例,文本识别的深度学习模型可以包括以下之一:基于CRNN(Convo1utional Recurrent Neural Network,卷积循环神经网络)的文本识别模型和基于编码器-解码器的文本识别模型。CRNN可以包括卷积层、循环层和转录层编码器-解码器可以包括以下之一:对称编码器-解码器和非对称编码器-解码器。According to embodiments of the present disclosure, the deep learning model for text recognition may include one of the following: a text recognition model based on CRNN (Convolutional Recurrent Neural Network, Convolutional Recurrent Neural Network) and a text recognition model based on an encoder-decoder. CRNN can include convolutional layers, recurrent layers, and transcription layers. The encoder-decoder can include one of the following: symmetric encoder-decoder and asymmetric encoder-decoder.
根据本公开的实施例,基于CRNN的文本识别模型可以包括以下至少之一:基于CTC(即Connectionist Temporal Classification)的CRNN模型、基于Attention(即注意力)的CRNN模型和基于ACE(即Aggregation Cross Entropy)的CRNN模型。基于编码器-解码器的文本识别模型可以包括基于Seq-To-Seq(即Sequence-To-Sequence)的文本识别模型。According to embodiments of the present disclosure, the CRNN-based text recognition model may include at least one of the following: a CRNN model based on CTC (ie Connectionist Temporal Classification), a CRNN model based on Attention (ie attention), and a CRNN model based on ACE (ie Aggregation Cross Entropy ) CRNN model. The encoder-decoder based text recognition model may include a Seq-To-Seq (ie Sequence-To-Sequence) based text recognition model.
根据本公开的实施例,文本语义理解的深度学习模型可以包括以下至少之一:基于卷积神经网络的文本语义理解模型、基于循环神经网络的文本语义理解模型和基于Transformer(即转换器)的文本语义理解模型。According to embodiments of the present disclosure, the deep learning model for text semantic understanding may include at least one of the following: a convolutional neural network-based text semantic understanding model, a recurrent neural network-based text semantic understanding model, and a Transformer-based (i.e., converter)-based text semantic understanding model. Text semantic understanding model.
根据本公开的实施例,深度学习模型的训练方式可以根据实际业务需求进行配置,在此不作限定。例如,训练方式可以包括以下至少之一:无监督训练、有监督训练和半监督训练。 According to embodiments of the present disclosure, the training method of the deep learning model can be configured according to actual business needs, and is not limited here. For example, the training method may include at least one of the following: unsupervised training, supervised training, and semi-supervised training.
根据本公开的实施例,可以根据样本文本图像的样本文本输出结果和样本标签,将样本文本图像集划分为至少一个样本文本图像子集。例如,至少一个样本文本图像子集可以包括第一样本文本图像子集。此外,至少一个样本文本图像子集还可以包括第二样本文本图像子集。第二样本文本图像子集中的样本文本图像可以指样本文本输出结果为错误样本文本输出结果的样本文本图像。According to an embodiment of the present disclosure, the sample text image set may be divided into at least one sample text image subset according to the sample text output result and the sample label of the sample text image. For example, the at least one sample text image subset may include a first sample text image subset. In addition, the at least one sample text image subset may also include a second sample text image subset. The sample text images in the second subset of sample text images may refer to sample text images whose sample text output results are incorrect sample text output results.
根据本公开的实施例,针对待裁剪样本文本图像集中的待裁剪样本文本图像,可以根据该待裁剪样本文本图像的样本文本输出结果,确定多个候选裁剪位置。从多个候选裁剪位置中确定至少一个目标裁剪位置。例如,可以从多个候选裁剪位置中随机确定至少一个目标裁剪位置。备选地,可以从多个候选裁剪位置中确定与至少一个目标字符对应的位置。将与至少一个目标字符对应的位置确定为至少一个目标裁剪位置。According to an embodiment of the present disclosure, for a sample text image to be cropped in a set of sample text images to be cropped, a plurality of candidate cropping positions may be determined based on the sample text output result of the sample text image to be cropped. Determine at least one target cropping position from a plurality of candidate cropping positions. For example, at least one target cropping position may be randomly determined from a plurality of candidate cropping positions. Alternatively, a position corresponding to at least one target character may be determined from a plurality of candidate cropping positions. A position corresponding to at least one target character is determined as at least one target cropping position.
根据本公开的实施例,针对待裁剪样本文本图像集中的待裁剪样本文本图像,可以基于与该待裁剪样本文本图像对应的至少一个目标裁剪位置对该待裁剪样本文本图像进行裁剪,得到至少一个裁剪样本图像。According to an embodiment of the present disclosure, for a sample text image to be cropped in a set of sample text images to be cropped, the sample text image to be cropped can be cropped based on at least one target cropping position corresponding to the sample text image to be cropped, to obtain at least one Crop the sample image.
根据本公开的实施例,在获得与待裁剪样本文本图像集包括的待裁剪样本文本图像各自对应的至少一个裁剪样本文本图像之后,可以对与待裁剪样本文本图像集包括的待裁剪样本文本图像各自对应的至少一个裁剪样本文本图像进行组合,得到至少一个组合样本文本图像。According to an embodiment of the present disclosure, after obtaining at least one cropped sample text image corresponding to each of the sample text images to be cropped included in the sample text image set to be cropped, the sample text images to be cropped included in the sample text image set to be cropped may be processed At least one corresponding cropped sample text image is combined to obtain at least one combined sample text image.
根据本公开的实施例,根据至少一个裁剪样本文本图像子集和至少一个样本文本图像子集,得到目标样本文本图像集,可以包括:可以根据至少一个样本文本图像子集中除第一样本文本图像子集以外的其他样本文本图像子集、第一样本文本图像子集中除待裁剪样本文本图像集以外的其他样本文本图像和至少一个组合样本文本图像,得到目标样本文本图像集。备选地,可以根据样本文本图像集和至少一个组合样本文本图像,得到目标样本文本图像集。According to an embodiment of the present disclosure, obtaining a target sample text image set based on at least one cropped sample text image subset and at least one sample text image subset may include: dividing the first sample text based on at least one sample text image subset Other sample text image subsets other than the image subset, other sample text images in the first sample text image subset other than the sample text image set to be cropped, and at least one combined sample text image are used to obtain a target sample text image set. Alternatively, a target sample text image set may be obtained based on the sample text image set and at least one combined sample text image.
根据本公开的实施例,可以由电子设备执行本公开实施例的文本图像生成方法。例如,电子设备可以是服务器或终端设备。电子设备可以包括至少一个处理器。处理器可以用于执行本公开实施例提供的文本图像生成方法。例如,可以利用单个处理器执行本公开实施例提供的文本图像生成 方法,也可以利用多个处理器并行执行本公开实施例提供的文本图像生成方法。According to embodiments of the present disclosure, the text image generation method of the embodiment of the present disclosure can be executed by an electronic device. For example, the electronic device may be a server or a terminal device. The electronic device may include at least one processor. The processor may be used to execute the text image generation method provided by embodiments of the present disclosure. For example, a single processor may be used to perform text image generation provided by embodiments of the present disclosure. method, multiple processors may also be used to execute the text image generation method provided by the embodiments of the present disclosure in parallel.
根据本公开的实施例,由于目标裁剪位置集是根据待裁剪样本文本图像集的样本文本输出结果集确定的,待裁剪样本文本图像集是根据第一样本文本图像子集确定的,第一样本文本图像子集中的样本文本图像是根据样本文本图像集的样本文本输出结果集和样本标签集从样本文本图像集中确定的包括样本文本输出结果正确的样本文本图像,因此,能够有效保证目标裁剪位置的准确性,有效避免字符信息被破坏。此外,目标样本文本图像集是根据至少一个样本文本图像子集和基于目标裁剪位置集对待裁剪样本文本图像集进行裁剪得到的至少一个裁剪样本文本图像子集得到的,提高了目标样本文本图像集中样本文本图像的图像背景复杂度和图像多样性,由此,能够获得上下文信息更为丰富的目标样本文本图像集。由此,利用目标样本文本图像集进行后续模型的训练优化,降低了模型迭代次数,提高了模型的训练速度,进而降低了电子设备的数据处理量和资源消耗量,进而获得符合自然规律的电子设备内部性能改进的效果,从而提升电子设备的核心竞争力。According to an embodiment of the present disclosure, since the target cropping position set is determined based on the sample text output result set of the sample text image set to be cropped, and the sample text image set to be cropped is determined based on the first sample text image subset, the first The sample text images in the sample text image subset are sample text images that include the correct sample text output result and are determined from the sample text image set based on the sample text output result set and the sample label set of the sample text image set. Therefore, the target can be effectively guaranteed The accuracy of the cropping position effectively prevents character information from being destroyed. In addition, the target sample text image set is obtained based on at least one sample text image subset and at least one cropped sample text image subset obtained by cropping the to-be-cropped sample text image set based on the target cropping position set, which improves the target sample text image set. The image background complexity and image diversity of the sample text images can be used to obtain a target sample text image set with richer contextual information. As a result, the target sample text image set is used for subsequent model training and optimization, which reduces the number of model iterations, increases the model training speed, thereby reducing the data processing volume and resource consumption of electronic devices, and thereby obtaining electronic products that conform to the laws of nature. The effect of improving the internal performance of the equipment, thereby enhancing the core competitiveness of electronic equipment.
根据本公开的实施例,上述文本图像生成方法还可以包括如下操作。According to embodiments of the present disclosure, the above text image generation method may further include the following operations.
对原始样本文本图像集进行数据增强处理,得到中间样本文本图像集。根据原始样本文本图像集和中间样本文本图像集,得到样本文本图像集。Perform data enhancement processing on the original sample text image set to obtain an intermediate sample text image set. According to the original sample text image set and the intermediate sample text image set, a sample text image set is obtained.
根据本公开的实施例,原始样本文本图像集可以包括至少一个原始样本文本图像。数据增强可以包括以下至少之一:有监督数据增强和无监督数据增强。有监督数据增强可以包括以下至少之一:单样本数据增强和多样本数据增强。无监督数据增强可以包括以下至少之一:生成新数据的数据增强和学习增强策略的数据增强。According to embodiments of the present disclosure, the set of original sample text images may include at least one original sample text image. Data augmentation may include at least one of the following: supervised data augmentation and unsupervised data augmentation. Supervised data augmentation may include at least one of the following: single-sample data augmentation and multi-sample data augmentation. Unsupervised data augmentation may include at least one of the following: data augmentation to generate new data and data augmentation to learn an augmentation strategy.
根据本公开的实施例,单样本数据增强可以包括以下至少之一:几何变换类和颜色变换类。几何变换类可以包括以下至少之一:翻转、旋转、随机裁剪、变形和缩放等。颜色变换类可以包括以下至少之一:噪声、模糊、颜色变换、擦除和填充等。According to embodiments of the present disclosure, single-sample data enhancement may include at least one of the following: a geometric transformation class and a color transformation class. The geometric transformation class may include at least one of the following: flipping, rotation, random cropping, deformation, scaling, etc. The color transformation class may include at least one of the following: noise, blur, color transformation, erasure and fill, etc.
根据本公开的实施例,多样本数据增强可以包括以下至少之一:SMOTE(即Synthetic Minority Over-sampling Technique)、Sample Pairing、 Mixup、Cutout、Cutmix、Fmix和ROImix等。According to embodiments of the present disclosure, multi-sample data enhancement may include at least one of the following: SMOTE (Synthetic Minority Over-sampling Technique), Sample Pairing, Mixup, Cutout, Cutmix, Fmix and ROImix, etc.
根据本公开的实施例,生成新数据的数据增强可以包括基于生成对抗网络模型的数据增强。学习增强策略的数据增强可以包括自动数据增强。According to embodiments of the present disclosure, data augmentation to generate new data may include data augmentation based on a generative adversarial network model. Data augmentation for learning augmentation strategies can include automatic data augmentation.
根据本公开的实施例,针对原始样本文本图像集中的原始样本文本图像,可以对该原始样本文本图像进行数据增强,得到与该原始样本文本图像对应的至少一个中间样本文本图像。各个原始样本文本图像的数据增强可以彼此不同、部分相同和全部相同中的之一。例如,原始样本文本图像集可以包括原始样本文本图像A和原始样本文本图像B。可以对原始样本文本图像A进行几何变换类的数据增强,得到与原始样本文本图像A对应的至少一个中间样本文本图像。可以对原始样本文本图像B进行颜色变换类的数据增强,得到与原始样本文本图像B对应的至少一个中间样本文本图像。According to embodiments of the present disclosure, data enhancement can be performed on the original sample text image in the original sample text image set to obtain at least one intermediate sample text image corresponding to the original sample text image. The data augmentation of each original sample text image may be one of different from each other, partially the same, or completely the same. For example, the original sample text image set may include original sample text image A and original sample text image B. The original sample text image A can be subjected to geometric transformation data enhancement to obtain at least one intermediate sample text image corresponding to the original sample text image A. Data enhancement such as color transformation can be performed on the original sample text image B to obtain at least one intermediate sample text image corresponding to the original sample text image B.
根据本公开的实施例,根据原始样本文本图像集和中间样本文本图像集,得到样本文本图像集,可以包括:将中间样本文本图像集确定为样本文本图像集。备选地,将原始样本文本图像集中的至少部分和中间样本文本图像集中的至少部分确定为样本文本图像集。According to an embodiment of the present disclosure, obtaining the sample text image set according to the original sample text image set and the intermediate sample text image set may include: determining the intermediate sample text image set as the sample text image set. Alternatively, at least part of the original sample text image set and at least part of the intermediate sample text image set are determined as the sample text image set.
根据本公开的实施例,由于可以对不同原始样本文本图像进行不同的数据增强,因此,能够有效保证第三样本文本图像子集中第三样本文本图像的图像多样性。在此基础上,利用第三样本文本图像子集训练深度学习模型,能够提高模型的泛化性能。According to embodiments of the present disclosure, since different data enhancements can be performed on different original sample text images, the image diversity of the third sample text image in the third sample text image subset can be effectively guaranteed. On this basis, training the deep learning model using the third sample text image subset can improve the generalization performance of the model.
根据本公开的实施例,根据原始样本文本图像集和中间样本文本图像集,得到样本文本图像集,可以包括如下操作。According to an embodiment of the present disclosure, obtaining a sample text image set based on the original sample text image set and the intermediate sample text image set may include the following operations.
针对原始样本文本图像集中的原始样本文本图像,在确定原始样本文本图像的高度不是预定高度的情况下,在保持原始样本文本图像的宽高比不变的情况下,将原始样本文本图像的高度调整至预定高度,得到调整后的原始样本文本图像。针对中间样本文本图像集中的中间样本文本图像,在确定中间样本文本图像的高度不是预定高度的情况下,在保持中间样本文本图像的宽高比不变的情况下,将中间样本文本图像的高度调整至预定高度,得到调整后的中间样本文本图像。根据原始样本文本图像集、至少一个调整后的原始样本文本图像、中间样本文本图像集和至少一个调整后 的中间样本文本图像集中的至少之一,得到样本文本图像集。For the original sample text image in the original sample text image set, when it is determined that the height of the original sample text image is not a predetermined height, while keeping the aspect ratio of the original sample text image unchanged, the height of the original sample text image is changed Adjust to a predetermined height to obtain the adjusted original sample text image. For the intermediate sample text image in the intermediate sample text image set, when it is determined that the height of the intermediate sample text image is not a predetermined height, while keeping the aspect ratio of the intermediate sample text image unchanged, the height of the intermediate sample text image is changed Adjust to a predetermined height to obtain the adjusted intermediate sample text image. According to the original sample text image set, at least one adjusted original sample text image, the intermediate sample text image set and at least one adjusted At least one of the intermediate sample text image sets is used to obtain a sample text image set.
根据本公开的实施例,操作S210可以包括如下操作。According to an embodiment of the present disclosure, operation S210 may include the following operations.
将样本文本图像集的样本文本输出结果集和样本标签集进行比较,得到比较结果。根据比较结果,将样本文本图像集划分为至少一个样本文本图像子集。Compare the sample text output result set of the sample text image set and the sample label set to obtain a comparison result. According to the comparison result, the sample text image set is divided into at least one sample text image subset.
根据本公开的实施例,比较结果可以包括两个对象之间的关系满足预定匹配条件和两个对象之间的关系不满足预定匹配条件。两个对象可以指样本文本输出结果和样本标签。预定匹配条件可以根据实际业务需求进行配置,在此不作限定。例如,预定匹配条件可以包括两个对象相匹配。According to an embodiment of the present disclosure, the comparison result may include that the relationship between the two objects satisfies the predetermined matching condition and the relationship between the two objects does not satisfy the predetermined matching condition. The two objects can refer to sample text output results and sample labels. The predetermined matching conditions can be configured according to actual business needs and are not limited here. For example, the predetermined matching condition may include two objects matching.
根据本公开的实施例,针对样本文本图像集中的样本文本图像,可以将该样本文本图像的样本文本输出结果和样本标签进行比较,得到与该样本文本图像对应的比较结果。根据与该样本文本图像对应的比较结果,可以将该样本文本图像划分到与比较结果对应的样本文本图像子集。According to embodiments of the present disclosure, for the sample text image in the sample text image set, the sample text output result of the sample text image and the sample label can be compared to obtain a comparison result corresponding to the sample text image. According to the comparison result corresponding to the sample text image, the sample text image can be divided into a sample text image subset corresponding to the comparison result.
根据本公开的实施例,样本文本图像集可以包括多个样本文本图像。至少一个样本文本图像子集还可以包括第二样本文本图像子集。According to embodiments of the present disclosure, the sample text image set may include a plurality of sample text images. The at least one subset of sample text images may also include a second subset of sample text images.
根据本公开的实施例,根据比较结果,将样本文本图像集划分为至少一个样本文本图像子集,可以包括如下操作。According to an embodiment of the present disclosure, dividing the sample text image set into at least one sample text image subset according to the comparison result may include the following operations.
针对多个样本文本图像中的样本文本图像,在确定样本文本图像的样本文本输出结果和样本标签之间的关系满足预定匹配条件的情况下,将样本文本图像确定为第一样本文本图像子集中的样本文本图像。在确定样本文本图像的样本文本输出结果和样本标签之间的关系不满足预定匹配条件的情况下,将样本文本图像确定为第二样本文本图像子集中的样本文本图像。For the sample text image among the plurality of sample text images, when it is determined that the relationship between the sample text output result of the sample text image and the sample label satisfies the predetermined matching condition, the sample text image is determined to be the first sample text image sub-image. Concentrated sample text image. When it is determined that the relationship between the sample text output result of the sample text image and the sample label does not satisfy the predetermined matching condition, the sample text image is determined to be the sample text image in the second sample text image subset.
根据本公开的实施例,预定匹配条件可以指用于作为划分样本文本图像子集的依据。预定匹配条件可以包括样本文本输出结果和样本标签之间的差值小于或等于预定差值阈值。预定差值阈值可以根据实际业务需求进行配置,在此不作限定。例如,预定差值阈值可以是0.1。According to an embodiment of the present disclosure, the predetermined matching condition may be used as a basis for dividing the sample text image subsets. The predetermined matching condition may include that the difference between the sample text output result and the sample label is less than or equal to a predetermined difference threshold. The predetermined difference threshold can be configured according to actual business needs and is not limited here. For example, the predetermined difference threshold may be 0.1.
根据本公开的实施例,第一样本文本图像子集中的样本文本图像可以指样本文本输出结果为正确样本文本输出结果的样本文本图像。第二样本文本图像子集中的样本文本图像可以指样本文本输出结果为错误样本文 本输出结果的样本文本图像。According to an embodiment of the present disclosure, the sample text images in the first subset of sample text images may refer to sample text images whose sample text output results are correct sample text output results. The sample text image in the second sample text image subset may mean that the sample text output result is an incorrect sample text. A sample text image for this output.
根据本公开的实施例,针对多个样本文本图像中的样本文本图像,确定该样本文本图像的样本文本输出结果和样本标签之间的差值是否小于或等于预定差值阈值。在确定该样本文本图像的样本文本输出结果和样本标签之间的差值小于或等于预定差值阈值的情况下,可以将该样本文本图像确定为第一样本文本图像子集中的样本文本图像。在确定该样本文本图像的样本文本输出结果和样本标签之间的差值大于预定差值阈值的情况下,可以将该样本文本图像确定为第二样本文本图像子集中的样本文本图像。According to an embodiment of the present disclosure, for a sample text image among a plurality of sample text images, it is determined whether a difference between a sample text output result of the sample text image and a sample label is less than or equal to a predetermined difference threshold. When it is determined that the difference between the sample text output result of the sample text image and the sample label is less than or equal to the predetermined difference threshold, the sample text image may be determined to be a sample text image in the first sample text image subset . When it is determined that the difference between the sample text output result of the sample text image and the sample label is greater than the predetermined difference threshold, the sample text image may be determined to be a sample text image in the second sample text image subset.
根据本公开的实施例,由于目标裁剪位置集是根据待裁剪样本文本图像集的样本文本输出结果集确定的,待裁剪样本文本图像集是根据第一样本文本图像子集确定的,第一样本文本图像子集中的第一样本文本图像是样本文本输出结果和样本标签之间的关系满足预定匹配条件的样本文本图像,因此,能够有效保证目标裁剪位置的准确性,有效避免字符信息被破坏。According to an embodiment of the present disclosure, since the target cropping position set is determined based on the sample text output result set of the sample text image set to be cropped, and the sample text image set to be cropped is determined based on the first sample text image subset, the first The first sample text image in the sample text image subset is a sample text image whose relationship between the sample text output result and the sample label satisfies the predetermined matching conditions. Therefore, the accuracy of the target cropping position can be effectively guaranteed and character information can be effectively avoided. destroyed.
根据本公开的实施例,第一样本文本图像集可以包括多个第一样本文本图像。According to embodiments of the present disclosure, the first sample text image set may include a plurality of first sample text images.
根据本公开的实施例,待裁剪样本文本图像集可以是通过以下方式确定的:According to an embodiment of the present disclosure, the sample text image set to be cropped may be determined in the following manner:
针对多个第一样本文本图像中的第一样本文本图像,在确定第一样本文本图像的预定概率值小于或等于预定概率阈值的情况下,将第一样本文本图像确定为待裁剪样本文本图像集中的待裁剪样本文本图像。For the first sample text image among the plurality of first sample text images, when it is determined that the predetermined probability value of the first sample text image is less than or equal to the predetermined probability threshold, the first sample text image is determined to be the target text image. The sample text image to be cropped in the cropped sample text image set.
根据本公开的实施例,预定概率值和预定概率阈值可以用于作为确定第一样本文本图像子集中的第一样本文本图像是待裁剪样本文本图像集中的待裁剪样本文本图像。预定概率值和预定概率阈值可以根据实际业务需求进行配置,在此不作限定。预定概率值可以是大于或等于0且小于1的数。预定概率阈值可以是大于或等于0且小于或等于1的数。例如,预定概率阈值可以根据深度学习模型的模型特点来确定。模型特点可以包括模型结构的复杂性、拟合性和通用性的至少之一。例如,如果深度学习模型的模型结构的模型特点是通用性较强、复杂性较大和容易过拟合中的至 少之一,则可以配置数值较大的预定概率阈值。如果深度学习模型的模型结构的模型特点是通用性较弱、复杂性较小和容易欠拟合中的至少之一,则可以配置数值较小的预定概率阈值。According to an embodiment of the present disclosure, the predetermined probability value and the predetermined probability threshold may be used to determine that the first sample text image in the first sample text image subset is a sample text image to be cropped in the sample text image set to be cropped. The predetermined probability value and the predetermined probability threshold can be configured according to actual business requirements and are not limited here. The predetermined probability value may be a number greater than or equal to 0 and less than 1. The predetermined probability threshold may be a number greater than or equal to 0 and less than or equal to 1. For example, the predetermined probability threshold can be determined based on model characteristics of the deep learning model. Model characteristics may include at least one of model structural complexity, fit, and generality. For example, if the model structure of a deep learning model is characterized by strong versatility, greater complexity, and easy overfitting, If the probability is less than one, you can configure a predetermined probability threshold with a larger value. If the model structure of the deep learning model is characterized by at least one of weak generality, low complexity, and easy underfitting, a predetermined probability threshold with a smaller value may be configured.
根据本公开的实施例,待裁剪样本文本图像集可以包括多个待裁剪样本文本图像。According to embodiments of the present disclosure, the set of sample text images to be cropped may include a plurality of sample text images to be cropped.
根据本公开的实施例,操作S220可以包括如下操作。According to an embodiment of the present disclosure, operation S220 may include the following operations.
针对待裁剪样本文本图像集中的待裁剪样本文本图像,根据待裁剪样本文本图像的样本文本输出结果,从多个候选裁剪位置中确定至少一个所述目标裁剪位置。For the sample text image to be cropped in the sample text image set to be cropped, at least one target cropping position is determined from a plurality of candidate cropping positions according to the sample text output result of the sample text image to be cropped.
根据本公开的实施例,可以根据待裁剪样本文本图像的样本文本输出结果,确定多个候选裁剪位置。随机从多个候选裁剪位置中确定至少一个目标裁剪位置。According to embodiments of the present disclosure, a plurality of candidate cropping positions may be determined based on the sample text output results of the sample text image to be cropped. At least one target cropping position is randomly determined from a plurality of candidate cropping positions.
根据本公开的实施例,通过随机从多个候选裁剪位置中确定至少一个目标裁剪位置,能够提高样本文本图像的图像多样性。According to embodiments of the present disclosure, image diversity of sample text images can be improved by randomly determining at least one target cropping position from a plurality of candidate cropping positions.
根据本公开的实施例,样本文本图像集可以包括多个样本文本图像。According to embodiments of the present disclosure, the sample text image set may include a plurality of sample text images.
根据本公开的实施例,样本文本识别输出结果可以是对样本文本图像的全局样本特征序列进行序列解码得到的。全局样本特征序列可以是对样本文本图像的第一局部样本特征图进行全局特征提取得到的。第一局部样本特征图可以是对样本文本图像进行第一局部特征提取得到的。According to embodiments of the present disclosure, the sample text recognition output result may be obtained by sequentially decoding the global sample feature sequence of the sample text image. The global sample feature sequence may be obtained by extracting global features from the first local sample feature map of the sample text image. The first local sample feature map may be obtained by extracting the first local feature from the sample text image.
根据本公开的实施例,样本文本语义输出结果可以是对样本文本图像的第二局部样本特征图进行语义理解得到的。第二局部样本特征图可以是对样本文本图像进行第二局部特征提取得到的。According to embodiments of the present disclosure, the sample text semantic output result may be obtained by semantic understanding of the second local sample feature map of the sample text image. The second local sample feature map may be obtained by performing second local feature extraction on the sample text image.
根据本公开的实施例,可以利用基于CRNN的文本识别模型处理样本文本图像,得到样本文本识别输出结果。CRNN可以包括卷积层、循环层和转录层。可以利用卷积层处理样本文本图像,得到第一局部样本特征图。可以利用循环层处理第一局部样本特征图,得到全局样本特征序列。可以利用转录层处理全局样本特征序列,得到样本文本识别输出结果。According to embodiments of the present disclosure, a CRNN-based text recognition model can be used to process sample text images to obtain sample text recognition output results. CRNN can include convolutional layers, recurrent layers and transcription layers. The convolutional layer can be used to process the sample text image to obtain the first local sample feature map. The loop layer can be used to process the first local sample feature map to obtain the global sample feature sequence. The transcription layer can be used to process the global sample feature sequence and obtain the sample text recognition output result.
根据本公开的实施例,在样本文本输出结果包括样本文本识别结果和样本文本语义输出结果的情况下,根据待裁剪样本文本图像的样本文本输出结果,从多个候选裁剪位置中确定至少一个所述目标裁剪位置,可以包 括如下操作。According to an embodiment of the present disclosure, in the case where the sample text output result includes a sample text recognition result and a sample text semantic output result, at least one of the candidate cropping positions is determined from a plurality of candidate cropping positions according to the sample text output result of the sample text image to be cropped. The target cropping position can be included Including the following operations.
根据待裁剪样本文本图像的样本文本识别输出结果,确定多个候选裁剪位置。根据待裁剪样本文本图像的样本文本语义输出结果,从多个候选裁剪位置中确定至少一个目标裁剪位置。Multiple candidate cropping positions are determined based on the sample text recognition output results of the sample text image to be cropped. Determine at least one target cropping position from a plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped.
根据本公开的实施例,例如,待裁剪样本文本图像的样本文本识别输出结果可以是“今天去上班”。根据样本文本识别输出结果,确定四个候选裁剪位置,即“今”和“天”之间的候选裁剪位置、“天”和“去”之间的候选裁剪位置、“去”和“上”之间的候选裁剪位置以及“上”和“班”之间的候选裁剪位置。根据样本文本语义输出结果,可以确定“今”和“天”不应该被分开,“上”和“班”不应该被分开,因此,可以从四个候选裁剪位置中确定两个目标裁剪位置,即“天”和“去”之间的候选裁剪位置以及“去”和“上”之间的候选裁剪位置。According to an embodiment of the present disclosure, for example, the sample text recognition output result of the sample text image to be cropped may be "Go to work today." According to the sample text recognition output results, four candidate cropping positions are determined, namely the candidate cropping position between "today" and "day", the candidate cropping position between "day" and "go", "go" and "上" Candidate cropping positions between "up" and "class". According to the sample text semantic output results, it can be determined that "today" and "day" should not be separated, and "on" and "class" should not be separated. Therefore, two target cropping positions can be determined from four candidate cropping positions, That is, the candidate cropping positions between "day" and "go" and the candidate cropping positions between "go" and "up".
根据本公开的实施例,根据待裁剪样本文本图像的样本文本语义输出结果,从多个候选裁剪位置中确定至少一个目标裁剪位置,提高了目标裁剪位置的准确性。According to an embodiment of the present disclosure, at least one target cropping position is determined from a plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped, thereby improving the accuracy of the target cropping position.
根据本公开的实施例,操作S230可以包括如下操作。According to an embodiment of the present disclosure, operation S230 may include the following operations.
基于目标裁剪位置集对待裁剪样本文本图像集进行裁剪,得到第一裁剪样本文本图像子集和第二裁剪样本文本图像子集。The sample text image set to be cropped is cropped based on the target cropping position set to obtain a first cropped sample text image subset and a second cropped sample text image subset.
根据本公开的实施例,第一裁剪样本文本图像子集可以包括至少一个第一裁剪样本文本图像。第二裁剪样本文本图像子集可以包括至少一个第二裁剪样本文本图像。与待裁剪样本文本图像对应的至少一个目标裁剪位置可以包括第一目标裁剪位置和第二目标裁剪位置。According to an embodiment of the present disclosure, the first cropped sample text image subset may include at least one first cropped sample text image. The second subset of cropped sample text images may include at least one second cropped sample text image. The at least one target cropping position corresponding to the sample text image to be cropped may include a first target cropping position and a second target cropping position.
根据本公开的实施例,针对待裁剪样本文本图像集中的待裁剪样本文本图像,可以基于与该待裁剪样本文本图像对应的第一目标裁剪位置进行裁剪,得到与该待裁剪样本文本图像对应的第一裁剪样本文本图像。可以基于与该待裁剪样本文本图像对应的第二目标裁剪位置进行裁剪,得到与该待裁剪样本文本图像对应的第二裁剪样本文本图像。According to an embodiment of the present disclosure, the sample text image to be cropped in the sample text image set to be cropped can be cropped based on the first target cropping position corresponding to the sample text image to be cropped, and a sample text image corresponding to the sample text image to be cropped is obtained. First crop the sample text image. Cropping may be performed based on the second target cropping position corresponding to the sample text image to be cropped, to obtain a second cropped sample text image corresponding to the sample text image to be cropped.
根据本公开的实施例,操作S240可以包括如下操作。According to an embodiment of the present disclosure, operation S240 may include the following operations.
根据至少一个裁剪样本文本图像子集,得到第三样本文本图像子集。根据至少一个样本文本图像子集和第三样本文本图像子集,得到目标样本 文本图像集。A third sample text image subset is obtained based on at least one cropped sample text image subset. Obtain the target sample according to at least one sample text image subset and the third sample text image subset Text image set.
根据本公开的实施例,可以对至少一个裁剪样本文本图像子集进行组合,得到第三样本文本图像子集。可以根据第二样本文本图像子集和第三样本文本图像子集,得到目标样本文本图像集。According to embodiments of the present disclosure, at least one cropped sample text image subset may be combined to obtain a third sample text image subset. The target sample text image set can be obtained according to the second sample text image subset and the third sample text image subset.
根据本公开的实施例,根据至少一个裁剪样本文本图像子集,得到第三样本文本图像子集,可以包括如下操作。According to an embodiment of the present disclosure, obtaining a third sample text image subset based on at least one cropped sample text image subset may include the following operations.
基于预定组合策略,将至少一个裁剪样本文本图像子集中的裁剪样本文本图像进行组合,得到第三样本文本图像子集。Based on a predetermined combination strategy, the cropped sample text images in at least one cropped sample text image subset are combined to obtain a third sample text image subset.
根据本公开是实施例,预定组合策略可以指用于对裁剪样本文本图像进行组合的策略。例如,预定组合策略可以包括以下至少之一:随机组合策略和固定组合策略。第三样本文本图像子集可以包括至少一个第三样本文本图像。第三样本文本图像可以与样本文本图像集中的样本文本图像相同或不同。According to an embodiment of the present disclosure, the predetermined combination strategy may refer to a strategy for combining cropped sample text images. For example, the predetermined combination strategy may include at least one of the following: a random combination strategy and a fixed combination strategy. The third sample text image subset may include at least one third sample text image. The third sample text image may be the same as or different from the sample text image in the sample text image set.
根据本公开的实施例,针对至少一个裁剪样本文本图像子集中的裁剪样本文本图像子集,针对该裁剪样本文本图像子集中的裁剪样本文本图像,可以将该裁剪样本文本图像与其他裁剪样本文本图像子集中的裁剪样本文本图像进行组合,得到至少一个第三样本文本图像。其他裁剪样本文本图像子集可以是至少一个裁剪样本文本图像子集中除该裁剪样本文本图像子集以外的其他任意一个或多个裁剪样本文本图像子集。According to an embodiment of the present disclosure, for a subset of cropped sample text images in at least one subset of cropped sample text images, for a subset of cropped sample text images in the subset of cropped sample text images, the cropped sample text image may be combined with other cropped sample text The cropped sample text images in the image subset are combined to obtain at least one third sample text image. Other cropped sample text image subsets may be any other one or more cropped sample text image subsets in at least one cropped sample text image subset except the cropped sample text image subset.
例如,至少一个裁剪样本文本图像子集可以包括第一裁剪样本文本图像子集和第二裁剪样本文本图像子集。第一裁剪样本文本图像子集可以表征第一方向的裁剪样本文本图像子集。第二采集样本文本图像子集可以表征第二方向的裁剪样本文本图像子集。第一方向可以指右方向。第二方向可以指左方向。针对第一裁剪样本文本图像子集中的第一裁剪样本文本图像,可以将第一裁剪样本文本图像与第二裁剪样本文本图像子集中的至少一个第二裁剪样本文本图像进行组合,得到至少一个第三样本文本图像。For example, the at least one cropped sample text image subset may include a first cropped sample text image subset and a second cropped sample text image subset. The first cropped sample text image subset may represent a cropped sample text image subset in the first direction. The second collected sample text image subset may represent the cropped sample text image subset in the second direction. The first direction may refer to the right direction. The second direction may refer to the left direction. For the first cropped sample text image in the first subset of cropped sample text images, the first cropped sample text image may be combined with at least one second cropped sample text image in the second subset of cropped sample text images to obtain at least one first cropped sample text image. Three sample text images.
根据本公开的实施例,由于第三样本文本图像子集是基于预定组合策略将至少一个裁剪样本文本图像子集中的裁剪样本文本图像进行组合得到的,因此,实现了裁剪样本文本图像的随机组合,提高了第三样本文本图像子集中第三样本文本图像的图像背景复杂度和图像多样性。在此基础 上,利用第三样本文本图像子集训练深度学习模型,能够提高模型的泛化性能。According to an embodiment of the present disclosure, since the third sample text image subset is obtained by combining cropped sample text images in at least one cropped sample text image subset based on a predetermined combination strategy, a random combination of cropped sample text images is achieved. , improving the image background complexity and image diversity of the third sample text image in the third sample text image subset. On this basis On the other hand, using the third sample text image subset to train the deep learning model can improve the generalization performance of the model.
根据本公开的实施例,上述文本图像生成方法还可以包括如下操作。According to embodiments of the present disclosure, the above text image generation method may further include the following operations.
基于所述目标裁剪位置集对所述待裁剪样本文本图像集的样本标签集进行裁剪,得到至少一个裁剪样本标签子集。根据与至少一个样本文本图像子集对应的样本标签子集和至少一个裁剪样本标签子集,得到目标样本标签集。The sample label set of the sample text image set to be cropped is cropped based on the target cropping position set to obtain at least one cropped sample label subset. A target sample label set is obtained based on a sample label subset corresponding to at least one sample text image subset and at least one cropped sample label subset.
根据本公开的实施例,根据与至少一个样本文本图像子集对应的样本标签子集和至少一个裁剪样本标签子集,得到目标样本标签集,可以包括如下操作。According to an embodiment of the present disclosure, obtaining a target sample label set based on a sample label subset corresponding to at least one sample text image subset and at least one cropped sample label subset may include the following operations.
根据至少一个裁剪样本标签子集,得到与第三样本文本图像子集对应的样本标签子集。根据与至少一个样本文本图像子集对应的样本标签子集和与第三样本文本图像子集对应的样本标签子集,得到目标样本标签集。According to at least one cropped sample label subset, a sample label subset corresponding to the third sample text image subset is obtained. A target sample label set is obtained according to the sample label subset corresponding to at least one sample text image subset and the sample label subset corresponding to the third sample text image subset.
根据本公开的实施例,根据至少一个裁剪样本标签子集,得到与第三样本文本图像子集对应的样本标签子集,可以包括如下操作。According to an embodiment of the present disclosure, obtaining a sample label subset corresponding to the third sample text image subset based on at least one cropped sample label subset may include the following operations.
基于预定组合策略,将至少一个裁剪样本标签子集中的裁剪样本标签进行组合,得到与第三样本文本图像子集对应的样本标签子集。Based on a predetermined combination strategy, the cropped sample labels in at least one cropped sample label subset are combined to obtain a sample label subset corresponding to the third sample text image subset.
下面参考图3A、图3B、图3C、图3D和图3E,结合具体实施例对根据本公开实施例所述的文本图像生成方法做进一步说明。The text image generation method according to the embodiment of the present disclosure will be further described below with reference to FIG. 3A, FIG. 3B, FIG. 3C, FIG. 3D, and FIG. 3E in conjunction with specific embodiments.
图3A示意性示出了根据本公开实施例的文本图像生成方法的原理示意图。FIG. 3A schematically shows a principle diagram of a text image generation method according to an embodiment of the present disclosure.
如图3A所示,在300A中,根据样本文本图像集的样本文本输出结果集301和样本标签集302,将样本文本图像集303划分为第一样本文本图像子集303_1和第二样本文本图像子集303_2。根据第一样本文本图像子集303_1确定待裁剪样本文本图像集304。As shown in Figure 3A, in step 300A, the sample text image set 303 is divided into a first sample text image subset 303_1 and a second sample text according to the sample text output result set 301 and the sample label set 302 of the sample text image set. Image subset 303_2. The sample text image set 304 to be cropped is determined according to the first sample text image subset 303_1.
根据待裁剪样本文本图像集304的样本文本输出结果集305,确定待裁剪样本文本图像集304的目标裁剪位置集306。基于目标裁剪位置集306对待裁剪样本文本图像集304进行裁剪,得到至少一个裁剪样本文本图像子集307。根据至少一个裁剪样本文本图像子集307、第一样本文本图像子集303_1和第二样本文本图像子集303_2,得到目标样本文本图像集308。 According to the sample text output result set 305 of the sample text image set 304 to be cropped, the target cropping position set 306 of the sample text image set 304 to be cropped is determined. The to-be-cropped sample text image set 304 is cropped based on the target cropping position set 306 to obtain at least one cropped sample text image subset 307. According to at least one cropped sample text image subset 307, the first sample text image subset 303_1 and the second sample text image subset 303_2, a target sample text image set 308 is obtained.
图3B示意性示出了根据本公开实施例的第三样本文本图像子集的生成过程的示例示意图。FIG. 3B schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to an embodiment of the present disclosure.
如图3B所示,在300B中,待裁剪样本文本图像集309可以包括待裁剪样本文本图像309_1和待裁剪样本文本图像309_2。As shown in FIG. 3B , in 300B, the sample text image set 309 to be cropped may include a sample text image to be cropped 309_1 and a sample text image to be cropped 309_2.
根据待裁剪样本文本图像309_1的样本文本输出结果,从多个候选裁剪位置中确定目标裁剪位置是“婴和百之间的位置”。基于目标裁剪位置对待裁剪样本文本图像309_1进行裁剪,得到裁剪样本文本图像309_1_1和裁剪样本文本图像309_1_2。裁剪样本文本图像309_1_1是与“母婴”对应的样本文本图像。裁剪样本文本图像309_1_2是与“百汇”对应的样本文本图像。According to the sample text output result of the sample text image 309_1 to be cropped, the target cropping position is determined to be "the position between Ying and Bai" from multiple candidate cropping positions. The sample text image 309_1 to be cropped is cropped based on the target cropping position to obtain a cropped sample text image 309_1_1 and a cropped sample text image 309_1_2. The cropped sample text image 309_1_1 is a sample text image corresponding to "Mother and Baby". The cropped sample text image 309_1_2 is a sample text image corresponding to "Parkway".
根据待裁剪样本文本图像309_2的样本文本输出结果,从多个候选裁剪位置中确定目标裁剪位置是“转和让之间的位置”。基于目标裁剪位置对待裁剪样本文本图像309_2进行裁剪,得到裁剪样本文本图像309_2_1和裁剪样本文本图像309_2_2。裁剪样本文本图像309_2_1是与“转”对应的样本文本图像。裁剪样本文本图像309_2_2是与“让”对应的样本文本图像。According to the sample text output result of the sample text image 309_2 to be cropped, it is determined from multiple candidate cropping positions that the target cropping position is the "position between transfer and transfer". The to-be-cropped sample text image 309_2 is cropped based on the target cropping position to obtain a cropped sample text image 309_2_1 and a cropped sample text image 309_2_2. The cropped sample text image 309_2_1 is a sample text image corresponding to "turn". The cropped sample text image 309_2_2 is a sample text image corresponding to "Let".
基于预定组合策略,将裁剪样本文本图像309_1_1和裁剪样本文本图像309_2_2进行组合,得到第三样本文本图像子集310中的第三样本文本图像310_1,以及将裁剪样本文本图像309_1_2和裁剪样本文本图像309_2_1进行组合,得到第三样本文本图像子集310中的第三样本文本图像310_2。第三样本文本图像310_1是与“母婴让”对应的样本文本图像。第三样本文本图像310_2是与“转百汇”对应的样本文本图像。Based on the predetermined combination strategy, the cropped sample text image 309_1_1 and the cropped sample text image 309_2_2 are combined to obtain the third sample text image 310_1 in the third sample text image subset 310, and the cropped sample text image 309_1_2 and the cropped sample text image are obtained 309_2_1 are combined to obtain the third sample text image 310_2 in the third sample text image subset 310. The third sample text image 310_1 is a sample text image corresponding to "Mother and Infant Let". The third sample text image 310_2 is a sample text image corresponding to "Zhuanbahui".
图3C示意性示出了根据本公开另一实施例的第三样本文本图像子集的生成过程的示例示意图。FIG. 3C schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure.
如图3C所示,在300C中,与图3B不同的是,第三样本文本图像311_1是与“让母婴”对应的样本文本图像。第三样本文本图像311_2是与“百汇转”对应的样本文本图像。As shown in FIG. 3C, in 300C, what is different from FIG. 3B is that the third sample text image 311_1 is a sample text image corresponding to "Let mother and baby". The third sample text image 311_2 is a sample text image corresponding to "Baihuizhuan".
图3D示意性示出了根据本公开另一实施例的第三样本文本图像子集的生成过程的示例示意图。FIG. 3D schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure.
如图3D所示,在300D中,与图3B不同的是,基于预定组合策略, 将裁剪样本文本图像309_1_1和裁剪样本文本图像309_2_1进行组合,得到第三样本文本图像子集312中的第三样本文本图像312_1,以及将裁剪样本文本图像309_1_2和裁剪样本文本图像309_2_2进行组合,得到第三样本文本图像子集312中的第三样本文本图像312_2。第三样本文本图像312_1是与“母婴转”对应的样本文本图像。第三样本文本图像312_2是与“百汇让”对应的样本文本图像。As shown in Figure 3D, in 300D, what is different from Figure 3B is that based on the predetermined combination strategy, The cropped sample text image 309_1_1 and the cropped sample text image 309_2_1 are combined to obtain the third sample text image 312_1 in the third sample text image subset 312, and the cropped sample text image 309_1_2 and the cropped sample text image 309_2_2 are combined to obtain The third sample text image 312_2 in the third sample text image subset 312. The third sample text image 312_1 is a sample text image corresponding to "Mother-to-child transfer". The third sample text image 312_2 is a sample text image corresponding to "Baihui Rang".
图3E示意性示出了根据本公开另一实施例的第三样本文本图像子集的生成过程的示例示意图。FIG. 3E schematically illustrates an example schematic diagram of a generation process of a third sample text image subset according to another embodiment of the present disclosure.
如图3E所示,在300E中,与图3D不同的是,第三样本文本图像313_1是与“转母婴”对应的样本文本图像。第三样本文本图像313_2是与“让百汇”对应的样本文本图像。As shown in FIG. 3E, in 300E, what is different from FIG. 3D is that the third sample text image 313_1 is a sample text image corresponding to "transformation of mother and child". The third sample text image 313_2 is a sample text image corresponding to "Let Baihui".
图4示意性示出了根据本公开实施例的深度学习模型的训练方法的流程图。Figure 4 schematically shows a flow chart of a training method of a deep learning model according to an embodiment of the present disclosure.
如图4所示,该方法400可以包括操作S410~S420。As shown in Figure 4, the method 400 may include operations S410 to S420.
在操作S410,获取目标样本文本图像集。In operation S410, a target sample text image set is obtained.
在操作S420,利用目标样本文本图像集训练深度学习模型,得到文本图像处理模型。In operation S420, a deep learning model is trained using the target sample text image set to obtain a text image processing model.
根据本公开的实施例,目标样本文本图像集可以是根据本公开实施例所述的文本图像生成方法得到的。According to embodiments of the present disclosure, the target sample text image set may be obtained according to the text image generation method described in the embodiments of the present disclosure.
根据本公开的实施例,由于目标样本文本图像集目标裁剪位置集是根据待裁剪样本文本图像集的样本文本输出结果集确定的,待裁剪样本文本图像集是根据第一样本文本图像子集确定的,第一样本文本图像子集是根据样本文本图像集的样本文本输出结果集和样本标签集从样本文本图像集中确定的包括样本文本输出结果正确的样本文本图像,因此,能够有效保证目标裁剪位置的准确性,有效避免字符信息被破坏。在此基础上,根据至少一个裁剪样本文本图像子集和至少一个样本文本图像子集,得到目标样本文本图像集,能够获得上下文信息更为丰富的目标样本文本图像集。由此,利用目标样本文本图像集进行后续模型的训练优化,降低了模型迭代次数,提高了模型的训练速度,由此,降低了电子设备的数据处理量和资源消耗量,进而获得符合自然规律的电子设备内部性能改进的效果,从 而提升电子设备的核心竞争力。According to an embodiment of the present disclosure, since the target cropping position set of the target sample text image set is determined based on the sample text output result set of the sample text image set to be cropped, the sample text image set to be cropped is based on the first sample text image subset. It is determined that the first sample text image subset is a sample text image that includes the correct sample text output result and is determined from the sample text image set according to the sample text output result set and the sample label set of the sample text image set. Therefore, it can effectively ensure that The accuracy of the target cropping position effectively prevents character information from being destroyed. On this basis, a target sample text image set is obtained based on at least one cropped sample text image subset and at least one sample text image subset, and a target sample text image set with richer contextual information can be obtained. As a result, the target sample text image set is used for subsequent model training and optimization, which reduces the number of model iterations and increases the training speed of the model. This reduces the data processing volume and resource consumption of electronic devices, thereby obtaining a model that conforms to natural laws. The effect of improving the internal performance of electronic equipment, from And enhance the core competitiveness of electronic equipment.
图5示意性示出了根据本公开实施例的文本图像处理方法的流程图。FIG. 5 schematically shows a flowchart of a text image processing method according to an embodiment of the present disclosure.
如图5所示,该方法500包括操作S510~S520。As shown in Figure 5, the method 500 includes operations S510 to S520.
在操作S510,获取待处理文本图像。In operation S510, a text image to be processed is obtained.
在操作S520,将待处理文本图像输入文本图像处理模型,得到文本图像处理结果。In operation S520, the text image to be processed is input into the text image processing model to obtain a text image processing result.
根据本公开的实施例,文本图像处理模型可以是根据本公开实施例所述的深度学习模型的训练方法训练得到的。According to embodiments of the present disclosure, the text image processing model may be trained according to the deep learning model training method described in the embodiments of the present disclosure.
本公开的技术方案中,所涉及的用户个人信息的收集、存储、使用、加工、传输、提供、公开和应用等处理,均符合相关法律法规的规定,采取了必要保密措施,且不违背公序良俗。在本公开的技术方案中,在获取或采集用户个人信息之前,均获取了用户的授权或同意。In the technical solution of this disclosure, the collection, storage, use, processing, transmission, provision, disclosure and application of user personal information are in compliance with relevant laws and regulations, necessary confidentiality measures have been taken, and do not violate public order and good customs . In the technical solution of the present disclosure, the user's authorization or consent is obtained before obtaining or collecting the user's personal information.
以上仅是示例性实施例,但不限于此,还可以包括本领域已知的其他文本图像生成方法、深度学习模型的训练方法和文本图像处理方法,只要能够有效保证目标裁剪位置的准确性和获得上下文信息更为丰富的目标样本文本图像集即可。The above are only exemplary embodiments, but are not limited thereto, and may also include other text image generation methods, deep learning model training methods and text image processing methods known in the art, as long as the accuracy of the target cropping position and Just obtain a target sample text image set with richer contextual information.
图6示意性示出了根据本公开实施例的文本图像生成装置的框图。FIG. 6 schematically shows a block diagram of a text image generating device according to an embodiment of the present disclosure.
如图6所示,文本图像生成装置600可以包括划分模块610、确定模块620、第一获得模块630和第二获得模块640。As shown in FIG. 6 , the text image generating device 600 may include a dividing module 610 , a determining module 620 , a first obtaining module 630 and a second obtaining module 640 .
划分模块610,用于根据样本文本图像集的样本文本输出结果集和样本标签集,将样本文本图像集划分为至少一个样本文本图像子集。至少一个样本文本图像子集包括第一样本文本图像子集。第一样本文本图像子集包括样本文本输出结果正确的样本文本图像。The dividing module 610 is configured to divide the sample text image set into at least one sample text image subset according to the sample text output result set and the sample label set of the sample text image set. At least one sample text image subset includes a first sample text image subset. The first sample text image subset includes sample text images with correct sample text output results.
确定模块620,用于根据待裁剪样本文本图像集的样本文本输出结果集,确定待裁剪样本文本图像集的目标裁剪位置集。待裁剪样本文本图像集是根据第一样本文本图像子集确定的。The determination module 620 is configured to determine a target cropping position set of the sample text image set to be cropped based on the sample text output result set of the sample text image set to be cropped. The set of sample text images to be cropped is determined based on the first subset of sample text images.
第一获得模块630,用于基于目标裁剪位置集对待裁剪样本文本图像集进行裁剪,得到至少一个裁剪样本文本图像子集。The first obtaining module 630 is configured to crop the sample text image set to be cropped based on the target cropping position set to obtain at least one cropped sample text image subset.
第二获得模块640,用于根据至少一个裁剪样本文本图像子集和至少一个样本文本图像子集,得到目标样本文本图像集。 The second obtaining module 640 is configured to obtain a target sample text image set based on at least one cropped sample text image subset and at least one sample text image subset.
根据本公开的实施例,划分模块610可以包括比较子模块和划分子模块。According to an embodiment of the present disclosure, the partition module 610 may include a comparison sub-module and a partition sub-module.
比较子模块,用于将样本文本图像集的样本文本输出结果集和样本标签集进行比较,得到比较结果。The comparison submodule is used to compare the sample text output result set of the sample text image set and the sample label set to obtain the comparison result.
划分子模块,用于根据比较结果,将样本文本图像集划分为至少一个样本文本图像子集。The dividing submodule is used to divide the sample text image set into at least one sample text image subset according to the comparison result.
根据本公开的实施例,样本文本图像集包括多个样本文本图像,至少一个样本文本图像子集还包括第二样本文本图像子集。According to an embodiment of the present disclosure, the sample text image set includes a plurality of sample text images, and at least one sample text image subset further includes a second sample text image subset.
根据本公开的实施例,针对多个样本文本图像中的样本文本图像,划分子模块可以包括第一确定单元和第二确定单元。According to an embodiment of the present disclosure, for the sample text image among the plurality of sample text images, the dividing sub-module may include a first determination unit and a second determination unit.
第一确定单元,用于在确定样本文本图像的样本文本输出结果和样本标签之间的关系满足预定匹配条件的情况下,将样本文本图像确定为第一样本文本图像子集中的样本文本图像。A first determination unit configured to determine the sample text image as a sample text image in the first sample text image subset when it is determined that the relationship between the sample text output result and the sample label of the sample text image satisfies the predetermined matching condition. .
第二确定单元,用于在确定样本文本图像的样本文本输出结果和样本标签之间的关系不满足预定匹配条件的情况下,将样本文本图像确定为第二样本文本图像子集中的样本文本图像。A second determination unit configured to determine the sample text image as a sample text image in the second sample text image subset when it is determined that the relationship between the sample text output result and the sample label of the sample text image does not satisfy the predetermined matching condition. .
根据本公开的实施例,待裁剪样本文本图像集可以包括多个待裁剪样本文本图像。According to embodiments of the present disclosure, the set of sample text images to be cropped may include a plurality of sample text images to be cropped.
根据本公开的实施例,针对待裁剪样本文本图像集中的待裁剪样本文本图像,确定模块620可以包括确定子模块。According to an embodiment of the present disclosure, for the sample text image to be cropped in the set of sample text images to be cropped, the determining module 620 may include a determining sub-module.
确定子模块,用于根据待裁剪样本文本图像的样本文本输出结果,从多个候选裁剪位置中确定至少一个目标裁剪位置。The determining submodule is configured to determine at least one target cropping position from a plurality of candidate cropping positions based on the sample text output result of the sample text image to be cropped.
根据本公开的实施例,样本文本输出结果可以包括以下至少之一:样本文本识别输出结果和样本文本语义输出结果。According to an embodiment of the present disclosure, the sample text output result may include at least one of the following: a sample text recognition output result and a sample text semantic output result.
根据本公开的实施例,样本文本图像集可以包括多个样本文本图像。According to embodiments of the present disclosure, the sample text image set may include a plurality of sample text images.
根据本公开的实施例,样本文本识别输出结果可以是对样本文本图像的全局样本特征序列进行序列解码得到的。全局样本特征序列可以是对样本文本图像的第一局部样本特征图进行全局特征提取得到的。第一局部样本特征图可以是对样本文本图像进行第一局部特征提取得到的。According to embodiments of the present disclosure, the sample text recognition output result may be obtained by sequentially decoding the global sample feature sequence of the sample text image. The global sample feature sequence may be obtained by extracting global features from the first local sample feature map of the sample text image. The first local sample feature map may be obtained by extracting the first local feature from the sample text image.
根据本公开的实施例,样本文本语义输出结果可以是对样本文本图像 的第二局部样本特征图进行语义理解得到的。第二局部样本特征图可以是对样本文本图像进行第二局部特征提取得到的。According to embodiments of the present disclosure, the sample text semantic output result may be a sample text image The second local sample feature map is obtained through semantic understanding. The second local sample feature map may be obtained by performing second local feature extraction on the sample text image.
根据本公开的实施例,在样本文本输出结果包括样本文本识别结果和样本文本语义输出结果的情况下,确定子模块可以包括第三确定单元和第四确定单元。According to an embodiment of the present disclosure, in the case where the sample text output result includes a sample text recognition result and a sample text semantic output result, the determination sub-module may include a third determination unit and a fourth determination unit.
第三确定单元,用于根据待裁剪样本文本图像的样本文本识别输出结果,确定多个候选裁剪位置。The third determination unit is used to determine multiple candidate cropping positions based on the sample text recognition output result of the sample text image to be cropped.
第四确定单元,用于根据待裁剪样本文本图像的样本文本语义输出结果,从多个候选裁剪位置中确定至少一个目标裁剪位置。The fourth determination unit is configured to determine at least one target cropping position from a plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped.
根据本公开的实施例,第一获得模块630可以包括第一获得子模块。According to an embodiment of the present disclosure, the first obtaining module 630 may include a first obtaining sub-module.
第一获得子模块,用于基于目标裁剪位置集对待裁剪样本文本图像集进行裁剪,得到第一裁剪样本文本图像子集和第二裁剪样本文本图像子集。The first obtaining submodule is used to crop the sample text image set to be cropped based on the target cropping position set, and obtain the first cropped sample text image subset and the second cropped sample text image subset.
根据本公开的实施例,第二获得模块640可以包括第二获得子模块和第三获得子模块。According to an embodiment of the present disclosure, the second obtaining module 640 may include a second obtaining sub-module and a third obtaining sub-module.
第二获得子模块,用于根据至少一个裁剪样本文本图像子集,得到第三样本文本图像子集。The second obtaining submodule is used to obtain a third sample text image subset based on at least one cropped sample text image subset.
第三获得子模块,用于根据至少一个样本文本图像子集和第三样本文本图像子集,得到目标样本文本图像集。The third obtaining submodule is used to obtain a target sample text image set based on at least one sample text image subset and a third sample text image subset.
根据本公开的实施例,第二获得子模块可以包括获得单元。According to embodiments of the present disclosure, the second obtaining sub-module may include an obtaining unit.
获得单元,用于基于预定组合策略,将至少一个裁剪样本文本图像子集中的裁剪样本文本图像进行组合,得到第三样本文本图像子集。The obtaining unit is configured to combine the cropped sample text images in at least one cropped sample text image subset based on a predetermined combination strategy to obtain a third sample text image subset.
根据本公开的实施例,第一样本文本图像集可以包括多个第一样本文本图像。According to embodiments of the present disclosure, the first sample text image set may include a plurality of first sample text images.
根据本公开的实施例,待裁剪样本文本图像集可以是通过以下方式确定的:According to an embodiment of the present disclosure, the sample text image set to be cropped may be determined in the following manner:
针对多个第一样本文本图像中的第一样本文本图像,For a first sample text image among the plurality of first sample text images,
在确定第一样本文本图像的预定概率值小于或等于预定概率阈值的情况下,将第一样本文本图像确定为待裁剪样本文本图像集中的待裁剪样本文本图像。When it is determined that the predetermined probability value of the first sample text image is less than or equal to the predetermined probability threshold, the first sample text image is determined as the sample text image to be cropped in the set of sample text images to be cropped.
根据本公开的实施例,文本图像生成装置还可以包括第三获得模块和 第四获得模块。According to an embodiment of the present disclosure, the text image generating device may further include a third obtaining module and The fourth acquisition module.
第三获得模块,用于对原始样本文本图像集进行数据增强处理,得到中间样本文本图像集。The third acquisition module is used to perform data enhancement processing on the original sample text image set to obtain an intermediate sample text image set.
第四获得模块,用于根据原始样本文本图像集和中间样本文本图像集,得到样本文本图像集。The fourth obtaining module is used to obtain a sample text image set based on the original sample text image set and the intermediate sample text image set.
根据本公开的实施例,样本文本图像集可以是文本视觉任务的文本图像集。According to an embodiment of the present disclosure, the sample text image set may be a text image set of a text vision task.
图7示意性示出了根据本公开实施例的深度学习模型的训练装置的框图。Figure 7 schematically shows a block diagram of a training device for a deep learning model according to an embodiment of the present disclosure.
如图7所示,深度学习模型的训练装置700可以包括第一获取模块710和第五获得模块720。As shown in FIG. 7 , the deep learning model training device 700 may include a first acquisition module 710 and a fifth acquisition module 720 .
第一获取模块710,用于获取目标样本文本图像集。The first acquisition module 710 is used to acquire the target sample text image set.
第五获得模块720,用于利用目标样本文本图像集训练深度学习模型,得到文本图像处理模型。The fifth acquisition module 720 is used to train a deep learning model using the target sample text image set to obtain a text image processing model.
根据本公开的实施例,目标样本文本图像集可以是根据本公开实施例的深度学习模型的训练装置训练得到的。According to an embodiment of the present disclosure, the target sample text image set may be trained according to the training device of the deep learning model of the embodiment of the present disclosure.
图8示意性示出了根据本公开实施例的文本图像处理装置的框图。FIG. 8 schematically shows a block diagram of a text image processing device according to an embodiment of the present disclosure.
如图8所示,图像处理装置800可以包括第二获取模块810和第六获得模块820。As shown in FIG. 8 , the image processing device 800 may include a second acquisition module 810 and a sixth acquisition module 820 .
第二获取模块810,用于获取待处理文本图像。The second acquisition module 810 is used to acquire text images to be processed.
第六获得模块820,用于将待处理文本图像输入文本图像处理模型,得到文本图像处理结果。The sixth obtaining module 820 is used to input the text image to be processed into the text image processing model to obtain the text image processing result.
根据本公开的实施例,文本图像处理模型可以是根据本公开实施例的图像处理装置训练得到的。According to embodiments of the present disclosure, the text image processing model may be trained according to the image processing device according to the embodiments of the present disclosure.
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
根据本公开的实施例,一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行如上所述的方法。 According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions that can be executed by at least one processor, and the instructions are processed by at least one processor. processor execution, so that at least one processor can execute the method as described above.
根据本公开的实施例,一种存储有计算机指令的非瞬时计算机可读存储介质,其中,计算机指令用于使计算机执行如上所述的方法。According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium stores computer instructions, wherein the computer instructions are used to cause a computer to perform the method as described above.
根据本公开的实施例,一种计算机程序产品,包括计算机程序,计算机程序在被处理器执行时实现如上所述的方法。According to an embodiment of the present disclosure, a computer program product includes a computer program, and when executed by a processor, the computer program implements the method as described above.
图9示意性示出了根据本公开实施例的适于实现文本图像生成方法、深度学习模型的训练方法和文本图像处理方法的电子设备的框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 9 schematically shows a block diagram of an electronic device suitable for implementing a text image generation method, a deep learning model training method, and a text image processing method according to an embodiment of the present disclosure. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
如图9所示,电子设备900包括计算单元901,其可以根据存储在只读存储器(ROM)902中的计算机程序或者从存储单元908加载到随机访问存储器(RAM)903中的计算机程序,来执行各种适当的动作和处理。在RAM903中,还可存储电子设备900操作所需的各种程序和数据。计算单元901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(I/O)接口905也连接至总线904。As shown in FIG. 9 , the electronic device 900 includes a computing unit 901 that can perform calculations according to a computer program stored in a read-only memory (ROM) 902 or loaded from a storage unit 908 into a random access memory (RAM) 903 . Perform various appropriate actions and processing. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. Computing unit 901, ROM 902 and RAM 903 are connected to each other via bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
电子设备900中的多个部件连接至I/O接口905,包括:输入单元906,例如键盘、鼠标等;输出单元907,例如各种类型的显示器、扬声器等;存储单元908,例如磁盘、光盘等;以及通信单元909,例如网卡、调制解调器、无线通信收发机等。通信单元909允许电子设备900通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906, such as a keyboard, a mouse, etc.; an output unit 907, such as various types of displays, speakers, etc.; a storage unit 908, such as a magnetic disk, an optical disk, etc. etc.; and a communication unit 909, such as a network card, modem, wireless communication transceiver, etc. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunications networks.
计算单元901可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元901的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元901执行上文所描述的各个方法和处理,例如,文本图像生成方法、深度学习模型的训练方法和文本图像处理方法。例如,在一些实施例中,文本图像生成方法、深度学习模型的 训练方法和文本图像处理方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元908。在一些实施例中,计算机程序的部分或者全部可以经由ROM 902和/或通信单元909而被载入和/或安装到电子设备900上。当计算机程序加载到RAM 903并由计算单元901执行时,可以执行上文描述的文本图像生成方法、深度学习模型的训练方法和文本图像处理方法的一个或多个步骤。备选地,在其他实施例中,计算单元901可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行文本图像生成方法、深度学习模型的训练方法和文本图像处理方法。Computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc. The computing unit 901 performs various methods and processes described above, for example, a text image generation method, a deep learning model training method, and a text image processing method. For example, in some embodiments, text image generation methods, deep learning models The training method and the text image processing method may be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto electronic device 900 via ROM 902 and/or communication unit 909 . When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the text image generation method, the deep learning model training method and the text image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the text image generation method, the deep learning model training method, and the text image processing method in any other suitable manner (eg, by means of firmware).
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、 便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of this disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include electrical connections based on one or more wires, Portable computer disk, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM) ), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,也可以是分布式系统的服务器,或者是结合了区块链的服务器。Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, a distributed system server, or a server combined with a blockchain.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in the present disclosure can be executed in parallel, sequentially, or in a different order. As long as the desired results of the technical solution disclosed in the present disclosure can be achieved, there is no limitation here.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术 人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。 The above-mentioned specific embodiments do not constitute a limitation on the scope of the present disclosure. Technology in this field Personnel should understand that various modifications, combinations, subcombinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this disclosure shall be included in the protection scope of this disclosure.

Claims (33)

  1. 一种文本图像生成方法,包括:A method for generating text images, including:
    根据样本文本图像集的样本文本输出结果集和样本标签集,将所述样本文本图像集划分为至少一个样本文本图像子集,其中,所述至少一个样本文本图像子集包括第一样本文本图像子集,所述第一样本文本图像子集包括样本文本输出结果正确的样本文本图像;According to the sample text output result set and the sample label set of the sample text image set, the sample text image set is divided into at least one sample text image subset, wherein the at least one sample text image subset includes a first sample text Image subset, the first sample text image subset includes sample text images with correct sample text output results;
    根据待裁剪样本文本图像集的样本文本输出结果集,确定所述待裁剪样本文本图像集的目标裁剪位置集,其中,所述待裁剪样本文本图像集是根据所述第一样本文本图像子集确定的;Determine the target cropping position set of the sample text image set to be cropped according to the sample text output result set of the sample text image set to be cropped, wherein the sample text image set to be cropped is based on the first sample text image sub-set Set determined;
    基于所述目标裁剪位置集对所述待裁剪样本文本图像集进行裁剪,得到至少一个裁剪样本文本图像子集;以及Crop the sample text image set to be cropped based on the target cropping position set to obtain at least one cropped sample text image subset; and
    根据所述至少一个裁剪样本文本图像子集和所述至少一个样本文本图像子集,得到目标样本文本图像集。According to the at least one cropped sample text image subset and the at least one sample text image subset, a target sample text image set is obtained.
  2. 根据权利要求1所述的方法,其中,所述根据样本文本图像集的样本文本输出结果集和样本标签集,将所述样本文本图像集划分为至少一个样本文本图像子集,包括:The method according to claim 1, wherein dividing the sample text image set into at least one sample text image subset according to the sample text output result set and the sample label set of the sample text image set includes:
    将所述样本文本图像集的样本文本输出结果集和样本标签集进行比较,得到比较结果;以及Compare the sample text output result set and the sample label set of the sample text image set to obtain a comparison result; and
    根据所述比较结果,将所述样本文本图像集划分为所述至少一个样本文本图像子集。According to the comparison result, the sample text image set is divided into the at least one sample text image subset.
  3. 根据权利要求2所述的方法,其中,所述样本文本图像集包括多个样本文本图像,所述至少一个样本文本图像子集还包括第二样本文本图像子集;The method of claim 2, wherein the sample text image set includes a plurality of sample text images, and the at least one sample text image subset further includes a second sample text image subset;
    其中,所述根据所述比较结果,将所述样本文本图像集划分为所述至少一个样本文本图像子集,包括:Wherein, according to the comparison result, dividing the sample text image set into the at least one sample text image subset includes:
    针对所述多个样本文本图像中的样本文本图像,For a sample text image among the plurality of sample text images,
    在确定所述样本文本图像的样本文本输出结果和样本标签之间的关系满足预定匹配条件的情况下,将所述样本文本图像确 定为所述第一样本文本图像子集中的样本文本图像;以及When it is determined that the relationship between the sample text output result of the sample text image and the sample label satisfies the predetermined matching condition, the sample text image is determined to be Determined to be a sample text image in the first subset of sample text images; and
    在确定所述样本文本图像的样本文本输出结果和样本标签之间的关系不满足所述预定匹配条件的情况下,将所述样本文本图像确定为所述第二样本文本图像子集中的样本文本图像。When it is determined that the relationship between the sample text output result of the sample text image and the sample label does not satisfy the predetermined matching condition, the sample text image is determined to be the sample text in the second sample text image subset. image.
  4. 根据权利要求1~3中任一项所述的方法,其中,所述待裁剪样本文本图像集包括多个待裁剪样本文本图像;The method according to any one of claims 1 to 3, wherein the sample text image set to be cropped includes a plurality of sample text images to be cropped;
    其中,所述根据待裁剪样本文本图像集的样本文本输出结果集,确定所述待裁剪样本文本图像集的目标裁剪位置集,包括:Wherein, determining the target cropping position set of the sample text image set to be cropped based on the sample text output result set of the sample text image set to be cropped includes:
    针对所述待裁剪样本文本图像集中的待裁剪样本文本图像,For the sample text image to be cropped in the set of sample text images to be cropped,
    根据所述待裁剪样本文本图像的样本文本输出结果,从多个候选裁剪位置中确定至少一个所述目标裁剪位置。At least one target cropping position is determined from a plurality of candidate cropping positions according to the sample text output result of the sample text image to be cropped.
  5. 根据权利要求4所述的方法,其中,所述样本文本输出结果包括以下至少之一:样本文本识别输出结果和样本文本语义输出结果。The method according to claim 4, wherein the sample text output result includes at least one of the following: a sample text recognition output result and a sample text semantic output result.
  6. 根据权利要求5所述的方法,其中,所述样本文本图像集包括多个样本文本图像;The method of claim 5, wherein the set of sample text images includes a plurality of sample text images;
    其中,所述样本文本识别输出结果是对所述样本文本图像的全局样本特征序列进行序列解码得到的,所述全局样本特征序列是对所述样本文本图像的第一局部样本特征图进行全局特征提取得到的,所述第一局部样本特征图是对所述样本文本图像进行第一局部特征提取得到的;Wherein, the sample text recognition output result is obtained by decoding the global sample feature sequence of the sample text image, and the global sample feature sequence is obtained by performing global feature analysis on the first local sample feature map of the sample text image. Extracted, the first local sample feature map is obtained by extracting the first local feature of the sample text image;
    其中,所述样本文本语义输出结果是对所述样本文本图像的第二局部样本特征图进行语义理解得到的,所述第二局部样本特征图是对所述样本文本图像进行第二局部特征提取得到的。Wherein, the sample text semantic output result is obtained by semantic understanding of the second local sample feature map of the sample text image, and the second local sample feature map is obtained by performing second local feature extraction on the sample text image. owned.
  7. 根据权利要求5所述的方法,其中,在所述样本文本输出结果包括所述样本文本识别结果和所述样本文本语义输出结果的情况下,所述根据所述待裁剪样本文本图像的样本文本输出结果,从多个候选裁剪位置中确定至少一个所述目标裁剪位置,包括:The method according to claim 5, wherein, in the case where the sample text output result includes the sample text recognition result and the sample text semantic output result, the sample text according to the sample text image to be cropped Output the result and determine at least one target cropping position from multiple candidate cropping positions, including:
    根据所述待裁剪样本文本图像的样本文本识别输出结果,确定所述多个候选裁剪位置;以及 Determine the plurality of candidate cropping positions according to the sample text recognition output result of the sample text image to be cropped; and
    根据所述待裁剪样本文本图像的样本文本语义输出结果,从所述多个候选裁剪位置中确定至少一个所述目标裁剪位置。At least one target cropping position is determined from the plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped.
  8. 根据权利要求1~3中任一项所述的方法,其中,所述基于所述目标裁剪位置集对所述待裁剪样本文本图像集进行裁剪,得到至少一个裁剪样本文本图像子集,包括:The method according to any one of claims 1 to 3, wherein the cropping the sample text image set to be cropped based on the target cropping position set to obtain at least one cropped sample text image subset includes:
    基于所述目标裁剪位置集对所述待裁剪样本文本图像集进行裁剪,得到第一裁剪样本文本图像子集和第二裁剪样本文本图像子集。The sample text image set to be cropped is cropped based on the target cropping position set to obtain a first cropped sample text image subset and a second cropped sample text image subset.
  9. 根据权利要求1~3中任一项所述的方法,其中,所述根据所述至少一个裁剪样本文本图像子集和所述至少一个样本文本图像子集,得到目标样本文本图像集,包括:The method according to any one of claims 1 to 3, wherein obtaining the target sample text image set based on the at least one cropped sample text image subset and the at least one sample text image subset includes:
    根据所述至少一个裁剪样本文本图像子集,得到第三样本文本图像子集;以及Obtain a third sample text image subset according to the at least one cropped sample text image subset; and
    根据所述至少一个样本文本图像子集和所述第三样本文本图像子集,得到所述目标样本文本图像集。The target sample text image set is obtained according to the at least one sample text image subset and the third sample text image subset.
  10. 根据权利要求9所述的方法,其中,所述根据所述至少一个裁剪样本文本图像子集,得到第三样本文本图像子集,包括:The method of claim 9, wherein obtaining a third sample text image subset based on the at least one cropped sample text image subset includes:
    基于预定组合策略,将所述至少一个裁剪样本文本图像子集中的裁剪样本文本图像进行组合,得到所述第三样本文本图像子集。Based on a predetermined combination strategy, the cropped sample text images in the at least one cropped sample text image subset are combined to obtain the third sample text image subset.
  11. 根据权利要求1~3中任一项所述的方法,其中,所述第一样本文本图像集包括多个第一样本文本图像;The method according to any one of claims 1 to 3, wherein the first sample text image set includes a plurality of first sample text images;
    其中,所述待裁剪样本文本图像集是通过以下方式确定的:Wherein, the sample text image set to be cropped is determined in the following way:
    针对所述多个第一样本文本图像中的第一样本文本图像,For a first sample text image among the plurality of first sample text images,
    在确定所述第一样本文本图像的预定概率值小于或等于预定概率阈值的情况下,将所述第一样本文本图像确定为所述待裁剪样本文本图像集中的待裁剪样本文本图像。If it is determined that the predetermined probability value of the first sample text image is less than or equal to the predetermined probability threshold, the first sample text image is determined as the sample text image to be cropped in the set of sample text images to be cropped.
  12. 根据权利要求1~3中任一项所述的方法,还包括:The method according to any one of claims 1 to 3, further comprising:
    对原始样本文本图像集进行数据增强处理,得到中间样本文本图像集;以及 Perform data enhancement processing on the original sample text image set to obtain an intermediate sample text image set; and
    根据所述原始样本文本图像集和所述中间样本文本图像集,得到所述样本文本图像集。The sample text image set is obtained according to the original sample text image set and the intermediate sample text image set.
  13. 根据权利要求1~3中任一项所述的方法,其中,所述样本文本图像集是文本视觉任务的文本图像集。The method according to any one of claims 1 to 3, wherein the sample text image set is a text image set of a text vision task.
  14. 一种深度学习模型的训练方法,包括:A training method for a deep learning model, including:
    获取目标样本文本图像集;以及Obtain the target sample text image set; and
    利用所述目标样本文本图像集训练所述深度学习模型,得到文本图像处理模型,Use the target sample text image set to train the deep learning model to obtain a text image processing model,
    其中,所述目标样本文本图像集是利用根据权利要求1~13中任一项所述的方法得到的。Wherein, the target sample text image set is obtained by using the method according to any one of claims 1 to 13.
  15. 一种文本图像处理方法,包括:A text image processing method, including:
    获取待处理文本图像;以及Get the text image to be processed; and
    将所述待处理文本图像输入文本图像处理模型,得到文本图像处理结果,Input the text image to be processed into the text image processing model to obtain the text image processing result,
    其中,所述文本图像处理模型是利用根据权利要求14所述的方法训练得到的。Wherein, the text image processing model is trained using the method according to claim 14.
  16. 一种文本图像生成装置,包括:A text image generating device, including:
    划分模块,用于根据样本文本图像集的样本文本输出结果集和样本标签集,将所述样本文本图像集划分为至少一个样本文本图像子集,其中,所述至少一个样本文本图像子集包括第一样本文本图像子集,所述第一样本文本图像子集包括样本文本输出结果正确的样本文本图像;A dividing module, configured to divide the sample text image set into at least one sample text image subset according to the sample text output result set and the sample label set, wherein the at least one sample text image subset includes A first sample text image subset, the first sample text image subset includes sample text images with correct sample text output results;
    确定模块,用于根据待裁剪样本文本图像集的样本文本输出结果集,确定所述待裁剪样本文本图像集的目标裁剪位置集,其中,所述待裁剪样本文本图像集是根据所述第一样本文本图像子集确定的;A determination module configured to determine a target cropping position set of the sample text image set to be cropped based on a sample text output result set of the sample text image set to be cropped, wherein the sample text image set to be cropped is based on the first Determined by a subset of sample text images;
    第一获得模块,用于基于所述目标裁剪位置集对所述待裁剪样本文本图像集进行裁剪,得到至少一个裁剪样本文本图像子集;以及A first obtaining module, configured to crop the sample text image set to be cropped based on the target cropping position set to obtain at least one cropped sample text image subset; and
    第二获得模块,用于根据所述至少一个裁剪样本文本图像子 集和所述至少一个样本文本图像子集,得到目标样本文本图像集。The second obtaining module is configured to use the at least one cropped sample text image sub- set and the at least one sample text image subset to obtain a target sample text image set.
  17. 根据权利要求16所述的装置,其中,所述划分模块,包括:The device according to claim 16, wherein the dividing module includes:
    比较子模块,用于将所述样本文本图像集的样本文本输出结果集和样本标签集进行比较,得到比较结果;以及A comparison submodule, used to compare the sample text output result set of the sample text image set and the sample label set to obtain a comparison result; and
    划分子模块,用于根据所述比较结果,将所述样本文本图像集划分为所述至少一个样本文本图像子集。A dividing submodule, configured to divide the sample text image set into the at least one sample text image subset according to the comparison result.
  18. 根据权利要求17所述的装置,其中,所述样本文本图像集包括多个样本文本图像,所述至少一个样本文本图像子集还包括第二样本文本图像子集;The apparatus of claim 17, wherein the set of sample text images includes a plurality of sample text images, and the at least one subset of sample text images further includes a second subset of sample text images;
    其中,针对所述多个样本文本图像中的样本文本图像,所述划分子模块,包括:Wherein, for the sample text image among the plurality of sample text images, the dividing sub-module includes:
    第一确定单元,用于在确定所述样本文本图像的样本文本输出结果和样本标签之间的关系满足预定匹配条件的情况下,将所述样本文本图像确定为所述第一样本文本图像子集中的样本文本图像;以及A first determination unit configured to determine the sample text image as the first sample text image when it is determined that the relationship between the sample text output result and the sample label of the sample text image satisfies a predetermined matching condition. Sample text images in the subset; and
    第二确定单元,用于在确定所述样本文本图像的样本文本输出结果和样本标签之间的关系不满足所述预定匹配条件的情况下,将所述样本文本图像确定为所述第二样本文本图像子集中的样本文本图像。A second determination unit configured to determine the sample text image as the second sample when it is determined that the relationship between the sample text output result and the sample label of the sample text image does not satisfy the predetermined matching condition. Sample text images from the text image subset.
  19. 根据权利要求16~18中任一项所述的装置,其中,所述待裁剪样本文本图像集包括多个待裁剪样本文本图像;The device according to any one of claims 16 to 18, wherein the sample text image set to be cropped includes a plurality of sample text images to be cropped;
    其中,针对所述待裁剪样本文本图像集中的待裁剪样本文本图像,所述确定模块,包括:Wherein, for the sample text image to be cropped in the sample text image set to be cropped, the determination module includes:
    确定子模块,用于根据所述待裁剪样本文本图像的样本文本输出结果,从多个候选裁剪位置中确定至少一个所述目标裁剪位置。Determining submodule, configured to determine at least one target cropping position from a plurality of candidate cropping positions according to the sample text output result of the sample text image to be cropped.
  20. 根据权利要求19所述的装置,其中,所述样本文本输出结果包括以下至少之一:样本文本识别输出结果和样本文本语义输出结果。 The device according to claim 19, wherein the sample text output result includes at least one of the following: a sample text recognition output result and a sample text semantic output result.
  21. 根据权利要求20所述的装置,其中,所述样本文本图像集包括多个样本文本图像;The device of claim 20, wherein the set of sample text images includes a plurality of sample text images;
    其中,所述样本文本识别输出结果是对所述样本文本图像的全局样本特征序列进行序列解码得到的,所述全局样本特征序列是对所述样本文本图像的第一局部样本特征图进行全局特征提取得到的,所述第一局部样本特征图是对所述样本文本图像进行第一局部特征提取得到的;Wherein, the sample text recognition output result is obtained by decoding the global sample feature sequence of the sample text image, and the global sample feature sequence is obtained by performing global feature analysis on the first local sample feature map of the sample text image. Extracted, the first local sample feature map is obtained by extracting the first local feature of the sample text image;
    其中,所述样本文本语义输出结果是对所述样本文本图像的第二局部样本特征图进行语义理解得到的,所述第二局部样本特征图是对所述样本文本图像进行第二局部特征提取得到的。Wherein, the sample text semantic output result is obtained by semantic understanding of the second local sample feature map of the sample text image, and the second local sample feature map is obtained by performing second local feature extraction on the sample text image. owned.
  22. 根据权利要求20所述的装置,其中,在所述样本文本输出结果包括所述样本文本识别结果和所述样本文本语义输出结果的情况下,所述确定子模块,包括:The device according to claim 20, wherein, in the case where the sample text output result includes the sample text recognition result and the sample text semantic output result, the determination sub-module includes:
    第三确定单元,用于根据所述待裁剪样本文本图像的样本文本识别输出结果,确定所述多个候选裁剪位置;以及A third determination unit configured to determine the plurality of candidate cropping positions based on the sample text recognition output result of the sample text image to be cropped; and
    第四确定单元,用于根据所述待裁剪样本文本图像的样本文本语义输出结果,从所述多个候选裁剪位置中确定至少一个所述目标裁剪位置。A fourth determination unit configured to determine at least one target cropping position from the plurality of candidate cropping positions according to the sample text semantic output result of the sample text image to be cropped.
  23. 根据权利要求16~18中任一项所述的装置,其中,所述第一获得模块,包括:The device according to any one of claims 16 to 18, wherein the first acquisition module includes:
    第一获得子模块,用于基于所述目标裁剪位置集对所述待裁剪样本文本图像集进行裁剪,得到第一裁剪样本文本图像子集和第二裁剪样本文本图像子集。The first obtaining sub-module is used to crop the sample text image set to be cropped based on the target cropping position set to obtain a first cropped sample text image subset and a second cropped sample text image subset.
  24. 根据权利要求16~18中任一项所述的装置,其中,所述第二获得模块,包括:The device according to any one of claims 16 to 18, wherein the second acquisition module includes:
    第二获得子模块,用于根据所述至少一个裁剪样本文本图像子集,得到第三样本文本图像子集;以及a second obtaining submodule, configured to obtain a third sample text image subset according to the at least one cropped sample text image subset; and
    第三获得子模块,用于根据所述至少一个样本文本图像子集和所述第三样本文本图像子集,得到所述目标样本文本图像集。The third obtaining sub-module is used to obtain the target sample text image set according to the at least one sample text image subset and the third sample text image subset.
  25. 根据权利要求24所述的装置,其中,所述第二获得子模 块,包括:The apparatus of claim 24, wherein the second obtained sub-module blocks, including:
    获得单元,用于基于预定组合策略,将所述至少一个裁剪样本文本图像子集中的裁剪样本文本图像进行组合,得到所述第三样本文本图像子集。The obtaining unit is configured to combine the cropped sample text images in the at least one cropped sample text image subset based on a predetermined combination strategy to obtain the third sample text image subset.
  26. 根据权利要求16~18中任一项所述的装置,其中,所述第一样本文本图像集包括多个第一样本文本图像;The device according to any one of claims 16 to 18, wherein the first sample text image set includes a plurality of first sample text images;
    其中,所述待裁剪样本文本图像集是通过以下方式确定的:Wherein, the sample text image set to be cropped is determined in the following way:
    针对所述多个第一样本文本图像中的第一样本文本图像,For a first sample text image among the plurality of first sample text images,
    在确定所述第一样本文本图像的预定概率值小于或等于预定概率阈值的情况下,将所述第一样本文本图像确定为所述待裁剪样本文本图像集中的待裁剪样本文本图像。If it is determined that the predetermined probability value of the first sample text image is less than or equal to the predetermined probability threshold, the first sample text image is determined as the sample text image to be cropped in the set of sample text images to be cropped.
  27. 根据权利要求16~18中任一项所述的装置,还包括:The device according to any one of claims 16 to 18, further comprising:
    第三获得模块,用于对原始样本文本图像集进行数据增强处理,得到中间样本文本图像集;以及The third acquisition module is used to perform data enhancement processing on the original sample text image set to obtain an intermediate sample text image set; and
    第四获得模块,用于根据所述原始样本文本图像集和所述中间样本文本图像集,得到所述样本文本图像集。The fourth obtaining module is used to obtain the sample text image set according to the original sample text image set and the intermediate sample text image set.
  28. 根据权利要求16~18中任一项所述的装置,其中,所述样本文本图像集是文本视觉任务的文本图像集。The device according to any one of claims 16 to 18, wherein the sample text image set is a text image set of a text vision task.
  29. 一种深度学习模型的训练装置,包括:A training device for a deep learning model, including:
    第一获取模块,用于获取目标样本文本图像集;以及The first acquisition module is used to acquire the target sample text image set; and
    第五获得模块,用于利用所述目标样本文本图像集训练所述深度学习模型,得到文本图像处理模型,The fifth acquisition module is used to train the deep learning model using the target sample text image set to obtain a text image processing model,
    其中,所述目标样本文本图像集是利用根据权利要求16~28中任一项所述的装置得到的。Wherein, the target sample text image set is obtained by using the device according to any one of claims 16 to 28.
  30. 一种文本图像处理装置,包括:A text image processing device, including:
    第二获取模块,用于获取待处理文本图像;以及The second acquisition module is used to acquire the text image to be processed; and
    第六获得模块,用于将所述待处理文本图像输入文本图像处理模型,得到文本图像处理结果,The sixth acquisition module is used to input the text image to be processed into the text image processing model to obtain the text image processing result,
    其中,所述文本图像处理模型是利用根据权利要求29所述的装置训练得到的。 Wherein, the text image processing model is trained using the device according to claim 29.
  31. 一种电子设备,包括:An electronic device including:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1~15中任一项所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute any one of claims 1 to 15. Methods.
  32. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1~15中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1 to 15.
  33. 一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1~15中任一项所述方法的步骤。 A computer program product, including a computer program, characterized in that when the computer program is executed by a processor, the steps of the method described in any one of claims 1 to 15 are implemented.
PCT/CN2023/074125 2022-08-24 2023-02-01 Text image generation, training, and processing methods, and electronic device WO2024040870A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211015424.6A CN115082598B (en) 2022-08-24 2022-08-24 Text image generation, training, text image processing method and electronic equipment
CN202211015424.6 2022-08-24

Publications (1)

Publication Number Publication Date
WO2024040870A1 true WO2024040870A1 (en) 2024-02-29

Family

ID=83244124

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/074125 WO2024040870A1 (en) 2022-08-24 2023-02-01 Text image generation, training, and processing methods, and electronic device

Country Status (2)

Country Link
CN (1) CN115082598B (en)
WO (1) WO2024040870A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082598B (en) * 2022-08-24 2023-07-11 北京百度网讯科技有限公司 Text image generation, training, text image processing method and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978044A (en) * 2019-03-20 2019-07-05 广州云测信息技术有限公司 The training method and device of training data generation method and device and model
US10489682B1 (en) * 2017-12-21 2019-11-26 Automation Anywhere, Inc. Optical character recognition employing deep learning with machine generated training data
US20210319246A1 (en) * 2020-04-08 2021-10-14 Konica Minolta Business Solutions U.S.A., Inc. Online training data generation for optical character recognition
CN113657370A (en) * 2021-08-26 2021-11-16 北京有竹居网络技术有限公司 Character recognition method and related equipment thereof
CN114529909A (en) * 2022-02-17 2022-05-24 北京百度网讯科技有限公司 Sample data set generation method and device and electronic equipment
CN115082598A (en) * 2022-08-24 2022-09-20 北京百度网讯科技有限公司 Text image generation method, text image training method, text image processing method and electronic equipment

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5680479A (en) * 1992-04-24 1997-10-21 Canon Kabushiki Kaisha Method and apparatus for character recognition
US5594809A (en) * 1995-04-28 1997-01-14 Xerox Corporation Automatic training of character templates using a text line image, a text line transcription and a line image source model
RU2695489C1 (en) * 2018-03-23 2019-07-23 Общество с ограниченной ответственностью "Аби Продакшн" Identification of fields on an image using artificial intelligence
CN111695385B (en) * 2019-03-15 2023-09-26 杭州海康威视数字技术股份有限公司 Text recognition method, device and equipment
CN110796031A (en) * 2019-10-11 2020-02-14 腾讯科技(深圳)有限公司 Table identification method and device based on artificial intelligence and electronic equipment
WO2022154787A1 (en) * 2021-01-13 2022-07-21 Hewlett-Packard Development Company, L.P. Image region of interest defect detection
CN112766418A (en) * 2021-03-02 2021-05-07 阳光财产保险股份有限公司 Image text direction classification method, device, equipment and storage medium
CN113920296B (en) * 2021-11-23 2022-07-15 厦门市美亚柏科信息股份有限公司 Text recognition method and system based on comparative learning
CN114818708B (en) * 2022-04-20 2023-04-18 北京百度网讯科技有限公司 Key information extraction method, model training method, related device and electronic equipment
CN114639064B (en) * 2022-05-18 2022-09-02 智洋创新科技股份有限公司 Water level identification method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10489682B1 (en) * 2017-12-21 2019-11-26 Automation Anywhere, Inc. Optical character recognition employing deep learning with machine generated training data
CN109978044A (en) * 2019-03-20 2019-07-05 广州云测信息技术有限公司 The training method and device of training data generation method and device and model
US20210319246A1 (en) * 2020-04-08 2021-10-14 Konica Minolta Business Solutions U.S.A., Inc. Online training data generation for optical character recognition
CN113657370A (en) * 2021-08-26 2021-11-16 北京有竹居网络技术有限公司 Character recognition method and related equipment thereof
CN114529909A (en) * 2022-02-17 2022-05-24 北京百度网讯科技有限公司 Sample data set generation method and device and electronic equipment
CN115082598A (en) * 2022-08-24 2022-09-20 北京百度网讯科技有限公司 Text image generation method, text image training method, text image processing method and electronic equipment

Also Published As

Publication number Publication date
CN115082598A (en) 2022-09-20
CN115082598B (en) 2023-07-11

Similar Documents

Publication Publication Date Title
WO2022142014A1 (en) Multi-modal information fusion-based text classification method, and related device thereof
CN112966522B (en) Image classification method and device, electronic equipment and storage medium
US20220129731A1 (en) Method and apparatus for training image recognition model, and method and apparatus for recognizing image
WO2023015941A1 (en) Text detection model training method and apparatus, text detection method, and device
EP3913499A1 (en) Method and apparatus for processing dataset, electronic device and storage medium
US20230106873A1 (en) Text extraction method, text extraction model training method, electronic device and storage medium
KR20210124111A (en) Method and apparatus for training model, device, medium and program product
US20220318275A1 (en) Search method, electronic device and storage medium
US20220415072A1 (en) Image processing method, text recognition method and apparatus
US20210406579A1 (en) Model training method, identification method, device, storage medium and program product
WO2022012179A1 (en) Method and apparatus for generating feature extraction network, and device and computer-readable medium
US20220270384A1 (en) Method for training adversarial network model, method for building character library, electronic device, and storage medium
CN114861889B (en) Deep learning model training method, target object detection method and device
WO2023015939A1 (en) Deep learning model training method for text detection, and text detection method
US20230084055A1 (en) Method for generating federated learning model
CN113204615A (en) Entity extraction method, device, equipment and storage medium
US20230133981A1 (en) Method of training image generation model, and method of generating image
KR20220034080A (en) Training method for circulary generating network model, method and apparatus for establishing word library, electronic device, recording medium and computer program
CN113627439A (en) Text structuring method, processing device, electronic device and storage medium
CN112507090A (en) Method, apparatus, device and storage medium for outputting information
US20220358955A1 (en) Method for detecting voice, method for training, and electronic devices
US20230215203A1 (en) Character recognition model training method and apparatus, character recognition method and apparatus, device and storage medium
WO2024040870A1 (en) Text image generation, training, and processing methods, and electronic device
US20230215136A1 (en) Method for training multi-modal data matching degree calculation model, method for calculating multi-modal data matching degree, and related apparatuses
WO2020006488A1 (en) Corpus generating method and apparatus, and human-machine interaction processing method and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23855967

Country of ref document: EP

Kind code of ref document: A1