WO2023035531A1

WO2023035531A1 - Super-resolution reconstruction method for text image and related device thereof

Info

Publication number: WO2023035531A1
Application number: PCT/CN2022/071883
Authority: WO
Inventors: 郑喜民; 翟尤; 舒畅; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-09-10
Filing date: 2022-01-13
Publication date: 2023-03-16
Also published as: CN113763249A

Abstract

Embodiments of the present application belong to the technical field of artificial intelligence, are applied to the field of intelligent medical treatment, and relate to a super-resolution reconstruction method for a text image and a related device thereof. The method comprises: inputting a low-resolution picture into a scene text recognition model to obtain text position and text content information; generating a text mask on the basis of the text position information and the text content information, and upsampling the text mask to obtain a target mask; inputting the low-resolution picture and the target mask into an adversarial network to obtain a discrimination result, and calculating discrimination accuracy on the basis of the discrimination result; calculating a loss function on the basis of the low-resolution picture and the target mask until the loss function converges, and obtaining a trained adversarial network when the discrimination accuracy is lower than an accuracy threshold; and inputting a received low-resolution picture to be converted into the trained adversarial network to obtain a target super-resolution picture. The trained adversarial network may be stored in a blockchain. The present application may ensure the quality of the super-resolution reconstruction of a text image.

Description

Text image super-resolution reconstruction method and related equipment

This application claims the priority of the Chinese patent application with the application number 202111061974.7 filed on September 10, 2021 with the title of "Text Image Super-Resolution Reconstruction Method and Related Devices" filed with the China Patent Office, the entire contents of which are incorporated by reference in this application.

technical field

The present application relates to the technical field of artificial intelligence, in particular to a text image super-resolution reconstruction method and related equipment.

Background technique

Super-resolution reconstruction means that for any given low-resolution picture, the corresponding high-resolution picture is generated through a convolutional neural network, and the details and textures in the picture are preserved and restored as much as possible. Super-resolution reconstruction technology plays a good role in promoting the development of related fields such as image classification, segmentation, tracking, and dehazing, and plays an important role in the development of neural networks.

The inventors realized that text pictures are different from natural scenes, and the text content has fixed shapes and clear edges, and the reconstruction requirements are higher. For ordinary pictures, most of the scenes in the picture are natural and random, and it is easier to convert low-resolution pictures into high-resolution ones. For the text in the scene, if there is distortion, sudden change in color, or blurring of text edges in the reconstructed image, the quality of the reconstructed image will be significantly reduced.

Contents of the invention

The purpose of the embodiment of the present application is to propose a text image super-resolution reconstruction method and its related equipment, so as to ensure the quality of the text image super-resolution reconstruction.

In order to solve the above technical problems, the embodiment of the present application provides a text image super-resolution reconstruction method, which adopts the following technical solution:

A text image super-resolution reconstruction method, comprising the steps of:

Receive a low-resolution picture and a corresponding high-resolution picture, input the low-resolution picture into a pre-trained scene text recognition model, and obtain output text position information and text content information;

generating a text mask based on the text position information and the text content information, and upsampling the text mask to obtain a target mask;

Input the low-resolution picture and the target mask into the generation layer of the preset confrontation network to obtain an output super-resolution picture;

Simultaneously input the super-resolution picture and the high-resolution picture into the discrimination layer of the confrontation network, obtain an output discrimination result, and calculate the discrimination accuracy rate based on the discrimination result;

Calculate the loss function of the confrontation network based on the low-resolution picture and the target mask until the loss function converges, and when the discrimination accuracy is lower than the accuracy threshold, a trained confrontation network is obtained;

Receive the low-resolution picture to be converted, input the low-resolution picture to be converted into the trained confrontation network, and obtain the output target super-resolution picture.

In order to solve the above technical problems, the embodiment of the present application also provides a text image super-resolution reconstruction device, which adopts the following technical solutions:

A text image super-resolution reconstruction device, comprising:

A receiving module, configured to receive a low-resolution picture and a corresponding high-resolution picture, input the low-resolution picture into a pre-trained scene text recognition model, and obtain output text position information and text content information;

An upsampling module, configured to generate a text mask based on the text position information and the text content information, and perform upsampling on the text mask to obtain a target mask;

A generating module, configured to input the low-resolution image and the target mask into a preset generation layer of the confrontation network to obtain an output super-resolution image;

A discrimination module, configured to simultaneously input the super-resolution picture and the high-resolution picture into the discrimination layer of the confrontation network, obtain an output discrimination result, and calculate the discrimination accuracy based on the discrimination result;

A calculation module, configured to calculate the loss function of the confrontation network based on the low-resolution picture and the target mask, until the loss function converges, and when the discrimination accuracy is lower than the accuracy threshold, a post-training against the network;

The obtaining module is used to receive the low-resolution picture to be converted, input the low-resolution picture to be converted into the trained confrontation network, and obtain the output target super-resolution picture.

In order to solve the above technical problems, the embodiment of the present application also provides a computer device, which adopts the following technical solution:

A computer device, comprising a memory and a processor, computer-readable instructions are stored in the memory, and the processor implements the steps of the text image super-resolution reconstruction method as follows when executing the computer-readable instructions:

In order to solve the above technical problems, the embodiment of the present application also provides a computer-readable storage medium, which adopts the following technical solution:

A computer-readable storage medium, computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the steps of the following text image super-resolution reconstruction method are realized:

Compared with the prior art, the embodiments of the present application mainly have the following beneficial effects:

The application obtains the text position and text content information through the received low-resolution pictures, and generates a text mask based on the text position information and text content information. The generation of the text mask takes into account the text position information and content information, and then can Clarify the boundary between the text in the picture and the surrounding image, so that the text in the super-resolution picture generated subsequently is clear, and the quality of the reconstructed picture is significantly improved. By upsampling the text mask, the text mask is enlarged, and the resolution of the text mask is improved, which facilitates the subsequent generation of super-resolution images. Through the adversarial training of the generation layer and the discriminative layer in the adversarial network, the trained adversarial network is obtained to generate a target super-resolution image with better quality.

Description of drawings

In order to illustrate the solution in this application more clearly, a brief introduction will be given below to the accompanying drawings that need to be used in the description of the embodiments of the application. Obviously, the accompanying drawings in the following description are some embodiments of the application. Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.

FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;

Fig. 2 is the flowchart of an embodiment according to the text image super-resolution reconstruction method of the present application;

FIG. 3 is a schematic structural diagram of an embodiment of a text image super-resolution reconstruction device according to the present application;

Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.

Reference numerals: 200, computer equipment; 201, memory; 202, processor; 203, network interface; 300, text image super-resolution reconstruction device; 301, receiving module; 302, upsampling module; 303, generating module; 304 . Discrimination module; 305. Calculation module; 306. Obtaining module.

Detailed ways

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of the application; the terms used herein in the description of the application are only to describe specific embodiments The purpose is not to limit the present application; the terms "comprising" and "having" and any variations thereof in the specification and claims of the present application and the description of the above drawings are intended to cover non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.

Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are independent or alternative embodiments mutually exclusive of other embodiments. It is understood explicitly and implicitly by those skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings.

As shown in FIG. 1 , a system architecture 100 may include

terminal devices

101 , 102 , 103 , a network 104 and a server 105 . The network 104 is used as a medium for providing communication links between the

terminal devices

101 , 102 , 103 and the server 105 . Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.

Users can use

terminal devices

101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like. Various communication client applications can be installed on the

terminal devices

101, 102, 103, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.

Terminal devices

101, 102, 103 can be various electronic devices with display screens and support web browsing, including but not limited to smartphones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compress standard audio layer 4) players, laptops and desktop computers, etc.

The server 105 may be a server that provides various services, such as a background server that provides support for pages displayed on the

terminal devices

101 , 102 , 103 .

It should be noted that the text image super-resolution reconstruction method provided in the embodiment of the present application is generally executed by a server/terminal device, and correspondingly, the text image super-resolution reconstruction device is generally set in the server/terminal device.

It should be understood that the numbers of terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.

Continuing to refer to FIG. 2 , it shows a flowchart of an embodiment of a text image super-resolution reconstruction method according to the present application. The described text image super-resolution reconstruction method comprises the following steps:

S1: Receive a low-resolution picture and a corresponding high-resolution picture, input the low-resolution picture into a pre-trained scene text recognition model, and obtain output text position information and text content information.

In this embodiment, the size of the low resolution image (Low Resolution image, LR image) is W*H. Input the low-resolution image into the scene text recognition model to obtain the location and content of the scene text. The scene text recognition model of this application is: text recognition model ASTER ("Aster: An attentional scene text recognizer with flexible rectification". This application completes the training of the text recognition model in advance.

In this embodiment, the electronic device on which the text image super-resolution reconstruction method runs (such as the server/terminal device shown in FIG. 1 ) can receive the low-resolution picture and the corresponding high-resolution picture through a wired or wireless connection. resolution picture. It should be pointed out that the above wireless connection methods may include but not limited to 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods known or developed in the future .

S2: Generate a text mask based on the text position information and the text content information, and perform upsampling on the text mask to obtain a target mask.

In this embodiment, a text mask (textmask) emphasizing only the text part is generated based on the text position information and the text content information, and the size of the text mask is the same as that of the low-resolution picture. In the text mask, the pixels where the text exists are marked as 1, and the pixels where the text does not exist are marked as 0, that is, a two-dimensional mask is obtained. When the size of the text mask is H*W, it is a low-resolution image. Text mask. Upsampling the text mask to obtain a new text mask, that is, the target mask, whose size is rW*rH, at this time the size of the target mask is the same as the generated high-resolution image, where, for the received The high-resolution image needs to be resized to be consistent with the size of the target mask, which is convenient for subsequent calculations. The target mask is used for subsequent supervision of the generation result of the generation layer (ie, the super-resolution image). This application does not need to label high-resolution images to complete the scene text recognition part of the operation.

Specifically, the step of generating a text mask based on the text position information and the text content information includes:

modifying the text position information based on the text content information to obtain target text position information;

The text mask is generated based on the target text location information.

In this embodiment, a text mask is generated based on the text position information and the text content information. In the picture, the shooting of some places is not clear, or the recognition result of the position is wrong, and the computer recognizes the text content information to generate a mask with higher accuracy. For example, if the content is not considered, the output mask may be "goed", while the network that considers the content outputs "good".

In addition, as another embodiment of the present application, the step of upsampling the text mask and obtaining the target mask includes:

performing multiple upsampling on the text mask to obtain the target mask.

In this embodiment, the application performs 5 times up-sampling on the text mask, and the text mask is enlarged by 5 times to improve the resolution of the text mask, and the generated super-resolution picture is 5 times larger than the low-resolution picture.

S3: Input the low-resolution picture and the target mask into a generation layer of a preset confrontation network to obtain an output super-resolution picture.

In this embodiment, after the scene text is recognized, the computer generates a super-resolution image through the generation layer of the confrontation network (Generative Adversarial Networks, GAN): the low-resolution image and the target mask are simultaneously input into the generation layer of the generation confrontation network ( Generator), the generation layer (Generator) generates a super-resolution image (Super Resolution image, SRimage).

S4: Simultaneously input the super-resolution picture and the high-resolution picture into the discrimination layer of the confrontation network, obtain an output discrimination result, and calculate a discrimination accuracy rate based on the discrimination result.

In this embodiment, the super-resolution picture and the high-resolution picture (High Resolution image, HRimage) are input into the discriminator layer (Discriminator) at the same time, and the discriminator layer outputs the discrimination result, that is, the output super-resolution picture or the high-resolution picture , for example, the discriminant layer outputs 0 or 1, where 0 represents "the picture is a generated picture (super-resolution picture)", and 1 represents that the picture is a real picture (high-resolution picture). In this application, through the adversarial training of the generation layer and the discrimination layer, as the super-resolution pictures generated by the generation layer are more and more similar to the high-resolution pictures in natural scenes, it becomes more and more difficult to distinguish, and the accurate output of the discrimination layer When the accuracy rate is lower than the accuracy rate threshold, it is determined that the discriminative layer is difficult to distinguish whether the input is a real image or a super-resolution image generated by the generation layer, indicating that the super-resolution image generated by the generation layer is of high quality and similar to the real image, that is, the training goal is completed , used in practical applications. Wherein, the calculation of the accuracy rate is the ratio of the number of correct judgment results output by the judgment layer within a preset time period to the total number of judgments.

S5: Calculate the loss function of the adversarial network based on the low-resolution picture and the target mask until the loss function converges, and when the discrimination accuracy is lower than the accuracy threshold, obtain the trained adversarial network .

In this embodiment, the loss function involved in this application is mainly the loss function involved in generating images generated by the adversarial network, including content loss function (content loss), adversarial loss function (adversarial loss) and regularization loss function (regularization loss), And a text perceptual loss designed for text masks. When the loss function and the discrimination accuracy are lower than the accuracy threshold, it is determined that the training of the adversarial network is completed, and an adversarial network with better performance is obtained.

Specifically, the step of calculating the loss function of the confrontation network based on the low-resolution picture and the target mask includes:

Calculate the content loss function of the confrontation network based on the low-resolution picture, the feature of the content loss function is:

in,

is the content loss function,

is the value of the pixel of the high-resolution picture at (x, y) position, G _θG (I ^LR ) _{x, y} is the value of the pixel of the super-resolution picture at (x, y) position, rW and rH are the width and length of the super-resolution picture, respectively, and r ² WH is the total number of pixels in the super-resolution picture.

In this embodiment, the content loss function calculates the mean square error, and the width and length of the super-resolution picture and the high-resolution picture are rW and rH respectively. This application calculates the difference sum of pixels in all positions of the super-resolution image and the high-resolution image whose width is rW and length is rH, and divided by the number of pixels, as the text perception loss. What is calculated is the loss between the super-resolution image and the high-resolution image.

As another embodiment of the present application, the step of calculating the loss function of the adversarial network based on the low-resolution picture and the target mask includes:

Calculate the adversarial loss function of the adversarial network based on the low-resolution picture, the characteristics of the adversarial loss function are:

in,

For the confrontation loss function, G _θG (I ^LR ) is the super-resolution picture, D _θD is the discriminant layer, M is the total number of the super-resolution picture, and m represents the super-resolution picture number.

In this embodiment, the adversarial loss requires the discriminative layer D to successfully distinguish between the super-resolution image generated by the generative layer G and the natural high-resolution image input therein. Through the adversarial training of the generation layer and the discriminative layer, the quality of the super-resolution images generated by the network is gradually improved. M is the total number of super-resolution pictures input into the discriminative layer.

In addition, as another embodiment of the present application, the step of calculating the loss function of the adversarial network based on the low-resolution picture and the target mask includes:

Calculate the regularization loss function of the confrontation network based on the low-resolution picture, and the characteristics of the regularization loss function are:

in,

is the regularization loss function, G _θG (I ^LR ) _{x, y} is the value of the pixel point of the super-resolution image at (x, y) position, rW and rH are the width and long, r ² WH is the total number of pixels in the target mask, ‖‖ represents the norm,

Indicates the gradient.

In this embodiment, by adding a regularized loss function, network overfitting is prevented and the overall loss function converges faster.

Calculate the text-aware loss function of the confrontation network based on the low-resolution picture, the characteristics of the text-aware loss function are:

Wherein, l ^TR is the text perception loss function, N is the total number of text existence position pixels,

for the target mask,

is the super-resolution picture.

In this example, calculating the target mask

The pixel value difference between the position where the Chinese text exists and the corresponding position of the generated picture in the generation layer, N represents the total number of pixels where the text exists. After summing all the differences and dividing by N, it is the text perception function. Through the text-aware function, the generative layer will produce clearer text when constructing new images. In this application, the position pixels where the text exists in the mask are marked as 1, and the position pixels that do not exist are marked as 0, and the target mask is generated after up-sampling. The target mask supervises the generation results of the generative layer through a text-aware loss function to emphasize only the text.

S6: Receive the low-resolution picture to be converted, input the low-resolution picture to be converted into the trained adversarial network, and obtain the output target super-resolution picture.

In this embodiment, according to the trained adversarial network, a target super-resolution picture with higher quality can be generated to ensure that the text information in the picture is clear and complete.

It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned trained adversarial network, the above-mentioned trained adversarial network can also be stored in a block chain node.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain (Blockchain), essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains a batch of network transaction information, which is used to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

The application can be applied in the field of smart medical treatment, and can be used to restore low-resolution pictures in the medical field, thereby promoting the construction of a smart city.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing related hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , when the computer-readable instructions are executed, they may include the processes of the embodiments of the above-mentioned methods. Wherein, the aforementioned storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM).

It should be understood that although the various steps in the flow chart of the accompanying drawings are displayed sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some of the steps in the flowcharts of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the order of execution is also It is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

Further referring to FIG. 3 , as an implementation of the method shown in FIG. 2 above, the present application provides an embodiment of a text image super-resolution reconstruction device, which corresponds to the method embodiment shown in FIG. 2 , The device can be specifically applied to various electronic devices.

As shown in FIG. 3 , the text image super-resolution reconstruction device 300 in this embodiment includes: a receiving module 301 , an upsampling module 302 , a generating module 303 , a judging module 304 , a calculating module 305 and an obtaining module 306 . Wherein: the receiving module 301 is used to receive the low-resolution picture and the corresponding high-resolution picture, input the low-resolution picture into the pre-trained scene text recognition model, and obtain the output text position information and text content information; An upsampling module 302, configured to generate a text mask based on the text position information and the text content information, and upsample the text mask to obtain a target mask; a generation module 303, configured to convert the low The resolution picture and the target mask are input into the generation layer of the preset confrontation network to obtain the output super-resolution picture; the discrimination module 304 is used to simultaneously combine the super-resolution picture and the high-resolution picture Input to the discriminant layer of the confrontation network to obtain the output discriminant result, and calculate the discriminant accuracy rate based on the discriminant result; the calculation module 305 is used to calculate the The loss function of the adversarial network, until the loss function converges, and when the discrimination accuracy is lower than the accuracy threshold, the trained adversarial network is obtained; the obtaining module 306 is used to receive the low-resolution image to be converted, and convert the The low-resolution image to be converted is input into the trained confrontation network to obtain the output target super-resolution image.

In this embodiment, the application obtains its text position and text content information through the received low-resolution pictures, and generates a text mask based on the text position information and text content information, and the generation of its text mask takes into account the text position information And content information, and then can clarify the boundary between the text in the picture and the surrounding image, so that the text in the super-resolution picture generated subsequently is clear, and the quality of the reconstructed picture is significantly improved. By upsampling the text mask, the text mask is enlarged, and the resolution of the text mask is improved, which facilitates the subsequent generation of super-resolution images. Through the adversarial training of the generation layer and the discriminative layer in the adversarial network, the trained adversarial network is obtained to generate a target super-resolution image with better quality.

The up-sampling module 302 includes a correction submodule and a generation submodule, wherein the correction submodule is used to modify the text position information based on the text content information to obtain target text position information; the generation submodule is used to modify the text position information based on the target text position information to generate the text mask.

In some optional implementation manners of this embodiment, the upsampling module 302 is further configured to: perform multiple upsampling on the text mask to obtain the target mask.

In some optional implementations of this embodiment, the calculation module 305 is further configured to: calculate a content loss function of the adversarial network based on the low-resolution picture, and the content loss function is characterized by:

in,

is the content loss function,

In some optional implementations of this embodiment, the calculation module 305 is further configured to: calculate the adversarial loss function of the adversarial network based on the low-resolution picture, and the characteristics of the adversarial loss function are:

in,

In some optional implementations of this embodiment, the calculation module 305 is further configured to: calculate a regularization loss function of the adversarial network based on the low-resolution picture, and the regularization loss function is characterized by:

in,

Indicates the gradient.

In some optional implementations of this embodiment, the calculation module 305 is further configured to: calculate a text-aware loss function of the adversarial network based on the low-resolution picture, and the characteristics of the text-aware loss function are:

for the target mask,

is the super-resolution picture.

In order to solve the above technical problems, the embodiment of the present application further provides computer equipment. Please refer to FIG. 4 for details. FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.

The computer device 200 includes a memory 201 , a processor 202 , and a network interface 203 connected to each other through a system bus for communication. It should be noted that only the computer device 200 having components 201-203 is shown in the figure, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to preset or stored instructions, and its hardware includes but is not limited to microprocessors, dedicated Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded devices, etc.

The computer equipment may be computing equipment such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can perform human-computer interaction with the user through keyboard, mouse, remote controller, touch panel or voice control device.

The memory 201 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc. The computer-readable storage medium may be non-volatile or volatile. In some embodiments, the storage 201 may be an internal storage unit of the computer device 200 , such as a hard disk or memory of the computer device 200 . In some other embodiments, the memory 201 can also be an external storage device of the computer device 200, such as a plug-in hard disk equipped on the computer device 200, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the storage 201 may also include both the internal storage unit of the computer device 200 and its external storage device. In this embodiment, the memory 201 is generally used to store the operating system and various application software installed in the computer device 200 , such as computer-readable instructions of a text image super-resolution reconstruction method and the like. In addition, the memory 201 can also be used to temporarily store various types of data that have been output or will be output.

The processor 202 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chips in some embodiments. The processor 202 is generally used to control the overall operation of the computer device 200. In this embodiment, the processor 202 is configured to run computer-readable instructions stored in the memory 201 or process data, such as computer-readable instructions for running the text image super-resolution reconstruction method.

The network interface 203 may include a wireless network interface or a wired network interface, and the network interface 203 is generally used to establish a communication connection between the computer device 200 and other electronic devices.

In this embodiment, the generation of the text mask takes into account the position information and content information of the text, so that the boundary between the text in the picture and the surrounding image can be clarified, and the quality of the reconstructed picture can be significantly improved. Through the adversarial training of the generation layer and the discriminative layer in the adversarial network, the trained adversarial network is obtained to generate a target super-resolution image with better quality.

The present application also provides another implementation manner, which is to provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is made to execute the steps of the above text image super-resolution reconstruction method.

Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.

Apparently, the embodiments described above are only some of the embodiments of the present application, not all of them. The drawings show preferred embodiments of the present application, but do not limit the patent scope of the present application. The present application can be implemented in many different forms, on the contrary, the purpose of providing these embodiments is to make the understanding of the disclosure of the present application more thorough and comprehensive. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features . All equivalent structures made using the contents of the description and drawings of this application, directly or indirectly used in other related technical fields, are also within the scope of protection of this application.

Claims

A text image super-resolution reconstruction method, comprising the steps of:

Receive a low-resolution picture and a corresponding high-resolution picture, input the low-resolution picture into a pre-trained scene text recognition model, and obtain output text position information and text content information;

generating a text mask based on the text position information and the text content information, and upsampling the text mask to obtain a target mask;

Input the low-resolution picture and the target mask into the generation layer of the preset confrontation network to obtain an output super-resolution picture;

Simultaneously input the super-resolution picture and the high-resolution picture into the discrimination layer of the confrontation network, obtain an output discrimination result, and calculate the discrimination accuracy rate based on the discrimination result;

Calculate the loss function of the confrontation network based on the low-resolution picture and the target mask until the loss function converges, and when the discrimination accuracy is lower than the accuracy threshold, a trained confrontation network is obtained;

Receive the low-resolution picture to be converted, input the low-resolution picture to be converted into the trained confrontation network, and obtain the output target super-resolution picture.
The text image super-resolution reconstruction method according to claim 1, wherein the step of generating a text mask based on the text position information and the text content information comprises:

modifying the text position information based on the text content information to obtain target text position information;

The text mask is generated based on the target text location information.
The text image super-resolution reconstruction method according to claim 1, wherein the step of calculating the loss function of the confrontation network based on the low-resolution picture and the target mask comprises:

Calculate the content loss function of the confrontation network based on the low-resolution picture, the feature of the content loss function is:

in,
is the content loss function,
is the value of the pixel of the high-resolution picture at (x, y) position, G θG (I LR ) x, y is the value of the pixel of the super-resolution picture at (x, y) position, rW and rH are the width and length of the super-resolution picture, respectively, and r 2 WH is the total number of pixels in the super-resolution picture.
The text image super-resolution reconstruction method according to claim 1, wherein the step of calculating the loss function of the confrontation network based on the low-resolution picture and the target mask comprises:

Calculate the adversarial loss function of the adversarial network based on the low-resolution picture, the characteristics of the adversarial loss function are:

in,
For the confrontation loss function, G θG (I LR ) is the super-resolution picture, D θD is the discriminant layer, M is the total number of the super-resolution picture, and m represents the super-resolution picture number.
The text image super-resolution reconstruction method according to claim 1, wherein the step of calculating the loss function of the confrontation network based on the low-resolution picture and the target mask comprises:

Calculate the regularization loss function of the confrontation network based on the low-resolution picture, and the characteristics of the regularization loss function are:

in,
is the regularization loss function, G θG (I LR ) x, y is the value of the pixel point of the super-resolution image at (x, y) position, rW and rH are the width and long, r 2 WH is the total number of pixels in the target mask, ‖‖ represents the norm,
Indicates the gradient.
The text image super-resolution reconstruction method according to claim 1, wherein the step of calculating the loss function of the confrontation network based on the low-resolution picture and the target mask comprises:

Calculate the text-aware loss function of the confrontation network based on the low-resolution picture, the characteristics of the text-aware loss function are:

Wherein, l TR is the text perception loss function, N is the total number of text existence position pixels,
for the target mask,
is the super-resolution picture.
The text image super-resolution reconstruction method according to claim 1, wherein said step of upsampling said text mask and obtaining a target mask comprises:

performing multiple upsampling on the text mask to obtain the target mask.
A text image super-resolution reconstruction device, comprising:

A receiving module, configured to receive a low-resolution picture and a corresponding high-resolution picture, input the low-resolution picture into a pre-trained scene text recognition model, and obtain output text position information and text content information;

An upsampling module, configured to generate a text mask based on the text position information and the text content information, and perform upsampling on the text mask to obtain a target mask;

A generating module, configured to input the low-resolution image and the target mask into a preset generation layer of the confrontation network to obtain an output super-resolution image;

A discrimination module, configured to simultaneously input the super-resolution picture and the high-resolution picture into the discrimination layer of the confrontation network, obtain an output discrimination result, and calculate the discrimination accuracy based on the discrimination result;

A calculation module, configured to calculate the loss function of the confrontation network based on the low-resolution picture and the target mask, until the loss function converges, and when the discrimination accuracy is lower than the accuracy threshold, a post-training against the network;

The obtaining module is used to receive the low-resolution picture to be converted, input the low-resolution picture to be converted into the trained confrontation network, and obtain the output target super-resolution picture.
A computer device, comprising a memory and a processor, computer-readable instructions are stored in the memory, and the processor implements the steps of the text image super-resolution reconstruction method as follows when executing the computer-readable instructions:

Receive a low-resolution picture and a corresponding high-resolution picture, input the low-resolution picture into a pre-trained scene text recognition model, and obtain output text position information and text content information;

generating a text mask based on the text position information and the text content information, and upsampling the text mask to obtain a target mask;

Input the low-resolution picture and the target mask into the generation layer of the preset confrontation network to obtain an output super-resolution picture;

Simultaneously input the super-resolution picture and the high-resolution picture into the discrimination layer of the confrontation network, obtain an output discrimination result, and calculate the discrimination accuracy rate based on the discrimination result;

Calculate the loss function of the confrontation network based on the low-resolution picture and the target mask until the loss function converges, and when the discrimination accuracy is lower than the accuracy threshold, a trained confrontation network is obtained;

Receive the low-resolution picture to be converted, input the low-resolution picture to be converted into the trained confrontation network, and obtain the output target super-resolution picture.
The computer device according to claim 9, wherein the step of generating a text mask based on the text position information and the text content information comprises:

modifying the text position information based on the text content information to obtain target text position information;

The text mask is generated based on the target text location information.
The computer device according to claim 9, wherein the step of calculating the loss function of the adversarial network based on the low-resolution picture and the target mask comprises:

Calculate the content loss function of the confrontation network based on the low-resolution picture, the feature of the content loss function is:

in,
is the content loss function,
is the value of the pixel of the high-resolution picture at (x, y) position, G θG (I LR ) x, y is the value of the pixel of the super-resolution picture at (x, y) position, rW and rH are the width and length of the super-resolution picture, respectively, and r 2 WH is the total number of pixels in the super-resolution picture.
The computer device according to claim 9, wherein the step of calculating the loss function of the adversarial network based on the low-resolution picture and the target mask comprises:

Calculate the adversarial loss function of the adversarial network based on the low-resolution picture, the characteristics of the adversarial loss function are:

in,
For the confrontation loss function, G θG (I LR ) is the super-resolution picture, D θD is the discriminant layer, M is the total number of the super-resolution picture, and m represents the super-resolution picture number.
The computer device according to claim 9, wherein the step of calculating the loss function of the adversarial network based on the low-resolution picture and the target mask comprises:

Calculate the regularization loss function of the confrontation network based on the low-resolution picture, and the characteristics of the regularization loss function are:

in,
is the regularization loss function, G θG (I LR ) x, y is the value of the pixel point of the super-resolution image at (x, y) position, rW and rH are the width and long, r 2 WH is the total number of pixels in the target mask, ‖‖ represents the norm,
Indicates the gradient.
The computer device according to claim 9, wherein the step of calculating the loss function of the adversarial network based on the low-resolution picture and the target mask comprises:

Calculate the text-aware loss function of the confrontation network based on the low-resolution picture, the characteristics of the text-aware loss function are:

Wherein, l TR is the text perception loss function, N is the total number of text existence position pixels,
for the target mask,
is the super-resolution picture.
The computer device according to claim 9, wherein said step of upsampling said text mask to obtain a target mask comprises:

performing multiple upsampling on the text mask to obtain the target mask.
A computer-readable storage medium, computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the steps of the following text image super-resolution reconstruction method are realized:

Receive a low-resolution picture and a corresponding high-resolution picture, input the low-resolution picture into a pre-trained scene text recognition model, and obtain output text position information and text content information;

generating a text mask based on the text position information and the text content information, and upsampling the text mask to obtain a target mask;

Input the low-resolution picture and the target mask into the generation layer of the preset confrontation network to obtain an output super-resolution picture;

Simultaneously input the super-resolution picture and the high-resolution picture into the discrimination layer of the confrontation network, obtain an output discrimination result, and calculate the discrimination accuracy rate based on the discrimination result;

Calculate the loss function of the confrontation network based on the low-resolution picture and the target mask until the loss function converges, and when the discrimination accuracy is lower than the accuracy threshold, a trained confrontation network is obtained;

Receive the low-resolution picture to be converted, input the low-resolution picture to be converted into the trained confrontation network, and obtain the output target super-resolution picture.
The computer-readable storage medium according to claim 16, wherein the step of generating a text mask based on the text position information and the text content information comprises:

modifying the text position information based on the text content information to obtain target text position information;

The text mask is generated based on the target text location information.
The computer-readable storage medium according to claim 16, wherein the step of calculating the loss function of the adversarial network based on the low-resolution picture and the target mask comprises:

Calculate the content loss function of the confrontation network based on the low-resolution picture, the feature of the content loss function is:

in,
is the content loss function,
is the value of the pixel of the high-resolution picture at (x, y) position, G θG (I LR ) x, y is the value of the pixel of the super-resolution picture at (x, y) position, rW and rH are the width and length of the super-resolution picture, respectively, and r 2 WH is the total number of pixels in the super-resolution picture.
The computer-readable storage medium according to claim 16, wherein the step of calculating the loss function of the adversarial network based on the low-resolution picture and the target mask comprises:

Calculate the adversarial loss function of the adversarial network based on the low-resolution picture, the characteristics of the adversarial loss function are:

in,
For the confrontation loss function, G θG (I LR ) is the super-resolution picture, D θD is the discriminant layer, M is the total number of the super-resolution picture, and m represents the super-resolution picture number.
The computer-readable storage medium according to claim 16, wherein the step of calculating the loss function of the adversarial network based on the low-resolution picture and the target mask comprises:

Calculate the regularization loss function of the confrontation network based on the low-resolution picture, and the characteristics of the regularization loss function are:

in,
is the regularization loss function, G θG (I LR ) x, y is the value of the pixel point of the super-resolution image at (x, y) position, rW and rH are the width and long, r 2 WH is the total number of pixels in the target mask, ‖‖ represents the norm,
Indicates the gradient.