CN114170608A

CN114170608A - Super-resolution text image recognition method, device, equipment and storage medium

Info

Publication number: CN114170608A
Application number: CN202111455688.9A
Authority: CN
Inventors: 衡鹤瑞; 杨周龙; 李斯
Original assignee: Dongpu Software Co Ltd
Current assignee: Dongpu Software Co Ltd
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-03-11

Abstract

The invention relates to the technical field of image recognition, and discloses a super-resolution text image recognition method, a super-resolution text image recognition device, super-resolution text image recognition equipment and a storage medium. The method comprises the following steps: acquiring an image to be detected; inputting an image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected; inputting the pixel data into a sub-pixel convolution layer of a super-resolution reconstruction model for pixel extraction to obtain a target high-resolution image; inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image; and inputting the text area of the target high-resolution image into a preset text recognition model for recognition, and determining the text content in the text area according to the recognition result. The invention improves the deep learning capability of the model through the gan network and solves the technical problems of low accuracy of the text detection network and unbalanced detection speed.

Description

Super-resolution text image recognition method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of image recognition, in particular to a super-resolution text image recognition method, a super-resolution text image recognition device, super-resolution text image recognition equipment and a storage medium.

Background

The key information extraction technology of the existing express delivery document has the main characteristics that: and obtaining the text content on the whole bill by a text recognition detection technology. The key information is extracted by template matching or regular expression.

However, in daily operations, due to the influence of factors such as environment and equipment, images acquired by people are often of low quality. With the increasing demand for intelligence, the difficulty of low-quality image recognition remains a significant problem. The difficulty of low-quality image recognition is that the resolution of the image is not enough, the edge and detail of the image are blurred, the character positioning is difficult, and the recognition rate is low. The main drawbacks and deficiencies of this type of technology are: the method of template matching can only extract key information from express documents of fixed template type. Once the format of the document picture does not conform to the template contained in the system, correct key information cannot be correctly extracted. The method for extracting the key information by regular expression matching needs to artificially analyze the format of the key information and design a regular expression. Once the key information exceeding the regular expression appears, the correct key information cannot be extracted correctly. The result recognized by the OCR technology is only a string of editable character strings, which is less valuable to users, and the image character recognition technology is less practical.

Disclosure of Invention

The invention mainly aims to improve the deep learning capability of the model through a gan network and solve the technical problems of low accuracy of a text detection network and unbalanced detection speed.

The invention provides a super-resolution text image identification method in a first aspect, which comprises the following steps: acquiring an image to be detected;

inputting the image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected; inputting the pixel data into a sub-pixel convolution layer of the super-resolution reconstruction model for pixel extraction to obtain a target high-resolution image; inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image; and inputting the text area of the target high-resolution image into a preset text recognition model for recognition, and determining the text content in the text area according to the recognition result.

Optionally, in a first implementation manner of the first aspect of the present invention, before the inputting the image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected, the method further includes: acquiring a low-resolution image from a source database; receiving a data enhancement request, and performing data enhancement on the low-resolution image according to the data enhancement request to obtain an enhanced image; and (3) building an initial super-resolution reconstruction model, and training the super-resolution reconstruction model through the enhanced image to obtain the trained super-resolution reconstruction model.

Optionally, in a second implementation manner of the first aspect of the present invention, the receiving a data enhancement request, and performing data enhancement on the low-resolution image according to the data enhancement request to obtain an enhanced image includes: receiving a data enhancement request, determining a request scene according to the data enhancement request, and acquiring the number of images of each label in the request scene; extracting a request label from the label according to the number of the images, and acquiring a request image corresponding to the request label; carrying out image fusion on any two images in the request image to obtain a fused image; generating a fusion label of the fusion image according to the request labels of any two images; and splicing the request image and the fusion image according to the fusion label to obtain an enhanced image.

Optionally, in a third implementation manner of the first aspect of the present invention, the building an initial super-resolution reconstruction model, and training the super-resolution reconstruction model through the enhanced image to obtain a trained super-resolution reconstruction model includes: building a super-resolution reconstruction model, and inputting the enhanced image into the super-resolution reconstruction model to obtain an alternative high-resolution image; carrying out image format conversion on the alternative high-resolution image and the standard high-resolution image to obtain a first image and a second image; constructing a loss function from a difference between the first image and the second image; and carrying out iterative training on the initial super-resolution reconstruction model based on the loss function to obtain a trained super-resolution reconstruction model.

Optionally, in a fourth implementation manner of the first aspect of the present invention, before the inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image, the method further includes: determining a target training image, and inputting the target training image into a first initial model, wherein the first initial model comprises a feature extraction network, a feature fusion network and an output network; inputting the target training image into a feature extraction network of the first initial model for feature extraction to obtain an initial feature map of the target training image; inputting the initial feature map of the target training image into a feature fusion network of the first initial model for feature fusion to obtain a fusion feature map; inputting the fusion feature map into the output network to obtain candidate regions of a text region in the target training image and a probability value of each candidate region; determining the candidate regions and the loss value of the probability value of each candidate region based on a preset detection loss function; and training the first initial model according to the loss value until parameters in the first initial model are converged to obtain a text detection network model.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the inputting a text region of the target high-resolution image into a preset text recognition model for recognition, and determining text content in the text region according to a recognition result includes: inputting a preset text recognition model into a text region of the target high-resolution image, and performing feature extraction on the text region through the text recognition model to obtain a feature map corresponding to the text region; performing language classification processing on the feature map through a classification channel in the text recognition model to obtain a language deviation classification result corresponding to the text image; and performing text recognition on the feature map according to the language deviation classification result to obtain a corresponding text recognition result, and determining text content in the text region according to the recognition result.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the inputting the target high-resolution image into a preset text detection network model for detection, and obtaining a text region of the target high-resolution image includes: inputting the target high-resolution image into the text detection network model to obtain a plurality of candidate text regions of the text regions in the target high-resolution image and a probability value of each candidate text region; determining a text region in the target high-resolution image from the plurality of text candidate regions according to the probability value of the text candidate region and the degree of overlap between the plurality of text candidate regions.

The second aspect of the present invention provides a super-resolution text image recognition apparatus, including: the first acquisition module is used for acquiring an image to be detected; the input module is used for inputting the image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected; the pixel extraction module is used for inputting the pixel data into the sub-pixel convolution layer of the super-resolution reconstruction model to carry out pixel extraction so as to obtain a target high-resolution image; the detection module is used for inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image; and the recognition module is used for inputting the text area of the target high-resolution image into a preset text recognition model for recognition, and determining the text content in the text area according to the recognition result.

Optionally, in a first implementation manner of the second aspect of the present invention, the super-resolution text image recognition apparatus further includes:

the second acquisition module is used for acquiring the low-resolution image from the source database;

the data enhancement module is used for receiving a data enhancement request and carrying out data enhancement on the low-resolution image according to the data enhancement request to obtain an enhanced image;

and the building module is used for building an initial super-resolution reconstruction model, and training the super-resolution reconstruction model through the enhanced image to obtain the trained super-resolution reconstruction model.

Optionally, in a second implementation manner of the second aspect of the present invention, the data enhancement module is specifically configured to: receiving a data enhancement request, determining a request scene according to the data enhancement request, and acquiring the number of images of each label in the request scene; extracting a request label from the label according to the number of the images, and acquiring a request image corresponding to the request label; carrying out image fusion on any two images in the request image to obtain a fused image; generating a fusion label of the fusion image according to the request labels of any two images; and splicing the request image and the fusion image according to the fusion label to obtain an enhanced image.

Optionally, in a third implementation manner of the second aspect of the present invention, the building module is specifically configured to: building a super-resolution reconstruction model, and inputting the enhanced image into the super-resolution reconstruction model to obtain an alternative high-resolution image; carrying out image format conversion on the alternative high-resolution image and the standard high-resolution image to obtain a first image and a second image; constructing a loss function from a difference between the first image and the second image; and carrying out iterative training on the initial super-resolution reconstruction model based on the loss function to obtain a trained super-resolution reconstruction model.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the super-resolution text image recognition apparatus further includes: the system comprises a first determination module, a second determination module and a third determination module, wherein the first determination module is used for determining a target training image and inputting the target training image into a first initial model, and the first initial model comprises a feature extraction network, a feature fusion network and an output network; the feature extraction module is used for inputting the target training image into a feature extraction network of the first initial model for feature extraction to obtain an initial feature map of the target training image; the fusion module is used for inputting the initial feature map of the target training image into the feature fusion network of the first initial model for feature fusion to obtain a fusion feature map; inputting the fusion feature map into the output network to obtain candidate regions of a text region in the target training image and a probability value of each candidate region; a second determining module, configured to determine the candidate regions and a loss value of the probability value of each candidate region based on a preset detection loss function; and the training module is used for training the first initial model according to the loss value until parameters in the first initial model are converged to obtain a text detection network model.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the identification module is specifically configured to: inputting a preset text recognition model into a text region of the target high-resolution image, and performing feature extraction on the text region through the text recognition model to obtain a feature map corresponding to the text region; performing language classification processing on the feature map through a classification channel in the text recognition model to obtain a language deviation classification result corresponding to the text image; and performing text recognition on the feature map according to the language deviation classification result to obtain a corresponding text recognition result, and determining text content in the text region according to the recognition result.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the detection module includes: the detection unit is used for inputting the target high-resolution image into the text detection network model to obtain a plurality of candidate text regions of the text regions in the target high-resolution image and a probability value of each candidate text region; a determining unit, configured to determine a text region in the target high resolution image from the plurality of text candidate regions according to a probability value of the text candidate region and a degree of overlap between the plurality of text candidate regions.

A third aspect of the present invention provides a super-resolution text image recognition apparatus comprising: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;

the at least one processor invokes the instructions in the memory to cause the super-resolution text image recognition device to perform the steps of the super-resolution text image recognition method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the above-mentioned super-resolution text image recognition method.

According to the technical scheme provided by the invention, an image to be detected is obtained; inputting an image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected; inputting the pixel data into a sub-pixel convolution layer of a super-resolution reconstruction model for pixel extraction to obtain a target high-resolution image; inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image; and inputting the text area of the target high-resolution image into a preset text recognition model for recognition, and determining the text content in the text area according to the recognition result. The invention improves the deep learning capability of the model through the gan network and solves the technical problems of low accuracy of the text detection network and unbalanced detection speed.

Drawings

FIG. 1 is a schematic diagram of a first embodiment of a super-resolution text image recognition method provided by the present invention;

FIG. 2 is a diagram of a super-resolution text image recognition method according to a second embodiment of the present invention;

FIG. 3 is a diagram of a super-resolution text image recognition method according to a third embodiment of the present invention;

FIG. 4 is a diagram of a fourth embodiment of a super-resolution text image recognition method provided by the present invention;

FIG. 5 is a diagram of a fifth embodiment of a super-resolution text image recognition method according to the present invention;

FIG. 6 is a schematic diagram of a super-resolution text image recognition apparatus according to a first embodiment of the present invention;

FIG. 7 is a diagram of a super-resolution text image recognition apparatus according to a second embodiment of the present invention;

fig. 8 is a schematic diagram of an embodiment of a super-resolution text image recognition device provided by the present invention.

Detailed Description

According to the super-resolution text image identification method, the super-resolution text image identification device, the super-resolution text image identification equipment and the storage medium, an image to be detected is obtained; inputting an image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected; inputting the pixel data into a sub-pixel convolution layer of a super-resolution reconstruction model for pixel extraction to obtain a target high-resolution image; inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image; and inputting the text area of the target high-resolution image into a preset text recognition model for recognition, and determining the text content in the text area according to the recognition result. The invention improves the deep learning capability of the model through the gan network and solves the technical problems of low accuracy of the text detection network and unbalanced detection speed.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of understanding, a detailed flow of an embodiment of the present invention is described below, and referring to fig. 1, a first embodiment of a super-resolution text image recognition method according to an embodiment of the present invention includes:

101. acquiring an image to be detected;

in this embodiment, the image to be detected is obtained from the source database through a preset interface, for example, the image to be processed is obtained through a fromPixels interface of tensflo. TensorFlow is a second generation artificial intelligence learning system developed based on DistBeief, and is a system for transmitting a complex data structure to an artificial intelligence neural network for analysis and processing. TensorFlow can be used in multiple machine deep learning fields such as speech recognition or image recognition.

102. Inputting an image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected;

in this embodiment, the neural network model may be trained and deployed in a browser based on a machine learning framework, tensrflow. In the step, after the browser obtains the low-resolution image to be processed, the image to be processed is acquired to the super-resolution reconstruction model through a fromPixels interface of TensorFlow.

In this embodiment, the preset super-resolution reconstruction model is deployed on the server, and the preset super-resolution reconstruction model is a general image processing CNN, and is specifically set by an implementer, which is referred to in the prior art and is not described here again. The super-resolution reconstruction model is a preset CNN model which can be identified by the server side in a file format, and the CNN model is obtained through conversion processing and can be identified by the browser in the file format.

In this step, the image to be processed is processed through the super-resolution reconstruction model in the super-resolution model, and the image to be processed is extracted and feature-identified through one or more convolution layers of the super-resolution reconstruction model, so as to obtain first information corresponding to the image to be processed, where the first information is pixel data obtained by enhancing the image quality of the low-resolution image to be processed.

103. Inputting the pixel data into a sub-pixel convolution layer of a super-resolution reconstruction model for pixel extraction to obtain a target high-resolution image;

in this embodiment, the sub-pixel convolution layer is an image upsampling method, and if r is a magnification factor, the layer will assemble the r x r dimension low resolution image into a high resolution image.

In this step, the pixel data obtained in step 307 is subjected to an enlargement process by the sub-pixel convolution layer of the super-resolution model, and a high-resolution target image is obtained by a rendering process.

104. Inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image;

in the embodiment, an image to be detected is obtained; the image to be detected can be a picture, or a video frame captured from a video file or a live video. Inputting an image to be detected into a text detection model which is trained in advance, and outputting a plurality of candidate regions of the text region in the image to be detected and the probability value of each candidate region; the text detection model is obtained by training through the training method of the text detection model; and determining a text region in the image to be detected from the multiple candidate regions according to the probability value of the candidate regions and the overlapping degree of the multiple candidate regions.

In the candidate regions output by the text detection model, a plurality of candidate regions may correspond to the same text line; in order to find the region that best matches the text line from the plurality of candidate regions, the plurality of candidate regions need to be filtered. In most cases, a plurality of candidate regions with higher mutual overlapping degree usually correspond to the same text line, and then the text region corresponding to the text line can be determined according to the probability values of the candidate regions with higher mutual overlapping degree; for example, a candidate region having the highest probability value among a plurality of candidate regions that overlap each other to a high degree is determined as the text region. If there are multiple lines of text in the image, multiple text regions are typically finalized.

105. And inputting the text area of the target high-resolution image into a preset text recognition model for recognition, and determining the text content in the text area according to the recognition result.

In this embodiment, the text recognition model is a neural network model for classifying and recognizing images containing text. Specifically, the computer equipment obtains a text recognition model trained by the server, and performs feature extraction on the text image through the text recognition model to obtain a feature map corresponding to the text image.

In an embodiment, the computer device may select a neural Network such as a VGG (Visual Geometry Group Network) or a resource Network as a Network for feature extraction, which is not limited in the embodiment of the present application. For example, the computer device uses two residual modules of the Resnet neural network as convolution layers to perform underlying feature extraction, thereby extracting a feature map from a text image.

In one embodiment, after the computer device extracts the feature map from the text image, the feature map may be subjected to language classification processing and text recognition processing through a corresponding channel in the text recognition model. For example, the computer device performs language classification processing on the feature map through a classification channel in the text recognition model; and the computer equipment performs text recognition processing on the feature map through a text recognition channel in the text recognition model.

Further, the text recognition result is the text content recognized by the trained text recognition model. The text recognition channel is a channel for recognizing text in an image. The text recognition channels are divided into at least a first text recognition channel and a second text recognition channel. For example, the first text recognition channel is mainly used for recognizing the image with the largest number of characters in the first language; the second text recognition channel is mainly used for recognizing the image with the largest number of characters in the second language. And when there are texts in multiple languages, corresponding text recognition channels can be correspondingly added. Specifically, when the language deviation classification result is a deviation to a first language category, the computer device inputs a feature map corresponding to the first language category into a first text recognition channel for text recognition, and obtains a corresponding text recognition result.

In one embodiment, the first text recognition channel adopts network structures of LSTM and CTC, and through the combination of the two network structures, character recognition can be realized under the condition that the character area of the text image is not fixed.

In the embodiment of the invention, an image to be detected is obtained; inputting an image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected; inputting the pixel data into a sub-pixel convolution layer of a super-resolution reconstruction model for pixel extraction to obtain a target high-resolution image; inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image; and inputting the text area of the target high-resolution image into a preset text recognition model for recognition, and determining the text content in the text area according to the recognition result. The invention improves the deep learning capability of the model through the gan network and solves the technical problems of low accuracy of the text detection network and unbalanced detection speed.

Referring to fig. 2, a second embodiment of the super-resolution text image recognition method according to the embodiment of the present invention includes:

201. acquiring an image to be detected;

202. acquiring a low-resolution image from a source database;

in the embodiment, for the high-quality text image, no external condition interference exists, the identification process is easy, and other operations are not needed. But for low-quality text images, the effects of text positioning can be seriously influenced by the situations of blurred handwriting, complex background and the like. Poor text positioning effect can reduce the character recognition rate in the later period, and the improvement of the text positioning accuracy has very important significance for character recognition.

Text location is the process of locating the text information needed in the text. If the image is high-quality image, and has no influence of writing blurring or complex background, the text region can be obtained by using the text positioning method based on the connected domain. For low-quality images, the technical scheme uses a text positioning method based on deep learning.

The EAST algorithm can be used as an end-to-end text detection algorithm based on deep learning, full convolution and maximum value suppression are used, intermediate processing procedures such as text word segmentation, candidate area aggregation, post-processing and the like are eliminated, and redundancy is eliminated. EAST networks are mainly divided into three parts: the system comprises a feature extraction branch, a feature fusion branch and an output layer. And in the feature extraction part, extracting image features by using ImageNet pre-trained convolution network. For the input low-quality image, the image features of 1/32, 1/16, 1/8 and 1/4 of the input picture size are extracted respectively for feature merging.

Alternatively, the CTPN algorithm is combined with the CNN and LSTM deep networks, so that characters which are transversely distributed in a complex scene can be effectively detected. The CTPN algorithm is proposed based on the following two points: assuming the text level, the text can be viewed as being composed of every "letter". The reason for providing the algorithm is that in a text detection scene, a general target detection method cannot achieve a good effect, and particularly, the text with a large change amplitude in length is provided. The document divides the text into small segments, detects the small segments, and finally combines the small segments belonging to the same horizontal line into text lines by using rules.

203. Receiving a data enhancement request, and performing data enhancement on the low-resolution image according to the data enhancement request to obtain an enhanced image;

in this embodiment, the public data sets T91 and BSD200 are used, with a total of 291 images as the training set, which cover various different categories of text images. The standard image super-resolution reconstruction was used to evaluate the data sets Set5, Set14, BSD100 as test sets.

And performing data enhancement operation on the data set before the training set is sent into the network. Data Augmentation (Data Augmentation) generally adopts ways of rotating and adding noise, reflecting, turning, translating, scaling, scale transformation, twisting and the like to expand the number of Data sets, so that a model has stronger robustness to smaller changes. According to the technical scheme, 291 training set images are subjected to data set expansion in three ways: 1) zooming: scaling the training image by a ratio of 0.5 and 0.7; 2) rotating: image rotation is carried out on the image at 90 degrees, 180 degrees and 270 degrees; 3) the image was flipped horizontally and vertically with a probability of 0.5. All images are cropped to 96 × 96 size. A total of 191552 LR and HR training image patch pairs.

The batch size is set to 64, and 100 periods are trained; the deconvolution filter was initialized with a gaussian distribution with a mean of 0 and a standard deviation of 0.001. And optimally training the network by using an Adam training optimization algorithm. The learning rate decays by a factor of 0.5 per iteration of 20 cycles. The weight attenuation was set to 0.0001 and the learning rate was set to 0.0001. The effect of 2, 4 times image magnification was tested on test sets Set5 and Set14, respectively.

As can be seen from the three groups of images, the reconstruction effect of the LapSRN algorithm is relatively good, the reconstruction details are relatively complete, but the problem of blurring of the reconstructed edge still exists. The super-resolution image reconstructed by the technical scheme is obviously clearer, the color of the same part of the text is clearer, and the image texture and the edge information are finer and smoother. Compared with the traditional bicubic interpolation method, the technical scheme has the advantages that the algorithm is obviously improved, and the deep learning algorithm is also improved compared with other mainstream deep learning algorithms.

204. Building an initial super-resolution reconstruction model, and training the super-resolution reconstruction model through the enhanced image to obtain a trained super-resolution reconstruction model;

in this embodiment, a super-resolution reconstruction method based on deep learning is used instead of a sensor and an optical manufacturing technique, and super-resolution reconstruction is performed on a low-quality image first, and then subsequent processing is performed on the image. And (3) a super-resolution reconstruction model based on a single image. The super-resolution reconstruction used the interpolation method for the first time, and this method uses neighborhood pixels to calculate missing pixel values, and the commonly used interpolation method mainly includes the following three methods:

(1) nearest neighbor interpolation method

The nearest neighbor interpolation method adopts the value which is closest to the pixel point to be solved in the neighborhood of the pixel point to be solved to carry out interpolation calculation, so that the calculated amount is greatly reduced, but the association between the surrounding pixel points and the interpolation points is neglected, so that the sawtooth phenomenon is caused, the fitting capability is low, and the reconstruction effect is poor.

(2) Bilinear interpolation

The bilinear interpolation method is mainly used for improving the fitting capacity of the nearest neighbor interpolation method, interpolation operation is carried out in two directions by considering the association between four adjacent sampling points and interpolation points, although the reconstruction speed of the bilinear interpolation method is slower than that of the nearest neighbor interpolation method, the problem of sawtooth effect existing in the nearest neighbor interpolation method is solved, but the problems of image details, unclear edges and the like are easily generated.

(3) Bicubic interpolation method

The bicubic interpolation method increases 4 adjacent points of bilinear interpolation into 16 pixel points, and the pixel points are subjected to interpolation calculation through cubic polynomials. Because the partial derivative of the bicubic interpolation function is continuous, the edges of the reconstructed image are smooth, and the detailed parts in the image are lost.

In recent years, image super-resolution algorithms based on convolutional neural networks have been developed rapidly. The image super-resolution algorithm mainly improves a network structure from aspects of deepening a model, residual learning, jumping connection, attention mechanism, generation-based countermeasure network and the like. The technical scheme adopts a LapSRN algorithm, and comprises the following specific steps:

LapSRN (deep Laplacian Pyramid network for Fast and Accurate Super-Resolution) is a deep Laplacian Pyramid network model, and is respectively provided with a feature extraction branch and an image reconstruction branch, and a Super-Resolution image is reconstructed in a step-by-step progressive mode. The LapSRN can directly input the original low resolution image into the network. However, the LapSRN model does not sufficiently extract image features, and does not pay attention to information features with different dimensions. Aiming at the problems, the technical scheme provides an optimization scheme, and the optimized model reconstruction image is improved in subjective vision and objective evaluation.

205. Inputting an image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected;

206. inputting the pixel data into a sub-pixel convolution layer of a super-resolution reconstruction model for pixel extraction to obtain a target high-resolution image;

207. inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image;

208. and inputting the text area of the target high-resolution image into a preset text recognition model for recognition, and determining the text content in the text area according to the recognition result.

The steps 201, 205-208 in this embodiment are similar to the steps 101, 102-105 in the first embodiment, and are not described herein again.

Referring to fig. 3, a third embodiment of the super-resolution text image recognition method according to the embodiment of the present invention includes:

301. acquiring an image to be detected;

302. acquiring a low-resolution image from a source database;

303. receiving a data enhancement request, determining a request scene according to the data enhancement request, and acquiring the number of images of each label in the request scene;

in this embodiment, the data enhancement request may be triggered and generated by any image processing user. Wherein, the data enhancement request carries the relevant information of the request scene. Further, the request scene refers to an application scene where an image which needs to be subjected to data enhancement in the data enhancement request is located. For example, the request scenario may be to identify an animal species.

In this embodiment, the tagging label refers to a specific label corresponding to the request scenario, for example, if the request scenario is an animal species identification, the tagging label may include: cats, dogs, rabbits, etc. The number of images is the number of images corresponding to the label.

304. Extracting a request label from the label according to the number of the images, and acquiring a request image corresponding to the request label;

in this embodiment, the request tag is a tag that needs to be data enhanced, that is, the labeling proportion of the request tag is low. The request image refers to an image corresponding to the request tag.

In another embodiment, the electronic device extracting the request tag from the annotation tag according to the number of images comprises: calculating the number of the labels of the labeled labels, and determining the reciprocal of the number of the labels as the initial proportion of the labeled labels; adjusting the initial proportion according to a preset adjustment proportion to obtain a target proportion of the label; calculating the sum of the number of the images to obtain the total operation amount; determining the labeling proportion of each labeling label based on the ratio of the number of the images to the total operation amount; and determining the labeling label with the labeling proportion smaller than the target proportion as the request label.

The preset adjustment proportion refers to a proportion set according to an actual tolerance requirement, wherein the actual tolerance requirement refers to a tolerance requirement for data balance in the request scene. It is understood that the preset adjustment ratio is smaller than the initial ratio. The target proportion refers to a difference value between the initial proportion and the preset adjusting proportion.

The initial proportion is adjusted through the preset adjusting proportion, the tolerance of the target proportion can be improved, and then the label with the label proportion smaller than the target proportion is selected as the request label, and the label with the label proportion larger than a certain proportion is not simultaneously selected as the request label.

305. Carrying out image fusion on any two images in the request image to obtain a fused image;

in this embodiment, the fused image refers to a synthesized image generated based on the image pair. Specifically, the electronic device fuses an image pair formed by any two images in the request image to obtain a fused image, including: acquiring a first size of a first image in an image pair consisting of any two images, and acquiring a second size of a second image in the image pair consisting of any two images; if the first size is not equal to the second size, performing size transformation on the second image according to the first size and the second size to obtain a third image; shearing the first image based on a preset shearing proportion to obtain a sheared image; determining an image position of the cropped image in the first image; determining a region corresponding to the image position in the third image as an image region; and replacing the image area based on the cut image to obtain the fused image. Wherein the third image is equal in size to the first size. It will be appreciated that the image content in the third image is the same as the image content in the second image. The third image may be generated by compressing or stretching the second image.

The preset shearing proportion is a proportion set according to actual requirements, and is obtained according to actual experiments, and the preset shearing proportion is usually set to be 30%. The problem that the first image and the third image are fused unevenly can be avoided through the setting of the preset shearing proportion, and therefore the fusion label cannot be accurately generated.

In another embodiment, the electronic device fuses an image pair formed by any two images in the request image, and obtaining a fused image further includes: randomly acquiring a fusion proportion from a preset interval; acquiring a pixel value of each pixel point in the first image to obtain a first pixel value, and acquiring a pixel value of each pixel point in the third image to obtain a second pixel value; carrying out weighted sum operation on the first pixel value and the second pixel value according to the fusion proportion to obtain a target pixel value; and splicing the target pixel values to obtain the fusion image. Wherein the preset interval is generally set to (0, 1). In contrast, the fusion ratio may be any real number of (0, 1).

306. Generating a fusion label of the fusion image according to the request labels of any two images;

in this embodiment, the fusion tag refers to all request tags corresponding to the image pair.

In another embodiment, the electronic device generating the fusion tag of the fusion image according to the request tag of any image pair includes: and determining all the corresponding request tags in any image pair as the fusion tags. Wherein, the all request tags refer to all tags corresponding to any image pair. By directly determining all the request labels as the fusion labels without adopting a mode of weighting the request labels, the adaptability of data enhancement on multi-label classification can be improved, and the training accuracy of the model can be improved.

307. Splicing the request image and the fusion image according to the fusion label to obtain an enhanced image;

in this embodiment, the enhanced image includes the fused image, and an image generated by stitching the request image and the fused image. The preset ratio may be set as the target ratio. It is emphasized that the enhanced image may also be stored in a node of a blockchain in order to further ensure privacy and security of the enhanced image.

In another embodiment, the electronic device splices the request image and the fused image according to the label proportion until the label proportion is greater than or equal to a preset proportion, and obtaining an enhanced image includes: determining the label with the label proportion smaller than the preset proportion as a target label; selecting an image corresponding to the target label from the request image and the fusion image as a target image; and randomly splicing the target images based on a preset even number until the label proportion is greater than or equal to the preset proportion to obtain the enhanced image. Wherein the preset even number refers to the number of target images required for generating the enhanced image. To avoid image gaps when generating the enhanced image, the preset even number is typically set to an even number.

The enhanced images are generated by splicing the target images with the number of the preset even numbers, so that the generated enhanced images can be prevented from having image gaps, and the image quality of the enhanced images is improved.

308. Building a super-resolution reconstruction model, and inputting the enhanced image into the super-resolution reconstruction model to obtain an alternative high-resolution image;

in this embodiment, each set of training samples in the training sample set includes a low resolution enhanced image, and a corresponding real high resolution enhanced image and a standard image. The real high-resolution enhanced image and the standard image are used for guiding the preset image super-resolution reconstruction model to carry out model training so as to obtain the optimal mapping relation from the low-resolution enhanced image to the high-resolution enhanced image, namely the trained preset image super-resolution reconstruction model, and therefore the optimal mapping relation is utilized to realize the enhanced image super-resolution reconstruction.

309. Carrying out image format conversion on the alternative high-resolution image and the standard high-resolution image to obtain a first image and a second image;

in this embodiment, a pre-trained image mode conversion model may be used, and in practical applications, the image mode conversion model may select a deep neural network model, and the like, to convert the candidate high-resolution image and the real high-resolution image into corresponding visible light images, respectively.

Specifically, a current image to be converted is obtained, and convolutional coding is performed on the current image to be converted to obtain a coding tensor of the image to be converted; mapping an image coding tensor to be converted into a visible light image tensor based on the multilayer residual error network; and performing transpose convolution decoding on the tensor of the visible light image to obtain the visible light image with the same size as the current image to be converted.

For example, the SVTN network is composed of an encoding module, a mapping module and a decoding module, and is respectively used for executing the above steps. The coding module consists of five small convolution modules, the input is an enhanced image with the resolution of 256 multiplied by 256, and the output is an enhanced image coding tensor with the 1024 channels with the size of 8 multiplied by 8. The 4 × 4 convolution layer is used to expand the receptive field, so that the enhanced image can contain more neighborhood information in the convolution operation, and the interference of speckle noise is reduced. By this process, the encoded expression and high-level semantic information of the enhanced image can be obtained.

The mapping module adopts a multilayer residual error structure to increase the mapping capability of the network, and specifically consists of 3 residual error blocks. The input of the device is an enhanced image coding tensor of 8 multiplied by 8, and the output is a visible light image coding tensor of 8 multiplied by 8. Through this process, the enhanced image tensor obtained by the previous encoding is mapped to the visible light image tensor. The decoding module upsamples the visible image tensor to the same size as the input enhanced image using a transposed convolution. The module inputs the enhanced image coding tensor of 8 × 8 size of 1024 channels and outputs the visible light image of 256 × 256 size of 3 channels.

310. Constructing a loss function from a difference between the first image and the second image;

in this embodiment, specifically, the loss between the first image and the second image is estimated through the visible light space and by using the standard image, so as to realize that the high-frequency information of the enhanced image is mined from the high-resolution visible light image for feedback, and thus by increasing the estimation loss of the visible light space in the conventional loss function of the preset image super-resolution reconstruction model, the preset image super-resolution reconstruction model can be guided to output an enhanced image closer to the true high-resolution in the enhanced image super-resolution reconstruction process, so that the reconstructed enhanced image retains the texture details.

311. Performing iterative training on the initial super-resolution reconstruction model based on the loss function to obtain a trained super-resolution reconstruction model;

in this embodiment, the model parameters are continuously adjusted by using the calculation result calculated by each training sample in the training sample set until the calculation result of the loss function meets the preset numerical requirement, and the preset image super-resolution reconstruction model obtained by training can be used for super-resolution reconstruction of the low-resolution image.

312. Inputting an image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected;

313. inputting the pixel data into a sub-pixel convolution layer of a super-resolution reconstruction model for pixel extraction to obtain a target high-resolution image;

314. inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image;

315. and inputting the text area of the target high-resolution image into a preset text recognition model for recognition, and determining the text content in the text area according to the recognition result.

Steps

301, 302, 309 and 312 in this embodiment are similar to

steps

101, 202 and 102 and 105 in the first embodiment, and are not repeated herein.

Referring to fig. 4, a fourth embodiment of the super-resolution text image recognition method according to the embodiment of the present invention includes:

401. acquiring an image to be detected;

402. inputting an image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected;

403. inputting the pixel data into a sub-pixel convolution layer of a super-resolution reconstruction model for pixel extraction to obtain a target high-resolution image;

404. determining a target training image, and inputting the target training image into a first initial model, wherein the first initial model comprises a feature extraction network, a feature fusion network and an output network;

in this embodiment, the training set may include a plurality of images, and in order to improve the application range of the detection model, the images in the training set may include images in various scenes, for example, live view images, game view images, outdoor view images, indoor view images, and the like; the images in the training set may also contain text lines of various font sizes, shapes, fonts, and languages, so that the trained detection model can detect various text lines.

Each image comprises a text area of a text line marked manually, and the text area can be marked by rectangular frames and other quadrilateral frames, and can also be marked by other polygonal frames; the text region of the callout can typically cover the entire line of text in its entirety, and the text region can fit closely with the line of text.

In one embodiment, the target training image may be adjusted to a predetermined size, such as 512 x 512, before being input to the first initial model.

405. Inputting the target training image into a feature extraction network of a first initial model for feature extraction to obtain an initial feature map of the target training image;

in this embodiment, the feature extraction network may be implemented by multiple convolutional layers, generally, the multiple convolutional layers are connected in sequence, and each convolutional layer is provided with a different convolutional kernel to extract feature maps of different scales. In the plurality of initial feature maps of the target training image, each initial feature map can be obtained by performing convolution calculation on a corresponding convolution layer. Taking four convolutional layers as an example, each convolutional layer can output an initial characteristic diagram; each convolutional layer can be provided with convolutional kernels with different sizes, so that the scale of the initial characteristic diagram output by each convolutional layer is different. In practical implementation, the scale of the initial feature map output by the convolutional layer of the input target training image is set to be the largest, and the scale of the initial feature map output by each subsequent layer of convolutional layer is gradually reduced.

406. Inputting the initial feature map of the target training image into a feature fusion network of a first initial model for feature fusion to obtain a fusion feature map;

in this embodiment, a smaller convolution kernel may sense a high-frequency feature in an image, and an initial feature map output by a convolution network using the smaller convolution kernel carries a small-scale text line feature; the large convolution kernel can sense the low-frequency characteristics in the image, and the initial characteristic diagram output by the convolution layer using the large convolution network carries large-scale text line characteristics; based on this, the initial feature maps with different scales carry text line features with various scales, and the fusion feature map obtained by fusing the initial feature maps also carries the text line features with various scales. In this way, the detection model can detect text lines of various scales without manually performing image scaling prior to detection.

In practical implementation, because the scales of the plurality of initial feature maps are different, before fusion, interpolation operation can be performed on the initial feature map with the smaller scale to expand the initial feature map with the smaller scale so as to match the initial feature map with the larger scale. In the fusion process, feature points at the same position can be multiplied or added among different initial feature maps, so that a final fusion feature map is obtained.

407. Inputting the fusion characteristic graph into an output network to obtain candidate regions of a text region in the target training image and a probability value of each candidate region;

in this embodiment, the first output network is configured to extract a required feature from the fused feature map to obtain an output result; the first output network typically comprises a set of networks if the output of the detection model is the only result; if the output result of the detection model is multiple results, the first output network usually comprises multiple groups of networks, the multiple groups of networks are arranged in parallel, and each group of networks correspondingly outputs one result. The first output network may be composed of convolutional layers or fully-connected layers. In the above step, the first output network needs to output two results, namely the candidate region and the probability value of the candidate region, so that the first output network may include two groups of networks, and each group of networks may be a convolutional network or a fully connected network.

408. Determining candidate regions and a loss value of the probability value of each candidate region based on a preset detection loss function;

in this embodiment, a standard text region is pre-labeled in a target training image, and a coordinate matrix of the text region and a probability matrix of the text region can be generated based on the position of the labeled text region; the coordinate matrix of the text area comprises the vertex coordinates of the standard text area; the probability matrix of the text region contains a probability value of the text region, which is typically 1.

409. Training the first initial model according to the loss value until the parameters in the first initial model are converged to obtain a text detection network model;

in this embodiment, the detection loss function may compare the difference between the coordinate matrix of the candidate region and the coordinate matrix of the standard text region, and the difference between the probability value of the candidate region and the probability value of the standard text region, where the larger the difference is, the larger the first loss value is. Based on the first loss value, parameters of each part in the first initial model can be adjusted to achieve the purpose of training. And when each parameter in the model is converged, finishing training to obtain the text detection network model.

410. Inputting the target high-resolution image into a text detection network model to obtain a plurality of candidate text regions of the text region in the target high-resolution image and a probability value of each candidate text region;

in this embodiment, in the candidate region output by the text detection model, there may be a plurality of candidate regions all corresponding to the same text line.

411. Determining a text region in the target high-resolution image from the plurality of text candidate regions according to the probability value of the text candidate regions and the overlapping degree of the plurality of text candidate regions;

in this embodiment, in order to find out the region that best matches the text line from the multiple candidate regions, the multiple candidate regions need to be filtered. In most cases, a plurality of candidate regions with higher mutual overlapping degree usually correspond to the same text line, and then the text region corresponding to the text line can be determined according to the probability values of the candidate regions with higher mutual overlapping degree; for example, a candidate region having the highest probability value among a plurality of candidate regions that overlap each other to a high degree is determined as the text region. If there are multiple lines of text in the image, multiple text regions are typically finalized.

412. And inputting the text area of the target high-resolution image into a preset text recognition model for recognition, and determining the text content in the text area according to the recognition result.

The

steps

401, 403, 412 in the present embodiment are similar to the

steps

101, 103, 105 in the first embodiment, and are not described herein again.

Referring to fig. 5, a fifth embodiment of the super-resolution text image recognition method according to the embodiment of the present invention includes:

501. acquiring an image to be detected;

502. inputting an image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected;

503. inputting the pixel data into a sub-pixel convolution layer of a super-resolution reconstruction model for pixel extraction to obtain a target high-resolution image;

504. inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image;

505. inputting a preset text recognition model into a text region of the target high-resolution image, and extracting features of the text region through the text recognition model to obtain a feature map corresponding to the text region;

506. Performing language classification processing on the feature map through a classification channel in the text recognition model to obtain a language deviation classification result corresponding to the text image;

in this embodiment, the classification channel is a channel for language classification in the text recognition channel, and the channel may also be understood as a network structure or a channel branch. The language classification processing is a process of classifying each image by determining preset features of each language text in the image. And the language deviation classification result is a deviation category obtained by a trained text recognition model. It can be understood that, since there is at least one language text in each image, the corresponding classification result is a biased classification result, which indicates which language text in the current image has more preset features.

Specifically, the computer device inputs the feature map into a classification channel in the text recognition model, performs language classification processing on the feature map through the classification channel, and takes an output result of the classification channel as a language deviation classification result corresponding to the text image.

In one embodiment, the computer device performs language classification processing on the feature map through a classification channel in the text recognition model, determines the language type of the text in the text image corresponding to the feature map, and counts the number of texts in each language. And the computer equipment determines the language deviation classification result corresponding to the text image by comparing the number of each language text in the text image.

507. And performing text recognition on the feature map according to the language deviation classification result to obtain a corresponding text recognition result, and determining text content in the text region according to the recognition result.

In this embodiment, the text recognition result is the text content obtained by the trained text recognition model. The text recognition channel is a channel for recognizing text in an image. The text recognition channels are divided into at least a first text recognition channel and a second text recognition channel. For example, the first text recognition channel is mainly used for recognizing the image with the largest number of characters in the first language; the second text recognition channel is mainly used for recognizing the image with the largest number of characters in the second language. And when there are texts in multiple languages, corresponding text recognition channels can be correspondingly added.

Specifically, when the language deviation classification result is a deviation to a first language category, the computer device inputs a feature map corresponding to the first language category into a first text recognition channel for text recognition, and obtains a corresponding text recognition result.

Step 501-504 in the present embodiment is similar to step 101-104 in the first embodiment, and will not be described herein again.

The above description of the method for recognizing a super-resolution text image according to the embodiment of the present invention, and the following description of the device for recognizing a super-resolution text image according to the embodiment of the present invention refer to fig. 6, where a first embodiment of the device for recognizing a super-resolution text image according to the embodiment of the present invention includes:

a first obtaining module 601, configured to obtain an image to be detected;

an input module 602, configured to input the image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing, so as to obtain pixel data of the image to be detected;

the pixel extraction module 603 is configured to input the pixel data into the sub-pixel convolution layer of the super-resolution reconstruction model to perform pixel extraction, so as to obtain a target high-resolution image;

the detection module 604 is configured to input the target high-resolution image into a preset text detection network model for detection, so as to obtain a text region of the target high-resolution image;

the identification module 605 is configured to input a text region of the target high-resolution image into a preset text identification model for identification, and determine text content in the text region according to an identification result.

Referring to fig. 7, a super-resolution text image recognition apparatus according to a second embodiment of the present invention specifically includes:

a first obtaining module 601, configured to obtain an image to be detected;

In this embodiment, the super-resolution text image recognition apparatus further includes:

a second obtaining module 606, configured to obtain a low-resolution image from a source database;

a data enhancement module 607, configured to receive a data enhancement request, and perform data enhancement on the low-resolution image according to the data enhancement request to obtain an enhanced image;

and a building module 608, configured to build an initial super-resolution reconstruction model, and train the super-resolution reconstruction model through the enhanced image to obtain a trained super-resolution reconstruction model.

In this embodiment, the data enhancement module 607 is specifically configured to:

receiving a data enhancement request, determining a request scene according to the data enhancement request, and acquiring the number of images of each label in the request scene;

extracting a request label from the label according to the number of the images, and acquiring a request image corresponding to the request label; carrying out image fusion on any two images in the request image to obtain a fused image;

generating a fusion label of the fusion image according to the request labels of any two images; and splicing the request image and the fusion image according to the fusion label to obtain an enhanced image.

In this embodiment, the building module 608 is specifically configured to:

building a super-resolution reconstruction model, and inputting the enhanced image into the super-resolution reconstruction model to obtain an alternative high-resolution image;

carrying out image format conversion on the alternative high-resolution image and the standard high-resolution image to obtain a first image and a second image;

constructing a loss function from a difference between the first image and the second image;

and carrying out iterative training on the initial super-resolution reconstruction model based on the loss function to obtain a trained super-resolution reconstruction model.

a first determining module 609, configured to determine a target training image, and input the target training image to a first initial model, where the first initial model includes a feature extraction network, a feature fusion network, and an output network;

a feature extraction module 610, configured to input the target training image into a feature extraction network of the first initial model to perform feature extraction, so as to obtain an initial feature map of the target training image;

the fusion module 611 is configured to input the initial feature map of the target training image into the feature fusion network of the first initial model to perform feature fusion, so as to obtain a fusion feature map; inputting the fusion feature map into the output network to obtain candidate regions of a text region in the target training image and a probability value of each candidate region;

a second determining module 612, configured to determine the candidate regions and a loss value of the probability value of each candidate region based on a preset detection loss function;

a training module 613, configured to train the first initial model according to the loss value until a parameter in the first initial model converges, to obtain a text detection network model.

In this embodiment, the identification module 605 is specifically configured to:

inputting a preset text recognition model into a text region of the target high-resolution image, and performing feature extraction on the text region through the text recognition model to obtain a feature map corresponding to the text region;

performing language classification processing on the feature map through a classification channel in the text recognition model to obtain a language deviation classification result corresponding to the text image;

and performing text recognition on the feature map according to the language deviation classification result to obtain a corresponding text recognition result, and determining text content in the text region according to the recognition result.

In this embodiment, the detecting module 604 includes:

a detecting unit 6041, configured to input the target high-resolution image into the text detection network model, to obtain a plurality of candidate text regions of text regions in the target high-resolution image, and a probability value of each candidate text region;

a determining unit 6042 configured to determine a text region in the target high-resolution image from among the plurality of text candidate regions according to the probability value of the text candidate region and a degree of overlap between the plurality of text candidate regions.

Fig. 6 and 7 describe the super-resolution text image recognition apparatus in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the super-resolution text image recognition apparatus in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 8 is a schematic structural diagram of a super-resolution text image recognition apparatus according to an embodiment of the present invention, where the super-resolution text image recognition apparatus 800 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 810 (e.g., one or more processors) and a memory 820, and one or more storage media 830 (e.g., one or more mass storage devices) storing an application 833 or data 832. Memory 820 and storage medium 830 may be, among other things, transient or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations for the super-resolution text image recognition apparatus 800. Further, the processor 810 may be configured to communicate with the storage medium 830, and execute a series of instruction operations in the storage medium 830 on the super-resolution text image recognition apparatus 800 to implement the steps of the super-resolution text image recognition method provided by the above-mentioned method embodiments.

Super-resolution text image recognition device 800 may also include one or more power supplies 840, one or more wired or wireless network interfaces 850, one or more input-output interfaces 860, and/or one or more operating systems 831, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art will appreciate that the super-resolution text image recognition device configuration shown in fig. 8 does not constitute a limitation of the super-resolution text image recognition device provided herein, and may include more or less components than those shown, or some components in combination, or a different arrangement of components.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and may also be a volatile computer-readable storage medium, where instructions are stored, and when the instructions are executed on a computer, the instructions cause the computer to execute the steps of the super-resolution text image recognition method.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A super-resolution text image recognition method is characterized by comprising the following steps:

acquiring an image to be detected;

inputting the image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected;

inputting the pixel data into a sub-pixel convolution layer of the super-resolution reconstruction model for pixel extraction to obtain a target high-resolution image;

inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image;

and inputting the text area of the target high-resolution image into a preset text recognition model for recognition, and determining the text content in the text area according to the recognition result.

2. The super-resolution text image recognition method according to claim 1, wherein before the image to be detected is input into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected, the method further comprises:

acquiring a low-resolution image from a source database;

receiving a data enhancement request, and performing data enhancement on the low-resolution image according to the data enhancement request to obtain an enhanced image;

and (3) building an initial super-resolution reconstruction model, and training the super-resolution reconstruction model through the enhanced image to obtain the trained super-resolution reconstruction model.

3. The method of claim 2, wherein the receiving a data enhancement request and performing data enhancement on the low-resolution image according to the data enhancement request to obtain an enhanced image comprises:

extracting a request label from the label according to the number of the images, and acquiring a request image corresponding to the request label;

carrying out image fusion on any two images in the request image to obtain a fused image;

generating a fusion label of the fusion image according to the request labels of any two images;

and splicing the request image and the fusion image according to the fusion label to obtain an enhanced image.

4. The super-resolution text image recognition method according to claim 2, wherein the building of an initial super-resolution reconstruction model and the training of the super-resolution reconstruction model through the enhanced image to obtain the trained super-resolution reconstruction model comprises:

5. The method for recognizing the super-resolution text image according to claim 1, wherein before the target high-resolution image is input into a preset text detection network model for detection, and a text region of the target high-resolution image is obtained, the method further comprises:

determining a target training image, and inputting the target training image into a first initial model, wherein the first initial model comprises a feature extraction network, a feature fusion network and an output network;

inputting the target training image into a feature extraction network of the first initial model for feature extraction to obtain an initial feature map of the target training image;

inputting the initial feature map of the target training image into a feature fusion network of the first initial model for feature fusion to obtain a fusion feature map;

inputting the fusion feature map into the output network to obtain candidate regions of a text region in the target training image and a probability value of each candidate region;

determining the candidate regions and the loss value of the probability value of each candidate region based on a preset detection loss function;

and training the first initial model according to the loss value until parameters in the first initial model are converged to obtain a text detection network model.

6. The super-resolution text image recognition method according to claim 1, wherein the inputting a text region of the target high-resolution image into a preset text recognition model for recognition, and determining text content in the text region according to a recognition result comprises:

7. The super-resolution text image recognition method according to claim 5, wherein the inputting the target high-resolution image into a preset text detection network model for detection to obtain the text region of the target high-resolution image comprises:

inputting the target high-resolution image into the text detection network model to obtain a plurality of candidate text regions of the text regions in the target high-resolution image and a probability value of each candidate text region;

determining a text region in the target high-resolution image from the plurality of text candidate regions according to the probability value of the text candidate region and the degree of overlap between the plurality of text candidate regions.

8. A super-resolution text image recognition apparatus, comprising:

the first acquisition module is used for acquiring an image to be detected;

the input module is used for inputting the image to be detected into a convolutional neural network layer of a preset super-resolution reconstruction model for processing to obtain pixel data of the image to be detected;

the pixel extraction module is used for inputting the pixel data into the sub-pixel convolution layer of the super-resolution reconstruction model to carry out pixel extraction so as to obtain a target high-resolution image;

the detection module is used for inputting the target high-resolution image into a preset text detection network model for detection to obtain a text region of the target high-resolution image;

and the recognition module is used for inputting the text area of the target high-resolution image into a preset text recognition model for recognition, and determining the text content in the text area according to the recognition result.

9. A super-resolution text image recognition apparatus, characterized by comprising: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;

the at least one processor invokes the instructions in the memory to cause the super resolution text image recognition device to perform the steps of the super resolution text image recognition method according to any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for super-resolution text image recognition according to any one of claims 1 to 7.