CN112365451A

CN112365451A - Method, device and equipment for determining image quality grade and computer readable medium

Info

Publication number: CN112365451A
Application number: CN202011147351.7A
Authority: CN
Inventors: 毕姚姚; 陈琳; 吴伟佳; 李羽
Original assignee: Weimin Insurance Agency Co Ltd
Current assignee: Weimin Insurance Agency Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-02-12

Abstract

The application relates to a method, a device, equipment and a computer readable medium for determining image quality grade. The method comprises the following steps: acquiring an image to be processed, wherein the image to be processed comprises a text area recorded with an acceptance record of a target service; extracting text region characteristics of the text region, wherein the text region characteristics are used for describing the relationship between pixel points and characters in the text region; when the quality grade of the image to be processed is evaluated, the target quality grade to which the image to be processed belongs is determined based on the overall image characteristic and the text region characteristic of the image to be processed. The method and the device solve the technical problem that the evaluation result of the bill text quality is inaccurate when the bill image quality is evaluated.

Description

Method, device and equipment for determining image quality grade and computer readable medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a computer readable medium for determining an image quality level.

Background

The most important of the examination of the bill quality is whether the size of the bill characters is proper, whether the local keywords are clear, whether the handwriting printing is continuous and the like. The manual examination of the bill not only consumes manpower, but also has long examination period and poor user experience.

At present, in the related art, a bill image recognition model (such as a CV operator, a machine learning model, and a bill classification depth model) is usually adopted to perform automatic quality recognition on a bill image, generally based on the characteristics of an artificial structure or an automatic structure generated by the whole image, and the adopted methods are all general methods based on natural image quality recognition, and for a bill image scene in which the quality recognition of a character region is a key point, the evaluation result on the bill text quality during the bill image quality evaluation is not accurate.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The application provides a method, a device and equipment for determining image quality grade and a computer readable medium, which are used for solving the technical problem that an evaluation result for bill text quality is inaccurate when bill image quality is evaluated.

According to an aspect of an embodiment of the present application, there is provided a method for determining an image quality level, including: acquiring an image to be processed, wherein the image to be processed comprises a text area recorded with an acceptance record of a target service; extracting text region characteristics of the text region, wherein the text region characteristics are used for describing the relationship between pixel points and characters in the text region; when the quality grade of the image to be processed is evaluated, the target quality grade to which the image to be processed belongs is determined based on the overall image characteristic and the text region characteristic of the image to be processed.

According to another aspect of the embodiments of the present application, there is provided an apparatus for determining an image quality level, including: the image acquisition module is used for acquiring an image to be processed, wherein the image to be processed comprises a text area recorded with an acceptance record of the target service; the text feature extraction module is used for extracting text region features of the text region, and the text region features are used for describing the relationship between pixel points and characters in the text region; and the image classification module is used for determining the target quality grade of the image to be processed based on the integral image characteristic and the text region characteristic of the image to be processed when the quality grade of the image to be processed is evaluated.

According to another aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, a communication interface, and a communication bus, where the memory stores a computer program executable on the processor, and the memory and the processor communicate with each other through the communication bus and the communication interface, and the processor implements the method when executing the computer program.

According to another aspect of embodiments of the present application, there is also provided a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the above-mentioned method.

Compared with the related art, the technical scheme provided by the embodiment of the application has the following advantages:

the technical scheme includes that the image to be processed is obtained and comprises a text area recorded with an acceptance record of a target service; extracting text region characteristics of the text region, wherein the text region characteristics are used for describing the relationship between pixel points and characters in the text region; when the quality grade of the image to be processed is evaluated, the target quality grade to which the image to be processed belongs is determined based on the overall image characteristic and the text region characteristic of the image to be processed. The method and the device solve the technical problem that the evaluation result of the bill text quality is inaccurate when the bill image quality is evaluated.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the technical solutions in the embodiments or related technologies of the present application, the drawings needed to be used in the description of the embodiments or related technologies will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without any creative effort.

Fig. 1 is a hardware environment diagram of an alternative image quality level determination method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of an alternative method for determining an image quality level according to an embodiment of the present disclosure;

fig. 3 is a flowchart of an optional text region feature extraction provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of an alternative input-fused text detection model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative text detection model fused from a feature layer according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an alternative text detection model using multi-task learning fusion according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an alternative fusion of a pre-trained model and a text detection model according to an embodiment of the present application;

fig. 8 is a block diagram of an alternative apparatus for determining an image quality level according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.

First, partial nouns or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:

a neural network: the neural network may be composed of neural units, which may be referred to as x_sAnd an arithmetic unit with intercept b as input, the output of the arithmetic unit may be:

wherein s is 1, 2, … … n, n is a natural number greater than 1, and W is_sIs x_sB is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

Deep neural network: deep Neural Networks (DNNs), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, where "many" has no particular metric. From the division of the DNN by the positions of different layers, the neural network inside the DNN can be divided intoThe method is classified into three types: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. For example, a fully-connected neural network is fully connected between layers, that is, any neuron at the i-th layer must be connected with any neuron at the i + 1-th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector of the output vector,

is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input.The summary is that: the coefficients of the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.

A convolutional neural network: a Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. The same learned image information can be used for all positions on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

CRAFT: a residual network, one of deep neural networks, is characterized by being easily optimized and capable of improving accuracy by adding a considerable depth. The inner residual block uses jump connection, and the problem of gradient disappearance caused by depth increase in a deep neural network is relieved.

Multi-task learning: is an inductive migration mechanism, the main goal is to improve generalization capability using domain-specific information implicit in the training signals of multiple related tasks, multi-task learning accomplishes this goal by training multiple tasks in parallel using a shared representation.

Pixel value: the pixel value of the image may be a Red Green Blue (RGB) color value and the pixel value may be a long integer representing a color. For example, the pixel value is 256 Red +100 Green +76Blue, where Blue represents the Blue component, Green represents the Green component, and Red represents the Red component. In each color component, the smaller the numerical value, the lower the luminance, and the larger the numerical value, the higher the luminance. For a grayscale image, the pixel values may be grayscale values.

In the related art, a bill image recognition model (such as a CV operator, a machine learning model and a bill classification depth model) is generally adopted to automatically recognize the quality of a bill image, generally based on the characteristics of an artificial structure or an automatic structure generated by the whole image, and the adopted methods are all general methods based on natural image quality recognition, so that the evaluation result of the bill text quality is not accurate in the evaluation of the bill image quality in the bill image scene with the important character area quality recognition.

In order to solve the problems mentioned in the background, according to an aspect of embodiments of the present application, an embodiment of a method for determining an image quality level is provided.

Alternatively, in the embodiment of the present application, the method for determining the image quality level may be applied to a hardware environment formed by the terminal 101 and the server 103 as shown in fig. 1. As shown in fig. 1, a server 103 is connected to a terminal 101 through a network, which may be used to provide services for the terminal or a client installed on the terminal, and a database 105 may be provided on the server or separately from the server, and is used to provide data storage services for the server 103, and the network includes but is not limited to: wide area network, metropolitan area network, or local area network, and the terminal 101 includes but is not limited to a PC, a cell phone, a tablet computer, and the like.

In an embodiment of the present application, a method for determining an image quality level may be performed by the server 103, or may be performed by both the server 103 and the terminal 101, as shown in fig. 2, where the method may include the following steps:

step S202, the image to be processed comprises a text area recorded with the acceptance record of the target service.

The bill image processing method based on the text detection model in the embodiment of the application can be applied to business scenes where an applicant applies for handling certain requirements on definition of uploaded bill images, for example, when the applicant handles a claim service, the bill images need to be uploaded to a claim service system, and the claim service system judges definition of currently acquired bill images so as to determine whether to enter a handling stage or inform the applicant to return again according to system prompt requirements in the next step. The business scenario may be that the applicant transacts financial business related to personal information to a bank, and the like, which is not limited in the embodiment of the present application.

Optionally, in the embodiment of the present application, a claim settlement service scenario is taken as an example to explain the ticket image processing method based on the text detection model. The applicant can upload the image to be processed into the claim settlement service system, wherein the image to be processed is a bill image of the application acceptance target service, for example, the bill image is the image information of the insurance bill of the user.

The background server of the claim settlement service system can receive the to-be-processed image uploaded by the client of the applicant (i.e., the client applying for the acceptance target service), so as to acquire the to-be-processed image.

In step S204, text region features of the text region are extracted, where the text region features are used to describe a relationship between pixel points and characters in the text region.

In the embodiment of the application, when the quality of the bill image is automatically identified, in order to attach the particularity that the character region in the bill image is taken as the key identification region, a deep text detection model can be adopted to detect the text region of the bill image, so as to find out the position of the character in the image. Common deep text detection models include CTPN, segLink, EAST, PSENet, LSAE, ATRR, CRAFT and the like.

And step S206, when the quality grade of the image to be processed is evaluated, determining the target quality grade of the image to be processed based on the overall image characteristic and the text region characteristic of the image to be processed.

In the embodiment of the application, the text region characteristics extracted by the text detection model on the image to be processed can be combined with the target classification model, so that the text detection model is fused into the target classification model, and the evaluation accuracy rate of the quality of the bill image is improved. The target classification model can be obtained by training with training data with labeled information by using a convolutional neural network model as an initial training model. The labeling information labels at least an image quality level of the training data.

By adopting the technical scheme, the text detection model is fused into the general quality evaluation model, and the technical problem that the evaluation result of the bill text quality is inaccurate in the bill image quality evaluation can be solved.

Optionally, as shown in fig. 3, extracting the text region feature of the image to be processed by using the text detection model may include the following steps:

step S302, zooming the image to be processed to obtain an intermediate image meeting the target size requirement;

step S304, utilizing a text detection model to perform up-sampling on the intermediate image so as to extract single character features;

step S306, combining the extracted single character features to obtain multi-character features;

step S308, determining the probability that each pixel point of the intermediate image belongs to each character center in the multi-character feature to obtain a text feature map.

In the embodiment of the application, downsampling, namely scaling, can be performed on the image to be processed, so that the length and the width of the obtained intermediate image are consistent with those of the image to be processed. The size of the image to be processed is the target size. The image of a single character is cut out from the middle image by utilizing upsampling, and a character area can be segmented by utilizing a watershed algorithm to obtain the single character, at the moment, each character is surrounded in a polygonal frame, and the center position of the polygonal frame is the character center of each character. And converting the coordinates of the polygonal frame of the divided single characters back to the coordinates on the image to be processed, namely combining the single characters, and obtaining continuous multiple characters according to the coordinate sequence. And finally, calculating the probability of the pixel belonging to the character center pixel by pixel to obtain a text characteristic diagram.

In the embodiment of the application, a CRAFT model can be used as a text detection model. The backbone network of the CRAFT model adopts a backbone of VGG-16, the VGG-16 is a deep convolution neural network, the backbone is a backbone part in a network structure, and the network part is used for extracting the features of an image in the CV field. When the text region features of the image to be processed are extracted by the CRAFT, similar to a method of sampling down and then sampling up by a u-net structure, the down sampling can be performed for multiple times. The downsampling may reduce the length and width of the input picture to 32 times closest to the length and width values, for example, if the input picture is 500 × 400, the picture padding may be reduced to 512 × 416, so that the pixel drift phenomenon in the segmentation may be effectively avoided. Pixel drift, i.e., phase drift of a digital image, is the phenomenon of jitter occurring in a series of digital images repeatedly acquired from a stationary optical image. After down-sampling, an image to be subjected to up-sampling and feature combination operation is an intermediate image, and after the intermediate image is subjected to up-sampling and feature combination by the CRAFT model, the model outputs two channel feature maps: a region score map and an affinity score map, which are the probability of a single character center region and the probability of the center of an adjacent character region, respectively.

Since the quality of the bill image identifies the characteristics of the text region when the scene needs to be focused and weakens the characteristics of the non-text region, the text region and the non-text region can be distinguished through the region score map, namely the text region characteristics. In general, when the degree of blurring of a text region is different, the region probability value is also different, and the degree of blurring of an image may be differentiated by using the text region probability value.

The present application provides 4 methods for fusing a text detection model into a general quality assessment model, and various aspects of the present application are described in detail below with reference to fig. 4 to 7.

Optionally, when evaluating the quality level of the image to be processed, determining the target quality level to which the image to be processed belongs based on the overall image feature and the text region feature of the image to be processed may include the following steps:

step 1, zooming the text characteristic graph to adjust the text characteristic graph to be consistent with the length and the width of the image to be processed.

In the embodiment of the application, the up-sampling and the down-sampling can be used for scaling the image, so that the target feature map can be adjusted to be consistent with the length and the width of the image to be processed. For an image I with size M N, sampling the image I by s times to obtain a resolution image with size (M/s) N/s, wherein s is a common divisor of M and N, and the sampling is carried out. The upsampling can adopt an interpolation value method, namely, a proper interpolation algorithm is adopted to insert new elements among pixel points on the basis of the original image pixels.

And 2, inputting the three color components of the image to be processed into a target classification model as image information, and inputting a text feature map with the length and the width consistent with those of the image to be processed into the target classification model as additional image information so that the target classification model can identify the image to be processed by utilizing the text feature map.

In the field of computer vision, the size of general image input information is height, width and channel number, and the channel input of color images is color three-channel data, such as RGB three channels, HSV three channels, YUV three channels and the like.

In the embodiment of the application, as shown in fig. 4, text region features (that is, a text feature map having a length and a width that are consistent with those of an image to be processed) extracted by a text detection model for the image to be processed can be input from a fourth channel as one dimension of image input information, so that the distinction of a target classification model for a text region and a non-text region in a bill, and a clear text region and a fuzzy text region can be enhanced.

And 3, determining the target quality grade of the image to be processed according to the recognition result of the target classification model by using the text feature map to recognize the image to be processed.

Optionally, determining the target quality level to which the image to be processed belongs according to the recognition result of the target classification model by using the text feature map includes the following steps:

step 31, inputting the text feature map and the image to be processed into a first convolution layer of the target classification model to obtain a first image feature;

step 32, inputting the first image characteristics into a second convolution layer of the target classification model to obtain the class probability output by the output layer, outputting the output result of the second convolution layer through the output layer, and using the class probability to evaluate the quality grade of the image to be processed;

and step 33, determining the target quality level of the image to be processed when the class probability is within the preset class probability threshold range.

In this embodiment, the first convolution layer is a plurality of convolution layers in a hidden layer of the target classification model, and is used to extract image features, and the second convolution layer is a 1 × 1 convolution layer and is used to calculate a probability.

In the embodiment of the present application, the convolutional layer may include a plurality of convolution operators, also called kernels, whose role in image processing is equivalent to a filter for extracting specific information from an input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed on the input image pixel by pixel along the horizontal direction (or on the basis of the value of the step size stride, the pixels are followed by the pixels), so as to complete the task of extracting specific features from the image. The size of the weight matrix is related to the size of the image to be processed.

It should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image (the image to be processed and the target feature map), and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above.

Different weight matrices may be used to extract different features in an image, e.g., one weight matrix may be used to extract image edge information, another weight matrix to extract a particular color of an image, yet another weight matrix to blur unwanted noise in an image, etc. The plurality of weight matrices have the same size (row × column), the feature maps extracted by the plurality of weight matrices having the same size also have the same size, and the extracted feature maps having the same size are combined to form the output of the convolution operation.

The weight values in the weight matrixes need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used for extracting information from the input image, so that the neural network can carry out correct prediction. In the embodiment of the application, the first image feature can be obtained through the convolutional layer, and the first image feature is obtained by combining the to-be-processed image and the text region feature identification.

In the embodiment of the present application, the first image feature enters a 1 × 1 convolutional layer for probability prediction, and a prediction result (i.e., the class probability that the image to be processed belongs to each classification) is output by an output layer. And finally, determining the target quality grade of the image to be processed according to the preset category probability threshold range, wherein the target quality grade is high and indicates that the character area of the image to be processed is clear, the quality of the bill image is high, the target quality grade is low and indicates that the character area of the image to be processed is fuzzy, and the quality of the bill image is low.

By adopting the technical scheme, the text detection model can be fused to the universal quality evaluation model from the input end, and the classification model can be enhanced to distinguish the character region and the non-character region in the bill, and the clear characters and the fuzzy characters.

Optionally, when evaluating the quality level of the image to be processed, determining the target quality level to which the image to be processed belongs based on the overall image feature and the text region feature of the image to be processed may further include:

step 1, inputting an image to be processed into a target classification model to obtain a second image characteristic of the image to be processed output by a first convolution layer of the target classification model, wherein the second image characteristic is a result of performing characteristic pre-extraction on the image to be processed.

In this embodiment of the application, the second image feature is obtained by performing feature recognition on only the image to be processed by the target classification model, that is, the overall image feature of the image to be processed. The first convolution layer is a multilayer convolution layer in the hidden layer of the target classification model and is used for extracting image features.

And 2, inputting the second image characteristics and the text region characteristics into a characteristic layer of the target classification model to perform characteristic fusion on the second image characteristics and the text region characteristics to obtain third image characteristics.

In the embodiment of the present application, the third image feature is obtained by fusing an overall image feature (i.e., the second image feature) of the image to be processed and a text region feature.

And 3, inputting the third image characteristics into a second convolution layer of the target classification model to obtain the class probability output by the output layer, outputting the output result of the second convolution layer through the output layer, and using the class probability to evaluate the quality level of the image to be processed.

In an embodiment of the present invention, the second convolution layer is a 1 × 1 convolution layer and is used for probability prediction.

And 4, determining the target quality grade of the image to be processed under the condition that the class probability is within the preset class probability threshold range.

In the embodiment of the present application, as shown in fig. 5, a text detection model may also be fused in a feature layer of a target classification model, that is, feature fusion may be performed on a text region feature extracted by the text detection model on an image to be processed and an image overall feature preliminarily extracted by a convolution layer of the target classification model on the image to be processed, and a classification task is identified on the fused feature.

Optionally, the feature fusing the second image feature and the text region feature may include at least one of the following ways:

scaling the second image characteristic and the text region characteristic to adjust the second image characteristic and the text region characteristic to be consistent in size; adding the second image features with the same size and the text region features to carry out feature splicing to obtain third image features;

multiplying the second image characteristic and the text region characteristic to obtain a characteristic matrix; performing pooling operation on the feature matrix to obtain a feature vector; and normalizing the feature vectors to perform bilinear pooling to obtain third image features.

In the embodiment of the present application, the second image feature and the text region feature are both represented by matrices, and the matrix splicing requires the same dimension, so that scaling processing needs to be performed on the second image feature and the text region feature before the matrix splicing, so that the two matrix dimensions are the same.

In this embodiment of the present application, when bilinear pooling is performed, the second image feature is multiplied by the text region feature, that is, a matrix representing the second image feature is multiplied by a matrix representing the text feature, and if the second image feature is M rows and the text region feature is N columns, the obtained fusion feature is an M × N dimensional matrix. The pooling operation may be a maximum pooling operation or an average pooling operation.

By adopting the technical scheme, the text detection model can be fused to the universal quality evaluation model from the characteristic layer of the universal quality evaluation model, and the classification model can be used for enhancing the distinction of character areas and non-character areas in the bill, and clear characters and fuzzy characters.

Optionally, the extracting text region features of the image to be processed by using the text detection model further includes:

extracting features of an image to be processed by adopting a target intermediate layer in a target classification model to obtain an intermediate layer feature map, wherein the text region features comprise the intermediate layer feature map, and the target intermediate layer is obtained by extracting the text region features of training data by using a text detection model and using the text region features as supervision labels for supervision training;

when evaluating the quality level of the image to be processed, determining the target quality level to which the image to be processed belongs based on the overall image feature and the text region feature of the image to be processed may further include: determining the mean square error loss of the intermediate layer characteristic diagram and the text characteristic diagram extracted by the text detection model; determining a first quality grade obtained by identifying an image to be processed by a target classification model; and taking the weighted sum of the mean square error loss and the first quality level as a target quality level to which the image to be processed belongs.

In the embodiment of the application, as shown in fig. 6, a text detection model task can be added on the basis of a general quality assessment model (classification model) task by referring to a multi-task learning model, namely, task supervision for text region detection can be added outside the current classification task, an intermediate layer can be selected in a backbone network of an original classification model for predicting character region probability, a supervision label can use a region score map output by CRAFT on the same image (to-be-processed image) for supervision, for example, a probability map (text region feature) output by the middle layer to-be-processed image prediction is used as a middle layer feature map, a region score map (text region feature) obtained by the text detection model to-be-processed image prediction is used as a supervision label, and mean square error is performed on the middle layer feature map and the supervision label to perform task supervision on the text region feature. When the intermediate layer feature map does not match the supervised label size, at least one of upsampling and downsampling is required for normalization.

In the embodiment of the present application, making the mean square error of the intermediate layer feature map and the supervision tag may be to use the intermediate layer feature map as an estimator and the supervision tag as an estimated quantity, and reflect the difference degree of the intermediate layer feature map and the supervision tag by calculating the mean square error of the intermediate layer feature map and the supervision tag, and specifically may be to calculate an expectation of a square difference between a probability predicted by the intermediate layer (intermediate layer feature map) and a probability predicted by the text detection model (supervision tag).

In this embodiment of the present application, the target classification model may also predict a first quality level of the image to be processed according to the overall image feature of the image to be processed, where the first quality level is obtained by evaluating the image without being combined with the text region feature, and therefore, a weight may be given to the first quality level and the above-mentioned mean square error loss, and the weighted sum of the first quality level and the above-mentioned mean square error loss is used as the final target quality level of the image to be processed, so that the text region feature is combined into a classification task of the image quality level.

By adopting the technical scheme, the text detection model can be fused to the general quality evaluation model from the perspective of multi-task supervision, and the classification model can be enhanced to distinguish the character region and the non-character region in the bill, and the clear characters and the fuzzy characters.

Optionally, before extracting the text region feature of the image to be processed by using the text detection model, the method further includes:

replacing an output layer of the text detection model with a full connection layer, training classification tasks on the text detection model by using training data to obtain a target classification model, and using the classification tasks to finely adjust training parameters of the text detection model by using the text detection model as a pre-training model of the target classification model, wherein the classification tasks are tasks for determining image quality grades.

In the embodiment of the application, the text detection model is used for performing a task of identifying the text region, and the output layer outputs the identification result in the model. The target classification model is used for classification tasks, and the probability of prediction is output in the model usually at a full connection layer. Therefore, in order to use the text detection model as a pre-training model of the target classification model, the last layer (i.e., the output layer) of the text detection model can be replaced by a fully-connected layer to perform a classification task.

In the embodiment of the present application, as shown in fig. 7, the text detection model may also be used as a pre-training model of the classification model, specifically, an output layer of the text detection model may be replaced with a full connection layer, training of a classification task is performed on the text detection model by using training data to obtain a target classification model, the text detection model is used as the pre-training model of the target classification model, and the classification task is used to perform fine tuning on a training parameter of the text detection model, and the classification task is a task for determining an image quality level.

Each neuron in the fully-connected layer is fully connected with all neurons in the layer before the neuron. The fully connected layer may integrate local information with category distinctiveness in the convolutional layer or the pooling layer.

Extracting text region features of the text region further includes: and inputting the image to be processed into the target classification model so as to extract the text region characteristic and the integral image characteristic of the image to be processed by utilizing the main network of the target classification model.

When the quality grade of the image to be processed is evaluated, determining the target quality grade to which the image to be processed belongs based on the overall image characteristic and the text region characteristic of the image to be processed further comprises: inputting the text region characteristics and the integral image characteristics into the full-connection layer to obtain class probability output by the full-connection layer, wherein the class probability is used for evaluating the quality grade of the image to be processed; and under the condition that the class probability is within the preset class probability threshold value range, determining the target quality level to which the image to be processed belongs.

In the embodiment of the application, a target classification model obtained by using a text detection model as a pre-training model can extract the overall image characteristics and the text region characteristics of an image from the image to be processed, and then the overall image characteristics and the text region characteristics of the image are input into a full connection layer to perform probability prediction, so that the target quality grade of the image to be processed is obtained.

By adopting the technical scheme, the text detection model can be fused to the general quality evaluation model from the perspective of the pre-training model, and the classification model can be enhanced to distinguish the character region and the non-character region in the bill, and the clear characters and the fuzzy characters.

According to still another aspect of the embodiments of the present application, as shown in fig. 8, there is provided a document image processing apparatus based on a text detection model, including: an image obtaining module 801, configured to obtain an image to be processed, where the image to be processed includes a text area in which a receiving record of a target service is recorded; a text feature extraction module 803, configured to extract a text region feature of the text region, where the text region feature is used to describe a relationship between a pixel point and a character in the text region; the image classification module 805 is configured to determine, when the quality level of the image to be processed is evaluated, a target quality level to which the image to be processed belongs based on the overall image feature and the text region feature of the image to be processed.

It should be noted that the image obtaining module 801 in this embodiment may be configured to execute step S202 in this embodiment, the text feature extracting module 803 in this embodiment may be configured to execute step S204 in this embodiment, and the image classifying module 805 in this embodiment may be configured to execute step S206 in this embodiment.

It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments. It should be noted that the modules described above as a part of the apparatus may operate in a hardware environment as shown in fig. 1, and may be implemented by software or hardware.

Optionally, the text feature extraction module is specifically configured to: obtaining an intermediate image meeting the target size requirement by carrying out scaling processing on the image to be processed; utilizing a text detection model to perform upsampling on the intermediate image so as to extract single character features; combining the extracted single character features to obtain multi-character features; and determining the probability that each pixel point of the intermediate image belongs to each character center in the multi-character feature to obtain a text feature map.

Optionally, the image classification module is specifically configured to: scaling the text characteristic graph to adjust the text characteristic graph to be consistent with the length and the width of the image to be processed; inputting three color components of an image to be processed into a target classification model as image information, and inputting a text feature map with the length and width consistent with the length and width of the image to be processed into the target classification model as additional image information so that the target classification model can identify the image to be processed by utilizing the text feature map; and determining the target quality grade of the image to be processed according to the recognition result of the target classification model by utilizing the text feature map.

Optionally, the image classification module is further configured to: inputting the text feature map and the image to be processed into a first convolution layer of the target classification model to obtain a first image feature; inputting the first image characteristics into a second convolution layer of the target classification model to obtain the class probability output by the output layer, outputting the output result of the second convolution layer through the output layer, and using the class probability to evaluate the quality grade of the image to be processed; and under the condition that the class probability is within the preset class probability threshold value range, determining the target quality level to which the image to be processed belongs.

Optionally, the image classification module is further configured to: inputting the image to be processed into a target classification model to obtain a second image characteristic of the image to be processed output by the first convolution layer of the target classification model, wherein the second image characteristic is a result of performing characteristic pre-extraction on the image to be processed; inputting the second image characteristics and the text region characteristics into a characteristic layer of the target classification model so as to perform characteristic fusion on the second image characteristics and the text region characteristics to obtain third image characteristics; inputting the third image characteristics into a second convolution layer of the target classification model to obtain the class probability output by the output layer, outputting the output result of the second convolution layer through the output layer, and using the class probability to evaluate the quality grade of the image to be processed; and under the condition that the class probability is within the preset class probability threshold value range, determining the target quality level to which the image to be processed belongs.

Optionally, the image classification module further includes a feature fusion unit, configured to: scaling the second image characteristic and the text region characteristic to adjust the second image characteristic and the text region characteristic to be consistent in size; adding the second image features with the same size and the text region features to carry out feature splicing to obtain third image features; multiplying the second image characteristic and the text region characteristic to obtain a characteristic matrix; performing pooling operation on the feature matrix to obtain a feature vector; and normalizing the feature vectors to perform bilinear pooling to obtain third image features.

Optionally, the text feature extraction module is further configured to: and extracting the features of the image to be processed by adopting a target intermediate layer in the target classification model to obtain an intermediate layer feature map, wherein the text region features comprise the intermediate layer feature map, and the target intermediate layer is obtained by extracting the text region features from the training data by using a text detection model and using the text region features as supervision labels for supervision training.

Optionally, the image classification module is further configured to: determining the mean square error loss of the intermediate layer characteristic diagram and the text characteristic diagram extracted by the text detection model; determining a first quality grade obtained by identifying an image to be processed by a target classification model; and taking the weighted sum of the mean square error loss and the first quality level as a target quality level to which the image to be processed belongs.

Optionally, the document image processing apparatus based on the text detection model further includes a pre-training model module, configured to: replacing an output layer of the text detection model with a full connection layer, training classification tasks on the text detection model by using training data to obtain a target classification model, and using the classification tasks to finely adjust training parameters of the text detection model by using the text detection model as a pre-training model of the target classification model, wherein the classification tasks are tasks for determining image quality grades.

Optionally, the text feature extraction module is further configured to: and inputting the image to be processed into the target classification model so as to extract the text region characteristic and the integral image characteristic of the image to be processed by utilizing the main network of the target classification model.

Optionally, the image classification module is further configured to: inputting the text region characteristics and the integral image characteristics into the full-connection layer to obtain class probability output by the full-connection layer, wherein the class probability is used for evaluating the quality grade of the image to be processed; and under the condition that the class probability is within the preset class probability threshold value range, determining the target quality level to which the image to be processed belongs.

According to another aspect of the embodiments of the present application, an electronic device is provided, as shown in fig. 9, and includes a memory 901, a processor 903, a communication interface 905, and a communication bus 907, where a computer program operable on the processor 903 is stored in the memory 901, the memory 901 and the processor 903 communicate through the communication interface 905 and the communication bus 907, and the steps of the method are implemented when the processor 903 executes the computer program.

The memory and the processor in the electronic equipment are communicated with the communication interface through a communication bus. The communication bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

There is also provided, in accordance with yet another aspect of an embodiment of the present application, a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of any of the embodiments described above.

Optionally, in an embodiment of the present application, the computer program product or the computer program is a program code for a processor to execute the following steps:

acquiring an image to be processed, wherein the image to be processed comprises a text area recorded with an acceptance record of a target service;

extracting text region characteristics of the text region, wherein the text region characteristics are used for describing the relationship between pixel points and characters in the text region;

when the quality grade of the image to be processed is evaluated, the target quality grade to which the image to be processed belongs is determined based on the overall image characteristic and the text region characteristic of the image to be processed.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.

When the embodiments of the present application are specifically implemented, reference may be made to the above embodiments, and corresponding technical effects are achieved.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk. It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for determining a level of image quality, comprising:

extracting text region features of the text region, wherein the text region features are used for describing the relationship between pixel points and characters in the text region;

when the quality grade of the image to be processed is evaluated, determining a target quality grade to which the image to be processed belongs based on the overall image characteristic and the text region characteristic of the image to be processed.

2. The method of claim 1, wherein facilitating extracting text region features for the text region comprises:

obtaining an intermediate image meeting the target size requirement by carrying out scaling processing on the image to be processed;

and performing feature recognition on the intermediate image to obtain a text feature map, wherein the text region features comprise the text feature map, the text feature map is used for representing the probability that pixel points are located at character centers, and the character centers are the center positions of polygon frames which are generated by a text detection model and surround each character in the text region.

3. The method of claim 2, wherein performing feature recognition on the intermediate image to obtain a text feature map comprises:

utilizing a text detection model to perform upsampling on the intermediate image so as to extract single character features;

combining the extracted single character features to obtain multi-character features;

and determining the probability that each pixel point on the intermediate image belongs to each character center in the multi-character feature to obtain the text feature map.

4. The method according to claim 3, wherein in evaluating the quality level of the image to be processed, determining the target quality level to which the image to be processed belongs based on the overall image feature and the text region feature of the image to be processed comprises:

scaling the text feature map to adjust the text feature map to be consistent with the length and the width of the image to be processed;

inputting the three color components of the image to be processed into a target classification model as image information, and inputting the text feature map with the length and the width consistent with those of the image to be processed into the target classification model as additional image information so that the target classification model can identify the image to be processed by utilizing the text feature map;

and determining the target quality grade to which the image to be processed belongs according to the recognition result of the target classification model by utilizing the text feature map to recognize the image to be processed.

5. The method according to claim 4, wherein determining the target quality level to which the image to be processed belongs according to the recognition result of the target classification model by using the text feature map comprises:

inputting the text feature map and the image to be processed into a first convolution layer of the target classification model to obtain a first image feature;

inputting the first image characteristics into a second convolution layer of the target classification model to obtain a class probability output by an output layer, wherein the output result of the second convolution layer is output through the output layer, and the class probability is used for evaluating the quality grade of the image to be processed;

and under the condition that the class probability is within a preset class probability threshold value range, determining the target quality level to which the image to be processed belongs.

6. The method of claim 1, wherein when evaluating the quality level of the image to be processed, determining the target quality level to which the image to be processed belongs based on the overall image features and the text region features of the image to be processed comprises:

inputting the image to be processed into a target classification model to obtain a second image characteristic of the image to be processed output by a first convolution layer of the target classification model, wherein the second image characteristic is a result of performing characteristic pre-extraction on the image to be processed;

inputting the second image features and the text region features into a feature layer of the target classification model to perform feature fusion on the second image features and the text region features to obtain third image features;

inputting the third image characteristics into a second convolution layer of the target classification model to obtain a class probability output by an output layer, wherein the output result of the second convolution layer is output through the output layer, and the class probability is used for evaluating the quality grade of the image to be processed;

7. The method of claim 6, wherein feature fusing the second image features and the text region features to obtain third image features comprises at least one of:

scaling the second image features and the text region features to adjust the second image features and the text region features to be consistent in size; adding the second image features with the consistent sizes and the text region features for feature splicing to obtain third image features;

multiplying the second image characteristic and the text region characteristic to obtain a characteristic matrix; performing pooling operation on the feature matrix to obtain a feature vector; and normalizing the feature vectors to perform bilinear pooling to obtain the third image feature.

8. The method of claim 1, wherein extracting text region features of the text region further comprises:

extracting features of the image to be processed by adopting a target intermediate layer in a target classification model to obtain an intermediate layer feature map, wherein the text region features comprise the intermediate layer feature map, the target intermediate layer is at least one layer of feature extraction network layer in the target classification model, and the target intermediate layer is obtained by extracting the text region features from training data by using a text detection model and using the text region features as a supervision label for supervision training;

when the quality level of the image to be processed is evaluated, determining the target quality level to which the image to be processed belongs based on the overall image feature and the text region feature of the image to be processed comprises: determining the mean square error loss of the intermediate layer characteristic diagram and the text characteristic diagram extracted by the text detection model; determining a first quality grade obtained by identifying the image to be processed by the target classification model; and taking the weighted sum of the mean square error loss and the first quality level as the target quality level to which the image to be processed belongs.

9. The method of claim 1, wherein prior to extracting text region features of the text region, the method further comprises:

replacing an output layer of a text detection model with a full connection layer, training a classification task on the text detection model by using training data to obtain a target classification model, taking the text detection model as a pre-training model of the target classification model, and finely adjusting training parameters of the text detection model by using the classification task, wherein the classification task is a task for determining the image quality grade;

extracting text region features of the text region further comprises: inputting the image to be processed into the target classification model so as to extract the text region characteristic and the overall image characteristic of the image to be processed by utilizing the target classification model;

when the quality level of the image to be processed is evaluated, determining the target quality level to which the image to be processed belongs based on the overall image feature and the text region feature of the image to be processed further comprises: inputting the text region characteristics and the overall image characteristics into the full-connection layer to obtain class probability output by the full-connection layer, wherein the class probability is used for evaluating the quality level of the image to be processed; and under the condition that the class probability is within a preset class probability threshold value range, determining the target quality level to which the image to be processed belongs.

10. An apparatus for determining a quality level of an image, comprising:

the image acquisition module is used for acquiring an image to be processed, wherein the image to be processed comprises a text area recorded with an acceptance record of a target service;

the text feature extraction module is used for extracting text region features of the text region, wherein the text region features are used for describing the relationship between pixel points and characters in the text region;

and the image quality evaluation module is used for determining the target quality grade of the image to be processed based on the overall image characteristic and the text region characteristic of the image to be processed when the quality grade of the image to be processed is evaluated.

11. An electronic device comprising a memory, a processor, a communication interface and a communication bus, wherein the memory stores a computer program operable on the processor, and the memory and the processor communicate with the communication interface via the communication bus, wherein the processor implements the method of any of claims 1 to 9 when executing the computer program.

12. A computer-readable medium having non-volatile program code executable by a processor, wherein the program code causes the processor to perform the method of any of claims 1 to 9.