CN112365451B

CN112365451B - Method, device, equipment and computer readable medium for determining image quality grade

Info

Publication number: CN112365451B
Application number: CN202011147351.7A
Authority: CN
Inventors: 毕姚姚; 陈琳; 吴伟佳; 李羽
Original assignee: Weimin Insurance Agency Co Ltd
Current assignee: Weimin Insurance Agency Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2024-06-21
Anticipated expiration: 2040-10-23
Also published as: CN112365451A

Abstract

The application relates to a method, a device, equipment and a computer readable medium for determining image quality grade. The method comprises the following steps: acquiring an image to be processed, wherein the image to be processed comprises a text area recorded with a acceptance record of a target service; extracting text region characteristics of a text region, wherein the text region characteristics are used for describing the relation between pixel points and characters in the text region; when the quality grade of the image to be processed is evaluated, determining the target quality grade of the image to be processed based on the integral image characteristics and the text region characteristics of the image to be processed. The application solves the technical problem that the evaluation result of the bill text quality is inaccurate when the bill image quality is evaluated.

Description

Method, device, equipment and computer readable medium for determining image quality grade

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a computer readable medium for determining an image quality level.

Background

The most important verification of the bill quality is whether the size of the bill characters is proper, whether the local keywords are clear, whether the handwriting printing is continuous, and the like. The manual bill auditing consumes manpower, and has long auditing period and poor user experience.

At present, in the related art, a bill image recognition model (such as a CV operator, a machine learning model and a bill classification depth model) is generally adopted to automatically recognize the quality of a bill image, and is generally based on the characteristics of manual construction or automatic construction generated by the whole image, and the adopted method is also a general method based on natural image quality recognition, so that for a bill image scene with character area quality recognition as an important point, the evaluation result of the quality of a bill text is inaccurate when the quality of the bill image is evaluated.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The application provides a method, a device, equipment and a computer readable medium for determining an image quality grade, which are used for solving the technical problem that an evaluation result of bill text quality is inaccurate when bill image quality is evaluated.

According to an aspect of an embodiment of the present application, there is provided a method for determining an image quality level, including: acquiring an image to be processed, wherein the image to be processed comprises a text area recorded with a acceptance record of a target service; extracting text region characteristics of a text region, wherein the text region characteristics are used for describing the relation between pixel points and characters in the text region; when the quality grade of the image to be processed is evaluated, determining the target quality grade of the image to be processed based on the integral image characteristics and the text region characteristics of the image to be processed.

According to another aspect of the embodiments of the present application, there is provided an apparatus for determining an image quality level, including: the image acquisition module is used for acquiring an image to be processed, wherein the image to be processed comprises a text area recorded with a acceptance record of the target service; the text feature extraction module is used for extracting text region features of the text region, wherein the text region features are used for describing the relation between pixels and characters in the text region; and the image classification module is used for determining the target quality grade of the image to be processed based on the integral image characteristics and the text region characteristics of the image to be processed when the quality grade of the image to be processed is evaluated.

According to another aspect of the embodiments of the present application, there is provided an electronic device including a memory, a processor, a communication interface, and a communication bus, where the memory stores a computer program executable on the processor, the memory, the processor, and the processor communicate through the communication bus and the communication interface, and the processor executes the computer program to implement the method.

According to another aspect of embodiments of the present application, there is also provided a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the above-described method.

Compared with the related art, the technical scheme provided by the embodiment of the application has the following advantages:

The technical scheme of the application is that an image to be processed is obtained, wherein the image to be processed comprises a text area recorded with a acceptance record of a target service; extracting text region characteristics of a text region, wherein the text region characteristics are used for describing the relation between pixel points and characters in the text region; when the quality grade of the image to be processed is evaluated, determining the target quality grade of the image to be processed based on the integral image characteristics and the text region characteristics of the image to be processed. The application solves the technical problem that the evaluation result of the bill text quality is inaccurate when the bill image quality is evaluated.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it will be apparent to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort.

FIG. 1 is a schematic diagram of an alternative image quality level determination method hardware environment according to an embodiment of the present application;

FIG. 2 is a flowchart of an alternative method for determining an image quality level according to an embodiment of the present application;

FIG. 3 is an alternative text region feature extraction flow chart provided in accordance with an embodiment of the present application;

FIG. 4 is a schematic diagram of an alternative fused text detection model from the input provided in accordance with an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative text detection model fused from feature layers according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an alternative text detection model employing multiple learning fusion according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an alternative text detection model fused with a pre-trained model according to an embodiment of the present application;

FIG. 8 is a block diagram of an alternative image quality level determination apparatus according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the following description, suffixes such as "module", "component", or "unit" for representing elements are used only for facilitating the description of the present application, and are not of specific significance per se. Thus, "module" and "component" may be used in combination.

First, partial terms or terminology appearing in the course of describing the embodiments of the application are applicable to the following explanation:

Neural network: the neural network may be composed of neural units, which may refer to an arithmetic unit with x _s and an intercept b as inputs, and the output of the arithmetic unit may be:

Where s=1, 2, … … n, n is a natural number greater than 1, W _s is the weight of x _s, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by joining together a number of the above-described single neural units, i.e., the output of one neural unit may be the input of another. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

Deep neural network: deep neural networks (deep neural network, DNN), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, many of which are not particularly metrics. From DNNs, which are divided by the location of the different layers, the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. For example, layers in a fully connected neural network are fully connected, that is, any neuron in layer i must be connected to any neuron in layer i+1. Although DNN appears to be complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression: wherein/> Is an input vector,/>Is an output vector,/>Is an offset vector,/>Is a weight matrix (also called coefficient)/>Is an activation function. Each layer is only for input vectors/>The output vector/>, is obtained through the simple operation. Since the number of DNN layers is large, the coefficient/>And offset vector/>And thus a large number. The definition of these parameters in DNN is as follows: by a factor/>The following are examples: it is assumed that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as/>. Superscript 3 represents coefficient/>The number of layers is located, and the subscripts correspond to the output third layer index 2 and the input second layer index 4. The summary is: the coefficients from the kth neuron of the L-1 th layer to the jth neuron of the L-th layer are defined as/>. It is noted that the input layer is absent/>Parameters. In deep neural networks, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks. The process of training the deep neural network, i.e. learning the weight matrix, has the final objective of obtaining a weight matrix (a weight matrix formed by a number of layers of vectors W) for all layers of the trained deep neural network.

Convolutional neural network: the convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer. The feature extractor can be seen as a filter and the convolution process can be seen as a convolution with an input image or convolution feature plane (feature map) using a trainable filter. The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way image information is extracted is independent of location. The underlying principle in this is: the statistics of a certain part of the image are the same as other parts. I.e. meaning that the image information learned in one part can also be used in another part. The same learned image information can be used for all locations on the image. In the same convolution layer, a plurality of convolution kernels may be used to extract different image information, and in general, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation. The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.

CRAFT: the residual network, one of the deep neural networks, is characterized by easy optimization and can improve accuracy by increasing considerable depth. The residual blocks inside the deep neural network are connected in a jumping mode, and the gradient disappearance problem caused by depth increase in the deep neural network is relieved.

Multitasking learning: is a generalized migration mechanism and the main objective is to improve generalization ability by using domain-specific information implicit in the training signals of multiple related tasks, which is achieved by training multiple tasks in parallel using a shared representation.

Pixel value: the pixel value of the image may be a Red Green Blue (RGB) color value and the pixel value may be a long integer representing the color. For example, the pixel value is 256×red+100×green+76blue, where Blue represents the Blue component, green represents the Green component, and Red represents the Red component. The smaller the value, the lower the luminance, the larger the value, and the higher the luminance in each color component. For a gray image, the pixel value may be a gray value.

In the related art, a bill image recognition model (such as a CV operator, a machine learning model and a bill classification depth model) is generally adopted to perform automatic quality recognition on a bill image, and is generally based on a manually constructed or automatically constructed feature generated by the whole image, and the adopted method is also a general method based on natural image quality recognition, so that for a bill image scene with character area quality recognition as an important point, the evaluation result of the bill text quality is inaccurate when the bill image quality is evaluated.

In order to solve the problems mentioned in the background art, according to an aspect of the embodiments of the present application, an embodiment of a method of determining an image quality level is provided.

Alternatively, in the embodiment of the present application, the above-described method of determining the image quality level may be applied to a hardware environment constituted by the terminal 101 and the server 103 as shown in fig. 1. As shown in fig. 1, the server 103 is connected to the terminal 101 through a network, which may be used to provide services to the terminal or a client installed on the terminal, and a database 105 may be provided on the server or independent of the server, for providing data storage services to the server 103, where the network includes, but is not limited to: a wide area network, metropolitan area network, or local area network, and terminal 101 includes, but is not limited to, a PC, a cell phone, a tablet computer, etc.

A method for determining an image quality level in an embodiment of the present application may be performed by the server 103, or may be performed by the server 103 and the terminal 101 together, as shown in fig. 2, and the method may include the following steps:

Step S202, an image to be processed is included in a text area recorded with a receipt record of the target service.

The bill image processing method based on the text detection model in the embodiment of the application can be applied to a business scene that the applicant applies for handling some bill images with certain requirements on definition of uploaded bill images, for example, the applicant handles claim business, the bill images need to be uploaded to a claim settlement business system, and the claim settlement business system judges the definition of the currently acquired bill images so as to determine whether the next step is to enter a receiving stage or inform the applicant of returning again according to the system prompt requirements. The business scenario may be financial business related to the personal information transacted from the applicant to the bank, which is not limited in the embodiment of the present application.

Optionally, the embodiment of the application takes the claim service scene as an example to explain the bill image processing method based on the text detection model. The applicant can upload the image to be processed into the claim service system, wherein the image to be processed is a bill image for applying to accept the target service, for example, the bill image is insurance bill image information of the user.

The background server of the claim service system may receive the image to be processed uploaded by the client of the applicant (i.e., the client applying for accepting the target service), thereby acquiring the image to be processed.

In step S204, text region features of the text region are extracted, where the text region features are used to describe a relationship between pixels and characters in the text region.

In the embodiment of the application, when the quality automatic identification is carried out on the bill image, in order to attach the specificity of taking the text area in the bill image as the key identification area, a depth text detection model can be adopted to carry out text area detection on the bill image so as to find out the position of the text in the image. Common deep text detection models include CTPN, segLink, EAST, PSENet, LSAE, ATRR, CRAFT models.

Step S206, when evaluating the quality level of the image to be processed, determining the target quality level to which the image to be processed belongs based on the overall image characteristics and the text region characteristics of the image to be processed.

According to the method and the device for evaluating the bill image quality, the text region features extracted from the image to be processed by the text detection model can be combined to the target classification model, so that the text detection model is integrated into the target classification model, and the evaluation accuracy of the bill image quality is improved. The target classification model can be obtained by training the convolutional neural network model serving as an initial training model by using training data with marking information. The marking information marks at least the image quality level of the training data.

By adopting the technical scheme, the technical problem that the evaluation result of the bill text quality is inaccurate when the bill image quality is evaluated can be solved by fusing the text detection model into the universal quality evaluation model.

Alternatively, as shown in fig. 3, step S204 may include the steps of:

Step S2041, obtaining an intermediate image meeting the target size requirement by scaling the image to be processed;

Step S2042, up-sampling the intermediate image by using the text detection model to extract single character features;

Step S2043, combining the extracted single character features to obtain multi-character features;

And step S2044, determining the probability that each pixel point of the intermediate image belongs to the center of each character in the multi-character feature, and obtaining a text feature map.

In the embodiment of the application, the image to be processed can be downsampled, namely zoomed, so that the length and the width of the obtained intermediate image are consistent with those of the image to be processed. The size of the image to be processed is the target size. And cutting out an image of a single character from the intermediate image by utilizing upsampling, and dividing a character area by using a watershed algorithm to obtain the single character, wherein each character is enclosed in a polygonal frame at the moment, and the central position of the polygonal frame is the character center of each character. The coordinates of the polygonal frames of the divided single characters are converted back to the coordinates on the image to be processed, i.e. the single characters are combined, and continuous multi-characters can be obtained according to the sequence of the coordinates. And finally, calculating the probability that the pixel belongs to the character center pixel by pixel to obtain a text feature map.

In the embodiment of the application, a CRATT model can be adopted as a text detection model. The backbone network of the CRASFT model adopts a back-bone of VGG-16, VGG-16 is a deep convolutional neural network, the back-bone is a backbone part in a network structure, and the backbone network generally refers to a network part for extracting characteristics of an image in the CV field. When the CRASFT extracts the text region features of the image to be processed, the method is similar to a method of downsampling and then upsampling of a u-net structure, and multiple downsampling can be performed. Downsampling may scale the length and width of the input picture to a multiple of 32 nearest to the length and width values, e.g., 500×400 for the input picture, and scale the picture to 512×416, so as to effectively avoid pixel drift during segmentation. Pixel drift, i.e., the phase shift of a digital image, refers to the phenomenon of jitter in a series of digital images that are repeatedly acquired from a stationary optical image. The image to be subjected to up-sampling and feature merging operation after down-sampling is an intermediate image, and after the CRAFT model performs up-sampling and feature merging on the intermediate image, the model outputs two channel feature graphs: region score maps and affinity score map are the probability of a single character center region and the probability of the center of an adjacent character region, respectively.

Since the quality of the bill image recognizes the characteristics of the text region and weakens the characteristics of the non-text region when the scene needs to be focused, the text region and the non-text region can be distinguished by the region score map, namely the text region characteristics. When the blur degree of the text region is different in normal cases, the region probability value is also different, and the blur degree of the image can be distinguished by using the text region probability value.

The present application provides 4 methods of fusing text detection models into a generic quality assessment model, and various aspects of the present application are described in detail below in conjunction with fig. 4-7.

Alternatively, step S206 may include the steps of:

in step S2061, the text feature map is subjected to scaling processing to adjust the text feature map to coincide with the length and width of the image to be processed.

In the embodiment of the application, the up-sampling and the down-sampling can scale the image, so that the target feature map can be adjusted to be consistent with the length and the width of the image to be processed. For an image I with a size of M x N, sampling the image I by s times to obtain a resolution image with a size of (M/s) (N/s), wherein s is a common divisor of M and N, and the downsampling is performed. The up-sampling may use an interpolation method, i.e. a suitable interpolation algorithm is used to insert new elements between pixels on the basis of the original image pixels.

In step S2062, the three color components of the image to be processed are input as image information into the target classification model, and the text feature map which is consistent with the length and width of the image to be processed is input as additional image information into the target classification model, so that the target classification model can identify the image to be processed by using the text feature map.

In the field of computer vision, the size of the general image input information is the number of channels with the height and width, and the channel input of the color image is general three-channel color data, such as three channels of RGB, HSV, YUV, and the like.

In the embodiment of the application, as shown in fig. 4, the text region features extracted from the image to be processed by the text detection model (namely, the text feature map consistent with the length and width of the image to be processed) can be input from the fourth channel as one dimension of the image input information, so that the distinction of the text region and the non-text region in the bill and the text definition and the text blurring can be enhanced by the target classification model.

Step S2063, determining the target quality grade of the image to be processed according to the recognition result of the target classification model for recognizing the image to be processed by using the text feature map.

Optionally, step S2063 of determining, according to the target classification model, the target quality level to which the image to be processed belongs by using the recognition result of recognizing the image to be processed by using the text feature map includes the steps of:

Step S20631, inputting the text feature map and the image to be processed into a first convolution layer of the target classification model to obtain a first image feature;

Step S20632, inputting the first image features into a second convolution layer of the target classification model to obtain class probabilities output by an output layer, wherein the output results of the second convolution layer are output by the output layer, and the class probabilities are used for evaluating the quality grade of the image to be processed;

And step S20633, determining the target quality grade of the image to be processed under the condition that the class probability is within the preset class probability threshold value range.

In the embodiment of the application, the first convolution layer is a multi-layer convolution layer in an implicit layer of the target classification model and is used for extracting image features, and the second convolution layer is a 1*1 convolution layer and is used for calculating probability.

In the embodiment of the present application, the convolution layer may include a plurality of convolution operators, where the convolution operators are also called kernels, and function in image processing as a filter for extracting specific information from an input image matrix, where the convolution operators may be a weight matrix, which is usually predefined, and in the process of performing convolution operation on an image, the weight matrix is usually processed on the input image along a horizontal direction (or a value depending on a stride, where a plurality of pixels follow a plurality of pixels), so as to complete the task of extracting specific features from the image. The size of the weight matrix is related to the size of the image to be processed.

It should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image (the image to be processed and the target feature map), and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix produces a convolved output of a single depth dimension, but in most cases does not use a single weight matrix, but instead applies multiple weight matrices of the same size (row by column), i.e., multiple homography matrices. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by the "multiple" as described above.

Different weight matrices may be used to extract different features in the image, e.g., one weight matrix may be used to extract image edge information, another weight matrix may be used to extract a particular color of the image, yet another weight matrix may be used to blur unwanted noise in the image, etc. The plurality of weight matrixes have the same size (row and column), the feature images extracted by the plurality of weight matrixes with the same size have the same size, and the extracted feature images with the same size are combined to form the output of convolution operation.

The weight values in the weight matrices are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can be used for extracting information from an input image, so that the neural network can conduct correct prediction. In the embodiment of the application, the first image characteristic can be obtained through the convolution layer, and the first image characteristic is obtained by combining the image to be processed and the text region characteristic identification.

In the embodiment of the application, the first image feature enters 1*1 convolution layers to carry out probability prediction, and the output layer outputs a prediction result (namely, the class probability that the image to be processed belongs to each class). And finally, determining the target quality grade of the image to be processed according to the preset category probability threshold range, wherein the target quality grade is high, the character area of the image to be processed is clear, the quality of the bill image is higher, the quality grade is low, the character area of the image to be processed is fuzzy, and the quality of the bill image is lower.

By adopting the technical scheme of the application, the text detection model can be fused to the general quality evaluation model from the input end, and the distinction of the classification model on the text region, the non-text region and the text definition and the text blurring in the bill can be enhanced.

Optionally, when evaluating the quality level of the image to be processed, step S206 determines the target quality level to which the image to be processed belongs based on the overall image feature and the text region feature of the image to be processed, and may further include the steps of:

Step S2061, inputting the image to be processed into the target classification model to obtain the second image feature of the image to be processed, which is the result of feature pre-extraction of the image to be processed, and is output by the first convolution layer of the target classification model.

In the embodiment of the application, the second image features are obtained by performing feature recognition on the image to be processed only by the target classification model, namely, the overall image features of the image to be processed. The first convolution layer is a plurality of convolution layers in an implicit layer of the target classification model and is used for extracting image features.

Step S2062, inputting the second image feature and the text region feature into the feature layer of the target classification model to perform feature fusion on the second image feature and the text region feature to obtain a third image feature.

In the embodiment of the present application, the third image feature is obtained by fusing the overall image feature (i.e., the second image feature) of the image to be processed and the text region feature.

Step S2063, inputting the third image feature into the second convolution layer of the target classification model to obtain the class probability output by the output layer, wherein the output result of the second convolution layer is output by the output layer, and the class probability is used for evaluating the quality grade of the image to be processed.

In the embodiment of the present application, the second convolution layer is a 1*1 convolution layer, which is used for performing probability prediction.

Step S2064, determining the target quality level to which the image to be processed belongs, in the case that the category probability is within the preset category probability threshold.

In the embodiment of the application, as shown in fig. 5, a text detection model can be fused at a feature layer of a target classification model, namely, feature fusion can be performed on text region features extracted from an image to be processed by the text detection model and overall features of the image initially extracted from the image to be processed by a convolution layer of the target classification model, and classification tasks are identified on the fused features.

Optionally, feature fusing the second image feature and the text region feature may include at least one of:

scaling the second image feature and the text region feature to adjust the second image feature and the text region feature to be consistent in size; adding the second image features and the text region features which are consistent in size to perform feature stitching to obtain third image features;

Multiplying the second image features by the text region features to obtain a feature matrix; pooling operation is carried out on the feature matrix to obtain feature vectors; and normalizing the feature vector to perform bilinear pooling to obtain a third image feature.

In the embodiment of the application, the second image feature and the text region feature are both represented by the matrix, and the same dimension is required for matrix splicing, so that the second image feature and the text region feature are required to be scaled before the matrix splicing, so that the dimensions of the two matrices are the same.

In the embodiment of the application, when the bilinear pooling fusion feature is performed, the second image feature and the text region feature are multiplied, namely, the matrix representing the second image feature and the matrix representing the text feature are multiplied, and if the second image feature is M rows and the text region feature is N columns, the obtained fusion feature is M x N dimension matrix. The pooling operation may be a maximum pooling operation or a time-averaged pooling operation.

By adopting the technical scheme of the application, the text detection model can be fused to the universal quality assessment model from the feature layer of the universal quality assessment model, and the distinction of the classification model on the text region, the non-text region, the text definition and the text blurring in the bill can be enhanced.

Optionally, step S204 extracts text region features of the image to be processed using the text detection model, and may further include the steps of: and extracting features of the image to be processed by adopting a target middle layer in the target classification model to obtain a middle layer feature map, wherein the text region features comprise the middle layer feature map, and the target middle layer is obtained by using a text detection model to extract text region features from training data as a supervision label for supervision training.

Step S206, when evaluating the quality level of the image to be processed, determines the target quality level to which the image to be processed belongs based on the overall image feature and the text region feature of the image to be processed, and may further include the steps of:

step S2061, determining the mean square error loss of the middle layer feature map and the text feature map extracted by the text detection model;

Step S2062, determining a first quality grade obtained by identifying the image to be processed by the target classification model;

step S2063, taking the weighted sum of the mean square error loss and the first quality level as the target quality level to which the image to be processed belongs.

In the embodiment of the application, as shown in fig. 6, a task of a text detection model can be added based on a task of a general quality assessment model (classification model) by referring to a multi-task learning model, i.e. task supervision of text region detection can be added outside the current classification task, an intermediate layer can be selected in a backbone network of an original classification model for predicting the probability of a text region, and a supervision label can use a region score map output by a CRAFT on the same image (image to be processed) to supervise, for example, a probability map (text region feature) of the image to be processed of the intermediate layer is used as an intermediate layer feature map, a region score map (text region feature) obtained by the image to be processed of the text detection model is used as a supervision label, and then the intermediate layer feature map and the supervision label are used as mean square errors to perform task supervision of the text region feature. When the intermediate layer feature map is inconsistent with the monitor tag size, at least one of upsampling and downsampling is required to normalize the intermediate layer feature map.

In the embodiment of the application, the mean square error of the middle layer characteristic diagram and the supervision label can be calculated by taking the middle layer characteristic diagram as an estimated quantity and taking the supervision label as an estimated quantity, and the difference degree of the middle layer characteristic diagram and the supervision label can be reflected by calculating the mean square error of the middle layer characteristic diagram and the supervision label, and specifically, the expectation of calculating the square of the difference between the probability (middle layer characteristic diagram) predicted by the middle layer and the probability (supervision label) predicted by the text detection model can be calculated.

In the embodiment of the application, the target classification model can also predict the first quality grade of the image to be processed according to the integral image characteristics of the image to be processed, and the first quality grade is obtained by evaluating the characteristics of the unbound text region, so that the first quality grade and the mean square error loss can be given weight, and the weighted sum of the first quality grade and the mean square error loss is used as the final target quality grade of the image to be processed, thereby integrating the characteristics of the text region into the classification task of the image quality grade.

By adopting the technical scheme of the application, the text detection model can be fused to the general quality evaluation model from the perspective of multitasking supervision, and the distinction of the classification model on the text region, the non-text region and the text definition and the text blurring in the bill can be enhanced.

Optionally, before extracting the text region features of the image to be processed using the text detection model in step S204, the method further includes:

The output layer of the text detection model is replaced by a full connection layer, the training data is utilized to train the classification task of the text detection model, a target classification model is obtained, the text detection model is used as a pre-training model of the target classification model to finely adjust training parameters of the text detection model by using the classification task, and the classification task is a task of determining the image quality grade.

In the embodiment of the application, the text detection model is used for carrying out the task of identifying the text region, and an output layer outputs the identification result in the model. The target classification model is used to perform classification tasks, where the model generally outputs the predicted probabilities at the full connection layer. Thus, in order to use the text detection model as a pre-training model of the target classification model, the last layer (i.e., the output layer) of the text detection model may be replaced with a fully connected layer for classification tasks.

In the embodiment of the present application, as shown in fig. 7, a text detection model may be used as a pre-training model of a classification model, specifically, an output layer of the text detection model may be replaced by a full connection layer, training data is used to train a classification task on the text detection model to obtain a target classification model, the text detection model is used as the pre-training model of the target classification model, and the classification task is used to fine tune training parameters of the text detection model, where the classification task is a task of determining an image quality level.

Each neuron in the fully connected layer is fully connected to all neurons in its previous layer. The fully connected layer may integrate local information with class distinction in the convolutional layer or the pooled layer.

Step S204 extracts text region features of the text region, and may further include the steps of: inputting the image to be processed into a target classification model to extract text region features and integral image features of the image to be processed by using a backbone network of the target classification model.

Step S2061, inputting the text region features and the whole image features into a full-connection layer to obtain the class probability output by the full-connection layer, wherein the class probability is used for evaluating the quality grade of the image to be processed;

Step S2062, determining a target quality level to which the image to be processed belongs, in the case that the category probability is within the preset category probability threshold.

In the embodiment of the application, the whole image characteristic and the text region characteristic of the image can be extracted from the image to be processed by using the text detection model as the target classification model obtained by the pre-training model, and then the whole image characteristic and the text region characteristic of the image are input into the full-connection layer to carry out probability prediction so as to obtain the target quality grade of the image to be processed.

By adopting the technical scheme of the application, the text detection model can be fused to the general quality evaluation model from the angle of the pre-training model, and the distinction of the classification model on the text region, the non-text region and the text definition and the text blurring in the bill can be enhanced.

According to still another aspect of the embodiment of the present application, as shown in fig. 8, there is provided a ticket image processing apparatus based on a text detection model, including: the image acquisition module 801 is configured to acquire an image to be processed, where the image to be processed includes a text area recorded with a reception record of the target service; a text feature extraction module 803, configured to extract text region features of a text region, where the text region features are used to describe a relationship between a pixel point and a character in the text region; the image classification module 805 is configured to determine, when evaluating the quality level of the image to be processed, a target quality level to which the image to be processed belongs based on the overall image feature and the text region feature of the image to be processed.

It should be noted that, the image obtaining module 801 in this embodiment may be used to perform step S202 in the embodiment of the present application, the text feature extracting module 803 in this embodiment may be used to perform step S204 in the embodiment of the present application, and the image classifying module 805 in this embodiment may be used to perform step S206 in the embodiment of the present application.

It should be noted that the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments. It should be noted that the above modules may be implemented in software or hardware as a part of the apparatus in the hardware environment shown in fig. 1.

Optionally, the text feature extraction module is specifically configured to: obtaining an intermediate image meeting the target size requirement by scaling the image to be processed; upsampling the intermediate image using a text detection model to extract single character features; combining the extracted single character features to obtain multi-character features; and determining the probability that each pixel point of the intermediate image belongs to the center of each character in the multi-character feature, and obtaining a text feature map.

Optionally, the image classification module is specifically configured to: scaling the text feature map to adjust the text feature map to be consistent with the length and width of the image to be processed; inputting three color components of an image to be processed as image information into a target classification model, and inputting a text feature map which is consistent with the length and the width of the image to be processed as additional image information into the target classification model so that the target classification model can identify the image to be processed by using the text feature map; and determining the target quality grade of the image to be processed according to the recognition result of recognizing the image to be processed by using the text feature map according to the target classification model.

Optionally, the image classification module is further configured to: inputting the text feature map and the image to be processed into a first convolution layer of the target classification model to obtain first image features; inputting the first image features into a second convolution layer of the target classification model to obtain class probability output by an output layer, outputting an output result of the second convolution layer through the output layer, and evaluating the class probability for evaluating the quality grade of the image to be processed; and under the condition that the class probability is within a preset class probability threshold value range, determining the target quality grade to which the image to be processed belongs.

Optionally, the image classification module is further configured to: inputting the image to be processed into a target classification model to obtain second image features of the image to be processed, which are the results of feature pre-extraction of the image to be processed, and are output by a first convolution layer of the target classification model; inputting the second image features and the text region features into a feature layer of the target classification model to perform feature fusion on the second image features and the text region features so as to obtain third image features; inputting the third image features into a second convolution layer of the target classification model to obtain class probability output by an output layer, outputting an output result of the second convolution layer through the output layer, and evaluating the class probability for evaluating the quality grade of the image to be processed; and under the condition that the class probability is within a preset class probability threshold value range, determining the target quality grade to which the image to be processed belongs.

Optionally, the image classification module further includes a feature fusion unit, configured to: scaling the second image feature and the text region feature to adjust the second image feature and the text region feature to be consistent in size; adding the second image features and the text region features which are consistent in size to perform feature stitching to obtain third image features; multiplying the second image features by the text region features to obtain a feature matrix; pooling operation is carried out on the feature matrix to obtain feature vectors; and normalizing the feature vector to perform bilinear pooling to obtain a third image feature.

Optionally, the text feature extraction module is further configured to: and extracting features of the image to be processed by adopting a target middle layer in the target classification model to obtain a middle layer feature map, wherein the text region features comprise the middle layer feature map, and the target middle layer is obtained by using a text detection model to extract text region features from training data as a supervision label for supervision training.

Optionally, the image classification module is further configured to: determining the mean square error loss of the middle layer feature map and the text feature map extracted by the text detection model; determining a first quality grade obtained by identifying an image to be processed by a target classification model; and taking the weighted sum of the mean square error loss and the first quality grade as a target quality grade to which the image to be processed belongs.

Optionally, the bill image processing device based on the text detection model further comprises a pre-training model module for: the output layer of the text detection model is replaced by a full connection layer, the training data is utilized to train the classification task of the text detection model, a target classification model is obtained, the text detection model is used as a pre-training model of the target classification model to finely adjust training parameters of the text detection model by using the classification task, and the classification task is a task of determining the image quality grade.

Optionally, the text feature extraction module is further configured to: inputting the image to be processed into a target classification model to extract text region features and integral image features of the image to be processed by using a backbone network of the target classification model.

Optionally, the image classification module is further configured to: inputting the text region features and the integral image features into a full-connection layer to obtain the class probability output by the full-connection layer, wherein the class probability is used for evaluating the quality grade of the image to be processed; and under the condition that the class probability is within a preset class probability threshold value range, determining the target quality grade to which the image to be processed belongs.

According to another aspect of the embodiments of the present application, as shown in fig. 9, the present application provides an electronic device, including a memory 901, a processor 903, a communication interface 905, and a communication bus 907, where the memory 901 stores a computer program that can be executed on the processor 903, and the memory 901 and the processor 903 communicate through the communication interface 905 and the communication bus 907, and when the processor 903 executes the computer program, the steps of the method are implemented.

The memory and the processor in the electronic device communicate with the communication interface through a communication bus. The communication bus may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like.

The memory may include random access memory (Random Access Memory, RAM) or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), etc.; but may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.

There is also provided, in accordance with yet another aspect of embodiments of the present application, a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of any of the embodiments described above.

Optionally, in an embodiment of the present application, the computer program product or the computer program is for program code for the processor to perform the steps of:

acquiring an image to be processed, wherein the image to be processed comprises a text area recorded with a acceptance record of a target service;

Extracting text region characteristics of a text region, wherein the text region characteristics are used for describing the relation between pixel points and characters in the text region;

When the quality grade of the image to be processed is evaluated, determining the target quality grade of the image to be processed based on the integral image characteristics and the text region characteristics of the image to be processed.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.

When the embodiment of the application is specifically implemented, the above embodiments can be referred to, and the application has corresponding technical effects.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application SPECIFIC INTEGRATED Circuits (ASICs), digital signal processors (DIGITAL SIGNAL Processing, DSPs), digital signal Processing devices (DSP DEVICE, DSPD), programmable logic devices (Programmable Logic Device, PLDs), field-Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units for performing the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be embodied in essence or a part contributing to the prior art or a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc. It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of determining an image quality level, comprising:

step S1, acquiring an image to be processed, wherein the image to be processed comprises a text area recorded with a acceptance record of a target service;

S2, extracting text region characteristics of the text region, wherein the text region characteristics are used for describing the relation between pixel points and characters in the text region;

step S3, when evaluating the quality grade of the image to be processed, determining the target quality grade of the image to be processed based on the integral image characteristics and the text region characteristics of the image to be processed;

The step S3 includes: scaling the text feature map to adjust the text feature map to be consistent with the length and the width of the image to be processed, wherein the text region features comprise the text feature map; inputting three color components of the image to be processed as image information into a target classification model, and inputting the text feature images with the same length and width as additional image information into the target classification model so that the target classification model can identify the image to be processed by using the text feature images; determining the target quality grade to which the image to be processed belongs according to the recognition result of the target classification model for recognizing the image to be processed by utilizing the text feature map;

The step of determining the target quality level to which the image to be processed belongs according to the recognition result of recognizing the image to be processed by using the text feature map according to the target classification model comprises the following steps: inputting the text feature map and the image to be processed into a first convolution layer of the target classification model to obtain a first image feature; inputting the first image features into a second convolution layer of the target classification model to obtain class probability output by an output layer, wherein an output result of the second convolution layer is output through the output layer, and the class probability is used for evaluating the quality grade of the image to be processed; and under the condition that the class probability is within a preset class probability threshold value range, determining the target quality grade to which the image to be processed belongs.

2. The method according to claim 1, wherein the step S2 comprises:

obtaining an intermediate image meeting the target size requirement by scaling the image to be processed;

And carrying out feature recognition on the intermediate image to obtain a text feature map, wherein the text region feature comprises the text feature map, the text feature map is used for representing the probability that a pixel point is positioned at the center of a character, and the center of the character is the center position of a polygonal frame which surrounds each character in the text region and is generated by a text detection model.

3. The method according to claim 2, wherein the step of performing feature recognition on the intermediate image to obtain a text feature map includes:

upsampling the intermediate image using a text detection model to extract single character features;

combining the extracted single character features to obtain multi-character features;

And determining the probability that each pixel point on the intermediate image belongs to the center of each character in the multi-character feature, and obtaining the text feature map.

4. The method according to claim 1, wherein the step S3 further comprises:

Inputting the image to be processed into a target classification model to obtain second image features of the image to be processed, which are output by a first convolution layer of the target classification model, wherein the second image features are the results of feature pre-extraction on the image to be processed;

Inputting the second image features and the text region features into a feature layer of the target classification model to perform feature fusion on the second image features and the text region features so as to obtain third image features;

Inputting the third image features into a second convolution layer of the target classification model to obtain class probability output by an output layer, wherein an output result of the second convolution layer is output through the output layer, and the class probability is used for evaluating the quality grade of the image to be processed;

And under the condition that the class probability is within a preset class probability threshold value range, determining the target quality grade to which the image to be processed belongs.

5. The method of claim 4, wherein the step of feature fusing the second image feature and the text region feature to obtain a third image feature comprises at least one of:

Scaling the second image feature and the text region feature to adjust the second image feature and the text region feature to be consistent in size; adding the second image features and the text region features which are consistent in size to perform feature stitching to obtain the third image features;

multiplying the second image features and the text region features to obtain a feature matrix; carrying out pooling operation on the feature matrix to obtain a feature vector; and normalizing the feature vector to perform bilinear pooling to obtain the third image feature.

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The step S2 further includes: extracting features of the image to be processed by adopting a target middle layer in a target classification model to obtain a middle layer feature map, wherein the text region features comprise the middle layer feature map, the target middle layer is at least one layer of feature extraction network layer in the target classification model, and the target middle layer is obtained by using a text detection model to extract the text region features from training data as a supervision tag for supervision training;

The step S3 further includes: determining the mean square error loss of the intermediate layer feature map and the text feature map extracted by the text detection model; determining a first quality grade obtained by the target classification model for identifying the image to be processed; and taking the weighted sum of the mean square error loss and the first quality grade as the target quality grade to which the image to be processed belongs.

7. The method of claim 1, wherein the step of determining the position of the substrate comprises,

Prior to the step S2, the method further comprises: the method comprises the steps of replacing an output layer of a text detection model with a full connection layer, training a classification task of the text detection model by using training data to obtain a target classification model, taking the text detection model as a pre-training model of the target classification model, and performing fine adjustment on training parameters of the text detection model by using the classification task, wherein the classification task is a task of determining an image quality grade;

The step S2 further includes: inputting the image to be processed into the target classification model to extract the text region features and the whole image features of the image to be processed by using the target classification model;

The step S3 further includes: inputting the text region features and the integral image features into the full-connection layer to obtain class probabilities output by the full-connection layer, wherein the class probabilities are used for evaluating the quality grade of the image to be processed; and under the condition that the class probability is within a preset class probability threshold value range, determining the target quality grade to which the image to be processed belongs.

8. An image quality level determining apparatus, comprising:

the image acquisition module is used for acquiring an image to be processed, wherein the image to be processed comprises a text area recorded with a acceptance record of the target service;

The text feature extraction module is used for extracting text region features of the text region, wherein the text region features are used for describing the relation between pixel points and characters in the text region;

The image quality evaluation module is used for determining the target quality grade of the image to be processed based on the overall image characteristics and the text region characteristics of the image to be processed when evaluating the quality grade of the image to be processed;

The image quality evaluation module is used for: scaling the text feature map to adjust the text feature map to be consistent with the length and the width of the image to be processed, wherein the text region features comprise the text feature map; inputting three color components of the image to be processed as image information into a target classification model, and inputting the text feature images with the same length and width as additional image information into the target classification model so that the target classification model can identify the image to be processed by using the text feature images; determining the target quality grade to which the image to be processed belongs according to the recognition result of the target classification model for recognizing the image to be processed by utilizing the text feature map;

The image quality evaluation module is specifically further configured to: inputting the text feature map and the image to be processed into a first convolution layer of the target classification model to obtain a first image feature; inputting the first image features into a second convolution layer of the target classification model to obtain class probability output by an output layer, wherein an output result of the second convolution layer is output through the output layer, and the class probability is used for evaluating the quality grade of the image to be processed; and under the condition that the class probability is within a preset class probability threshold value range, determining the target quality grade to which the image to be processed belongs.

9. An electronic device comprising a memory, a processor, a communication interface and a communication bus, the memory storing a computer program executable on the processor, the memory, the processor communicating with the communication interface via the communication bus, characterized in that the processor implements the method of any of the preceding claims 1 to 7 when executing the computer program.

10. A computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method of any one of claims 1 to 7.