CN113392232A

CN113392232A - Data quality evaluation method and device, computer and readable storage medium

Info

Publication number: CN113392232A
Application number: CN202011338955.XA
Authority: CN
Inventors: 谢植淮; 李松南
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-09-14

Abstract

The embodiment of the application discloses a data quality evaluation method, a data quality evaluation device, a computer and a readable storage medium, wherein the method comprises the following steps: acquiring a media characteristic map of multimedia data, and acquiring at least two data channels for forming the multimedia data; generating channel attention masks corresponding to at least two data channels based on a channel attention mechanism, and generating a channel characteristic map of the multimedia data according to the channel attention masks and the media characteristic map; generating a spatial attention mask corresponding to the channel feature map based on a spatial attention mechanism, and generating a spatial feature map of the multimedia data according to the spatial attention mask and the channel feature map; and outputting the predicted media quality of the multimedia data according to the spatial feature map. By the method and the device, the accuracy of the quality prediction of the multimedia data can be improved.

Description

Data quality evaluation method and device, computer and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data quality assessment method, apparatus, computer, and readable storage medium.

Background

In the process of acquiring, processing, transmitting and recording videos, due to the imperfections of imaging systems, processing methods, transmission media and recording equipment, as well as object motion or noise pollution, distortion and degradation of the videos are inevitable. When the recommendation system recommends videos, videos with poor quality need to be filtered out, so that unpleasant experience brought to audiences is avoided. The video quality evaluation algorithm based on no reference has no reference image, so that the problems of distortion, degradation and the like of the video are difficult to quantify, and great challenges are brought to quality evaluation. The non-reference video quality evaluation algorithm is an algorithm for directly evaluating the quality of a given video segment without the aid of original lossless reference video information. Most of the conventional non-reference video quality evaluation methods are based on statistical analysis, and the statistical analysis is generally performed aiming at specific video distortion types, so that the application range of the conventional method based on the statistical analysis is limited, and the real-time performance is poor. Or, the time domain distortion information of the video is represented based on the characteristics of the block structure similarity between adjacent frames in the video, and since the inter-frame noise intensity in the same video may be similar, the structure similarity of corresponding small blocks between adjacent frames is difficult to objectively reflect the quality of the video, so that the accuracy of quality evaluation on the video is low.

Disclosure of Invention

The embodiment of the application provides a data quality evaluation method, a data quality evaluation device, a computer and a readable storage medium, which can improve the accuracy of quality prediction of multimedia data.

An aspect of the embodiments of the present application provides a data quality assessment method, including:

acquiring a media characteristic map of multimedia data, and acquiring at least two data channels for forming the multimedia data;

generating channel attention masks corresponding to at least two data channels based on a channel attention mechanism, and generating a channel characteristic map of the multimedia data according to the channel attention masks and the media characteristic map;

generating a spatial attention mask corresponding to the channel feature map based on a spatial attention mechanism, and generating a spatial feature map of the multimedia data according to the spatial attention mask and the channel feature map;

and outputting the predicted media quality of the multimedia data according to the spatial feature map.

Wherein the at least two data channels comprise a data channel i; i is a positive integer, i is the number of data channels less than or equal to at least two data channels;

generating a channel attention mask corresponding to at least two data channels based on a channel attention mechanism, comprising:

determining the data characteristics of the multimedia data in the data channel i based on the media characteristic map;

performing feature fusion on the data features in the data channel i based on a channel attention mechanism to obtain channel fusion features of the data channel i;

and when the channel fusion characteristics corresponding to each data channel are obtained, generating the channel attention masks corresponding to at least two data channels according to the channel fusion characteristics corresponding to each data channel.

The channel fusion features comprise a first channel fusion feature and a second channel fusion feature;

performing feature fusion on the data features in the data channel i based on a channel attention mechanism to obtain channel fusion features of the data channel i, wherein the feature fusion features include:

performing mean pooling on the data characteristics of the data channel i based on a channel attention mechanism to obtain first channel fusion characteristics of the data channel i;

and acquiring the mean square error of the data characteristic of the data channel i, and determining the mean square error as a second channel fusion characteristic of the data channel i.

Generating a channel attention mask corresponding to at least two data channels according to the channel fusion characteristics corresponding to each data channel respectively, including:

performing feature splicing on the first channel fusion features respectively corresponding to each data channel to obtain a first mean value feature, and performing feature splicing on the second channel fusion features respectively corresponding to each data channel to obtain a first mean square feature;

weighting the first mean value characteristic to obtain a first mean value weighted characteristic, and weighting the first mean square characteristic to obtain a first mean square weighted characteristic;

and performing feature fusion on the first mean weighted feature and the first mean-square weighted feature to obtain channel attention masks corresponding to the at least two data channels.

Wherein the channel attention mask includes a sub-mask for each data channel;

generating a channel feature map of the multimedia data according to the channel attention mask and the media feature map, comprising:

performing mask weighting on the data characteristics in the data channel i based on the sub-mask of the data channel i to obtain channel mask characteristics of the data channel i;

and when the channel mask characteristics corresponding to each data channel are obtained, performing characteristic splicing on the channel mask characteristics corresponding to each data channel to obtain a channel characteristic map of the multimedia data.

Based on the spatial attention mechanism, generating a spatial attention mask corresponding to the channel feature map includes:

acquiring at least two pixel points for forming a channel characteristic map; the at least two pixel points comprise pixel points j, j is a positive integer, and j is less than or equal to the number of the pixel points of the at least two pixel points;

determining channel pixel characteristics corresponding to the pixel points j in each data channel based on the media characteristic map;

based on a space attention mechanism, performing feature fusion on at least two channel pixel features corresponding to a pixel point j to obtain a pixel fusion feature of the pixel point j;

and when the pixel fusion characteristics corresponding to each pixel point are obtained, generating a spatial attention mask corresponding to the channel characteristic map according to the pixel fusion characteristics corresponding to each pixel point.

The pixel fusion feature comprises a first pixel fusion feature and a second pixel fusion feature;

based on the spatial attention mechanism, the feature fusion is carried out on at least two channel pixel features corresponding to the pixel point j, and the pixel fusion feature of the pixel point j is obtained, which includes:

based on a space attention mechanism, performing mean pooling on at least two channel pixel characteristics corresponding to a pixel point j to obtain a first pixel fusion characteristic of the pixel point j;

and acquiring the mean square error of at least two channel pixel characteristics corresponding to the pixel point j, and determining the mean square error corresponding to the pixel point j as a second pixel fusion characteristic of the pixel point j.

Wherein, according to the pixel fusion characteristic that every pixel corresponds respectively, generate the space attention mask that the passageway characteristic map corresponds, include:

performing feature splicing on first pixel fusion features respectively corresponding to at least two pixel points to obtain a second mean value feature, and performing feature splicing on second pixel fusion features respectively corresponding to at least two pixel points to obtain a second mean square feature;

performing feature splicing on the second mean value feature and the second mean square feature to obtain a pixel splicing feature;

and (4) convolving the pixel splicing characteristics to generate a space attention mask corresponding to the channel characteristic map.

The method for convolving the pixel splicing characteristics to generate the space attention mask corresponding to the channel characteristic map comprises the following steps:

acquiring the data size and convolution parameters of multimedia data, and determining the convolution characteristic size;

adding a default characteristic value for the pixel splicing characteristic to obtain a pixel characteristic to be convolved; the size of the pixel feature to be convolved is the size of the convolution feature;

and (4) convolving the pixel features to be convolved to generate a space attention mask corresponding to the channel feature map.

Wherein the convolution parameters include expansion coefficients;

convolving the pixel splicing characteristics to generate a space attention mask corresponding to a channel characteristic map, comprising:

determining a kth convolution position corresponding to the convolution kernel in the pixel splicing feature based on the expansion coefficient, and performing convolution on an element at the kth convolution position in the pixel splicing feature by adopting the convolution kernel to obtain a kth convolution element; k is a positive integer;

acquiring a convolution step length, determining a (k +1) th convolution position corresponding to a convolution kernel in the pixel splicing feature based on the convolution step length, the expansion coefficient and the kth convolution position, and performing convolution on an element at the (k +1) th convolution position in the pixel splicing feature by adopting the convolution kernel to obtain a (k +1) th convolution element;

when the convolution of the pixel stitching feature is completed, a spatial attention mask is generated according to the obtained convolution elements.

Wherein the multimedia data comprises a first image frame and a second image frame;

outputting a predicted media quality of the multimedia data according to the spatial feature map, comprising:

predicting the spatial feature map of the first image frame based on the full-connection layer to obtain first prediction quality of the first image frame;

predicting the spatial feature map of the second image frame based on the full-connection layer to obtain second prediction quality of the second image frame;

acquiring a first evaluation weight of a first image frame and a second evaluation weight of a second image frame, weighting the first prediction quality based on the first evaluation weight to obtain a first weighted prediction value, and weighting the second prediction quality based on the second evaluation weight to obtain a second weighted prediction value;

and determining the sum of the first weighted prediction value and the second weighted prediction value as the predicted media quality of the multimedia data.

Wherein, the method also comprises:

acquiring a recommendation threshold, and if the predicted media quality is greater than or equal to the recommendation threshold, recommending and displaying the multimedia data;

if the predicted media quality is smaller than the recommendation threshold, acquiring user equipment uploading multimedia data, and sending a media quality abnormal message to the user equipment; the media quality exception message is used to instruct the user equipment to optimize the media quality of the multimedia data.

An aspect of an embodiment of the present application provides a data quality evaluation apparatus, including:

the channel acquisition module is used for acquiring a media characteristic map of the multimedia data and acquiring at least two data channels for forming the multimedia data;

the channel processing module is used for generating channel attention masks corresponding to at least two data channels based on a channel attention mechanism, and generating a channel characteristic map of the multimedia data according to the channel attention masks and the media characteristic map;

the spatial processing module is used for generating a spatial attention mask corresponding to the channel feature map based on a spatial attention mechanism, and generating a spatial feature map of the multimedia data according to the spatial attention mask and the channel feature map;

and the quality prediction module is used for outputting the predicted media quality of the multimedia data according to the spatial characteristic map.

in generating a channel attention mask corresponding to at least two data channels based on a channel attention mechanism, the channel processing module includes:

the characteristic determining unit is used for determining the data characteristics of the multimedia data in the data channel i based on the media characteristic map;

the channel feature fusion unit is used for performing feature fusion on the data features in the data channel i based on a channel attention mechanism to obtain channel fusion features of the data channel i;

and the channel mask generating unit is used for generating the channel attention masks corresponding to the at least two data channels according to the channel fusion features respectively corresponding to each data channel when the channel fusion features respectively corresponding to each data channel are obtained.

Wherein the channel fusion feature comprises a first channel fusion feature and a second channel fusion feature;

the channel feature fusion unit includes:

the first pooling subunit is used for performing mean pooling on the data characteristics of the data channel i based on a channel attention mechanism to obtain first channel fusion characteristics of the data channel i;

and the second pooling subunit is used for acquiring the mean square error of the data characteristic of the data channel i and determining the mean square error as the second channel fusion characteristic of the data channel i.

Wherein, this passageway mask generation unit includes:

the first splicing subunit is used for performing feature splicing on the first channel fusion features respectively corresponding to each data channel to obtain a first mean value feature, and performing feature splicing on the second channel fusion features respectively corresponding to each data channel to obtain a first mean square feature;

the second splicing subunit is configured to perform weighting processing on the first mean value feature to obtain a first mean value weighting feature, and perform weighting processing on the first mean square feature to obtain a first mean square weighting feature;

and the pooling fusion subunit is used for performing feature fusion on the first mean weighted feature and the first mean square weighted feature to obtain channel attention masks corresponding to the at least two data channels.

Wherein the channel attention mask includes a sub-mask for each data channel;

in generating a channel feature map of multimedia data based on a channel attention mask and a media feature map, the channel processing module includes:

the characteristic weighting unit is used for carrying out mask weighting on the data characteristics in the data channel i based on the sub-mask of the data channel i to obtain the channel mask characteristics of the data channel i;

and the map generation unit is used for performing feature splicing on the channel mask features respectively corresponding to each data channel when the channel mask features respectively corresponding to each data channel are obtained, so as to obtain a channel feature map of the multimedia data.

In terms of generating a spatial attention mask corresponding to a channel feature map based on a spatial attention mechanism, the spatial processing module includes:

the pixel acquisition unit is used for acquiring at least two pixel points for forming a channel characteristic map; the at least two pixel points comprise pixel points j, j is a positive integer, and j is less than or equal to the number of the pixel points of the at least two pixel points;

the characteristic acquisition unit is used for determining channel pixel characteristics corresponding to the pixel points j in each data channel based on the media characteristic map;

the pixel feature fusion unit is used for performing feature fusion on at least two channel pixel features corresponding to the pixel point j based on a space attention mechanism to obtain a pixel fusion feature of the pixel point j;

and the space mask generating unit is used for generating a space attention mask corresponding to the channel feature map according to the pixel fusion features respectively corresponding to each pixel point when the pixel fusion features respectively corresponding to each pixel point are obtained.

the pixel feature fusion unit comprises:

the third pooling subunit is used for performing mean pooling on at least two channel pixel characteristics corresponding to the pixel point j based on a space attention mechanism to obtain a first pixel fusion characteristic of the pixel point j;

and the fourth pooling subunit is used for acquiring the mean square error of the at least two channel pixel characteristics corresponding to the pixel point j, and determining the mean square error corresponding to the pixel point j as the second pixel fusion characteristic of the pixel point j.

Wherein, this space mask generation unit includes:

the third splicing subunit is configured to perform feature splicing on the first pixel fusion features respectively corresponding to the at least two pixel points to obtain a second mean-square feature, and perform feature splicing on the second pixel fusion features respectively corresponding to the at least two pixel points to obtain a second mean-square feature;

the fourth splicing subunit is used for performing feature splicing on the second mean value feature and the second mean square feature to obtain a pixel splicing feature;

and the characteristic convolution subunit is used for performing convolution on the pixel splicing characteristics to generate a space attention mask corresponding to the channel characteristic map.

Wherein, the characteristic convolution subunit includes:

the size determining subunit is used for acquiring the data size and convolution parameters of the multimedia data and determining the convolution characteristic size;

the characteristic updating subunit is used for adding a default characteristic value to the pixel splicing characteristic to obtain a pixel characteristic to be convolved; the size of the pixel feature to be convolved is the size of the convolution feature;

and the updating convolution subunit is used for performing convolution on the pixel characteristics to be convolved to generate a space attention mask corresponding to the channel characteristic map.

Wherein the convolution parameters include expansion coefficients;

the updating convolution subunit comprises:

the first convolution subunit is used for determining a kth convolution position corresponding to the convolution kernel in the pixel splicing feature based on the expansion coefficient, and performing convolution on an element at the kth convolution position in the pixel splicing feature by adopting the convolution kernel to obtain a kth convolution element; k is a positive integer;

the second convolution subunit is used for acquiring a convolution step length, determining a (k +1) th convolution position corresponding to the convolution kernel in the pixel splicing feature based on the convolution step length, the expansion coefficient and the kth convolution position, and performing convolution on an element at the (k +1) th convolution position in the pixel splicing feature by adopting the convolution kernel to obtain a (k +1) th convolution element;

and the element combination subunit is used for generating a spatial attention mask according to the obtained convolution elements when the convolution of the pixel splicing characteristics is completed.

the quality prediction module comprises:

the first prediction unit is used for predicting the spatial feature map of the first image frame based on the full connection layer to obtain first prediction quality of the first image frame;

the second prediction unit is used for predicting the spatial feature map of the second image frame based on the full connection layer to obtain second prediction quality of the second image frame;

the prediction weighting unit is used for acquiring a first evaluation weight of the first image frame and a second evaluation weight of the second image frame, weighting the first prediction quality based on the first evaluation weight to obtain a first weighted prediction value, and weighting the second prediction quality based on the second evaluation weight to obtain a second weighted prediction value;

and the quality determining unit is used for determining the sum of the first weighted prediction value and the second weighted prediction value as the predicted media quality of the multimedia data.

Wherein, the device still includes:

the data recommendation module is used for acquiring a recommendation threshold value, and recommending and displaying the multimedia data if the predicted media quality is greater than or equal to the recommendation threshold value;

the abnormal feedback module is used for acquiring the user equipment uploading the multimedia data and sending a media quality abnormal message to the user equipment if the predicted media quality is smaller than the recommendation threshold; the media quality exception message is used to instruct the user equipment to optimize the media quality of the multimedia data.

One aspect of the embodiments of the present application provides a computer device, including a processor, a memory, and an input/output interface;

the processor is respectively connected with the memory and the input/output interface, wherein the input/output interface is used for receiving data and outputting data, the memory is used for storing a computer program, and the processor is used for calling the computer program to execute the data quality evaluation method in one aspect of the embodiment of the application.

In an aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program includes program instructions, and when the program instructions are executed by a processor, the method for evaluating data quality in the aspect of the embodiment of the present application is performed.

An aspect of an embodiment of the present application provides a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternatives in one aspect of the embodiments of the application.

The embodiment of the application has the following beneficial effects:

in the embodiment of the application, the computer equipment acquires a media characteristic map of multimedia data and acquires at least two data channels for forming the multimedia data; generating channel attention masks corresponding to at least two data channels based on a channel attention mechanism, and generating a channel characteristic map of the multimedia data according to the channel attention masks and the media characteristic map; generating a spatial attention mask corresponding to the channel feature map based on a spatial attention mechanism, and generating a spatial feature map of the multimedia data according to the spatial attention mask and the channel feature map; and outputting the predicted media quality of the multimedia data according to the spatial feature map. In the application, the network can pay important attention to the data channel forming the multimedia data in the training process through the channel attention mechanism and the space attention mechanism, the attention to distortion and degradation areas is improved, and the accuracy of the quality prediction of the multimedia data is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a network architecture diagram for data quality evaluation provided by an embodiment of the present application;

fig. 2 is a schematic diagram of a data quality evaluation scenario provided in an embodiment of the present application;

FIG. 3 is a flow chart of a method for evaluating data quality according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of channel feature map generation provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of a channel feature map generation model according to an embodiment of the present application;

fig. 6 is a flow chart of a recommendation determination scenario provided in an embodiment of the present application;

fig. 7 is a schematic diagram of a spatial feature map generation scene provided in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a generative model of a spatial feature map provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of a convolution scenario provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of a structure of a multi-layer dilation convolution according to an embodiment of the present application;

fig. 11 is a schematic diagram of an architecture of a quality prediction network according to an embodiment of the present application;

FIG. 12 is a diagram illustrating a quality prediction result provided by an embodiment of the present application;

fig. 13 is a schematic diagram of a data quality evaluation apparatus according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the embodiment of the application, feature extraction and feature processing may be performed on multimedia data based on techniques in the field of artificial intelligence and learning thereof, for example, the multimedia data is processed based on a channel attention mechanism to obtain a channel feature map, the channel feature map is processed based on a spatial attention mechanism to obtain a spatial feature map, the spatial feature map is predicted to obtain predicted media quality of the multimedia data, so as to implement regional attention on the multimedia data, and divide importance degrees of each data channel, so that areas of distortion or degradation of the multimedia data can be focused, and accuracy of quality prediction of the multimedia data is further improved.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science, which attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence to process the characteristics of multimedia data on each data channel and on each pixel point, and make the processing result similar to the quality evaluation result of human intelligence on multimedia data as much as possible. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The present application mainly relates to machine learning/deep learning (such as channel attention mechanism and spatial attention mechanism). The quality prediction model can be obtained through learning, and comprises a channel attention module and a space attention module, and the channel attention module and the space attention module are used for predicting the quality of the multimedia data.

Deep Learning (DL) is a new research direction in the field of Machine Learning (ML). Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The deep learning is a complex machine learning algorithm, and the effect obtained in the aspects of voice and image recognition is far superior to that of the prior related technology, and the deep learning generally comprises technologies such as artificial neural network, confidence network, reinforcement learning, transfer learning, inductive learning and formal teaching learning.

Further, the data in the application can be stored through a cloud storage technology, and can also be stored in a storage space of a computer device. Since a large amount of multimedia data may appear, the multimedia data in the present application may also be processed by using a big data technology.

The distributed cloud storage system (hereinafter referred to as a storage system) refers to a storage system which integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network through application software or application interfaces to cooperatively work through functions such as cluster application, grid technology, distributed storage file system and the like, and provides data storage and service access functions to the outside.

Through the cloud storage technology, the multimedia data in the application are stored, and the data storage efficiency and the data interaction efficiency are improved.

In the embodiment of the present application, please refer to fig. 1, where fig. 1 is a network architecture diagram for data quality evaluation provided in the embodiment of the present application, and the embodiment of the present application may be implemented by a computer device. The network for data quality evaluation may include computer devices, user devices, and the like, such as the computer device 101, the user device 102a, the user device 102b, the user device 102c, and the like, and the computer devices may perform data interaction with the user devices, where the computer devices may implement the scheme in the present application to perform quality prediction on multimedia data.

As shown in fig. 1, the computer device 101 may obtain multimedia data that needs to be quality predicted from any user device, or may obtain multimedia data from a storage space of the computer device 101 and perform quality prediction on the multimedia data. After acquiring the multimedia data, the computer device 101 may perform quality prediction on the multimedia data; the computer device 101 may also cache the multimedia data after acquiring the multimedia data, for example, store the multimedia data into a storage space of the computer device 101, and perform periodic quality prediction on the multimedia data stored in the storage space based on the prediction period; alternatively, the computer apparatus 101 may perform quality prediction on the multimedia data after acquiring the quality prediction request for the multimedia data, and the like, which is not limited herein.

For example, the user equipment 102a sends a quality prediction request for the multimedia data to the computer equipment 101, the quality prediction request includes the multimedia data, and the computer equipment 101 may perform quality prediction on the multimedia data based on the quality prediction request and send the quality prediction result to the user equipment 102 a. Alternatively, the user device 102b transmits multimedia data to the computer device 101, and the computer device 101 may perform quality prediction on the multimedia data and determine whether to recommend the multimedia data based on the quality prediction result. The computer device 101 may process the features of the multimedia data in each data channel based on a channel attention mechanism to obtain a channel feature map of the multimedia data, so as to focus on the features in each data channel of the multimedia data. Further, the computer apparatus 101 performs feature processing on the channel feature map based on the spatial attention mechanism to obtain a spatial feature map of the multimedia data, so that important attention can be paid to distortion or degradation regions in the multimedia data. Through the channel attention mechanism and the space attention mechanism, the quality degradation or distortion area of the multimedia data can be focused, so that the accuracy of the quality prediction of the multimedia data is improved.

It is understood that the computer device mentioned in the embodiments of the present application includes, but is not limited to, a terminal device or a server. In other words, the computer device or the user device may be a server or a terminal device, or may be a system composed of a server and a terminal device. The above-mentioned terminal device may be an electronic device, including but not limited to a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm-top computer, an Augmented Reality/Virtual Reality (AR/VR) device, a helmet-mounted display, a wearable device, a smart speaker, a digital camera, a camera, and other Mobile Internet Devices (MID) with network access capability, and the like, where the client has a display function. The above-mentioned server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Optionally, the data related in the embodiment of the present application may be stored in a server, or may also be stored in a memory (i.e., a storage space) of a computer device, or the data may be stored based on a cloud storage technology, which is not limited herein.

Further, please refer to fig. 2, and fig. 2 is a schematic diagram of a data quality evaluation scenario provided in the embodiment of the present application. As shown in fig. 2, the quality prediction process for multimedia data according to the present application may be implemented by a quality prediction model, which may include a channel attention module and a spatial attention module. The computer device acquires the multimedia data 201, acquires the media feature map 202 of the multimedia data 201, acquires at least two data channels for constituting the multimedia data 201, performs feature processing on the multimedia data 202 based on a channel attention mechanism in the channel attention module 203 to obtain channel attention masks 2031 corresponding to the at least two data channels, and generates the channel feature map 204 of the multimedia data 201 according to the channel attention masks 2031 and the media feature map 202. In the spatial attention module 205, the channel feature map 204 is subjected to feature processing based on a spatial attention mechanism to obtain a spatial attention mask 2051 corresponding to the channel feature map 204, and a spatial feature map 206 of the multimedia data is generated according to the spatial attention mask 2051 and the media feature map 202. The predicted media quality of the multimedia data is output according to the spatial feature map 206.

Further, please refer to fig. 3, fig. 3 is a flowchart of a method for evaluating data quality according to an embodiment of the present application. As shown in fig. 3, the data quality evaluation process includes the steps of:

step S301, a media feature map of the multimedia data is obtained, and at least two data channels for constituting the multimedia data are obtained.

In the embodiment of the present application, the computer device obtains a media feature map of the multimedia data, and obtains at least two data channels for constituting the multimedia data, where the data channels refer to channels forming features of one pixel point, and the data channels may be color channels, for example. For example, assuming that the multimedia data is red, R, green, G, blue (B) data, the multimedia data can be considered to be composed of three data channels, i.e., a red (R) channel, a green (G) channel, and a blue (B) channel; assuming that the multimedia data is Cyan (Cyan, C), Magenta (M), Yellow (Y), and Black (K) type data, i.e., CMYK data, the multimedia data may be considered to be composed of four data channels, i.e., a Cyan (C) channel, a Magenta (M) channel, a Yellow (Y) channel, and a Black (K) channel; assuming that the multimedia data is 256-color data, the multimedia data can be considered to be composed of 256 data channels; assuming that the multimedia data is gray scale data, the multimedia data may be considered to be composed of one data channel, etc., without limitation. In other words, the number of data channels and channel types constituting the multimedia data are determined by the data type of the multimedia data. Wherein each data channel represents a dedicated detector.

The multimedia data may be a single Image, such as a Joint Photographic Experts Group (JPEG) Format Image, a Portable Network Graphics (png) Format Image, a Bitmap Image (BMP) Format Image, a Tag Image File Format (TIFF) Format Image, or a dynamic Image, an Image combination, or a Video formed by at least two Image frames, such as a Graphics Interchange Format (GIF) dynamic Image, an Audio Video Interleaved (AVI) Format Video, or a Moving Picture Experts Group (Moving Picture Experts Group 4, mp4) Format Video, and the like, without limitation. It is considered that data that can be split into images can be considered as multimedia data.

The computer equipment acquires a media characteristic map of the multimedia data, wherein the media characteristic map is composed of data characteristics respectively corresponding to at least two data channels forming the multimedia data. Recording the height of the multimedia data as H and the width as W, that is, the size of the multimedia data is H × W, and recording the number of channels of at least two data channels constituting the multimedia data as C, wherein the computer device processes the multimedia data, extracts data features respectively corresponding to each data channel of the multimedia data, and forms a media feature map of the multimedia data from the data features respectively corresponding to each data channel, the size of the media feature map is C × H W, which is used to represent that the multimedia data includes data features respectively corresponding to C data channels, each data feature is a feature having a dimension of H × W, wherein the media feature map can be recorded as F, and the size of the media feature map can be expressed as F e □^C*H*WTo illustrate that the media feature map F is a feature with dimension C × H × W. If the multimedia data is composed of at least two image frames, the computer device can split the multimedia data into at least two image frames when acquiring the multimedia data, acquire a media feature map of each image frame, acquire at least two media feature maps, and process each media feature map respectively.

Further optionally, when the time for predicting the quality of the multimedia data is reached, the computer device executes the step to obtain a media feature map of the multimedia data, and obtain at least two data channels for forming the multimedia data. The quality prediction opportunity may be determined based on a prediction period, and when a new prediction period starts, the quality prediction opportunity of the multimedia data may be considered to be reached, for example, the prediction period is one day, the time for last quality prediction of the historical multimedia data is 11 months, 23 days, 10:00, and when the current system network time is 11 months, 24 days, 10:00, the quality prediction opportunity of the multimedia data is determined to be reached, and at this time, the computer device may obtain the multimedia data and perform quality prediction on the multimedia data, where the number of the multimedia data may be one or at least two; according to the method, after the computer equipment acquires the multimedia data, the multimedia data are stored, and when the quality prediction time is reached, the multimedia data are acquired from the storage position of the multimedia data, so that the quality prediction is carried out on the multimedia data. Alternatively, the quality prediction opportunity may be a time when the computer device acquires the multimedia data, in other words, when the computer device acquires the multimedia data, it is determined that the quality prediction opportunity of the multimedia data is reached, and the quality prediction is performed on the multimedia data. Alternatively, the quality prediction opportunity may be an acquisition time of the quality prediction request, in other words, when the computer device acquires the quality prediction request for the multimedia data, it determines that the quality prediction opportunity of the multimedia data is reached, and performs quality prediction on the multimedia data.

Step S302, generating channel attention masks corresponding to at least two data channels based on a channel attention mechanism, and generating a channel feature map of the multimedia data according to the channel attention masks and the media feature map.

In this embodiment, the computer device may perform feature fusion on features of each pixel point in one data channel based on a channel attention mechanism to obtain channel fusion features of the data channel until the channel fusion features of each data channel are obtained, and generate a channel attention mask corresponding to at least two data channels according to the channel fusion features of each data channel. Wherein features in respective data channelsThe fusion process may be executed simultaneously or separately, and the execution order is not limited. Wherein the channel attention mask can be designated as Mc, the dimension of the channel attention mask is C1, and is designated as Mc e □^C*1*1This is because the dimension of the data feature in each data channel is reduced from H × W to 1 × 1 by performing the dimension reduction operation on the data feature in the data channel, and the dimension of the channel attention mask is still C since the dimension of the channel attention mask is still C × 1.

Specifically, the at least two data channels include a data channel i; i is a positive integer, i is the number of data channels less than or equal to at least two data channels. The computer device can determine the data characteristics of the multimedia data in the data channel i based on the media characteristic map; and performing feature fusion on the data features in the data channel i based on a channel attention mechanism to obtain channel fusion features of the data channel i. When the channel fusion features respectively corresponding to each data channel are obtained, the computer device may generate the channel attention masks corresponding to the at least two data channels according to the channel fusion features respectively corresponding to each data channel. The computer device can generate a channel feature map of the multimedia data according to the channel attention mask Mc and the media feature map F, wherein the channel feature map is denoted as F ', and since the size of the channel attention mask Mc is C × 1 and the size of the media feature map F is C × H × W, the size of the channel feature map F ' is C × H × W and denoted as F ' ∈ □^C*H*W。

For example, please refer to fig. 4, fig. 4 is a schematic diagram illustrating a channel feature map generation according to an embodiment of the present disclosure. As shown in fig. 4, it is assumed that the multimedia data is composed of three data channels, the three data channels include data channel 1, data channel 2 and data channel 3, and the dimension of the media feature map of the multimedia data is 3 × hw, wherein the computer device can obtain the data feature 4011 of the data channel 1, the data feature 4012 of the data channel 2 and the data feature 4013 of the data channel 3 based on the media feature map 401, wherein the dimensions of the three data features are all H × W, in other words, the media feature map 401 of the multimedia data is composed of three data features with the dimension H × W. The computer equipment can perform feature fusion on the data feature 4011 of the data channel 1 based on a channel attention mechanism, so as to perform dimension reduction operation on the data feature 4011 to obtain a channel fusion feature 4021 of the data channel 1, wherein the dimension of the channel fusion feature 4021 is 1 x 1; performing feature fusion on the data feature 4012 of the data channel 2 to perform dimension reduction operation on the data feature 4012 to obtain a channel fusion feature 4022 of the data channel 2, wherein the dimension of the channel fusion feature 4022 is 1 x 1; and performing feature fusion on the data features 4013 of the data channel 3 to perform dimension reduction operation on the data features 4013 to obtain a channel fusion feature 4023 of the data channel 3, wherein the dimension of the channel fusion feature 4023 is 1 x 1. The channel fusion feature 4021 for data channel 1, the channel fusion feature 4022 for data channel 2, and the channel fusion feature 4023 for data channel 3 are processed to obtain the channel attention mask 403 with a dimension of 3 x 1. A channel feature map 404 of the multimedia data is generated based on the channel attention mask 403 and the media feature map 401.

Wherein the channel fusion feature comprises a first channel fusion feature and a second channel fusion feature. When the computer device performs feature fusion on the data features in the data channel i based on the channel attention mechanism to obtain the channel fusion features of the data channel i, specifically performing mean pooling on the data features of the data channel i based on the channel attention mechanism to obtain first channel fusion features of the data channel i; and acquiring the mean square error of the data characteristic of the data channel i, and determining the mean square error as a second channel fusion characteristic of the data channel i.

Further, the computer device may perform feature splicing on the first channel fusion features respectively corresponding to each data channel to obtain a first mean feature, and perform feature splicing on the second channel fusion features respectively corresponding to each data channel to obtain a first mean square feature; and weighting the first mean value characteristic to obtain a first mean value weighted characteristic, and weighting the first mean square characteristic to obtain a first mean square weighted characteristic. The computer device may perform feature fusion on the first mean weighted feature and the first mean square weighted feature to obtain a channel attention mask corresponding to the at least two data channels.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a generation model of a channel feature map according to an embodiment of the present application. As shown in fig. 5, the computer device obtains a media feature map 501 of the multimedia data, and performs mean pooling on the data features of each data channel in the media feature map 501 (i.e., mean pooling on the data features of data channel 1, mean pooling on the data features of data channel 2, and …, mean pooling on the data features of data channel C), so as to obtain first channel fusion features corresponding to each data channel. And performing feature splicing on the first channel fusion features respectively corresponding to each data channel to obtain a first average feature 5021, wherein the first average feature 5021 is a feature of C1 x 1. The feature splicing may be direct splicing, or may be weighted splicing of the first channel fusion features respectively corresponding to each data channel, which is not limited herein. The first mean feature 5021 is weighted based on the first single-layer neural network 5031 to obtain a first mean weighted feature 5041, where the first single-layer neural network 5031 may include C neurons corresponding to weights for the first mean feature 5021, in other words, the C first channel fusion features composing the first mean feature 5021 are in one-to-one correspondence with the C neurons in the first single-layer neural network 5031.

Similarly, the computer device obtains the media feature map 501 of the multimedia data, obtains the mean square error of the data feature of each data channel in the media feature map 501 (i.e. obtains the mean square error of the data feature of the data channel 1, obtains the mean square error of the data feature of the data channel 2, …, obtains the mean square error of the data feature of the data channel C), obtains the second channel fusion feature corresponding to each data channel, respectively, wherein, taking the data channel 1 as an example, the computer device can obtain the mean value of the data feature of the data channel 1, obtain the sum of squares of the differences of the feature deviation mean values of each pixel point included in the data feature of the data channel 1, divide the sum of the differences by the total number of pixels included in the multimedia data to obtain an average value, determine the average value as the second channel fusion feature of the data channel 1, and similarly, and acquiring second channel fusion characteristics corresponding to other data channels. And performing feature splicing on the second channel fusion features respectively corresponding to each data channel to obtain a first mean square feature 5022, wherein the first mean square feature 5022 is a feature of C1. The feature splicing may be direct splicing, or may be weighted splicing of the second channel fusion features respectively corresponding to each data channel, which is not limited herein. The first mean square feature 5022 is weighted based on the second single-layer neural network 5032 to obtain a first mean square weighted feature 5042, wherein the second single-layer neural network 5032 may include C neurons corresponding to weights for the first mean square feature 5022, in other words, the C second channel fusion features composing the first mean square feature 5022 are in one-to-one correspondence with the C neurons in the second single-layer neural network 5032.

Further, the computer device performs feature fusion on the first mean weighted feature 5041 and the first mean square weighted feature 5042, and then processes a feature fusion result based on an activation function to obtain a channel attention mask 507 corresponding to at least two data channels. The first mean weighted feature 5041 is feature fused with The first mean weighted feature 5042, for example by feature addition indicated by The symbol 505 ≦ ≦ and The feature fusion result is processed based on an activation function 506, which may be a sigmoid function (sigmoid function) or a Rectified Linear Unit (Relu) activation function, or The like, without limitation. Further, the computer device may generate a channel feature map 508 of the multimedia data based on the channel attention mask 507 and the media feature map 501.

Further, the channel attention mask includes a sub-mask for each data channel. When the computer device generates the channel feature map of the multimedia data according to the channel attention mask and the media feature map, the computer device may perform mask weighting on the data features in the data channel i based on the sub-mask of the data channel i to obtain the channel of the data channel iPerforming mask feature; and when the channel mask characteristics corresponding to each data channel are obtained, performing characteristic splicing on the channel mask characteristics corresponding to each data channel to obtain a channel characteristic map of the multimedia data. For example, the multimedia data includes three data channels, and assuming that the channel attention mask includes a sub-mask "0.2" for data channel 1, a sub-mask "0.4" for data channel 2, and a sub-mask "0.4" for data channel 3, the data feature F for data channel 1 based on the sub-mask for data channel 1 is determined₁Carrying out mask weighting to obtain channel mask characteristics of the data channel 1; data feature F for data channel 2 based on data channel 2's sub-mask₂Carrying out mask weighting to obtain channel mask characteristics of the data channel 2; data feature F for data channel 3 based on a sub-mask for data channel 3₃Performing mask weighting to obtain channel mask features of the data channel 3, performing feature splicing on the channel mask features of the data channel 1, the channel mask features of the data channel 2 and the channel mask features of the data channel 3 to obtain a channel feature map of the multimedia data, wherein the channel feature map is F' ═ 0.2F₁++0.4F₂++0.4F₃Where "+ +" is used to denote feature stitching.

As shown in fig. 5, the process of pooling the mean value of the data characteristics of the data channel 1 and the process of obtaining the mean square error of the data characteristics of the data channel 1 may be performed independently, so that the training and learning processes of the mean value and the data characteristics of the data channel 1 do not interfere with each other, the mining capability of the channel attention module may be improved, and the learning effect of the channel attention module is better. Through mean pooling, prediction errors caused by the fact that the variance of the estimated value is increased due to the field size of the multimedia data can be reduced, and more information of the multimedia data can be reserved; and determining the deviation condition between the multimedia data and the mean value by acquiring the mean square error so as to reflect the characteristic change condition of the multimedia data and improve the accuracy of the channel attention mask of the multimedia data.

Step S303, generating a spatial attention mask corresponding to the channel feature map based on the spatial attention mechanism, and generating a spatial feature map of the multimedia data according to the spatial attention mask and the channel feature map.

In this embodiment, the computer device may perform feature fusion on features of a pixel point in each data channel of the multimedia data based on a spatial attention mechanism to obtain pixel fusion features of the pixel point, obtain pixel fusion features corresponding to each pixel point based on the process, and generate a spatial attention mask corresponding to a channel feature map according to the pixel fusion features corresponding to each pixel point. The feature fusion process of each pixel point can be executed simultaneously or respectively, and the execution sequence is not limited. Wherein, the size of the channel feature picture is C H W, then the space attention mask is marked as Ms, the size of the space attention mask is 1H W, and is marked as Ms E □^1*H*WThe dimension of the channel pixel feature of each pixel point is reduced from C to 1 because the dimension reduction operation is performed on the channel pixel feature of each pixel point, and the dimension reduction operation is not performed on the pixel points, that is, the dimension of the spatial attention mask on the pixel points is still H × W, so that the size of the spatial attention mask is 1 × H × W.

Specifically, the computer device may obtain at least two pixel points for forming the channel feature map, where the at least two pixel points include a pixel point j, j is a positive integer, and j is less than or equal to the number of pixel points of the at least two pixel points. Determining channel pixel characteristics corresponding to the pixel points j in each data channel based on the media characteristic map; and performing feature fusion on at least two channel pixel features corresponding to the pixel point j based on a spatial attention mechanism to obtain the pixel fusion feature of the pixel point j. And when the pixel fusion characteristics corresponding to each pixel point are obtained, generating a spatial attention mask corresponding to the channel characteristic map according to the pixel fusion characteristics corresponding to each pixel point. Further, the computer device may generate a spatial feature map of the multimedia data based on the spatial attention mask Ms and the channel feature map F ', wherein the spatial feature map is denoted as F ', and the spatial feature map is denoted as C H W due to the spatial attention mask Ms having a size of 1H W and the channel feature map F ' having a size of C H WThe spectrum F "has the dimensions C H W and is denoted F' ∈ □^C*H*W。

The pixel fusion feature comprises a first pixel fusion feature and a second pixel fusion feature. When the computer device performs feature fusion on at least two channel pixel features corresponding to a pixel point j based on a spatial attention mechanism to obtain a pixel fusion feature of the pixel point j, the computer device performs mean pooling on the at least two channel pixel features corresponding to the pixel point j based on the spatial attention mechanism to obtain a first pixel fusion feature of the pixel point j; and acquiring the mean square error of at least two channel pixel characteristics corresponding to the pixel point j, and determining the mean square error corresponding to the pixel point j as a second pixel fusion characteristic of the pixel point j.

Further, the computer device can perform feature splicing on first pixel fusion features respectively corresponding to at least two pixel points to obtain a second mean value feature, and perform feature splicing on second pixel fusion features respectively corresponding to at least two pixel points to obtain a second mean square feature. The computer equipment can perform feature splicing on the second mean value feature and the second mean square feature to obtain a pixel splicing feature; and (4) convolving the pixel splicing characteristics to generate a space attention mask corresponding to the channel characteristic map. Optionally, the computer device may further perform weighting processing on the second mean value feature to obtain a second mean value weighting feature, perform weighting processing on the second mean square feature to obtain a second mean square weighting feature, and perform feature stitching on the second mean value weighting feature and the second mean square feature to obtain a pixel stitching feature.

Further, the spatial attention mask includes a sub-mask for each pixel point. When the computer device generates the spatial feature map of the multimedia data according to the spatial attention mask and the channel feature map, the computer device may perform mask weighting on the channel pixel feature of the pixel j based on the sub-mask of the pixel j to obtain the pixel mask feature of the pixel j. And when the pixel mask characteristics corresponding to each pixel point are obtained, performing characteristic splicing on the pixel mask characteristics corresponding to each pixel point to obtain a spatial characteristic map of the multimedia data. For example, the size of the multimedia data is 100 × 50, the multimedia data is composed of C data channels, and the obtained spatial attention mask may be regarded as a 100 × 50 matrix, where the size of the channel feature map is C100 × 50, that is, there are C data channels, each data channel corresponds to a 100 × 50 feature matrix, and the 100 × 50 feature matrices in each data channel are respectively dot-multiplied with the spatial attention mask to obtain C100 × 50 weighted feature matrices, and the C100 × 50 weighted feature matrices constitute the spatial feature map of the multimedia data.

And step S304, outputting the predicted media quality of the multimedia data according to the spatial feature map.

In the embodiment of the present application, if the multimedia data is a single image, the multimedia data is predicted based on the full link layer, so as to obtain the predicted media quality of the multimedia data. If the multimedia data is a dynamic image, an image combination or a video and the like formed by at least two image frames, obtaining the at least two image frames forming the multimedia data, obtaining the prediction quality respectively corresponding to the at least two image frames based on the full connection layer, and determining the prediction media quality of the multimedia data according to the prediction quality respectively corresponding to the at least two image frames. When the predicted media quality of the multimedia data is determined according to the predicted quality corresponding to at least two image frames, the computer device can determine the average value of the predicted quality corresponding to at least two image frames as the predicted media quality of the multimedia data; the computer equipment can also obtain the weight corresponding to each image frame, and carry out weighted summation on the prediction quality corresponding to each image frame based on the weight to obtain the prediction media quality of the multimedia data; or, a key image frame of the at least two image frames may be acquired, the predicted quality of the key image frame is weighted and summed based on the key frame weight to obtain the key quality, the image frames except the key image frame are weighted and summed based on the conventional frame weight to obtain the conventional quality, and the sum of the key quality and the conventional quality is determined as the predicted media quality of the multimedia data.

Taking the example that the multimedia data comprises a first image frame and a second image frame, the computer device may predict a spatial feature map of the first image frame based on the full connection layer to obtain a first prediction quality of the first image frame; and predicting the spatial feature map of the second image frame based on the full connection layer to obtain second prediction quality of the second image frame. Further, the computer device may obtain a first evaluation weight of the first image frame and a second evaluation weight of the second image frame, weight the first prediction quality based on the first evaluation weight to obtain a first weighted prediction value, and weight the second prediction quality based on the second evaluation weight to obtain a second weighted prediction value; and determining the sum of the first weighted prediction value and the second weighted prediction value as the predicted media quality of the multimedia data. Optionally, a first frame position of the first image frame in the multimedia data may be obtained, and if the first frame position belongs to a first position range, the first frame weight is determined as the weight of the first image frame; if the first frame position belongs to the second position range, determining the second frame weight as the first evaluation weight. Acquiring a second frame position of a second image frame in the multimedia data, and determining the weight of the first frame as the weight of the second image frame if the second frame position belongs to the first position range; if the second frame position belongs to the second position range, determining the second frame weight as the second evaluation weight. And weighting the first prediction quality by adopting the first evaluation weight to obtain a first weighted prediction value, weighting the second prediction quality by adopting the second evaluation weight to obtain a second weighted prediction value, and determining the sum of the first weighted prediction value and the second weighted prediction value as the predicted media quality of the multimedia data. The first position range may be the position of the more key image frame, and the second position range may be the position of the more conventional image frame, that is, the importance degree of the first position range is greater than that of the second position range, and the first frame weight is greater than the second frame weight. For example, assuming that the multimedia data is a video, if the importance of the cover frame is considered to be greater than that of the content frame, the cover frame can be considered as the first position range, and the content frame can be considered as the second position range. The first position range and the second position range may be modified based on the requirement, and are not limited herein.

Optionally, if the multimedia data is a single image, the computer device may perform feature fusion on the media feature map and the spatial feature map to obtain an output feature map, and predict the output feature map based on the full connection layer to obtain predicted media quality of the multimedia data. If the multimedia data is a dynamic image, an image combination or a video and the like composed of at least two image frames, when the prediction quality corresponding to each of the at least two image frames is obtained based on the full connection layer, taking the image frame p as an example, the media feature map and the spatial feature map corresponding to the image frame p may be feature-fused to obtain an output feature map of the image frame p, and the output feature map of the image frame p is predicted based on the full connection layer to obtain the prediction quality of the image frame p, where p is a positive integer, and similarly, the prediction quality corresponding to each image frame is obtained.

Further, the computer device may obtain a recommendation threshold, and if the predicted media quality is greater than or equal to the recommendation threshold, recommend and display the multimedia data; and if the predicted media quality is smaller than the recommendation threshold, acquiring user equipment uploading the multimedia data, and sending a media quality abnormal message to the user equipment, wherein the media quality abnormal message is used for indicating the user equipment to optimize the media quality of the multimedia data. Optionally, if the first evaluation weight and the second evaluation weight are determined according to the first location range and the second location range, the recommendation threshold includes a first recommendation threshold and a second recommendation threshold, the first weighted prediction value and the first recommendation threshold are compared, and the second weighted prediction value and the second recommendation threshold are compared. If the first weighted prediction value is larger than or equal to the first recommendation threshold value and the second weighted prediction value is larger than or equal to the second recommendation threshold value, recommending and displaying the multimedia data; and if the first weighted prediction value is smaller than the first recommendation threshold value or the second weighted prediction value is smaller than the second recommendation threshold value, acquiring the user equipment uploading the multimedia data, and sending a media quality abnormal message to the user equipment, wherein the media quality abnormal message is used for indicating the user equipment to optimize the media quality of the multimedia data.

For example, referring to fig. 6, fig. 6 is a flow chart of a recommendation determination scenario provided in an embodiment of the present application. As shown in fig. 6, assuming that the multimedia data is a video, at least two image frames constituting the multimedia data are acquired, and the at least two image frames are divided into a video cover frame and a video content frame. Performing quality prediction on the video cover frame based on a quality prediction model to obtain first prediction quality, and weighting the first prediction quality by adopting a first evaluation weight to obtain a first weighted prediction value; and performing quality prediction on the video content frame based on the quality prediction model to obtain second prediction quality, and weighting the second prediction quality by adopting a second evaluation weight to obtain a second weighted prediction value. And determining a recommendation result of the multimedia data based on the first weighted prediction value and the second weighted prediction value. When the computer equipment acquires at least two image frames forming the multimedia data, the computer equipment can collect at least two image frames to be detected from the at least two image frames based on a sampling period, determine the prediction quality corresponding to the at least two image frames to be detected respectively based on a quality prediction model, and determine the predicted media quality of the multimedia data according to the prediction quality corresponding to the at least two image frames to be detected respectively. The quality prediction model comprises a channel attention module, a space attention module, a full connection layer and the like, and is used for realizing the steps S301 to S304, wherein the channel attention module is used for realizing the step S302 and acquiring a channel feature map of the multimedia data; the spatial attention module is used for realizing the step S303, and acquiring a spatial feature map of the multimedia data; the full connection layer is used to implement step S304, outputting the predicted media quality of the multimedia data.

Further, referring to fig. 7, fig. 7 is a schematic view of a spatial feature map generation scene provided in an embodiment of the present application. As shown in fig. 7, the method for generating the spatial feature map includes the following steps:

and step S701, performing feature fusion processing among data channels on the channel feature map based on a space attention mechanism to obtain pixel splicing features.

In this embodiment of the application, the computer device may obtain a second mean value feature and a second mean square feature of at least two pixel points based on a spatial attention mechanism, perform feature concatenation on the second mean value feature and the second mean square feature, and obtain a pixel concatenation feature, where the process may refer to related description in step S303 of fig. 3. Further, referring to fig. 8, fig. 8 is a schematic structural diagram of a generative model of a spatial feature map provided in an embodiment of the present application. As shown in fig. 8, the computer device obtains a channel feature map 801 of the multimedia data, and performs mean pooling on channel pixel features of each pixel point in the channel feature map 801 (i.e., performs mean pooling on channel pixel features of pixel point 1, performs mean pooling on channel pixel features of pixel point 2, and … performs mean pooling on channel pixel features of pixel point H × W) to obtain first pixel fusion features corresponding to each pixel point. And performing feature splicing on the first pixel fusion features respectively corresponding to each pixel point to obtain a second average feature 8021, wherein the second average feature 8021 is a feature of 1 × H × W. The feature splicing may be direct splicing, or weighted splicing may be performed on the first pixel fusion features corresponding to each pixel point, which is not limited herein. The second mean feature 8021 is weighted based on the third single-layer neural network 8031 to obtain 1 second mean weighted feature, where the third single-layer neural network 8031 may include H × W neurons corresponding to weights for the second mean feature 8021, in other words, the H × W first pixel fusion features constituting the second mean feature 8021 are in one-to-one correspondence with the H × W neurons in the third single-layer neural network 8031.

Similarly, the computer device obtains the channel feature map 801 of the multimedia data, obtains the mean square error of the channel pixel feature of each pixel point in the channel feature map 801 (i.e. obtains the mean square error of the channel pixel feature of the pixel point 1, obtains the mean square error of the channel pixel feature of the pixel point 2, …, obtains the mean square error of the channel pixel feature of the pixel point H x W), obtains the second pixel fusion feature corresponding to each pixel point, respectively, wherein, taking the pixel point 1 as an example, the computer device can obtain the mean value of the channel pixel features corresponding to the pixel point 1 in each data channel, obtain the square sum of the difference values of the channel pixel features corresponding to the pixel point 1 in each data channel deviating from the mean value, and then divide by the total number of data channels included in the multimedia data to obtain the average number, and determine the average number as the second pixel fusion feature of the pixel point 1, and similarly, second pixel fusion characteristics corresponding to other pixel points are obtained. And performing feature splicing on the second pixel fusion features respectively corresponding to each pixel point to obtain a second mean square feature 8022, wherein the second mean square feature 8022 is a feature of 1 × H × W. The feature splicing may be direct splicing or weighted splicing of the second pixel fusion features corresponding to each pixel point, which is not limited herein. The second mean square feature 8022 is weighted based on the fourth single-layer neural network 8032 to obtain a second mean square weighted feature, where the fourth single-layer neural network 8032 may include H × W neurons, which are equivalent to weights for the second mean square feature 8022, in other words, the H × W second pixel fusion features constituting the second mean square feature 8022 are in one-to-one correspondence with the H × W neurons in the fourth single-layer neural network 8032.

The third single-layer neural network 8031 and the fourth single-layer neural network 8032 can be convolutional networks.

Further, the computer device performs feature concatenation on the second mean weighted feature and the second mean-square weighted feature to obtain a pixel concatenation feature 804, where the pixel concatenation feature 804 is a feature of 2 × H × W.

Step S702, processing the pixel splicing characteristics to generate a spatial characteristic map.

In this embodiment, the computer device may perform convolution on the pixel stitching features to generate a spatial attention mask corresponding to the channel feature map, and generate the spatial feature map according to the spatial attention mask and the channel feature map. The size of the convolution kernel for convolving the pixel splicing feature is determined according to the size of the pixel splicing feature, and if the size of the pixel splicing feature is 2 × H × W, the size of the convolution kernel may be 2 × n, n is a positive integer, and a value of n may be changed as needed, for example, n may be 3 or 7. Optionally, the computer device may perform continuous convolution on the pixel stitching features to generate a spatial attention mask corresponding to the channel feature map; the computer device may also perform dilation convolution on the pixel stitching feature to generate a spatial attention mask corresponding to the channel pixel feature. The continuous convolution refers to that when the pixel splicing characteristics are convoluted, the convoluted elements are adjacent; the dilation convolution refers to convolution of elements with element intervals in the pixel splicing characteristics, and the receptive field of each element in the spatial attention mask can be improved. In the convolutional neural network, a Receptive Field (Receptive Field) refers to the size of an area where pixels on a feature map (featuremap) output by each layer in the convolutional neural network are mapped on an input matrix or an input picture, and in brief, a Receptive Field refers to an area where a point on the feature map corresponds to the input matrix or the input picture. In the embodiment of the present application, the number of elements in the pixel stitching feature fused by one element in the spatial attention mask is the receptive field of the spatial attention mask to the input layer.

Taking the example of performing dilation convolution on the pixel stitching feature by using a computer device, the convolution parameter includes a dilation coefficient, and the dilation coefficient is used to indicate an element interval between each convolved element when the pixel stitching feature is convolved. The computer device can determine a kth convolution position corresponding to the convolution kernel in the pixel splicing feature based on the expansion coefficient, and convolve an element at the kth convolution position in the pixel splicing feature by adopting the convolution kernel to obtain a kth convolution element; k is a positive integer; acquiring a convolution step length, determining a (k +1) th convolution position corresponding to a convolution kernel in the pixel splicing feature based on the convolution step length, the expansion coefficient and the kth convolution position, and performing convolution on an element at the (k +1) th convolution position in the pixel splicing feature by adopting the convolution kernel to obtain a (k +1) th convolution element; when the convolution of the pixel stitching feature is completed, a spatial attention mask is generated according to the obtained convolution elements.

Optionally, the computer device may obtain a data size and a convolution parameter of the multimedia data, and determine a convolution feature size; and adding a default characteristic value for the pixel splicing characteristic to obtain a pixel characteristic to be convolved so as to realize the dimension-reduction-free convolution of the pixel splicing characteristic, wherein the size of the pixel characteristic to be convolved is the size of the convolution characteristic. And (4) convolving the pixel features to be convolved to generate a space attention mask corresponding to the channel feature map. Optionally, the computer device may perform continuous convolution on the pixel features to be convolved to generate a spatial attention mask corresponding to the channel feature map; the computer device may also perform dilation convolution on the pixel features to be convolved to generate a spatial attention mask corresponding to the channel pixel features. The process of expanding and convolving the pixel features to be convolved is the same as the process of directly expanding and convolving the pixel splicing features, and is not repeated herein.

As shown in fig. 8, a computer device may convolve the pixel stitching feature 804 to generate a pixel fusion map 805, process the pixel fusion map 805 based on an activation function 806, and obtain a spatial attention mask 807. The activation function 806 may be an S-shaped growth curve (sigmoid function), or a rectifying Linear Unit (Relu) activation function, and The like, which is not limited herein. Further, the computer device may generate a spatial feature map 808 of the multimedia data based on the spatial attention mask 807 and the channel feature map 801.

Please refer to fig. 9, where fig. 9 is a schematic diagram of a convolution scenario provided in the embodiment of the present application. As shown in fig. 9, assuming that the pixel stitching feature 901 is a 7 × 8 feature, the data size of the obtained multimedia data is 7 × 8, and the expansion coefficient included in the convolution parameter is used to indicate that the element interval is 1, in order to perform non-dimensionality reduction convolution on the pixel stitching feature 901, that is, the size of the spatial attention mask obtained after convolution is performed on the pixel stitching feature 901 is 7 × 8, the computer device may obtain the data size of the multimedia data and the convolution parameter, and determine that the convolution feature size is 11 × 12. Adding default eigenvalues to the pixel splicing feature 901 to obtain a pixel feature 902 to be convolved, where the adding positions of the default eigenvalues in the pixel splicing feature 901 may be set as needed, for example, the adding positions are uniformly distributed to four sides of the pixel splicing feature 901, or the adding positions may be on a single side of the width or height of the pixel splicing feature 901, for example, assuming that four sides of the pixel splicing feature 901 are abstracted, which are respectively denoted as a1, B1, a2 and B2, where a1 and a2 are opposite sides, and B1 and B2 are opposite sides, default eigenvalues may be added on both sides of a1 and B1; alternatively, the adding position may be a start position or an end position of the pixel stitching feature 901, and the like, which is not limited herein.

The size of the pixel feature 902 to be convolved obtained by the computer device is 11 × 12, and the feature corresponding to the dotted line labeled part is an added default feature value. Further, assuming that the convolution step of the convolution kernel is 1, the computer device determines, based on the expansion coefficient, that first convolution positions of the convolution kernel in the pixel feature 902 to be convolved are (first in the first row, third in the first row, fifth in the first row, first in the third row, third in the third row, fifth in the third row, first in the fifth row, third in the fifth row, and fifth in the fifth row), and convolves an element at the first convolution position in the pixel feature 902 to be convolved with the convolution kernel to obtain a first convolution element; determining second convolution positions (the second in the first row, the fourth in the first row, the sixth in the first row, the second in the third row, the fourth in the third row, the sixth in the third row, the second in the fifth row, the fourth in the fifth row and the sixth in the fifth row) corresponding to the convolution kernel in the pixel splicing feature according to the convolution step length "1", the expansion coefficient and the first convolution position, and performing convolution on elements at the second convolution positions in the pixel feature 902 to be convolved by adopting the convolution kernel to obtain second convolution elements; …, respectively; according to the convolution step size "1", the expansion coefficient and the 55 th convolution position, the corresponding last convolution position of the convolution kernel in the pixel splicing feature is determined to be (the seventh of the eighth row, the ninth of the eighth row, the eleventh of the eighth row, the seventh of the tenth row, the ninth of the tenth row, the eleventh of the tenth row, the seventh of the twelfth row, the ninth of the twelfth row and the eleventh of the twelfth row), and the convolution kernel is adopted to convolve the element at the last convolution position in the pixel feature 902 to be convolved, so that the last convolution element is obtained. And generating a pixel fusion map 903 according to the obtained convolution elements, and processing the pixel fusion map 903 based on an activation function to obtain a spatial attention mask 904.

Optionally, the computer device may perform multilayer convolution on the pixel stitching feature to obtain a spatial attention mask; or generating a pixel feature to be convolved according to the pixel splicing feature, and performing multilayer convolution on the pixel feature to be convolved to obtain the space attention mask. Referring to fig. 10, fig. 10 is a schematic diagram of a multilayer dilated convolution structure provided in an embodiment of the present application. As shown in fig. 10, assuming that the number of convolution layers is 3 and the step size is 1, taking the pixel stitching feature as an example, the pixel stitching feature is assumed to be (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x 11). Acquiring the expansion coefficient of a first expansion convolutional layer, wherein if the expansion coefficient of the first expansion convolutional layer is 1, the element interval of the first expansion convolutional layer is 0, and the computer device convolutes x1, x2 and x3 according to the expansion coefficient of the first expansion convolutional layer to obtain a first element in the first expansion convolutional layer; convolving x2, x3 and x4 to obtain a second element in the first expanded convolutional layer; convolving x3, x4 and x5 to obtain a third element in the first expanded convolutional layer; …, respectively; convolving x9, x10, and x11 to obtain the ninth element in the first dilated convolutional layer. And splicing the nine elements in the first expanded convolutional layer to obtain an output matrix of the first expanded convolutional layer.

The computer device determines the expansion coefficient of the second expanded convolutional layer based on the expansion coefficient of the first expanded convolutional layer, and obtains the expansion coefficient of the second expanded convolutional layer as 2. Determining the element interval of the second expanded convolutional layer to be 1 according to the expansion coefficient of the second expanded convolutional layer, and performing convolution on the first element, the third element and the fifth element in the output matrix of the first expanded convolutional layer to obtain the first element of the second expanded convolutional layer; convolving a second element, a fourth element and a sixth element in an output matrix of the first expanded convolutional layer to obtain a second element of the second expanded convolutional layer; …, respectively; and (4) convolving the fifth element, the seventh element and the ninth element in the output matrix of the first expanded convolutional layer to obtain the fifth element of the second expanded convolutional layer. And splicing the five elements of the second expanded convolutional layer to obtain an output matrix of the second expanded convolutional layer. The expansion coefficient of the third expanded convolutional layer is determined according to the expansion coefficient of the second expanded convolutional layer of "2", and the expansion coefficient of the third expanded convolutional layer is obtained to be 4. And determining the element interval of the third expanded convolutional layer to be 3 according to the expansion coefficient of the third expanded convolutional layer, and performing convolution on the first element, the fifth element and the ninth element in the output matrix of the second expanded convolutional layer to obtain the first element of the third expanded convolutional layer, wherein zero can be filled in the output matrix of the second expanded convolutional layer so as to perform convolution on the output matrix of the second expanded convolutional layer in the third expanded convolutional layer. And obtaining an output matrix of the third expanded convolutional layer according to the first element of the third expanded convolutional layer, and determining the output matrix of the third expanded convolutional layer as a spatial attention mask.

Through multilayer convolution, expansion coefficients can be increased layer by layer, so that the characteristic processing of the characteristic convolution of pixel splicing characteristics or pixel characteristics to be convoluted is more gradual, the receptive field of the obtained space attention mask can be enlarged, the space attention mask can contain more information, and the accuracy is improved.

In the embodiment of the application, a spatial attention mechanism is described, the channel feature map is processed, and the spatial feature map of the multimedia data is finally obtained, so that each element in the spatial feature map can carry more information, and the accuracy of the model is improved.

Further, the quality prediction model in the embodiment of the present application may be implemented by being embedded in a neural network, and the neural network may be a ResNet18 network, a Visual Geometry Group Net (vgnet), an AlexNet network, or the like, where numbers in the ResNet18 network indicate a depth of the network, and refer to layers with weights, including a convolutional layer and a full connection layer, and not including a pooling layer and a batch normalization layer (BN) layer, or the like. The embedded position of the quality prediction model in the neural network may be changed, that is, the quality prediction model may be inserted into any position in the neural network to obtain the final quality prediction network obtained in the embodiment of the present application, or any number of quality prediction models may be inserted into the neural network to obtain the quality prediction network obtained in the embodiment of the present application.

For example, in the ResNet18 network, the quality prediction model may be embedded between the third convolutional layer and the fourth convolutional layer, or between the fourth convolutional layer and the fifth convolutional layer, and the quality prediction model may be embedded in the intermediate layer, so that the resolution of the quality prediction model may be higher, that is, the global features in the media feature map may be extracted, and partial local details may be acquired, thereby improving the accuracy of quality prediction on multimedia data.

Referring to fig. 11, fig. 11 is a schematic diagram of an architecture of a quality prediction network according to an embodiment of the present disclosure. As shown in fig. 11, multimedia data is input into a quality prediction network, and is convolved by a plurality of convolution blocks to obtain a media feature map 1101, a channel attention module 1102 generates a channel attention mask 11021 corresponding to the media feature map 1101 in at least two data channels based on a channel attention mechanism, and a channel feature map 1103 of the multimedia data is generated according to the channel attention mask 11021 and the media feature map 1101. In the spatial attention module 1104, a spatial attention mask 11041 corresponding to the channel feature map 1103 is generated based on the spatial attention mechanism, and a spatial feature map 1105 of the multimedia data is generated from the spatial attention mask 11041 and the channel feature map 1103. The computer device may directly determine the spatial feature map 1105 as the output feature map 1106 of the multimedia data; the spatial feature map 1105 and the media feature map 1101 may also be feature-fused to obtain an output feature map 1106 of the multimedia data. Inputting the output feature map 1106 into the next volume block of the quality prediction model, performing convolution on the output feature map 1106 until the output feature map is input into the last layer (namely, the fully connected layer) of the quality prediction network, and predicting the acquired features in the last layer to obtain the prediction quality of the image frame corresponding to the media feature map 1101. If the multimedia data comprises a single image, determining the predicted quality as the predicted media quality of the multimedia data; if the multimedia data comprises at least two image frames, acquiring the prediction quality corresponding to each image frame based on the process, and determining the predicted media quality of the multimedia data according to the prediction quality corresponding to each image frame.

It is assumed that the value range of the predicted media quality of the multimedia data is 0-3, and a higher value indicates better quality of the corresponding multimedia data, as shown in fig. 12, fig. 12 is a schematic diagram of a quality prediction result provided in the embodiment of the present application, it is known that the predicted media quality of the multimedia data 1201 is 0.66, the predicted media quality of the multimedia data 1202 is 1.32, and the predicted media quality of the multimedia data 1203 is 0.77, and it can be seen that the embodiment of the present application can predict a more reasonable predicted media quality for the multimedia data, and for the multimedia data with higher compression noise, the value of the predicted media quality is smaller. Wherein, 848 test samples are adopted, and the experimental results are as follows:

based on the two indexes, it can be seen that the effect of performing the non-reference quality prediction on the multimedia data in the test sample set is excellent when the Pearson Line Correlation Coefficient (PLCC) is 0.88 and the Spearman rank Correlation Coefficient (SROCC) is 0.87.

Further, please refer to fig. 13, fig. 13 is a schematic diagram of a data quality evaluation apparatus according to an embodiment of the present application. The data quality assessment means may be a computer program (including program code, etc.) running on a computer device, for example the data quality assessment means may be an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 13, the data quality evaluation apparatus 1300 may be used in the computer device in the embodiment corresponding to fig. 3, and specifically, the apparatus may include: a channel acquisition module 11, a channel processing module 12, a spatial processing module 13 and a quality prediction module.

A channel obtaining module 11, configured to obtain a media feature map of multimedia data, and obtain at least two data channels for forming the multimedia data;

the channel processing module 12 is configured to generate channel attention masks corresponding to at least two data channels based on a channel attention mechanism, and generate a channel feature map of the multimedia data according to the channel attention masks and the media feature map;

the spatial processing module 13 is configured to generate a spatial attention mask corresponding to the channel feature map based on a spatial attention mechanism, and generate a spatial feature map of the multimedia data according to the spatial attention mask and the channel feature map;

and the quality prediction module 14 is used for outputting the predicted media quality of the multimedia data according to the spatial feature map.

in generating a channel attention mask corresponding to at least two data channels based on a channel attention mechanism, the channel processing module 12 includes:

a feature determining unit 121, configured to determine a data feature of the multimedia data in the data channel i based on the media feature map;

the channel feature fusion unit 122 is configured to perform feature fusion on the data features in the data channel i based on a channel attention mechanism to obtain channel fusion features of the data channel i;

the channel mask generating unit 123 is configured to, when the channel fusion features respectively corresponding to each data channel are obtained, generate a channel attention mask corresponding to at least two data channels according to the channel fusion features respectively corresponding to each data channel.

the channel feature fusion unit 122 includes:

a first pooling subunit 1221, configured to perform mean pooling on the data characteristics of the data channel i based on a channel attention mechanism, to obtain a first channel fusion characteristic of the data channel i;

and the second pooling subunit 1222 is configured to obtain a mean square error of the data feature of the data channel i, and determine the mean square error as a second channel fusion feature of the data channel i.

The channel mask generating unit 123 includes:

the first splicing subunit 1231 is configured to perform feature splicing on the first channel fusion features respectively corresponding to each data channel to obtain a first mean feature, and perform feature splicing on the second channel fusion features respectively corresponding to each data channel to obtain a first mean square feature;

the second splicing subunit 1232 is configured to perform weighting processing on the first mean value feature to obtain a first mean value weighted feature, and perform weighting processing on the first mean square feature to obtain a first mean square weighted feature;

and a pooling fusion subunit 1233, configured to perform feature fusion on the first mean weighted feature and the first mean square weighted feature, so as to obtain channel attention masks corresponding to the at least two data channels.

Wherein the channel attention mask includes a sub-mask for each data channel;

in generating a channel feature map of multimedia data based on a channel attention mask and a media feature map, the channel processing module 12 includes:

the feature weighting unit 124 is configured to perform mask weighting on the data feature in the data channel i based on the sub-mask of the data channel i, so as to obtain a channel mask feature of the data channel i;

and the map generating unit 125 is configured to, when the channel mask features respectively corresponding to each data channel are obtained, perform feature splicing on the channel mask features respectively corresponding to each data channel to obtain a channel feature map of the multimedia data.

In terms of generating a spatial attention mask corresponding to a channel feature map based on a spatial attention mechanism, the spatial processing module 13 includes:

a pixel obtaining unit 131, configured to obtain at least two pixel points that are used to form a channel feature map; the at least two pixel points comprise pixel points j, j is a positive integer, and j is less than or equal to the number of the pixel points of the at least two pixel points;

the feature obtaining unit 132 is configured to determine, based on the media feature map, channel pixel features corresponding to the pixel points j in each data channel;

the pixel feature fusion unit 133 is configured to perform feature fusion on at least two channel pixel features corresponding to the pixel point j based on a spatial attention mechanism to obtain a pixel fusion feature of the pixel point j;

and the spatial mask generating unit 134 is configured to, when the pixel fusion features corresponding to each pixel point are obtained, generate a spatial attention mask corresponding to the channel feature map according to the pixel fusion features corresponding to each pixel point.

the pixel feature fusion unit 133 includes:

a third pooling sub-unit 1331, configured to perform mean pooling on at least two channel pixel characteristics corresponding to the pixel point j based on a spatial attention mechanism, to obtain a first pixel fusion characteristic of the pixel point j;

and a fourth pooling subunit 1332, configured to obtain mean square errors of the at least two channel pixel characteristics corresponding to the pixel point j, and determine the mean square error corresponding to the pixel point j as a second pixel fusion characteristic of the pixel point j.

The spatial mask generating unit 134 includes:

a third splicing subunit 1341, configured to perform feature splicing on first pixel fusion features corresponding to at least two pixel points, respectively, to obtain a second mean value feature, and perform feature splicing on second pixel fusion features corresponding to at least two pixel points, respectively, to obtain a second mean square feature;

a fourth splicing subunit 1342, configured to perform feature splicing on the second mean feature and the second mean-square feature to obtain a pixel splicing feature;

and a feature convolution subunit 1343, configured to perform convolution on the pixel stitching features to generate a spatial attention mask corresponding to the channel feature map.

The feature convolution subunit 1343 includes:

a size determination subunit 13431, configured to obtain a data size and a convolution parameter of the multimedia data, and determine a convolution feature size;

a feature update subunit 13432, configured to add a default feature value to the pixel splicing feature to obtain a pixel feature to be convolved; the size of the pixel feature to be convolved is the size of the convolution feature;

and the update convolution subunit 13433 is configured to convolve the pixel features to be convolved, and generate a spatial attention mask corresponding to the channel feature map.

Wherein the convolution parameters include expansion coefficients;

the update convolution subunit 13433 includes:

the first convolution subunit 1343a is configured to determine a kth convolution position corresponding to the convolution kernel in the pixel stitching feature based on the expansion coefficient, and perform convolution on an element at the kth convolution position in the pixel stitching feature by using the convolution kernel to obtain a kth convolution element; k is a positive integer;

the second convolution subunit 1343b is configured to obtain a convolution step size, determine, based on the convolution step size, the expansion coefficient, and the kth convolution position, a (k +1) th convolution position corresponding to the convolution kernel in the pixel stitching feature, and perform convolution on an element at the (k +1) th convolution position in the pixel stitching feature by using the convolution kernel to obtain a (k +1) th convolution element;

and the element combination subunit 1343c is configured to, when convolution is completed on the pixel stitching feature, generate a spatial attention mask according to the obtained convolution elements.

the quality prediction module 14 includes:

the first prediction unit 141 is configured to predict a spatial feature map of the first image frame based on the full connection layer, so as to obtain a first prediction quality of the first image frame;

the second prediction unit 142 is configured to predict a spatial feature map of the second image frame based on the full connection layer, so as to obtain a second prediction quality of the second image frame;

the prediction weighting unit 143 is configured to obtain a first evaluation weight of the first image frame and a second evaluation weight of the second image frame, weight the first prediction quality based on the first evaluation weight to obtain a first weighted prediction value, and weight the second prediction quality based on the second evaluation weight to obtain a second weighted prediction value;

and a quality determining unit 144, configured to determine a sum of the first weighted prediction value and the second weighted prediction value as a predicted media quality of the multimedia data.

Wherein, the apparatus 1300 further comprises:

the data recommendation module 15 is configured to obtain a recommendation threshold, and recommend and display the multimedia data if the predicted media quality is greater than or equal to the recommendation threshold;

the anomaly feedback module 16 is configured to, if the predicted media quality is smaller than the recommendation threshold, acquire the user equipment that uploads the multimedia data, and send a media quality anomaly message to the user equipment; the media quality exception message is used to instruct the user equipment to optimize the media quality of the multimedia data.

The embodiment of the application provides a data quality evaluation device, which is used for acquiring a media characteristic map of multimedia data and acquiring at least two data channels for forming the multimedia data; generating channel attention masks corresponding to at least two data channels based on a channel attention mechanism, and generating a channel characteristic map of the multimedia data according to the channel attention masks and the media characteristic map; generating a spatial attention mask corresponding to the channel feature map based on a spatial attention mechanism, and generating a spatial feature map of the multimedia data according to the spatial attention mask and the channel feature map; and outputting the predicted media quality of the multimedia data according to the spatial feature map. In the application, the network can pay important attention to the data channel forming the multimedia data in the training process through the channel attention mechanism and the space attention mechanism, the attention to distortion and degradation areas is improved, and the accuracy of the quality prediction of the multimedia data is improved.

Referring to fig. 14, fig. 14 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 14, the computer device in the embodiment of the present application may include: one or more processors 1401, a memory 1402, and an input-output interface 1403. The processor 1401, the memory 1402, and the input/output interface 1403 are connected by a bus 1404. The memory 1402 is used for storing a computer program comprising program instructions, the input output interface 1403 being used for receiving data and outputting data, such as for data interaction between the computer device and a user device; processor 1401 is configured to execute program instructions stored by memory 1402.

The processor 1401 performs the following operations:

In some possible implementations, the processor 1401 may be a Central Processing Unit (CPU), and the processor may be other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 1402 may include read-only memory and random access memory, and provides instructions and data to the processor 1401 and the input output interface 1403. A portion of memory 1402 may also include non-volatile random access memory. For example, memory 1402 may also store information of device types.

In a specific implementation, the computer device may execute the implementation manners provided in the steps in fig. 3 through each built-in functional module thereof, which may specifically refer to the implementation manners provided in the steps in fig. 3, and details are not described herein again.

The embodiment of the present application provides a computer device, including: the system comprises a processor, an input/output interface and a memory, wherein the processor acquires computer instructions in the memory, executes the steps of the method shown in the figure 3 and carries out data quality evaluation operation. The embodiment of the application realizes the acquisition of the media characteristic map of the multimedia data and the acquisition of at least two data channels for forming the multimedia data; generating channel attention masks corresponding to at least two data channels based on a channel attention mechanism, and generating a channel characteristic map of the multimedia data according to the channel attention masks and the media characteristic map; generating a spatial attention mask corresponding to the channel feature map based on a spatial attention mechanism, and generating a spatial feature map of the multimedia data according to the spatial attention mask and the channel feature map; and outputting the predicted media quality of the multimedia data according to the spatial feature map. In the application, the network can pay important attention to the data channel forming the multimedia data in the training process through the channel attention mechanism and the space attention mechanism, the attention to distortion and degradation areas is improved, and the accuracy of the quality prediction of the multimedia data is improved.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by the processor, the data quality assessment method provided in each step in fig. 3 may be implemented, which may specifically refer to an implementation manner provided in each step in fig. 3, and is not described herein again. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application. By way of example, program instructions may be deployed to be executed on one computer device or on multiple computer devices at one site or distributed across multiple sites and interconnected by a communication network.

The computer-readable storage medium may be the data quality evaluation apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device executes the method provided in the various optional manners in fig. 3, thereby implementing that the channel feature map corresponding to the multimedia data is obtained based on the channel attention mechanism, the channel feature map is processed based on the spatial attention mechanism, a spatial feature map of the multimedia data is obtained, the predicted media quality of the multimedia data is determined according to the spatial feature map, and the attention to the distortion and degradation region in the multimedia data is implemented in combination with the channel attention mechanism and the spatial attention mechanism, thereby improving the accuracy of quality prediction. The quality prediction model in the application is trained on the basis of end-to-end, so that the quality prediction model can be optimized in an iterative process, meanwhile, the simplicity of the convolution attention module is kept, and the prediction speed is improved.

The terms "first," "second," and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, apparatus, product, or apparatus.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the specification for the purpose of clearly illustrating the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method for evaluating data quality, the method comprising:

generating channel attention masks corresponding to the at least two data channels based on a channel attention mechanism, and generating a channel feature map of the multimedia data according to the channel attention masks and the media feature map;

2. The method of claim 1, wherein the at least two data lanes comprise data lane i; i is a positive integer, i is the number of data channels less than or equal to the at least two data channels;

generating a channel attention mask corresponding to the at least two data channels based on a channel attention mechanism, including:

determining data characteristics of the multimedia data in the data channel i based on the media characteristic map;

and when the channel fusion characteristics corresponding to each data channel are obtained, generating the channel attention masks corresponding to the at least two data channels according to the channel fusion characteristics corresponding to each data channel.

3. The method of claim 2, wherein the channel fusion feature comprises a first channel fusion feature and a second channel fusion feature;

the feature fusion is performed on the data features in the data channel i based on the channel attention mechanism to obtain the channel fusion features of the data channel i, and the method comprises the following steps:

4. The method according to claim 3, wherein the generating the channel attention mask corresponding to the at least two data channels according to the channel fusion feature corresponding to each data channel respectively comprises:

and performing feature fusion on the first mean weighted feature and the first mean square weighted feature to obtain channel attention masks corresponding to the at least two data channels.

5. The method of claim 2, wherein the channel attention mask comprises a sub-mask for each of the data channels;

generating a channel feature map of the multimedia data according to the channel attention mask and the media feature map, including:

performing mask weighting on the data characteristics in the data channel i based on the sub-mask of the data channel i to obtain the channel mask characteristics of the data channel i;

and when the channel mask features respectively corresponding to each data channel are obtained, performing feature splicing on the channel mask features respectively corresponding to each data channel to obtain a channel feature map of the multimedia data.

6. The method of claim 1, wherein generating the spatial attention mask corresponding to the channel feature map based on the spatial attention mechanism comprises:

acquiring at least two pixel points for forming the channel characteristic map; the at least two pixel points comprise pixel points j, j is a positive integer, and j is less than or equal to the number of the pixel points of the at least two pixel points;

determining channel pixel characteristics corresponding to the pixel points j in each data channel respectively based on the media characteristic map;

based on a space attention mechanism, performing feature fusion on at least two channel pixel features corresponding to the pixel point j to obtain a pixel fusion feature of the pixel point j;

7. The method of claim 6, wherein the pixel fusion feature comprises a first pixel fusion feature and a second pixel fusion feature;

the method for performing feature fusion on at least two channel pixel features corresponding to the pixel point j based on the spatial attention mechanism to obtain the pixel fusion feature of the pixel point j includes:

based on a space attention mechanism, performing mean pooling on at least two channel pixel characteristics corresponding to the pixel point j to obtain a first pixel fusion characteristic of the pixel point j;

8. The method according to claim 7, wherein the generating a spatial attention mask corresponding to the channel feature map according to the pixel fusion feature corresponding to each pixel point respectively comprises:

performing feature splicing on first pixel fusion features respectively corresponding to the at least two pixel points to obtain a second mean value feature, and performing feature splicing on second pixel fusion features respectively corresponding to the at least two pixel points to obtain a second mean square feature;

and performing convolution on the pixel splicing characteristics to generate a space attention mask corresponding to the channel characteristic map.

9. The method of claim 8, wherein said convolving the pixel patch features to generate the spatial attention mask corresponding to the channel feature map comprises:

acquiring the data size and convolution parameters of the multimedia data, and determining the convolution characteristic size;

adding a default characteristic value to the pixel splicing characteristic to obtain a pixel characteristic to be convolved; the size of the pixel feature to be convolved is the size of the convolution feature;

and performing convolution on the pixel characteristics to be convolved to generate a space attention mask corresponding to the channel characteristic map.

10. The method of claim 8, wherein the convolution parameters include a dilation coefficient;

the convolving the pixel stitching features to generate the spatial attention mask corresponding to the channel feature map includes:

determining a kth convolution position corresponding to a convolution kernel in the pixel splicing feature based on the expansion coefficient, and performing convolution on an element at the kth convolution position in the pixel splicing feature by adopting the convolution kernel to obtain a kth convolution element; k is a positive integer;

acquiring a convolution step length, determining a (k +1) th convolution position corresponding to the convolution kernel in the pixel splicing feature based on the convolution step length, the expansion coefficient and the kth convolution position, and performing convolution on an element at the (k +1) th convolution position in the pixel splicing feature by adopting the convolution kernel to obtain a (k +1) th convolution element;

and when the convolution of the pixel splicing characteristics is finished, generating a space attention mask according to the obtained convolution elements.

11. The method of claim 1, wherein the multimedia data comprises a first image frame and a second image frame;

the outputting the predicted media quality of the multimedia data according to the spatial feature pattern comprises:

predicting the spatial feature map of the first image frame based on a full connection layer to obtain first prediction quality of the first image frame;

acquiring a first evaluation weight of the first image frame and a second evaluation weight of the second image frame, weighting the first prediction quality based on the first evaluation weight to obtain a first weighted prediction value, and weighting the second prediction quality based on the second evaluation weight to obtain a second weighted prediction value;

12. The method of any one of claims 1 to 11, further comprising:

if the predicted media quality is smaller than the recommendation threshold, acquiring user equipment uploading the multimedia data, and sending a media quality abnormal message to the user equipment; the media quality exception message is used for instructing the user equipment to optimize the media quality of the multimedia data.

13. An apparatus for evaluating data quality, the apparatus comprising:

the channel acquisition module is used for acquiring a media characteristic map of multimedia data and acquiring at least two data channels for forming the multimedia data;

the channel processing module is used for generating channel attention masks corresponding to the at least two data channels based on a channel attention mechanism, and generating a channel feature map of the multimedia data according to the channel attention masks and the media feature map;

and the quality prediction module is used for outputting the predicted media quality of the multimedia data according to the spatial feature map.

14. A computer device comprising a processor, a memory, an input output interface;

the processor is connected to the memory and the input/output interface, respectively, wherein the input/output interface is configured to receive data and output data, the memory is configured to store a computer program, and the processor is configured to call the computer program to perform the method according to any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method of any of claims 1-12.