CN112995652B

CN112995652B - Video quality evaluation method and device

Info

Publication number: CN112995652B
Application number: CN202110138817.5A
Authority: CN
Inventors: 余冠东; 易高雄; 吴庆波; 龚桂良
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-01
Filing date: 2021-02-01
Publication date: 2021-12-07
Anticipated expiration: 2041-02-01
Also published as: CN112995652A

Abstract

The embodiment of the application provides a video quality evaluation method and device. The video quality evaluation method comprises the following steps: acquiring a video to be evaluated, wherein the video to be evaluated comprises a plurality of video frames; extracting the features of each video frame in the plurality of video frames to obtain the image feature vector of each video frame; determining content feature vectors of the video frames related to picture content according to the image feature vectors, and determining quality feature vectors of the video frames related to picture quality according to the image feature vectors and the content feature vectors; and determining the picture quality scores of the video frames according to the quality feature vectors, and generating a quality evaluation result aiming at the video to be evaluated according to the picture quality scores of the video frames. The technical scheme of the embodiment of the application can improve the accuracy of video quality evaluation.

Description

Video quality evaluation method and device

Technical Field

The present application relates to the field of computer and communication technologies, and in particular, to a method and an apparatus for video quality assessment.

Background

With the advent of the multimedia information age, various video processing and video communication technologies are coming out of range, and thus video quality assessment technology becomes increasingly important. However, the existing video quality assessment technology has many defects of large calculation overhead, high complexity, low accuracy and the like.

Disclosure of Invention

The embodiment of the application provides a video quality evaluation method and device, and therefore the accuracy of video quality evaluation can be improved at least to a certain extent.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided a video quality assessment method, including: acquiring a video to be evaluated, wherein the video to be evaluated comprises a plurality of video frames; extracting the features of each video frame in the plurality of video frames to obtain the image feature vector of each video frame; determining content feature vectors of the video frames related to picture content according to the image feature vectors, and determining quality feature vectors of the video frames related to picture quality according to the image feature vectors and the content feature vectors; and determining the picture quality scores of the video frames according to the quality feature vectors, and generating a quality evaluation result aiming at the video to be evaluated according to the picture quality scores of the video frames.

According to an aspect of an embodiment of the present application, there is provided a video quality evaluation apparatus including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a video to be evaluated, and the video to be evaluated comprises a plurality of video frames; the extraction unit is configured to perform feature extraction on each video frame in the plurality of video frames to obtain an image feature vector of each video frame; a first determining unit, configured to determine a content feature vector related to picture content of each video frame according to the image feature vector, and determine a quality feature vector related to picture quality of each video frame according to the image feature vector and the content feature vector; and the second determining unit is configured to determine the picture quality scores of the video frames according to the quality feature vectors, and generate a quality evaluation result for the video to be evaluated according to the picture quality scores of the video frames.

In some embodiments of the present application, based on the foregoing scheme, the second determining unit includes: the obtaining subunit is configured to obtain the pause times and pause time of the video to be evaluated, and determine the video fluency damage value of the video to be evaluated according to the pause times and the pause time; the determining subunit is configured to determine the picture quality scores of the videos to be evaluated according to the picture quality scores of the video frames; the generating subunit is configured to generate a quality evaluation result for the video to be evaluated according to the video fluency impairment value of the video to be evaluated and the picture quality score of the video to be evaluated.

In some embodiments of the present application, based on the foregoing scheme, the determining subunit is configured to: and calculating the average value of a plurality of quality scores according to the picture quality scores of all the video frames, and taking the calculated average value as the picture quality score of the video to be evaluated.

In some embodiments of the present application, based on the foregoing solution, the obtaining subunit is configured to: determining a first video fluency damage value of the video to be evaluated according to the pause times, and determining a second video fluency damage value of the video to be evaluated according to the pause time; and determining the video fluency impairment value of the video to be evaluated according to the first video fluency impairment value and the second video fluency impairment value.

In some embodiments of the present application, based on the foregoing solution, the image feature vector is extracted by a feature extraction module of a quality assessment model, and the content feature vector is obtained by inputting the image feature vector into a residual module of the quality assessment model; the second determination unit is configured to: determining a picture quality score of each video frame according to the quality feature vector, comprising: and inputting the quality characteristic vector of each video frame into a quality evaluation module of the quality evaluation model to obtain the picture quality score of each video frame output by the quality evaluation module.

In some embodiments of the present application, based on the foregoing solution, the quality assessment model further includes a correlation module, a feature correlation coefficient calculation module, and a content classification module, and is trained in the following manner: acquiring a training sample set, wherein the training sample set comprises a plurality of batch processing sets, each batch processing set of the plurality of batch processing sets comprises a plurality of video samples, and each video sample comprises a quality score label and a video content label; determining a first feature correlation coefficient corresponding to each batch processing set through the feature extraction module, the residual error module, the correlation module and the feature correlation coefficient calculation module, determining a first loss function according to the first feature correlation coefficient, and adjusting parameters of the correlation module according to the first loss function; determining a second feature correlation coefficient, a quality loss value and a content loss value corresponding to each batch processing set through the feature extraction module, the residual error module, the parameter-adjusted correlation module, the feature correlation coefficient calculation module, the quality evaluation module and the content classification module, determining a second loss function according to the second feature correlation coefficient, the quality loss value and the content loss value, and adjusting parameters of the feature extraction module, the residual error module, the quality evaluation module and the content classification module according to the second loss function; based on the feature extraction module after parameter adjustment, the residual error module after parameter adjustment, the quality evaluation module after parameter adjustment and the content classification module after parameter adjustment, the parameters of the correlation module are adjusted again, and based on the correlation module after parameter adjustment, the parameters of the feature extraction module, the residual error module, the quality evaluation module and the content classification module are continuously adjusted until convergence.

In some embodiments of the present application, based on the foregoing solution, determining, by the feature extraction module, the residual error module, and the correlation module, a first feature correlation coefficient corresponding to each batch processing set includes: extracting image feature vectors of a plurality of video sample frames in each video sample contained in each batch processing set through the feature extraction module, inputting the image feature vectors of each video sample frame into the residual error module to obtain content feature vectors, which are output by the residual error module and are related to picture content, of each video sample frame, and determining quality feature vectors, which are related to picture quality, of each video sample frame according to the image feature vectors of each video sample frame and the content feature vectors of each video sample frame; and determining a feature correlation coefficient corresponding to each video sample frame through the correlation module and the feature correlation coefficient calculation module according to the content feature vector of each video sample frame and the quality feature vector of each video sample frame, and taking the determined feature correlation coefficient corresponding to each video sample frame as a first feature correlation coefficient corresponding to each batch processing set.

In some embodiments of the present application, based on the foregoing scheme, determining a first loss function according to the first feature correlation coefficient includes: and acquiring feature correlation coefficients corresponding to the plurality of video sample frames respectively, and taking the inverse number of the minimum feature correlation coefficient in the plurality of feature correlation coefficients as the first loss function.

In some embodiments of the present application, based on the foregoing solution, the method further includes: determining, by the quality evaluation module, an output picture quality score of each video sample included in each batch processing set, and determining, by the content classification module, an output video content category of each video sample included in each batch processing set; determining a quality loss value corresponding to each batch processing set according to the output picture quality score of each video sample and the quality score label of each video sample; and determining the content loss value corresponding to each batch processing set according to the output video content category of each video sample and the video content label of each video sample.

According to an aspect of the embodiments of the present application, there is provided a computer-readable medium on which a computer program is stored, the computer program, when executed by a processor, implementing the video quality assessment method as described in the above embodiments.

According to an aspect of an embodiment of the present application, there is provided an electronic device including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the video quality assessment method as described in the above embodiments.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the video quality assessment method provided in the various alternative embodiments described above.

In the technical solutions provided in some embodiments of the present application, feature extraction is performed on each video frame in a video to be evaluated to obtain an image feature vector of each video frame, then a content feature vector related to picture content of each video frame is determined according to the image feature vector, a quality feature vector related to picture quality of each video frame is determined according to the image feature vector and the content feature vector, a picture quality score of each video frame is determined according to the quality feature vector of each video frame, and finally a quality evaluation result for the video to be evaluated is generated according to the picture quality score of each video frame. According to the technical scheme of the embodiment of the application, the characteristics related to the picture content and the characteristics related to the picture quality are separated, so that mutual interference among the characteristics is avoided, the accuracy of video quality assessment is improved, meanwhile, a model with a large and complex structure is avoided, the calculation cost is saved, and the practicability is higher.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 shows a schematic diagram of an exemplary system architecture to which aspects of embodiments of the present application may be applied;

FIG. 2 shows a flow diagram of a video quality assessment method according to an embodiment of the present application;

FIG. 3 shows a flow diagram of a video quality assessment method according to an embodiment of the present application;

FIG. 4 shows a schematic structural diagram of a quality assessment model according to an embodiment of the present application;

FIG. 5 shows a flow diagram of a video quality assessment method according to an embodiment of the present application;

FIG. 6 shows a flow diagram of a video quality assessment method according to an embodiment of the present application;

fig. 7 shows a block diagram of a video quality assessment apparatus according to an embodiment of the present application;

FIG. 8 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

It is to be noted that the terms used in the specification and claims of the present application and the above-described drawings are only for describing the embodiments and are not intended to limit the scope of the present application. It will be understood that the terms "comprises," "comprising," "includes," "including," "has," "having," and the like, when used herein, specify the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element without departing from the scope of the present invention. Similarly, a second element may be termed a first element. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It should be noted that: reference herein to "a plurality" means two or more. "and/or" describe the association relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

With the research and development of Artificial Intelligence (AI) technology, the Artificial Intelligence technology is being researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, intelligent medical services, intelligent customer service, etc. it is believed that with the development of technology, the Artificial Intelligence technology will be applied in more fields and play more and more important value.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a natural language processing technology, machine learning/deep learning and other directions.

With the research and progress of artificial intelligence technology, artificial intelligence is researched and applied in a plurality of fields, such as common smart homes, smart customer service, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, robots, smart medical treatment and the like.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.

It should be understood that the technical scheme provided by the application can be applied to video processing scenes based on artificial intelligence, and particularly can be applied to scenes for evaluating video quality. The quality of the video is effectively evaluated in real time, and the video service supply can be helped to provide better service. In recent years, with the large number of terminals of the fourth generation mobile communication technology and the continuous breakthrough of the fifth generation mobile communication technology, the speed of the wireless network is rapidly increased. Various applications for video calls and video conferences (e.g., instant messaging applications, online conference applications, etc.) are increasingly used in real life. However, various distortions are introduced in the acquisition, encoding, transmission and decoding processes of the video, which seriously affect the quality of the video and the user's experience of watching the video. As a provider of video services, in order to provide users with a better video viewing experience, it is necessary to establish an evaluation criterion for the quality of videos acquired by users, that is, to detect and quantify the quality of user experience of these videos. Therefore, the requirement for quality assessment of video proposed by the present application is very urgent.

At present, video quality evaluation methods can be mainly divided into two types, one type is based on a traditional natural scene statistical method, and the other type is based on a deep learning method. The traditional statistical method generally has lower time and model complexity, but the accuracy is not high, and the traditional statistical method can only be used for evaluating samples with obviously reduced quality in most cases, so that a plurality of failure cases often appear when facing high-definition videos with rich scene details and complicated and variable videos. With the development of recent years, the deep learning method shows remarkable powerful performance in the field of image and video processing, but many excellent models are often more complex in structure and very high in computational overhead, and methods developed in the academic world in recent years tend to be more complex, for example, a huge Residual Network50 (Residual Network50, ResNet50) or even a 3D convolutional neural Network is selected in spatial domain feature extraction, and the models have very high requirements on hardware devices and cannot be deployed in mobile devices in practical application. In addition, in the aspect of time domain feature extraction, the application of a Long Short-Term Memory network (LSTM) is more and more extensive, and the real-time performance of model prediction is also limited. It should be noted that this performance improvement through increased model complexity is not significant. The extra computational overhead is often for capturing the temporal features, but from the experimental results and experience of various aspects, the influence of video fluency on users is far less than the definition, so from the practical point of view, the benefit of the existing deep learning method in capturing the temporal features is far insufficient to compensate the sacrificial computational overhead.

Based on this, an embodiment of the present application provides a text recognition method, which includes performing feature extraction on each video frame in a video to be evaluated to obtain an image feature vector of each video frame, determining a content feature vector of each video frame, which is related to picture content, according to the image feature vector and the content feature vector, determining a quality feature vector of each video frame, which is related to picture quality, according to the quality feature vector of each video frame, determining a picture quality score of each video frame, and finally generating a quality evaluation result for the video to be evaluated according to the picture quality score of each video frame. According to the technical scheme of the embodiment of the application, the characteristics related to the picture content and the characteristics related to the picture quality are separated, so that mutual interference among the characteristics is avoided, the accuracy of video quality assessment is improved, meanwhile, a model with a large and complex structure is avoided, the calculation cost is saved, and the practicability is higher.

For easy understanding, the present embodiment of the application provides a method for video quality assessment, which is applied to the system architecture shown in fig. 1, and referring to fig. 1, the system architecture 100 may include a terminal device 102, a network and a server 104. The network serves as a medium for providing a communication link between the terminal device 102 and the server 104. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The terminal device 102 may be a variety of electronic devices having a display screen including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. The server 104 may be an independent physical server, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The video quality assessment method provided by the embodiment of the present application is generally executed by the server 104, and accordingly, the video quality assessment apparatus is generally disposed in the server 104. However, it is easily understood by those skilled in the art that the video quality assessment method provided in the embodiment of the present application may also be executed by the terminal device 102, and accordingly, the video quality assessment apparatus may also be disposed in the terminal device 102, which is not particularly limited in the exemplary embodiment. For example, in an exemplary embodiment, a user uploads a video to be evaluated to the server 104 through the terminal device 102, and the server 104 evaluates the video by using the video quality evaluation method provided in the embodiment of the present application, and sends an obtained quality evaluation result to the terminal device 102.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative, and that there may be any number of terminal devices, networks, and servers, as desired for an implementation. For example, server 104 may be a server cluster comprised of multiple servers, or the like.

In combination with the above description, the solution provided in the embodiment of the present application relates to technologies such as Computer Vision of artificial intelligence, and Computer Vision technology (Computer Vision, CV) is a science for researching how to make a machine "see", and further, it refers to using a camera and a Computer to replace human eyes to perform machine Vision such as recognition, tracking, and measurement on a target, and further performing image processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes technologies such as image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, three-dimensional technology, virtual reality, augmented reality, synchronous positioning, map construction, and the like, and also includes common biometric technologies such as face Recognition, fingerprint Recognition, and the like.

The implementation details of the technical solution of the embodiment of the present application are set forth in detail below:

fig. 2 shows a flow diagram of a video quality assessment method according to an embodiment of the present application, which may be performed by a server, which may be the server 104 shown in fig. 1. Referring to fig. 2, the video quality assessment method at least includes the following steps:

step S210, obtaining a video to be evaluated, wherein the video to be evaluated comprises a plurality of video frames;

step S220, extracting the characteristics of each video frame in the plurality of video frames to obtain the image characteristic vector of each video frame;

step S230, determining content feature vectors of each video frame related to the picture content according to the image feature vectors, and determining quality feature vectors of each video frame related to the picture quality according to the image feature vectors and the content feature vectors;

step S240, determining the picture quality scores of the video frames according to the quality feature vectors, and generating the quality evaluation results aiming at the video to be evaluated according to the picture quality scores of the video frames.

These steps are described in detail below.

Step S210, a video to be evaluated is obtained, where the video to be evaluated includes a plurality of video frames.

In this embodiment, the video to be evaluated is a video that needs to be subjected to video quality evaluation, the video to be evaluated may be a real-time streaming video, a shot video or a downloaded video, and the duration and the type of the video to be evaluated are not limited. The video to be evaluated comprises a plurality of video frames: frame 1, frame 2 … … frames i-1, frame i +1, frame … … frames n-1, frame n, etc.

Step S220, performing feature extraction on each video frame of the plurality of video frames to obtain an image feature vector of each video frame.

The image feature vector is a vector of image features expressed in the form of a vector, and the image features at least include information on picture content and information on picture quality of a video frame.

Optionally, the feature extraction performed on each video frame in the multiple video frames may be performed by inputting each video frame into a pre-trained feature extraction network to obtain an image feature vector output by the feature extraction network, where the feature extraction network may be obtained based on deep convolutional neural network training, for example, the feature extraction network may adopt a Visual Geometry Group (VGG) network structure. Besides the feature extraction by the model method, the feature extraction of each video frame can be performed by adopting methods such as geometric feature extraction and signal processing method feature extraction, and the embodiment of the application does not specifically limit the specific way of extracting the image features.

Step S230, determining a content feature vector related to the picture content of each video frame according to the image feature vector, and determining a quality feature vector related to the picture quality of each video frame according to the image feature vector and the content feature vector.

The content feature vector refers to a feature vector related to the picture content of the video frame, that is, a vector of picture content features expressed in a vector form, where the picture content features at least may include object information, scene information, color information, position information, and the like in the video frame.

Since the image features at least include information related to the picture content and information related to the picture quality of the video frames, after the image feature vectors are extracted, the content feature vectors related to the picture content of the video frames can be determined according to the image feature vectors, and further, the quality feature vectors related to the picture quality of the video frames are determined by subtracting the content feature vectors from the image feature vectors.

In particular, the picture quality score is used to indicate the picture quality status of each video frame. In one embodiment, each quality feature vector may correspond to a picture quality score, for example, when a certain quality feature vector is obtained, a corresponding picture quality score may be obtained.

In an embodiment, a quality evaluation model can be set up and trained in advance, and the quality feature vector is input into the quality evaluation model, so that the picture quality score of each video frame output by the quality evaluation model is obtained. The quality evaluation model is obtained by taking video samples in a video sample set as input and taking the total quality scores in the video samples as output training. It should be noted that the present embodiment does not limit the type of the quality evaluation model. Alternatively, the type of quality assessment model may be any of the following: a neural network model, a depth algorithm model, and a machine algorithm model.

Since the picture quality score may be used to indicate the picture quality condition of each video frame, after the picture quality score of each video frame is determined, further, a quality evaluation result for the video to be evaluated may be generated according to the picture quality score of each video frame. In an embodiment, the generating of the quality evaluation result for the video to be evaluated according to the picture quality scores of the video frames may be calculating an average value of a plurality of quality scores according to the picture quality scores of the video frames, and then taking the calculated average value as the picture quality evaluation result of the video to be evaluated. Of course, it may also be another feasible implementation manner to generate the quality evaluation result for the video to be evaluated according to the quality score of each video frame, and is not specifically limited herein.

Based on the technical scheme of the embodiment, the content feature vector related to the picture content of each video frame is determined according to the image feature vector of each video frame, the quality feature vector related to the picture quality of each video frame is determined according to the image feature vector and the content feature vector, the picture quality score of each video frame is further determined according to the quality feature vector of each video frame, and finally, the quality evaluation result for the video to be evaluated is generated according to the picture quality score of each video frame. According to the technical scheme of the embodiment of the application, the characteristics related to the picture content and the characteristics related to the picture quality are separated, so that mutual interference between the content characteristics and the quality characteristics is avoided, the accuracy of video quality assessment is improved, meanwhile, a model with a large and complex structure is avoided, the calculation cost is saved, and the practicability is higher.

In an embodiment of the application, after the picture quality scores of the video frames of the video to be evaluated are determined, the obtained picture quality scores and the playing fluency related characteristics, that is, the stuck times and the stuck duration, can be combined to finally generate a quality evaluation result for the video to be evaluated. In this embodiment, as shown in fig. 3, the step S240 of generating a quality evaluation result for the video to be evaluated according to the picture quality score of each video frame may specifically include steps S310 to S330, which are described as follows:

in step S310, the number of times the video to be evaluated is blocked and the duration of the video to be evaluated are obtained, and the video smoothness damage value of the video to be evaluated is determined according to the number of times the video is blocked and the duration of the video to be evaluated.

The video playing is usually caused by delay in operation of the terminal device or occurrence of bad network conditions such as network delay, jitter, packet loss, and the like.

When the human eye images on the retina when watching an object and inputs the image into the human brain through the optic nerve, the image of the object is sensed, when the object is removed, the impression of the optic nerve on the object does not disappear immediately and can be delayed, for example, 0.1 to 0.2 seconds is delayed, based on the persistence of vision effect, the katon information can be used as a factor for evaluating the video quality, and the katon information can reflect the fluency of the video. In this embodiment, the stuck information at least includes the stuck times and the stuck duration, and in specific implementation, the execution main body may calculate the stuck times and the stuck duration according to the stuck event of the video to be evaluated, so as to obtain the stuck times and the stuck duration of the video to be evaluated.

For example, in a preset time period, counting an entering card pause event sent by a player, so as to obtain the number of card pauses; for another example, the time length between the entering card pause event and the exiting card pause event sent by the player is calculated within the preset time period, so as to determine the card pause time length.

After the number of times of pause and the length of pause are obtained, the video smoothness damage value of the video to be evaluated can be further determined according to the number of times of pause and the length of pause. The video fluency damage value determined according to the pause times and the pause time length can reflect the fluency damage degree of the video to be evaluated from the video playing fluency, and the larger the video fluency damage value is, the worse the playing fluency of the video to be evaluated is.

In an embodiment of the application, when determining the video fluency impairment value of the video to be evaluated, a first video fluency impairment value of the video to be evaluated may be determined according to the pause times, a second video fluency impairment value of the video to be evaluated may be determined according to the pause time, and finally, the video fluency impairment value of the video to be evaluated may be determined according to the first video fluency impairment value and the second video fluency impairment value.

In a specific implementation process, considering sensitivity and saturation characteristics of user experience reduction, a first video fluency impairment value D _ num of a video to be evaluated can be calculated according to a first formula, and a second video fluency impairment value D _ dur of the video to be evaluated can be calculated according to a second formula, meanwhile, considering the overlapping performance of the first video fluency impairment value D _ num and the second video fluency impairment value D _ dur and the diversity of actual conditions, a video fluency impairment value D _ stall of the video to be evaluated can be calculated according to a third formula:

d _ stall ═ max (D _ num, D _ dur) +0.5min (D _ num, D _ dur) formula three

Wherein n and d respectively represent the number of times of the stuck and the length of the stuck time.

In step S320, the picture quality score of the video to be evaluated is determined according to the picture quality score of each video frame.

Since the video to be evaluated comprises a plurality of video frames, after the picture quality score of each video frame is determined, the picture quality score of the video to be evaluated can be determined according to the picture quality score of each video frame. Specifically, the picture quality scores of the video frames are subjected to weighting operation to obtain the picture quality scores of the videos to be evaluated.

In an embodiment of the present application, determining the quality score of the video to be evaluated according to the picture quality score of each video frame may also be calculating an average value of a plurality of quality scores according to the quality score of each video frame, and then taking the calculated average value as the picture quality score of the video to be evaluated.

In step S330, a quality evaluation result for the video to be evaluated is generated according to the video fluency impairment value of the video to be evaluated and the picture quality score of the video to be evaluated.

The image quality score of the video to be evaluated is determined according to the image quality score of each video frame, the image quality condition of the video to be evaluated can be reflected, the video fluency damage value of the video to be evaluated is determined according to the number of times of pause and the length of pause time, the influence of a pause event on the overall quality of the video to be evaluated can be reflected, and the larger the video fluency damage value is, the larger the influence of the pause event on the overall quality of the video to be evaluated is, the worse the overall quality of the video to be evaluated is.

In this embodiment, after determining the video fluency impairment value of the video to be evaluated and the picture quality score of the video to be evaluated, a quality evaluation result for the video to be evaluated may be generated according to the video fluency impairment value of the video to be evaluated and the picture quality score of the video to be evaluated, and in a specific implementation process, the generated quality evaluation result predicted _ score may be obtained by calculation according to the following formula four:

predicted _ score-0.1D _ stall formula four

Wherein frame _ score is the picture quality score of the video to be evaluated, and D _ stall is the video fluency impairment value of the video to be evaluated.

In an embodiment of the present application, a server side or a terminal device side may be deployed with a quality evaluation module, and a quality of a video to be evaluated is evaluated through the quality evaluation model, for convenience of description, refer to fig. 4, fig. 4 is a schematic structural diagram of the quality evaluation model in the embodiment of the present application, as shown in fig. 4, in a prediction process, a picture quality score is output after the video to be evaluated passes through the quality evaluation model, and the output picture quality score may be a picture quality score of each video frame in the video to be evaluated, or may be a picture quality score of the video to be evaluated is output according to a picture quality score of each video frame in the video to be evaluated, which is obtained through prediction. Specifically, the quality evaluation model for evaluation includes a feature extraction module, a residual error module and a quality evaluation module, and in the training process, a correlation module 1, a correlation module 2, a feature correlation coefficient calculation module and a content classification module are further added, and specifically, the quality evaluation model for training includes a feature extraction module, a residual error module, a correlation module 1, a correlation module 2, a feature correlation coefficient calculation module, a quality evaluation module and a content classification module.

In this embodiment, the evaluating the video quality of the video to be evaluated by using the quality evaluation model may specifically include: firstly, the image feature vector of each video frame in the video to be evaluated can be extracted by a feature extraction module of a quality evaluation model, after the image feature vector of each video frame is extracted and obtained, the image feature vector of each video frame is input into a residual error module of the quality evaluation model, so that the content feature vector of each video frame related to the picture content is obtained, and then the content feature vector is subtracted by the image feature vector, so that the quality feature vector of each video frame related to the picture quality is obtained. Finally, after the quality feature vector of each video frame is obtained, the quality feature vector of each video frame can be input into a quality evaluation module of the quality evaluation model, and the picture quality score of each video frame output by the quality evaluation module is obtained.

Optionally, the feature extraction module of the quality evaluation model may select a lightweight network as a backbone network of the feature extraction module, where the lightweight network may be MobileNet V3, ShuffleNet, or the like.

In an embodiment of the present application, a method for training a quality assessment model is further provided, which may specifically include:

the method comprises the steps of firstly, obtaining a training sample set, wherein the training sample set comprises a plurality of batch processing sets, each batch processing set in the plurality of batch processing sets comprises a plurality of video samples, and each video sample comprises a quality score label and a video content label.

Specifically, before training the quality assessment model to be trained, a training sample set for training the model needs to be constructed. Under the optimization method adopting small batch gradient descent, the training sample set does not pass through the quality assessment model to be trained at one time, but is transmitted into the model to be calculated according to the batch (batch) form, the training sample set can be divided into a plurality of batch processing sets, and the quality assessment model to be trained is trained and parameters are updated once by utilizing each batch processing set.

For example, suppose that there are 200 video samples in the training data set and the batch size is 5, that is, it means that the training data set will be divided into 40 batch processing sets, each batch processing set has 5 video samples, and after each batch of 5 video samples, the parameter of the quality assessment model to be trained will be updated.

It should be further noted that each video sample in the training sample set includes the annotation quality score and the video content label. The quality score label is an actual quality score of each video sample, the quality score can be labeled according to subjective perception of people, and the video content label is related to the content of each video sample.

And secondly, after the training sample set is obtained, determining a first characteristic correlation coefficient corresponding to each batch processing set through a characteristic extraction module, a residual error module, a correlation module and a characteristic correlation coefficient calculation module of the quality evaluation model to be trained, determining a first loss function according to the first characteristic correlation coefficient, and adjusting parameters of the correlation module according to the first loss function.

In this step, after the training sample set is obtained, each batch processing set may be transmitted to the quality assessment model to be trained to perform the first stage of training. The training in the first stage is mainly used for adjusting the parameters of the correlation module 1 and the correlation module 2 in the quality assessment model to be trained.

In the training process of the first stage, first feature correlation coefficients corresponding to each batch processing set may be determined through a feature extraction module, a residual module, a correlation module 1 and a correlation module 2 of the quality assessment model to be trained, where the first feature correlation coefficients are used to represent content feature vectors related to picture content of a plurality of video samples included in each batch processing set and correlations between quality feature vectors related to picture quality of the plurality of video samples.

As shown in fig. 5, determining the first feature correlation coefficient corresponding to each batch processing set by the feature extraction module, the residual error module, the correlation module, and the feature correlation coefficient calculation module of the quality assessment model to be trained may specifically include steps S510 to S520, which are described in detail as follows:

in step S510, image feature vectors of a plurality of video sample frames in each video sample included in each batch processing set are extracted by the feature extraction module, the image feature vectors of each video sample frame are input to the residual error module, content feature vectors of each video sample frame output by the residual error module and related to the picture content are obtained, and a quality feature vector of each video sample frame and related to the picture quality is determined according to the image feature vector of each video sample frame and the content feature vector of each video sample frame.

In the present embodiment, the extracted image feature vector is an image feature vector of a plurality of video sample frames in each video sample, where the plurality of video sample frames may be predefined, for example, the 1 st frame and the 2 nd frame. After the image feature vectors of the plurality of video sample frames are extracted and obtained by the feature extraction module, the image feature vectors of each video sample frame can be input into the residual error module, so as to obtain the content feature vectors, which are output by the residual error module and are related to the picture content, of each video sample frame.

After obtaining the image feature vector of each video sample frame and the content feature vector of each video sample frame, the content feature vector may be subtracted from the image feature vector to obtain a quality feature vector of each video sample frame related to the picture quality.

In step S520, according to the content feature vector of each video sample frame and the quality feature vector of each video sample frame, a feature correlation coefficient corresponding to each video sample frame is determined by the correlation module and the feature correlation coefficient calculation module, and the determined feature correlation coefficient corresponding to each video sample frame is used as the first feature correlation coefficient corresponding to each batch processing set.

The determining, by the correlation module and the feature correlation coefficient calculation module, the feature correlation coefficient corresponding to each video sample frame may specifically include: firstly, the relevance module 1 is used for reducing the dimension of the quality characteristic vector of each video sample frame to 1 dimension, the relevance module 2 is used for reducing the dimension of the content characteristic vector of each video sample frame to 1 dimension, and then the characteristic correlation coefficient calculation module is used for calculating the characteristic correlation coefficient corresponding to each video sample frame.

In a specific implementation process, the characteristic correlation coefficient ρ of each video sample frame (jth frame) can be obtained by calculating according to the following formula five_j：

Where m is the number of video samples in the batch processing set,

the reduced value of the content feature vector of the jth frame in the video sample i,

is the value of the quality characteristic vector of the jth frame of the video sample after dimensionality reduction, mu_c、μ_qAre respectively as

And

mean value of (a)_c、σ_qAre respectively as

And

is a constant minimum value used to maintain numerical stability.

For a plurality of video sample frames in each video sample contained in each batch processing set, after the feature vector is subjected to dimension reduction through the correlation module and the feature correlation coefficient corresponding to each video sample frame is obtained through calculation through the feature correlation coefficient calculation module, the feature correlation coefficient corresponding to each video sample frame obtained through calculation can be used as the first feature correlation coefficient corresponding to each batch processing set.

For example, the batch processing set a includes 3 video samples, which are respectively a video sample 1, a video sample 2, and a video sample 3, a plurality of video sample frames are a 1 st frame video sample frame and a 2 nd frame video sample frame, then the feature correlation coefficient corresponding to the 1 st frame of video sample frame can be calculated according to the 1 st frame of video sample frame in video sample 1, the 1 st frame of video sample frame in video sample 2 and the 1 st frame of video sample frame in video sample 3, calculating the characteristic correlation coefficient corresponding to the 2 nd frame of video sample frame according to the 2 nd frame of video sample frame in the video sample 1, the 2 nd frame of video sample frame in the video sample 2 and the 2 nd frame of video sample frame in the video sample 3, and finally, the calculated feature correlation coefficient corresponding to the 1 st frame of video sample frame and the feature correlation coefficient corresponding to the 2 nd frame of video sample frame may be used as the first feature correlation coefficient corresponding to the batch processing set a.

Continuing back to the second step, after determining the first feature correlation coefficient corresponding to each batch processing set, a first loss function may be determined according to the first feature correlation coefficient, and then, a parameter of the correlation module may be adjusted according to the first loss function.

In an embodiment of the present application, since the first feature correlation coefficient is used to represent a correlation between a content feature vector related to picture content of a plurality of video samples included in each batch processing set and a quality feature vector related to picture quality of the plurality of video samples, in order to better evaluate the quality of a video, avoid learning other features unrelated to picture quality by a model, cause additional interference to a regression decision process of the model, and affect performance of the model, in the first stage of model training, the first Loss function may be determined by maximizing the feature correlation coefficient of the video sample frame with the smallest feature correlation coefficient, that is, the first Loss function may be set as Loss₁＝-min|ρ_jWhere min | ρ_jI is the first characteristic correlation coefficient, rho_jThe characteristic correlation coefficient of the j frame video sample frame.

And thirdly, determining a second characteristic correlation coefficient, a quality loss value and a content loss value corresponding to each batch processing set through the characteristic extraction module, the residual error module, the correlation module after parameter adjustment, the characteristic correlation coefficient calculation module, the quality evaluation module and the content classification module, determining a second loss function according to the second characteristic correlation coefficient, the quality loss value and the content loss value, and adjusting parameters of the characteristic extraction module, the residual error module, the quality evaluation module and the content classification module according to the second loss function.

Specifically, after the parameters of the correlation module in the first stage are adjusted in the second step, the second stage of model training may be entered, and the second stage of training is mainly used to adjust the parameters of the feature extraction module, the residual error module, the quality evaluation module, and the content classification module.

In specific implementation, first, a second feature correlation coefficient corresponding to each batch processing set is determined through a feature extraction module, a residual error module, a correlation module after parameter adjustment, and a feature correlation coefficient calculation module of the quality assessment model to be trained, where the second feature correlation coefficient is used to represent a content feature vector related to picture content of a plurality of video samples included in each batch processing set and a correlation between quality feature vectors related to picture quality of the plurality of video samples. The method for determining the second feature correlation coefficient is similar to the method for determining the first feature correlation coefficient in the second step, and therefore, the description thereof is omitted.

Meanwhile, the quality loss value corresponding to each batch processing set can be determined through the quality evaluation module of the quality evaluation model to be trained, and the content loss value corresponding to each batch processing set can be determined through the content classification module of the quality evaluation model to be trained. Specifically, as shown in fig. 6, determining the quality loss value and the content loss value corresponding to each batch processing set may specifically include steps S610 to S630, which are described in detail as follows:

step S610, determining, by the quality evaluation module, an output picture quality score of each video sample included in each batch processing set, and determining, by the content classification module, an output video content category of each video sample included in each batch processing set.

With reference to fig. 4, for each video sample included in each batch processing set, firstly, the feature extraction module may extract an image feature vector of each video sample frame in each video sample; then inputting the image characteristic vector of each video sample frame into a residual error module to obtain the content characteristic vector of each video sample frame output by the residual error module, and subtracting the content characteristic vector by using the image characteristic vector after obtaining the content characteristic vector to obtain the quality characteristic vector of each video sample frame; the quality characteristic vector can be input into the quality evaluation module, and the content characteristic vector is input into the content classification module, so that the picture quality scores of all the video sample frames output by the quality evaluation module and the content characteristic vector of all the video sample frames output by the content classification module are obtained; and finally, obtaining the output picture quality scores of the video samples according to the picture quality scores of the video sample frames output by the quality evaluation module, and obtaining the output video content categories of the video samples according to the content feature vectors of the video sample frames output by the content classification module.

In an embodiment, the obtained output picture quality score of each video sample may be an average of the quality scores of all the output video sample frames according to the picture quality score of each video sample frame output by the quality evaluation module.

And S620, determining a quality loss value corresponding to each batch processing set according to the output picture quality score of each video sample and the quality score label of each video sample.

Specifically, determining the quality loss value corresponding to each batch processing set according to the output picture quality score of each video sample and the quality score label of each video sample may specifically include: first, the output picture quality scores of the video samples contained in the respective batch processing sets and the difference between the quality score labels are calculated, the calculated difference is used as the quality loss value corresponding to the video samples contained in the respective batch processing sets, then, the ratio of the sum of the quality loss values corresponding to the video samples to the number of the video samples contained in the respective batch processing sets can be calculated, and the calculated ratio is used as the quality loss value corresponding to the respective batch processing sets.

Step S630, determining the content loss value corresponding to each batch processing set according to the output video content category of each video sample and the video content label of each video sample.

After the output video content category of each video sample included in each batch processing set is obtained in step S610, a content loss value corresponding to each video sample may be determined according to the output video content category of each video sample and the video content label of each video sample, where the content loss value is used to indicate an inconsistency between the output video content category and the video content label. Since each video sample is included in the batch processing set, the content loss value corresponding to each batch processing set can be determined according to the content loss value corresponding to each video sample.

And continuing returning to the third step, after determining the second feature correlation coefficient, the quality loss value and the content loss value corresponding to each batch processing set, determining a second loss function according to the second feature correlation coefficient, the quality loss value and the content loss value, and then adjusting parameters of the feature extraction module, the residual error module, the quality evaluation module and the content classification module according to the second loss function.

Since the second feature correlation coefficient is used to represent the correlation between the content feature vector related to the picture content of the video samples and the quality feature vector related to the picture quality of the video samples contained in each batch processing set, in order to better evaluate the quality of the video, it is avoided that the model learns other features unrelated to the picture quality, additional interference is caused to the regression decision process of the model, and the performance of the model is affected, therefore, in the second stage of the model training, the second Loss function is determined to be Loss by minimizing the feature correlation coefficient of the video sample frame with the largest feature correlation coefficient and combining the content Loss value and the quality Loss value₂＝l_q+0.1l_c+0.1×max|ρ_jL, wherein l_qFor the mass loss value, the L1 loss, L, is used_cFor content loss values, cross-entropy loss, max | ρ, is used_jI is the second characteristic correlation coefficient, rho_jThe characteristic correlation coefficient of the j frame video frame.

And fourthly, based on the feature extraction module after parameter adjustment, the residual error module after parameter adjustment, the quality evaluation module after parameter adjustment and the content classification module after parameter adjustment, re-adjusting the parameters of the correlation module, and based on the correlation module after parameter re-adjustment, continuing to adjust the parameters of the feature extraction module, the residual error module, the quality evaluation module and the content classification module until convergence.

In the training process of the quality assessment model to be trained, the first stage and the second stage are alternately circulated, after the second stage training is completed, the parameters of the correlation module can be adjusted again based on the feature extraction module after the parameters are adjusted, the residual error module after the parameters are adjusted, the quality assessment module after the parameters are adjusted and the content classification module after the parameters are adjusted, and the parameters of the feature extraction module, the residual error module, the quality assessment module and the content classification module are continuously adjusted based on the correlation module after the parameters are adjusted again until convergence.

Based on the technical scheme of the embodiment, by means of strong feature learning capability of a deep convolutional network, the content feature vector and the quality feature vector of each video sample frame in a video sample are obtained through the feature extraction module and the residual error module, the correlation between the content feature vector and the quality feature vector of the video sample frame is minimized by reducing the maximum typical correlation, the features related to the picture quality and the features unrelated to the picture quality are separated, the features do not interfere with each other in different tasks, the correlation is reduced to the minimum, and the interpretability of the model and the prediction accuracy of the model on quality regression are improved.

Based on the video quality assessment method provided by the embodiment of the application, a test is performed on the public data set LIVE-NFLX-II, for convenience of introduction, please refer to table 1, where table 1 is a comparison illustration of prediction accuracy between the method provided by the application and the method provided by the prior art. There are many indexes for measuring the image quality evaluation result, each index has its own characteristics, and the difference and correlation between the objective value of the model and the observed subjective value are usually compared. The 2 common evaluation indexes are Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank order Correlation Coefficient (SRCC).

The pearson linear correlation coefficient describes the linear correlation between the subjective evaluation and the objective evaluation, the value range is-1 to 1, the larger the absolute value is, the better the performance of the algorithm is, and the specific definition is shown in formula six:

where N represents the number of distortion samples, y_i、

Respectively representing the ith sample label value and the predicted value corresponding to the algorithm,

respectively representing the average value of the sample label value and the average value of the predicted values corresponding to the algorithm.

The monotonicity of the prediction of the spearman sequencing correlation coefficient weighing algorithm ranges from-1 to 1, the larger the absolute value is, the better the algorithm performance is, and the calculation formula is the following formula seven:

wherein v is_i、p_iRespectively represent y_i、

The rank position in the tag value and predictor sequence.

In addition to this, there is a Kendell Rank Correlation Coefficient (KRCC). The KRCC property, like SRCC, also measures the monotonicity of the algorithm prediction.

TABLE 1

As can be seen from table 1, the pearson linear correlation coefficient of the predicted value on the test data set by using the method provided by the present application can reach 0.948, the spearman ordering correlation coefficient can reach 0.808, the kender rank correlation coefficient can reach 0.934, and all three indexes are greatly superior to all comparison methods. Table 1 lists some other commonly used methods including 8 classical parameterized models FTW, Mok2011, Liu2012, Xue2014, Yin2015, Spiteri2016, Bentaleb2016 and SQI, 2 learning based models VideoATLAS and p.1203. The model experiments were performed 30 times in total, taking into account the degree of randomness of the comparative experiments, and the results shown in table 1 are the median of the 30 results.

The following describes embodiments of the apparatus of the present application, which can be used to perform the video quality assessment method in the above embodiments of the present application. For details that are not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the video quality evaluation method described above in the present application.

Fig. 7 shows a block diagram of a video quality assessment apparatus according to an embodiment of the present application.

Referring to fig. 7, a video quality assessment apparatus 700 according to an embodiment of the present application includes: an obtaining unit 702, configured to obtain a video to be evaluated, where the video to be evaluated includes a plurality of video frames; an extracting unit 704 configured to perform feature extraction on each of the plurality of video frames to obtain an image feature vector of each of the video frames; a first determining unit 706 configured to determine a content feature vector related to picture content of the respective video frame according to the image feature vector, and determine a quality feature vector related to picture quality of the respective video frame according to the image feature vector and the content feature vector; a second determining unit 708, configured to determine the picture quality score of each video frame according to the quality feature vector, and generate a quality evaluation result for the video to be evaluated according to the picture quality score of each video frame.

In some embodiments of the present application, the second determining unit 708 comprises: the obtaining subunit is configured to obtain the pause times and pause time of the video to be evaluated, and determine the video fluency damage value of the video to be evaluated according to the pause times and the pause time; the determining subunit is configured to determine the picture quality scores of the videos to be evaluated according to the picture quality scores of the video frames; the generating subunit is configured to generate a quality evaluation result for the video to be evaluated according to the video fluency impairment value of the video to be evaluated and the picture quality score of the video to be evaluated.

In some embodiments of the present application, the determining subunit is configured to: and calculating the average value of a plurality of quality scores according to the quality scores of the video frames, and taking the calculated average value as the picture quality score of the video to be evaluated.

In some embodiments of the present application, the obtaining subunit is configured to: determining a first video fluency damage value of the video to be evaluated according to the pause times, and determining a second video fluency damage value of the video to be evaluated according to the pause time; and determining the video fluency impairment value of the video to be evaluated according to the first video fluency impairment value and the second video fluency impairment value.

In some embodiments of the present application, the image feature vector is extracted by a feature extraction module of a quality assessment model, and the content feature vector is obtained by inputting the image feature vector into a residual module of the quality assessment model; the second determination unit is configured to: determining a picture quality score of each video frame according to the quality feature vector, comprising: and inputting the quality characteristic vector of each video frame into a quality evaluation module of the quality evaluation model to obtain the picture quality score of each video frame output by the quality evaluation module.

In some embodiments of the present application, the quality evaluation model further includes a correlation module, a feature correlation coefficient calculation module, and a content classification module, and is trained in the following manner: acquiring a training sample set, wherein the training sample set comprises a plurality of batch processing sets, each batch processing set of the plurality of batch processing sets comprises a plurality of video samples, and each video sample comprises a quality score label and a video content label; determining a first feature correlation coefficient corresponding to each batch processing set through the feature extraction module, the residual error module, the correlation module and the feature correlation coefficient calculation module, determining a first loss function according to the first feature correlation coefficient, and adjusting parameters of the correlation module according to the first loss function; determining a second feature correlation coefficient, a quality loss value and a content loss value corresponding to each batch processing set through the feature extraction module, the residual error module, the parameter-adjusted correlation module, the feature correlation coefficient calculation module, the quality evaluation module and the content classification module, determining a second loss function according to the second feature correlation coefficient, the quality loss value and the content loss value, and adjusting parameters of the feature extraction module, the residual error module, the quality evaluation module and the content classification module according to the second loss function; based on the feature extraction module after parameter adjustment, the residual error module after parameter adjustment, the quality evaluation module after parameter adjustment and the content classification module after parameter adjustment, the parameters of the correlation module are adjusted again, and based on the correlation module after parameter adjustment, the parameters of the feature extraction module, the residual error module, the quality evaluation module and the content classification module are continuously adjusted until convergence.

In some embodiments of the present application, determining, by the feature extraction module, the residual error module, the correlation module, and the feature correlation coefficient calculation module, a first feature correlation coefficient corresponding to each batch processing set includes: extracting image feature vectors of a plurality of video sample frames in each video sample contained in each batch processing set through the feature extraction module, inputting the image feature vectors of each video sample frame into the residual error module to obtain content feature vectors, which are output by the residual error module and are related to picture content, of each video sample frame, and determining quality feature vectors, which are related to picture quality, of each video sample frame according to the image feature vectors of each video sample frame and the content feature vectors of each video sample frame; and determining a feature correlation coefficient corresponding to each video sample frame through the correlation module and the feature correlation coefficient calculation module according to the content feature vector of each video sample frame and the quality feature vector of each video sample frame, and taking the determined feature correlation coefficient corresponding to each video sample frame as a first feature correlation coefficient corresponding to each batch processing set.

In some embodiments of the present application, determining a first loss function based on the first characteristic correlation coefficient comprises: and acquiring feature correlation coefficients corresponding to the plurality of video sample frames respectively, and taking the inverse number of the minimum feature correlation coefficient in the plurality of feature correlation coefficients as the first loss function.

In some embodiments of the present application, further comprising: determining, by the quality evaluation module, an output picture quality score of each video sample included in each batch processing set, and determining, by the content classification module, an output video content category of each video sample included in each batch processing set; determining a quality loss value corresponding to each batch processing set according to the output picture quality score of each video sample and the quality score label of each video sample; and determining the content loss value corresponding to each batch processing set according to the output video content category of each video sample and the video content label of each video sample.

It should be noted that the computer system 800 of the electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, a computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes, such as performing the methods described in the above embodiments, according to a program stored in a Read-Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for system operation are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An Input/Output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. When the computer program is executed by the Central Processing Unit (CPU)801, various functions defined in the system of the present application are executed.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for video quality assessment, the method comprising:

acquiring a video to be evaluated, wherein the video to be evaluated comprises a plurality of video frames;

extracting features of each video frame in the plurality of video frames to obtain an image feature vector of each video frame, wherein the image feature vector is a vector of image features expressed in a vector form, and the image features at least comprise information related to picture content and information related to picture quality of the video frames;

determining content feature vectors of the video frames related to picture content according to the image feature vectors, and determining quality feature vectors of the video frames related to picture quality according to the image feature vectors and the content feature vectors;

and determining the picture quality scores of the video frames according to the quality feature vectors, and generating a quality evaluation result aiming at the video to be evaluated according to the picture quality scores of the video frames.

2. The method according to claim 1, wherein generating a quality assessment result for the video to be assessed according to the picture quality scores of the video frames comprises:

acquiring the pause times and pause time of the video to be evaluated, and determining the video fluency damage value of the video to be evaluated according to the pause times and the pause time;

determining the picture quality score of the video to be evaluated according to the picture quality score of each video frame;

and generating a quality evaluation result aiming at the video to be evaluated according to the video fluency damage value of the video to be evaluated and the picture quality score of the video to be evaluated.

3. The method according to claim 2, wherein determining the picture quality score of the video to be evaluated from the picture quality scores of the respective video frames comprises:

and calculating the average value of a plurality of quality scores according to the picture quality scores of all the video frames, and taking the calculated average value as the picture quality score of the video to be evaluated.

4. The method of claim 2, wherein determining the video fluency impairment value of the video to be evaluated according to the number of pauses and the pause duration comprises:

determining a first video fluency damage value of the video to be evaluated according to the pause times, and determining a second video fluency damage value of the video to be evaluated according to the pause time;

and determining the video fluency impairment value of the video to be evaluated according to the first video fluency impairment value and the second video fluency impairment value.

5. The method according to claim 1, wherein the image feature vector is extracted by a feature extraction module of a quality assessment model, and the content feature vector is obtained by inputting the image feature vector into a residual module of the quality assessment model;

determining a picture quality score of each video frame according to the quality feature vector, comprising: and inputting the quality characteristic vector of each video frame into a quality evaluation module of the quality evaluation model to obtain the picture quality score of each video frame output by the quality evaluation module.

6. The method of claim 5, wherein the quality assessment model further comprises a correlation module, a feature correlation coefficient calculation module, and a content classification module, and is trained by:

acquiring a training sample set, wherein the training sample set comprises a plurality of batch processing sets, each batch processing set of the plurality of batch processing sets comprises a plurality of video samples, and each video sample comprises a quality score label and a video content label;

determining a first feature correlation coefficient corresponding to each batch processing set through the feature extraction module, the residual error module, the correlation module and the feature correlation coefficient calculation module, determining a first loss function according to the first feature correlation coefficient, and adjusting parameters of the correlation module according to the first loss function;

determining a second feature correlation coefficient, a quality loss value and a content loss value corresponding to each batch processing set through the feature extraction module, the residual error module, the parameter-adjusted correlation module, the feature correlation coefficient calculation module, the quality evaluation module and the content classification module, determining a second loss function according to the second feature correlation coefficient, the quality loss value and the content loss value, and adjusting parameters of the feature extraction module, the residual error module, the quality evaluation module and the content classification module according to the second loss function;

based on the feature extraction module after parameter adjustment, the residual error module after parameter adjustment, the quality evaluation module after parameter adjustment and the content classification module after parameter adjustment, the parameters of the correlation module are adjusted again, and based on the correlation module after parameter adjustment, the parameters of the feature extraction module, the residual error module, the quality evaluation module and the content classification module are continuously adjusted until convergence.

7. The method of claim 6, wherein determining the first feature correlation coefficient corresponding to each batch processing set through the feature extraction module, the residual module, the correlation module, and the feature correlation coefficient calculation module comprises:

extracting image feature vectors of a plurality of video sample frames in each video sample contained in each batch processing set through the feature extraction module, inputting the image feature vectors of each video sample frame into the residual error module to obtain content feature vectors, which are output by the residual error module and are related to picture content, of each video sample frame, and determining quality feature vectors, which are related to picture quality, of each video sample frame according to the image feature vectors of each video sample frame and the content feature vectors of each video sample frame;

and determining a feature correlation coefficient corresponding to each video sample frame through the correlation module and the feature correlation coefficient calculation module according to the content feature vector of each video sample frame and the quality feature vector of each video sample frame, and taking the determined feature correlation coefficient corresponding to each video sample frame as a first feature correlation coefficient corresponding to each batch processing set.

8. The method of claim 7, wherein determining a first loss function based on the first characteristic correlation coefficient comprises:

and acquiring feature correlation coefficients corresponding to the plurality of video sample frames respectively, and taking the inverse number of the minimum feature correlation coefficient in the plurality of feature correlation coefficients as the first loss function.

9. The method of claim 6, further comprising:

determining, by the quality evaluation module, an output picture quality score of each video sample included in each batch processing set, and determining, by the content classification module, an output video content category of each video sample included in each batch processing set;

determining a quality loss value corresponding to each batch processing set according to the output picture quality score of each video sample and the quality score label of each video sample;

and determining the content loss value corresponding to each batch processing set according to the output video content category of each video sample and the video content label of each video sample.

10. A video quality assessment apparatus, characterized in that said apparatus comprises:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a video to be evaluated, and the video to be evaluated comprises a plurality of video frames;

an extracting unit configured to perform feature extraction on each of the plurality of video frames to obtain an image feature vector of each video frame, where the image feature vector is a vector of image features expressed in a vector form, and the image features at least include information related to picture content and information related to picture quality of the video frame;

a first determining unit, configured to determine a content feature vector related to picture content of each video frame according to the image feature vector, and determine a quality feature vector related to picture quality of each video frame according to the image feature vector and the content feature vector;

and the second determining unit is configured to determine the picture quality scores of the video frames according to the quality feature vectors, and generate a quality evaluation result for the video to be evaluated according to the picture quality scores of the video frames.