CN113869178A

CN113869178A - Feature extraction system and video quality evaluation system based on space-time dimension

Info

Publication number: CN113869178A
Application number: CN202111113707.XA
Authority: CN
Inventors: 余烨; 路强; 程茹秋
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2021-12-31
Anticipated expiration: 2041-09-18
Also published as: CN113869178B

Abstract

The invention provides a feature extraction system and a video quality evaluation system based on space-time dimensionality, wherein the feature extraction system comprises an image feature extraction module; a video feature extraction module; the time weight processing module and the semantic feature extraction module are used for obtaining target features based on space-time dimensions. According to the method, the weight distribution of different time periods is carried out by changing the channel number of the video feature vector in the time dimension, and then the splicing and weight mining are carried out on the high-dimensional semantic features and the low-dimensional semantic features by changing the channel number of the video vector in the space dimension, so that the finally obtained feature matrix of the space-time dimension is more in line with the subjective perception of human eyes, the correlation with the subjective perception of the human eyes is higher, and the accuracy is higher.

Description

Feature extraction system and video quality evaluation system based on space-time dimension

Technical Field

The invention relates to the technical field of video feature extraction, in particular to a feature extraction system and a video quality evaluation system based on space-time dimensionality.

Background

In the modern society, with the improvement of living standard and the accelerated development of urbanization, images and videos have become the most widely used data media in people's life. The data plays an important role in smart cities, public services and urban traffic, and the quality of the data influences the application of the data in various scenes. The quality evaluation task is an important branch in computer vision and is widely applied to video monitoring, network live broadcast, image super-resolution and image/video compression.

In video quality evaluation, the method of using a classification network to extract features and combining a pooling method to predict video quality is a widely used method, and in the process, information processing on a video time dimension is always a key and challenging problem. In the prior art, the construction of the space-time dependency relationship between video frames is partially researched and carried out by using a recurrent neural network and a series of variants, although the overall effect is good, the recurrent neural network is difficult to process some high-dimensional characteristics, and the recurrent neural network is easy to overfit in the training process; in addition, the correlation between the time-space characteristics excavated by the prior art and the subjective perception of human eyes is low, so the finally obtained evaluation result often has large deviation from the actual perception of human eyes.

In conclusion, the feature extraction method in the prior art has the problems of large error, low correlation with subjective perception of human eyes and the like.

Disclosure of Invention

In view of the above disadvantages of the prior art, an object of the present invention is to provide a feature extraction system and a video quality evaluation system based on spatio-temporal dimensions, so as to solve the technical problems of large error, easy overfitting, poor robustness, and the like in the feature extraction method in the prior art.

To achieve the above and other related objects, the present invention provides a feature extraction system based on spatiotemporal dimensions, comprising:

the image feature extraction module is used for carrying out frame decomposition on the experimental video and extracting image features from the experimental video;

the video feature extraction module is used for combining the image features on a time dimension to obtain video features;

the time weight processing module is used for extracting weight information of the video features in different time periods to obtain time weight features;

and the semantic feature extraction module is used for performing high-low dimensional semantic feature re-mining on the time weight features to obtain target features based on space-time dimensions.

In an embodiment of the present invention, the time weight processing module includes three first convolution layers and a feature weighting layer:

the first convolution layer is used for changing the number of channels of the video feature in a time dimension: an input terminal of the video feature receiving device, and an output terminal of the video feature receiving device is connected to an input terminal of the feature weighting layer;

the feature weighting layer is configured to perform weighting processing on outputs of the three first convolution layers to obtain the time weight feature: a first input terminal of which receives said video features and a second input terminal of which is connected to an output terminal of said first convolutional layer;

the output end of the time weight processing module is used as the output end of the time weight processing module and is connected to the semantic feature extraction module.

In an embodiment of the present invention, the feature weighting layer calculates the time weight feature by using the following formula:

wherein, I₁Representing the temporal weight characteristic; delta denotes the sigmoid activation function; w_iAn output representing the ith first convolution layer; represents matrix dot product; i is₀Representing the video feature;

a tensor stitching operation is represented.

In an embodiment of the present invention, the semantic feature extraction module includes four space dimension processing units connected in sequence, and is configured to change the number of channels of the time weight feature in a space dimension;

after convolution, the output of the first spatial dimension processing unit is added to the output matrix of the third spatial dimension processing unit to be used as the input of the fourth spatial dimension processing unit;

and after convolution, the output of the first space dimension processing unit and the output of the second space dimension processing unit are added with the output matrix of the fourth space dimension processing unit to be used as the output of the semantic feature extraction module.

In an embodiment of the present invention, the spatial dimension processing unit includes three second convolution layers, three matrix dot-by-dot layers, three active layers, a tensor splicing layer, and a third convolution layer:

the input end of the second convolution layer is used for receiving the input of the current space dimension processing unit, the first output end of the second convolution layer is connected to the first input end of the corresponding matrix dot-product layer, and the second output end of the second convolution layer is connected to the input end of the corresponding active layer;

the input ends of the three second convolution layers form the input end of the current space dimension processing unit;

the second input end of the matrix dot multiplication layer is connected to the output end of the corresponding activation layer, and the output end of the matrix dot multiplication layer is connected to the input end of the tensor splicing layer;

the output end of the tensor splicing layer is connected to the input end of the third convolution layer;

and the output end of the third convolution layer is used as the output end of the space dimension processing unit.

The invention also discloses a feature extraction method based on the space-time dimension, which comprises the feature extraction system, wherein the feature extraction method comprises the following steps:

carrying out frame decomposition on the experimental video, and extracting image features from the experimental video;

merging the image features in a time dimension to obtain video features;

extracting weight information of the video features in different time periods to obtain time weight features;

and performing high-low dimensional semantic feature re-excavation on the time weight features to finally obtain a target feature based on space-time dimensions.

The invention also discloses a feature extraction device based on the space-time dimension, which comprises a processor, wherein the processor is coupled with a memory, the memory stores program instructions, and the feature extraction method is realized when the program instructions stored in the memory are executed by the processor.

The present invention also discloses a computer-readable storage medium containing a program which, when run on a computer, causes the computer to execute the above-described feature extraction method.

The invention also discloses a video quality evaluation method based on the space-time dimension, which adopts the target characteristics based on the space-time dimension processed by the characteristic extraction system, and comprises the following steps:

and mapping the target characteristics based on the space-time dimension into the quality fraction of the video by adopting a quality pooling method to obtain the evaluation result of the experimental video.

The invention also discloses a video quality evaluation system based on the space-time dimension, which adopts the characteristic extraction system and comprises the following components:

and the quality pooling module is used for mapping the target characteristics based on the space-time dimension into the quality fraction of the video by adopting a quality pooling method to obtain the evaluation result of the experimental video.

According to the feature extraction system and the video quality evaluation system based on the space-time dimension, weight distribution of different time periods is carried out by changing the channel number of the video feature vector in the time dimension, and then splicing and re-mining are carried out on the high-dimensional semantic features and the low-dimensional semantic features by changing the channel number of the video vector in the space dimension, so that the finally obtained space-time dimension feature matrix is more suitable for human eye subjective perception, the correlation with the human eye subjective perception is higher, and the accuracy is higher.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of a spatiotemporal dimension-based feature extraction system according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a temporal weight processing module of the spatiotemporal dimension-based feature extraction system of the present invention in one embodiment;

FIG. 3 is a schematic diagram illustrating a spatial dimension processing unit of the spatiotemporal dimension-based feature extraction system according to an embodiment of the present invention;

FIG. 4 is a block diagram of a spatiotemporal dimension-based feature extraction system according to an embodiment of the present invention;

FIG. 5 is a system flow diagram illustrating a spatiotemporal dimension-based feature extraction method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a spatiotemporal dimension-based feature extraction apparatus according to an embodiment of the present invention;

FIG. 7 is a flow chart illustrating a spatiotemporal dimension-based video quality assessment method according to an embodiment of the present invention;

FIG. 8 is a block diagram of a spatiotemporal dimension-based video quality evaluation system according to an embodiment of the present invention.

Description of the element reference numerals

100. The feature extraction system based on the space-time dimension comprises: 110. an image feature extraction module; 120. a video feature extraction module; 130. a time weight processing module; 140. a semantic feature extraction module; 150. a mass pooling module; 200. feature extraction equipment based on space-time dimensions; 210. a processor; 220. a memory; 300. provided is a video quality evaluation system.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. It is also to be understood that the terminology used in the examples is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the present invention. Test methods in which specific conditions are not specified in the following examples are generally carried out under conventional conditions or under conditions recommended by the respective manufacturers.

Please refer to fig. 1 to 8. It should be understood that the structures, ratios, sizes, and the like shown in the drawings are only used for matching the disclosure of the present disclosure, and are not used for limiting the conditions of the present disclosure, so that the present disclosure is not limited to the technical essence, and any modifications of the structures, changes of the ratios, or adjustments of the sizes, can still fall within the scope of the present disclosure without affecting the function and the achievable purpose of the present disclosure. In addition, the terms "upper", "lower", "left", "right", "middle" and "one" used in the present specification are for clarity of description, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not to be construed as a scope of the present invention.

When numerical ranges are given in the examples, it is understood that both endpoints of each of the numerical ranges and any value therebetween can be selected unless the invention otherwise indicated. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs and the description of the present invention, and any methods, apparatuses, and materials similar or equivalent to those described in the examples of the present invention may be used to practice the present invention.

Please refer to fig. 1, which is a schematic structural diagram of a spatio-temporal feature-based feature extraction system 100 in the present embodiment, and fig. 4, which is a schematic structural diagram of the spatio-temporal feature-based feature extraction system 100 in the present embodiment, wherein the feature extraction module includes: image feature extraction module 110, video feature extraction module 120, temporal weight processing module 130, semantic feature extraction module 140: the image feature extraction module 110 is configured to perform frame decomposition on the experimental video, and extract image features of frame images of each frame; the video feature extraction module 120 is configured to combine image features in a time dimension to obtain video features, where the video features are specifically two-dimensional feature vectors; the time weight processing module 130 is configured to extract weight information of the video features at different time periods to obtain time weight features; the semantic feature extraction module 140 is configured to perform high-low dimensional semantic feature re-mining on the time weight feature to obtain a target feature based on a space-time dimension.

It should be noted that, performing frame decomposition on the video, extracting image features from the video, and further combining the image features to obtain video features is a technical means well known to those skilled in the art, and is not described in detail herein.

In this embodiment, a two-dimensional convolutional neural network is adopted to extract image features from a frame image, and after the image features are combined in a time dimension, a two-dimensional feature vector with the size of (fs, cs) is obtained and is used as a video feature; wherein fs represents the video frame number, cs represents the channel dimension, the video frame number is the time dimension characteristic, and the channel dimension is the space dimension characteristic.

Referring to fig. 2, which is a schematic structural diagram of the time weight processing module 130 in the present embodiment, the time weight processing module 130 includes three first convolution layers and a feature weighting layer:

the first convolution layer is used to change the number of channels in the time dimension for video features: its input end receives video features and its output end is connected to the input end of the feature weighting layer;

the characteristic weighting layer is used for carrying out weighting processing on the outputs of the three first convolution layers to obtain a time weight characteristic matrix: a first input terminal of which receives the video features and a second input terminal of which is connected to an output terminal of the first convolution layer;

the output end of the semantic feature extraction module 140 is connected to the output end of the temporal weight processing module 130.

In this embodiment, the input channel of the first convolution layer is fs, the output channel number is 1/4fs, the input channel of the second first convolution layer is fs, the output channel number is 1/2fs, the input channel of the third first convolution layer is fs, and the output channel number is 1/4fs, so that the three first convolution layers adjust the time dimension characteristic in the video characteristic, that is, the video frame number fs, and the final output time resolutions are different, and the sizes of the three first convolution layers are three eigenvectors of (1/4fs, cs), (1/2fs, cs), (1/4fs, cs), respectively.

Specifically, the feature weighting layer in this embodiment is the Catmul module in fig. 2.

The feature weighting layer comprises tensor splicing operation, activation function operation and matrix dot multiplication operation, and time weight features are obtained by adopting the following formula:

wherein, I₁Representing a temporal weight characteristic; delta denotes the sigmoid activation function; w_iAn output representing the ith first convolution layer; represents matrix dot product; i is₀Representing a video feature;

a tensor stitching operation is represented.

Specifically, W₁Representing a feature vector of size (1/4fs, cs), W₂Representing a feature vector of size (1/2fs, cs), W₃Representing a feature vector of size (1/4fs, cs), I₀Representing a two-dimensional feature vector of size (fs, cs).

The three eigenvectors of the first convolution layer are processed by sigmoid activation function to obtain weights of different time resolutions, and the weights and the two-dimensional eigenvectors (fs, cs) are subjected to point multiplication to obtain time weight characteristics I distributed by the weights of the time dimensions₁The time weight characteristic I₁Is (fs, cs).

Referring to fig. 1, the semantic feature extraction module 140 includes four space dimension processing units R connected in sequence_iWherein i belongs to 4, the semantic feature extraction module 140 is configured to change the number of channels of the time weight feature matrix in the spatial dimension;

wherein, the first space dimension processing unit R₁After convolution with the output of (3), with a third spatial dimension processing unit R₃As a fourth spatial dimension processing unit R₄The input of (1);

first spatial dimension processing unit R₁And a second spatial dimension processing unit R₂After convolution with the output of the fourth spatial dimension processing unit R₄As output of the semantic feature extraction module 140.

The feature vector with large number of channels in the space-time dimension represents high-dimensional semantic features, the feature vector with small number of channels in the space-time dimension represents low-dimensional semantic features, and the semantic feature extraction module 140 is used for extracting input time weight features I₁And adjusting the channel dimension of the space dimension, and splicing the high-dimensional semantic features and the low-dimensional semantic features again to complete the dependency construction of the high-dimensional semantic features and the low-dimensional semantic features.

Referring to fig. 3, a spatial dimension processing unit R in the present embodiment_iA space dimension processing unit R_iThe device comprises three second convolution layers, three matrix dot multiplication layers, three active layers, a tensor splicing layer and a third convolution layer:

the input end of the second convolution layer is used for receiving the current space dimension processing unit R_iA first output terminal connected to a first input terminal of a corresponding matrix dot-product layer, and a second output terminal connected to an input terminal of a corresponding active layer;

the input ends of the three second convolution layers form a current space dimension processing unit R_iAn input terminal of (1);

the second input end of the matrix dot multiplication layer is connected to the output end of the corresponding activation layer, and the output end of the matrix dot multiplication layer is connected to the input end of the tensor splicing layer:

the output end of the tensor splicing layer is connected to the third convolution layer;

the output end of the third convolution layer is used as a space dimension processing unit R_iTo the output terminal of (a).

For the first spatial dimension processing unit R₁The input channels of the three second convolution layers are cs, the output channels are 1/4cs, 1/8cs and 1/16cs respectively, and the inputs of the three second convolution layers are time weight characteristics I₁Respectively outputting three feature vectors with the sizes of (fs, 1/4cs), (fs, 1/8cs) and (fs, 1/16cs), wherein the three feature vectors are respectively subjected to sigmoid processing of corresponding activation layers to obtain delta (fs, 1/4cs), delta (fs, 1/8cs) and delta (fs, 1/16cs), and further, the first space dimension processing unit R₁The matrix dot multiplication layer performs matrix dot multiplication on the three eigenvectors (fs, 1/4cs), (fs, 1/8cs) and (fs, 1/16cs) and the corresponding delta (fs, 1/4cs), delta (fs, 1/8cs) and delta (fs, 1/16cs), and outputs the result to the first space dimension processing unit R₁And the tensor splicing layer outputs the result after tensor splicing to the third convolution layer for convolution processing.

First spatial dimension processing unit R₁The third convolutional layer in (2) has an input channel of 7/16cs, an output channel number of 1/4cs, and a feature vector output by the third convolutional layer has a magnitude of (fs, 1/4 cs).

For the second spatial dimension processing unit R₂The input channels of the three second convolution layers are 1/4cs, the output channels are 1/16cs, 1/32cs and 1/64cs, and the inputs of the three second convolution layers are the first space dimension processing unit R₁The output eigenvectors have the size of (fs, 1/4cs), the three second convolution layers respectively output three eigenvectors with the sizes of (fs, 1/16cs), (fs, 1/32cs) and (fs, 1/64cs), the three eigenvectors are respectively subjected to sigmoid processing of corresponding active layers to obtain delta (fs, 1/16cs), delta (fs, 1/32cs) and delta (fs, 1/64cs), and further, the second space dimension processing unit R₂The matrix dot multiplication layer of (1) respectively connects three eigenvectors (fs, 1/16cs), (fs, 1/32cs) and (fs, 1/64cs) with corresponding delta (fs, 1/1 cs)6cs), delta (fs, 1/32cs) and delta (fs, 1/64cs) are subjected to matrix dot product operation and output to a second spatial dimension processing unit R₂And the tensor splicing layer outputs the result after tensor splicing to the third convolution layer for convolution processing.

Second spatial dimension processing unit R₂The third convolutional layer in (2) has an input channel of 7/64cs, an output channel number of 1/16cs, and a feature vector output by the third convolutional layer has a magnitude of (fs, 1/16 cs).

Processing unit R for a third spatial dimension₃The input channels of the three second convolution layers are 1/16cs, the output channels are 1/64cs, 1/128cs and 1/256cs, and the inputs of the three second convolution layers are the first space dimension processing unit R₁The output eigenvectors have the size of (fs, 1/16cs), the three second convolution layers respectively output three eigenvectors with the sizes of (fs, 1/64cs), (fs, 1/128cs) and (fs, 1/256cs), the three eigenvectors are respectively subjected to sigmoid processing of corresponding active layers to obtain delta (fs, 1/64cs), delta (fs, 1/128cs) and delta (fs, 1/256cs), and further, a third spatial dimension processing unit R₃The matrix dot multiplication layer performs matrix dot multiplication on the three eigenvectors (fs, 1/64cs), (fs, 1/128cs) and (fs, 1/256cs) and the corresponding delta (fs, 1/64cs), delta (fs, 1/128cs) and delta (fs, 1/256cs), and outputs the result to the third spatial dimension processing unit R₃And the tensor splicing layer outputs the result after tensor splicing to the third convolution layer for convolution processing.

A third spatial dimension processing unit R₃The third convolutional layer in (2) has an input channel of 7/256cs, an output channel number of 1/64cs, and a feature vector output by the third convolutional layer has a magnitude of (fs, 1/64 cs).

Referring to FIG. 1, a first spatial dimension processing unit R₁The output feature vector is convolved by convolutional layer conv3 to obtain (fs, 1/4cs), and the output feature vector is processed by a third space dimension processing unit R₃Adding the output eigenvector matrixes to obtain a fourth space dimension processing unit R₄Is input.

For the fourth spatial dimension processing unit R₄The input channels of the three second convolution layers are 1/64cs, the output channels are 1/256cs, 1/512cs and 1/1024cs, and the inputs of the three second convolution layers are the first space dimension processing unit R₁The output eigenvectors have the size of (fs, 1/16cs), the three second convolution layers respectively output three eigenvectors with the sizes of (fs, 1/256cs), (fs, 1/512cs) and (fs, 1/1024cs), the three eigenvectors are processed by sigmoid of the corresponding active layer to obtain delta (fs, 1/256cs), delta (fs, 1/512cs) and delta (fs, 1/1024cs), and further, the fourth space dimension processing unit R₄In the matrix dot multiplication layer, three eigenvectors (fs, 1/256cs), (fs, 1/512cs) and (fs, 1/1024cs) are subjected to matrix dot multiplication with corresponding delta (fs, 1/256cs), delta (fs, 1/128cs) and delta (fs, 1/1024cs) respectively and then input into a fourth spatial dimension processing unit R₄And the tensor splicing layer outputs the result after tensor splicing to the third convolution layer for convolution processing.

A fourth spatial dimension processing unit R₄The third convolutional layer in (2) has an input channel of 7/1024cs, an output channel number of 1/256cs, and a feature vector output by the third convolutional layer has a magnitude of (fs, 1/256 cs).

Referring to FIG. 1, a first spatial dimension processing unit R₁The output eigenvector is obtained after convolution of convolution layer conv1 (fs, 1/4cs), and the second space dimension processing unit R₂The output eigenvector is convolved by the convolution layer conv2 to obtain (fs, 1/16cs) and the fourth spatial dimension processing unit R₄The output (fs, 1/256cs) matrices are added to obtain the target feature based on space-time dimension.

Referring to fig. 5, the present embodiment further discloses a feature extraction method based on spatiotemporal dimensions, including the above feature extraction system 100, where the feature extraction method includes:

s100, performing frame decomposition on the experimental video, and extracting image features from the experimental video;

s200, combining image characteristics in a time dimension to obtain video characteristics;

s300, extracting weight information of the video features in different time periods to obtain time weight features;

and S400, performing high-low dimensional semantic feature re-mining on the time weight features to finally obtain target features based on space-time dimensions.

Referring to fig. 6, the embodiment further discloses a space-time dimension-based feature extraction device 200, which includes a processor 210, the processor 210 is coupled to a memory 220, the memory 220 stores program instructions, and when the program instructions stored in the memory 220 are executed by the processor 210, the above-mentioned feature extraction method is implemented. The Processor 210 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; or a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component; the Memory 220 may include a Random Access Memory (RAM), and may also include a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory. The Memory 220 may also be an internal Memory of Random Access Memory (RAM) type, and the processor 210 and the Memory 220 may be integrated into one or more independent circuits or hardware, such as: application Specific Integrated Circuit (ASIC). It should be noted that the computer program in the memory 220 can be implemented in the form of software functional units and stored in a computer readable storage medium when the computer program is sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention.

The present embodiment also provides a computer-readable storage medium, which stores computer instructions for causing a computer to execute the above-mentioned feature extraction method. The storage medium may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system or a propagation medium. The storage medium may also include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a Random Access Memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Optical disks may include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-RW), and DVD.

Referring to fig. 7, the present embodiment further discloses a method for evaluating video quality based on spatiotemporal dimensions, where the target features based on spatiotemporal dimensions are obtained by processing with the feature extraction system, and the method for evaluating video quality includes:

and S500, mapping the target characteristics based on the space-time dimension into the quality scores of the videos by adopting a quality pooling method to obtain the evaluation results of the experimental videos.

It should be noted that, mapping the feature vector to the quality score of the video by using the quality pooling method is a technical means well known to those skilled in the art, and is not described in detail herein.

Referring to fig. 8, the present embodiment further discloses a video quality evaluation system 300 based on spatiotemporal dimensions, and with the above feature extraction system 100, the video quality evaluation system 300 includes:

and the quality pooling module 150 is used for mapping the target characteristics based on the space-time dimension into the quality scores of the videos by adopting a quality pooling method to obtain the evaluation results of the experimental videos.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A system for feature extraction based on spatiotemporal dimensions, comprising:

2. The feature extraction system of claim 1, wherein the temporal weight processing module comprises three first convolution layers, a feature weighting layer:

3. The feature extraction system of claim 2, wherein the feature weighting layer calculates the temporal weight features using the following formula:

a tensor stitching operation is represented.

4. The feature extraction system of claim 1, wherein the semantic feature extraction module comprises four spatial dimension processing units connected in sequence, and is configured to change the number of channels of the time-weighted features in a spatial dimension;

5. The feature extraction system of claim 4, wherein the spatial dimension processing unit comprises three second convolutional layers, three matrix dot-product layers, three active layers, one tensor concatenation layer, and one third convolutional layer:

6. A method for extracting features based on spatiotemporal dimensions, comprising the system for extracting features according to any one of claims 1 to 5, the method comprising:

merging the image features in a time dimension to obtain video features;

7. A spatiotemporal dimension-based feature extraction device comprising a processor coupled to a memory storing program instructions that, when executed by the processor, implement the feature extraction method of claim 6.

8. A computer-readable storage medium characterized by comprising a program which, when run on a computer, causes the computer to execute the feature extraction method according to claim 6.

9. A video quality evaluation method based on spatiotemporal dimension, characterized in that, the target feature based on spatiotemporal dimension obtained by the feature extraction system of any claim 1 to 5 is adopted, the video quality evaluation method comprises:

10. A video quality evaluation system based on spatiotemporal dimensions, characterized in that the feature extraction system of any one of claims 1-5 is employed, the video quality evaluation system comprising: