CN113869178A - Feature extraction system and video quality evaluation system based on space-time dimension - Google Patents

Feature extraction system and video quality evaluation system based on space-time dimension Download PDF

Info

Publication number
CN113869178A
CN113869178A CN202111113707.XA CN202111113707A CN113869178A CN 113869178 A CN113869178 A CN 113869178A CN 202111113707 A CN202111113707 A CN 202111113707A CN 113869178 A CN113869178 A CN 113869178A
Authority
CN
China
Prior art keywords
video
feature extraction
layer
dimension
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111113707.XA
Other languages
Chinese (zh)
Other versions
CN113869178B (en
Inventor
余烨
路强
程茹秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202111113707.XA priority Critical patent/CN113869178B/en
Publication of CN113869178A publication Critical patent/CN113869178A/en
Application granted granted Critical
Publication of CN113869178B publication Critical patent/CN113869178B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a feature extraction system and a video quality evaluation system based on space-time dimensionality, wherein the feature extraction system comprises an image feature extraction module; a video feature extraction module; the time weight processing module and the semantic feature extraction module are used for obtaining target features based on space-time dimensions. According to the method, the weight distribution of different time periods is carried out by changing the channel number of the video feature vector in the time dimension, and then the splicing and weight mining are carried out on the high-dimensional semantic features and the low-dimensional semantic features by changing the channel number of the video vector in the space dimension, so that the finally obtained feature matrix of the space-time dimension is more in line with the subjective perception of human eyes, the correlation with the subjective perception of the human eyes is higher, and the accuracy is higher.

Description

Feature extraction system and video quality evaluation system based on space-time dimension
Technical Field
The invention relates to the technical field of video feature extraction, in particular to a feature extraction system and a video quality evaluation system based on space-time dimensionality.
Background
In the modern society, with the improvement of living standard and the accelerated development of urbanization, images and videos have become the most widely used data media in people's life. The data plays an important role in smart cities, public services and urban traffic, and the quality of the data influences the application of the data in various scenes. The quality evaluation task is an important branch in computer vision and is widely applied to video monitoring, network live broadcast, image super-resolution and image/video compression.
In video quality evaluation, the method of using a classification network to extract features and combining a pooling method to predict video quality is a widely used method, and in the process, information processing on a video time dimension is always a key and challenging problem. In the prior art, the construction of the space-time dependency relationship between video frames is partially researched and carried out by using a recurrent neural network and a series of variants, although the overall effect is good, the recurrent neural network is difficult to process some high-dimensional characteristics, and the recurrent neural network is easy to overfit in the training process; in addition, the correlation between the time-space characteristics excavated by the prior art and the subjective perception of human eyes is low, so the finally obtained evaluation result often has large deviation from the actual perception of human eyes.
In conclusion, the feature extraction method in the prior art has the problems of large error, low correlation with subjective perception of human eyes and the like.
Disclosure of Invention
In view of the above disadvantages of the prior art, an object of the present invention is to provide a feature extraction system and a video quality evaluation system based on spatio-temporal dimensions, so as to solve the technical problems of large error, easy overfitting, poor robustness, and the like in the feature extraction method in the prior art.
To achieve the above and other related objects, the present invention provides a feature extraction system based on spatiotemporal dimensions, comprising:
the image feature extraction module is used for carrying out frame decomposition on the experimental video and extracting image features from the experimental video;
the video feature extraction module is used for combining the image features on a time dimension to obtain video features;
the time weight processing module is used for extracting weight information of the video features in different time periods to obtain time weight features;
and the semantic feature extraction module is used for performing high-low dimensional semantic feature re-mining on the time weight features to obtain target features based on space-time dimensions.
In an embodiment of the present invention, the time weight processing module includes three first convolution layers and a feature weighting layer:
the first convolution layer is used for changing the number of channels of the video feature in a time dimension: an input terminal of the video feature receiving device, and an output terminal of the video feature receiving device is connected to an input terminal of the feature weighting layer;
the feature weighting layer is configured to perform weighting processing on outputs of the three first convolution layers to obtain the time weight feature: a first input terminal of which receives said video features and a second input terminal of which is connected to an output terminal of said first convolutional layer;
the output end of the time weight processing module is used as the output end of the time weight processing module and is connected to the semantic feature extraction module.
In an embodiment of the present invention, the feature weighting layer calculates the time weight feature by using the following formula:
Figure BDA0003270465840000021
wherein, I1Representing the temporal weight characteristic; delta denotes the sigmoid activation function; wiAn output representing the ith first convolution layer; represents matrix dot product; i is0Representing the video feature;
Figure BDA0003270465840000022
a tensor stitching operation is represented.
In an embodiment of the present invention, the semantic feature extraction module includes four space dimension processing units connected in sequence, and is configured to change the number of channels of the time weight feature in a space dimension;
after convolution, the output of the first spatial dimension processing unit is added to the output matrix of the third spatial dimension processing unit to be used as the input of the fourth spatial dimension processing unit;
and after convolution, the output of the first space dimension processing unit and the output of the second space dimension processing unit are added with the output matrix of the fourth space dimension processing unit to be used as the output of the semantic feature extraction module.
In an embodiment of the present invention, the spatial dimension processing unit includes three second convolution layers, three matrix dot-by-dot layers, three active layers, a tensor splicing layer, and a third convolution layer:
the input end of the second convolution layer is used for receiving the input of the current space dimension processing unit, the first output end of the second convolution layer is connected to the first input end of the corresponding matrix dot-product layer, and the second output end of the second convolution layer is connected to the input end of the corresponding active layer;
the input ends of the three second convolution layers form the input end of the current space dimension processing unit;
the second input end of the matrix dot multiplication layer is connected to the output end of the corresponding activation layer, and the output end of the matrix dot multiplication layer is connected to the input end of the tensor splicing layer;
the output end of the tensor splicing layer is connected to the input end of the third convolution layer;
and the output end of the third convolution layer is used as the output end of the space dimension processing unit.
The invention also discloses a feature extraction method based on the space-time dimension, which comprises the feature extraction system, wherein the feature extraction method comprises the following steps:
carrying out frame decomposition on the experimental video, and extracting image features from the experimental video;
merging the image features in a time dimension to obtain video features;
extracting weight information of the video features in different time periods to obtain time weight features;
and performing high-low dimensional semantic feature re-excavation on the time weight features to finally obtain a target feature based on space-time dimensions.
The invention also discloses a feature extraction device based on the space-time dimension, which comprises a processor, wherein the processor is coupled with a memory, the memory stores program instructions, and the feature extraction method is realized when the program instructions stored in the memory are executed by the processor.
The present invention also discloses a computer-readable storage medium containing a program which, when run on a computer, causes the computer to execute the above-described feature extraction method.
The invention also discloses a video quality evaluation method based on the space-time dimension, which adopts the target characteristics based on the space-time dimension processed by the characteristic extraction system, and comprises the following steps:
and mapping the target characteristics based on the space-time dimension into the quality fraction of the video by adopting a quality pooling method to obtain the evaluation result of the experimental video.
The invention also discloses a video quality evaluation system based on the space-time dimension, which adopts the characteristic extraction system and comprises the following components:
and the quality pooling module is used for mapping the target characteristics based on the space-time dimension into the quality fraction of the video by adopting a quality pooling method to obtain the evaluation result of the experimental video.
According to the feature extraction system and the video quality evaluation system based on the space-time dimension, weight distribution of different time periods is carried out by changing the channel number of the video feature vector in the time dimension, and then splicing and re-mining are carried out on the high-dimensional semantic features and the low-dimensional semantic features by changing the channel number of the video vector in the space dimension, so that the finally obtained space-time dimension feature matrix is more suitable for human eye subjective perception, the correlation with the human eye subjective perception is higher, and the accuracy is higher.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram of a spatiotemporal dimension-based feature extraction system according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a temporal weight processing module of the spatiotemporal dimension-based feature extraction system of the present invention in one embodiment;
FIG. 3 is a schematic diagram illustrating a spatial dimension processing unit of the spatiotemporal dimension-based feature extraction system according to an embodiment of the present invention;
FIG. 4 is a block diagram of a spatiotemporal dimension-based feature extraction system according to an embodiment of the present invention;
FIG. 5 is a system flow diagram illustrating a spatiotemporal dimension-based feature extraction method according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a spatiotemporal dimension-based feature extraction apparatus according to an embodiment of the present invention;
FIG. 7 is a flow chart illustrating a spatiotemporal dimension-based video quality assessment method according to an embodiment of the present invention;
FIG. 8 is a block diagram of a spatiotemporal dimension-based video quality evaluation system according to an embodiment of the present invention.
Description of the element reference numerals
100. The feature extraction system based on the space-time dimension comprises: 110. an image feature extraction module; 120. a video feature extraction module; 130. a time weight processing module; 140. a semantic feature extraction module; 150. a mass pooling module; 200. feature extraction equipment based on space-time dimensions; 210. a processor; 220. a memory; 300. provided is a video quality evaluation system.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. It is also to be understood that the terminology used in the examples is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the present invention. Test methods in which specific conditions are not specified in the following examples are generally carried out under conventional conditions or under conditions recommended by the respective manufacturers.
Please refer to fig. 1 to 8. It should be understood that the structures, ratios, sizes, and the like shown in the drawings are only used for matching the disclosure of the present disclosure, and are not used for limiting the conditions of the present disclosure, so that the present disclosure is not limited to the technical essence, and any modifications of the structures, changes of the ratios, or adjustments of the sizes, can still fall within the scope of the present disclosure without affecting the function and the achievable purpose of the present disclosure. In addition, the terms "upper", "lower", "left", "right", "middle" and "one" used in the present specification are for clarity of description, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not to be construed as a scope of the present invention.
When numerical ranges are given in the examples, it is understood that both endpoints of each of the numerical ranges and any value therebetween can be selected unless the invention otherwise indicated. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs and the description of the present invention, and any methods, apparatuses, and materials similar or equivalent to those described in the examples of the present invention may be used to practice the present invention.
Please refer to fig. 1, which is a schematic structural diagram of a spatio-temporal feature-based feature extraction system 100 in the present embodiment, and fig. 4, which is a schematic structural diagram of the spatio-temporal feature-based feature extraction system 100 in the present embodiment, wherein the feature extraction module includes: image feature extraction module 110, video feature extraction module 120, temporal weight processing module 130, semantic feature extraction module 140: the image feature extraction module 110 is configured to perform frame decomposition on the experimental video, and extract image features of frame images of each frame; the video feature extraction module 120 is configured to combine image features in a time dimension to obtain video features, where the video features are specifically two-dimensional feature vectors; the time weight processing module 130 is configured to extract weight information of the video features at different time periods to obtain time weight features; the semantic feature extraction module 140 is configured to perform high-low dimensional semantic feature re-mining on the time weight feature to obtain a target feature based on a space-time dimension.
It should be noted that, performing frame decomposition on the video, extracting image features from the video, and further combining the image features to obtain video features is a technical means well known to those skilled in the art, and is not described in detail herein.
In this embodiment, a two-dimensional convolutional neural network is adopted to extract image features from a frame image, and after the image features are combined in a time dimension, a two-dimensional feature vector with the size of (fs, cs) is obtained and is used as a video feature; wherein fs represents the video frame number, cs represents the channel dimension, the video frame number is the time dimension characteristic, and the channel dimension is the space dimension characteristic.
Referring to fig. 2, which is a schematic structural diagram of the time weight processing module 130 in the present embodiment, the time weight processing module 130 includes three first convolution layers and a feature weighting layer:
the first convolution layer is used to change the number of channels in the time dimension for video features: its input end receives video features and its output end is connected to the input end of the feature weighting layer;
the characteristic weighting layer is used for carrying out weighting processing on the outputs of the three first convolution layers to obtain a time weight characteristic matrix: a first input terminal of which receives the video features and a second input terminal of which is connected to an output terminal of the first convolution layer;
the output end of the semantic feature extraction module 140 is connected to the output end of the temporal weight processing module 130.
In this embodiment, the input channel of the first convolution layer is fs, the output channel number is 1/4fs, the input channel of the second first convolution layer is fs, the output channel number is 1/2fs, the input channel of the third first convolution layer is fs, and the output channel number is 1/4fs, so that the three first convolution layers adjust the time dimension characteristic in the video characteristic, that is, the video frame number fs, and the final output time resolutions are different, and the sizes of the three first convolution layers are three eigenvectors of (1/4fs, cs), (1/2fs, cs), (1/4fs, cs), respectively.
Specifically, the feature weighting layer in this embodiment is the Catmul module in fig. 2.
The feature weighting layer comprises tensor splicing operation, activation function operation and matrix dot multiplication operation, and time weight features are obtained by adopting the following formula:
Figure BDA0003270465840000071
wherein, I1Representing a temporal weight characteristic; delta denotes the sigmoid activation function; wiAn output representing the ith first convolution layer; represents matrix dot product; i is0Representing a video feature;
Figure BDA0003270465840000072
a tensor stitching operation is represented.
Specifically, W1Representing a feature vector of size (1/4fs, cs), W2Representing a feature vector of size (1/2fs, cs), W3Representing a feature vector of size (1/4fs, cs), I0Representing a two-dimensional feature vector of size (fs, cs).
The three eigenvectors of the first convolution layer are processed by sigmoid activation function to obtain weights of different time resolutions, and the weights and the two-dimensional eigenvectors (fs, cs) are subjected to point multiplication to obtain time weight characteristics I distributed by the weights of the time dimensions1The time weight characteristic I1Is (fs, cs).
Referring to fig. 1, the semantic feature extraction module 140 includes four space dimension processing units R connected in sequenceiWherein i belongs to 4, the semantic feature extraction module 140 is configured to change the number of channels of the time weight feature matrix in the spatial dimension;
wherein, the first space dimension processing unit R1After convolution with the output of (3), with a third spatial dimension processing unit R3As a fourth spatial dimension processing unit R4The input of (1);
first spatial dimension processing unit R1And a second spatial dimension processing unit R2After convolution with the output of the fourth spatial dimension processing unit R4As output of the semantic feature extraction module 140.
The feature vector with large number of channels in the space-time dimension represents high-dimensional semantic features, the feature vector with small number of channels in the space-time dimension represents low-dimensional semantic features, and the semantic feature extraction module 140 is used for extracting input time weight features I1And adjusting the channel dimension of the space dimension, and splicing the high-dimensional semantic features and the low-dimensional semantic features again to complete the dependency construction of the high-dimensional semantic features and the low-dimensional semantic features.
Referring to fig. 3, a spatial dimension processing unit R in the present embodimentiA space dimension processing unit RiThe device comprises three second convolution layers, three matrix dot multiplication layers, three active layers, a tensor splicing layer and a third convolution layer:
the input end of the second convolution layer is used for receiving the current space dimension processing unit RiA first output terminal connected to a first input terminal of a corresponding matrix dot-product layer, and a second output terminal connected to an input terminal of a corresponding active layer;
the input ends of the three second convolution layers form a current space dimension processing unit RiAn input terminal of (1);
the second input end of the matrix dot multiplication layer is connected to the output end of the corresponding activation layer, and the output end of the matrix dot multiplication layer is connected to the input end of the tensor splicing layer:
the output end of the tensor splicing layer is connected to the third convolution layer;
the output end of the third convolution layer is used as a space dimension processing unit RiTo the output terminal of (a).
For the first spatial dimension processing unit R1The input channels of the three second convolution layers are cs, the output channels are 1/4cs, 1/8cs and 1/16cs respectively, and the inputs of the three second convolution layers are time weight characteristics I1Respectively outputting three feature vectors with the sizes of (fs, 1/4cs), (fs, 1/8cs) and (fs, 1/16cs), wherein the three feature vectors are respectively subjected to sigmoid processing of corresponding activation layers to obtain delta (fs, 1/4cs), delta (fs, 1/8cs) and delta (fs, 1/16cs), and further, the first space dimension processing unit R1The matrix dot multiplication layer performs matrix dot multiplication on the three eigenvectors (fs, 1/4cs), (fs, 1/8cs) and (fs, 1/16cs) and the corresponding delta (fs, 1/4cs), delta (fs, 1/8cs) and delta (fs, 1/16cs), and outputs the result to the first space dimension processing unit R1And the tensor splicing layer outputs the result after tensor splicing to the third convolution layer for convolution processing.
First spatial dimension processing unit R1The third convolutional layer in (2) has an input channel of 7/16cs, an output channel number of 1/4cs, and a feature vector output by the third convolutional layer has a magnitude of (fs, 1/4 cs).
For the second spatial dimension processing unit R2The input channels of the three second convolution layers are 1/4cs, the output channels are 1/16cs, 1/32cs and 1/64cs, and the inputs of the three second convolution layers are the first space dimension processing unit R1The output eigenvectors have the size of (fs, 1/4cs), the three second convolution layers respectively output three eigenvectors with the sizes of (fs, 1/16cs), (fs, 1/32cs) and (fs, 1/64cs), the three eigenvectors are respectively subjected to sigmoid processing of corresponding active layers to obtain delta (fs, 1/16cs), delta (fs, 1/32cs) and delta (fs, 1/64cs), and further, the second space dimension processing unit R2The matrix dot multiplication layer of (1) respectively connects three eigenvectors (fs, 1/16cs), (fs, 1/32cs) and (fs, 1/64cs) with corresponding delta (fs, 1/1 cs)6cs), delta (fs, 1/32cs) and delta (fs, 1/64cs) are subjected to matrix dot product operation and output to a second spatial dimension processing unit R2And the tensor splicing layer outputs the result after tensor splicing to the third convolution layer for convolution processing.
Second spatial dimension processing unit R2The third convolutional layer in (2) has an input channel of 7/64cs, an output channel number of 1/16cs, and a feature vector output by the third convolutional layer has a magnitude of (fs, 1/16 cs).
Processing unit R for a third spatial dimension3The input channels of the three second convolution layers are 1/16cs, the output channels are 1/64cs, 1/128cs and 1/256cs, and the inputs of the three second convolution layers are the first space dimension processing unit R1The output eigenvectors have the size of (fs, 1/16cs), the three second convolution layers respectively output three eigenvectors with the sizes of (fs, 1/64cs), (fs, 1/128cs) and (fs, 1/256cs), the three eigenvectors are respectively subjected to sigmoid processing of corresponding active layers to obtain delta (fs, 1/64cs), delta (fs, 1/128cs) and delta (fs, 1/256cs), and further, a third spatial dimension processing unit R3The matrix dot multiplication layer performs matrix dot multiplication on the three eigenvectors (fs, 1/64cs), (fs, 1/128cs) and (fs, 1/256cs) and the corresponding delta (fs, 1/64cs), delta (fs, 1/128cs) and delta (fs, 1/256cs), and outputs the result to the third spatial dimension processing unit R3And the tensor splicing layer outputs the result after tensor splicing to the third convolution layer for convolution processing.
A third spatial dimension processing unit R3The third convolutional layer in (2) has an input channel of 7/256cs, an output channel number of 1/64cs, and a feature vector output by the third convolutional layer has a magnitude of (fs, 1/64 cs).
Referring to FIG. 1, a first spatial dimension processing unit R1The output feature vector is convolved by convolutional layer conv3 to obtain (fs, 1/4cs), and the output feature vector is processed by a third space dimension processing unit R3Adding the output eigenvector matrixes to obtain a fourth space dimension processing unit R4Is input.
For the fourth spatial dimension processing unit R4The input channels of the three second convolution layers are 1/64cs, the output channels are 1/256cs, 1/512cs and 1/1024cs, and the inputs of the three second convolution layers are the first space dimension processing unit R1The output eigenvectors have the size of (fs, 1/16cs), the three second convolution layers respectively output three eigenvectors with the sizes of (fs, 1/256cs), (fs, 1/512cs) and (fs, 1/1024cs), the three eigenvectors are processed by sigmoid of the corresponding active layer to obtain delta (fs, 1/256cs), delta (fs, 1/512cs) and delta (fs, 1/1024cs), and further, the fourth space dimension processing unit R4In the matrix dot multiplication layer, three eigenvectors (fs, 1/256cs), (fs, 1/512cs) and (fs, 1/1024cs) are subjected to matrix dot multiplication with corresponding delta (fs, 1/256cs), delta (fs, 1/128cs) and delta (fs, 1/1024cs) respectively and then input into a fourth spatial dimension processing unit R4And the tensor splicing layer outputs the result after tensor splicing to the third convolution layer for convolution processing.
A fourth spatial dimension processing unit R4The third convolutional layer in (2) has an input channel of 7/1024cs, an output channel number of 1/256cs, and a feature vector output by the third convolutional layer has a magnitude of (fs, 1/256 cs).
Referring to FIG. 1, a first spatial dimension processing unit R1The output eigenvector is obtained after convolution of convolution layer conv1 (fs, 1/4cs), and the second space dimension processing unit R2The output eigenvector is convolved by the convolution layer conv2 to obtain (fs, 1/16cs) and the fourth spatial dimension processing unit R4The output (fs, 1/256cs) matrices are added to obtain the target feature based on space-time dimension.
Referring to fig. 5, the present embodiment further discloses a feature extraction method based on spatiotemporal dimensions, including the above feature extraction system 100, where the feature extraction method includes:
s100, performing frame decomposition on the experimental video, and extracting image features from the experimental video;
s200, combining image characteristics in a time dimension to obtain video characteristics;
s300, extracting weight information of the video features in different time periods to obtain time weight features;
and S400, performing high-low dimensional semantic feature re-mining on the time weight features to finally obtain target features based on space-time dimensions.
Referring to fig. 6, the embodiment further discloses a space-time dimension-based feature extraction device 200, which includes a processor 210, the processor 210 is coupled to a memory 220, the memory 220 stores program instructions, and when the program instructions stored in the memory 220 are executed by the processor 210, the above-mentioned feature extraction method is implemented. The Processor 210 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; or a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component; the Memory 220 may include a Random Access Memory (RAM), and may also include a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory. The Memory 220 may also be an internal Memory of Random Access Memory (RAM) type, and the processor 210 and the Memory 220 may be integrated into one or more independent circuits or hardware, such as: application Specific Integrated Circuit (ASIC). It should be noted that the computer program in the memory 220 can be implemented in the form of software functional units and stored in a computer readable storage medium when the computer program is sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention.
The present embodiment also provides a computer-readable storage medium, which stores computer instructions for causing a computer to execute the above-mentioned feature extraction method. The storage medium may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system or a propagation medium. The storage medium may also include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a Random Access Memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Optical disks may include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-RW), and DVD.
Referring to fig. 7, the present embodiment further discloses a method for evaluating video quality based on spatiotemporal dimensions, where the target features based on spatiotemporal dimensions are obtained by processing with the feature extraction system, and the method for evaluating video quality includes:
and S500, mapping the target characteristics based on the space-time dimension into the quality scores of the videos by adopting a quality pooling method to obtain the evaluation results of the experimental videos.
It should be noted that, mapping the feature vector to the quality score of the video by using the quality pooling method is a technical means well known to those skilled in the art, and is not described in detail herein.
Referring to fig. 8, the present embodiment further discloses a video quality evaluation system 300 based on spatiotemporal dimensions, and with the above feature extraction system 100, the video quality evaluation system 300 includes:
and the quality pooling module 150 is used for mapping the target characteristics based on the space-time dimension into the quality scores of the videos by adopting a quality pooling method to obtain the evaluation results of the experimental videos.
According to the feature extraction system and the video quality evaluation system based on the space-time dimension, weight distribution of different time periods is carried out by changing the channel number of the video feature vector in the time dimension, and then splicing and re-mining are carried out on the high-dimensional semantic features and the low-dimensional semantic features by changing the channel number of the video vector in the space dimension, so that the finally obtained space-time dimension feature matrix is more suitable for human eye subjective perception, the correlation with the human eye subjective perception is higher, and the accuracy is higher.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (10)

1. A system for feature extraction based on spatiotemporal dimensions, comprising:
the image feature extraction module is used for carrying out frame decomposition on the experimental video and extracting image features from the experimental video;
the video feature extraction module is used for combining the image features on a time dimension to obtain video features;
the time weight processing module is used for extracting weight information of the video features in different time periods to obtain time weight features;
and the semantic feature extraction module is used for performing high-low dimensional semantic feature re-mining on the time weight features to obtain target features based on space-time dimensions.
2. The feature extraction system of claim 1, wherein the temporal weight processing module comprises three first convolution layers, a feature weighting layer:
the first convolution layer is used for changing the number of channels of the video feature in a time dimension: an input terminal of the video feature receiving device, and an output terminal of the video feature receiving device is connected to an input terminal of the feature weighting layer;
the feature weighting layer is configured to perform weighting processing on outputs of the three first convolution layers to obtain the time weight feature: a first input terminal of which receives said video features and a second input terminal of which is connected to an output terminal of said first convolutional layer;
the output end of the time weight processing module is used as the output end of the time weight processing module and is connected to the semantic feature extraction module.
3. The feature extraction system of claim 2, wherein the feature weighting layer calculates the temporal weight features using the following formula:
Figure FDA0003270465830000011
wherein, I1Representing the temporal weight characteristic; delta denotes the sigmoid activation function; wiAn output representing the ith first convolution layer; represents matrix dot product; i is0Representing the video feature;
Figure FDA0003270465830000012
a tensor stitching operation is represented.
4. The feature extraction system of claim 1, wherein the semantic feature extraction module comprises four spatial dimension processing units connected in sequence, and is configured to change the number of channels of the time-weighted features in a spatial dimension;
after convolution, the output of the first spatial dimension processing unit is added to the output matrix of the third spatial dimension processing unit to be used as the input of the fourth spatial dimension processing unit;
and after convolution, the output of the first space dimension processing unit and the output of the second space dimension processing unit are added with the output matrix of the fourth space dimension processing unit to be used as the output of the semantic feature extraction module.
5. The feature extraction system of claim 4, wherein the spatial dimension processing unit comprises three second convolutional layers, three matrix dot-product layers, three active layers, one tensor concatenation layer, and one third convolutional layer:
the input end of the second convolution layer is used for receiving the input of the current space dimension processing unit, the first output end of the second convolution layer is connected to the first input end of the corresponding matrix dot-product layer, and the second output end of the second convolution layer is connected to the input end of the corresponding active layer;
the input ends of the three second convolution layers form the input end of the current space dimension processing unit;
the second input end of the matrix dot multiplication layer is connected to the output end of the corresponding activation layer, and the output end of the matrix dot multiplication layer is connected to the input end of the tensor splicing layer:
the output end of the tensor splicing layer is connected to the input end of the third convolution layer;
and the output end of the third convolution layer is used as the output end of the space dimension processing unit.
6. A method for extracting features based on spatiotemporal dimensions, comprising the system for extracting features according to any one of claims 1 to 5, the method comprising:
carrying out frame decomposition on the experimental video, and extracting image features from the experimental video;
merging the image features in a time dimension to obtain video features;
extracting weight information of the video features in different time periods to obtain time weight features;
and performing high-low dimensional semantic feature re-excavation on the time weight features to finally obtain a target feature based on space-time dimensions.
7. A spatiotemporal dimension-based feature extraction device comprising a processor coupled to a memory storing program instructions that, when executed by the processor, implement the feature extraction method of claim 6.
8. A computer-readable storage medium characterized by comprising a program which, when run on a computer, causes the computer to execute the feature extraction method according to claim 6.
9. A video quality evaluation method based on spatiotemporal dimension, characterized in that, the target feature based on spatiotemporal dimension obtained by the feature extraction system of any claim 1 to 5 is adopted, the video quality evaluation method comprises:
and mapping the target characteristics based on the space-time dimension into the quality fraction of the video by adopting a quality pooling method to obtain the evaluation result of the experimental video.
10. A video quality evaluation system based on spatiotemporal dimensions, characterized in that the feature extraction system of any one of claims 1-5 is employed, the video quality evaluation system comprising:
and the quality pooling module is used for mapping the target characteristics based on the space-time dimension into the quality fraction of the video by adopting a quality pooling method to obtain the evaluation result of the experimental video.
CN202111113707.XA 2021-09-18 2021-09-18 Feature extraction system and video quality evaluation system based on space-time dimension Active CN113869178B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111113707.XA CN113869178B (en) 2021-09-18 2021-09-18 Feature extraction system and video quality evaluation system based on space-time dimension

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111113707.XA CN113869178B (en) 2021-09-18 2021-09-18 Feature extraction system and video quality evaluation system based on space-time dimension

Publications (2)

Publication Number Publication Date
CN113869178A true CN113869178A (en) 2021-12-31
CN113869178B CN113869178B (en) 2022-07-15

Family

ID=78993422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111113707.XA Active CN113869178B (en) 2021-09-18 2021-09-18 Feature extraction system and video quality evaluation system based on space-time dimension

Country Status (1)

Country Link
CN (1) CN113869178B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115243031A (en) * 2022-06-17 2022-10-25 合肥工业大学智能制造技术研究院 Video spatiotemporal feature optimization method and system based on quality attention mechanism, electronic device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180357258A1 (en) * 2015-06-05 2018-12-13 Beijing Jingdong Shangke Information Technology Co., Ltd. Personalized search device and method based on product image features
CN111797777A (en) * 2020-07-07 2020-10-20 南京大学 Sign language recognition system and method based on space-time semantic features
CN111860162A (en) * 2020-06-17 2020-10-30 上海交通大学 Video crowd counting system and method
CN112085102A (en) * 2020-09-10 2020-12-15 西安电子科技大学 No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180357258A1 (en) * 2015-06-05 2018-12-13 Beijing Jingdong Shangke Information Technology Co., Ltd. Personalized search device and method based on product image features
CN111860162A (en) * 2020-06-17 2020-10-30 上海交通大学 Video crowd counting system and method
CN111797777A (en) * 2020-07-07 2020-10-20 南京大学 Sign language recognition system and method based on space-time semantic features
CN112085102A (en) * 2020-09-10 2020-12-15 西安电子科技大学 No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115243031A (en) * 2022-06-17 2022-10-25 合肥工业大学智能制造技术研究院 Video spatiotemporal feature optimization method and system based on quality attention mechanism, electronic device and storage medium

Also Published As

Publication number Publication date
CN113869178B (en) 2022-07-15

Similar Documents

Publication Publication Date Title
WO2020177651A1 (en) Image segmentation method and image processing device
US11216910B2 (en) Image processing system, image processing method and display device
CN111914997B (en) Method for training neural network, image processing method and device
CN112446834A (en) Image enhancement method and device
CN111445418A (en) Image defogging method and device and computer equipment
KR20120115407A (en) Method and system for determining a quality measure for an image using multi-level decomposition of images
CN110059728B (en) RGB-D image visual saliency detection method based on attention model
CN110020639B (en) Video feature extraction method and related equipment
CN113191489B (en) Training method of binary neural network model, image processing method and device
CN111797882A (en) Image classification method and device
CN112488923A (en) Image super-resolution reconstruction method and device, storage medium and electronic equipment
CN112131959A (en) 2D human body posture estimation method based on multi-scale feature reinforcement
US20220188595A1 (en) Dynamic matrix convolution with channel fusion
CN115439470B (en) Polyp image segmentation method, computer readable storage medium and computer device
CN112052808A (en) Human face living body detection method, device and equipment for refining depth map and storage medium
EP3663938B1 (en) Signal processing method and apparatus
CN113869178B (en) Feature extraction system and video quality evaluation system based on space-time dimension
CN111291631A (en) Video analysis method and related model training method, device and apparatus
CN114821058A (en) Image semantic segmentation method and device, electronic equipment and storage medium
CA2688041C (en) Method and device for selecting transform matrices for down-sampling dct image using learning with forgetting algorithm
CN114420135A (en) Attention mechanism-based voiceprint recognition method and device
Lee et al. Dual-branch vision transformer for blind image quality assessment
CN116503895A (en) Multi-fine-granularity shielding pedestrian re-recognition method based on visual transducer
Yang et al. Blind image quality measurement via data-driven transform-based feature enhancement
CN116797510A (en) Image processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant