WO2021134485A1 - 视频评分方法、装置、存储介质及电子设备 - Google Patents

视频评分方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2021134485A1
WO2021134485A1 PCT/CN2019/130520 CN2019130520W WO2021134485A1 WO 2021134485 A1 WO2021134485 A1 WO 2021134485A1 CN 2019130520 W CN2019130520 W CN 2019130520W WO 2021134485 A1 WO2021134485 A1 WO 2021134485A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
feature
detected
features
neural network
Prior art date
Application number
PCT/CN2019/130520
Other languages
English (en)
French (fr)
Inventor
高洪涛
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to CN201980100393.4A priority Critical patent/CN114375466A/zh
Priority to PCT/CN2019/130520 priority patent/WO2021134485A1/zh
Publication of WO2021134485A1 publication Critical patent/WO2021134485A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • This application relates to the technical field of video scoring, in particular to a video scoring method, device, storage medium and electronic equipment.
  • Video information has become an important way of information dissemination on the Internet, changing people's lives in all aspects.
  • videos there are a huge variety of videos, and the content inside is uneven. Some are positive and full of positive energy, some are low and depressed, and some are angry and violent. Therefore, it is especially urgent to evaluate and screen videos in the emotional dimension. .
  • the embodiments of the present application provide a video scoring method, device, storage medium, and electronic equipment, which can perform emotional scoring on the video from the emotional dimension.
  • an embodiment of the present application provides a video scoring method, including:
  • the scores of the video to be detected in multiple emotional dimensions are calculated.
  • an embodiment of the present application provides a video scoring device, including:
  • the data acquisition module is used to acquire the video to be detected
  • the feature extraction module is configured to extract video features of multiple dimensions from the video to be detected according to a preset feature extraction algorithm
  • a score calculation module configured to perform feature fusion processing on the video features of the multiple dimensions based on a preset feature fusion algorithm to generate a fusion feature
  • the scores of the video to be detected in multiple emotional dimensions are calculated.
  • an embodiment of the present application provides a storage medium on which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the video scoring method provided in any embodiment of the present application.
  • an embodiment of the present application provides an electronic device, including a processor and a memory, the memory has a computer program, and the processor is used to execute the computer program as provided in any of the embodiments of the present application by invoking the computer program. Video scoring method.
  • the solution provided by the embodiment of this application separately extracts video features of multiple dimensions from the video to be detected according to a preset feature extraction algorithm, performs fusion processing on the video features of these multiple dimensions to obtain the fused feature, and then according to the preset regression algorithm Calculate the scores of the video to be detected in multiple emotional dimensions with fusion features.
  • this solution realizes the effective combination of multiple types of features extracted from the video, and uses the fused features as the basis for video emotional scoring, based on The feedforward neural network scores the video to be detected on multiple emotional dimensions, and obtains multiple scores, realizing the scoring of the video from the emotional dimension.
  • FIG. 1 is a schematic diagram of the first flow of a video scoring method provided by an embodiment of this application.
  • FIG. 2 is a schematic diagram of a ring model based on Valence-Arousal in the video scoring method proposed in an embodiment of the application.
  • FIG. 3 is a schematic diagram of the second flow of a video scoring method provided by an embodiment of the application.
  • FIG. 4 is a schematic structural diagram of a deep neural network model of a video scoring method provided by an embodiment of the application.
  • Fig. 5 is a schematic structural diagram of a video scoring device provided by an embodiment of the application.
  • FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • FIG. 7 is a schematic structural diagram of a video scoring circuit of an electronic device provided by an embodiment of the application.
  • the embodiments of the present application provide a video scoring method.
  • the execution subject of the video scoring method may be the video scoring device provided in the embodiment of the application, or an electronic device integrated with the video scoring device, wherein the video scoring device may use hardware or Realized by software.
  • the electronic device can be a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a desktop computer and other devices.
  • the electronic device may also be a server.
  • FIG. 1 is a schematic diagram of the first flow of a video scoring method provided by an embodiment of this application.
  • the specific process of the video scoring method provided in the embodiment of the application may be as follows:
  • the video to be detected is obtained.
  • the video scoring scheme in this application can be applied to various video platforms, for example, online video viewing websites, video sharing apps, etc.
  • video when a video uploaded by a user is received, the video can be emotionally scored on the server side according to the solution of the embodiment of this application. When the emotional score meets the preset conditions, the video will be uploaded to Video platform for sharing.
  • the emotional dimension can be two or more than two.
  • the emotional dimension includes the positive and negative degrees and the emotional intensity, and a ring model based on Valence-Arousal (positive and negative degrees and excitement) is used to score the emotion of the video.
  • the degree of positive and negative can be understood as the positive emotional tendency or negative emotional tendency reflected in the video.
  • the positive emotional tendency can be positive, such as happy, satisfied, etc.
  • the negative emotional tendency can be negative, such as anger, disappointment, etc.
  • the degree of excitement can be divided into mild (such as calm, fatigue, etc.), neutral, and severe (such as irritable, intense, etc.).
  • FIG. 2 is a schematic diagram of a ring model based on Valence-Arousal in the video scoring method proposed in an embodiment of the application.
  • the horizontal axis is the score corresponding to the degree of positive and negative
  • the vertical axis is the score of the degree of excitement.
  • 0 to -1 are negative emotional tendencies
  • 0 to 1 are positive emotional tendencies.
  • the more negative the emotion the higher the degree of negativity reflected in the video.
  • the closer the score of the video is to -1 the milder the degree of excitement reflected in the video
  • Valence-Arousal-based ring model includes two emotional dimensions. In other embodiments, more emotional dimensions can be set to score the video according to evaluation requirements.
  • video features of multiple dimensions are extracted from the video to be detected.
  • video features of multiple dimensions may include facial features, audio features, and visual features.
  • the video features of multiple dimensions may also include features obtained by fusion of any two of the above three features.
  • video features of multiple dimensions may also include features of other dimensions.
  • the corresponding feature extraction section can be used to extract specific features.
  • an image containing human face information can be used to train a deep neural network in advance to determine network parameters, as a feature extraction network.
  • the video frame image containing the human face in the video is input to the feature extraction network, the feature output by the last convolutional layer is obtained, and the feature is reduced by dimensionality processing to obtain the facial feature vector, as Face features.
  • the feature vector output by the fully connected layer can also be extracted as the human face feature.
  • HOG Histogram of Oriented Gradient
  • the audio data in the video can be extracted separately, the audio data can be converted into a spectrogram, and then the spectrogram can be converted into a semantic vector according to a pre-trained deep neural network as the audio feature. Or, it is also possible to directly input audio data into a pre-trained self-encoding recurrent neural network to generate semantic feature vectors as audio features.
  • the histogram of the pixel values of the video frame image can be extracted to reflect the brightness and tone of the image. According to the appearance times of each pixel value in the histogram, a feature vector is generated as the visual feature of the video.
  • feature fusion processing is performed on video features of multiple dimensions based on a preset feature fusion algorithm to generate fusion features.
  • the above features are fused, for example, a weight value is assigned to each feature, and multiple feature vectors are weighted and averaged according to the weight value to obtain a fused feature vector.
  • the multiple feature vectors of the above three types of features are spliced into a feature matrix, and the feature matrix is convolved according to a preset convolution layer to perform feature fusion.
  • the scores of the video to be detected on multiple emotional dimensions are calculated.
  • the fusion feature is input into the preset regression algorithm to calculate the score.
  • the preset regression algorithm may be a feedforward neural network, a logistic regression algorithm, and the like.
  • the feed-forward neural network can be obtained by fusion feature training that carries scores in multiple emotional dimensions.
  • the number of neurons in the output layer of the network is equal to the number of emotional dimensions, and one neuron corresponds to one emotional dimension.
  • Perform normalization calculation on each neuron in the output layer to obtain a number between -1 and 1 as the score of the video to be detected in the emotional dimension corresponding to the neuron.
  • the present application is not limited by the order of execution of the various steps described, and certain steps may also be performed in other order or at the same time if there is no conflict.
  • the video scoring method proposed in the embodiment of this application extracts video features of multiple dimensions from the video to be detected according to a preset feature extraction algorithm, and performs fusion processing on the video features of these multiple dimensions to obtain fused features. Then calculate the scores of the video to be detected in multiple emotional dimensions according to the preset regression algorithm and the fusion feature. Based on this, this solution realizes the effective combination of multiple types of features extracted from the video, and uses the fusion feature as the video
  • the basis of emotional scoring is based on the feed-forward neural network to score the video to be detected on multiple emotional dimensions, and multiple scores are obtained, which realizes the scoring of the video from the emotional dimension, and the emotional score can be used as a basis for sharing or recommending the video.
  • the facial expressions of the characters can reflect the emotional state expressed in the video as a whole.
  • a video frame image containing human face information is acquired as an analysis object, and features are extracted therefrom as the facial features corresponding to the video.
  • extracting facial features from a video to be detected includes: obtaining a video frame image containing human face information from the video to be detected; and according to a preset first convolutional neural Network and video frame images, generate a face feature matrix; reduce the dimensionality of the face feature matrix, and generate a face feature vector.
  • multiple frames of video frame images included in the video can be acquired, and one or more frames of video frame images containing human face information can be selected from the multiple frames of video frame images as the target video frame image.
  • the video contains multiple human objects
  • multiple video frames containing human face information can be selected as the target video frames.
  • a video frame image with the largest proportion of the face area in the entire image can be selected as the target video frame image.
  • the target video frame image After obtaining the target video frame image, input the target video frame image to the pre-trained first convolutional neural network for operation, and obtain the feature map (feature map) output by the last convolutional layer of the network, and then according to The featuremap generates a face feature vector. For example, if the feature map output by the last convolutional layer has a size of 10 ⁇ 10, the dimension can be reduced to a 1 ⁇ 100 vector by splicing 10 rows together, and this vector is used as a face feature vector.
  • the obtained feature vector can be used as the face feature vector.
  • extracting facial features from a video to be detected includes: obtaining a video frame image containing human face information from the video to be detected; calculating a histogram of directional gradients of the video frame image Feature vector, and use the directional gradient histogram feature vector as the face feature.
  • the directional gradient histogram feature is calculated, and the feature is generally in the form of a vector.
  • the method of calculating the histogram feature of the directional gradient of the target video frame image is as follows: divide the image into multiple regions, calculate the value of the gradient in different directions in each region, and then accumulate to obtain the histogram feature.
  • FIG. 3 is a schematic diagram of a second process of a video scoring method provided by an embodiment of the present invention.
  • the method includes:
  • a video to be detected is obtained.
  • the video scoring scheme in this application can be applied to various video platforms, for example, online video viewing websites, video sharing applications, etc.
  • video when a video uploaded by a user is received, the video can be emotionally scored on the server side according to the solution of the embodiment of this application. When the emotional score meets the preset conditions, the video will be uploaded to Video platform for sharing.
  • a target video frame image is obtained from the video to be detected, and a face feature vector is generated according to the first convolutional neural network and the target video frame image.
  • obtain a video frame image containing human face information from the video to be detected generate a face feature matrix according to the preset first convolutional neural network and the video frame image, and reduce the dimensionality of the face feature matrix to generate a face Feature vector.
  • multiple frames of video frame images contained in the video can be obtained, and one or more frames of video frame images containing human face information can be selected from the multiple frames of video frame images as the target video frame image.
  • the video contains multiple human objects
  • multiple video frames containing human face information can be selected as the target video frames.
  • the target video frame image has multiple frames
  • the target video frame image After obtaining the target video frame image, input the target video frame image to the pre-trained first convolutional neural network for operation, and obtain the feature map (feature map) output by the last convolutional layer of the network, and then according to The feature map generates a face feature vector. For example, perform a dimensionality reduction operation on the feature map output by the last convolutional layer to generate a feature vector.
  • Each target video frame image corresponds to a face feature vector.
  • the face feature vector can be expressed as follows:
  • these multiple face feature vectors may have the same weight value during feature fusion.
  • the audio data in the video to be detected is obtained, and the audio data is converted into an audio feature vector.
  • the audio feature extraction algorithm can be the MFCC (Mel Frequency Cepstrum Coefficient) algorithm or the FFT (Fast Fourier Transformation, Fast Fourier Transform) algorithm.
  • the audio feature extraction algorithm converts the voice data into a spectrogram.
  • the spectrogram is used as the input data and output data of the self-encoding convolutional neural network, and the semantic feature vector is extracted from the network.
  • the second convolutional neural network is trained, its output data is consistent with the input data to obtain valuable information in its middle hidden layer.
  • the audio feature vector can be expressed as follows:
  • audio data can also be directly input into a pre-trained self-encoding recurrent neural network to generate semantic feature vectors as audio features.
  • the self-encoding neural network model consists of an encoder encoder and a decoder decoder. The output of the network is equal to the input.
  • the network includes an intermediate hidden layer, which can extract the semantic feature vector of the speech data.
  • a self-encoding cyclic neural network is used to extract semantic feature vectors from speech data, and the input data and output data of the self-encoding cyclic neural network are the above-mentioned speech data.
  • the to-be-detected image is obtained from the to-be-detected video, and a visual feature vector is generated according to the pixel value distribution histogram of the to-be-detected image.
  • the image to be detected is obtained from the video to be detected, the pixel value distribution histogram of the image to be detected on one or more pixel channels is obtained, and the histogram of the pixel value distribution on the one or more pixel channels is generated according to the pixel value distribution histogram.
  • Visual feature vector In this embodiment, multiple frames of video frame images included in the video may be obtained, and one or multiple frame rate frame images can be selected from the multiple frames of video frame images as the image to be detected.
  • mi is the number of occurrences of the pixel value i.
  • the calculated visual feature vector can be expressed as follows:
  • the feature vector corresponding to each pixel channel can be calculated, and the feature vector is 256 dimensions. Therefore, in order to facilitate subsequent feature fusion operations, the face feature vector, audio feature vector, and visual feature vector need to have the same dimension. Therefore, the parameters of the first convolutional neural network and the second convolutional neural network can be adjusted in advance.
  • the vector obtained by reducing the dimensionality of the output featuremap is also 256-dimensional. It can be understood that in other embodiments, the dimensions of the face feature vector, audio feature vector, and visual feature vector can also be fixed to other values as needed.
  • the lengths of the face feature vector, audio feature vector, and visual feature vector it is not necessary to set the lengths of the face feature vector, audio feature vector, and visual feature vector to be the same, and the network parameters of the first convolutional neural network and the second convolutional neural network are respectively based on the extracted features from The accuracy of the angle setting.
  • the network parameters of the first convolutional neural network and the second convolutional neural network are respectively based on the extracted features from The accuracy of the angle setting.
  • the length of the above-mentioned feature vector reaches the preset length. If not, the length of the feature vector can be extended to the preset length by means of zero padding.
  • the face feature vector, audio feature vector, and visual feature vector are spliced into a feature matrix.
  • the feature matrix is input into a preset deep neural network model, where the deep neural network model includes a convolutional layer and a feedforward neural network.
  • a convolution operation is performed on the feature matrix according to the convolution layer to generate a fusion feature.
  • the scores of the video to be detected in multiple emotional dimensions are generated.
  • FIG. 4 is a schematic structural diagram of a deep neural network model of a video scoring method provided by an embodiment of the application.
  • the deep neural network model includes a data base layer and a regression layer composed of a feedforward neural network layer.
  • Input the feature matrix into the preset deep neural network model, and the convolution layer performs convolution operation on the feature matrix to generate fused features;
  • the size of the feature matrix is 3 ⁇ 256, and after a 3 ⁇ 1 convolution kernel operation, a fusion feature with a size of 1 ⁇ 256 is obtained.
  • the fusion features output by the convolutional layer are input into the feedforward neural network for calculation, and the scores of the video to be detected in multiple emotional dimensions are generated.
  • the deep neural network model is trained on sample videos, which carry scores in multiple emotional dimensions. For example, [Sample Video A: Positive and negative scores are 0.6, and excitement is 0.1] can be used as a training sample, according to In this way, multiple training samples are obtained in advance.
  • the facial feature vector, audio feature vector, and visual feature vector are extracted according to the corresponding feature extraction algorithm.
  • the face feature vectors, audio feature vectors, and visual feature vectors carrying scores in multiple emotional dimensions are spliced and input into a pre-built deep neural network for training to determine model parameters.
  • the model can learn the magnitude of the impact of each feature on the scoring results on each emotional dimension.
  • the trained model can effectively utilize the various features of the input and realize the effective quantification of the video from multiple emotional dimensions.
  • a video scoring device is also provided.
  • FIG. 5 is a schematic structural diagram of a video scoring device 300 provided by an embodiment of the application.
  • the video scoring device 300 is applied to electronic equipment, and the video scoring device 300 includes a data acquisition module 301, a feature extraction module 302, and a score calculation module 303, as follows:
  • the data acquisition module 301 is used to acquire the video to be detected
  • the feature extraction module 302 is configured to extract video features of multiple dimensions from the video to be detected according to a preset feature extraction algorithm
  • the score calculation module 303 is configured to perform feature fusion processing on the video features of the multiple dimensions based on a preset feature fusion algorithm to generate a fusion feature;
  • the scores of the video to be detected in multiple emotional dimensions are calculated.
  • the score calculation module 303 is also used to:
  • the feature matrix is input into a preset deep neural network model, where the deep neural network model includes a convolutional layer and a feedforward neural network, and the deep neural network model is obtained by training a sample video, and the sample video carries Scores on the multiple emotional dimensions;
  • the scores of the video to be detected in multiple emotional dimensions are generated.
  • the multiple-dimensional video features include facial features, audio features, and visual features.
  • the feature extraction module 302 is further configured to: obtain a video frame image containing human face information from the video to be detected:
  • the dimensionality reduction processing of the face feature matrix is performed to generate a face feature vector.
  • the feature extraction module 302 is further configured to: obtain a video frame image containing human face information from the video to be detected;
  • the feature extraction module 302 is further configured to: obtain audio data contained in the video to be detected;
  • the audio feature vector of the audio data is generated.
  • the feature extraction module 302 is also used to:
  • the image to be detected is obtained from the video to be detected
  • each of the above modules can be implemented as an independent entity, or can be combined arbitrarily, and implemented as the same or several entities.
  • each of the above modules please refer to the previous method embodiments, which will not be repeated here.
  • the video scoring device provided in this embodiment of the application belongs to the same concept as the video scoring method in the above embodiment. Any method provided in the video scoring method embodiment can be run on the video scoring device, and its specific implementation For details of the process, refer to the embodiment of the video scoring method, which will not be repeated here.
  • the video scoring device proposed in this embodiment of the application separately extracts facial features, audio features, and visual features from the video to be detected according to a preset feature extraction algorithm, and performs fusion processing on these three features to obtain fusion features. Then calculate the scores of the video to be detected in multiple emotional dimensions according to the preset regression algorithm and the fusion feature. Based on this, this solution realizes the effective combination of multiple types of features extracted from the video, and uses the fusion feature as the video
  • the basis for emotional scoring is based on the feed-forward neural network to score the video to be detected on multiple emotional dimensions, and multiple scores are obtained, realizing the scoring of the video from the emotional dimension.
  • the embodiments of the present application also provide an electronic device, which may be a mobile terminal such as a tablet computer or a smart phone.
  • an electronic device which may be a mobile terminal such as a tablet computer or a smart phone.
  • FIG. 6, is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • the electronic device 800 may include a camera module 801, a memory 802, a processor 803, a touch display screen 804, a speaker 805, a microphone 806 and other components.
  • the camera module 801 may include a video scoring circuit, which may be implemented by hardware and/or software components, and may include various processing units that define an image signal processing (Image Signal Processing) pipeline.
  • the video scoring circuit may at least include a camera, an image signal processor (Image Signal Processor, ISP processor), a control logic, an image memory, a display, and so on.
  • the camera can include at least one or more lenses and image sensors.
  • the image sensor may include a color filter array (such as a Bayer filter). The image sensor can obtain the light intensity and wavelength information captured by each imaging pixel of the image sensor, and provide a set of raw image data that can be processed by the image signal processor.
  • the image signal processor can process the original image data pixel by pixel in a variety of formats. For example, each image pixel may have a bit depth of 8, 10, 12, or 14 bits, and the image signal processor may perform one or more video scoring operations on the original image data, and collect statistical information about the image data. Among them, the video scoring operation can be performed with the same or different bit depth accuracy.
  • the original image data can be stored in the image memory after being processed by the image signal processor.
  • the image signal processor can also receive image data from the image memory.
  • the image memory may be a part of a memory device, a storage device, or an independent dedicated memory in an electronic device, and may include DMA (Direct Memory Access) features.
  • DMA Direct Memory Access
  • the image signal processor can perform one or more video scoring operations, such as temporal filtering.
  • the processed image data can be sent to the image memory for additional processing before being displayed.
  • the image signal processor may also receive processed data from the image memory, and perform image data processing in the original domain and in the RGB and YCbCr color spaces on the processed data.
  • the processed image data can be output to a display for viewing by the user and/or further processed by a graphics engine or GPU (Graphics Processing Unit, graphics processor).
  • the output of the image signal processor can also be sent to the image memory, and the display can read image data from the image memory.
  • the image memory may be configured to implement one or more frame buffers.
  • the statistical data determined by the image signal processor can be sent to the control logic.
  • the statistical data may include the statistical information of the image sensor such as automatic exposure, automatic white balance, automatic focus, flicker detection, black level compensation, and lens shading correction.
  • the control logic may include a processor and/or microcontroller that executes one or more routines (such as firmware).
  • routines can determine the control parameters of the camera and the ISP control parameters based on the received statistical data.
  • the control parameters of the camera may include camera flash control parameters, lens control parameters (for example, focal length for focusing or zooming), or a combination of these parameters.
  • ISP control parameters may include gain levels and color correction matrices for automatic white balance and color adjustment (for example, during RGB processing).
  • FIG. 7 is a schematic diagram of the structure of the video scoring circuit in this embodiment. For ease of description, only various aspects of the video scoring technology related to the embodiments of the present invention are shown.
  • the video scoring circuit may include: a camera, an image signal processor, a control logic, an image memory, and a display.
  • the camera may include one or more lenses and image sensors.
  • the camera may be any one of a telephoto camera or a wide-angle camera.
  • the images collected by the camera are transmitted to the image signal processor for processing.
  • the image signal processor processes the image, it can send the statistical data of the image (such as the brightness of the image, the contrast value of the image, the color of the image, etc.) to the control logic.
  • the control logic can determine the control parameters of the camera according to the statistical data, so that the camera can perform operations such as autofocus and automatic exposure according to the control parameters.
  • the image can be stored in the image memory after being processed by the image signal processor.
  • the image signal processor can also read the image stored in the image memory for processing.
  • the image can be directly sent to the monitor for display after being processed by the image signal processor.
  • the display can also read the image in the image memory for display.
  • the electronic device may also include a CPU and a power supply module.
  • the CPU is connected to the logic controller, image signal processor, image memory, and display, and the CPU is used to implement global control.
  • the power supply module is used to supply power to each module.
  • the application program stored in the memory 802 contains executable code.
  • Application programs can be composed of various functional modules.
  • the processor 803 executes various functional applications and data processing by running application programs stored in the memory 802.
  • the processor 803 is the control center of the electronic device. It uses various interfaces and lines to connect the various parts of the entire electronic device, and executes the electronic device by running or executing the application program stored in the memory 802 and calling the data stored in the memory 802.
  • the various functions and processing data of the electronic equipment can be used to monitor the electronic equipment as a whole.
  • the touch display screen 804 may be used to receive a user's touch control operation on the electronic device.
  • the speaker 805 can play sound signals.
  • the microphone 806 can be used to pick up sound signals.
  • the processor 803 in the electronic device will load the executable code corresponding to the process of one or more application programs into the memory 802 according to the following instructions, and the processor 803 will run and store the executable code in the memory. 802 application program to execute:
  • the score of the video to be detected on multiple emotional dimensions is calculated.
  • an embodiment of the present application provides an electronic device that extracts video features of multiple dimensions from a video to be detected according to a preset feature extraction algorithm, and merges the video features of these multiple dimensions.
  • the fusion feature is obtained by processing, and then the scores of the video to be detected in multiple emotional dimensions are calculated according to the preset regression algorithm and the fusion feature. Based on this, this solution realizes the effective combination of multiple types of features extracted from the video, and This fusion feature is used as the basis for scoring the video emotion.
  • the video to be detected is scored on multiple emotional dimensions, and multiple scores are obtained, which realizes the scoring of the video from the emotional dimension.
  • An embodiment of the present application also provides a storage medium in which a computer program is stored.
  • the computer program When the computer program is run on a computer, the computer executes the video scoring method described in any of the above embodiments.
  • the storage medium may include, but is not limited to: read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

一种视频评分方法、装置、存储介质及电子设备,其中,获取待检测视频(101);根据预设的特征提取算法,从所述待检测视频中提取多个维度的视频特征(102);基于预设特征融合算法对所述多个维度的视频特征进行特征融合处理,生成融合特征(103);基于预设回归算法和所述融合特征,计算所述待检测视频在多个情感维度上的情感分数(104),实现了从情感维度上对视频进行评分。

Description

视频评分方法、装置、存储介质及电子设备 技术领域
本申请涉及视频评分技术领域,具体涉及一种视频评分方法、装置、存储介质及电子设备。
背景技术
近年来随着科技的发展,拍摄获取和浏览视频变得非常便捷,视频信息已经成为互联网上信息传播的重要方式,从各方面改变着人们的生活。然而视频的种类繁多数量巨大,里面的内容也良莠不齐,有的积极向上充满正能量,有的低沉压抑,有的很愤怒有暴力倾向,因此对视频进行情感维度上的评价与甄别变得尤为迫切。
发明内容
本申请实施例提供一种视频评分方法、装置、存储介质及电子设备,能够对视频从情感维度上对视频进行情感评分。
第一方面,本申请实施例提供一种视频评分方法,包括:
获取待检测视频;
根据预设的特征提取算法,从所述待检测视频中提取多个维度的视频特征;
基于预设特征融合算法对所述多个维度的视频特征进行特征融合处理,生成融合特征;
基于预设回归算法和所述融合特征,计算所述待检测视频在多个情感维度上的分数。
第二方面,本申请实施例提供一种视频评分装置,包括:
数据获取模块,用于获取待检测视频;
特征提取模块,用于根据预设的特征提取算法,从所述待检测视频中提取多个维度的视频特征;
分数计算模块,用于基于预设特征融合算法对所述多个维度的视频特征进行特征融合处理,生成融合特征;
以及,基于预设回归算法和所述融合特征,计算所述待检测视频在多个情感维度上的分数。
第三方面,本申请实施例提供一种存储介质,其上存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行如本申请任一实施例 提供的视频评分方法。
第四方面,本申请实施例提供一种电子设备,包括处理器和存储器,所述存储器有计算机程序,所述处理器通过调用所述计算机程序,用于执行如本申请任一实施例提供的视频评分方法。
本申请实施例提供的方案,根据预设的特征提取算法从待检测视频中分别提取多个维度的视频特征,将这多个维度的视频特征进行融合处理得到融合特征,然后根据预设回归算法和融合特征计算待检测视频在多个情感维度上的分数,基于此,本方案实现了将从视频中提取出的多种类型的特征有效结合,将该融合特征作为视频情感打分的依据,基于前馈神经网络对待检测视频在多个情感维度上打分,得到多个分数,实现了从情感维度上对视频进行评分。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的视频评分方法的第一种流程示意图。
图2为本申请实施例提出的视频评分方法中基于Valence-Arousal的环状模型示意图。
图3为本申请实施例提供的视频评分方法的第二种流程示意图。
图4为本申请实施例提供的视频评分方法的深度神经网络模型的结构示意图。
图5为本申请实施例提供的视频评分装置的结构示意图。
图6为本申请实施例提供的电子设备的结构示意图。
图7为本申请实施例提供的电子设备的视频评分电路的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有付出创造性劳动前提下所获得的所有其他实施例,都属于本申请的保护范围。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实 施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
本申请实施例提供一种视频评分方法,该视频评分方法的执行主体可以是本申请实施例提供的视频评分装置,或者集成了该视频评分装置的电子设备,其中该视频评分装置可以采用硬件或者软件的方式实现。其中,电子设备可以是智能手机、平板电脑、掌上电脑、笔记本电脑、或者台式电脑等设备。在一些实施例中,电子设备还可以是服务器。
请参照图1,图1为本申请实施例提供的视频评分方法的第一种流程示意图。本申请实施例提供的视频评分方法的具体流程可以如下:
在101中,获取待检测视频。
本申请中的视频评分方案可以应用于各种视频平台,例如,在线视频观看网站,视频分享APP等。对于视频系统来说,当接收到用户上传的视频时,可以在服务端,按照本申请实施例的方案对该视频进行情感评分,当情感评分满足预设条件时,才会将该视频上传到视频平台进行分享。
在102中,根据预设的特征提取算法,从待检测视频中提取多个维度的视频特征。
区别于基于单一特征对视频进行分类,本申请从多个情感角度对视频进行评分。其中,情感维度可以为两个或者两个以上。例如,在一些实施例中,情感维度包括的正负面程度和情感激烈程度,采用基于Valence-Arousal(正负面程度和激动程度)的环状模型对视频的情感进行打分。正负面程度可以理解为视频画面体现出的正面情感倾向或者负面情感倾向,例如,正面情感倾向可以是积极的,例如开心、满足等,负面情感倾向可以是消极的,例如生气、失望等。激动程度可以分为轻微(如平静、疲劳等)、中立、剧烈(如暴躁、激烈等)等。
请参阅图2,图2为本申请实施例提出的视频评分方法中基于Valence-Arousal的环状模型示意图。其中,横轴为正负面程度对应的分数,纵轴为激动程度的分数。在横轴上,0~-1为负面情感倾向,0~1为正面情感倾向,视频的分数越接近于1,情感越正面,比如,视频所体现出的积极程度越高,越接近于-1,情感越负面,视频所体现出的消极程度越高。同理,在纵轴上,视频的分数越接近于-1,视频中体现出的激动程度越轻微,视频的分数越接近 于1,视频中体现出的激动程度越剧烈。
可以理解的是,上述基于Valence-Arousal的环状模型包含有两个情感维度,在其他的实施例中,还可以根据评价需要,设置更多的情感维度对视频打分。
本实施例中,从待检测视频中提取多个维度的视频特征,例如,多个维度的视频特征可以包括人脸特征、音频特征和视觉特征。或者,在一些实施例中,除了上述三种特征之外,多个维度的视频特征还可以包括上述三种特征任意两种特征融合处理到的特征。或者,在一些实施例中,除了上述三种特征之外,多个维度的视频特征还可以包括其他维度的特征。
对于上述各个维度的特征,可以采取对应的特征提取段进行特定的特征的提取。
例如,对于人脸特征,可以预先使用包含有人脸信息的图像训练第深度神经网路以确定网络参数,作为特征提取网络。在提取视频的人脸特征时,将视频中包含有人脸的视频帧图像输入到该特征提取网络中,获取最后一个卷积层输出的特征,将该特征降维处理得到人脸特征向量,作为人脸特征。或者将全连接层输出的特征向量作为人脸特征。或者,还可以提取包含有人脸的视频帧图像的方向梯度直方图(Histogram of Oriented Gradient,HOG)特征作为人脸特征。
关于音频特征,可以单独提取出视频中的音频数据,将该音频数据转换为频谱图,再根据预先训练好的深度神经网络将该频谱图转换为语义向量,作为音频特征。或者,还可以直接将音频数据输入到预先训练好的自编码循环神经网络中,生成语义特征向量,作为音频特征。
关于视觉特征,可以提取视频帧图像的像素值的直方图,来体现出图像的明暗程度、色调。根据直方图中各个像素值出现次数,生成特征向量,作为视频的视觉特征。
在103中,基于预设特征融合算法对多个维度的视频特征进行特征融合处理,生成融合特征。
在获取到视频的上述三种特征后,将上述特征进行融合处理,例如,分别为每种特征赋予一个权重值,根据该权重值对多个特征向量进行加权平均,得到融合后的特征向量。或者,将上述三种特征的多个特征向量拼接为特征矩阵,按照预设的卷积层对该特征矩阵进行卷积操作,以进行特征的融合。
在104中,基于预设回归算法和融合特征,计算待检测视频在多个情感维度上的分数。
在得到融合特征后,将该融合特征输入到预设回归算法进行分数的计算。其中,预设回归算法可以是前馈神经网络、逻辑回归算法等。以前馈神经网络为例,前馈神经网络可以是通过携带有在多个情感维度上的分数的融合特征训练得到的。该网络的输出层中的神经元的数量等于情感维度的数量,一个神经元对应于一个情感维度。在输出层的每一个神经元进行归一化计算,得到一个-1~1之间的数字,作为待检测视频在该神经元对应的情感维度上的分数。
具体实施时,本申请不受所描述的各个步骤的执行顺序的限制,在不产生冲突的情况下,某些步骤还可以采用其它顺序进行或者同时进行。
由上可知,本申请实施例提出的视频评分方法,根据预设的特征提取算法从待检测视频中分别提取多个维度的视频特征,将这多个维度的视频特征进行融合处理得到融合特征,然后根据预设回归算法和融合特征计算待检测视频在多个情感维度上的分数,基于此,本方案实现了将从视频中提取出的多种类型的特征有效结合,将该融合特征作为视频情感打分的依据,基于前馈神经网络对待检测视频在多个情感维度上打分,得到多个分数,实现了从情感维度上对视频进行评分,该情感评分可以作为分享或者推荐该视频的依据。
人物的表情可以体现出视频整体表达的情感状态。本申请实施例中获取包含有人脸信息的视频帧图像作为分析对象,从中提取特征,作为视频对应的人脸特征。
在一些实施方式中,根据预设的特征提取算法,从待检测视频中提取人脸特征,包括:从待检测视频中获取包含有人脸信息的视频帧图像;根据预设的第一卷积神经网络和视频帧图像,生成人脸特征矩阵;对人脸特征矩阵降维处理,生成人脸特征向量。
该实施方式中,可以获取视频包含的多帧视频帧图像,从多帧视频帧图像中选择一帧或者多帧包含有人脸信息的视频帧图像作为目标视频帧图像。其中,当视频中包含有多个人物对象时,可以选择多帧包含有人脸信息的视频帧作为目标视频帧。当视频中只包含有一个人物对象时,可以选择一帧人脸区域在整个图像中占比最大的视频帧图像作为目标视频帧图像。
获取到目标视频帧图像后,将该目标视频帧图像输入到预先训练好的第一 卷积神经网络中进行运算,获取该网络的最后一个卷积层输出的feature map(特征地图),再根据该featuremap生成人脸特征向量。例如,最后一个卷积层输出的feature map为10×10的大小,则可以通过将10行拼接到一起的方式将其降维为1×100的向量,将该向量作为人脸特征向量。
或者,还可以获取该网络的全连接对该10×10的feature map降维操作后,得到的特征向量作为人脸特征向量。
在一些实施方式中,根据预设的特征提取算法,从待检测视频中提取人脸特征,包括:从待检测视频中获取包含有人脸信息的视频帧图像;计算视频帧图像的方向梯度直方图特征向量,并将方向梯度直方图特征向量作为人脸特征。
在该实施方式中,针对获取到的每一帧目标视频帧图像,计算其方向梯度直方图特征,该特征的形式一般为向量。其中,计算目标视频帧图像的方向梯度直方图特征的方式如下:将图像划分为多个区域,计算每一个区域中不同方向上梯度的值,然后进行累积,得到直方图特征。
下面将在上述实施例描述的方法基础上,对本申请的视频评分方法做进一步详细介绍。请参阅图3,图3是本发明实施例提供的视频评分方法的第二流程示意图。该方法包括:
在201中,获取待检测视频。
本申请中的视频评分方案可以应用于各种视频平台,例如,在线视频观看网站,视频分享应用程序等。对于视频系统来说,当接收到用户上传的视频时,可以在服务端,按照本申请实施例的方案对该视频进行情感评分,当情感评分满足预设条件时,才会将该视频上传到视频平台进行分享。
在202中,从待检测视频中获取目标视频帧图像,根据第一卷积神经网络和目标视频帧图像,生成人脸特征向量。
例如,从待检测视频中获取包含有人脸信息的视频帧图像,根据预设的第一卷积神经网络和视频帧图像,生成人脸特征矩阵,对人脸特征矩阵降维处理,生成人脸特征向量。
该实施例中,可以获取视频包含的多帧视频帧图像,从多帧视频帧图像中选择一帧或者多帧包含有人脸信息的视频帧图像作为目标视频帧图像。其中,当视频中包含有多个人物对象时,可以选择多帧包含有人脸信息的视频帧作为目标视频帧。当视频中只包含有一个人物对象时,可以选择一帧人脸区域在整 个图像中占比最大的视频帧图像作为目标视频帧图像。
可以理解的是,当目标视频帧图像有多帧时,计算得到的人脸特征向量也可以有多个。如果每一帧目标视频帧图像中的人物是不同的,则得到的人脸特征向量的数量等于目标视频帧图像的数量。如果多帧目标视频帧图像中的人物是相同的,则可以将这多帧具有相同人物的多帧目标视频帧图像计算得到的特征向量求平均值后,得到一个特征向量。
获取到目标视频帧图像后,将该目标视频帧图像输入到预先训练好的第一卷积神经网络中进行运算,获取该网络的最后一个卷积层输出的feature map(特征地图),再根据该feature map生成人脸特征向量。例如,对最后一个卷积层输出的feature map进行降维操作,生成一个特征向量。每一个目标视频帧图像对应于一个人脸特征向量。其中,人脸特征向量可以表示如下:
s 1={x 1,x 2,…,x n}
当有多个人脸特征向量时,在进行特征融合时,这多个人脸特征向量可以具有相同的权重值。
在203中,获取待检测视频中的音频数据,将音频数据转换为音频特征向量。
例如,获取待检测视频中包含的音频数据;根据音频特征提取算法将音频数据转换为频谱图;根据预先训练好的第二卷积神经网络和频谱图,生成音频数据的音频特征向量。
音频特征提取算法可以是MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)算法或者FFT(Fast Fourier Transformation,快速傅里叶变换)算法,通过音频特征提取算法将语音数据转换为频谱图,将频谱图作为自编码卷积神经网络的输入数据和输出数据,从网络中提取语义特征向量。第二卷积神经网络在训练时,其输出数据与输入数据一致,以获取其中间隐藏层中有价值的信息。将频谱图输入该第二卷积神经网络计算,将该网络的中间隐藏层输出的特征向量作为音频特征向量。其中,音频特征向量可以表示如下:
s 2={y 1,y 2,…,y n}
在其他实施例中,还可以直接将音频数据输入到预先训练好的自编码循环神经网络中,生成语义特征向量,作为音频特征。自编码神经网络模型由一个encoder编码器和一个decoder解码器组成,该网络的输出等于输入,网络包括 有中间隐藏层,中间隐藏层能够提取语音数据的语义特征向量。本方案中采用自编码循环神经网络从语音数据中提取语义特征向量,自编码循环神经网络的输入数据和输出数据均为上述语音数据。该网络在训练时,无需对语音数据贴标签,预先采集大量的语音数据作为网络的输入和输出,网络通过自学习确定网络参数。
在204中,从待检测视频中获取待检测图像,根据待检测图像的像素值分布直方图生成视觉特征向量。
例如,从待检测视频中获取待检测图像,获取待检测图像在一个或者多个像素通道上的像素值分布直方图,根据像素值分布直方图生成待检测图像在一个或者多个像素通道上的视觉特征向量。该实施例中,可以获取视频包含的多帧视频帧图像,从多帧视频帧图像中选择一帧或者多帧频帧图像作为待检测图像。
本实施例中,本方案提取待检测图像的RGB三通道的像素值分布直方图,作为图像明暗程度、色调的代表。假设用c i,i=0,1,2,3……,255来表示特征向量中的每个元素值,该元素的计算公式如下:
Figure PCTCN2019130520-appb-000001
其中,m i为像素值i的出现次数。计算得到的视觉特征向量可以表示如下:
s 3={c 1,c 2,…,c n}
根据上述计算方式可以计算得到每一个像素通道对应的特征向量,该特征向量为256维。因此,为了便于后续的特征融合操作,人脸特征向量、音频特征向量和视觉特征向量需要具有相同的维数,因此,可以通过预先调整第一卷积神经网络和第二卷积神经网络参数,以使其输出的featuremap降维后得到的向量也为256维。可以理解的是,在其他实施例中,也可以根据需要将人脸特征向量、音频特征向量和视觉特征向量的维数固定设置为其他值。
或者,在一些实施例中,不需要将人脸特征向量、音频特征向量和视觉特征向量的长度设置为相同,第一卷积神经网络和第二卷积神经网络的网络参数分别根据从提取特征准确度的角度设置。但是,在进行特征向量的拼接之前,先判断上述特征向量的长度是否达到预设长度,如果没有,则可以采用补零的方式,将特征向量的长度延伸到预设长度。
在205中,将人脸特征向量、音频特征向量和视觉特征向量拼接为特征矩阵。
在得到人脸特征向量、音频特征向量和视觉特征向量之后,将这三个向量拼接为特征矩阵。如下:
Figure PCTCN2019130520-appb-000002
在206中,将特征矩阵输入预设的深度神经网络模型,其中,深度神经网络模型包括卷积层和前馈神经网络。
在207中,根据卷积层对特征矩阵进行卷积运算,生成融合特征。
在208中,根据前馈神经网络和融合特征,生成待检测视频在多个情感维度上的分数。
请参阅图4,图4为本申请实施例提供的视频评分方法的深度神经网络模型的结构示意图。该深度神经网络模型包括据基层和由前馈神经网络层构成的回归层。将特征矩阵输入预设的深度神经网络模型,卷积层对特征矩阵进行卷积运算,生成融合特征;卷积核的大小可以为k×f,其中,f需要与输入的特征矩阵的行数匹配,例如输入的特征矩阵的行数为3,则f=3。比如,特征矩阵的尺寸为3×256,经过3×1的卷积核运算后,得到尺寸为1×256的融合特征。将卷积层输出的融合特征输入前馈神经网络进行计算,生成待检测视频在多个情感维度上的分数。
该深度神经网络模型由样本视频训练得到,样本视频携带有在多个情感维度上的分数,例如,【样本视频A:正负面程度分数为0.6,激动程度为0.1】可以作为一条训练样本,按照这样的方式,预先获取多条训练样本。针对携带有每一条样本视频,分别按照对应的特征提取算法,提取人脸特征向量、音频特征向量和视觉特征向量。再将携带有在多个情感维度上的分数的人脸特征向量、音频特征向量和视觉特征向量进行拼接后输入到预先构建好的深度神经网络中进行训练,确定模型参数。在模型训练过程中,模型可以学习得到每一种特征对各个情感维度上的评分结果的影响的大小,对评分结果影响大的特征会被赋予相对较大的权重,反之,对评分结果影响小的特征会被赋予相对较小的权重。基于此,训练得到的模型能够对输入的多种特征有效利用,实现了对视频从多个情感维度的有效量化。
在一实施例中还提供了一种视频评分装置。请参阅图5,图5为本申请实施例提供的视频评分装置300的结构示意图。其中该视频评分装置300应用于电子设备,该视频评分装置300包括数据获取模块301、特征提取模块302以及分数计算模块303,如下:
数据获取模块301,用于获取待检测视频;
特征提取模块302,用于根据预设的特征提取算法,从所述待检测视频中提取多个维度的视频特征;
分数计算模块303,用于基于预设特征融合算法对所述多个维度的视频特征进行特征融合处理,生成融合特征;
以及,基于预设回归算法和所述融合特征,计算所述待检测视频在多个情感维度上的分数。
在一些实施例中,分数计算模块303还用于:
将所述多个维度的视频特征拼接为特征矩阵;
将所述特征矩阵输入预设的深度神经网络模型,其中,所述深度神经网络模型包括卷积层和前馈神经网络,所述深度神经网络模型由样本视频训练得到,所述样本视频携带有在所述多个情感维度上的分数;
根据所述卷积层对所述特征矩阵进行卷积运算,生成融合特征;
根据所述前馈神经网络和所述融合特征,生成所述待检测视频在多个情感维度上的分数。
在一些实施例中,所述多个维度的视频特征包括人脸特征、音频特征和视觉特征。
在一些实施例中,特征提取模块302还用于:从所述待检测视频中获取包含有人脸信息的视频帧图像:
根据预设的第一卷积神经网络和所述视频帧图像,生成人脸特征矩阵;
对所述人脸特征矩阵降维处理,生成人脸特征向量。
在一些实施例中,特征提取模块302还用于:从所述待检测视频中获取包含有人脸信息的视频帧图像;
计算所述视频帧图像的方向梯度直方图特征向量,并将所述方向梯度直方图特征向量作为人脸特征。
在一些实施例中,特征提取模块302还用于:获取所述待检测视频中包含 的音频数据;
根据音频特征提取算法将所述音频数据转换为频谱图;
根据预先训练好的第二卷积神经网络和所述频谱图,生成所述音频数据的音频特征向量。
在一些实施例中,特征提取模块302还用于:
在一些实施例中,从所述待检测视频中获取待检测图像;
获取所述待检测图像在一个或者多个像素通道上的像素值分布直方图;
根据所述像素值分布直方图统计各个像素值的数量,并根据所述各个像素值的数量,生成所述待检测图像在一个或者多个像素通道上的视觉特征向量。
具体实施时,以上各个模块可以作为独立的实体来实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个模块的具体实施可参见前面的方法实施例,在此不再赘述。
应当说明的是,本申请实施例提供的视频评分装置与上文实施例中的视频评分方法属于同一构思,在视频评分装置上可以运行视频评分方法实施例中提供的任一方法,其具体实现过程详见视频评分方法实施例,此处不再赘述。
由上可知,本申请实施例提出的视频评分装置,根据预设的特征提取算法从待检测视频中分别提取人脸特征、音频特征和视觉特征,将这三种特征进行融合处理得到融合特征,然后根据预设回归算法和融合特征计算待检测视频在多个情感维度上的分数,基于此,本方案实现了将从视频中提取出的多种类型的特征有效结合,将该融合特征作为视频情感打分的依据,基于前馈神经网络对待检测视频在多个情感维度上打分,得到多个分数,实现了从情感维度上对视频进行评分。
本申请实施例还提供一种电子设备,该电子设备可以是诸如平板电脑或者智能手机等移动终端。请参阅图6,图6为本申请实施例提供的电子设备的结构示意图。电子设备800可以包括摄像模组801、存储器802、处理器803、触摸显示屏804、扬声器805、麦克风806等部件。
摄像模组801可以包括视频评分电路,视频评分电路可以利用硬件和/或软件组件实现,可包括定义图像信号处理(Image Signal Processing)管线的各种处理单元。视频评分电路至少可以包括:摄像头、图像信号处理器(Image Signal Processor,ISP处理器)、控制逻辑器、图像存储器以及显示器等。其中摄像头 至少可以包括一个或多个透镜和图像传感器。图像传感器可包括色彩滤镜阵列(如Bayer滤镜)。图像传感器可获取用图像传感器的每个成像像素捕捉的光强度和波长信息,并提供可由图像信号处理器处理的一组原始图像数据。
图像信号处理器可以按多种格式逐个像素地处理原始图像数据。例如,每个图像像素可具有8、10、12或14比特的位深度,图像信号处理器可对原始图像数据进行一个或多个视频评分操作、收集关于图像数据的统计信息。其中,视频评分操作可按相同或不同的位深度精度进行。原始图像数据经过图像信号处理器处理后可存储至图像存储器中。图像信号处理器还可从图像存储器处接收图像数据。
图像存储器可为存储器装置的一部分、存储设备、或电子设备内的独立的专用存储器,并可包括DMA(Direct Memory Access,直接直接存储器存取)特征。
当接收到来自图像存储器的图像数据时,图像信号处理器可进行一个或多个视频评分操作,如时域滤波。处理后的图像数据可发送给图像存储器,以便在被显示之前进行另外的处理。图像信号处理器还可从图像存储器接收处理数据,并对所述处理数据进行原始域中以及RGB和YCbCr颜色空间中的图像数据处理。处理后的图像数据可输出给显示器,以供用户观看和/或由图形引擎或GPU(Graphics Processing Unit,图形处理器)进一步处理。此外,图像信号处理器的输出还可发送给图像存储器,且显示器可从图像存储器读取图像数据。在一种实施方式中,图像存储器可被配置为实现一个或多个帧缓冲器。
图像信号处理器确定的统计数据可发送给控制逻辑器。例如,统计数据可包括自动曝光、自动白平衡、自动聚焦、闪烁检测、黑电平补偿、透镜阴影校正等图像传感器的统计信息。
控制逻辑器可包括执行一个或多个例程(如固件)的处理器和/或微控制器。一个或多个例程可根据接收的统计数据,确定摄像头的控制参数以及ISP控制参数。例如,摄像头的控制参数可包括照相机闪光控制参数、透镜的控制参数(例如聚焦或变焦用焦距)、或这些参数的组合。ISP控制参数可包括用于自动白平衡和颜色调整(例如,在RGB处理期间)的增益水平和色彩校正矩阵等。
请参阅图7,图7为本实施例中视频评分电路的结构示意图。为便于说明,仅示出与本发明实施例相关的视频评分技术的各个方面。
例如视频评分电路可以包括:摄像头、图像信号处理器、控制逻辑器、图像存储器、显示器。其中,摄像头可以包括一个或多个透镜和图像传感器。在一些实施例中,摄像头可为长焦摄像头或广角摄像头中的任一者。
摄像头采集的图像传输给图像信号处理器进行处理。图像信号处理器处理图像后,可将图像的统计数据(如图像的亮度、图像的反差值、图像的颜色等)发送给控制逻辑器。控制逻辑器可根据统计数据确定摄像头的控制参数,从而摄像头可根据控制参数进行自动对焦、自动曝光等操作。图像经过图像信号处理器进行处理后可存储至图像存储器中。图像信号处理器也可以读取图像存储器中存储的图像以进行处理。另外,图像经过图像信号处理器进行处理后可直接发送至显示器进行显示。显示器也可以读取图像存储器中的图像以进行显示。
此外,图中没有展示的,电子设备还可以包括CPU和供电模块。CPU和逻辑控制器、图像信号处理器、图像存储器和显示器均连接,CPU用于实现全局控制。供电模块用于为各个模块供电。
存储器802存储的应用程序中包含有可执行代码。应用程序可以组成各种功能模块。处理器803通过运行存储在存储器802的应用程序,从而执行各种功能应用以及数据处理。
处理器803是电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器802内的应用程序,以及调用存储在存储器802内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。
触摸显示屏804可以用于接收用户对电子设备的触摸控制操作。扬声器805可以播放声音信号。麦克风806可以用于拾取声音信号。
在本实施例中,电子设备中的处理器803会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行代码加载到存储器802中,并由处理器803来运行存储在存储器802中的应用程序,从而执行:
获取待检测视频;
根据预设的特征提取算法,从所述待检测视频中提取多个维度的视频特征;
基于预设特征融合算法对所述多个维度的视频特征进行特征融合处理,生成融合特征;
基于预设回归算法和所述融合特征,计算所述待检测视频在多个情感维度 上的分数。
由上可知,本申请实施例提供了一种电子设备,所述电子设备根据预设的特征提取算法从待检测视频中分别提取多个维度的视频特征,将这多个维度的视频特征进行融合处理得到融合特征,然后根据预设回归算法和融合特征计算待检测视频在多个情感维度上的分数,基于此,本方案实现了将从视频中提取出的多种类型的特征有效结合,将该融合特征作为视频情感打分的依据,基于前馈神经网络对待检测视频在多个情感维度上打分,得到多个分数,实现了从情感维度上对视频进行评分。
本申请实施例还提供一种存储介质,所述存储介质中存储有计算机程序,当所述计算机程序在计算机上运行时,所述计算机执行上述任一实施例所述的视频评分方法。
需要说明的是,本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过计算机程序来指令相关的硬件来完成,所述计算机程序可以存储于计算机可读存储介质中,所述存储介质可以包括但不限于:只读存储器(ROM,Read Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁盘或光盘等。
此外,本申请中的术语“第一”、“第二”和“第三”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或模块的过程、方法、系统、产品或设备没有限定于已列出的步骤或模块,而是某些实施例还包括没有列出的步骤或模块,或某些实施例还包括对于这些过程、方法、产品或设备固有的其它步骤或模块。
以上对本申请实施例所提供的视频评分方法、装置、存储介质及电子设备进行了详细介绍。本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种视频评分方法,其特征在于,包括:
    获取待检测视频;
    根据预设的特征提取算法,从所述待检测视频中提取多个维度的视频特征;
    基于预设特征融合算法对所述多个维度的视频特征进行特征融合处理,生成融合特征;
    基于预设回归算法和所述融合特征,计算所述待检测视频在多个情感维度上的分数。
  2. 如权利要求1所述的视频评分方法,其特征在于,所述基于预设特征融合算法对所述多个维度的视频特征进行特征融合处理,生成融合特征;基于预设回归算法和所述融合特征,计算所述待检测视频在多个情感维度上的分数,包括:
    将所述多个维度的视频特征拼接为特征矩阵;
    将所述特征矩阵输入预设的深度神经网络模型,其中,所述深度神经网络模型包括卷积层和前馈神经网络,所述深度神经网络模型由样本视频训练得到,所述样本视频携带有在所述多个情感维度上的分数;
    根据所述卷积层对所述特征矩阵进行卷积运算,生成融合特征;
    根据所述前馈神经网络和所述融合特征,生成所述待检测视频在多个情感维度上的分数。
  3. 如权利要求1所述的视频评分方法,其特征在于,所述多个维度的视频特征包括人脸特征、音频特征和视觉特征。
  4. 如权利要求3所述的视频评分方法,其特征在于,所述根据预设的特征提取算法,从所述待检测视频中提取人脸特征,包括:
    从所述待检测视频中获取包含有人脸信息的视频帧图像;
    根据预设的第一卷积神经网络和所述视频帧图像,生成人脸特征矩阵;
    对所述人脸特征矩阵降维处理,生成人脸特征向量。
  5. 如权利要求3所述的视频评分方法,其特征在于,所述根据预设的特征提取算法,从所述待检测视频中提取人脸特征,包括:
    从所述待检测视频中获取包含有人脸信息的视频帧图像;
    计算所述视频帧图像的方向梯度直方图特征向量,并将所述方向梯度直方 图特征向量作为人脸特征。
  6. 如权利要求3所述的视频评分方法,其特征在于,所述根据预设的特征提取算法,从所述待检测视频中提取音频特征,包括:
    获取所述待检测视频中包含的音频数据;
    根据音频特征提取算法将所述音频数据转换为频谱图;
    根据预先训练好的第二卷积神经网络和所述频谱图,生成所述音频数据的音频特征向量。
  7. 如权利要求3所述的视频评分方法,其特征在于,所述根据预设的特征提取算法,从所述待检测视频中提取视觉特征,包括:
    从所述待检测视频中获取待检测图像;
    获取所述待检测图像在一个或者多个像素通道上的像素值分布直方图;
    根据所述像素值分布直方图统计各个像素值的数量,并根据所述各个像素值的数量,生成所述待检测图像在一个或者多个像素通道上的视觉特征向量。
  8. 一种视频评分装置,其特征在于,包括:
    数据获取模块,用于获取待检测视频;
    特征提取模块,用于根据预设的特征提取算法,从所述待检测视频中提取多个维度的视频特征;
    分数计算模块,用于基于预设特征融合算法对所述多个维度的视频特征进行特征融合处理,生成融合特征;
    以及,基于预设回归算法和所述融合特征,计算所述待检测视频在多个情感维度上的分数。
  9. 如权利要求8所述的视频评分装置,其特征在于,所述分数计算模块还用于:将所述多个维度的视频特征拼接为特征矩阵;
    将所述特征矩阵输入预设的深度神经网络模型,其中,所述深度神经网络模型包括卷积层和前馈神经网络,所述深度神经网络模型由样本视频训练得到,所述样本视频携带有在所述多个情感维度上的分数;
    根据所述卷积层对所述特征矩阵进行卷积运算,生成融合特征;
    根据所述前馈神经网络和所述融合特征,生成所述待检测视频在多个情感维度上的分数。
  10. 如权利要求8所述的视频评分装置,其特征在于,所述多个维度的视 频特征包括人脸特征、音频特征和视觉特征。
  11. 如权利要求10所述的视频评分装置,其特征在于,所述特征提取模块还用于:从所述待检测视频中获取包含有人脸信息的视频帧图像;
    根据预设的第一卷积神经网络和所述视频帧图像,生成人脸特征矩阵;
    对所述人脸特征矩阵降维处理,生成人脸特征向量。
  12. 如权利要求10所述的视频评分装置,其特征在于,所述特征提取模块还用于:从所述待检测视频中获取包含有人脸信息的视频帧图像;
    计算所述视频帧图像的方向梯度直方图特征向量,并将所述方向梯度直方图特征向量作为人脸特征。
  13. 如权利要求10所述的视频评分装置,其特征在于,所述特征提取模块还用于:获取所述待检测视频中包含的音频数据;
    根据音频特征提取算法将所述音频数据转换为频谱图;
    根据预先训练好的第二卷积神经网络和所述频谱图,生成所述音频数据的音频特征向量。
  14. 如权利要求10所述的视频评分装置,其特征在于,所述特征提取模块还用于:从所述待检测视频中获取待检测图像;
    获取所述待检测图像在一个或者多个像素通道上的像素值分布直方图;
    根据所述像素值分布直方图统计各个像素值的数量,并根据所述各个像素值的数量,生成所述待检测图像在一个或者多个像素通道上的视觉特征向量。
  15. 一种存储介质,其上存储有计算机程序,其特征在于,当所述计算机程序在计算机上运行时,使得所述计算机执行如权利要求1至7任一项所述的视频评分方法。
  16. 一种电子设备,包括处理器和存储器,所述存储器存储有计算机程序,其特征在于,所述处理器通过调用所述计算机程序,用于执行如权利要求1至7任一项所述的视频评分方法。
  17. 如权利要求16所述的电子设备,其特征在于,所述处理器通过调用所述计算机程序,还执行:
    将所述多个维度的视频特征拼接为特征矩阵;
    将所述特征矩阵输入预设的深度神经网络模型,其中,所述深度神经网络模型包括卷积层和前馈神经网络,所述深度神经网络模型由样本视频训练得到,所述样本视频携带有在所述多个情感维度上的分数;
    根据所述卷积层对所述特征矩阵进行卷积运算,生成融合特征;
    根据所述前馈神经网络和所述融合特征,生成所述待检测视频在多个情感维度上的分数。
  18. 如权利要求16所述的电子设备,其特征在于,所述多个维度的视频特征包括人脸特征、音频特征和视觉特征。
  19. 如权利要求18所述的电子设备,其特征在于,所述处理器通过调用所述计算机程序,还执行:
    从所述待检测视频中获取包含有人脸信息的视频帧图像;
    根据预设的第一卷积神经网络和所述视频帧图像,生成人脸特征矩阵;
    对所述人脸特征矩阵降维处理,生成人脸特征向量。
  20. 如权利要求16所述的电子设备,其特征在于,所述处理器通过调用所述计算机程序,还执行:
    获取所述待检测视频中包含的音频数据;
    根据音频特征提取算法将所述音频数据转换为频谱图;
    根据预先训练好的第二卷积神经网络和所述频谱图,生成所述音频数据的音频特征向量。
PCT/CN2019/130520 2019-12-31 2019-12-31 视频评分方法、装置、存储介质及电子设备 WO2021134485A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980100393.4A CN114375466A (zh) 2019-12-31 2019-12-31 视频评分方法、装置、存储介质及电子设备
PCT/CN2019/130520 WO2021134485A1 (zh) 2019-12-31 2019-12-31 视频评分方法、装置、存储介质及电子设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130520 WO2021134485A1 (zh) 2019-12-31 2019-12-31 视频评分方法、装置、存储介质及电子设备

Publications (1)

Publication Number Publication Date
WO2021134485A1 true WO2021134485A1 (zh) 2021-07-08

Family

ID=76687522

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130520 WO2021134485A1 (zh) 2019-12-31 2019-12-31 视频评分方法、装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN114375466A (zh)
WO (1) WO2021134485A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565814A (zh) * 2022-02-25 2022-05-31 平安国际智慧城市科技股份有限公司 一种特征检测方法、装置及终端设备
CN115495712A (zh) * 2022-09-28 2022-12-20 支付宝(杭州)信息技术有限公司 数字作品处理方法及装置
CN115953715A (zh) * 2022-12-22 2023-04-11 北京字跳网络技术有限公司 一种视频检测方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109475294A (zh) * 2016-05-06 2019-03-15 斯坦福大学托管董事会 用于治疗精神障碍的移动和可穿戴视频捕捉和反馈平台
CN110147936A (zh) * 2019-04-19 2019-08-20 深圳壹账通智能科技有限公司 基于情绪识别的服务评价方法、装置、存储介质
US20190278978A1 (en) * 2018-03-08 2019-09-12 Electronics And Telecommunications Research Institute Apparatus and method for determining video-related emotion and method of generating data for learning video-related emotion
CN110414323A (zh) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 情绪检测方法、装置、电子设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109475294A (zh) * 2016-05-06 2019-03-15 斯坦福大学托管董事会 用于治疗精神障碍的移动和可穿戴视频捕捉和反馈平台
US20190278978A1 (en) * 2018-03-08 2019-09-12 Electronics And Telecommunications Research Institute Apparatus and method for determining video-related emotion and method of generating data for learning video-related emotion
CN110147936A (zh) * 2019-04-19 2019-08-20 深圳壹账通智能科技有限公司 基于情绪识别的服务评价方法、装置、存储介质
CN110414323A (zh) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 情绪检测方法、装置、电子设备及存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565814A (zh) * 2022-02-25 2022-05-31 平安国际智慧城市科技股份有限公司 一种特征检测方法、装置及终端设备
CN115495712A (zh) * 2022-09-28 2022-12-20 支付宝(杭州)信息技术有限公司 数字作品处理方法及装置
CN115495712B (zh) * 2022-09-28 2024-04-16 支付宝(杭州)信息技术有限公司 数字作品处理方法及装置
CN115953715A (zh) * 2022-12-22 2023-04-11 北京字跳网络技术有限公司 一种视频检测方法、装置、设备及存储介质
CN115953715B (zh) * 2022-12-22 2024-04-19 北京字跳网络技术有限公司 一种视频检测方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN114375466A (zh) 2022-04-19

Similar Documents

Publication Publication Date Title
WO2021087985A1 (zh) 模型训练方法、装置、存储介质及电子设备
US20230008363A1 (en) Audio matching method and related device
US20200334830A1 (en) Method, apparatus, and storage medium for processing video image
WO2020192483A1 (zh) 图像显示方法和设备
US20210012127A1 (en) Action recognition method and apparatus, driving action analysis method and apparatus, and storage medium
US10832069B2 (en) Living body detection method, electronic device and computer readable medium
WO2021134485A1 (zh) 视频评分方法、装置、存储介质及电子设备
CN111401324A (zh) 图像质量评估方法、装置、存储介质及电子设备
WO2022161298A1 (zh) 信息生成方法、装置、设备、存储介质及程序产品
CN111209970B (zh) 视频分类方法、装置、存储介质及服务器
WO2021138855A1 (zh) 模型训练方法、视频处理方法、装置、存储介质及电子设备
WO2023178906A1 (zh) 活体检测方法及装置、电子设备、存储介质、计算机程序、计算机程序产品
WO2021092808A1 (zh) 网络模型的训练方法、图像的处理方法、装置及电子设备
CN110348358B (zh) 一种肤色检测系统、方法、介质和计算设备
Geng et al. Learning deep spatiotemporal feature for engagement recognition of online courses
CN109784277A (zh) 一种基于智能眼镜的情绪识别方法
CN109035147A (zh) 图像处理方法及装置、电子装置、存储介质和计算机设备
CN113516990A (zh) 一种语音增强方法、训练神经网络的方法以及相关设备
CN109145861B (zh) 情绪识别装置及方法、头戴式显示设备、存储介质
CN113611318A (zh) 一种音频数据增强方法及相关设备
CN115620054A (zh) 一种缺陷分类方法、装置、电子设备及存储介质
CN110826726B (zh) 目标处理方法、目标处理装置、目标处理设备及介质
WO2021189321A1 (zh) 一种图像处理方法和装置
Nakanishi et al. Facial expression recognition of a speaker using thermal image processing and reject criteria in feature vector space
CN112818782B (zh) 一种基于媒介感知的泛化性静默活体检测方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19958484

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.11.2022)