CN103500184A - Video memorability judging method based on bottom visual sense and auditory sense characteristics of video data - Google Patents

Video memorability judging method based on bottom visual sense and auditory sense characteristics of video data Download PDF

Info

Publication number
CN103500184A
CN103500184A CN201310418333.1A CN201310418333A CN103500184A CN 103500184 A CN103500184 A CN 103500184A CN 201310418333 A CN201310418333 A CN 201310418333A CN 103500184 A CN103500184 A CN 103500184A
Authority
CN
China
Prior art keywords
video
feature
key frame
video data
memorability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310418333.1A
Other languages
Chinese (zh)
Other versions
CN103500184B (en
Inventor
韩军伟
刘念
郭雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN201310418333.1A priority Critical patent/CN103500184B/en
Publication of CN103500184A publication Critical patent/CN103500184A/en
Application granted granted Critical
Publication of CN103500184B publication Critical patent/CN103500184B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/785Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using colour or luminescence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/786Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using motion, e.g. object motion or camera motion

Landscapes

  • Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video memorability judging method based on bottom visual sense and auditory sense characteristics. An Object bank characteristic, a saliency characteristic, a color characteristic, a motion character and an audio characteristic of a video are extracted and combined to express the video and a support vector regression machine is trained to acquire a video memorability judging model, and when a new video is given, a memorability value of video data is judged and acquired. The video memorability judging method based on the bottom visual sense and auditory sense characteristics can be used for judging the memorability of videos, be applied to industries such as the advertising industry and the news editing industry, and ensure that a practitioner can pick out a proper video, and has great commercial value.

Description

A kind of video Memorability decision method of the bottom audio visual feature based on video data
Technical field
The invention belongs to the Computer Image Processing field, relate to a kind of processing of computer video data image, be particularly related to a kind of video Memorability decision method of the bottom audio visual feature based on video data, can be applied to judge the Memorability numerical value of variety classes video.
Background technology
The lower-order questions of image/video data is the research direction that current digital image/field of video processing is new, and achievement in research seldom, and concentrates on the Memorability field of image, not yet has the research work of video Memorability to announce.
For the Memorability research of image, method is not a lot, and the certain methods existed at present is at first to extract the global characteristics of image (as SIFT, GIST, HOG etc.), by building sorter training pattern, then differentiates the Memorability of a given image.The Memorability of image has a lot of application.As editor can select the image easily remembered by the people front cover as magazine, the advertising man can select the image of easily being remembered as propagating poster etc.Therefore, when given piece image, if general-purpose computers automatically determine that can it be remembered by people will be highly significant.
And, for the lower-order questions of video, the method for discrimination of having announced is not yet arranged at present.The Memorability of video has application very widely, for example can be for the evaluation to video ads.After people have seen one section advertisement video, after a period of time, still can remember this section advertisement video, the value that this section advertisement video is described is very high, if, otherwise after a period of time, people do not remember this section advertisement video, the value that this section advertisement video is described is lower.Therefore the Memorability research of video had to very strong realistic meaning.
Summary of the invention
The technical matters solved
In order to solve the Memorability discrimination of video, the present invention proposes a kind of video Memorability method of discrimination of the bottom audio visual feature based on video data.
Technical scheme
A kind of video Memorability method of discrimination based on bottom audio visual feature, its characteristic extraction step is as follows:
Step 1 is extracted the audio visual feature of video data in video database:
Step a: comprise altogether the individual video data of N ∈ [100,1000] in video database, extract the key frame of the first frame of each video data each second as video data;
The object bank feature of step b, calculating video data: utilize the object bank routine package of Li-Jia Li issue in 2010, adopt the down-sampling technology to obtain 12 scalogram pictures of each key frame input picture, and 208 object templates in these 12 scalogram pictures and object bank program are carried out to convolutional calculation, obtain 208 * 12 width response images of each key frame;
Utilize two interpolation methods, the response image interpolation by each key frame corresponding to 12 yardsticks of each template, obtain the response image of same size;
Calculate the maximal value of each pixel on the response image of 12 same scale, form a peak response image; Then ask the pixel average of peak response image, obtain the proper vector of one 208 dimension of each key frame;
By the feature of 208 dimensions of all key frames of each video data difference maximizing on each dimension, obtain the proper vectors of 208 dimensions of each video data;
Ask again average and the variance of 208 dimensional feature vectors of each video data, obtain the proper vector of 2 dimensions;
Then find the value max of largest component in the proper vectors of 208 dimensions of each video data, the number that the value of calculating component in 208 dimensional feature vectors is greater than the component of 1.5*max accounts for the ratio of total dimension 208 as object bank simplicity feature;
208 dimensional features and average, variance and the object bank simplicity feature that obtain are linked up, obtain the object bank feature of 211 dimensions of a video data;
The saliency feature of step c, calculating video data:
At first extract the saliency image of each key frame of each video in video database, the saliency image binaryzation of each key frame is obtained to bianry image; Calculate each bianry image not connected region number with and corresponding area, then
(1) the saliency entropy feature of computed image, be designated as SE, and computation model is:
SE = - Σ k = 1 N S k S ln ( S k S )
Wherein, N is the number of connected region not in bianry image; S kit is each not area of connected region; S is the total area of all not connected regions of bianry image;
(2) generate a normalized Gaussian template big or small with frame of video etc., the standard deviation of given Gaussian template centered by the center of frame of video
Figure BDA0000381536250000031
calculate average significantly strength S I, computation model is:
SI = 1 S frame Σ k = 1 N Σ ( i , j ) ∈ R k B i , j * ω Gaus i , j
Wherein, the size that (x, y) is frame of video; S frameit is the area of key frame; N is the number of connected region not in bianry image; R kk not connected region; B i,jbe illustrated in the saliency value of pixel (i, j) in the saliency image;
Figure BDA0000381536250000033
the weight size that the Gaussian template that expression had before obtained is located in pixel (i, j);
Obtain thus two features of SE, SI of each key frame, then all key frames of a video are asked respectively the average of these two features, obtain the saliency proper vector of 2 dimensions of video;
The color feature of steps d, calculating video:
(1) by each key frame of video data from the RGB color space conversion to the hsv color space, calculate the average of key frame V value in the HSV space as the brightness feature; The average of the S value of calculating key frame is as the saturation feature;
(2) by each key frame of video data from the RGB color space conversion to the HSL color space, calculate key frame in the HSL space L value without inclined to one side standard deviation as the contrast feature;
(3) calculate the colorfulness feature, in the RGB color space, calculate the value rg=R-G of each pixel of key frame, yb=1/2 (R+G)-B, then calculate the average μ of rg value and the yb value of a key frame rgand μ yb, and variance
Figure BDA0000381536250000034
with
Figure BDA0000381536250000035
Then calculate the colorfulness feature colorfu ln ess = σ rg 2 + σ yb 2 + 0.3 μ rg 2 + μ yb 2 ;
(4) calculate the simplicity feature, do the histogram of key frame in rgb space, by each passage average quantization of tri-passages of key frame RGB, be 16 bins, obtain 4096 bins after whole rgb space quantizes, then calculate the pixel count that belongs to each bin, obtain the histogram of 4096 dimensions of each key frame; Then find in this histogram maximum amplitude max, the number that in compute histograms, amplitude is greater than the bins of 0.01*max accounts for the ratio of total number 4096 as the simplicity feature;
Each key frame of a video has obtained brightness, saturation, contrast, colorfulness, a simplicity5 eigenwert thus, then these 5 eigenwerts of all key frames of a video are averaged respectively and variance, altogether obtain 10 values, connecting is the color features of 10 dimensional vectors as a video;
The motion feature of step e, calculating video: at first with the frequency of 5 frames of sampling equably p.s., each video is sampled, then to the sample video of sampling out, adopt standard block-based motion estimation to carry out estimation;
N frame for sampling out, obtained the motion vector of N-1 to the microblocks between consecutive frame; Then calculate the average intensity value of the motion vector of all microblocks between every pair of consecutive frame, obtain N-1 mean motion intensity; Calculate average and the variance of mean motion intensity, as 2 dimension motion features of a video;
Step f, calculate the audio feature of video: the sound signal of at first extracting each video data, then utilize the MIRtoolbox routine package of the people such as Olivier Lartillot issue, extract 13 dimension MFCC features of each video voice signal, and brightness, roughness, novelty, low energy rate, root-mean-square energy, zero-crossing rate, roll off, pitch estimation, Shannon entropy feature, then this 22 dimensional feature is linked to be to a long vector, audio feature as video,
Step 2, model training:
Using the given video database with Memorability numerical value as training sample, utilize the method for step 1, extract the Object bank feature of each video data in training sample, saliency feature, color feature, the motion feature, the audio feature, then be cascaded these features from beginning to end, obtains the bottom audio visual proper vector of 247 dimensions of each video data, utilize the mnemonic numerical value of video in training sample as label, training obtains a support vector regression model;
Step 3, prediction video mnemonic numerical value:
Video for the unknown of a Memorability numerical value, extract its Object bank feature by step 1, the saliency feature, the color feature, the motion feature, the audio feature, then be cascaded these features from beginning to end and form the bottom audio visual proper vector of one 247 dimension, be input in the support vector regression model that step 2 obtains and judged, obtain the Memorability numerical value of unknown video.
It is 201310332613.0 that the video database with Memorability numerical value that described step 2 adopts adopts number of patent application, and name is called a kind of based on the resulting video database of functional mri video Memorability method of discrimination.
Beneficial effect
The video Memorability method of discrimination of a kind of bottom audio visual feature based on video data that the present invention proposes, research shows that a plurality of audio visual factors in video all can exert an influence to the Memorability of video, comprise the object comprised in video, the present invention adopts object bank feature to mean; The degree that video pictures attracts human vision, adopt the saliency feature of frame of video to mean; The color factors of video pictures, adopt the color feature to mean; The motion amplitude factor of video pictures, adopt the motion feature to mean; And the sense of hearing factor of video voice, adopt the audio feature to mean.Because these 5 kinds of factors all have a great impact the Memorability of video, therefore by 5 kinds of feature combinations, can obtain video Memorability preferably and judge effect.
The present invention is for the computer video data, by extracting the object bank feature of video, the saliency feature, color feature, motion feature and audio feature, studied the impact of some audio visual factors on the video Memorability, comprise the object information comprised in video, the degree that video pictures attracts people's vision, the color factors of video pictures, the motion amplitude factor of video pictures, and the auditory information factor of video voice.Because these features of these factors of this paper research and extraction can mean the lower-order questions of video well, therefore can be used for judging well the Memorability numerical value of video.
The accompanying drawing explanation
Fig. 1: the basic flow sheet of the inventive method
Embodiment
Now in conjunction with the embodiments, the invention will be further described for accompanying drawing:
For the hardware environment of implementing, be: AMD Athlon64 * 25000+ computing machine, 2GB internal memory, 256M video card, the software environment of operation is: Matlab2009a and Windows XP.We have realized with Matlab software the method that the present invention proposes.
The present invention specifically is implemented as follows:
1, extract the bottom audio visual proper vector of N video data:
Step a: extract the key frame of the first frame of each video data each second as video data.
Step b: the object bank feature of calculating video.Utilize the object bank routine package of Li-Jia Li issue in 2010, to each key frame, utilize the down-sampling technology to obtain 12 scalogram pictures of input picture, and 208 object templates in these 12 scalogram pictures and object bank program are carried out to convolutional calculation, make each key frame obtain 208 * 12 width response images; Utilize two interpolation methods, the response image interpolation by each key frame corresponding to 12 yardsticks of each template, obtain the response image of same size; Calculate the maximal value of each pixel on the response image of 12 same scale, form a peak response image; Then ask the pixel average of peak response image, obtain the proper vector of one 208 dimension of each key frame, by the feature of 208 dimensions of all key frames of each video data difference maximizing on each dimension, obtain the proper vectors of 208 dimensions of each video data.Ask again average and the variance of 208 dimensional feature vectors of each video data, obtain the proper vector of 2 dimensions; Then find the value max of largest component in the proper vectors of 208 dimensions of each video data, the number that the value of calculating component in 208 dimensional feature vectors is greater than the component of 1.5*max accounts for the ratio of total dimension (208) as object bank simplicity feature.More than 208 dimensional features and average, variance and the object bank simplicity feature that obtain link up, obtain the object bank features of 211 dimensions of a video data.The object bank program that described Li-Jia Li announced in 2010 is shown in paper: Li-Jia L, Hao S, Eric X, et al.Object bank:A high-level image representation for scene classif ì cation and semantic feature sparsif ì cation[C] .NIPS, 2010.
Step c: the saliency feature of calculating video.To each video, the method for at first utilizing Tilke to propose is obtained the saliency image of its each key frame.Then to a key frame, by its saliency image binaryzation, get threshold value and be
Figure BDA0000381536250000065
obtain the bianry image of the corresponding saliency image of each frame; Then calculate bianry image not connected region number with and corresponding area, then (1) utilizes the saliency entropy feature of following formula computed image, is designated as SE:
SE = - Σ k = 1 N S k S ln ( S k S )
Wherein, N is the number of connected region not in bianry image; S kit is each not area of connected region; S is the total area of all not connected regions of bianry image; (2) generate a normalized Gaussian template with the size such as frame of video centered by the center of frame of video, the size of supposing frame of video is (x, y), and the standard deviation of Gaussian template is made as according to experiment
Figure BDA0000381536250000062
then we calculate the average significantly intensity of a frame of video with following formula, are designated as SI:
SI = 1 S frame Σ k = 1 N Σ ( i , j ) ∈ R k B i , j * ω Gaus i , j
Wherein, S frameit is the area of key frame; N is the number of connected region not in bianry image; R kk not connected region; B i,jbe illustrated in the saliency value of pixel (i, j) in the saliency image;
Figure BDA0000381536250000064
the weight size that the Gaussian template that expression had before obtained is located in pixel (i, j).We have obtained SE, two features of SI to each key frame like this.Then we ask respectively the average of these two features to all key frames of a video, and to a video, we have obtained the saliency proper vector of one 2 dimension.
The method that described Tilke proposes is shown in paper: Tilke J, Krista E, Fredo D, et al.Learning to Predict Where Humans Look[C] .ICCV, 2009,2106-2113.
Steps d: the color feature of calculating video.(1) to each key frame of video data, by it from the RGB color space conversion to the hsv color space.Calculate the average of key frame V value in the HSV space as the brightness feature; The average of the S value of calculating key frame is as the saturation feature.(2) by each key frame of video data from the RGB color space conversion to the HSL color space.Calculate key frame in the HSL space L value without inclined to one side standard deviation as the contrast feature.(3) calculate the colorfulness feature.To a key frame, in the RGB color space, calculated value rg=R-G, yb=1/2 (R+G)-B, then calculate the average μ of rg value and the yb value of a key frame rgand μ yb, and variance
Figure BDA0000381536250000071
with
Figure BDA0000381536250000072
Then calculate the colorfulness feature colorfu ln ess = σ rg 2 + σ yb 2 + 0.3 μ rg 2 + μ yb 2 · (4) calculate the simplicity feature.To a key frame, do histogram in rgb space.Each passage that is about to tri-passages of RGB is quantified as 16 bins, and whole rgb space quantizes for 4096 bins, then calculates the pixel count that belongs to each bin.Each key frame obtains the histogram of one 4096 dimension.Then find in this histogram maximum amplitude max, the number that in compute histograms, amplitude is greater than the bins of 0.01*max accounts for the ratio of total number (4096) as the simplicity feature.Like this, each key frame of a video has obtained brightness, saturation, contrast, colorfulness, a simplicity5 eigenwert, then we average respectively and variance to these 5 eigenwerts of all key frames of a video, altogether obtain 10 values, connecting is the color features of 10 dimensional vectors as a video.
Step e: the motion feature of calculating video.We are sampled to each video, 5 frames of sampling equably p.s..Then to the sample video of sampling out, adopt standard block-based motion estimation to carry out estimation.Suppose to have sampled out the N frame, we have obtained the motion vector of N-1 to the microblocks between consecutive frame.Then calculate the average amplitude of the motion vector of the microblocks between every pair of consecutive frame, obtained N-1 mean motion intensity.Then calculate average and the variance of these mean motion intensity, as 2 dimension motion features of a video.
Step f: the audio feature of calculating video.At first extract the sound signal of each video data, then utilize the MIRtoolbox routine package of the people such as Olivier Lartillot issue, extract 13 dimension MFCC features of each video voice signal, and brightness, roughness, novelty, low energy rate, root-mean-square energy, zero-crossing rate, roll off, pitch estimation, Shannon entropy feature, then this 22 dimensional feature is linked to be to a long vector, as the audio feature of video.
The MIRtoolbox routine package of the people such as described Olivier Lartillot issue is shown in network address: www.jyu.fi/hum/laitokset/musiikki/en/research/coe/materi als/mirtoolbox
2, model training: take in the video data of training with and corresponding Memorability numerical value, utilize the method in step 1, extract the training sample video data object bank feature, the saliency feature, the color feature, motion feature and audio feature, then link up them the proper vector that forms one 247 dimension.Using this proper vector as the input data, the Memorability numerical value of video data, as label, is trained a support vector regression model, with the libSVM software package, realizes here, parameter is set to '-s3-t2-c0.5-p0.04-g0.125 ', for the Memorability numerical value of video data, judge.
It is 201310332613.0 that the video database with Memorability numerical value adopted adopts number of patent application, and name is called a kind of based on the resulting video database of functional mri video Memorability method of discrimination.
3, the Memorability value of prediction video: for the video data of Memorability numerical value the unknown, at first extract its object bank feature, the saliency feature, the color feature, motion feature and audio feature, then they are linked up to the proper vector that forms one 247 dimension, this proper vector is input in support vector regression (SVR) model trained, judge the Memorability numerical value that obtains this video data.
4, the calculating of relative coefficient: in order to verify the effect of the inventive method, the relative coefficient of computational discrimination value and actual value, adopt leave one cross validation to obtain predicting the relative coefficient between Memorability numerical value and true value.To 213 experiment video samples, choose successively 1 video data as test sample book, choose 212 remaining video datas at every turn and train support vector regression (SVR) model as training set, then judge the Memorability numerical value of test sample book, so circulate 213 times.Calculate the predicted value of all 213 video datas and the related coefficient between its actual value, the computing formula of related coefficient is as follows:
ρ = Σ i = 1 N ( X i - X ‾ ) ( Y i - Y ‾ ) Σ i = 1 N ( X i - X ‾ ) 2 Σ i = 1 N ( Y i - Y ‾ ) 2
Wherein ρ means related coefficient, and N=213 means to have 213 video datas, X ithe Memorability true value that means test video, Y ithe Memorability decision content that means test video, the mean value that means the Memorability true value of 213 videos,
Figure BDA0000381536250000083
the mean value that means the Memorability decision content of 213 videos.Can calculate the predicted value of all 213 video datas and the related coefficient between its actual value according to above formula.Table 1 has shown the experimental result that the inventive method obtains.
The relative coefficient of table 1 experimental result
Figure BDA0000381536250000091

Claims (2)

1. the video Memorability decision method of the bottom audio visual feature based on video data is characterized in that step is as follows:
Step 1 is extracted the audio visual feature of video data in video database:
Step a: comprise altogether the individual video data of N ∈ [100,1000] in video database, extract the key frame of the first frame of each video data each second as video data;
The object bank feature of step b, calculating video data: utilize the object bank routine package of Li-Jia Li issue in 2010, adopt the down-sampling technology to obtain 12 scalogram pictures of each key frame input picture, and 208 object templates in these 12 scalogram pictures and object bank program are carried out to convolutional calculation, obtain 208 * 12 width response images of each key frame;
Utilize two interpolation methods, the response image interpolation by each key frame corresponding to 12 yardsticks of each template, obtain the response image of same size;
Calculate the maximal value of each pixel on the response image of 12 same scale, form a peak response image; Then ask the pixel average of peak response image, obtain the proper vector of one 208 dimension of each key frame;
By the feature of 208 dimensions of all key frames of each video data difference maximizing on each dimension, obtain the proper vectors of 208 dimensions of each video data;
Ask again average and the variance of 208 dimensional feature vectors of each video data, obtain the proper vector of 2 dimensions; Then find the value max of largest component in the proper vectors of 208 dimensions of each video data, the number that the value of calculating component in 208 dimensional feature vectors is greater than the component of 1.5*max accounts for the ratio of total dimension 208 as object bank simplicity feature;
208 dimensional features and average, variance and the object bank simplicity feature that obtain are linked up, obtain the object bank feature of 211 dimensions of a video data;
The saliency feature of step c, calculating video data:
At first extract the saliency image of each key frame of each video in video database, the saliency image binaryzation of each key frame is obtained to bianry image; Calculate each bianry image not connected region number with and corresponding area, then
(1) the saliency entropy feature of computed image, be designated as SE, and computation model is:
SE = - Σ k = 1 N S k S ln ( S k S )
Wherein, N is the number of connected region not in bianry image; S kit is each not area of connected region; S is the total area of all not connected regions of bianry image;
(2) generate a normalized Gaussian template big or small with frame of video etc., the standard deviation of given Gaussian template centered by the center of frame of video
Figure FDA0000381536240000022
calculate average significantly strength S I, computation model is:
SI = 1 S frame Σ k = 1 N Σ ( i , j ) ∈ R k B i , j * ω Gaus i , j
Wherein, the size that (x, y) is frame of video; S frameit is the area of key frame; N is the number of connected region not in bianry image; R kk not connected region; B i,jbe illustrated in the saliency value of pixel (i, j) in the saliency image;
Figure FDA0000381536240000024
the weight that the Gaussian template that expression had before obtained is located in pixel (i, j);
Obtain thus two features of SE, SI of each key frame, then all key frames of a video are asked respectively the average of these two features, obtain the saliency proper vector of 2 dimensions of video;
The color feature of steps d, calculating video:
(1) by each key frame of video data from the RGB color space conversion to the hsv color space, calculate the average of key frame V value in the HSV space as the brightness feature; The average of the S value of calculating key frame is as the saturation feature;
(2) by each key frame of video data from the RGB color space conversion to the HSL color space, calculate key frame in the HSL space L value without inclined to one side standard deviation as the contrast feature;
(3) calculate the colorfulness feature, in the RGB color space, calculate the value rg=R-G of each pixel of key frame, yb=1/2 (R+G)-B, then calculate the average μ of rg value and the yb value of a key frame rgand μ yb, and variance with
Figure FDA0000381536240000027
Then calculate the colorfulness feature colorfu ln ess = σ rg 2 + σ yb 2 + 0.3 μ rg 2 + μ yb 2 ;
(4) calculate the simplicity feature, do the histogram of key frame in rgb space, by each passage average quantization of tri-passages of key frame RGB, be 16 bins, obtain 4096 bins after whole rgb space quantizes, then calculate the pixel count that belongs to each bin, obtain the histogram of 4096 dimensions of each key frame; Then find in this histogram maximum amplitude max, the number that in compute histograms, amplitude is greater than the bins of 0.01*max accounts for the ratio of total number 4096 as the simplicity feature;
Each key frame of a video has obtained brightness, saturation, contrast, colorfulness, a simplicity5 eigenwert thus, then these 5 eigenwerts of all key frames of a video are averaged respectively and variance, altogether obtain 10 values, connecting is the color features of 10 dimensional vectors as a video;
The motion feature of step e, calculating video: at first with the frequency of 5 frames of sampling equably p.s., each video is sampled, then to the sample video of sampling out, adopt standard block-based motion estimation to carry out estimation;
N frame for sampling out, obtained the motion vector of N-1 to the microblocks between consecutive frame; Then calculate the average intensity value of the motion vector of all microblocks between every pair of consecutive frame, obtain N-1 mean motion intensity; Calculate average and the variance of mean motion intensity, as 2 dimension motion features of a video;
Step f, calculate the audio feature of video: the sound signal of at first extracting each video data, then utilize the MIRtoolbox program of people's issues such as Olivier Lartillot, extract 13 dimension MFCC features of each video voice signal, and brightness, roughness, novelty, low energy rate, root-mean-square energy, zero-crossing rate, roll off, pitch estimation, Shannon entropy feature, then this 22 dimensional feature is linked to be to a long vector, audio feature as video,
Step 2, model training:
Using the given video database with Memorability numerical value as training sample, utilize the method for step 1, extract the Object bank feature of each video data in training sample, saliency feature, color feature, the motion feature, the audio feature, then be cascaded these features from beginning to end, obtains the bottom audio visual proper vector of 247 dimensions of each video data, utilize the mnemonic numerical value of video in training sample as label, training obtains a support vector regression model;
Step 3, prediction video mnemonic numerical value:
Video for the unknown of a Memorability numerical value, extract its Object bank feature by step 1, the saliency feature, the color feature, the motion feature, the audio feature, then be cascaded these features from beginning to end and form the bottom audio visual proper vector of one 247 dimension, be input in the support vector regression model that step 2 obtains and judged, obtain the Memorability numerical value of unknown video.
2. the video Memorability decision method of the bottom audio visual feature based on video data according to claim 1, it is characterized in that: it is 201310332613.0 that the video database with Memorability numerical value that described step 2 adopts adopts number of patent application, and name is called a kind of based on the resulting video database of functional mri video Memorability method of discrimination.
CN201310418333.1A 2013-09-13 2013-09-13 Video memorability judging method based on bottom visual sense and auditory sense characteristics of video data Active CN103500184B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310418333.1A CN103500184B (en) 2013-09-13 2013-09-13 Video memorability judging method based on bottom visual sense and auditory sense characteristics of video data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310418333.1A CN103500184B (en) 2013-09-13 2013-09-13 Video memorability judging method based on bottom visual sense and auditory sense characteristics of video data

Publications (2)

Publication Number Publication Date
CN103500184A true CN103500184A (en) 2014-01-08
CN103500184B CN103500184B (en) 2017-05-24

Family

ID=49865395

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310418333.1A Active CN103500184B (en) 2013-09-13 2013-09-13 Video memorability judging method based on bottom visual sense and auditory sense characteristics of video data

Country Status (1)

Country Link
CN (1) CN103500184B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020130870A1 (en) 2018-12-21 2020-06-25 Акционерное общество "Нейротренд" Method for measuring the memorability of a multimedia message

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102142037A (en) * 2011-05-05 2011-08-03 西北工业大学 Video data search method based on functional magnetic resonance imaging
CN102855352A (en) * 2012-08-17 2013-01-02 西北工业大学 Method for clustering videos by using brain imaging space features and bottom layer vision features

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102142037A (en) * 2011-05-05 2011-08-03 西北工业大学 Video data search method based on functional magnetic resonance imaging
CN102855352A (en) * 2012-08-17 2013-01-02 西北工业大学 Method for clustering videos by using brain imaging space features and bottom layer vision features

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020130870A1 (en) 2018-12-21 2020-06-25 Акционерное общество "Нейротренд" Method for measuring the memorability of a multimedia message
CN111787860A (en) * 2018-12-21 2020-10-16 尼罗特兰德股份公司 Measuring method for storing multimedia message

Also Published As

Publication number Publication date
CN103500184B (en) 2017-05-24

Similar Documents

Publication Publication Date Title
CN105608456B (en) A kind of multi-direction Method for text detection based on full convolutional network
US20190138798A1 (en) Time domain action detecting methods and system, electronic devices, and computer storage medium
US20190172193A1 (en) Method and apparatus for evaluating image definition, computer device and storage medium
CN101819638B (en) Establishment method of pornographic detection model and pornographic detection method
CN109657600B (en) Video area removal tampering detection method and device
CN103810504A (en) Image processing method and device
CN110415260B (en) Smoke image segmentation and identification method based on dictionary and BP neural network
CN104092988A (en) Method, device and system for managing passenger flow in public place
CN103853724A (en) Multimedia data sorting method and device
CN106792005B (en) Content detection method based on audio and video combination
CN106055653A (en) Video synopsis object retrieval method based on image semantic annotation
CN105608457B (en) Gray Histogram square thresholding method
CN113239807B (en) Method and device for training bill identification model and bill identification
CN104239852A (en) Infrared pedestrian detecting method based on motion platform
CN115205890A (en) Method and system for re-identifying pedestrians of non-motor vehicles
US12020510B2 (en) Person authentication apparatus, control method, and non-transitory storage medium
Yanagisawa et al. Face detection for comic images with deformable part model
CN102609715B (en) Object type identification method combining plurality of interest point testers
CN116486296A (en) Target detection method, device and computer readable storage medium
CN104282019A (en) Blind image quality evaluation method based on natural scene statistics and perceived quality propagation
CN111833347A (en) Transmission line damper defect detection method and related device
CN112925905B (en) Method, device, electronic equipment and storage medium for extracting video subtitles
CN113901924A (en) Document table detection method and device
CN110121723B (en) Artificial neural network
CN103500184A (en) Video memorability judging method based on bottom visual sense and auditory sense characteristics of video data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant