CN110324657A - Model generation, method for processing video frequency, device, electronic equipment and storage medium - Google Patents

Model generation, method for processing video frequency, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110324657A
CN110324657A CN201910459442.5A CN201910459442A CN110324657A CN 110324657 A CN110324657 A CN 110324657A CN 201910459442 A CN201910459442 A CN 201910459442A CN 110324657 A CN110324657 A CN 110324657A
Authority
CN
China
Prior art keywords
video
processed
unit
sample
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910459442.5A
Other languages
Chinese (zh)
Inventor
贾少勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201910459442.5A priority Critical patent/CN110324657A/en
Publication of CN110324657A publication Critical patent/CN110324657A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23424Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip

Abstract

The present invention provides a kind of model generation, method for processing video frequency, device, electronic equipment and storage mediums.Model generating method includes: acquisition training sample;The training sample includes the markup information of Sample video and the Sample video;The markup information is used to indicate whether the Sample video belongs to theme song classification;The Sample video is divided into multiple unit sample videos;For each unit sample video, the corresponding audio feature vector of the unit sample video is obtained;Preset initial model is trained using the markup information of the Sample video as the target of output using the corresponding audio feature vector of continuous at least two unit samples video as input, the model that training is completed is determined as video processing model.Bai Fumei is based on audio feature vector and is detected, and does not limit video and whether belongs to same album, the videos of multiple types can the general video handle model, adaptivity is stronger.

Description

Model generation, method for processing video frequency, device, electronic equipment and storage medium
Technical field
The present invention relates to Internet technical fields, more particularly to a kind of model generation, method for processing video frequency, device, electricity Sub- equipment and storage medium.
Background technique
Video display are with copy, tape, film, memory etc. for carrier, for the purpose of screen, screen show, to realize The art form that vision and comprehensive hearing watch, is the comprehensive morphological of modern art, contains in film, TV play, animation etc. Hold.Theme song is generally comprised in video display video, theme song includes Presence of the Moment, piece caudal flexure.In practical applications, there are detection video display The demand of theme song in video, for example in order to save user's viewing time, theme song can be detected when playing video display video, And it skips theme song and directly plays particular content.
The theme song in the method detection video display video of template matching is generallyd use in the prior art.Specific practice is as follows: A corresponding template is generated for the video display video for belonging to same album, may include the video display video of the album in the template In theme song feature, to the album video display video carry out theme song detection when, using the corresponding template of the album with to The video display video of detection is matched, and will there is the feature with the theme song characteristic matching in template in video display video to be detected Segment as theme song.
But since the theme song feature in the video display video of different albums is different, the video display video of different albums Corresponding template is different, for example will generate a template to a TV play.Also, this kind of mode is only applicable to comprising multiple Same template can be used in the TV play of collection of drama, each collection of drama, and for the film for not including multiple collection of dramas, a film A template will be generated.Therefore, existing theme song detection method adaptivity is poor, is not suitable for the video display of magnanimity, multiplicity Video library.
Summary of the invention
The embodiment of the present invention provides a kind of model generation, method for processing video frequency, device, electronic equipment and storage medium side Method, device, electronic equipment and storage medium, it is poor to solve existing theme song detection method adaptivity, be not suitable for magnanimity, The problem of video display video library of multiplicity.
In a first aspect, the embodiment of the invention provides a kind of model generating methods, which comprises
Obtain training sample;The training sample includes the markup information of Sample video and the Sample video;The mark Note information is used to indicate whether the Sample video belongs to theme song classification;
The Sample video is divided into multiple unit sample videos;
For each unit sample video, the corresponding audio feature vector of the unit sample video is obtained;
Using the corresponding audio feature vector of continuous at least two unit samples video as input, by the Sample video Target of the markup information as output, is trained preset initial model,
The model that training is completed is determined as video processing model.
Optionally, described to obtain the corresponding audio feature vector of the unit sample video, comprising: to generate the unit sample The corresponding spectrogram of audio signal in this video;By the corresponding spectrogram input of audio signal in the unit sample video The audio feature vector that the neural network model exports is determined as the unit sample video by preset neural network model Corresponding audio feature vector.
Optionally, the corresponding spectrogram of audio signal generated in the unit sample video, comprising: to the list Audio signal in the Sample video of position carries out sub-frame processing, obtains multiple audio signal frames;Each audio signal frame is added Window processing and Fourier transformation processing, obtain the corresponding initial spectrum figure of audio signal in the unit sample video;To institute It states initial spectrum figure progress Meier conversion process and obtains Meier spectrogram, regarded using the Meier spectrogram as the unit sample The corresponding spectrogram of audio signal in frequency.
Optionally, described using the corresponding audio feature vector of at least two unit sample videos as input, by the sample Target of the markup information of this video as output, is trained preset initial model, comprising: randomly selects continuously at least Two unit sample videos will input the introductory die after the corresponding audio feature vector splicing of the unit sample video of extraction Type obtains the prediction probability that the Sample video belongs to theme song classification;Belong to theme song classification according to the Sample video The markup information of prediction probability and the Sample video calculates the corresponding penalty values of the Sample video;In the penalty values When losing threshold value less than setting, determine that training is completed.
Second aspect, the embodiment of the invention provides a kind of method for processing video frequency, which comprises
Obtain video to be processed;
Head segment and run-out segment are extracted from the video to be processed;
The head segment and the run-out segment are divided into multiple units video to be processed respectively;
For each unit video to be processed, the corresponding audio feature vector of unit video to be processed is obtained;
Including comprising unit video to be processed, the corresponding audio of continuous at least two units video to be processed is special It levies the pre-generated video of vector input and handles model, determine that the unit is to be processed according to the output that the video handles model Whether video belongs to theme song classification;Wherein, the video processing model is generated using as above described in any item methods;
By in the unit video to be processed for belonging to theme song classification, continuous unit video to be processed is spliced, and is obtained Presence of the Moment segment and run-out knee-piece section in the video to be processed.
Optionally, described to obtain the corresponding audio feature vector of the unit video to be processed, comprising: to generate the unit The corresponding spectrogram of audio signal in video to be processed;By the corresponding frequency spectrum of audio signal in unit video to be processed Figure inputs preset neural network model, and the audio feature vector that the neural network model exports is determined as the unit and is waited for Handle the corresponding audio feature vector of video.
Optionally, the corresponding spectrogram of audio signal generated in the unit video to be processed, comprising: to described Audio signal in unit video to be processed carries out sub-frame processing, obtains multiple audio signal frames;To each audio signal frame into Row windowing process and Fourier transformation processing, obtain the corresponding initial spectrum of audio signal in unit video to be processed Figure;Meier conversion process is carried out to the initial spectrum figure and obtains Meier spectrogram, using the Meier spectrogram as the list The corresponding spectrogram of audio signal in the video to be processed of position.
Optionally, described to be directed to each unit video to be processed, it is special to obtain the corresponding audio of unit video to be processed Levy vector, comprising: while calling preset first process and preset second process;It divides to obtain for by the head segment Each unit video to be processed, using first process obtain the corresponding audio frequency characteristics of the unit video to be processed to Amount;For each unit video to be processed divided by the run-out segment, the list is obtained using second process The corresponding audio feature vector of position video to be processed.
Optionally, the output for handling model according to the video determines whether unit video to be processed belongs to master Inscribe bent classification, comprising: the unit video to be processed of the video processing model output belongs to the pre- of theme song classification Survey whether probability is more than or equal to setting probability threshold value;When if it is being more than or equal to, unit view to be processed is determined Frequency belongs to theme song classification.
Optionally, in the unit video to be processed that theme song classification will be belonged to, continuous unit video to be processed into Row splicing, obtains the Presence of the Moment segment and run-out knee-piece section in the video to be processed, comprising: will be divided by the head segment In the obtained unit video to be processed for belonging to theme song classification, continuous unit video to be processed is spliced, described in acquisition Presence of the Moment segment in video to be processed;The unit for belonging to theme song classification divided by the run-out segment is to be processed In video, continuous unit video to be processed is spliced, and obtains the run-out knee-piece section in the video to be processed.
Optionally, the head segment and the run-out segment are divided into multiple units video to be processed respectively described Later, further includes: mark initial time and the end time of each unit video to be processed respectively;Theme song will be belonged to described In the unit of classification video to be processed, continuous unit video to be processed is spliced, and obtains the piece in the video to be processed After cephalic flexure segment and run-out knee-piece section, further includes: by the starting of first unit video to be processed in the Presence of the Moment segment Initial time of the time as the Presence of the Moment segment, by the knot of the last one unit video to be processed in the Presence of the Moment segment End time of the beam time as the Presence of the Moment segment;By of first unit video to be processed in the run-out knee-piece section Begin initial time of the time as the run-out knee-piece section, by the last one unit video to be processed in the run-out knee-piece section End time of the end time as the run-out knee-piece section.
The third aspect, the embodiment of the invention provides a kind of model generating means, described device includes:
Sample acquisition module, for obtaining training sample;The training sample includes Sample video and the Sample video Markup information;The markup information is used to indicate whether the Sample video belongs to theme song classification;
First division module, for the Sample video to be divided into multiple unit sample videos;
Primary vector obtains module, and for being directed to each unit sample video, it is corresponding to obtain the unit sample video Audio feature vector;
Training module is used for using the corresponding audio feature vector of continuous at least two unit samples video as input, will Target of the markup information of the Sample video as output, is trained preset initial model, the mould that training is completed Type is determined as video processing model.
Fourth aspect, the embodiment of the invention provides a kind of video process apparatus, described device includes:
Video acquiring module, for obtaining video to be processed;
Snippet extraction module, for extracting head segment and run-out segment from the video to be processed;
It is to be processed to be divided into multiple units for respectively by the second division module for the head segment and the run-out segment Video;
Secondary vector obtains module, for being directed to each unit video to be processed, obtains unit video pair to be processed The audio feature vector answered;
Category determination module, for including will including unit video to be processed, continuous at least two unit to wait locating It manages the pre-generated video of the corresponding audio feature vector input of video and handles model, the output of model is handled according to the video Determine whether unit video to be processed belongs to theme song classification;Wherein, the video processing model is using as described above Device generate;
Segment determining module, in the unit video to be processed for that will belong to theme song classification, continuous unit is to be processed Video is spliced, and Presence of the Moment segment and run-out knee-piece section in the video to be processed are obtained.
Optionally, it includes: call unit that the secondary vector, which obtains module, for call simultaneously preset first process and Preset second process;Head acquiring unit, for for each unit to be processed divided by the head segment Video obtains the corresponding audio feature vector of the unit video to be processed using first process;Run-out acquiring unit is used In for each unit video to be processed divided by the run-out segment, the unit is obtained using second process The corresponding audio feature vector of video to be processed.
Optionally, the segment determining module includes: Presence of the Moment determination unit, for that will be divided by the head segment To the unit video to be processed for belonging to theme song classification in, continuous unit video to be processed is spliced, obtain it is described to Handle the Presence of the Moment segment in video;Piece caudal flexure determination unit, for belonging to theme for what is divided by the run-out segment In the unit of bent classification video to be processed, continuous unit video to be processed is spliced, and is obtained in the video to be processed Run-out knee-piece section.
Optionally, described device further include: mark module is used in second division module respectively by the head piece Section and the run-out segment are divided into after multiple units video to be processed, mark the starting of each unit video to be processed respectively Time and end time;Time determining module, for by the starting of first unit video to be processed in the Presence of the Moment segment Initial time of the time as the Presence of the Moment segment, by the knot of the last one unit video to be processed in the Presence of the Moment segment End time of the beam time as the Presence of the Moment segment;By of first unit video to be processed in the run-out knee-piece section Begin initial time of the time as the run-out knee-piece section, by the last one unit video to be processed in the run-out knee-piece section End time of the end time as the run-out knee-piece section.
5th aspect, the embodiment of the invention provides a kind of electronic equipment, comprising: processor;It can for storage processor The memory executed instruction;Wherein, the processor is configured to executing as above described in any item model generating methods, and/ Or, described in any item method for processing video frequency as above.
6th aspect, the embodiment of the invention provides a kind of non-transitorycomputer readable storage mediums, when the storage When instruction in medium is executed by the processor of electronic equipment, so that electronic equipment is able to carry out described in any item models as above Generation method, and/or, described in any item method for processing video frequency as above.
In embodiments of the present invention, training sample is obtained, the training sample includes Sample video and the Sample video Markup information, the markup information is used to indicate whether the Sample video belongs to theme song classification;Sample video is divided For multiple unit sample videos;For each unit sample video, obtain the corresponding audio frequency characteristics of the unit sample video to Amount;Using the corresponding audio feature vector of continuous at least two unit samples video as input, by the mark of the Sample video Target of the information as output, is trained preset initial model, training is completed originally determined for video processing mould Type.It follows that in view of the audio of the theme bent portions in video and the audio of particular content part in the embodiment of the present invention It is distinct, utilize multiple Sample videos for belonging to theme song classification and multiple Sample videos for being not belonging to theme song classification, root The video for detecting video subject song, which is obtained, according to the corresponding audio feature vector training of Sample video handles model, it is subsequent Theme song segment therein is detected according to the corresponding audio feature vector of video to be detected using video processing model.Based on sound Frequency feature vector is detected, and does not limit whether video belongs to same album, and the video of multiple types can the general video Model is handled, adaptivity is stronger.
Detailed description of the invention
Fig. 1 is a kind of step flow chart of model generating method of the embodiment of the present invention;
Fig. 2 is a kind of step flow chart of method for processing video frequency of the embodiment of the present invention;
Fig. 3 is the step flow chart of another method for processing video frequency of the embodiment of the present invention;
Fig. 4 is a kind of schematic diagram of video processing procedure of the embodiment of the present invention;
Fig. 5 is a kind of structural block diagram of model generating means of the embodiment of the present invention;
Fig. 6 is a kind of structural block diagram of video process apparatus of the embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
Referring to Fig.1, a kind of step flow chart of model generating method of the embodiment of the present invention is shown.
The model generating method of the embodiment of the present invention the following steps are included:
Step 101, training sample is obtained.
In training pattern, can be obtained from internet largely from the Sample video of video display class video first.Sample This video may include theme song video and not a theme vision distortion frequency, and theme song video may include the head vision distortion in video display video Frequency and run-out vision distortion frequency, not a theme vision distortion frequency may include video of speaking, cheer video, applause video etc. in video display video. Sample video is labeled by mark personnel, obtains the markup information of Sample video, markup information is used to indicate sample view Whether frequency belongs to theme song classification.For example, markup information is that " 1 " indicates that Sample video is the theme bent classification, markup information is " 0 " Instruction Sample video is not a theme song classification.Using the markup information of a Sample video and Sample video as a trained sample This, using a large amount of training sample as training sample set.The treatment process of each training sample is identical, present invention implementation The treatment process for being directed to a training sample is mainly introduced in example.
In the embodiment of the present invention, it can come by acquisition from the Sample video of the video display video of multiple and different types Guarantee the diversity of sample;It can be by the theme song video and not a theme vision distortion frequency of acquisition equal number, to guarantee sample Uniformity.For example, 2000 Sample videos from TV play class video display video are obtained, wherein 1000 vision distortions that are the theme Frequently, 1000 are not a theme vision distortion frequency;2000 Sample videos from film class video display video are obtained, wherein 1000 are Theme song video, 1000 are not a theme vision distortion frequency;2000 Sample videos from animation class video display video are obtained, wherein 1000 vision distortion frequencies that are the theme, 1000 are not a theme vision distortion frequency.By above-mentioned 6000 Sample videos and the mark of Sample video Information is as training sample set.
Wherein, for the specific duration of each Sample video, those skilled in the art select any suitable based on practical experience Value, such as duration can be 3s, 4s, 5s, etc..
Step 102, the Sample video is divided into multiple unit sample videos.
The video for detecting the theme song segment in video is trained to handle model in the embodiment of the present invention, it is contemplated that video In theme song segment be theme song classification there are the audio in consistency namely theme song segment in audio, pass through sound Frequency feature vector may determine whether the bent classification that is the theme, therefore the video processing model in the embodiment of the present invention is based primarily upon sound Frequency feature vector detects whether the bent classification that is the theme.
For a Sample video, it is divided into multiple unit sample videos and is analyzed.
In a kind of optional embodiment, Sample video can be divided into multiple unit samples as unit of setting duration Video.For setting the specific value of duration, those skilled in the art select any suitable value based on practical experience.Than It such as, is the audio of 1s since neural network model is manageable if obtaining audio feature vector using neural network model Signal, therefore set duration and can be set to 1s, etc..
Step 103, for each unit sample video, the corresponding audio feature vector of the unit sample video is obtained.
For each unit sample video, the corresponding audio feature vector of unit sample video is obtained respectively.
For example, a length of 5s, is divided into unit sample video for Sample video A as unit of 1s at that time for Sample video A 1, unit sample video 2, unit sample video 3, unit sample video 4, unit sample video 5, totally 5 unit sample videos. Therefore, respectively obtain the corresponding audio feature vector of unit sample video 1, the corresponding audio feature vector of unit sample video 2, The corresponding audio feature vector of unit sample video 3, the corresponding audio feature vector of unit sample video 4, unit sample video 5 Corresponding audio feature vector.
In a kind of optional embodiment, obtaining the corresponding audio feature vector of a unit sample video may include step Rapid A1~A2.
Step A1 generates the corresponding spectrogram of audio signal in the unit sample video.
Step A1 can further include step A11~A13:
Step A11 carries out sub-frame processing to the audio signal in the unit sample video, obtains multiple audio signals Frame.
Audio signal is extracted from unit sample video, and the audio signal in unit sample video is carried out at framing Reason.
Audio signal is being macroscopically jiggly, is that smoothly, audio signal has short-term stationarity (10 on microcosmic It is considered that audio signal approximation is constant in~30ms), thus audio signal can be divided into some short sections to be handled, Here it is framings, each short section is known as an audio signal frame after framing.For example, can be using the framing side of overlapping segmentation Method, namely interception way back-to-back is not used, but use the interception way of overlapped a part.Wherein, former frame and The overlapping part of a later frame is known as frame shifting, and frame, which is moved, is generally 0~0.5 with the ratio of frame length.It can basis for specific frame length Actual conditions setting, it is 33~100 that frame number per second, which can be set,.
Step A12 carries out windowing process to each audio signal frame and Fourier transformation is handled, obtains the unit sample The corresponding initial spectrum figure of audio signal in video.
Audio is not stop to change in long range, and the characteristic that do not fix can not process, so each audio is believed Number frame carries out windowing process, and audio signal frame is multiplied by adding window with a window function.The purpose of adding window is to eliminate each audio The signal discontinuity that signal frame both ends are likely to result in makes global more continuous.The cost of adding window is an audio signal frame Both ends part be weakened, so to have when framing, between frame and frame overlapping.In practical applications, audio is believed Number frame, which carries out common window function when windowing process, to be square window, Hamming window, Hanning window, etc..According to the frequency domain of window function Characteristic can preferably use Hamming window.
Since the transformation of audio signal in the time domain is generally difficult to find out the characteristic of signal, so usually converting it to frequency Energy distribution on domain is observed, and different Energy distributions can represent the characteristic of different phonetic.So after windowing process, Fourier transformation processing is carried out to each audio signal frame after windowing process, to obtain the Energy distribution on frequency spectrum, is obtained each The frequency spectrum of audio signal frame, and then obtain the corresponding initial spectrum figure of the audio signal in unit sample video.
Step A13 carries out Meier conversion process to the initial spectrum figure and obtains Meier spectrogram, by the Meier frequency spectrum Figure is as the corresponding spectrogram of audio signal in the unit sample video.
Initial spectrum figure is often a biggish figure, in order to obtain the audio frequency characteristics of suitable size, can be initial frequency Spectrogram carries out Meier conversion process by Meier (Mel) filter group, is transformed to Meier spectrogram.
The unit of frequency is hertz (Hz), and the frequency range that human ear can be heard is 20-20000Hz, but human ear is this to Hz Scale unit is not linear perception relationship.For example, if we have adapted to the tone of 1000Hz, if pitch frequency is improved To 2000Hz, our ear can only be aware of frequency and improve a little, be detectable frequency at all and be doubled.It will be general Logical frequency translation is mel-frequency, and mapping relations are shown below:
Mel (f)=2595*log10(1+f/700)
Wherein, f is common frequency, and mel (f) is mel-frequency.
By above-mentioned formula, human ear is to the perceptibility of frequency just at linear relationship.That is, under mel-frequency, If the mel-frequency of two section audios differs twice, the tone that human ear can perceive probably also is differed twice.
According to the actual situation, frequency is divided into multiple Meier filters by human ear sensitivity, obtains Meier filter group, Meier filter group may include 20~40 Meier filters.In Mel frequency range, the center frequency of each Meier filter Rate is the linear distribution of equal intervals, but is not equal intervals in frequency range.Using Meier filter group to initial spectrum Figure is filtered, and obtains Meier spectrogram, and the audio signal which is determined as in unit sample video is corresponding Spectrogram.
The corresponding spectrogram of audio signal in the unit sample video is inputted preset neural network mould by step A2 Type, by the audio feature vector that the neural network model exports be determined as the corresponding audio frequency characteristics of the unit sample video to Amount.
In the embodiment of the present invention, neural network model can use, the audio signal in unit sample video is corresponding Spectrogram inputs neural network model, after carrying out feature extraction inside neural network model, neural network model output Audio feature vector, the audio feature vector are the corresponding audio feature vector of unit sample video.
In a kind of optional embodiment, the VGGish based on Tensorflow open source deep learning frame can use (Visual Geometry Group, VGG, visual geometric group) model extraction audio feature vector.VGGish model may include Convolutional layer, full articulamentum etc., wherein convolutional layer can be used for extracting feature, and full articulamentum can be used for carrying out the feature of extraction Classification obtains corresponding feature vector.Therefore, the corresponding spectrogram of audio signal in unit sample video is inputted into VGGish Model extracts the audio frequency characteristics in spectrogram by convolutional layer, and the audio frequency characteristics of extraction are inputted full articulamentum again by convolutional layer, are led to Cross full articulamentum to classify to audio frequency characteristics, obtain the audio feature vector of 128 dimensions, full articulamentum export the audio frequency characteristics to Amount.
In the embodiment of the present invention, the corresponding audio feature vector of each unit sample video can be saved as into TFRecord Format.The data of TFRecord format use binary format in storage, and occupancy disk space is smaller, speed when reading data Faster.
Step 104, using the corresponding audio feature vector of continuous at least two unit samples video as input, by the sample Target of the markup information of this video as output, is trained preset initial model, and the model that training is completed determines Model is handled for video.
If representing a Sample video using the corresponding feature vector of a unit sample video to be trained, due to one The duration of a unit sample video is shorter, and corresponding feature vector may not be able to accurately and comprehensively represent entire Sample video, Therefore, a sample is represented using the corresponding audio feature vector of continuous at least two unit samples video in the embodiment of the present invention Video is trained.
For a Sample video, the continuous at least two unit samples video that will be divided by the Sample video Corresponding audio feature vector is as input, using the markup information of the Sample video as the target of output, to preset initial Model is trained.
The process being trained to preset initial model may include step B1~B3:
Step B1 randomly selects continuous at least two unit samples video, by the corresponding sound of unit sample video of extraction The initial model is inputted after the splicing of frequency feature vector, obtains the prediction probability that the Sample video belongs to theme song classification.
Initial model refers to the model with classification feature not being trained also.Initial model can be to the audio of input Feature vector is analyzed, and whether output Sample video belongs to the prediction probability of theme song classification, but initial model output Prediction probability is usually inaccurate, therefore to be trained to initial model, to obtain accurate video processing model.
From the unit sample video divided by Sample video, continuous at least two unit samples view is randomly selected Frequently, initial model, initial model output are inputted after the corresponding audio feature vector of unit sample video of extraction being spliced Sample video belongs to the prediction probability of theme song classification.
For example, Sample video A is divided into unit sample video 1, unit sample as unit of 1s for Sample video A Video 2, unit sample video 3, unit sample video 4, unit sample video 5, totally 5 unit sample videos.From 5 unit samples Continuous 3 unit sample videos are randomly selected in this video, each unit sample video corresponds to the audio feature vector of 128 dimensions, The corresponding feature vector of 3 unit sample videos is spliced into the audio feature vector of 128*3=384 dimension, inputs initial model In.Initial model output Sample video A belongs to the prediction probability of theme song classification.
Step B2 belongs to the prediction probability of theme song classification and the mark of the Sample video according to the Sample video Information calculates the corresponding penalty values of the Sample video.
The prediction probability that Sample video belongs to theme song classification is the reality output of initial model, the mark letter of Sample video Breath is the target of output, according to reality output penalty values corresponding with the Sample video that the target of output calculates extraction.Penalty values It can indicate that Sample video belongs to the extent of deviation of the prediction probability of theme song classification and the markup information of Sample video.
In a kind of optional embodiment, the markup information of Sample video and Sample video can be belonged into theme song classification Prediction probability between difference as penalty values.For example, the prediction probability that Sample video belongs to theme song classification is 0.8, sample The markup information of this video is 1, then penalty values can be 0.2.
Step B3 determines that training is completed when the penalty values are less than setting loss threshold value.
Penalty values are smaller, and the robustness of model is better.It is preset in the embodiment of the present invention for measuring whether model instructs Practice the loss threshold value completed.If penalty values are less than setting loss threshold value, it may be said that bright Sample video belongs to theme song classification The extent of deviation of the markup information of prediction probability and Sample video is smaller, at this time it is considered that training is completed;If penalty values are big In or equal to setting loss threshold value, it may be said that bright Sample video belongs to the prediction probability of theme song classification and the mark of Sample video The extent of deviation for infusing information is larger, and the parameter of adjustable model, continues with next training sample and be trained at this time.
For the specific value of setting loss threshold value, those skilled in the art select any suitable value based on practical experience ?.For example it can be set to 0.1,0.2,0.3, etc..
The model that training is completed can be used as video processing model, be subsequently used for carrying out video the inspection of theme song segment It surveys.
In addition, in the embodiment of the present invention test sample set, test specimens can also be obtained when obtaining training sample set This set is similar with training sample set, and test sample includes the markup information of test video and test video.It is obtained in training After video handles model, video processing model is tested using test sample set.Test process may include: that will survey Examination video is divided into multiple unit testing videos;For each unit testing video, it is corresponding to obtain the unit testing video Audio feature vector;The corresponding audio feature vector input video of continuous at least two unit testings video is handled into model, depending on Frequency processing model output test video belongs to the prediction probability of theme song classification, and the markup information of itself and test video is compared Compared with so that whether test video processing model is accurate.
In view of the audio of the audio of the theme bent portions in video and particular content part exists in the embodiment of the present invention Difference, using multiple Sample videos for belonging to theme song classification and multiple Sample videos for being not belonging to theme song classification, according to sample The corresponding audio feature vector training of this video obtains the processing model of the video for detecting video subject song, subsequent i.e. available The video handles model and detects theme song segment therein according to the corresponding audio feature vector of video to be detected.Based on audio spy Sign vector is detected, and does not limit whether video belongs to same album, and the video of multiple types can general video processing Model, adaptivity are stronger.
Referring to Fig. 2, a kind of step flow chart of method for processing video frequency of the embodiment of the present invention is shown.
The method for processing video frequency of the embodiment of the present invention the following steps are included:
Step 201, video to be processed is obtained.
Video to be processed refers to the video display class video of the demand with detection theme song segment.For example, for TV play class Video can detect theme song segment, and skip theme song segment to save user's viewing time when playing each collection of drama Particular content is directly played, therefore the video of each collection of drama can be used as a video to be processed.
Step 202, head segment and run-out segment are extracted from the video to be processed.
Theme song includes Presence of the Moment and piece caudal flexure, and Presence of the Moment is located at the beginning part of video to be processed, piece caudal flexure be located to Handle the ending of video.Therefore, in order to save the processing time, head segment and run-out can be extracted from video to be processed Segment only detects the head segment where Presence of the Moment and the run-out segment where piece caudal flexure.
In a kind of optional embodiment, piece can be extracted from the beginning part in video to be processed according to setting percentage The head segment of Duan Zuowei video to be processed extracts segment conduct from the ending in video to be processed according to setting percentage The run-out segment of video to be processed.For setting the specific value of percentage, those skilled in the art are arranged according to the actual situation Any suitable value, for example can be set and set percentage as 10%, 15%, 20%, etc..
Step 203, the head segment and the run-out segment are divided into multiple units video to be processed respectively.
It is similar with above-mentioned steps 102, based on consistency of the theme song segment in audio in video to be processed, Ke Yitong It crosses audio feature vector and determines whether the bent classification that is the theme.
Head segment and run-out segment in video to be processed for one, are divided into multiple units video to be processed It is analyzed.It waits locating for example, multiple units can be divided into for head segment and run-out segment as unit of setting duration respectively Manage video.The setting duration being related in the step 203 can be identical as the setting duration being related in above-mentioned steps 102.
Step 204, for each unit video to be processed, obtain the corresponding audio frequency characteristics of unit video to be processed to Amount.
Obtaining the corresponding audio feature vector of unit video to be processed may include: to generate the unit view to be processed The corresponding spectrogram of audio signal in frequency;The corresponding spectrogram input of audio signal in unit video to be processed is pre- If neural network model, the audio feature vector that the neural network model exports is determined as unit video to be processed Corresponding audio feature vector.
The corresponding spectrogram of audio signal generated in unit video to be processed may include: to wait locating to the unit The audio signal managed in video carries out sub-frame processing, obtains multiple audio signal frames;Each audio signal frame is carried out at adding window Reason and Fourier transformation processing, obtain the corresponding initial spectrum figure of audio signal in unit video to be processed;To described Initial spectrum figure carries out Meier conversion process and obtains Meier spectrogram, using the Meier spectrogram as unit view to be processed The corresponding spectrogram of audio signal in frequency.
Step 204 is similar with above-mentioned steps 103, referring in particular to the associated description of step 103, the embodiment of the present invention pair This is no longer discussed in detail.
For example, for video to be processed, as unit of 1s by video to be processed be divided into multiple units video 1 to be processed, Unit video 2 to be processed, unit video 3 to be processed, etc..The corresponding audio frequency characteristics of each unit video to be processed are obtained respectively Vector.
Step 205, by including comprising unit video to be processed, continuous at least two units video to be processed is corresponding The pre-generated video of audio feature vector input handle model, the list is determined according to the output that the video handles model Whether position video to be processed belongs to theme song classification.
If directlying adopt whether the corresponding feature vector of a unit video to be processed detects unit video to be processed Belong to theme song classification, since the duration of a unit video to be processed is shorter, corresponding feature vector may not be able to be accurate Ground determines whether unit video to be processed really belongs to theme song classification.Therefore, it is used in the embodiment of the present invention and includes the list Including the video to be processed of position, the corresponding audio feature vector of continuous at least two units video to be processed determines that the unit waits for Whether processing video belongs to theme song classification.
For a unit video to be processed, including comprising unit video to be processed, continuous at least two The corresponding audio feature vector of unit video to be processed inputs the video processing model generated in above-mentioned embodiment shown in FIG. 1. After video processing model analyzes audio feature vector, the prediction that unit video to be processed belongs to theme song classification is exported Probability.After the output for getting video processing model, the unit video to be processed for comparing video processing model output belongs to master Whether the prediction probability for inscribing bent classification is more than or equal to setting probability threshold value, when if it is being more than or equal to, determines the list Position video to be processed belongs to theme song classification.
For setting the specific value of probability threshold value, those skilled in the art select any suitable value based on practical experience ?.For example it can be set to 0.7,0.8,0.9, etc..
For example, waiting locating comprising continuous 3 units including unit video 3 to be processed for video 3 to be processed for unit Manage video, can as unit of video 1 to be processed, unit video 2 to be processed, unit video 3 to be processed, or unit waits locating Manage video 2, unit video 3 to be processed, unit video 4 to be processed, or unit video 3 to be processed, unit view to be processed Frequently 4, unit video 5 to be processed.Wherein, unit video 2 to be processed, unit video 3 to be processed, unit video 4 to be processed this Scheme had both considered the audio feature vector before unit video 3 to be processed, it is also considered that arrived unit video 3 to be processed it Audio feature vector afterwards, therefore utilize unit video 2 to be processed, unit video 3 to be processed, unit video 4 to be processed this 3 The corresponding audio feature vector of continuous unit video to be processed, the corresponding result of the unit determined video 3 to be processed are compared It is more accurate in other two schemes.
The video 2 to be processed, single as unit of comprising continuous 3 units video to be processed including unit video 3 to be processed For position video 3 to be processed, unit video 4 to be processed, unit video 2 corresponding 128 to be processed is tieed up into audio feature vector, list Position video 3 corresponding 128 to be processed ties up audio feature vector and unit video 4 corresponding 128 to be processed tie up audio frequency characteristics to Amount, is spliced into the audio feature vector input video processing model of 128*3=384 dimension, and video processing model output unit waits locating Reason video 3 belongs to the prediction probability of theme song classification, if the prediction probability is greater than setting probability threshold value, it is determined that unit waits locating Reason video 3 belongs to theme song classification.
Step 206, by the unit video to be processed for belonging to theme song classification, continuous unit video to be processed is spelled It connects, obtains the Presence of the Moment segment and run-out knee-piece section in the video to be processed.
After determining whether each unit video to be processed belongs to theme song classification, if some unit view to be processed Frequency belongs to theme song classification, can determine that unit video to be processed belongs to the part in theme song segment, if some unit Video to be processed is not belonging to theme song classification, can determine the part that unit video to be processed belongs in not a theme knee-piece section. So if continuous multiple unit videos to be processed belong to theme song classification, then it will continuously belong to the unit of theme song classification Video to be processed is spliced, and the Presence of the Moment segment and run-out knee-piece section in video to be processed are obtained.
Under normal conditions, the theme song in a video to be processed may include Presence of the Moment and piece caudal flexure, therefore to from Presence of the Moment segment and run-out knee-piece section can be determined in reason video.Belong to theme song for what is divided by the head segment In the unit of classification video to be processed, continuous unit video to be processed is spliced, and obtains the piece in the video to be processed Cephalic flexure segment;It is continuous single by the unit video to be processed for belonging to theme song classification divided by the run-out segment Position video to be processed is spliced, and the run-out knee-piece section in the video to be processed is obtained.
After video to be processed is divided into multiple units video to be processed, each unit view to be processed can also be marked Frequently corresponding initial time and end time.Therefore, continuous single in the unit video to be processed that will belong to theme song classification Position video to be processed is spliced, and after obtaining the Presence of the Moment segment and run-out knee-piece section in the video to be processed, can be incited somebody to action Initial time of the initial time of first unit video to be processed as the Presence of the Moment segment in the Presence of the Moment segment, will End time of the end time of the last one unit video to be processed as the Presence of the Moment segment in the Presence of the Moment segment; Using the initial time of first unit video to be processed in the run-out knee-piece section as the initial time of the run-out knee-piece section, Using the end time of the last one unit video to be processed in the run-out knee-piece section as the run-out knee-piece section at the end of Between.
In the embodiment of the present invention, based on consistency of the theme song segment in audio in video, video handles model root Theme song segment is detected according to audio feature vector, testing result is more accurate, and the adaptivity that video handles model is stronger.
Referring to Fig. 3, the step flow chart of another method for processing video frequency of the embodiment of the present invention is shown.
The method for processing video frequency of the embodiment of the present invention the following steps are included:
Step 301, video to be processed is obtained.
Step 302, head segment and run-out segment are extracted from the video to be processed.
Step 303, the head segment and the run-out segment are divided into multiple units video to be processed respectively.
Fig. 4 is a kind of schematic diagram of video processing procedure of the embodiment of the present invention.Long video in Fig. 4 is view to be processed Frequently, long video is divided to obtain multiple units video to be processed.
Step 304, while preset first process and preset second process being called.
In the embodiment of the present invention, if using a process to the multiple lists divided by head segment and run-out segment Position video to be processed is handled, and treatment effeciency is lower.Therefore the first process and the second process can be set, while calling first Process and the second process are divided to the multiple units video to be processed divided by head segment and by run-out segment respectively To multiple units video to be processed handled, to improve treatment effeciency.First process and the second process can store in In process pool.
It include the first process process1 and the second process process2 in process pool in Fig. 4.
Step 305, for each unit video to be processed divided by the head segment, using described first into Journey obtains the corresponding audio feature vector of the unit video to be processed.
In the first process, for each unit video to be processed divided by head segment, the unit is obtained The corresponding audio feature vector of video to be processed.
There is a neural network model, neural network model is specifically as follows Audio in Fig. 4 in first process VGGish.By the Audio VGGish in the unit divided by head segment the first process of video input to be processed, utilize Audio VGGish obtains the corresponding 128 dimension audio feature vector of unit video to be processed.
Step 305 is similar with above-mentioned steps 204, referring in particular to the associated description of step 204, the embodiment of the present invention pair This is no longer discussed in detail.
Step 306, using first process by including comprising unit video to be processed, continuous at least two is single The pre-generated video of the corresponding audio feature vector input of position video to be processed handles model, handles model according to the video Output determine whether unit video to be processed belongs to theme song classification.
In the first process, including comprising unit video to be processed, continuous at least two units view to be processed Frequently corresponding audio feature vector input video trained in advance handles model.Have at a video in the first process in Fig. 4 Model is managed, video processing model is specifically as follows FCs.
Video processing model exports the prediction probability that unit video to be processed belongs to theme song classification, big in prediction probability Determine that unit video to be processed belongs to theme song classification when setting probability threshold value.Confidence level indicates pre- in Fig. 4 Probability is surveyed, determination belongs to theme song classification when confidence level is more than or equal to 0.7.
Step 307, using first process by the unit video to be processed for belonging to theme song classification, continuous unit Video to be processed is spliced, and the Presence of the Moment segment in the video to be processed is obtained.
In the first process, the unit for continuously belonging to theme song classification video to be processed is spliced to obtain to be processed Presence of the Moment segment in video.The Presence of the Moment obtained such as the testing result in Fig. 4.
Step 308, for each unit video to be processed divided by the run-out segment, using described second into Journey obtains the corresponding audio feature vector of the unit video to be processed.
In the second process, for each unit video to be processed divided by run-out segment, the unit is obtained The corresponding audio feature vector of video to be processed.
There is a neural network model, neural network model is specifically as follows Audio in Fig. 4 in second process VGGish.By the Audio VGGish in the unit divided by run-out segment the second process of video input to be processed, utilize Audio VGGish obtains the corresponding audio feature vector of unit video to be processed.
Step 308 is similar with above-mentioned steps 204, referring in particular to the associated description of step 204, the embodiment of the present invention pair This is no longer discussed in detail.
Step 309, using second process by including comprising unit video to be processed, continuous at least two is single The pre-generated video of the corresponding audio feature vector input of position video to be processed handles model, handles model according to the video Output determine whether unit video to be processed belongs to theme song classification.
In the second process, including comprising unit video to be processed, continuous at least two units view to be processed Frequently corresponding audio feature vector input video trained in advance handles model.Have at a video in the second process in Fig. 4 Model is managed, video processing model is specifically as follows FCs.
Video processing model exports the prediction probability that unit video to be processed belongs to piece caudal flexure classification, big in prediction probability Determine that unit video to be processed belongs to theme song classification when setting probability threshold value.Confidence level indicates pre- in Fig. 4 Probability is surveyed, determination belongs to theme song classification when confidence level is more than or equal to 0.7.
Step 310, using second process by the unit video to be processed for belonging to theme song classification, continuous unit Video to be processed is spliced, and the run-out knee-piece section in the video to be processed is obtained.
In the second process, the unit for continuously belonging to theme song classification video to be processed is spliced to obtain to be processed Theme song segment in video.The piece caudal flexure obtained such as the testing result in Fig. 4.
In the embodiment of the present invention, using process pool technology, treatment effeciency is greatly improved.
It should be noted that for simple description, therefore, it is stated as a series of action groups for embodiment of the method It closes, but those skilled in the art should understand that, embodiment of that present invention are not limited by the describe sequence of actions, because according to According to the embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also should Know, the embodiments described in the specification are all preferred embodiments, and the related movement not necessarily present invention is implemented Necessary to example.
Referring to Fig. 5, a kind of structural block diagram of model generating means of the embodiment of the present invention is shown.
The model generating means of the embodiment of the present invention include sample acquisition module 501, the first division module 502, first to Amount obtains module 503 and training module 504.
Sample acquisition module 501, for obtaining training sample.The training sample includes Sample video and sample view The markup information of frequency;The markup information is used to indicate whether the Sample video belongs to theme song classification.
First division module 502, for the Sample video to be divided into multiple unit sample videos.
Primary vector obtains module 503, and for being directed to each unit sample video, it is corresponding to obtain the unit sample video Audio feature vector.
Training module 504, for will the corresponding audio feature vector of continuous at least two unit samples video as input, Using the markup information of the Sample video as the target of output, preset initial model is trained, training is completed Model is determined as video processing model.
In a kind of optional embodiment, it includes: the first generation unit that the primary vector, which obtains module 503, for giving birth to At the corresponding spectrogram of audio signal in the unit sample video;First determination unit, for regarding the unit sample The corresponding spectrogram of audio signal in frequency inputs preset neural network model, the audio that the neural network model is exported Feature vector is determined as the corresponding audio feature vector of the unit sample video.
In a kind of optional embodiment, first generation unit includes: the first framing subelement, for the list Audio signal in the Sample video of position carries out sub-frame processing, obtains multiple audio signal frames;First processing subelement, for every A audio signal frame carries out windowing process and Fourier transformation processing, and the audio signal obtained in the unit sample video is corresponding Initial spectrum figure;First transformation subelement obtains Meier frequency spectrum for carrying out Meier conversion process to the initial spectrum figure Figure, using the Meier spectrogram as the corresponding spectrogram of audio signal in the unit sample video.
In a kind of optional embodiment, the training module 504 includes: probability acquiring unit, for the company of randomly selecting Continuous at least two unit sample videos, will input after the corresponding audio feature vector splicing of the unit sample video of extraction it is described just Beginning model obtains the prediction probability that the Sample video belongs to theme song classification;Acquiring unit is lost, for according to the sample Video belongs to the prediction probability of theme song classification and the markup information of the Sample video, and it is corresponding to calculate the Sample video Penalty values;Training detection unit, for determining that training is completed when the penalty values are less than setting loss threshold value.
Referring to Fig. 6, a kind of structural block diagram of video process apparatus of the embodiment of the present invention is shown.
The video process apparatus of the embodiment of the present invention includes video acquiring module 601, snippet extraction module 602, and second stroke Sub-module 603, secondary vector obtain module 604, category determination module 605 and segment determining module 606.
Video acquiring module 601, for obtaining video to be processed.
Snippet extraction module 602, for extracting head segment and run-out segment from the video to be processed;
Second division module 603 is waited for for the head segment and the run-out segment to be divided into multiple units respectively Handle video.
Secondary vector obtains module 604, for being directed to each unit video to be processed, obtains unit video to be processed Corresponding audio feature vector.
Category determination module 605, for including will including unit video to be processed, continuous at least two unit to be waited for It handles the pre-generated video of the corresponding audio feature vector input of video and handles model, the defeated of model is handled according to the video Determine whether unit video to be processed belongs to theme song classification out.Wherein, video processing model is to utilize mould shown in fig. 5 What type generating means generated.
Segment determining module 606, in the unit video to be processed for that will belong to theme song classification, continuous unit waits locating Reason video is spliced, and the Presence of the Moment segment and run-out knee-piece section in the video to be processed are obtained.
In a kind of optional embodiment, it includes: the second generation unit that the secondary vector, which obtains module 604, for giving birth to At the corresponding spectrogram of audio signal in unit video to be processed;Second determination unit, for waiting locating the unit The corresponding spectrogram of audio signal managed in video inputs preset neural network model, and the neural network model is exported Audio feature vector is determined as the corresponding audio feature vector of unit video to be processed.
In a kind of optional embodiment, second generation unit includes: the second framing subelement, for the list Audio signal in the video to be processed of position carries out sub-frame processing, obtains multiple audio signal frames;Second processing subelement, for pair Each audio signal frame carries out windowing process and Fourier transformation processing, obtains the audio signal in unit video to be processed Corresponding initial spectrum figure;Second transformation subelement obtains Meier for carrying out Meier conversion process to the initial spectrum figure Spectrogram, using the Meier spectrogram as the corresponding spectrogram of audio signal in unit video to be processed.
In a kind of optional embodiment, it includes: call unit that the secondary vector, which obtains module 604, for adjusting simultaneously With preset first process and preset second process;Head acquiring unit is divided to obtain for being directed to by the head segment Each unit video to be processed, using first process obtain the corresponding audio frequency characteristics of the unit video to be processed to Amount;Run-out acquiring unit utilizes described for for each unit video to be processed for being divided by the run-out segment Two processes obtain the corresponding audio feature vector of the unit video to be processed.
In a kind of optional embodiment, the category determination module 605, for video processing model output Unit video to be processed belong to the prediction probability of theme song classification and whether be more than or equal to setting probability threshold value;As a result Determination unit, for determining that unit video to be processed belongs to theme song classification when being more than or equal to.
In a kind of optional embodiment, the segment determining module 606 includes: Presence of the Moment determination unit, for will be by In the unit video to be processed that the head segment divided belong to theme song classification, continuous unit video to be processed into Row splicing, obtains the Presence of the Moment segment in the video to be processed;Piece caudal flexure determination unit, for that will be drawn by the run-out segment In the unit video to be processed for belonging to theme song classification got, continuous unit video to be processed is spliced, and obtains institute State the run-out knee-piece section in video to be processed.
In a kind of optional embodiment, described device further include: mark module, in second division module point After the head segment and the run-out segment are not divided into multiple units video to be processed, each unit is marked to wait for respectively Handle initial time and the end time of video;Time determining module, for waiting for first unit in the Presence of the Moment segment Initial time of the initial time of video as the Presence of the Moment segment is handled, by the last one unit in the Presence of the Moment segment End time of the end time of video to be processed as the Presence of the Moment segment;By first unit in the run-out knee-piece section Initial time of the initial time of video to be processed as the run-out knee-piece section, by the last one list in the run-out knee-piece section End time of the end time of position video to be processed as the run-out knee-piece section.
In view of the audio of the audio of the theme bent portions in video and particular content part exists in the embodiment of the present invention Difference, using multiple Sample videos for belonging to theme song classification and multiple Sample videos for being not belonging to theme song classification, according to sample The corresponding audio feature vector training of this video obtains the processing model of the video for detecting video subject song, subsequent i.e. available The video handles model and detects theme song segment therein according to the corresponding audio feature vector of video to be detected.Based on audio spy Sign vector is detected, and does not limit whether video belongs to same album, and the video of multiple types can general video processing Model, adaptivity are stronger.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.
In an embodiment of the present invention, a kind of electronic equipment is additionally provided.For example, electronic equipment may be provided as a clothes Business device.The electronic equipment may include one or more processors, and for the memory of storage processor executable instruction, Executable instruction such as application program.Processor is configured as executing above-mentioned model generating method, and/or, video processing side Method.
In an embodiment of the present invention, a kind of non-transitorycomputer readable storage medium including instruction is additionally provided, Memory for example including instruction, above-metioned instruction can be executed by the processor of electronic equipment, to complete above-mentioned model generation side Method, and/or, method for processing video frequency.For example, the non-transitorycomputer readable storage medium can be ROM, arbitrary access is deposited Reservoir (RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.
It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can provide as method, apparatus or calculate Machine program product.Therefore, the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can be used one or more wherein include computer can With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form of the computer program product of implementation.
The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, terminal device (system) and computer program The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions In each flow and/or block and flowchart and/or the block diagram in process and/or box combination.It can provide these Computer program instructions are set to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals Standby processor is to generate a machine, so that being held by the processor of computer or other programmable data processing terminal devices Capable instruction generates for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram The device of specified function.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates packet The manufacture of command device is included, which realizes in one side of one or more flows of the flowchart and/or block diagram The function of being specified in frame or multiple boxes.
These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus The instruction executed on computer or other programmable terminal equipments is provided for realizing in one or more flows of the flowchart And/or in one or more blocks of the block diagram specify function the step of.
Although the preferred embodiment of the embodiment of the present invention has been described, once a person skilled in the art knows bases This creative concept, then additional changes and modifications can be made to these embodiments.So the following claims are intended to be interpreted as Including preferred embodiment and fall into all change and modification of range of embodiment of the invention.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap Those elements are included, but also including other elements that are not explicitly listed, or further includes for this process, method, article Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device for including the element.
It is situated between above to a kind of model generation provided by the present invention, method for processing video frequency, device, electronic equipment and storage Matter is described in detail, and used herein a specific example illustrates the principle and implementation of the invention, above The explanation of embodiment is merely used to help understand method and its core concept of the invention;Meanwhile for the general skill of this field Art personnel, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion this Description should not be construed as limiting the invention.

Claims (18)

1. a kind of model generating method, which is characterized in that the described method includes:
Obtain training sample;The training sample includes the markup information of Sample video and the Sample video;The mark letter Breath is used to indicate whether the Sample video belongs to theme song classification;
The Sample video is divided into multiple unit sample videos;
For each unit sample video, the corresponding audio feature vector of the unit sample video is obtained;
Using the corresponding audio feature vector of continuous at least two unit samples video as input, by the mark of the Sample video Target of the information as output, is trained preset initial model,
The model that training is completed is determined as video processing model.
2. the method according to claim 1, wherein described obtain the corresponding audio spy of the unit sample video Levy vector, comprising:
Generate the corresponding spectrogram of audio signal in the unit sample video;
The corresponding spectrogram of audio signal in the unit sample video is inputted into preset neural network model, by the mind The audio feature vector exported through network model is determined as the corresponding audio feature vector of the unit sample video.
3. according to the method described in claim 2, it is characterized in that, the audio signal generated in the unit sample video Corresponding spectrogram, comprising:
Sub-frame processing is carried out to the audio signal in the unit sample video, obtains multiple audio signal frames;
Windowing process and Fourier transformation processing are carried out to each audio signal frame, obtain the audio in the unit sample video The corresponding initial spectrum figure of signal;
Meier conversion process is carried out to the initial spectrum figure and obtains Meier spectrogram, using the Meier spectrogram as the list The corresponding spectrogram of audio signal in the Sample video of position.
4. the method according to claim 1, wherein described by the corresponding audio of at least two unit sample videos Feature vector carries out preset initial model using the markup information of the Sample video as the target of output as input Training, comprising:
Continuous at least two unit samples video is randomly selected, the corresponding audio feature vector of the unit sample video of extraction is spelled The initial model is inputted after connecing, and obtains the prediction probability that the Sample video belongs to theme song classification;
Belong to the prediction probability of theme song classification and the markup information of the Sample video according to the Sample video, calculates institute State the corresponding penalty values of Sample video;
When the penalty values are less than setting loss threshold value, determine that training is completed.
5. a kind of method for processing video frequency, which is characterized in that the described method includes:
Obtain video to be processed;
Head segment and run-out segment are extracted from the video to be processed;
The head segment and the run-out segment are divided into multiple units video to be processed respectively;
For each unit video to be processed, the corresponding audio feature vector of unit video to be processed is obtained;
Including comprising unit video to be processed, the corresponding audio frequency characteristics of continuous at least two units video to be processed to The pre-generated video of amount input handles model, determines unit video to be processed according to the output that the video handles model Whether theme song classification is belonged to;Wherein, the video processing model is to utilize method described in any one of any one of claims 1 to 44 It generates;
By in the unit video to be processed for belonging to theme song classification, continuous unit video to be processed is spliced, described in acquisition Presence of the Moment segment and run-out knee-piece section in video to be processed.
6. according to the method described in claim 5, it is characterized in that, described obtain the corresponding audio of the unit video to be processed Feature vector, comprising:
Generate the corresponding spectrogram of audio signal in unit video to be processed;
The corresponding spectrogram of audio signal in unit video to be processed is inputted into preset neural network model, it will be described The audio feature vector of neural network model output is determined as the corresponding audio feature vector of unit video to be processed.
7. according to the method described in claim 6, it is characterized in that, the audio letter generated in the unit video to be processed Number corresponding spectrogram, comprising:
Sub-frame processing is carried out to the audio signal in unit video to be processed, obtains multiple audio signal frames;
Windowing process and Fourier transformation processing are carried out to each audio signal frame, obtain the sound in unit video to be processed The corresponding initial spectrum figure of frequency signal;
Meier conversion process is carried out to the initial spectrum figure and obtains Meier spectrogram, using the Meier spectrogram as the list The corresponding spectrogram of audio signal in the video to be processed of position.
8. according to the method described in claim 5, it is characterized in that, described be directed to each unit video to be processed, described in acquisition The corresponding audio feature vector of unit video to be processed, comprising:
Preset first process and preset second process are called simultaneously;
For each unit video to be processed divided by the head segment, the list is obtained using first process The corresponding audio feature vector of position video to be processed;
For each unit video to be processed divided by the run-out segment, the list is obtained using second process The corresponding audio feature vector of position video to be processed.
9. according to the method described in claim 5, it is characterized in that, the output for handling model according to the video determines institute State whether unit video to be processed belongs to theme song classification, comprising:
Compare video processing model output unit video to be processed belong to theme song classification prediction probability whether More than or equal to setting probability threshold value;
When if it is being more than or equal to, determine that unit video to be processed belongs to theme song classification.
10. according to the method described in claim 5, it is characterized in that, the unit view to be processed that theme song classification will be belonged to In frequency, continuous unit video to be processed is spliced, and obtains Presence of the Moment segment and run-out knee-piece in the video to be processed Section, comprising:
By in the unit video to be processed for belonging to theme song classification divided by the head segment, continuous unit waits locating Reason video is spliced, and the Presence of the Moment segment in the video to be processed is obtained;
By in the unit video to be processed for belonging to theme song classification divided by the run-out segment, continuous unit waits locating Reason video is spliced, and the run-out knee-piece section in the video to be processed is obtained.
11. according to the method described in claim 5, it is characterized in that,
It is described the head segment and the run-out segment are divided into multiple units video to be processed respectively after, also wrap It includes: marking initial time and the end time of each unit video to be processed respectively;
In the unit video to be processed that will belong to theme song classification, continuous unit video to be processed is spliced, and is obtained After Presence of the Moment segment and run-out knee-piece section in the video to be processed, further includes:
Using the initial time of first unit video to be processed in the Presence of the Moment segment as the starting of the Presence of the Moment segment Time, using the end time of the last one unit video to be processed in the Presence of the Moment segment as the knot of the Presence of the Moment segment The beam time;
Using the initial time of first unit video to be processed in the run-out knee-piece section as the starting of the run-out knee-piece section Time, using the end time of the last one unit video to be processed in the run-out knee-piece section as the knot of the run-out knee-piece section The beam time.
12. a kind of model generating means, which is characterized in that described device includes:
Sample acquisition module, for obtaining training sample;The training sample includes the mark of Sample video and the Sample video Infuse information;The markup information is used to indicate whether the Sample video belongs to theme song classification;
First division module, for the Sample video to be divided into multiple unit sample videos;
Primary vector obtains module, for being directed to each unit sample video, obtains the corresponding audio of the unit sample video Feature vector;
Training module is used for using the corresponding audio feature vector of continuous at least two unit samples video as input, will be described Target of the markup information of Sample video as output, is trained preset initial model, and the model that training is completed is true It is set to video processing model.
13. a kind of video process apparatus, which is characterized in that described device includes:
Video acquiring module, for obtaining video to be processed;
Snippet extraction module, for extracting head segment and run-out segment from the video to be processed;
Second division module, for the head segment and the run-out segment to be divided into multiple units view to be processed respectively Frequently;
Secondary vector obtains module, and for being directed to each unit video to be processed, it is corresponding to obtain unit video to be processed Audio feature vector;
Category determination module, for that will include continuous at least two units view to be processed including unit video to be processed Frequently the pre-generated video of corresponding audio feature vector input handles model, is determined according to the output that the video handles model Whether the unit video to be processed belongs to theme song classification;Wherein, the video processing model is to utilize claim 12 institute What the device stated generated;
Segment determining module, in the unit video to be processed for theme song classification will to be belonged to, continuous unit video to be processed Spliced, obtains the Presence of the Moment segment and run-out knee-piece section in the video to be processed.
14. device according to claim 13, which is characterized in that the secondary vector obtains module and includes:
Call unit, for calling preset first process and preset second process simultaneously;
Head acquiring unit, for utilizing institute for the to be processed video of each unit divided by the head segment It states the first process and obtains the corresponding audio feature vector of the unit video to be processed;
Run-out acquiring unit, for for each unit video to be processed for being divided by the run-out segment, using described Second process obtains the corresponding audio feature vector of the unit video to be processed.
15. device according to claim 13, which is characterized in that the segment determining module includes:
Presence of the Moment determination unit, the unit view to be processed for belonging to theme song classification for will be divided by the head segment In frequency, continuous unit video to be processed is spliced, and obtains the Presence of the Moment segment in the video to be processed;
Piece caudal flexure determination unit, the unit view to be processed for belonging to theme song classification for will be divided by the run-out segment In frequency, continuous unit video to be processed is spliced, and obtains the run-out knee-piece section in the video to be processed.
16. device according to claim 13, which is characterized in that described device further include:
Mark module, it is multiple for being respectively divided into the head segment and the run-out segment in second division module After unit video to be processed, initial time and the end time of each unit video to be processed are marked respectively;
Time determining module, for using the initial time of first unit video to be processed in the Presence of the Moment segment as described in The initial time of Presence of the Moment segment, using the end time of the last one unit video to be processed in the Presence of the Moment segment as institute State the end time of Presence of the Moment segment;Using the initial time of first unit video to be processed in the run-out knee-piece section as institute The initial time for stating run-out knee-piece section, using the end time of the last one unit video to be processed in the run-out knee-piece section as The end time of the run-out knee-piece section.
17. a kind of electronic equipment characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to executing model generating method according to any one of claims 1-4, and/or, such as The described in any item method for processing video frequency of claim 5-11.
18. a kind of non-transitorycomputer readable storage medium, which is characterized in that when the instruction in the storage medium is by electronics When the processor of equipment executes, so that electronic equipment is able to carry out model generating method according to any one of claims 1-4, And/or such as the described in any item method for processing video frequency of claim 5-11.
CN201910459442.5A 2019-05-29 2019-05-29 Model generation, method for processing video frequency, device, electronic equipment and storage medium Pending CN110324657A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910459442.5A CN110324657A (en) 2019-05-29 2019-05-29 Model generation, method for processing video frequency, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910459442.5A CN110324657A (en) 2019-05-29 2019-05-29 Model generation, method for processing video frequency, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110324657A true CN110324657A (en) 2019-10-11

Family

ID=68119305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910459442.5A Pending CN110324657A (en) 2019-05-29 2019-05-29 Model generation, method for processing video frequency, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110324657A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182301A (en) * 2020-09-30 2021-01-05 北京百度网讯科技有限公司 Method and device for extracting video clip
CN113569740A (en) * 2021-07-27 2021-10-29 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Video recognition model training method and device and video recognition method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102497594A (en) * 2011-12-16 2012-06-13 乐视网信息技术(北京)股份有限公司 Play method of serial video files
CN103325403A (en) * 2013-06-20 2013-09-25 富泰华工业(深圳)有限公司 Electronic device and video playing method thereof
CN105227999A (en) * 2015-09-29 2016-01-06 北京奇艺世纪科技有限公司 A kind of method and apparatus of video cutting
US20180102136A1 (en) * 2016-10-11 2018-04-12 Cirrus Logic International Semiconductor Ltd. Detection of acoustic impulse events in voice applications using a neural network
CN108024142A (en) * 2017-12-05 2018-05-11 深圳市茁壮网络股份有限公司 A kind of video flow detection method and system
CN108122562A (en) * 2018-01-16 2018-06-05 四川大学 A kind of audio frequency classification method based on convolutional neural networks and random forest
CN108924586A (en) * 2018-06-20 2018-11-30 北京奇艺世纪科技有限公司 A kind of detection method of video frame, device and electronic equipment
CN108989882A (en) * 2018-08-03 2018-12-11 百度在线网络技术(北京)有限公司 Method and apparatus for exporting the snatch of music in video
CN109166593A (en) * 2018-08-17 2019-01-08 腾讯音乐娱乐科技(深圳)有限公司 audio data processing method, device and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102497594A (en) * 2011-12-16 2012-06-13 乐视网信息技术(北京)股份有限公司 Play method of serial video files
CN103325403A (en) * 2013-06-20 2013-09-25 富泰华工业(深圳)有限公司 Electronic device and video playing method thereof
CN105227999A (en) * 2015-09-29 2016-01-06 北京奇艺世纪科技有限公司 A kind of method and apparatus of video cutting
US20180102136A1 (en) * 2016-10-11 2018-04-12 Cirrus Logic International Semiconductor Ltd. Detection of acoustic impulse events in voice applications using a neural network
CN108024142A (en) * 2017-12-05 2018-05-11 深圳市茁壮网络股份有限公司 A kind of video flow detection method and system
CN108122562A (en) * 2018-01-16 2018-06-05 四川大学 A kind of audio frequency classification method based on convolutional neural networks and random forest
CN108924586A (en) * 2018-06-20 2018-11-30 北京奇艺世纪科技有限公司 A kind of detection method of video frame, device and electronic equipment
CN108989882A (en) * 2018-08-03 2018-12-11 百度在线网络技术(北京)有限公司 Method and apparatus for exporting the snatch of music in video
CN109166593A (en) * 2018-08-17 2019-01-08 腾讯音乐娱乐科技(深圳)有限公司 audio data processing method, device and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
冀中等: "新闻视频故事单元分割技术综述", 《中国图象图形学报》 *
李明浩: "基于深度神经网络的连续语音识别研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
郭超远: "音频数据采集系统的设计与实施", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
韩凝: "基于深度神经网络的音乐自动标注技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182301A (en) * 2020-09-30 2021-01-05 北京百度网讯科技有限公司 Method and device for extracting video clip
EP3836141A3 (en) * 2020-09-30 2021-10-20 Beijing Baidu Netcom Science And Technology Co. Ltd. Method and apparatus for extracting video clip
US11646050B2 (en) 2020-09-30 2023-05-09 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for extracting video clip
CN113569740A (en) * 2021-07-27 2021-10-29 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Video recognition model training method and device and video recognition method and device
CN113569740B (en) * 2021-07-27 2023-11-21 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Video recognition model training method and device, and video recognition method and device

Similar Documents

Publication Publication Date Title
CN110213670A (en) Method for processing video frequency, device, electronic equipment and storage medium
Schlüter Learning to Pinpoint Singing Voice from Weakly Labeled Examples.
CN110324726A (en) Model generation, method for processing video frequency, device, electronic equipment and storage medium
CN110148400A (en) The pronunciation recognition methods of type, the training method of model, device and equipment
Stein et al. Automatic detection of audio effects in guitar and bass recordings
CN107086040A (en) Speech recognition capabilities method of testing and device
Krijnders et al. Sound event recognition through expectancy-based evaluation ofsignal-driven hypotheses
CN110473525A (en) The method and apparatus for obtaining voice training sample
CN104810025A (en) Audio similarity detecting method and device
Khan et al. A novel audio forensic data-set for digital multimedia forensics
CN109979485B (en) Audio evaluation method and device
CN113257283B (en) Audio signal processing method and device, electronic equipment and storage medium
Müller et al. Interactive fundamental frequency estimation with applications to ethnomusicological research
CN110324657A (en) Model generation, method for processing video frequency, device, electronic equipment and storage medium
CN111724770A (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN107885845B (en) Audio classification method and device, computer equipment and storage medium
CN108877779A (en) Method and apparatus for detecting voice tail point
CN104700831B (en) The method and apparatus for analyzing the phonetic feature of audio file
Goldstein et al. Guitar Music Transcription from Silent Video.
Felipe et al. Acoustic scene classification using spectrograms
Pilia et al. Time scaling detection and estimation in audio recordings
US9445210B1 (en) Waveform display control of visual characteristics
CN116721675A (en) Audio event detection method and device
KR101382356B1 (en) Apparatus for forgery detection of audio file
Bhatia et al. Analysis of audio features for music representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191011