CN110324726A

CN110324726A - Model generation, method for processing video frequency, device, electronic equipment and storage medium

Info

Publication number: CN110324726A
Application number: CN201910458806.8A
Authority: CN
Inventors: 贾少勇
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-05-29
Filing date: 2019-05-29
Publication date: 2019-10-11
Anticipated expiration: 2039-05-29
Also published as: CN110324726B

Abstract

The present invention provides a kind of model generation, method for processing video frequency, device, electronic equipment and storage mediums.Model generating method includes: acquisition training sample；The training sample includes the markup information of Sample video and the Sample video；The markup information is used to indicate whether the Sample video belongs to music categories；The Sample video is divided into multiple unit sample videos；For each unit sample video, the corresponding audio feature vector of the unit sample video is obtained；Preset initial model is trained using the markup information of the Sample video as the target of output using the corresponding audio feature vector of continuous at least two unit samples video as input, the model that training is completed is determined as video processing model.The invention avoids the problems that caused demolition inaccuracy when large change carries out demolition whether occurs only in accordance with scene image information, therefore the snatch of music that demolition obtains is more accurate.

Description

Model generation, method for processing video frequency, device, electronic equipment and storage medium

Technical field

The present invention relates to Internet technical fields, more particularly to a kind of model generation, method for processing video frequency, device, electricity Sub- equipment and storage medium.

Background technique

With the rapid development of Internet, various music program class videos emerge one after another, the attention of spectators has been clenched Power brings audiovisual grand banquet.Such as the music select-elite video of Variety Channel, such as the good sound of China, the new Chinese musical telling of China and music Concert video of channel etc..In order to meet the needs of users, for example user may focus more in music program class video Some snatch of music, therefore music program class video can be subjected to demolition, split into multiple snatch of music.

The demolition method based on scene change is generallyd use in the prior art, and demolition is carried out to music program class video, according to Whether the scene image information in video, which occurs large change, carries out demolition, the time that scene image information is varied widely Point is used as demolition cut-point.

But for music program class video, the case where audio does not switch there are scene switching, this kind of situation Under scene switching time point carry out demolition be inaccurate, therefore using above method demolition accuracy it is lower.

Summary of the invention

The embodiment of the present invention provides a kind of model generation, method for processing video frequency, device, electronic equipment and storage medium, with Solve the problems, such as that existing music program class video demolition method accuracy is lower.

In a first aspect, the embodiment of the invention provides a kind of model generating methods, which comprises

Obtain training sample；The training sample includes the markup information of Sample video and the Sample video；The mark Note information is used to indicate whether the Sample video belongs to music categories；

The Sample video is divided into multiple unit sample videos；

For each unit sample video, the corresponding audio feature vector of the unit sample video is obtained；

Using the corresponding audio feature vector of continuous at least two unit samples video as input, by the Sample video Target of the markup information as output, is trained preset initial model, and the model that training is completed is determined as at video Manage model.

Optionally, described to obtain the corresponding audio feature vector of the unit sample video, comprising: to generate the unit sample The corresponding spectrogram of audio signal in this video；By the corresponding spectrogram input of audio signal in the unit sample video The audio feature vector that the neural network model exports is determined as the unit sample video by preset neural network model Corresponding audio feature vector.

Optionally, the corresponding spectrogram of audio signal generated in the unit sample video, comprising: to the list Audio signal in the Sample video of position carries out sub-frame processing, obtains multiple audio signal frames；Each audio signal frame is added Window processing and Fourier transformation processing, obtain the corresponding initial spectrum figure of audio signal in the unit sample video；To institute It states initial spectrum figure progress Meier conversion process and obtains Meier spectrogram, regarded using the Meier spectrogram as the unit sample The corresponding spectrogram of audio signal in frequency.

Optionally, described using the corresponding audio feature vector of continuous at least two unit samples video as input, by institute Target of the markup information of Sample video as output is stated, preset initial model is trained, comprising: is randomly selected continuous At least two unit sample videos, it is described initial by being inputted after the corresponding audio feature vector splicing of the unit sample video of extraction Model obtains the prediction probability that the Sample video belongs to music categories；Belong to the pre- of music categories according to the Sample video The markup information for surveying probability and the Sample video, calculates the corresponding penalty values of the Sample video；It is small in the penalty values When threshold value is lost in setting, determine that training is completed.

Second aspect, the embodiment of the invention provides a kind of method for processing video frequency, which comprises

Obtain video to be processed；

The video to be processed is divided into multiple units video to be processed；

For each unit video to be processed, the corresponding audio feature vector of unit video to be processed is obtained；

Including comprising unit video to be processed, the corresponding audio of continuous at least two units video to be processed is special It levies the pre-generated video of vector input and handles model, determine that the unit is to be processed according to the output that the video handles model Whether video belongs to music categories；Wherein, the video processing model is generated using as above described in any item methods；

By in the unit video to be processed for belonging to music categories, continuous unit video to be processed is spliced, and obtains institute State the snatch of music in video to be processed.

Optionally, described to obtain the corresponding audio feature vector of the unit video to be processed, comprising: to generate the unit The corresponding spectrogram of audio signal in video to be processed；By the corresponding frequency spectrum of audio signal in unit video to be processed Figure inputs preset neural network model, and the audio feature vector that the neural network model exports is determined as the unit and is waited for Handle the corresponding audio feature vector of video.

Optionally, the corresponding spectrogram of audio signal generated in the unit video to be processed, comprising: to described Audio signal in unit video to be processed carries out sub-frame processing, obtains multiple audio signal frames；To each audio signal frame into Row windowing process and Fourier transformation processing, obtain the corresponding initial spectrum of audio signal in unit video to be processed Figure；Meier conversion process is carried out to the initial spectrum figure and obtains Meier spectrogram, using the Meier spectrogram as the list The corresponding spectrogram of audio signal in the video to be processed of position.

Optionally, described that the video to be processed is divided into multiple units video to be processed, comprising: will be described to be processed Video is averagely divided into multiple video clips；Each video clip is divided into multiple units video to be processed respectively.

Optionally, described to be directed to each unit video to be processed, it is special to obtain the corresponding audio of unit video to be processed Levy vector, comprising: while calling preset multiple processes；It is to be processed for each unit divided by a video clip Video obtains the corresponding audio feature vector of the unit video to be processed using a process.

Optionally, in the unit video to be processed that will belong to music categories, continuous unit video to be processed into Row splices, before the snatch of music in the acquisition video to be processed, further includes: look into from the multiple unit video to be processed Find out the unit video to be processed of existing classification mutation；Including comprising the unit video to be processed for classification mutation occur , continuous at least three units video to be processed determines whether the unit video to be processed for classification mutation occur belongs to sound Happy classification.

Optionally, including the foundation is comprising the unit video to be processed for classification mutation occur, continuously at least three A unit video to be processed, determines whether the unit video to be processed for classification mutation occur belongs to music categories, comprising: obtains It takes in the continuous at least three units video to be processed, belong to the number of the unit video to be processed of music categories and belongs to The number of the unit of unmusical classification video to be processed；There is into class as described in the classification of unit video to be processed more than number The classification of the unit not being mutated video to be processed.

Optionally, the output for handling model according to the video determines whether unit video to be processed belongs to sound Happy classification, comprising: the prediction that the unit video to be processed of the video processing model output belongs to music categories is general Whether rate is more than or equal to setting probability threshold value；When if it is being more than or equal to, unit video category to be processed is determined In music categories.

The third aspect, the embodiment of the invention provides a kind of model generating means, described device includes:

Sample acquisition module, for obtaining training sample；The training sample includes Sample video and the Sample video Markup information；The markup information is used to indicate whether the Sample video belongs to music categories；

First division module, for the Sample video to be divided into multiple unit sample videos；

Primary vector obtains module, and for being directed to each unit sample video, it is corresponding to obtain the unit sample video Audio feature vector；

Training module is used for using the corresponding audio feature vector of continuous at least two unit samples video as input, will Target of the markup information of the Sample video as output, is trained preset initial model, the mould that training is completed Type is determined as video processing model.

Fourth aspect, the embodiment of the invention provides a kind of video process apparatus, described device includes:

Video acquiring module, for obtaining video to be processed；

Second division module, for the video to be processed to be divided into multiple units video to be processed；

Secondary vector obtains module, for being directed to each unit video to be processed, obtains unit video pair to be processed The audio feature vector answered；

Category determination module, for including will including unit video to be processed, continuous at least two unit to wait locating It manages the pre-generated video of the corresponding audio feature vector input of video and handles model, the output of model is handled according to the video Determine whether unit video to be processed belongs to music categories；Wherein, the video processing model is using as described above What method generated；

Segment determining module, in the unit video to be processed for music categories will to be belonged to, continuous unit view to be processed Frequency is spliced, and the snatch of music in the video to be processed is obtained.

Optionally, second division module includes: snippet extraction unit, for averagely dividing the video to be processed For multiple video clips；Segment division unit, for each video clip to be divided into multiple units video to be processed respectively.

Optionally, it includes: process call unit that the secondary vector, which obtains module, for call simultaneously it is preset it is multiple into Journey；Process processing unit, for for each unit video to be processed for being divided by a video clip, using one into Journey obtains the corresponding audio feature vector of the unit video to be processed.

Optionally, described device further include: searching module occurs for searching from the multiple unit video to be processed The unit video to be processed of classification mutation；Determining module includes the unit view to be processed for classification mutation occur for foundation Including frequency, whether continuous at least three units video to be processed determines the unit video to be processed for classification mutation occur Belong to music categories.

Optionally, the determining module includes: number acquiring unit, waits locating for obtaining continuous at least three unit It manages in video, belongs to the number of the unit video to be processed of music categories and belong to the unit video to be processed of unmusical classification Number；Number comparing unit, for there is classification mutation using the classification of the unit video to be processed more than number as described The classification of unit video to be processed.

5th aspect, the embodiment of the invention provides a kind of electronic equipment, comprising: processor；It can for storage processor The memory executed instruction；Wherein, the processor is configured to executing as above described in any item model generating methods, and/ Or, described in any item method for processing video frequency as above.

6th aspect, the embodiment of the invention provides a kind of non-transitorycomputer readable storage mediums, which is characterized in that When the instruction in the storage medium is executed by the processor of electronic equipment, so that electronic equipment is able to carry out any one as above The model generating method, and/or, described in any item method for processing video frequency as above.

Multiple Sample videos for belonging to music categories and multiple samples for being not belonging to music categories are obtained in the embodiment of the present invention This video obtains the video for detecting snatch of music in video according to the corresponding audio feature vector training of Sample video and handles Model, therefore video processing model can detect whether the video belongs to music class according to the corresponding audio feature vector of video Not.For video to be detected, wherein both having included snatch of music or having handled model including unmusical segment, therefore using the video When detecting to the snatch of music in video to be detected, video to be detected is divided into multiple units video to be detected, is utilized Whether video processing model can belong to music categories based on audio feature vector detection constituent parts video to be processed.If some Unit video to be processed belongs to music categories, can determine that unit video to be processed belongs to the part in snatch of music, if Some unit video to be processed is not belonging to music categories, can determine that unit video to be processed belongs to the portion in unmusical segment Point.So if several continuous unit videos to be processed belong to music categories, then these continuous units can be determined Video to be processed belongs to the same snatch of music, and the unit video to be processed that these are continuously belonged to music categories is spelled It connects and can be obtained corresponding complete musical piece.When carrying out demolition for music program class video in the prior art, according to video In scene image information large change whether occur carry out demolition, the time point work that scene image information is varied widely For demolition cut-point, but this kind of demolition mode be there are scene switching and the case where audio does not switch, therefore can will be same Snatch of music is also split, therefore is inaccurate in the time point demolition snatch of music of scene switching.The embodiment of the present invention For compared with the existing technology, when carrying out demolition for the video to be processed of music program class, pass through recognition unit view to be processed Whether frequency itself belongs to music categories, to determine whether unit video to be processed belongs to the part in snatch of music, so that it is right Video to be processed carries out demolition, avoids only in accordance with whether scene image information occurs caused when large change carries out demolition tear open The problem of item inaccuracy, therefore the snatch of music that demolition of the embodiment of the present invention obtains is more accurate.

Detailed description of the invention

Fig. 1 is a kind of step flow chart of model generating method of the embodiment of the present invention；

Fig. 2 is a kind of step flow chart of method for processing video frequency of the embodiment of the present invention；

Fig. 3 is the step flow chart of another method for processing video frequency of the embodiment of the present invention；

Fig. 4 is a kind of schematic diagram of video processing procedure of the embodiment of the present invention；

Fig. 5 is a kind of structural block diagram of model generating means of the embodiment of the present invention；

Fig. 6 is a kind of structural block diagram of video process apparatus of the embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

Referring to Fig.1, a kind of step flow chart of model generating method of the embodiment of the present invention is shown.

The model generating method of the embodiment of the present invention the following steps are included:

Step 101, training sample is obtained.

In training pattern, it can obtain from internet largely regarded from the sample of music program class video first Frequently.Sample video may include music video and unmusical video, and music video can be the music in music program class video Segment, unmusical video can be the unmusical segments such as speaking in music program class video, advertisement.By mark personnel to sample This video is labeled, and obtains the markup information of Sample video, and markup information is used to indicate whether Sample video belongs to music class Not.For example, markup information is that " 1 " indicates that Sample video is music categories, markup information is that " 0 " instruction Sample video is unmusical Classification.Using the markup information of a Sample video and Sample video as a training sample, using a large amount of training sample as Training sample set.The treatment process of each training sample be it is identical, in the embodiment of the present invention mainly introduce for one instruction Practice the treatment process of sample.

In the embodiment of the present invention, it can be regarded by obtaining the sample of the music program class video from multiple and different types Frequently, guarantee the diversity of sample；It can be by the music video and unmusical video of acquisition equal number, to guarantee sample Uniformity.For example, the Sample video of 2000 music select-elite videos from Variety Channel is obtained, wherein 1000 are music Video, 1000 are unmusical video；The Sample video of 2000 concert videos from music channel is obtained, wherein 1000 are music video, and 1000 are unmusical video.By above-mentioned 4000 Sample videos and the markup information of Sample video As training sample set.

Wherein, for the specific duration of each Sample video, those skilled in the art select any suitable based on practical experience Value, such as duration can be 8s, 9s, 10s, etc..

Step 102, the Sample video is divided into multiple unit sample videos.

The video for detecting the snatch of music in video is trained to handle model in the embodiment of the present invention, it is contemplated that in video Snatch of music there are the audios in consistency namely a snatch of music should be music categories in audio, pass through audio Feature vector may determine whether as music categories, therefore the video processing model in the embodiment of the present invention is based primarily upon audio spy Levy whether vector detection is music categories.

For a Sample video, it is divided into multiple unit sample videos and is analyzed.

In a kind of optional embodiment, Sample video can be divided into multiple unit samples as unit of setting duration Video.For setting the specific value of duration, those skilled in the art select any suitable value based on practical experience.Than It such as, is the audio of 1s since neural network model is manageable if obtaining audio feature vector using neural network model Signal, therefore set duration and can be set to 1s, etc..

Step 103, for each unit sample video, the corresponding audio feature vector of the unit sample video is obtained.

For each unit sample video, the corresponding audio feature vector of unit sample video is obtained respectively.

For example, a length of 9s, is divided into unit sample video for Sample video A as unit of 1s at that time for Sample video A 1, unit sample video 2, unit sample video 3, unit sample video 4, unit sample video 5, unit sample video 6, unit Sample video 7, unit sample video 8, unit sample video 9, totally 9 unit sample videos.Therefore, unit sample is obtained respectively The corresponding audio feature vector of video 1, the corresponding audio feature vector of unit sample video 2, the corresponding sound of unit sample video 3 Frequency feature vector, the corresponding audio feature vector of unit sample video 4, the corresponding audio feature vector of unit sample video 5, list The corresponding audio feature vector of position Sample video 6, the corresponding audio feature vector of unit sample video 7, unit sample video 8 are right Audio feature vector, the corresponding audio feature vector of unit sample video 9 answered.

In a kind of optional embodiment, obtaining the corresponding audio feature vector of a unit sample video may include step Rapid A1~A2.

Step A1 generates the corresponding spectrogram of audio signal in the unit sample video.

Step A1 can further include step A11~A13:

Step A11 carries out sub-frame processing to the audio signal in the unit sample video, obtains multiple audio signals Frame.

Audio signal is extracted from unit sample video, and the audio signal in unit sample video is carried out at framing Reason.

Audio signal is being macroscopically jiggly, is that smoothly, audio signal has short-term stationarity (10 on microcosmic It is considered that audio signal approximation is constant in~30ms), thus audio signal can be divided into some short sections to be handled, Here it is framings, each short section is known as an audio signal frame after framing.For example, can be using the framing side of overlapping segmentation Method, namely interception way back-to-back is not used, but use the interception way of overlapped a part.Wherein, former frame and The overlapping part of a later frame is known as frame shifting, and frame, which is moved, is generally 0~0.5 with the ratio of frame length.It can basis for specific frame length Actual conditions setting, it is 33~100 that frame number per second, which can be set,.

Step A12 carries out windowing process to each audio signal frame and Fourier transformation is handled, obtains the unit sample The corresponding initial spectrum figure of audio signal in video.

Audio is not stop to change in long range, and the characteristic that do not fix can not process, so each audio is believed Number frame carries out windowing process, and audio signal frame is multiplied by adding window with a window function.The purpose of adding window is to eliminate each audio The signal discontinuity that signal frame both ends are likely to result in makes global more continuous.The cost of adding window is an audio signal frame Both ends part be weakened, so to have when framing, between frame and frame overlapping.In practical applications, audio is believed Number frame, which carries out common window function when windowing process, to be square window, Hamming window, Hanning window, etc..According to the frequency domain of window function Characteristic can preferably use Hamming window.

Since the transformation of audio signal in the time domain is generally difficult to find out the characteristic of signal, so usually converting it to frequency Energy distribution on domain is observed, and different Energy distributions can represent the characteristic of different phonetic.So after windowing process, Fourier transformation processing is carried out to each audio signal frame after windowing process, to obtain the Energy distribution on frequency spectrum, is obtained each The frequency spectrum of audio signal frame, and then obtain the corresponding initial spectrum figure of the audio signal in unit sample video.

Step A13 carries out Meier conversion process to the initial spectrum figure and obtains Meier spectrogram, by the Meier frequency spectrum Figure is as the corresponding spectrogram of audio signal in the unit sample video.

Initial spectrum figure is often a biggish figure, in order to obtain the audio frequency characteristics of suitable size, can be initial frequency Spectrogram carries out Meier conversion process by Meier (Mel) filter group, is transformed to Meier spectrogram.

The unit of frequency is hertz (Hz), and the frequency range that human ear can be heard is 20-20000Hz, but human ear is this to Hz Scale unit is not linear perception relationship.For example, if we have adapted to the tone of 1000Hz, if pitch frequency is improved To 2000Hz, our ear can only be aware of frequency and improve a little, be detectable frequency at all and be doubled.It will be general Logical frequency translation is mel-frequency, and mapping relations are shown below:

Mel (f)=2595*log₁₀(1+f/700)

Wherein, f is common frequency, and mel (f) is mel-frequency.

By above-mentioned formula, human ear is to the perceptibility of frequency just at linear relationship.That is, under mel-frequency, If the mel-frequency of two section audios differs twice, the tone that human ear can perceive probably also is differed twice.

According to the actual situation, frequency is divided into multiple Meier filters by human ear sensitivity, obtains Meier filter group, Meier filter group may include 20~40 Meier filters.In Mel frequency range, the center frequency of each Meier filter Rate is the linear distribution of equal intervals, but is not equal intervals in frequency range.Using Meier filter group to initial spectrum Figure is filtered, and obtains Meier spectrogram, and the audio signal which is determined as in unit sample video is corresponding Spectrogram.

The corresponding spectrogram of audio signal in the unit sample video is inputted preset neural network mould by step A2 Type, by the audio feature vector that the neural network model exports be determined as the corresponding audio frequency characteristics of the unit sample video to Amount.

In the embodiment of the present invention, neural network model can use, the audio signal in unit sample video is corresponding Spectrogram inputs neural network model, after carrying out feature extraction inside neural network model, neural network model output Audio feature vector, the audio feature vector are the corresponding audio feature vector of unit sample video.

In a kind of optional embodiment, the VGGish based on Tensorflow open source deep learning frame can use (Visual Geometry Group, VGG, visual geometric group) model extraction audio feature vector.VGGish model may include Convolutional layer, full articulamentum etc., wherein convolutional layer can be used for extracting feature, and full articulamentum can be used for carrying out the feature of extraction Classification obtains corresponding feature vector.Therefore, the corresponding spectrogram of audio signal in unit sample video is inputted into VGGish Model extracts the audio frequency characteristics in spectrogram by convolutional layer, and the audio frequency characteristics of extraction are inputted full articulamentum again by convolutional layer, are led to Cross full articulamentum to classify to audio frequency characteristics, obtain the audio feature vector of 128 dimensions, full articulamentum export the audio frequency characteristics to Amount.

In the embodiment of the present invention, the corresponding audio feature vector of each unit sample video can be saved as into TFRecord Format.The data of TFRecord format use binary format in storage, and occupancy disk space is smaller, speed when reading data Faster.

Step 104, using the corresponding audio feature vector of continuous at least two unit samples video as input, by the sample Target of the markup information of this video as output, is trained preset initial model, and the model that training is completed determines Model is handled for video.

If representing a Sample video using the corresponding feature vector of a unit sample video to be trained, due to one The duration of a unit sample video is shorter, and corresponding feature vector may not be able to accurately and comprehensively represent entire Sample video, Therefore, a sample is represented using the corresponding audio feature vector of continuous at least two unit samples video in the embodiment of the present invention Video is trained.

For a Sample video, the continuous at least two unit samples video that will be divided by the Sample video Corresponding audio feature vector is as input, using the markup information of the Sample video as the target of output, to preset initial Model is trained.

The process being trained to preset initial model may include step B1~B3:

Step B1 randomly selects continuous at least two unit samples video, by the corresponding sound of unit sample video of extraction The initial model is inputted after the splicing of frequency feature vector, obtains the prediction probability that the Sample video belongs to music categories.

Initial model refers to the model with classification feature not being trained also.Initial model can be to the audio of input Feature vector is analyzed, and whether output Sample video belongs to the prediction probability of music categories, but initial model output is pre- It is usually inaccurate to survey probability, therefore initial model is trained, to obtain accurate video processing model.

From the unit sample video divided by Sample video, continuous at least two unit samples view is randomly selected Frequently, initial model, initial model output are inputted after the corresponding audio feature vector of unit sample video of extraction being spliced Sample video belongs to the prediction probability of music categories.

For example, Sample video A is divided into unit sample video 1, unit sample as unit of 1s for Sample video A Video 2, unit sample video 3, unit sample video 4, unit sample video 5, unit sample video 6, unit sample video 7, Unit sample video 8, unit sample video 9, totally 9 unit sample videos.It is randomly selected from 9 unit sample videos continuous 5 unit sample videos, the audio feature vector of corresponding 128 dimension of each unit sample video are corresponding by 5 unit sample videos Feature vector be spliced into 128*5=640 dimension audio feature vector, input initial model in.Initial model exports sample view Frequency A belongs to the prediction probability of music categories.

The mark of step B2, the prediction probability and the Sample video that belong to music categories according to the Sample video are believed Breath, calculates the corresponding penalty values of the Sample video.

The prediction probability that Sample video belongs to music categories is the reality output of initial model, the markup information of Sample video For the target of output, the corresponding penalty values of Sample video are calculated according to reality output and the target of output.Penalty values can indicate Sample video belongs to the extent of deviation of the prediction probability of music categories and the markup information of Sample video.

In a kind of optional embodiment, the markup information of Sample video and Sample video can be belonged into music categories Difference between prediction probability is as penalty values.For example, the prediction probability that Sample video belongs to music categories is 0.8, sample view The markup information of frequency is 1, then penalty values can be 0.2.

Step B3 determines that training is completed when the penalty values are less than setting loss threshold value.

Penalty values are smaller, and the robustness of model is better.It is preset in the embodiment of the present invention for measuring whether model instructs Practice the loss threshold value completed.If penalty values are less than setting loss threshold value, it may be said that bright Sample video belongs to the pre- of music categories The extent of deviation for surveying the markup information of probability and Sample video is smaller, at this time it is considered that training is completed；If penalty values are greater than Or it is equal to setting loss threshold value, it may be said that the mark of prediction probability and Sample video that bright Sample video belongs to music categories is believed The extent of deviation of breath is larger, at this time the parameter of adjustable model, continues with next training sample and is trained.

For the specific value of setting loss threshold value, those skilled in the art select any suitable value based on practical experience ?.For example it can be set to 0.1,0.2,0.3, etc..

The model that training is completed can be used as video processing model, be subsequently used for carrying out video the detection of snatch of music.

In addition, in the embodiment of the present invention test sample set, test specimens can also be obtained when obtaining training sample set This set is similar with training sample set, and test sample includes the markup information of test video and test video.It is obtained in training After video handles model, video processing model is tested using test sample set.Test process may include: that will survey Examination video is divided into multiple unit testing videos；For each unit testing video, it is corresponding to obtain the unit testing video Audio feature vector；The corresponding audio feature vector input video of continuous at least two unit testings video is handled into model, depending on Frequency processing model output test video belongs to the prediction probability of music categories, and the markup information of itself and test video is compared Compared with so that whether test video processing model is accurate.

Referring to Fig. 2, a kind of step flow chart of method for processing video frequency of the embodiment of the present invention is shown.

The method for processing video frequency of the embodiment of the present invention the following steps are included:

Step 201, video to be processed is obtained.

Video to be processed refers to the music program class video of the demand with detection snatch of music.For example, being selected for music Elegant video, user may focus more on some snatch of music in music program class video, therefore the music select-elite of each phase Video can be used as a video to be processed.

Step 202, the video to be processed is divided into multiple units video to be processed.

Consistency similar with above-mentioned steps 102, based on the snatch of music in video to be processed in audio can pass through Audio feature vector determines whether for music categories.

Video to be processed for one is divided into multiple units video to be processed and is analyzed.For example, can be to set Video to be processed is divided into multiple units video to be processed for unit by timing length.The setting duration being related in the step 202 It can be identical as the setting duration being related in above-mentioned steps 102.

Step 203, for each unit video to be processed, obtain the corresponding audio frequency characteristics of unit video to be processed to Amount.

Obtaining the corresponding audio feature vector of unit video to be processed may include: to generate the unit view to be processed The corresponding spectrogram of audio signal in frequency；The corresponding spectrogram input of audio signal in unit video to be processed is pre- If neural network model, the audio feature vector that the neural network model exports is determined as unit video to be processed Corresponding audio feature vector.

The corresponding spectrogram of audio signal generated in unit video to be processed may include: to wait locating to the unit The audio signal managed in video carries out sub-frame processing, obtains multiple audio signal frames；Each audio signal frame is carried out at adding window Reason and Fourier transformation processing, obtain the corresponding initial spectrum figure of audio signal in unit video to be processed；To described Initial spectrum figure carries out Meier conversion process and obtains Meier spectrogram, using the Meier spectrogram as unit view to be processed The corresponding spectrogram of audio signal in frequency.

Step 203 is similar with above-mentioned steps 103, referring in particular to the associated description of step 103, the embodiment of the present invention pair This is no longer discussed in detail.

For example, for video to be processed, as unit of 1s by video to be processed be divided into multiple units video 1 to be processed, Unit video 2 to be processed, unit video 3 to be processed, etc..The corresponding audio frequency characteristics of each unit video to be processed are obtained respectively Vector.

Step 204, by including comprising unit video to be processed, continuous at least two units video to be processed is corresponding The pre-generated video of audio feature vector input handle model, the list is determined according to the output that the video handles model Whether position video to be processed belongs to music categories.

If directlying adopt whether the corresponding feature vector of a unit video to be processed detects unit video to be processed Belong to music categories, since the duration of a unit video to be processed is shorter, corresponding feature vector may not be able to be accurately Determine whether unit video to be processed really belongs to music categories.Therefore, it uses in the embodiment of the present invention and is waited for comprising the unit Including handling video, the corresponding audio feature vector of continuous at least two units video to be processed determines that the unit is to be processed Whether video belongs to music categories.

For a unit video to be processed, including comprising unit video to be processed, continuous at least two The corresponding audio feature vector of unit video to be processed inputs the video processing model generated in above-mentioned embodiment shown in FIG. 1. Video processing model audio feature vector is analyzed after, export unit video to be processed belong to music categories prediction it is general Rate.After the output for getting video processing model, the unit video to be processed for comparing video processing model output belongs to music Whether the prediction probability of classification is more than or equal to setting probability threshold value, when if it is being more than or equal to, determines that the unit waits for Processing video belongs to music categories.

For setting the specific value of probability threshold value, those skilled in the art select any suitable value based on practical experience ?.For example it can be set to 0.7,0.8,0.9, etc..

For example, for video 3 to be processed for unit, to be waited for comprising continuous 5 units including unit video 3 to be processed Processing video is unit video 1 to be processed, unit video 2 to be processed, unit video 3 to be processed, unit video 4 to be processed, list For the video 5 to be processed of position, it is right that unit video 1 corresponding 128 to be processed is tieed up into audio feature vector, unit video 2 to be processed 128 dimension audio feature vectors, the unit video 3 corresponding 128 to be processed answered tie up audio feature vector, unit video 4 to be processed Corresponding 128 dimension audio feature vector and unit video 5 corresponding 128 to be processed tie up audio feature vector, are spliced into 128*6 The audio feature vector input video of=640 dimensions handles model, and video processing model output unit video 3 to be processed belongs to music The prediction probability of classification, if the prediction probability is greater than setting probability threshold value, it is determined that unit video 3 to be processed belongs to music class Not.In this kind of scheme, the audio feature vector before unit video 3 to be processed had both been considered, it is also considered that arrived unit and waited locating The audio feature vector after video 3 is managed, therefore to be processed using unit video 1 to be processed, unit video 2 to be processed, unit Video 3, unit video 4 to be processed, unit video 5 to be processed, the corresponding audio frequency characteristics of this 5 continuous unit videos to be processed Vector, the corresponding result of the unit determined video 3 to be processed are more accurate.

Step 205, by the unit video to be processed for belonging to music categories, continuous unit video to be processed is spelled It connects, obtains the snatch of music in the video to be processed.

After determining whether each unit video to be processed belongs to music categories, if some unit video to be processed Belong to music categories, can determine that unit video to be processed belongs to the part in snatch of music, if some unit is to be processed Video is not belonging to music categories, can determine that unit video to be processed belongs to the part in unmusical segment.So if several A continuous unit video to be processed belongs to music categories, then can determine that these continuous unit videos to be processed belong to The same snatch of music, the unit video to be processed that these are continuously belonged to music categories carry out splicing and can be obtained correspondence Complete musical piece.Therefore the unit for continuously belonging to music categories video to be processed is spliced, is can be obtained wait locate Manage the snatch of music in video.Under normal conditions, the snatch of music in a video to be processed may include multiple.

When video to be processed is divided into multiple units video to be processed, each unit video to be processed can also be marked Corresponding initial time and end time.Therefore, the unit for continuously belonging to music categories video to be processed splice To after snatch of music, the initial time of first unit video to be processed is as the snatch of music in available snatch of music Initial time, obtain snatch of music in the last one unit video to be processed end time as the snatch of music at the end of Between.

In the embodiment of the present invention, consistency based on the snatch of music in video in audio, video handle model according to Audio feature vector carries out detection snatch of music, and testing result is more accurate, and the adaptivity that video handles model is stronger.

Referring to Fig. 3, the step flow chart of another method for processing video frequency of the embodiment of the present invention is shown.

Step 301, video to be processed is obtained.

Step 302, the video to be processed is averagely divided into multiple video clips, respectively divides each video clip For multiple units video to be processed.

Snatch of music in one video to be processed may include multiple.It therefore, can will be in order to save the processing time Processing video is averagely divided into multiple video clips while being handled.

Fig. 4 is a kind of schematic diagram of video processing procedure of the embodiment of the present invention.Long video in Fig. 4 is view to be processed Frequently, the multiple units video to be processed long video divided.

Step 303, while preset multiple processes being called.

In the embodiment of the present invention, if waited for using a process the multiple units divided by multiple video clips Reason video is handled, and treatment effeciency is lower.Therefore multiple processes identical with video clip number can be set, calls simultaneously Multiple processes are respectively handled the multiple units video to be processed divided by each video clip, to improve processing Efficiency.Multiple processes can store in process pool.

It include the first process process1, the second process process2 and the in process pool in Fig. 4 by taking 3 processes as an example Three process process3.

Step 304, it for each unit video to be processed divided by a video clip, is obtained using a process Take the corresponding audio feature vector of unit video to be processed.

In each process, for each unit video to be processed divided by a video clip, described in acquisition The corresponding audio feature vector of unit video to be processed.

It is respectively provided with a neural network model in 3 processes in Fig. 4, neural network model is specifically as follows Audio VGGish.The unit divided by each video clip video to be processed is inputted to the Audio in a process respectively VGGish obtains the corresponding 128 dimension audio feature vector of unit video to be processed using Audio VGGish.

Step 304 is similar with above-mentioned steps 203, referring in particular to the associated description of step 203, the embodiment of the present invention pair This is no longer discussed in detail.

Step 305, by including comprising unit video to be processed, continuous at least two units video to be processed is corresponding The pre-generated video of audio feature vector input handle model, the list is determined according to the output that the video handles model Whether position video to be processed belongs to music categories.

In each process, including comprising unit video to be processed, continuous at least two units view to be processed Frequently corresponding audio feature vector input video trained in advance handles model.A view is respectively provided in 3 processes in Fig. 4 Frequency processing model, video processing model are specifically as follows FCs.

Video processing model exports the prediction probability that unit video to be processed belongs to music categories, is greater than in prediction probability Or determine that unit video to be processed belongs to music categories when equal to setting probability threshold value.Confidence level indicates that prediction is general in Fig. 4 Rate, when confidence level is more than or equal to 0.7, determination belongs to music categories.

Step 306, the unit video to be processed for classification mutation occur is searched from the multiple unit video to be processed.

In practical applications, there may be the phenomenon that of short duration unmusical mutation in snatch of music, vice versa.Such as Occurs of short duration segment of speaking in a song.In the case of this kind, it will lead in multiple continuous unit videos to be processed The unit video to be processed of existing classification mutation.For example, if in continuous 20 units video to be processed, in addition to the 10th unit Video to be processed belongs to outside unmusical classification, and other 19 unit videos to be processed belong to music categories, then the 10th list Position video to be processed is the unit video to be processed for classification mutation occur.If the unit video to be processed that the category is mutated As the cut-point of snatch of music and unmusical segment, it will lead to segmentation errors.

Step 307, including foundation is comprising the unit video to be processed for classification mutation occur, continuous at least three is single Position video to be processed, determines whether the unit video to be processed for classification mutation occur belongs to music categories.

For above situation, it can determine whether each unit video to be processed belongs to music in the embodiment of the present invention After classification, the result determined is smoothed, so that result is more accurate.It can be using ballot in the embodiment of the present invention Mode be smoothed, namely using ballot mode determine occur classification mutation unit video to be processed whether belong to sound Happy classification.

It can using the process whether determining unit video to be processed for classification mutation occur of ballot mode belongs to music categories With include: from comprising occur classification be mutated unit video to be processed including, in continuous at least three units video to be processed, Obtain of the number for belonging to the unit video to be processed of music categories and the unit video to be processed for belonging to unmusical classification Number, using the classification of the unit video to be processed more than number as the classification for the unit video to be processed that classification is mutated occur.

For example, it is to be processed to obtain the unit for classification mutation occur for the unit video to be processed for classification mutation occur The classification of the classification of video and its front and back each 7 units video to be processed, the classification of 15 unit videos to be processed altogether, If the number of the unit for wherein belonging to music categories video to be processed is more, it is determined that the unit for classification mutation occur waits locating Reason video belongs to music categories, and vice versa.

Step 308, by the unit video to be processed for belonging to music categories, continuous unit video to be processed is spelled It connects, obtains the snatch of music in the video to be processed.

Spliced the unit for continuously belonging to music categories video to be processed to obtain the musical film in video to be processed Section.Song 1, song 2, song 3, the song 4 obtained such as the testing result in Fig. 4, respectively 4 snatch of music.

In the embodiment of the present invention, using process pool technology, treatment effeciency is greatly improved.

It should be noted that for simple description, therefore, it is stated as a series of action groups for embodiment of the method It closes, but those skilled in the art should understand that, embodiment of that present invention are not limited by the describe sequence of actions, because according to According to the embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also should Know, the embodiments described in the specification are all preferred embodiments, and the related movement not necessarily present invention is implemented Necessary to example.

Referring to Fig. 5, a kind of structural block diagram of model generating means of the embodiment of the present invention is shown.

The model generating means of the embodiment of the present invention include sample acquisition module 501, the first division module 502, first to Amount obtains module 503 and training module 504.

Sample acquisition module 501, for obtaining training sample.The training sample includes Sample video and sample view The markup information of frequency；The markup information is used to indicate whether the Sample video belongs to music categories.

First division module 502, for Sample video to be divided into multiple unit sample videos.

Primary vector obtains module 503, and for being directed to each unit sample video, it is corresponding to obtain the unit sample video Audio feature vector.

Training module 504, for will the corresponding audio feature vector of continuous at least two unit samples video as input, Using the markup information of the Sample video as the target of output, preset initial model is trained, training is completed Model is determined as video processing model.

In a kind of optional embodiment, it includes: the first generation unit that the primary vector, which obtains module 503, for giving birth to At the corresponding spectrogram of audio signal in the unit sample video；First determination unit, for regarding the unit sample The corresponding spectrogram of audio signal in frequency inputs preset neural network model, the audio that the neural network model is exported Feature vector is determined as the corresponding audio feature vector of the unit sample video.

In a kind of optional embodiment, first generation unit includes: the first framing subelement, for the list Audio signal in the Sample video of position carries out sub-frame processing, obtains multiple audio signal frames；First processing subelement, for every A audio signal frame carries out windowing process and Fourier transformation processing, and the audio signal obtained in the unit sample video is corresponding Initial spectrum figure；First transformation subelement obtains Meier frequency spectrum for carrying out Meier conversion process to the initial spectrum figure Figure, using the Meier spectrogram as the corresponding spectrogram of audio signal in the unit sample video.

In a kind of optional embodiment, the training module 504 includes: probability acquiring unit, for the company of randomly selecting Continuous at least two unit sample videos, will input after the corresponding audio feature vector splicing of the unit sample video of extraction it is described just Beginning model obtains the prediction probability that the Sample video belongs to music categories；Acquiring unit is lost, for regarding according to the sample Frequency belongs to the prediction probability of music categories and the markup information of the Sample video, calculates the corresponding loss of the Sample video Value；Training detection unit, for determining that training is completed when the penalty values are less than setting loss threshold value.

Referring to Fig. 6, a kind of structural block diagram of video process apparatus of the embodiment of the present invention is shown.

The video process apparatus of the embodiment of the present invention include video acquiring module 601, the second division module 602, second to Amount obtains module 603, category determination module 604 and segment determining module 605.

Video acquiring module 601, for obtaining video to be processed.

Second division module 602, for the video to be processed to be divided into multiple units video to be processed.

Secondary vector obtains module 603, for being directed to each unit video to be processed, obtains unit video to be processed Corresponding audio feature vector.

Category determination module 604, for including will including unit video to be processed, continuous at least two unit to be waited for It handles the pre-generated video of the corresponding audio feature vector input of video and handles model, the defeated of model is handled according to the video Determine whether unit video to be processed belongs to music categories out.Wherein, video processing model is to utilize model shown in fig. 5 What generating means generated.

Segment determining module 605, in the unit video to be processed for that will belong to music categories, continuous unit is to be processed Video is spliced, and the snatch of music in the video to be processed is obtained.

In a kind of optional embodiment, it includes: the second generation unit that the secondary vector, which obtains module 603, for giving birth to At the corresponding spectrogram of audio signal in unit video to be processed；Second determination unit, for waiting locating the unit The corresponding spectrogram of audio signal managed in video inputs preset neural network model, and the neural network model is exported Audio feature vector is determined as the corresponding audio feature vector of unit video to be processed.

In a kind of optional embodiment, second generation unit includes: the second framing subelement, for the list Audio signal in the video to be processed of position carries out sub-frame processing, obtains multiple audio signal frames；Second processing subelement, for pair Each audio signal frame carries out windowing process and Fourier transformation processing, obtains the audio signal in unit video to be processed Corresponding initial spectrum figure；Second transformation subelement obtains Meier for carrying out Meier conversion process to the initial spectrum figure Spectrogram, using the Meier spectrogram as the corresponding spectrogram of audio signal in unit video to be processed.

In a kind of optional embodiment, second division module 602 includes: snippet extraction unit, and being used for will be described Video to be processed is averagely divided into multiple video clips；Each video clip is divided into more by segment division unit for respectively A unit video to be processed.

In a kind of optional embodiment, it includes: process call unit that the secondary vector, which obtains module 603, for same When call preset multiple processes；Process processing unit is waited for for being directed to by each unit that a video clip divides Video is handled, obtains the corresponding audio feature vector of the unit video to be processed using a process.

In a kind of optional embodiment, described device further include: searching module, for be processed from the multiple unit The unit video to be processed for classification mutation occur is searched in video；There is classification mutation comprising described for foundation in determining module Unit video to be processed including, continuous at least three units video to be processed, determine it is described occur classification mutation unit Whether video to be processed belongs to music categories.

In a kind of optional embodiment, the determining module includes: number acquiring unit, for obtain it is described it is continuous extremely In few three unit videos to be processed, belongs to the number of the unit video to be processed of music categories and belong to unmusical classification The number of unit video to be processed；Number comparing unit, for using the classification of the unit video to be processed more than number as described in There is the classification of the unit video to be processed of classification mutation.

In a kind of optional embodiment, the category determination module 604, for video processing model output Unit video to be processed belong to the prediction probabilities of music categories and whether be more than or equal to setting probability threshold value；It is being greater than Or when being equal to, determine that unit video to be processed belongs to music categories.

The embodiment of the present invention compared with the existing technology for, for music program class video to be processed carry out demolition when, Whether belong to music categories by recognition unit video to be processed itself, to determine whether unit video to be processed belongs to music Part in segment, and then demolition is carried out to video to be processed, it avoids only in accordance with whether scene image information occurs larger change Change the problem of caused demolition inaccuracy when carrying out demolition, therefore the snatch of music that demolition of the embodiment of the present invention obtains is more quasi- Really.

For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.

In an embodiment of the present invention, a kind of electronic equipment is additionally provided.For example, electronic equipment may be provided as a clothes Business device.The electronic equipment may include one or more processors, and for the memory of storage processor executable instruction, Executable instruction such as application program.Processor is configured as executing above-mentioned model generating method, and/or, video processing side Method.

In an embodiment of the present invention, a kind of non-transitorycomputer readable storage medium including instruction is additionally provided, Memory for example including instruction, above-metioned instruction can be executed by the processor of electronic equipment, to complete above-mentioned model generation side Method, and/or, method for processing video frequency.For example, the non-transitorycomputer readable storage medium can be ROM, arbitrary access is deposited Reservoir (RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..

All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.

It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can provide as method, apparatus or calculate Machine program product.Therefore, the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can be used one or more wherein include computer can With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form of the computer program product of implementation.

The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, terminal device (system) and computer program The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions In each flow and/or block and flowchart and/or the block diagram in process and/or box combination.It can provide these Computer program instructions are set to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals Standby processor is to generate a machine, so that being held by the processor of computer or other programmable data processing terminal devices Capable instruction generates for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram The device of specified function.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates packet The manufacture of command device is included, which realizes in one side of one or more flows of the flowchart and/or block diagram The function of being specified in frame or multiple boxes.

These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus The instruction executed on computer or other programmable terminal equipments is provided for realizing in one or more flows of the flowchart And/or in one or more blocks of the block diagram specify function the step of.

Although the preferred embodiment of the embodiment of the present invention has been described, once a person skilled in the art knows bases This creative concept, then additional changes and modifications can be made to these embodiments.So the following claims are intended to be interpreted as Including preferred embodiment and fall into all change and modification of range of embodiment of the invention.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap Those elements are included, but also including other elements that are not explicitly listed, or further includes for this process, method, article Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device for including the element.

It is situated between above to a kind of model generation provided by the present invention, method for processing video frequency, device, electronic equipment and storage Matter is described in detail, and used herein a specific example illustrates the principle and implementation of the invention, above The explanation of embodiment is merely used to help understand method and its core concept of the invention；Meanwhile for the general skill of this field Art personnel, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion this Description should not be construed as limiting the invention.

Claims

1. a kind of model generating method, which is characterized in that the described method includes:

Obtain training sample；The training sample includes the markup information of Sample video and the Sample video；The mark letter Breath is used to indicate whether the Sample video belongs to music categories；

The Sample video is divided into multiple unit sample videos；

Using the corresponding audio feature vector of continuous at least two unit samples video as input, by the mark of the Sample video Target of the information as output, is trained preset initial model, and the model that training is completed is determined as video processing mould Type.

2. the method according to claim 1, wherein described obtain the corresponding audio spy of the unit sample video Levy vector, comprising:

Generate the corresponding spectrogram of audio signal in the unit sample video；

The corresponding spectrogram of audio signal in the unit sample video is inputted into preset neural network model, by the mind The audio feature vector exported through network model is determined as the corresponding audio feature vector of the unit sample video.

3. according to the method described in claim 2, it is characterized in that, the audio signal generated in the unit sample video Corresponding spectrogram, comprising:

Sub-frame processing is carried out to the audio signal in the unit sample video, obtains multiple audio signal frames；

Windowing process and Fourier transformation processing are carried out to each audio signal frame, obtain the audio in the unit sample video The corresponding initial spectrum figure of signal；

Meier conversion process is carried out to the initial spectrum figure and obtains Meier spectrogram, using the Meier spectrogram as the list The corresponding spectrogram of audio signal in the Sample video of position.

4. the method according to claim 1, wherein described that continuous at least two unit samples video is corresponding Audio feature vector is as input, using the markup information of the Sample video as the target of output, to preset initial model It is trained, comprising:

Continuous at least two unit samples video is randomly selected, the corresponding audio feature vector of the unit sample video of extraction is spelled The initial model is inputted after connecing, and obtains the prediction probability that the Sample video belongs to music categories；

Belong to the prediction probability of music categories and the markup information of the Sample video according to the Sample video, described in calculating The corresponding penalty values of Sample video；

When the penalty values are less than setting loss threshold value, determine that training is completed.

5. a kind of method for processing video frequency, which is characterized in that the described method includes:

Obtain video to be processed；

Including comprising unit video to be processed, the corresponding audio frequency characteristics of continuous at least two units video to be processed to The pre-generated video of amount input handles model, determines unit video to be processed according to the output that the video handles model Whether music categories are belonged to；Wherein, the video processing model is raw using method described in any one of any one of claims 1 to 44 At；

By in the unit video to be processed for belonging to music categories, continuous unit video to be processed is spliced, obtain it is described to Handle the snatch of music in video.

6. according to the method described in claim 5, it is characterized in that, described obtain the corresponding audio of the unit video to be processed Feature vector, comprising:

Generate the corresponding spectrogram of audio signal in unit video to be processed；

The corresponding spectrogram of audio signal in unit video to be processed is inputted into preset neural network model, it will be described The audio feature vector of neural network model output is determined as the corresponding audio feature vector of unit video to be processed.

7. according to the method described in claim 6, it is characterized in that, the audio letter generated in the unit video to be processed Number corresponding spectrogram, comprising:

Sub-frame processing is carried out to the audio signal in unit video to be processed, obtains multiple audio signal frames；

Windowing process and Fourier transformation processing are carried out to each audio signal frame, obtain the sound in unit video to be processed The corresponding initial spectrum figure of frequency signal；

Meier conversion process is carried out to the initial spectrum figure and obtains Meier spectrogram, using the Meier spectrogram as the list The corresponding spectrogram of audio signal in the video to be processed of position.

8. according to the method described in claim 5, it is characterized in that, described be divided into multiple units for the video to be processed and wait for Handle video, comprising:

The video to be processed is averagely divided into multiple video clips；

Each video clip is divided into multiple units video to be processed respectively.

9. according to the method described in claim 8, it is characterized in that, described be directed to each unit video to be processed, described in acquisition The corresponding audio feature vector of unit video to be processed, comprising:

Preset multiple processes are called simultaneously；

For each unit video to be processed divided by a video clip, the unit is obtained using a process and is waited for Handle the corresponding audio feature vector of video.

10. according to the method described in claim 5, it is characterized in that, in the unit view to be processed that will belong to music categories In frequency, continuous unit video to be processed is spliced, before the snatch of music in the acquisition video to be processed, further includes:

The unit video to be processed for classification mutation occur is searched from the multiple unit video to be processed；

Including comprising the unit video to be processed for classification mutation occur, continuous at least three units view to be processed Frequently, determine whether the unit video to be processed for classification mutation occur belongs to music categories.

11. according to the method described in claim 10, it is characterized in that, the foundation includes the unit for classification mutation occur Including video to be processed, continuous at least three units video to be processed determines that the unit for classification mutation occur is to be processed Whether video belongs to music categories, comprising:

Obtain in the continuous at least three units video to be processed, belong to the number of the unit video to be processed of music categories with And belong to the number of the unit video to be processed of unmusical classification；

Using the classification of the unit video to be processed more than number as the classification of the unit video to be processed for classification mutation occur.

12. according to the method described in claim 5, it is characterized in that, the output for handling model according to the video determines Whether the unit video to be processed belongs to music categories, comprising:

Compare video processing model output unit video to be processed belong to music categories prediction probability it is whether big In or equal to setting probability threshold value；

When if it is being more than or equal to, determine that unit video to be processed belongs to music categories.

13. a kind of model generating means, which is characterized in that described device includes:

Sample acquisition module, for obtaining training sample；The training sample includes the mark of Sample video and the Sample video Infuse information；The markup information is used to indicate whether the Sample video belongs to music categories；

Primary vector obtains module, for being directed to each unit sample video, obtains the corresponding audio of the unit sample video Feature vector；

Training module is used for using the corresponding audio feature vector of continuous at least two unit samples video as input, will be described Target of the markup information of Sample video as output, is trained preset initial model, and the model that training is completed is true It is set to video processing model.

14. a kind of video process apparatus, which is characterized in that described device includes:

Video acquiring module, for obtaining video to be processed；

Secondary vector obtains module, and for being directed to each unit video to be processed, it is corresponding to obtain unit video to be processed Audio feature vector；

Category determination module, for that will include continuous at least two units view to be processed including unit video to be processed Frequently the pre-generated video of corresponding audio feature vector input handles model, is determined according to the output that the video handles model Whether the unit video to be processed belongs to music categories；Wherein, the video processing model is to utilize institute in claim 13 What the method stated generated；

Segment determining module, in the unit video to be processed for music categories will to be belonged to, continuous unit video to be processed into Row splicing, obtains the snatch of music in the video to be processed.

15. device according to claim 14, which is characterized in that second division module includes:

Snippet extraction unit, for the video to be processed to be averagely divided into multiple video clips；

Segment division unit, for each video clip to be divided into multiple units video to be processed respectively.

16. device according to claim 15, which is characterized in that the secondary vector obtains module and includes:

Process call unit, for calling preset multiple processes simultaneously；

Process processing unit, for utilizing one for each unit video to be processed divided by a video clip Process obtains the corresponding audio feature vector of the unit video to be processed.

17. device according to claim 14, which is characterized in that described device further include:

Searching module, for searching the unit video to be processed for classification mutation occur from the multiple unit video to be processed；

Determining module, for according to comprising it is described occur classification mutation unit video to be processed including, continuous at least three Unit video to be processed, determines whether the unit video to be processed for classification mutation occur belongs to music categories.

18. device according to claim 17, which is characterized in that the determining module includes:

Number acquiring unit belongs to the unit of music categories for obtaining in the continuous at least three units video to be processed The number of the number of video to be processed and the unit for belonging to unmusical classification video to be processed；

Number comparing unit, for using the classification of the unit video to be processed more than number as the unit for classification mutation occur The classification of video to be processed.

19. a kind of electronic equipment characterized by comprising

Processor；

Memory for storage processor executable instruction；

Wherein, the processor is configured to executing model generating method according to any one of claims 1-4, and/or, such as The described in any item method for processing video frequency of claim 5-12.

20. a kind of non-transitorycomputer readable storage medium, which is characterized in that when the instruction in the storage medium is by electronics When the processor of equipment executes, so that electronic equipment is able to carry out model generating method according to any one of claims 1-4, And/or such as the described in any item method for processing video frequency of claim 5-12.