CN110324726A - Model generation, method for processing video frequency, device, electronic equipment and storage medium - Google Patents
Model generation, method for processing video frequency, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN110324726A CN110324726A CN201910458806.8A CN201910458806A CN110324726A CN 110324726 A CN110324726 A CN 110324726A CN 201910458806 A CN201910458806 A CN 201910458806A CN 110324726 A CN110324726 A CN 110324726A
- Authority
- CN
- China
- Prior art keywords
- video
- unit
- processed
- sample
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 130
- 238000012545 processing Methods 0.000 title claims abstract description 85
- 238000003860 storage Methods 0.000 title claims abstract description 18
- 239000013598 vector Substances 0.000 claims abstract description 128
- 238000012549 training Methods 0.000 claims abstract description 44
- 230000005236 sound signal Effects 0.000 claims description 74
- 230000008569 process Effects 0.000 claims description 66
- 230000035772 mutation Effects 0.000 claims description 34
- 238000001228 spectrum Methods 0.000 claims description 24
- 238000003062 neural network model Methods 0.000 claims description 22
- 238000000605 extraction Methods 0.000 claims description 12
- 230000009466 transformation Effects 0.000 claims description 12
- 238000006243 chemical reaction Methods 0.000 claims description 9
- 230000008859 change Effects 0.000 abstract description 11
- 238000012360 testing method Methods 0.000 description 16
- 238000010586 diagram Methods 0.000 description 12
- 238000001514 detection method Methods 0.000 description 7
- 238000009432 framing Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000009826 distribution Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 2
- 239000000470 constituent Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000032696 parturition Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 235000021167 banquet Nutrition 0.000 description 1
- FFBHFFJDDLITSX-UHFFFAOYSA-N benzyl N-[2-hydroxy-4-(3-oxomorpholin-4-yl)phenyl]carbamate Chemical compound OC1=C(NC(=O)OCC2=CC=CC=C2)C=CC(=C1)N1CCOCC1=O FFBHFFJDDLITSX-UHFFFAOYSA-N 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/233—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23424—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44016—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8455—Structuring of content, e.g. decomposing content into time segments involving pointers to the content, e.g. pointers to the I-frames of the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8456—Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Marketing (AREA)
- Business, Economics & Management (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
The present invention provides a kind of model generation, method for processing video frequency, device, electronic equipment and storage mediums.Model generating method includes: acquisition training sample;The training sample includes the markup information of Sample video and the Sample video;The markup information is used to indicate whether the Sample video belongs to music categories;The Sample video is divided into multiple unit sample videos;For each unit sample video, the corresponding audio feature vector of the unit sample video is obtained;Preset initial model is trained using the markup information of the Sample video as the target of output using the corresponding audio feature vector of continuous at least two unit samples video as input, the model that training is completed is determined as video processing model.The invention avoids the problems that caused demolition inaccuracy when large change carries out demolition whether occurs only in accordance with scene image information, therefore the snatch of music that demolition obtains is more accurate.
Description
Technical field
The present invention relates to Internet technical fields, more particularly to a kind of model generation, method for processing video frequency, device, electricity
Sub- equipment and storage medium.
Background technique
With the rapid development of Internet, various music program class videos emerge one after another, the attention of spectators has been clenched
Power brings audiovisual grand banquet.Such as the music select-elite video of Variety Channel, such as the good sound of China, the new Chinese musical telling of China and music
Concert video of channel etc..In order to meet the needs of users, for example user may focus more in music program class video
Some snatch of music, therefore music program class video can be subjected to demolition, split into multiple snatch of music.
The demolition method based on scene change is generallyd use in the prior art, and demolition is carried out to music program class video, according to
Whether the scene image information in video, which occurs large change, carries out demolition, the time that scene image information is varied widely
Point is used as demolition cut-point.
But for music program class video, the case where audio does not switch there are scene switching, this kind of situation
Under scene switching time point carry out demolition be inaccurate, therefore using above method demolition accuracy it is lower.
Summary of the invention
The embodiment of the present invention provides a kind of model generation, method for processing video frequency, device, electronic equipment and storage medium, with
Solve the problems, such as that existing music program class video demolition method accuracy is lower.
In a first aspect, the embodiment of the invention provides a kind of model generating methods, which comprises
Obtain training sample;The training sample includes the markup information of Sample video and the Sample video;The mark
Note information is used to indicate whether the Sample video belongs to music categories;
The Sample video is divided into multiple unit sample videos;
For each unit sample video, the corresponding audio feature vector of the unit sample video is obtained;
Using the corresponding audio feature vector of continuous at least two unit samples video as input, by the Sample video
Target of the markup information as output, is trained preset initial model, and the model that training is completed is determined as at video
Manage model.
Optionally, described to obtain the corresponding audio feature vector of the unit sample video, comprising: to generate the unit sample
The corresponding spectrogram of audio signal in this video;By the corresponding spectrogram input of audio signal in the unit sample video
The audio feature vector that the neural network model exports is determined as the unit sample video by preset neural network model
Corresponding audio feature vector.
Optionally, the corresponding spectrogram of audio signal generated in the unit sample video, comprising: to the list
Audio signal in the Sample video of position carries out sub-frame processing, obtains multiple audio signal frames;Each audio signal frame is added
Window processing and Fourier transformation processing, obtain the corresponding initial spectrum figure of audio signal in the unit sample video;To institute
It states initial spectrum figure progress Meier conversion process and obtains Meier spectrogram, regarded using the Meier spectrogram as the unit sample
The corresponding spectrogram of audio signal in frequency.
Optionally, described using the corresponding audio feature vector of continuous at least two unit samples video as input, by institute
Target of the markup information of Sample video as output is stated, preset initial model is trained, comprising: is randomly selected continuous
At least two unit sample videos, it is described initial by being inputted after the corresponding audio feature vector splicing of the unit sample video of extraction
Model obtains the prediction probability that the Sample video belongs to music categories;Belong to the pre- of music categories according to the Sample video
The markup information for surveying probability and the Sample video, calculates the corresponding penalty values of the Sample video;It is small in the penalty values
When threshold value is lost in setting, determine that training is completed.
Second aspect, the embodiment of the invention provides a kind of method for processing video frequency, which comprises
Obtain video to be processed;
The video to be processed is divided into multiple units video to be processed;
For each unit video to be processed, the corresponding audio feature vector of unit video to be processed is obtained;
Including comprising unit video to be processed, the corresponding audio of continuous at least two units video to be processed is special
It levies the pre-generated video of vector input and handles model, determine that the unit is to be processed according to the output that the video handles model
Whether video belongs to music categories;Wherein, the video processing model is generated using as above described in any item methods;
By in the unit video to be processed for belonging to music categories, continuous unit video to be processed is spliced, and obtains institute
State the snatch of music in video to be processed.
Optionally, described to obtain the corresponding audio feature vector of the unit video to be processed, comprising: to generate the unit
The corresponding spectrogram of audio signal in video to be processed;By the corresponding frequency spectrum of audio signal in unit video to be processed
Figure inputs preset neural network model, and the audio feature vector that the neural network model exports is determined as the unit and is waited for
Handle the corresponding audio feature vector of video.
Optionally, the corresponding spectrogram of audio signal generated in the unit video to be processed, comprising: to described
Audio signal in unit video to be processed carries out sub-frame processing, obtains multiple audio signal frames;To each audio signal frame into
Row windowing process and Fourier transformation processing, obtain the corresponding initial spectrum of audio signal in unit video to be processed
Figure;Meier conversion process is carried out to the initial spectrum figure and obtains Meier spectrogram, using the Meier spectrogram as the list
The corresponding spectrogram of audio signal in the video to be processed of position.
Optionally, described that the video to be processed is divided into multiple units video to be processed, comprising: will be described to be processed
Video is averagely divided into multiple video clips;Each video clip is divided into multiple units video to be processed respectively.
Optionally, described to be directed to each unit video to be processed, it is special to obtain the corresponding audio of unit video to be processed
Levy vector, comprising: while calling preset multiple processes;It is to be processed for each unit divided by a video clip
Video obtains the corresponding audio feature vector of the unit video to be processed using a process.
Optionally, in the unit video to be processed that will belong to music categories, continuous unit video to be processed into
Row splices, before the snatch of music in the acquisition video to be processed, further includes: look into from the multiple unit video to be processed
Find out the unit video to be processed of existing classification mutation;Including comprising the unit video to be processed for classification mutation occur
, continuous at least three units video to be processed determines whether the unit video to be processed for classification mutation occur belongs to sound
Happy classification.
Optionally, including the foundation is comprising the unit video to be processed for classification mutation occur, continuously at least three
A unit video to be processed, determines whether the unit video to be processed for classification mutation occur belongs to music categories, comprising: obtains
It takes in the continuous at least three units video to be processed, belong to the number of the unit video to be processed of music categories and belongs to
The number of the unit of unmusical classification video to be processed;There is into class as described in the classification of unit video to be processed more than number
The classification of the unit not being mutated video to be processed.
Optionally, the output for handling model according to the video determines whether unit video to be processed belongs to sound
Happy classification, comprising: the prediction that the unit video to be processed of the video processing model output belongs to music categories is general
Whether rate is more than or equal to setting probability threshold value;When if it is being more than or equal to, unit video category to be processed is determined
In music categories.
The third aspect, the embodiment of the invention provides a kind of model generating means, described device includes:
Sample acquisition module, for obtaining training sample;The training sample includes Sample video and the Sample video
Markup information;The markup information is used to indicate whether the Sample video belongs to music categories;
First division module, for the Sample video to be divided into multiple unit sample videos;
Primary vector obtains module, and for being directed to each unit sample video, it is corresponding to obtain the unit sample video
Audio feature vector;
Training module is used for using the corresponding audio feature vector of continuous at least two unit samples video as input, will
Target of the markup information of the Sample video as output, is trained preset initial model, the mould that training is completed
Type is determined as video processing model.
Fourth aspect, the embodiment of the invention provides a kind of video process apparatus, described device includes:
Video acquiring module, for obtaining video to be processed;
Second division module, for the video to be processed to be divided into multiple units video to be processed;
Secondary vector obtains module, for being directed to each unit video to be processed, obtains unit video pair to be processed
The audio feature vector answered;
Category determination module, for including will including unit video to be processed, continuous at least two unit to wait locating
It manages the pre-generated video of the corresponding audio feature vector input of video and handles model, the output of model is handled according to the video
Determine whether unit video to be processed belongs to music categories;Wherein, the video processing model is using as described above
What method generated;
Segment determining module, in the unit video to be processed for music categories will to be belonged to, continuous unit view to be processed
Frequency is spliced, and the snatch of music in the video to be processed is obtained.
Optionally, second division module includes: snippet extraction unit, for averagely dividing the video to be processed
For multiple video clips;Segment division unit, for each video clip to be divided into multiple units video to be processed respectively.
Optionally, it includes: process call unit that the secondary vector, which obtains module, for call simultaneously it is preset it is multiple into
Journey;Process processing unit, for for each unit video to be processed for being divided by a video clip, using one into
Journey obtains the corresponding audio feature vector of the unit video to be processed.
Optionally, described device further include: searching module occurs for searching from the multiple unit video to be processed
The unit video to be processed of classification mutation;Determining module includes the unit view to be processed for classification mutation occur for foundation
Including frequency, whether continuous at least three units video to be processed determines the unit video to be processed for classification mutation occur
Belong to music categories.
Optionally, the determining module includes: number acquiring unit, waits locating for obtaining continuous at least three unit
It manages in video, belongs to the number of the unit video to be processed of music categories and belong to the unit video to be processed of unmusical classification
Number;Number comparing unit, for there is classification mutation using the classification of the unit video to be processed more than number as described
The classification of unit video to be processed.
5th aspect, the embodiment of the invention provides a kind of electronic equipment, comprising: processor;It can for storage processor
The memory executed instruction;Wherein, the processor is configured to executing as above described in any item model generating methods, and/
Or, described in any item method for processing video frequency as above.
6th aspect, the embodiment of the invention provides a kind of non-transitorycomputer readable storage mediums, which is characterized in that
When the instruction in the storage medium is executed by the processor of electronic equipment, so that electronic equipment is able to carry out any one as above
The model generating method, and/or, described in any item method for processing video frequency as above.
Multiple Sample videos for belonging to music categories and multiple samples for being not belonging to music categories are obtained in the embodiment of the present invention
This video obtains the video for detecting snatch of music in video according to the corresponding audio feature vector training of Sample video and handles
Model, therefore video processing model can detect whether the video belongs to music class according to the corresponding audio feature vector of video
Not.For video to be detected, wherein both having included snatch of music or having handled model including unmusical segment, therefore using the video
When detecting to the snatch of music in video to be detected, video to be detected is divided into multiple units video to be detected, is utilized
Whether video processing model can belong to music categories based on audio feature vector detection constituent parts video to be processed.If some
Unit video to be processed belongs to music categories, can determine that unit video to be processed belongs to the part in snatch of music, if
Some unit video to be processed is not belonging to music categories, can determine that unit video to be processed belongs to the portion in unmusical segment
Point.So if several continuous unit videos to be processed belong to music categories, then these continuous units can be determined
Video to be processed belongs to the same snatch of music, and the unit video to be processed that these are continuously belonged to music categories is spelled
It connects and can be obtained corresponding complete musical piece.When carrying out demolition for music program class video in the prior art, according to video
In scene image information large change whether occur carry out demolition, the time point work that scene image information is varied widely
For demolition cut-point, but this kind of demolition mode be there are scene switching and the case where audio does not switch, therefore can will be same
Snatch of music is also split, therefore is inaccurate in the time point demolition snatch of music of scene switching.The embodiment of the present invention
For compared with the existing technology, when carrying out demolition for the video to be processed of music program class, pass through recognition unit view to be processed
Whether frequency itself belongs to music categories, to determine whether unit video to be processed belongs to the part in snatch of music, so that it is right
Video to be processed carries out demolition, avoids only in accordance with whether scene image information occurs caused when large change carries out demolition tear open
The problem of item inaccuracy, therefore the snatch of music that demolition of the embodiment of the present invention obtains is more accurate.
Detailed description of the invention
Fig. 1 is a kind of step flow chart of model generating method of the embodiment of the present invention;
Fig. 2 is a kind of step flow chart of method for processing video frequency of the embodiment of the present invention;
Fig. 3 is the step flow chart of another method for processing video frequency of the embodiment of the present invention;
Fig. 4 is a kind of schematic diagram of video processing procedure of the embodiment of the present invention;
Fig. 5 is a kind of structural block diagram of model generating means of the embodiment of the present invention;
Fig. 6 is a kind of structural block diagram of video process apparatus of the embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
Referring to Fig.1, a kind of step flow chart of model generating method of the embodiment of the present invention is shown.
The model generating method of the embodiment of the present invention the following steps are included:
Step 101, training sample is obtained.
In training pattern, it can obtain from internet largely regarded from the sample of music program class video first
Frequently.Sample video may include music video and unmusical video, and music video can be the music in music program class video
Segment, unmusical video can be the unmusical segments such as speaking in music program class video, advertisement.By mark personnel to sample
This video is labeled, and obtains the markup information of Sample video, and markup information is used to indicate whether Sample video belongs to music class
Not.For example, markup information is that " 1 " indicates that Sample video is music categories, markup information is that " 0 " instruction Sample video is unmusical
Classification.Using the markup information of a Sample video and Sample video as a training sample, using a large amount of training sample as
Training sample set.The treatment process of each training sample be it is identical, in the embodiment of the present invention mainly introduce for one instruction
Practice the treatment process of sample.
In the embodiment of the present invention, it can be regarded by obtaining the sample of the music program class video from multiple and different types
Frequently, guarantee the diversity of sample;It can be by the music video and unmusical video of acquisition equal number, to guarantee sample
Uniformity.For example, the Sample video of 2000 music select-elite videos from Variety Channel is obtained, wherein 1000 are music
Video, 1000 are unmusical video;The Sample video of 2000 concert videos from music channel is obtained, wherein
1000 are music video, and 1000 are unmusical video.By above-mentioned 4000 Sample videos and the markup information of Sample video
As training sample set.
Wherein, for the specific duration of each Sample video, those skilled in the art select any suitable based on practical experience
Value, such as duration can be 8s, 9s, 10s, etc..
Step 102, the Sample video is divided into multiple unit sample videos.
The video for detecting the snatch of music in video is trained to handle model in the embodiment of the present invention, it is contemplated that in video
Snatch of music there are the audios in consistency namely a snatch of music should be music categories in audio, pass through audio
Feature vector may determine whether as music categories, therefore the video processing model in the embodiment of the present invention is based primarily upon audio spy
Levy whether vector detection is music categories.
For a Sample video, it is divided into multiple unit sample videos and is analyzed.
In a kind of optional embodiment, Sample video can be divided into multiple unit samples as unit of setting duration
Video.For setting the specific value of duration, those skilled in the art select any suitable value based on practical experience.Than
It such as, is the audio of 1s since neural network model is manageable if obtaining audio feature vector using neural network model
Signal, therefore set duration and can be set to 1s, etc..
Step 103, for each unit sample video, the corresponding audio feature vector of the unit sample video is obtained.
For each unit sample video, the corresponding audio feature vector of unit sample video is obtained respectively.
For example, a length of 9s, is divided into unit sample video for Sample video A as unit of 1s at that time for Sample video A
1, unit sample video 2, unit sample video 3, unit sample video 4, unit sample video 5, unit sample video 6, unit
Sample video 7, unit sample video 8, unit sample video 9, totally 9 unit sample videos.Therefore, unit sample is obtained respectively
The corresponding audio feature vector of video 1, the corresponding audio feature vector of unit sample video 2, the corresponding sound of unit sample video 3
Frequency feature vector, the corresponding audio feature vector of unit sample video 4, the corresponding audio feature vector of unit sample video 5, list
The corresponding audio feature vector of position Sample video 6, the corresponding audio feature vector of unit sample video 7, unit sample video 8 are right
Audio feature vector, the corresponding audio feature vector of unit sample video 9 answered.
In a kind of optional embodiment, obtaining the corresponding audio feature vector of a unit sample video may include step
Rapid A1~A2.
Step A1 generates the corresponding spectrogram of audio signal in the unit sample video.
Step A1 can further include step A11~A13:
Step A11 carries out sub-frame processing to the audio signal in the unit sample video, obtains multiple audio signals
Frame.
Audio signal is extracted from unit sample video, and the audio signal in unit sample video is carried out at framing
Reason.
Audio signal is being macroscopically jiggly, is that smoothly, audio signal has short-term stationarity (10 on microcosmic
It is considered that audio signal approximation is constant in~30ms), thus audio signal can be divided into some short sections to be handled,
Here it is framings, each short section is known as an audio signal frame after framing.For example, can be using the framing side of overlapping segmentation
Method, namely interception way back-to-back is not used, but use the interception way of overlapped a part.Wherein, former frame and
The overlapping part of a later frame is known as frame shifting, and frame, which is moved, is generally 0~0.5 with the ratio of frame length.It can basis for specific frame length
Actual conditions setting, it is 33~100 that frame number per second, which can be set,.
Step A12 carries out windowing process to each audio signal frame and Fourier transformation is handled, obtains the unit sample
The corresponding initial spectrum figure of audio signal in video.
Audio is not stop to change in long range, and the characteristic that do not fix can not process, so each audio is believed
Number frame carries out windowing process, and audio signal frame is multiplied by adding window with a window function.The purpose of adding window is to eliminate each audio
The signal discontinuity that signal frame both ends are likely to result in makes global more continuous.The cost of adding window is an audio signal frame
Both ends part be weakened, so to have when framing, between frame and frame overlapping.In practical applications, audio is believed
Number frame, which carries out common window function when windowing process, to be square window, Hamming window, Hanning window, etc..According to the frequency domain of window function
Characteristic can preferably use Hamming window.
Since the transformation of audio signal in the time domain is generally difficult to find out the characteristic of signal, so usually converting it to frequency
Energy distribution on domain is observed, and different Energy distributions can represent the characteristic of different phonetic.So after windowing process,
Fourier transformation processing is carried out to each audio signal frame after windowing process, to obtain the Energy distribution on frequency spectrum, is obtained each
The frequency spectrum of audio signal frame, and then obtain the corresponding initial spectrum figure of the audio signal in unit sample video.
Step A13 carries out Meier conversion process to the initial spectrum figure and obtains Meier spectrogram, by the Meier frequency spectrum
Figure is as the corresponding spectrogram of audio signal in the unit sample video.
Initial spectrum figure is often a biggish figure, in order to obtain the audio frequency characteristics of suitable size, can be initial frequency
Spectrogram carries out Meier conversion process by Meier (Mel) filter group, is transformed to Meier spectrogram.
The unit of frequency is hertz (Hz), and the frequency range that human ear can be heard is 20-20000Hz, but human ear is this to Hz
Scale unit is not linear perception relationship.For example, if we have adapted to the tone of 1000Hz, if pitch frequency is improved
To 2000Hz, our ear can only be aware of frequency and improve a little, be detectable frequency at all and be doubled.It will be general
Logical frequency translation is mel-frequency, and mapping relations are shown below:
Mel (f)=2595*log10(1+f/700)
Wherein, f is common frequency, and mel (f) is mel-frequency.
By above-mentioned formula, human ear is to the perceptibility of frequency just at linear relationship.That is, under mel-frequency,
If the mel-frequency of two section audios differs twice, the tone that human ear can perceive probably also is differed twice.
According to the actual situation, frequency is divided into multiple Meier filters by human ear sensitivity, obtains Meier filter group,
Meier filter group may include 20~40 Meier filters.In Mel frequency range, the center frequency of each Meier filter
Rate is the linear distribution of equal intervals, but is not equal intervals in frequency range.Using Meier filter group to initial spectrum
Figure is filtered, and obtains Meier spectrogram, and the audio signal which is determined as in unit sample video is corresponding
Spectrogram.
The corresponding spectrogram of audio signal in the unit sample video is inputted preset neural network mould by step A2
Type, by the audio feature vector that the neural network model exports be determined as the corresponding audio frequency characteristics of the unit sample video to
Amount.
In the embodiment of the present invention, neural network model can use, the audio signal in unit sample video is corresponding
Spectrogram inputs neural network model, after carrying out feature extraction inside neural network model, neural network model output
Audio feature vector, the audio feature vector are the corresponding audio feature vector of unit sample video.
In a kind of optional embodiment, the VGGish based on Tensorflow open source deep learning frame can use
(Visual Geometry Group, VGG, visual geometric group) model extraction audio feature vector.VGGish model may include
Convolutional layer, full articulamentum etc., wherein convolutional layer can be used for extracting feature, and full articulamentum can be used for carrying out the feature of extraction
Classification obtains corresponding feature vector.Therefore, the corresponding spectrogram of audio signal in unit sample video is inputted into VGGish
Model extracts the audio frequency characteristics in spectrogram by convolutional layer, and the audio frequency characteristics of extraction are inputted full articulamentum again by convolutional layer, are led to
Cross full articulamentum to classify to audio frequency characteristics, obtain the audio feature vector of 128 dimensions, full articulamentum export the audio frequency characteristics to
Amount.
In the embodiment of the present invention, the corresponding audio feature vector of each unit sample video can be saved as into TFRecord
Format.The data of TFRecord format use binary format in storage, and occupancy disk space is smaller, speed when reading data
Faster.
Step 104, using the corresponding audio feature vector of continuous at least two unit samples video as input, by the sample
Target of the markup information of this video as output, is trained preset initial model, and the model that training is completed determines
Model is handled for video.
If representing a Sample video using the corresponding feature vector of a unit sample video to be trained, due to one
The duration of a unit sample video is shorter, and corresponding feature vector may not be able to accurately and comprehensively represent entire Sample video,
Therefore, a sample is represented using the corresponding audio feature vector of continuous at least two unit samples video in the embodiment of the present invention
Video is trained.
For a Sample video, the continuous at least two unit samples video that will be divided by the Sample video
Corresponding audio feature vector is as input, using the markup information of the Sample video as the target of output, to preset initial
Model is trained.
The process being trained to preset initial model may include step B1~B3:
Step B1 randomly selects continuous at least two unit samples video, by the corresponding sound of unit sample video of extraction
The initial model is inputted after the splicing of frequency feature vector, obtains the prediction probability that the Sample video belongs to music categories.
Initial model refers to the model with classification feature not being trained also.Initial model can be to the audio of input
Feature vector is analyzed, and whether output Sample video belongs to the prediction probability of music categories, but initial model output is pre-
It is usually inaccurate to survey probability, therefore initial model is trained, to obtain accurate video processing model.
From the unit sample video divided by Sample video, continuous at least two unit samples view is randomly selected
Frequently, initial model, initial model output are inputted after the corresponding audio feature vector of unit sample video of extraction being spliced
Sample video belongs to the prediction probability of music categories.
For example, Sample video A is divided into unit sample video 1, unit sample as unit of 1s for Sample video A
Video 2, unit sample video 3, unit sample video 4, unit sample video 5, unit sample video 6, unit sample video 7,
Unit sample video 8, unit sample video 9, totally 9 unit sample videos.It is randomly selected from 9 unit sample videos continuous
5 unit sample videos, the audio feature vector of corresponding 128 dimension of each unit sample video are corresponding by 5 unit sample videos
Feature vector be spliced into 128*5=640 dimension audio feature vector, input initial model in.Initial model exports sample view
Frequency A belongs to the prediction probability of music categories.
The mark of step B2, the prediction probability and the Sample video that belong to music categories according to the Sample video are believed
Breath, calculates the corresponding penalty values of the Sample video.
The prediction probability that Sample video belongs to music categories is the reality output of initial model, the markup information of Sample video
For the target of output, the corresponding penalty values of Sample video are calculated according to reality output and the target of output.Penalty values can indicate
Sample video belongs to the extent of deviation of the prediction probability of music categories and the markup information of Sample video.
In a kind of optional embodiment, the markup information of Sample video and Sample video can be belonged into music categories
Difference between prediction probability is as penalty values.For example, the prediction probability that Sample video belongs to music categories is 0.8, sample view
The markup information of frequency is 1, then penalty values can be 0.2.
Step B3 determines that training is completed when the penalty values are less than setting loss threshold value.
Penalty values are smaller, and the robustness of model is better.It is preset in the embodiment of the present invention for measuring whether model instructs
Practice the loss threshold value completed.If penalty values are less than setting loss threshold value, it may be said that bright Sample video belongs to the pre- of music categories
The extent of deviation for surveying the markup information of probability and Sample video is smaller, at this time it is considered that training is completed;If penalty values are greater than
Or it is equal to setting loss threshold value, it may be said that the mark of prediction probability and Sample video that bright Sample video belongs to music categories is believed
The extent of deviation of breath is larger, at this time the parameter of adjustable model, continues with next training sample and is trained.
For the specific value of setting loss threshold value, those skilled in the art select any suitable value based on practical experience
?.For example it can be set to 0.1,0.2,0.3, etc..
The model that training is completed can be used as video processing model, be subsequently used for carrying out video the detection of snatch of music.
In addition, in the embodiment of the present invention test sample set, test specimens can also be obtained when obtaining training sample set
This set is similar with training sample set, and test sample includes the markup information of test video and test video.It is obtained in training
After video handles model, video processing model is tested using test sample set.Test process may include: that will survey
Examination video is divided into multiple unit testing videos;For each unit testing video, it is corresponding to obtain the unit testing video
Audio feature vector;The corresponding audio feature vector input video of continuous at least two unit testings video is handled into model, depending on
Frequency processing model output test video belongs to the prediction probability of music categories, and the markup information of itself and test video is compared
Compared with so that whether test video processing model is accurate.
Multiple Sample videos for belonging to music categories and multiple samples for being not belonging to music categories are obtained in the embodiment of the present invention
This video obtains the video for detecting snatch of music in video according to the corresponding audio feature vector training of Sample video and handles
Model, therefore video processing model can detect whether the video belongs to music class according to the corresponding audio feature vector of video
Not.For video to be detected, wherein both having included snatch of music or having handled model including unmusical segment, therefore using the video
When detecting to the snatch of music in video to be detected, video to be detected is divided into multiple units video to be detected, is utilized
Whether video processing model can belong to music categories based on audio feature vector detection constituent parts video to be processed.If some
Unit video to be processed belongs to music categories, can determine that unit video to be processed belongs to the part in snatch of music, if
Some unit video to be processed is not belonging to music categories, can determine that unit video to be processed belongs to the portion in unmusical segment
Point.So if several continuous unit videos to be processed belong to music categories, then these continuous units can be determined
Video to be processed belongs to the same snatch of music, and the unit video to be processed that these are continuously belonged to music categories is spelled
It connects and can be obtained corresponding complete musical piece.When carrying out demolition for music program class video in the prior art, according to video
In scene image information large change whether occur carry out demolition, the time point work that scene image information is varied widely
For demolition cut-point, but this kind of demolition mode be there are scene switching and the case where audio does not switch, therefore can will be same
Snatch of music is also split, therefore is inaccurate in the time point demolition snatch of music of scene switching.The embodiment of the present invention
For compared with the existing technology, when carrying out demolition for the video to be processed of music program class, pass through recognition unit view to be processed
Whether frequency itself belongs to music categories, to determine whether unit video to be processed belongs to the part in snatch of music, so that it is right
Video to be processed carries out demolition, avoids only in accordance with whether scene image information occurs caused when large change carries out demolition tear open
The problem of item inaccuracy, therefore the snatch of music that demolition of the embodiment of the present invention obtains is more accurate.
Referring to Fig. 2, a kind of step flow chart of method for processing video frequency of the embodiment of the present invention is shown.
The method for processing video frequency of the embodiment of the present invention the following steps are included:
Step 201, video to be processed is obtained.
Video to be processed refers to the music program class video of the demand with detection snatch of music.For example, being selected for music
Elegant video, user may focus more on some snatch of music in music program class video, therefore the music select-elite of each phase
Video can be used as a video to be processed.
Step 202, the video to be processed is divided into multiple units video to be processed.
Consistency similar with above-mentioned steps 102, based on the snatch of music in video to be processed in audio can pass through
Audio feature vector determines whether for music categories.
Video to be processed for one is divided into multiple units video to be processed and is analyzed.For example, can be to set
Video to be processed is divided into multiple units video to be processed for unit by timing length.The setting duration being related in the step 202
It can be identical as the setting duration being related in above-mentioned steps 102.
Step 203, for each unit video to be processed, obtain the corresponding audio frequency characteristics of unit video to be processed to
Amount.
Obtaining the corresponding audio feature vector of unit video to be processed may include: to generate the unit view to be processed
The corresponding spectrogram of audio signal in frequency;The corresponding spectrogram input of audio signal in unit video to be processed is pre-
If neural network model, the audio feature vector that the neural network model exports is determined as unit video to be processed
Corresponding audio feature vector.
The corresponding spectrogram of audio signal generated in unit video to be processed may include: to wait locating to the unit
The audio signal managed in video carries out sub-frame processing, obtains multiple audio signal frames;Each audio signal frame is carried out at adding window
Reason and Fourier transformation processing, obtain the corresponding initial spectrum figure of audio signal in unit video to be processed;To described
Initial spectrum figure carries out Meier conversion process and obtains Meier spectrogram, using the Meier spectrogram as unit view to be processed
The corresponding spectrogram of audio signal in frequency.
Step 203 is similar with above-mentioned steps 103, referring in particular to the associated description of step 103, the embodiment of the present invention pair
This is no longer discussed in detail.
For example, for video to be processed, as unit of 1s by video to be processed be divided into multiple units video 1 to be processed,
Unit video 2 to be processed, unit video 3 to be processed, etc..The corresponding audio frequency characteristics of each unit video to be processed are obtained respectively
Vector.
Step 204, by including comprising unit video to be processed, continuous at least two units video to be processed is corresponding
The pre-generated video of audio feature vector input handle model, the list is determined according to the output that the video handles model
Whether position video to be processed belongs to music categories.
If directlying adopt whether the corresponding feature vector of a unit video to be processed detects unit video to be processed
Belong to music categories, since the duration of a unit video to be processed is shorter, corresponding feature vector may not be able to be accurately
Determine whether unit video to be processed really belongs to music categories.Therefore, it uses in the embodiment of the present invention and is waited for comprising the unit
Including handling video, the corresponding audio feature vector of continuous at least two units video to be processed determines that the unit is to be processed
Whether video belongs to music categories.
For a unit video to be processed, including comprising unit video to be processed, continuous at least two
The corresponding audio feature vector of unit video to be processed inputs the video processing model generated in above-mentioned embodiment shown in FIG. 1.
Video processing model audio feature vector is analyzed after, export unit video to be processed belong to music categories prediction it is general
Rate.After the output for getting video processing model, the unit video to be processed for comparing video processing model output belongs to music
Whether the prediction probability of classification is more than or equal to setting probability threshold value, when if it is being more than or equal to, determines that the unit waits for
Processing video belongs to music categories.
For setting the specific value of probability threshold value, those skilled in the art select any suitable value based on practical experience
?.For example it can be set to 0.7,0.8,0.9, etc..
For example, for video 3 to be processed for unit, to be waited for comprising continuous 5 units including unit video 3 to be processed
Processing video is unit video 1 to be processed, unit video 2 to be processed, unit video 3 to be processed, unit video 4 to be processed, list
For the video 5 to be processed of position, it is right that unit video 1 corresponding 128 to be processed is tieed up into audio feature vector, unit video 2 to be processed
128 dimension audio feature vectors, the unit video 3 corresponding 128 to be processed answered tie up audio feature vector, unit video 4 to be processed
Corresponding 128 dimension audio feature vector and unit video 5 corresponding 128 to be processed tie up audio feature vector, are spliced into 128*6
The audio feature vector input video of=640 dimensions handles model, and video processing model output unit video 3 to be processed belongs to music
The prediction probability of classification, if the prediction probability is greater than setting probability threshold value, it is determined that unit video 3 to be processed belongs to music class
Not.In this kind of scheme, the audio feature vector before unit video 3 to be processed had both been considered, it is also considered that arrived unit and waited locating
The audio feature vector after video 3 is managed, therefore to be processed using unit video 1 to be processed, unit video 2 to be processed, unit
Video 3, unit video 4 to be processed, unit video 5 to be processed, the corresponding audio frequency characteristics of this 5 continuous unit videos to be processed
Vector, the corresponding result of the unit determined video 3 to be processed are more accurate.
Step 205, by the unit video to be processed for belonging to music categories, continuous unit video to be processed is spelled
It connects, obtains the snatch of music in the video to be processed.
After determining whether each unit video to be processed belongs to music categories, if some unit video to be processed
Belong to music categories, can determine that unit video to be processed belongs to the part in snatch of music, if some unit is to be processed
Video is not belonging to music categories, can determine that unit video to be processed belongs to the part in unmusical segment.So if several
A continuous unit video to be processed belongs to music categories, then can determine that these continuous unit videos to be processed belong to
The same snatch of music, the unit video to be processed that these are continuously belonged to music categories carry out splicing and can be obtained correspondence
Complete musical piece.Therefore the unit for continuously belonging to music categories video to be processed is spliced, is can be obtained wait locate
Manage the snatch of music in video.Under normal conditions, the snatch of music in a video to be processed may include multiple.
When video to be processed is divided into multiple units video to be processed, each unit video to be processed can also be marked
Corresponding initial time and end time.Therefore, the unit for continuously belonging to music categories video to be processed splice
To after snatch of music, the initial time of first unit video to be processed is as the snatch of music in available snatch of music
Initial time, obtain snatch of music in the last one unit video to be processed end time as the snatch of music at the end of
Between.
In the embodiment of the present invention, consistency based on the snatch of music in video in audio, video handle model according to
Audio feature vector carries out detection snatch of music, and testing result is more accurate, and the adaptivity that video handles model is stronger.
Referring to Fig. 3, the step flow chart of another method for processing video frequency of the embodiment of the present invention is shown.
The method for processing video frequency of the embodiment of the present invention the following steps are included:
Step 301, video to be processed is obtained.
Step 302, the video to be processed is averagely divided into multiple video clips, respectively divides each video clip
For multiple units video to be processed.
Snatch of music in one video to be processed may include multiple.It therefore, can will be in order to save the processing time
Processing video is averagely divided into multiple video clips while being handled.
Fig. 4 is a kind of schematic diagram of video processing procedure of the embodiment of the present invention.Long video in Fig. 4 is view to be processed
Frequently, the multiple units video to be processed long video divided.
Step 303, while preset multiple processes being called.
In the embodiment of the present invention, if waited for using a process the multiple units divided by multiple video clips
Reason video is handled, and treatment effeciency is lower.Therefore multiple processes identical with video clip number can be set, calls simultaneously
Multiple processes are respectively handled the multiple units video to be processed divided by each video clip, to improve processing
Efficiency.Multiple processes can store in process pool.
It include the first process process1, the second process process2 and the in process pool in Fig. 4 by taking 3 processes as an example
Three process process3.
Step 304, it for each unit video to be processed divided by a video clip, is obtained using a process
Take the corresponding audio feature vector of unit video to be processed.
In each process, for each unit video to be processed divided by a video clip, described in acquisition
The corresponding audio feature vector of unit video to be processed.
It is respectively provided with a neural network model in 3 processes in Fig. 4, neural network model is specifically as follows Audio
VGGish.The unit divided by each video clip video to be processed is inputted to the Audio in a process respectively
VGGish obtains the corresponding 128 dimension audio feature vector of unit video to be processed using Audio VGGish.
Step 304 is similar with above-mentioned steps 203, referring in particular to the associated description of step 203, the embodiment of the present invention pair
This is no longer discussed in detail.
Step 305, by including comprising unit video to be processed, continuous at least two units video to be processed is corresponding
The pre-generated video of audio feature vector input handle model, the list is determined according to the output that the video handles model
Whether position video to be processed belongs to music categories.
In each process, including comprising unit video to be processed, continuous at least two units view to be processed
Frequently corresponding audio feature vector input video trained in advance handles model.A view is respectively provided in 3 processes in Fig. 4
Frequency processing model, video processing model are specifically as follows FCs.
Video processing model exports the prediction probability that unit video to be processed belongs to music categories, is greater than in prediction probability
Or determine that unit video to be processed belongs to music categories when equal to setting probability threshold value.Confidence level indicates that prediction is general in Fig. 4
Rate, when confidence level is more than or equal to 0.7, determination belongs to music categories.
Step 306, the unit video to be processed for classification mutation occur is searched from the multiple unit video to be processed.
In practical applications, there may be the phenomenon that of short duration unmusical mutation in snatch of music, vice versa.Such as
Occurs of short duration segment of speaking in a song.In the case of this kind, it will lead in multiple continuous unit videos to be processed
The unit video to be processed of existing classification mutation.For example, if in continuous 20 units video to be processed, in addition to the 10th unit
Video to be processed belongs to outside unmusical classification, and other 19 unit videos to be processed belong to music categories, then the 10th list
Position video to be processed is the unit video to be processed for classification mutation occur.If the unit video to be processed that the category is mutated
As the cut-point of snatch of music and unmusical segment, it will lead to segmentation errors.
Step 307, including foundation is comprising the unit video to be processed for classification mutation occur, continuous at least three is single
Position video to be processed, determines whether the unit video to be processed for classification mutation occur belongs to music categories.
For above situation, it can determine whether each unit video to be processed belongs to music in the embodiment of the present invention
After classification, the result determined is smoothed, so that result is more accurate.It can be using ballot in the embodiment of the present invention
Mode be smoothed, namely using ballot mode determine occur classification mutation unit video to be processed whether belong to sound
Happy classification.
It can using the process whether determining unit video to be processed for classification mutation occur of ballot mode belongs to music categories
With include: from comprising occur classification be mutated unit video to be processed including, in continuous at least three units video to be processed,
Obtain of the number for belonging to the unit video to be processed of music categories and the unit video to be processed for belonging to unmusical classification
Number, using the classification of the unit video to be processed more than number as the classification for the unit video to be processed that classification is mutated occur.
For example, it is to be processed to obtain the unit for classification mutation occur for the unit video to be processed for classification mutation occur
The classification of the classification of video and its front and back each 7 units video to be processed, the classification of 15 unit videos to be processed altogether,
If the number of the unit for wherein belonging to music categories video to be processed is more, it is determined that the unit for classification mutation occur waits locating
Reason video belongs to music categories, and vice versa.
Step 308, by the unit video to be processed for belonging to music categories, continuous unit video to be processed is spelled
It connects, obtains the snatch of music in the video to be processed.
Spliced the unit for continuously belonging to music categories video to be processed to obtain the musical film in video to be processed
Section.Song 1, song 2, song 3, the song 4 obtained such as the testing result in Fig. 4, respectively 4 snatch of music.
In the embodiment of the present invention, using process pool technology, treatment effeciency is greatly improved.
It should be noted that for simple description, therefore, it is stated as a series of action groups for embodiment of the method
It closes, but those skilled in the art should understand that, embodiment of that present invention are not limited by the describe sequence of actions, because according to
According to the embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also should
Know, the embodiments described in the specification are all preferred embodiments, and the related movement not necessarily present invention is implemented
Necessary to example.
Referring to Fig. 5, a kind of structural block diagram of model generating means of the embodiment of the present invention is shown.
The model generating means of the embodiment of the present invention include sample acquisition module 501, the first division module 502, first to
Amount obtains module 503 and training module 504.
Sample acquisition module 501, for obtaining training sample.The training sample includes Sample video and sample view
The markup information of frequency;The markup information is used to indicate whether the Sample video belongs to music categories.
First division module 502, for Sample video to be divided into multiple unit sample videos.
Primary vector obtains module 503, and for being directed to each unit sample video, it is corresponding to obtain the unit sample video
Audio feature vector.
Training module 504, for will the corresponding audio feature vector of continuous at least two unit samples video as input,
Using the markup information of the Sample video as the target of output, preset initial model is trained, training is completed
Model is determined as video processing model.
In a kind of optional embodiment, it includes: the first generation unit that the primary vector, which obtains module 503, for giving birth to
At the corresponding spectrogram of audio signal in the unit sample video;First determination unit, for regarding the unit sample
The corresponding spectrogram of audio signal in frequency inputs preset neural network model, the audio that the neural network model is exported
Feature vector is determined as the corresponding audio feature vector of the unit sample video.
In a kind of optional embodiment, first generation unit includes: the first framing subelement, for the list
Audio signal in the Sample video of position carries out sub-frame processing, obtains multiple audio signal frames;First processing subelement, for every
A audio signal frame carries out windowing process and Fourier transformation processing, and the audio signal obtained in the unit sample video is corresponding
Initial spectrum figure;First transformation subelement obtains Meier frequency spectrum for carrying out Meier conversion process to the initial spectrum figure
Figure, using the Meier spectrogram as the corresponding spectrogram of audio signal in the unit sample video.
In a kind of optional embodiment, the training module 504 includes: probability acquiring unit, for the company of randomly selecting
Continuous at least two unit sample videos, will input after the corresponding audio feature vector splicing of the unit sample video of extraction it is described just
Beginning model obtains the prediction probability that the Sample video belongs to music categories;Acquiring unit is lost, for regarding according to the sample
Frequency belongs to the prediction probability of music categories and the markup information of the Sample video, calculates the corresponding loss of the Sample video
Value;Training detection unit, for determining that training is completed when the penalty values are less than setting loss threshold value.
Referring to Fig. 6, a kind of structural block diagram of video process apparatus of the embodiment of the present invention is shown.
The video process apparatus of the embodiment of the present invention include video acquiring module 601, the second division module 602, second to
Amount obtains module 603, category determination module 604 and segment determining module 605.
Video acquiring module 601, for obtaining video to be processed.
Second division module 602, for the video to be processed to be divided into multiple units video to be processed.
Secondary vector obtains module 603, for being directed to each unit video to be processed, obtains unit video to be processed
Corresponding audio feature vector.
Category determination module 604, for including will including unit video to be processed, continuous at least two unit to be waited for
It handles the pre-generated video of the corresponding audio feature vector input of video and handles model, the defeated of model is handled according to the video
Determine whether unit video to be processed belongs to music categories out.Wherein, video processing model is to utilize model shown in fig. 5
What generating means generated.
Segment determining module 605, in the unit video to be processed for that will belong to music categories, continuous unit is to be processed
Video is spliced, and the snatch of music in the video to be processed is obtained.
In a kind of optional embodiment, it includes: the second generation unit that the secondary vector, which obtains module 603, for giving birth to
At the corresponding spectrogram of audio signal in unit video to be processed;Second determination unit, for waiting locating the unit
The corresponding spectrogram of audio signal managed in video inputs preset neural network model, and the neural network model is exported
Audio feature vector is determined as the corresponding audio feature vector of unit video to be processed.
In a kind of optional embodiment, second generation unit includes: the second framing subelement, for the list
Audio signal in the video to be processed of position carries out sub-frame processing, obtains multiple audio signal frames;Second processing subelement, for pair
Each audio signal frame carries out windowing process and Fourier transformation processing, obtains the audio signal in unit video to be processed
Corresponding initial spectrum figure;Second transformation subelement obtains Meier for carrying out Meier conversion process to the initial spectrum figure
Spectrogram, using the Meier spectrogram as the corresponding spectrogram of audio signal in unit video to be processed.
In a kind of optional embodiment, second division module 602 includes: snippet extraction unit, and being used for will be described
Video to be processed is averagely divided into multiple video clips;Each video clip is divided into more by segment division unit for respectively
A unit video to be processed.
In a kind of optional embodiment, it includes: process call unit that the secondary vector, which obtains module 603, for same
When call preset multiple processes;Process processing unit is waited for for being directed to by each unit that a video clip divides
Video is handled, obtains the corresponding audio feature vector of the unit video to be processed using a process.
In a kind of optional embodiment, described device further include: searching module, for be processed from the multiple unit
The unit video to be processed for classification mutation occur is searched in video;There is classification mutation comprising described for foundation in determining module
Unit video to be processed including, continuous at least three units video to be processed, determine it is described occur classification mutation unit
Whether video to be processed belongs to music categories.
In a kind of optional embodiment, the determining module includes: number acquiring unit, for obtain it is described it is continuous extremely
In few three unit videos to be processed, belongs to the number of the unit video to be processed of music categories and belong to unmusical classification
The number of unit video to be processed;Number comparing unit, for using the classification of the unit video to be processed more than number as described in
There is the classification of the unit video to be processed of classification mutation.
In a kind of optional embodiment, the category determination module 604, for video processing model output
Unit video to be processed belong to the prediction probabilities of music categories and whether be more than or equal to setting probability threshold value;It is being greater than
Or when being equal to, determine that unit video to be processed belongs to music categories.
The embodiment of the present invention compared with the existing technology for, for music program class video to be processed carry out demolition when,
Whether belong to music categories by recognition unit video to be processed itself, to determine whether unit video to be processed belongs to music
Part in segment, and then demolition is carried out to video to be processed, it avoids only in accordance with whether scene image information occurs larger change
Change the problem of caused demolition inaccuracy when carrying out demolition, therefore the snatch of music that demolition of the embodiment of the present invention obtains is more quasi-
Really.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple
Place illustrates referring to the part of embodiment of the method.
In an embodiment of the present invention, a kind of electronic equipment is additionally provided.For example, electronic equipment may be provided as a clothes
Business device.The electronic equipment may include one or more processors, and for the memory of storage processor executable instruction,
Executable instruction such as application program.Processor is configured as executing above-mentioned model generating method, and/or, video processing side
Method.
In an embodiment of the present invention, a kind of non-transitorycomputer readable storage medium including instruction is additionally provided,
Memory for example including instruction, above-metioned instruction can be executed by the processor of electronic equipment, to complete above-mentioned model generation side
Method, and/or, method for processing video frequency.For example, the non-transitorycomputer readable storage medium can be ROM, arbitrary access is deposited
Reservoir (RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with
The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.
It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can provide as method, apparatus or calculate
Machine program product.Therefore, the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and
The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can be used one or more wherein include computer can
With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code
The form of the computer program product of implementation.
The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, terminal device (system) and computer program
The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions
In each flow and/or block and flowchart and/or the block diagram in process and/or box combination.It can provide these
Computer program instructions are set to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals
Standby processor is to generate a machine, so that being held by the processor of computer or other programmable data processing terminal devices
Capable instruction generates for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram
The device of specified function.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices
In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates packet
The manufacture of command device is included, which realizes in one side of one or more flows of the flowchart and/or block diagram
The function of being specified in frame or multiple boxes.
These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that
Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus
The instruction executed on computer or other programmable terminal equipments is provided for realizing in one or more flows of the flowchart
And/or in one or more blocks of the block diagram specify function the step of.
Although the preferred embodiment of the embodiment of the present invention has been described, once a person skilled in the art knows bases
This creative concept, then additional changes and modifications can be made to these embodiments.So the following claims are intended to be interpreted as
Including preferred embodiment and fall into all change and modification of range of embodiment of the invention.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by
One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation
Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning
Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap
Those elements are included, but also including other elements that are not explicitly listed, or further includes for this process, method, article
Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited
Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device for including the element.
It is situated between above to a kind of model generation provided by the present invention, method for processing video frequency, device, electronic equipment and storage
Matter is described in detail, and used herein a specific example illustrates the principle and implementation of the invention, above
The explanation of embodiment is merely used to help understand method and its core concept of the invention;Meanwhile for the general skill of this field
Art personnel, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion this
Description should not be construed as limiting the invention.
Claims (20)
1. a kind of model generating method, which is characterized in that the described method includes:
Obtain training sample;The training sample includes the markup information of Sample video and the Sample video;The mark letter
Breath is used to indicate whether the Sample video belongs to music categories;
The Sample video is divided into multiple unit sample videos;
For each unit sample video, the corresponding audio feature vector of the unit sample video is obtained;
Using the corresponding audio feature vector of continuous at least two unit samples video as input, by the mark of the Sample video
Target of the information as output, is trained preset initial model, and the model that training is completed is determined as video processing mould
Type.
2. the method according to claim 1, wherein described obtain the corresponding audio spy of the unit sample video
Levy vector, comprising:
Generate the corresponding spectrogram of audio signal in the unit sample video;
The corresponding spectrogram of audio signal in the unit sample video is inputted into preset neural network model, by the mind
The audio feature vector exported through network model is determined as the corresponding audio feature vector of the unit sample video.
3. according to the method described in claim 2, it is characterized in that, the audio signal generated in the unit sample video
Corresponding spectrogram, comprising:
Sub-frame processing is carried out to the audio signal in the unit sample video, obtains multiple audio signal frames;
Windowing process and Fourier transformation processing are carried out to each audio signal frame, obtain the audio in the unit sample video
The corresponding initial spectrum figure of signal;
Meier conversion process is carried out to the initial spectrum figure and obtains Meier spectrogram, using the Meier spectrogram as the list
The corresponding spectrogram of audio signal in the Sample video of position.
4. the method according to claim 1, wherein described that continuous at least two unit samples video is corresponding
Audio feature vector is as input, using the markup information of the Sample video as the target of output, to preset initial model
It is trained, comprising:
Continuous at least two unit samples video is randomly selected, the corresponding audio feature vector of the unit sample video of extraction is spelled
The initial model is inputted after connecing, and obtains the prediction probability that the Sample video belongs to music categories;
Belong to the prediction probability of music categories and the markup information of the Sample video according to the Sample video, described in calculating
The corresponding penalty values of Sample video;
When the penalty values are less than setting loss threshold value, determine that training is completed.
5. a kind of method for processing video frequency, which is characterized in that the described method includes:
Obtain video to be processed;
The video to be processed is divided into multiple units video to be processed;
For each unit video to be processed, the corresponding audio feature vector of unit video to be processed is obtained;
Including comprising unit video to be processed, the corresponding audio frequency characteristics of continuous at least two units video to be processed to
The pre-generated video of amount input handles model, determines unit video to be processed according to the output that the video handles model
Whether music categories are belonged to;Wherein, the video processing model is raw using method described in any one of any one of claims 1 to 44
At;
By in the unit video to be processed for belonging to music categories, continuous unit video to be processed is spliced, obtain it is described to
Handle the snatch of music in video.
6. according to the method described in claim 5, it is characterized in that, described obtain the corresponding audio of the unit video to be processed
Feature vector, comprising:
Generate the corresponding spectrogram of audio signal in unit video to be processed;
The corresponding spectrogram of audio signal in unit video to be processed is inputted into preset neural network model, it will be described
The audio feature vector of neural network model output is determined as the corresponding audio feature vector of unit video to be processed.
7. according to the method described in claim 6, it is characterized in that, the audio letter generated in the unit video to be processed
Number corresponding spectrogram, comprising:
Sub-frame processing is carried out to the audio signal in unit video to be processed, obtains multiple audio signal frames;
Windowing process and Fourier transformation processing are carried out to each audio signal frame, obtain the sound in unit video to be processed
The corresponding initial spectrum figure of frequency signal;
Meier conversion process is carried out to the initial spectrum figure and obtains Meier spectrogram, using the Meier spectrogram as the list
The corresponding spectrogram of audio signal in the video to be processed of position.
8. according to the method described in claim 5, it is characterized in that, described be divided into multiple units for the video to be processed and wait for
Handle video, comprising:
The video to be processed is averagely divided into multiple video clips;
Each video clip is divided into multiple units video to be processed respectively.
9. according to the method described in claim 8, it is characterized in that, described be directed to each unit video to be processed, described in acquisition
The corresponding audio feature vector of unit video to be processed, comprising:
Preset multiple processes are called simultaneously;
For each unit video to be processed divided by a video clip, the unit is obtained using a process and is waited for
Handle the corresponding audio feature vector of video.
10. according to the method described in claim 5, it is characterized in that, in the unit view to be processed that will belong to music categories
In frequency, continuous unit video to be processed is spliced, before the snatch of music in the acquisition video to be processed, further includes:
The unit video to be processed for classification mutation occur is searched from the multiple unit video to be processed;
Including comprising the unit video to be processed for classification mutation occur, continuous at least three units view to be processed
Frequently, determine whether the unit video to be processed for classification mutation occur belongs to music categories.
11. according to the method described in claim 10, it is characterized in that, the foundation includes the unit for classification mutation occur
Including video to be processed, continuous at least three units video to be processed determines that the unit for classification mutation occur is to be processed
Whether video belongs to music categories, comprising:
Obtain in the continuous at least three units video to be processed, belong to the number of the unit video to be processed of music categories with
And belong to the number of the unit video to be processed of unmusical classification;
Using the classification of the unit video to be processed more than number as the classification of the unit video to be processed for classification mutation occur.
12. according to the method described in claim 5, it is characterized in that, the output for handling model according to the video determines
Whether the unit video to be processed belongs to music categories, comprising:
Compare video processing model output unit video to be processed belong to music categories prediction probability it is whether big
In or equal to setting probability threshold value;
When if it is being more than or equal to, determine that unit video to be processed belongs to music categories.
13. a kind of model generating means, which is characterized in that described device includes:
Sample acquisition module, for obtaining training sample;The training sample includes the mark of Sample video and the Sample video
Infuse information;The markup information is used to indicate whether the Sample video belongs to music categories;
First division module, for the Sample video to be divided into multiple unit sample videos;
Primary vector obtains module, for being directed to each unit sample video, obtains the corresponding audio of the unit sample video
Feature vector;
Training module is used for using the corresponding audio feature vector of continuous at least two unit samples video as input, will be described
Target of the markup information of Sample video as output, is trained preset initial model, and the model that training is completed is true
It is set to video processing model.
14. a kind of video process apparatus, which is characterized in that described device includes:
Video acquiring module, for obtaining video to be processed;
Second division module, for the video to be processed to be divided into multiple units video to be processed;
Secondary vector obtains module, and for being directed to each unit video to be processed, it is corresponding to obtain unit video to be processed
Audio feature vector;
Category determination module, for that will include continuous at least two units view to be processed including unit video to be processed
Frequently the pre-generated video of corresponding audio feature vector input handles model, is determined according to the output that the video handles model
Whether the unit video to be processed belongs to music categories;Wherein, the video processing model is to utilize institute in claim 13
What the method stated generated;
Segment determining module, in the unit video to be processed for music categories will to be belonged to, continuous unit video to be processed into
Row splicing, obtains the snatch of music in the video to be processed.
15. device according to claim 14, which is characterized in that second division module includes:
Snippet extraction unit, for the video to be processed to be averagely divided into multiple video clips;
Segment division unit, for each video clip to be divided into multiple units video to be processed respectively.
16. device according to claim 15, which is characterized in that the secondary vector obtains module and includes:
Process call unit, for calling preset multiple processes simultaneously;
Process processing unit, for utilizing one for each unit video to be processed divided by a video clip
Process obtains the corresponding audio feature vector of the unit video to be processed.
17. device according to claim 14, which is characterized in that described device further include:
Searching module, for searching the unit video to be processed for classification mutation occur from the multiple unit video to be processed;
Determining module, for according to comprising it is described occur classification mutation unit video to be processed including, continuous at least three
Unit video to be processed, determines whether the unit video to be processed for classification mutation occur belongs to music categories.
18. device according to claim 17, which is characterized in that the determining module includes:
Number acquiring unit belongs to the unit of music categories for obtaining in the continuous at least three units video to be processed
The number of the number of video to be processed and the unit for belonging to unmusical classification video to be processed;
Number comparing unit, for using the classification of the unit video to be processed more than number as the unit for classification mutation occur
The classification of video to be processed.
19. a kind of electronic equipment characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to executing model generating method according to any one of claims 1-4, and/or, such as
The described in any item method for processing video frequency of claim 5-12.
20. a kind of non-transitorycomputer readable storage medium, which is characterized in that when the instruction in the storage medium is by electronics
When the processor of equipment executes, so that electronic equipment is able to carry out model generating method according to any one of claims 1-4,
And/or such as the described in any item method for processing video frequency of claim 5-12.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910458806.8A CN110324726B (en) | 2019-05-29 | 2019-05-29 | Model generation method, video processing method, model generation device, video processing device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910458806.8A CN110324726B (en) | 2019-05-29 | 2019-05-29 | Model generation method, video processing method, model generation device, video processing device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110324726A true CN110324726A (en) | 2019-10-11 |
CN110324726B CN110324726B (en) | 2022-02-18 |
Family
ID=68119101
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910458806.8A Active CN110324726B (en) | 2019-05-29 | 2019-05-29 | Model generation method, video processing method, model generation device, video processing device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110324726B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112104892A (en) * | 2020-09-11 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Multimedia information processing method and device, electronic equipment and storage medium |
CN112750469A (en) * | 2020-02-26 | 2021-05-04 | 腾讯科技(深圳)有限公司 | Method for detecting music in voice, voice communication optimization method and corresponding device |
CN113096624A (en) * | 2021-03-24 | 2021-07-09 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for automatically creating symphony music |
CN113992970A (en) * | 2020-07-27 | 2022-01-28 | 阿里巴巴集团控股有限公司 | Video data processing method and device, electronic equipment and computer storage medium |
CN114222159A (en) * | 2021-12-01 | 2022-03-22 | 北京奇艺世纪科技有限公司 | Method and system for determining video scene change point and generating video clip |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1825936A (en) * | 2006-02-24 | 2006-08-30 | 北大方正集团有限公司 | News video retrieval method based on speech classifying indentification |
CN105872855A (en) * | 2016-05-26 | 2016-08-17 | 广州酷狗计算机科技有限公司 | Labeling method and device for video files |
CN105931635A (en) * | 2016-03-31 | 2016-09-07 | 北京奇艺世纪科技有限公司 | Audio segmentation method and device |
US20170228385A1 (en) * | 2016-02-08 | 2017-08-10 | Hulu, LLC | Generation of Video Recommendations Using Connection Networks |
CN107066488A (en) * | 2016-12-27 | 2017-08-18 | 上海东方明珠新媒体股份有限公司 | Video display bridge section automatic division method based on movie and television contents semantic analysis |
CN108307229A (en) * | 2018-02-02 | 2018-07-20 | 新华智云科技有限公司 | A kind of processing method and equipment of video-audio data |
CN108897829A (en) * | 2018-06-22 | 2018-11-27 | 广州多益网络股份有限公司 | Modification method, device and the storage medium of data label |
CN108989882A (en) * | 2018-08-03 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Method and apparatus for exporting the snatch of music in video |
CN109344780A (en) * | 2018-10-11 | 2019-02-15 | 上海极链网络科技有限公司 | A kind of multi-modal video scene dividing method based on sound and vision |
-
2019
- 2019-05-29 CN CN201910458806.8A patent/CN110324726B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1825936A (en) * | 2006-02-24 | 2006-08-30 | 北大方正集团有限公司 | News video retrieval method based on speech classifying indentification |
US20170228385A1 (en) * | 2016-02-08 | 2017-08-10 | Hulu, LLC | Generation of Video Recommendations Using Connection Networks |
CN105931635A (en) * | 2016-03-31 | 2016-09-07 | 北京奇艺世纪科技有限公司 | Audio segmentation method and device |
CN105872855A (en) * | 2016-05-26 | 2016-08-17 | 广州酷狗计算机科技有限公司 | Labeling method and device for video files |
CN107066488A (en) * | 2016-12-27 | 2017-08-18 | 上海东方明珠新媒体股份有限公司 | Video display bridge section automatic division method based on movie and television contents semantic analysis |
CN108307229A (en) * | 2018-02-02 | 2018-07-20 | 新华智云科技有限公司 | A kind of processing method and equipment of video-audio data |
CN108897829A (en) * | 2018-06-22 | 2018-11-27 | 广州多益网络股份有限公司 | Modification method, device and the storage medium of data label |
CN108989882A (en) * | 2018-08-03 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Method and apparatus for exporting the snatch of music in video |
CN109344780A (en) * | 2018-10-11 | 2019-02-15 | 上海极链网络科技有限公司 | A kind of multi-modal video scene dividing method based on sound and vision |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112750469A (en) * | 2020-02-26 | 2021-05-04 | 腾讯科技(深圳)有限公司 | Method for detecting music in voice, voice communication optimization method and corresponding device |
CN113992970A (en) * | 2020-07-27 | 2022-01-28 | 阿里巴巴集团控股有限公司 | Video data processing method and device, electronic equipment and computer storage medium |
CN112104892A (en) * | 2020-09-11 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Multimedia information processing method and device, electronic equipment and storage medium |
CN113096624A (en) * | 2021-03-24 | 2021-07-09 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for automatically creating symphony music |
CN113096624B (en) * | 2021-03-24 | 2023-07-25 | 平安科技(深圳)有限公司 | Automatic creation method, device, equipment and storage medium for symphony music |
CN114222159A (en) * | 2021-12-01 | 2022-03-22 | 北京奇艺世纪科技有限公司 | Method and system for determining video scene change point and generating video clip |
Also Published As
Publication number | Publication date |
---|---|
CN110324726B (en) | 2022-02-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110324726A (en) | Model generation, method for processing video frequency, device, electronic equipment and storage medium | |
CN110213670A (en) | Method for processing video frequency, device, electronic equipment and storage medium | |
Giannoulis et al. | A database and challenge for acoustic scene classification and event detection | |
CN107086040A (en) | Speech recognition capabilities method of testing and device | |
Krijnders et al. | Sound event recognition through expectancy-based evaluation ofsignal-driven hypotheses | |
CN110324657A (en) | Model generation, method for processing video frequency, device, electronic equipment and storage medium | |
CN108877779B (en) | Method and device for detecting voice tail point | |
CN109979483A (en) | Melody detection method, device and the electronic equipment of audio signal | |
CN109979485B (en) | Audio evaluation method and device | |
Wang et al. | Local business ambience characterization through mobile audio sensing | |
CN111696580A (en) | Voice detection method and device, electronic equipment and storage medium | |
Müller et al. | Interactive fundamental frequency estimation with applications to ethnomusicological research | |
CN109997186A (en) | A kind of device and method for acoustic environment of classifying | |
CN104700831B (en) | The method and apparatus for analyzing the phonetic feature of audio file | |
CN113781989B (en) | Audio animation playing and rhythm stuck point identifying method and related device | |
JP2005292207A (en) | Method of music analysis | |
CN104143340B (en) | A kind of audio frequency assessment method and device | |
Qais et al. | Deepfake audio detection with neural networks using audio features | |
WO2019053544A1 (en) | Identification of audio components in an audio mix | |
Karantaidis et al. | Assessing spectral estimation methods for electric network frequency extraction | |
CN104882152A (en) | Method and apparatus for generating lyric file | |
KR102077642B1 (en) | Sight-singing evaluation system and Sight-singing evaluation method using the same | |
Bhatia et al. | Analysis of audio features for music representation | |
Shirali-Shahreza et al. | Fast and scalable system for automatic artist identification | |
Ganapathy et al. | Temporal resolution analysis in frequency domain linear prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |