CN109936766A - A kind of generation method based on water scene audio end to end - Google Patents
A kind of generation method based on water scene audio end to end Download PDFInfo
- Publication number
- CN109936766A CN109936766A CN201910091367.1A CN201910091367A CN109936766A CN 109936766 A CN109936766 A CN 109936766A CN 201910091367 A CN201910091367 A CN 201910091367A CN 109936766 A CN109936766 A CN 109936766A
- Authority
- CN
- China
- Prior art keywords
- audio
- video
- generation method
- water scene
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The invention belongs to the technical fields of audio processing, and in particular to a kind of generation method based on water scene audio end to end includes the following steps: step 1, chooses all kinds of water scene videos, and pre-processed;Step 2 obtains Maker model by training according to pretreated data;Step 3 pre-processes silent video, is loaded into trained Maker model, exports audio corresponding with silent video;Step 4 generates envelope according to the sequence of audio, and is loaded into trained tone color booster model, exports the enhanced audio of tone color.The present invention can be realized automatically generating for outdoor water scene sound end to end, solve the problems, such as to dub for scene it is time-consuming and laborious, meanwhile, generate water scene audio using the resulting model of training, can be improved formation speed and synchronous degree, to improve working efficiency.
Description
Technical field
The invention belongs to the technical fields of audio processing, and in particular to a kind of generation based on water scene audio end to end
Method.
Background technique
With the continuous development of computer graphics techniques, people propose the sound quality of video and animation higher
It is required that.And water scene, especially outdoor water scene are present in video display, among game, so developing one kind being capable of automatic basis
Outdoor water scene video goes the method for generating corresponding scene sound to seem very necessary.Currently, people are mostly using based on physics
Method go generate water scene sound.
Water scene sound generation method based on physics is based primarily upon a kind of theory, i.e. the formation and resonance of bubble are the underwater sounds
The most important source of sound.Zheng et al. proposes the water flow sound generation method based on harmonic wave bubble in harmonic wave bubble,
The considerations of by sound transmission process, generates a variety of flowing water sound including running tap water, but it is generated
Result need by cumbersome artificial adjustment, then, Langlois et al. is mentioned in the Fluid Dynamics based on complicated acoustics bubble
A sound generation method based on the simulation of two-phase incompressible fluid is suggested out, for improving the fluid for utilizing bubble formation
Acoustic consequences, the bubble in liquid no longer use random model, but go to generate according to the state of fluid more true
Bubble, but also final sound effect is more life-like, but the main study subject of these methods is all confined to small-scale water
Stream, also, as the continuous improvement of acoustic consequences, algorithm complexity are also constantly being promoted, this, which allows for them, to apply
Into the sound rendering of outdoor water scene.
The sound generation method of deep learning goes to generate corresponding sound based on video.Owens et al. shows sound in vision
A neural network being composed of convolutional neural networks (CNN) and shot and long term memory unit (LSTM) is proposed in sound, is led to
Cross the characteristics of image for inputting the spacetime figure of each frame video gradation figure and its before and after frames gray level image composition, output and view
Frequently corresponding sound electrocochleogram, then go in voice bank to find to generate with the most matched sample sound splicing of this figure and most terminate
Fruit, Chen et al. propose that input is found pleasure in respectively using two kinds of translative mode of GAN network design in deep cross-module state audiovisual generation
The logarithmic amplitude Meier spectrogram (LMS) of device sound is converted to corresponding musical instrument figure, and musical instrument figure is converted to corresponding LMS
Figure, then look for all being analogous to the spectrum of image with the matched musical instrument sound of LMS, the output of the depth network of the two algorithms
Figure, does not directly generate original voice signal, Zhou et al. is in video to sound: proposing in the sound generation of life outdoor videos
Tentatively generated using sound of the SampleRNN model to natural scene video, by extract video image or
Input of the feature of light stream figure as RNN, to directly generate corresponding voice signal, however its in audio-visual synchronization still
So there are some problems.
Summary of the invention
It is an object of the invention to: in view of the deficiencies of the prior art, provide a kind of based on water scene audio end to end
Generation method can be realized automatically generating for outdoor water scene sound end to end, solve to dub for scene time-consuming and laborious
Problem, meanwhile, water scene audio is generated using the resulting model of training, can be improved formation speed and synchronous degree, to mention
High working efficiency.
To achieve the goals above, the present invention adopts the following technical scheme:
A kind of generation method based on water scene audio end to end, includes the following steps:
Step 1 is chosen all kinds of water scene videos, and is pre-processed;
Step 2 obtains Maker model by training according to pretreated data;
Step 3 pre-processes silent video, is loaded into the trained Maker model, output and the nothing
The corresponding audio of sound video;
Step 4 generates envelope according to the sequence of the audio, and is loaded into trained tone color booster model, exports
The enhanced audio of tone color.
It should be noted that: in generation method of the invention, in step 1, chooses all kinds of water scene videos and instructed
Practice, help to optimize model training, error is reduced, simultaneously as having between the image information and sound of video larger
Dimension difference, by pretreatment image information and sound can be made in the same dimension;In step 2, after to pretreatment
Data training Maker model, the fluid sound being synchronised with outdoor water scene video can be automatically synthesized, do not need profession
Quasi- sound teacher synthesizes synchronous water scene sound, does not also need artificial to be gone to design different algorithms according to different scene characteristics
The sound of all kinds of scenes is generated, while saving human and material resources, the accuracy of Maker model is improved, meets the need of people
It asks, it is also desirable to which discriminator is arranged, the quality of result is generated for assessing generator, and assessment result is fed back to generation
In device model, Maker model is realized by multiple feedback and adjustment process and effectively train to Maker model, thus
The accuracy for improving Maker model matches sound to silent video is synchronous;In step 3, silent video does not have sound, needs
Trained Maker model generates corresponding audio data according to each second silent video information vector, to complete to give
Silent video mixes sound;In step 4, since the audio data of Maker model output may not necessarily meet practical water scene, such as
Waterfall scene needs to enhance tone color, to meet practical water scene demand, meanwhile, in order to further increase Automated water
It is flat, also tone color is enhanced using trained tone color booster model, realize outdoor water scene sound end to end from
Dynamic to generate, trained tone color booster model can directly obtain enhanced audio, remove centre from according to the envelope of sound
Physical method, e.g., imaging method, comparison method, synthesis, control variate method and conversion method etc. greatly improve processing speed, reduce
The time that user waits.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 1
In, the pretreated method includes the following steps:
A1, the feature for extracting video frame, obtain the information of video;
A2, video information per second is converted into vector identical with audio dimension.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 2
In, the training method of the Maker model includes the following steps:
The vector of B1, the input video information, pass through the Maker model output audio signal;
B2, the assessment audio signal feed back to the Maker model, and be adjusted again, directly if not corresponding to
To the corresponding audio signal of output;If corresponding, continue the training of next video information.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 4
In, the training method of the tone color booster model includes the following steps:
C1, the envelope for inputting target audio, the sequence of the audio is exported by the tone color booster model;
The sequence of C2, the assessment audio, if not target sequence, then feed back to the tone color booster model, lay equal stress on
It is newly adjusted, the sequence until exporting target audio;If target sequence, then continue next tone color enhancing training.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 4
In, the generation method of the envelope includes the following steps:
D1, input a segment of audio sequence GVAnd the sampling interval L of envelopestep;
D2, tonic train G is takenVIn each sampling interval LstepThe maximum value of interior absolute value is as in this section of interval
One envelope point pi;
The array Ep that D3, the envelope point pi in all sampling intervals are formed by connecting, by linear interpolation formation length and GV
Identical sequence E (1:len), as tonic train GVCorresponding envelope,
Wherein, Pi ∈ Gv, interp () indicate linear interpolation,Indicate attended operation.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step A2
In, the video information conversion formula are as follows:
G(y1..., ym)→x1..., xn, x ∈ audio), y ∈ { video }
Wherein y1 ..., ym represents the Color Channel information of the video frame, each channel is by between 0 to 255
Between array at matrix, G (y1 ..., ym) indicates that (value range is -1 for the value of the audio signal generated based on video frame
To 1), x1 ..., xn indicate the value of the corresponding audio signal of video (variation range is -1 to 1).
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 2
In, export loss function used in the audio signal are as follows:
Wherein, λ
=100, wherein X indicates that sound true value, V indicate video frame information, and G indicates that generator generates as a result, D indicates assessment
As a result, E expression is averaged.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 2
In, assess loss function used in the audio signal are as follows:
Wherein,
V indicates that video frame information, G indicate that generator generates as a result, that D expression is assessed as a result, E expression is averaged.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, the water
The generation method of scene audio is based on GAN network, and the GAN network includes generator, discriminator and tone color booster.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, the water
The generation method of scene audio is based on GAN network, in step 1, vector V that pretreated video frame generatestIt can be expressed as
Following form:
Wherein,Indicate attended operation,vt,q
Indicate the t seconds extracted features of q frame, Floor indicates to be rounded downwards;
The generation task of sound can be further represented as following form:
G(V1, V2..., VΔt)→X1, X2..., XΔt
Wherein, Xt={ XT, 1, xT, 2..., Xt, SRaudio, t ∈ { 1,2 ..., Δ t }.
The beneficial effects of the present invention are, the present invention includes the following steps: step 1, all kinds of water scene videos are chosen, and
It is pre-processed;Step 2 obtains Maker model by training according to pretreated data;Step 3, by silent video
It is pre-processed, is loaded into trained Maker model, export audio corresponding with silent video;Step 4, according to audio
Sequence generate envelope, and be loaded into trained tone color booster model, export the enhanced audio of tone color.Of the invention
It in generation method, in step 1, chooses all kinds of water scene videos and is trained, help to optimize model training, reduce
Error can be such that image believes simultaneously as there is biggish dimension difference between the image information and sound of video by pretreatment
Breath and sound are in the same dimension;In step 2, by pretreated data training Maker model, can be automatically synthesized with
The fluid sound that outdoor water scene video is synchronised does not need the quasi- sound teacher of profession to synthesize synchronous water scene sound, not yet
Need it is artificial gone to design sound of the different algorithms to generate all kinds of scenes according to different scene characteristics, save human and material resources
While, the accuracy of Maker model is improved, meets the needs of people, it is also desirable to discriminator be arranged, for assessing life
Grow up to be a useful person and generate the quality of result, and by assessment result feedback into Maker model, Maker model by multiple feedback and
Adjustment process is realized and effectively train to Maker model, so that the accuracy of Maker model is improved, it is same to silent video
Step matches sound;In step 3, silent video does not have sound, needs trained Maker model according to each second noiseless view
Frequency information vector generates corresponding audio data, to complete to mix sound to silent video;In step 4, due to generator
The audio data of model output may not necessarily meet practical water scene, such as waterfall scene, need to enhance tone color, to meet reality
Border water scene demand, meanwhile, in order to further increase automatization level, also using trained tone color booster model to tone color
Enhanced, realizes automatically generating for outdoor water scene sound end to end, trained tone color booster model being capable of basis
The envelope of sound directly obtains enhanced audio, removes intermediate physical method, e.g., imaging method, comparison method, synthesis, control from
Quantity method processed and conversion method etc. greatly improve processing speed, reduce the time that user waits.The present invention can be realized end to end
Outdoor water scene sound automatically generates, solve the problems, such as to dub for scene it is time-consuming and laborious, meanwhile, utilize the resulting mould of training
Type generates water scene audio, can be improved formation speed and synchronous degree, to improve working efficiency.
Detailed description of the invention
Fig. 1 is flow diagram of the invention;
Fig. 2 is operation schematic diagram of the invention;
Fig. 3 is the waveform diagram of water scene and its corresponding audio signal in the present invention;
Fig. 4 is the spectral contrast figure of tone color enhancing front and back in the present invention.
Specific embodiment
As used some vocabulary to censure specific components in the specification and claims.Those skilled in the art answer
It is understood that hardware manufacturer may call the same component with different nouns.This specification and claims are not with name
The difference of title is as the mode for distinguishing component, but with the difference of component functionally as the criterion of differentiation.Such as logical
The "comprising" of piece specification and claim mentioned in is an open language, therefore should be construed to " include but do not limit
In "." substantially " refer within an acceptable error range, those skilled in the art can solve technology within a certain error range
Problem basically reaches technical effect.
In the description of the present invention, it is to be understood that, term " on ", "lower", "front", "rear", "left", "right", level "
The orientation or positional relationship of equal instructions is to be based on the orientation or positional relationship shown in the drawings, be merely for convenience of the description present invention and
Simplify description, rather than the device or element of indication or suggestion meaning must have a particular orientation, with specific orientation construction
And operation, therefore be not considered as limiting the invention.
In invention unless specifically defined or limited otherwise, the arts such as term " installation ", " connected ", " connection ", " fixation "
Language shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can be machinery
Connection, is also possible to be electrically connected;It can be directly connected, two elements can also be can be indirectly connected through an intermediary
Internal connection.For the ordinary skill in the art, above-mentioned term can be understood in the present invention as the case may be
In concrete meaning.
Below in conjunction with attached drawing 1~4, invention is further described in detail, but not as a limitation of the invention.
Embodiment 1
A kind of generation method based on water scene audio end to end, includes the following steps:
Step 1 is chosen all kinds of water scene videos, and is pre-processed;
Step 2 obtains Maker model by training according to pretreated data;
Step 3 pre-processes silent video, is loaded into trained Maker model, output and silent video pair
The audio answered;
Step 4 generates envelope according to the sequence of audio, and is loaded into trained tone color booster model, exports tone color
Enhanced audio.
It should be noted that: in generation method of the invention, in step 1, chooses all kinds of water scene videos and instructed
Practice, help to optimize model training, error is reduced, simultaneously as having between the image information and sound of video larger
Dimension difference, by pretreatment image information and sound can be made in the same dimension;In step 2, after to pretreatment
Data training Maker model, the fluid sound being synchronised with outdoor water scene video can be automatically synthesized, do not need profession
Quasi- sound teacher synthesizes synchronous water scene sound, does not also need artificial to be gone to design different algorithms according to different scene characteristics
The sound of all kinds of scenes is generated, while saving human and material resources, the accuracy of Maker model is improved, meets the need of people
It asks, it is also desirable to which discriminator is arranged, the quality of result is generated for assessing generator, and assessment result is fed back to generation
In device model, Maker model is realized by multiple feedback and adjustment process and effectively train to Maker model, thus
The accuracy for improving Maker model matches sound to silent video is synchronous;In step 3, silent video does not have sound, needs
Trained Maker model generates corresponding audio data according to each second silent video information vector, to complete to give
Silent video mixes sound;In step 4, since the audio data of Maker model output may not necessarily meet practical water scene, such as
Waterfall scene needs to enhance tone color, to meet practical water scene demand, meanwhile, in order to further increase Automated water
It is flat, also tone color is enhanced using trained tone color booster model, realize outdoor water scene sound end to end from
Dynamic to generate, trained tone color booster model can directly obtain enhanced audio, remove centre from according to the envelope of sound
Physical method, e.g., imaging method, comparison method, synthesis, control variate method and conversion method etc. greatly improve processing speed, reduce
The time that user waits.
Preferably, in step 1, pretreated method includes the following steps:
A1, the feature for extracting video frame, obtain the information of video;
A2, video information per second is converted into vector identical with audio dimension.
It is unfavorable because complete water scene video occupies bigger memory headroom in step A1 in above-mentioned preprocess method
In the information for obtaining video, and calculation amount is larger, so, by extracting the feature of video frame, calculation amount can be reduced, is reached simultaneously
To the purpose for the information for obtaining video, arithmetic speed is improved;In step A2, due to have between the image information and sound of video compared with
Big dimension difference, not only there are many calculation amount, also will increase the error of Maker model, reduce water scene sound and video matches
Effect.
Preferably, in step 2, the training method of Maker model includes the following steps:
The vector of B1, input video information pass through Maker model output audio signal;
B2, assessment audio signal feed back to Maker model, and be adjusted again if not corresponding to, until output pair
The audio signal answered;If corresponding, continue the training of next video information.
In above-mentioned training method, in step B2, initial Maker model is not by training, and the audio signal of output is not
It must correspond, be trained by all kinds of water scene videos, and Real-time Feedback has to Maker model with the vector of frequency information
Help optimize training to model, reduces the error of output.
Preferably, in step 4, the training method of tone color booster model includes the following steps:
C1, the envelope for inputting target audio, the sequence of audio is exported by tone color booster model;
C2, the sequence for assessing audio, if not target sequence, then feed back to tone color booster model, and re-start tune
It is whole, the sequence until exporting target audio;If target sequence, then continue next tone color enhancing training.
In above-mentioned training method, in step C2, initial tone color booster model is not by training, the tonic train of output
May not be corresponding with the envelope of target audio, it is trained by the envelope of each class audio frequency, and Real-time Feedback gives tone color booster mould
Type helps to optimize training to model, reduces the error of output.
Preferably, in step 4, the generation method of envelope includes the following steps:
D1, input a segment of audio sequence GVAnd the sampling interval L of envelopestep;
D2, tonic train G is takenVIn each sampling interval LstepThe maximum value of interior absolute value is as in this section of interval
One envelope point pi;
The array Ep that D3, the envelope point pi in all sampling intervals are formed by connecting, by linear interpolation formation length and GV
Identical sequence E (1:len), as tonic train GVCorresponding envelope,
Wherein, Pi ∈ Gv, interp () indicate linear interpolation,Indicate attended operation.
Preferably, in step A2, video information conversion formula are as follows:
G(y1..., ym)→x1..., xn, x ∈ { audio }, y ∈ { video }
Wherein y1..., ym represents the Color Channel information of the video frame, each channel is by between 0 to 255
Between array at matrix, G (y1 ..., ym) indicates that (value range is -1 for the value of the audio signal generated based on video frame
To 1), x1 ..., xn indicate the value of the corresponding audio signal of video (variation range is -1 to 1).
Preferably, in step 2, loss function used in output audio signal are as follows:
Wherein,
λ=100, wherein X indicates that sound true value, V indicate video frame information, and G indicates that generator generates as a result, D indicates assessment
As a result, E expression is averaged.
Preferably, in step 2, loss function used in audio signal is assessed are as follows:
Wherein,
V indicates that video frame information, G indicate that generator generates as a result, that D expression is assessed as a result, E expression is averaged.
Embodiment 2
Unlike the first embodiment: in the video pre-filtering of the present embodiment, for the video of different inputs, image ruler
Very little to be usually different from, in order to reduce calculation amount and unified management, it is 256 × 256 × 3 that input picture, which is scaled to size,
Then 30 256 × 256 × 3 images in each second are encoded to corresponding with audio scale 1 × 4096 × 1 by image.
Firstly, for each video frame yi, extract its feature vector v under VGG19 networki, dimension is 1 × 4096 × 1.If
SRvidecAnd SRaudioIt is in the present invention 30 and 44100 for the sample rate of video and audio.It is right for t seconds videos
Vector V after the video pre-filtering answeredtIt can be expressed as form:
Wherein,Indicate attended operation,vt,q
Indicate the t seconds extracted VGG19 features of q frame, Floor indicates to be rounded downwards.So p=10, q=in the present invention
3.For the missing in splicing the final lengths caused by rounding up, the present invention is uniformly in vacancy zero padding.In this way,
The conversion of the video of script to audio can be expressed as form:
G(V1, V2..., VΔt)→X1, X2..., XΔt
Wherein, Xt={ xT, 1, xT, 2..., xt, SRaudio, t ∈ { 1,2 ..., Δ t }.Vt and Xt has identical at this moment
Dimension.
Embodiment 3
Unlike the first embodiment: the generation method of the water scene audio of the present embodiment is based on GAN network, GAN network
Including generator, discriminator and tone color booster.The demand that network in the present invention is generated according to sound to input and output institute into
Row adjustment, so that the receptive field of each layer of convolution (receptive field: in convolutional neural networks CNN, determines in the image network of script
The area size of input layer corresponding to an element in a certain layer output result) no longer it is applicable in.In image network, usually make
The convolutional layer for being 3 × 3 with receptive field.And correspond to outputting and inputting for 44100 dimension of the present invention, the convolution of generator and discriminator
The receptive field of layer is also changed, and has used biggish receptive field to complete corresponding convolution operation.In addition, in convolution process
In, two dimensional filter used in image is rejected, and is directed to the feature of sound dimension, and present invention uses one-dimensional filtering devices
Convolution is carried out, in order to remove unwanted frequency information in number voice result, finally increases a filtering in generator
Device filters the component frequency information in result, keeps the length of output sequence constant during filtering.Generator and discrimination
The specific structure of device can refer to Tables 1 and 2
Table 1
Table 2
Wherein, due to part Relu corresponding after convolutional layer (Conv1D) and warp lamination (Trans Conv1D),
LeakRelu and BatchNorm layers is not related to convolution kernel and the change to Output Size, does not display to it in table.
Stride indicates the convolution step-length during convolution or deconvolution.Three parameters corresponding to " convolution kernel size " column respectively refer to
Be receptive field size, the input channel number of this layer and the output channel number of this layer.Corresponding to " output shape " column
Three parameters refer respectively to the size of this layer of Batch, input dimension and port number.In order to guarantee pair of convolution Yu deconvolution process
It answers, during convolution or deconvolution, the present invention makes in layer by using continually changing receptive field and convolution step-length
There is no give up dimension or increase the process of dimension for conversion between input and output.
According to the disclosure and teachings of the above specification, those skilled in the art in the invention can also be to above-mentioned embodiment party
Formula is changed and is modified.Therefore, the invention is not limited to above-mentioned specific embodiment, all those skilled in the art exist
Made any conspicuous improvement, replacement or modification all belong to the scope of protection of the present invention on the basis of the present invention.This
Outside, although using some specific terms in this specification, these terms are merely for convenience of description, not to the present invention
Constitute any restrictions.
Claims (10)
1. a kind of generation method based on water scene audio end to end, which comprises the steps of:
Step 1 is chosen all kinds of water scene videos, and is pre-processed;
Step 2 obtains Maker model by training according to pretreated data;
Step 3 pre-processes silent video, is loaded into the trained Maker model, output and the noiseless view
Frequently corresponding audio;
Step 4 generates envelope according to the sequence of the audio, and is loaded into trained tone color booster model, exports tone color
The enhanced audio.
2. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: step 1
In, the pretreated method includes the following steps:
A1, the feature for extracting video frame, obtain the information of video;
A2, video information per second is converted into vector identical with audio dimension.
3. a kind of generation method based on water scene audio end to end as claimed in claim 2, it is characterised in that: step 2
In, the training method of the Maker model includes the following steps:
The vector of B1, the input video information, pass through the Maker model output audio signal;
B2, the assessment audio signal feed back to the Maker model, and be adjusted again, until defeated if not corresponding to
Corresponding audio signal out;If corresponding, continue the training of next video information.
4. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: step 4
In, the training method of the tone color booster model includes the following steps:
C1, the envelope for inputting target audio, the sequence of the audio is exported by the tone color booster model;
The sequence of C2, the assessment audio, if not target sequence, then feed back to the tone color booster model, and again into
Row adjustment, the sequence until exporting target audio;If target sequence, then continue next tone color enhancing training.
5. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: step 4
In, the generation method of the envelope includes the following steps:
D1, input a segment of audio sequence GVAnd the sampling interval L of envelopestep;
D2, tonic train G is takenVIn each sampling interval LstepThe maximum value of interior absolute value is as one in this section of interval
Envelope point pi;
The array Ep that D3, the envelope point pi in all sampling intervals are formed by connecting, by linear interpolation formation length and GVIt is identical
Sequence E (1:len), as tonic train GVCorresponding envelope,
Wherein, Pi ∈ Gv, interp () indicate linear interpolation,Indicate attended operation.
6. a kind of generation method based on water scene audio end to end as claimed in claim 2, it is characterised in that: step A2
In, the video information conversion formula are as follows:
G (y1 ..., ym) → x1 ..., xn, x ∈ { audio }, y ∈ { video }
Wherein y1 ..., ym represents the Color Channel information of the video frame, each channel is by between 0 to 255
Array at matrix, G (y1 ..., ym) indicates the value (value range is -1 to 1) of the audio signal generated based on video frame,
X1 ..., xn indicates the value of the corresponding audio signal of video (variation range is -1 to 1).
7. a kind of generation method based on water scene audio end to end as claimed in claim 3, it is characterised in that: step 2
In, export loss function used in the audio signal are as follows:
Wherein, λ=100, wherein X indicate sound true value, V indicate video frame information, G indicate generator generate as a result, D
Indicate assessment as a result, E expression average.
8. a kind of generation method based on water scene audio end to end as claimed in claim 3, it is characterised in that: step 2
In, assess loss function used in the audio signal are as follows:
Wherein, V indicates that video frame information, G indicate that generator generates as a result, that D expression is assessed as a result, E expression is averaged.
9. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: the water
The generation method of scene audio is based on GAN network, and the GAN network includes generator, discriminator and tone color booster.
10. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: step
In one, the vector V of pretreated video frame generationtIt can be expressed as form:
Wherein,Indicate attended operation,VT, qIndicate the
The t seconds extracted features of q frame, Floor indicate to be rounded downwards;
The generation task of sound can be further represented as following form:
G(V1, V2..., VΔt)→X1, X2..., XΔt
Wherein, Xt={ xT, 1, xT, 2..., xT, SRaudio, t ∈ { 1,2 ..., Δ t }.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910091367.1A CN109936766B (en) | 2019-01-30 | 2019-01-30 | End-to-end-based method for generating audio of water scene |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910091367.1A CN109936766B (en) | 2019-01-30 | 2019-01-30 | End-to-end-based method for generating audio of water scene |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109936766A true CN109936766A (en) | 2019-06-25 |
CN109936766B CN109936766B (en) | 2021-04-13 |
Family
ID=66985371
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910091367.1A Active CN109936766B (en) | 2019-01-30 | 2019-01-30 | End-to-end-based method for generating audio of water scene |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109936766B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111435591A (en) * | 2020-01-17 | 2020-07-21 | 珠海市杰理科技股份有限公司 | Sound synthesis method and system, audio processing chip and electronic equipment |
CN113223493A (en) * | 2020-01-20 | 2021-08-06 | Tcl集团股份有限公司 | Voice nursing method, device, system and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5831518A (en) * | 1995-06-16 | 1998-11-03 | Sony Corporation | Sound producing method and sound producing apparatus |
CN101299241A (en) * | 2008-01-14 | 2008-11-05 | 浙江大学 | Method for detecting multi-mode video semantic conception based on tensor representation |
US20090241753A1 (en) * | 2004-12-30 | 2009-10-01 | Steve Mann | Acoustic, hyperacoustic, or electrically amplified hydraulophones or multimedia interfaces |
CN102222506A (en) * | 2010-04-15 | 2011-10-19 | 迪尔公司 | Context-based sound generation |
CN103117057A (en) * | 2012-12-27 | 2013-05-22 | 安徽科大讯飞信息科技股份有限公司 | Application method of special human voice synthesis technique in mobile phone cartoon dubbing |
WO2018039433A1 (en) * | 2016-08-24 | 2018-03-01 | Delos Living Llc | Systems, methods and articles for enhancing wellness associated with habitable environments |
-
2019
- 2019-01-30 CN CN201910091367.1A patent/CN109936766B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5831518A (en) * | 1995-06-16 | 1998-11-03 | Sony Corporation | Sound producing method and sound producing apparatus |
US20090241753A1 (en) * | 2004-12-30 | 2009-10-01 | Steve Mann | Acoustic, hyperacoustic, or electrically amplified hydraulophones or multimedia interfaces |
CN101299241A (en) * | 2008-01-14 | 2008-11-05 | 浙江大学 | Method for detecting multi-mode video semantic conception based on tensor representation |
CN102222506A (en) * | 2010-04-15 | 2011-10-19 | 迪尔公司 | Context-based sound generation |
CN103117057A (en) * | 2012-12-27 | 2013-05-22 | 安徽科大讯飞信息科技股份有限公司 | Application method of special human voice synthesis technique in mobile phone cartoon dubbing |
WO2018039433A1 (en) * | 2016-08-24 | 2018-03-01 | Delos Living Llc | Systems, methods and articles for enhancing wellness associated with habitable environments |
Non-Patent Citations (3)
Title |
---|
ROGER B DANNENBERG ET AL.: "Sound Synthesis from real-Time Video Image", 《REASERCHGATE》 * |
刘杰: "基于海浪谱的海浪模拟算法研究与系统实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王凯等: "Efficient Sound Synthesis for Natural Scenes", 《IEEE》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111435591A (en) * | 2020-01-17 | 2020-07-21 | 珠海市杰理科技股份有限公司 | Sound synthesis method and system, audio processing chip and electronic equipment |
CN113223493A (en) * | 2020-01-20 | 2021-08-06 | Tcl集团股份有限公司 | Voice nursing method, device, system and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109936766B (en) | 2021-04-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102449664B (en) | Gradual-change animation generating method and apparatus | |
CN112465955B (en) | Dynamic human body three-dimensional reconstruction and visual angle synthesis method | |
CN109711401B (en) | Text detection method in natural scene image based on Faster Rcnn | |
CN102054287B (en) | Facial animation video generating method and device | |
CN113706412B (en) | SDR (read-write memory) to HDR (high-definition digital interface) conversion method | |
US20070009180A1 (en) | Real-time face synthesis systems | |
CN111783658B (en) | Two-stage expression animation generation method based on dual-generation reactance network | |
CN112001992A (en) | Voice-driven 3D virtual human expression sound-picture synchronization method and system based on deep learning | |
CN109936766A (en) | A kind of generation method based on water scene audio end to end | |
CN106709933B (en) | Motion estimation method based on unsupervised learning | |
Zhang et al. | Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video | |
CN116721191A (en) | Method, device and storage medium for processing mouth-shaped animation | |
Xu et al. | Deep video inverse tone mapping | |
CN109525787B (en) | Live scene oriented real-time subtitle translation and system implementation method | |
CN108765246B (en) | A kind of selection method of steganographic system carrier image | |
CN109658369A (en) | Video intelligent generation method and device | |
CN113160358A (en) | Non-green-curtain cutout rendering method | |
Nakatsuka et al. | Audio-guided Video Interpolation via Human Pose Features. | |
Zhang et al. | Deep Learning Technology in Film and Television Post-Production | |
Xiao et al. | Dense convolutional recurrent neural network for generalized speech animation | |
CN112287998A (en) | Method for detecting target under low-light condition | |
CN114679605B (en) | Video transition method, device, computer equipment and storage medium | |
CN115829868B (en) | Underwater dim light image enhancement method based on illumination and noise residual image | |
CN117593442A (en) | Portrait generation method based on multi-stage fine grain rendering | |
CN113806584B (en) | Self-supervision cross-modal perception loss-based method for generating command actions of band |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |