CN109936766A - A kind of generation method based on water scene audio end to end - Google Patents

A kind of generation method based on water scene audio end to end Download PDF

Info

Publication number
CN109936766A
CN109936766A CN201910091367.1A CN201910091367A CN109936766A CN 109936766 A CN109936766 A CN 109936766A CN 201910091367 A CN201910091367 A CN 201910091367A CN 109936766 A CN109936766 A CN 109936766A
Authority
CN
China
Prior art keywords
audio
video
generation method
water scene
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910091367.1A
Other languages
Chinese (zh)
Other versions
CN109936766B (en
Inventor
刘世光
程皓楠
王凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201910091367.1A priority Critical patent/CN109936766B/en
Publication of CN109936766A publication Critical patent/CN109936766A/en
Application granted granted Critical
Publication of CN109936766B publication Critical patent/CN109936766B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention belongs to the technical fields of audio processing, and in particular to a kind of generation method based on water scene audio end to end includes the following steps: step 1, chooses all kinds of water scene videos, and pre-processed;Step 2 obtains Maker model by training according to pretreated data;Step 3 pre-processes silent video, is loaded into trained Maker model, exports audio corresponding with silent video;Step 4 generates envelope according to the sequence of audio, and is loaded into trained tone color booster model, exports the enhanced audio of tone color.The present invention can be realized automatically generating for outdoor water scene sound end to end, solve the problems, such as to dub for scene it is time-consuming and laborious, meanwhile, generate water scene audio using the resulting model of training, can be improved formation speed and synchronous degree, to improve working efficiency.

Description

A kind of generation method based on water scene audio end to end
Technical field
The invention belongs to the technical fields of audio processing, and in particular to a kind of generation based on water scene audio end to end Method.
Background technique
With the continuous development of computer graphics techniques, people propose the sound quality of video and animation higher It is required that.And water scene, especially outdoor water scene are present in video display, among game, so developing one kind being capable of automatic basis Outdoor water scene video goes the method for generating corresponding scene sound to seem very necessary.Currently, people are mostly using based on physics Method go generate water scene sound.
Water scene sound generation method based on physics is based primarily upon a kind of theory, i.e. the formation and resonance of bubble are the underwater sounds The most important source of sound.Zheng et al. proposes the water flow sound generation method based on harmonic wave bubble in harmonic wave bubble, The considerations of by sound transmission process, generates a variety of flowing water sound including running tap water, but it is generated Result need by cumbersome artificial adjustment, then, Langlois et al. is mentioned in the Fluid Dynamics based on complicated acoustics bubble A sound generation method based on the simulation of two-phase incompressible fluid is suggested out, for improving the fluid for utilizing bubble formation Acoustic consequences, the bubble in liquid no longer use random model, but go to generate according to the state of fluid more true Bubble, but also final sound effect is more life-like, but the main study subject of these methods is all confined to small-scale water Stream, also, as the continuous improvement of acoustic consequences, algorithm complexity are also constantly being promoted, this, which allows for them, to apply Into the sound rendering of outdoor water scene.
The sound generation method of deep learning goes to generate corresponding sound based on video.Owens et al. shows sound in vision A neural network being composed of convolutional neural networks (CNN) and shot and long term memory unit (LSTM) is proposed in sound, is led to Cross the characteristics of image for inputting the spacetime figure of each frame video gradation figure and its before and after frames gray level image composition, output and view Frequently corresponding sound electrocochleogram, then go in voice bank to find to generate with the most matched sample sound splicing of this figure and most terminate Fruit, Chen et al. propose that input is found pleasure in respectively using two kinds of translative mode of GAN network design in deep cross-module state audiovisual generation The logarithmic amplitude Meier spectrogram (LMS) of device sound is converted to corresponding musical instrument figure, and musical instrument figure is converted to corresponding LMS Figure, then look for all being analogous to the spectrum of image with the matched musical instrument sound of LMS, the output of the depth network of the two algorithms Figure, does not directly generate original voice signal, Zhou et al. is in video to sound: proposing in the sound generation of life outdoor videos Tentatively generated using sound of the SampleRNN model to natural scene video, by extract video image or Input of the feature of light stream figure as RNN, to directly generate corresponding voice signal, however its in audio-visual synchronization still So there are some problems.
Summary of the invention
It is an object of the invention to: in view of the deficiencies of the prior art, provide a kind of based on water scene audio end to end Generation method can be realized automatically generating for outdoor water scene sound end to end, solve to dub for scene time-consuming and laborious Problem, meanwhile, water scene audio is generated using the resulting model of training, can be improved formation speed and synchronous degree, to mention High working efficiency.
To achieve the goals above, the present invention adopts the following technical scheme:
A kind of generation method based on water scene audio end to end, includes the following steps:
Step 1 is chosen all kinds of water scene videos, and is pre-processed;
Step 2 obtains Maker model by training according to pretreated data;
Step 3 pre-processes silent video, is loaded into the trained Maker model, output and the nothing The corresponding audio of sound video;
Step 4 generates envelope according to the sequence of the audio, and is loaded into trained tone color booster model, exports The enhanced audio of tone color.
It should be noted that: in generation method of the invention, in step 1, chooses all kinds of water scene videos and instructed Practice, help to optimize model training, error is reduced, simultaneously as having between the image information and sound of video larger Dimension difference, by pretreatment image information and sound can be made in the same dimension;In step 2, after to pretreatment Data training Maker model, the fluid sound being synchronised with outdoor water scene video can be automatically synthesized, do not need profession Quasi- sound teacher synthesizes synchronous water scene sound, does not also need artificial to be gone to design different algorithms according to different scene characteristics The sound of all kinds of scenes is generated, while saving human and material resources, the accuracy of Maker model is improved, meets the need of people It asks, it is also desirable to which discriminator is arranged, the quality of result is generated for assessing generator, and assessment result is fed back to generation In device model, Maker model is realized by multiple feedback and adjustment process and effectively train to Maker model, thus The accuracy for improving Maker model matches sound to silent video is synchronous;In step 3, silent video does not have sound, needs Trained Maker model generates corresponding audio data according to each second silent video information vector, to complete to give Silent video mixes sound;In step 4, since the audio data of Maker model output may not necessarily meet practical water scene, such as Waterfall scene needs to enhance tone color, to meet practical water scene demand, meanwhile, in order to further increase Automated water It is flat, also tone color is enhanced using trained tone color booster model, realize outdoor water scene sound end to end from Dynamic to generate, trained tone color booster model can directly obtain enhanced audio, remove centre from according to the envelope of sound Physical method, e.g., imaging method, comparison method, synthesis, control variate method and conversion method etc. greatly improve processing speed, reduce The time that user waits.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 1 In, the pretreated method includes the following steps:
A1, the feature for extracting video frame, obtain the information of video;
A2, video information per second is converted into vector identical with audio dimension.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 2 In, the training method of the Maker model includes the following steps:
The vector of B1, the input video information, pass through the Maker model output audio signal;
B2, the assessment audio signal feed back to the Maker model, and be adjusted again, directly if not corresponding to To the corresponding audio signal of output;If corresponding, continue the training of next video information.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 4 In, the training method of the tone color booster model includes the following steps:
C1, the envelope for inputting target audio, the sequence of the audio is exported by the tone color booster model;
The sequence of C2, the assessment audio, if not target sequence, then feed back to the tone color booster model, lay equal stress on It is newly adjusted, the sequence until exporting target audio;If target sequence, then continue next tone color enhancing training.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 4 In, the generation method of the envelope includes the following steps:
D1, input a segment of audio sequence GVAnd the sampling interval L of envelopestep
D2, tonic train G is takenVIn each sampling interval LstepThe maximum value of interior absolute value is as in this section of interval One envelope point pi;
The array Ep that D3, the envelope point pi in all sampling intervals are formed by connecting, by linear interpolation formation length and GV Identical sequence E (1:len), as tonic train GVCorresponding envelope,
Wherein, Pi ∈ Gv, interp () indicate linear interpolation,Indicate attended operation.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step A2 In, the video information conversion formula are as follows:
G(y1..., ym)→x1..., xn, x ∈ audio), y ∈ { video }
Wherein y1 ..., ym represents the Color Channel information of the video frame, each channel is by between 0 to 255 Between array at matrix, G (y1 ..., ym) indicates that (value range is -1 for the value of the audio signal generated based on video frame To 1), x1 ..., xn indicate the value of the corresponding audio signal of video (variation range is -1 to 1).
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 2 In, export loss function used in the audio signal are as follows:
Wherein, λ =100, wherein X indicates that sound true value, V indicate video frame information, and G indicates that generator generates as a result, D indicates assessment As a result, E expression is averaged.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 2 In, assess loss function used in the audio signal are as follows:
Wherein, V indicates that video frame information, G indicate that generator generates as a result, that D expression is assessed as a result, E expression is averaged.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, the water The generation method of scene audio is based on GAN network, and the GAN network includes generator, discriminator and tone color booster.
As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, the water The generation method of scene audio is based on GAN network, in step 1, vector V that pretreated video frame generatestIt can be expressed as Following form:
Wherein,Indicate attended operation,vt,q Indicate the t seconds extracted features of q frame, Floor indicates to be rounded downwards;
The generation task of sound can be further represented as following form:
G(V1, V2..., VΔt)→X1, X2..., XΔt
Wherein, Xt={ XT, 1, xT, 2..., Xt, SRaudio, t ∈ { 1,2 ..., Δ t }.
The beneficial effects of the present invention are, the present invention includes the following steps: step 1, all kinds of water scene videos are chosen, and It is pre-processed;Step 2 obtains Maker model by training according to pretreated data;Step 3, by silent video It is pre-processed, is loaded into trained Maker model, export audio corresponding with silent video;Step 4, according to audio Sequence generate envelope, and be loaded into trained tone color booster model, export the enhanced audio of tone color.Of the invention It in generation method, in step 1, chooses all kinds of water scene videos and is trained, help to optimize model training, reduce Error can be such that image believes simultaneously as there is biggish dimension difference between the image information and sound of video by pretreatment Breath and sound are in the same dimension;In step 2, by pretreated data training Maker model, can be automatically synthesized with The fluid sound that outdoor water scene video is synchronised does not need the quasi- sound teacher of profession to synthesize synchronous water scene sound, not yet Need it is artificial gone to design sound of the different algorithms to generate all kinds of scenes according to different scene characteristics, save human and material resources While, the accuracy of Maker model is improved, meets the needs of people, it is also desirable to discriminator be arranged, for assessing life Grow up to be a useful person and generate the quality of result, and by assessment result feedback into Maker model, Maker model by multiple feedback and Adjustment process is realized and effectively train to Maker model, so that the accuracy of Maker model is improved, it is same to silent video Step matches sound;In step 3, silent video does not have sound, needs trained Maker model according to each second noiseless view Frequency information vector generates corresponding audio data, to complete to mix sound to silent video;In step 4, due to generator The audio data of model output may not necessarily meet practical water scene, such as waterfall scene, need to enhance tone color, to meet reality Border water scene demand, meanwhile, in order to further increase automatization level, also using trained tone color booster model to tone color Enhanced, realizes automatically generating for outdoor water scene sound end to end, trained tone color booster model being capable of basis The envelope of sound directly obtains enhanced audio, removes intermediate physical method, e.g., imaging method, comparison method, synthesis, control from Quantity method processed and conversion method etc. greatly improve processing speed, reduce the time that user waits.The present invention can be realized end to end Outdoor water scene sound automatically generates, solve the problems, such as to dub for scene it is time-consuming and laborious, meanwhile, utilize the resulting mould of training Type generates water scene audio, can be improved formation speed and synchronous degree, to improve working efficiency.
Detailed description of the invention
Fig. 1 is flow diagram of the invention;
Fig. 2 is operation schematic diagram of the invention;
Fig. 3 is the waveform diagram of water scene and its corresponding audio signal in the present invention;
Fig. 4 is the spectral contrast figure of tone color enhancing front and back in the present invention.
Specific embodiment
As used some vocabulary to censure specific components in the specification and claims.Those skilled in the art answer It is understood that hardware manufacturer may call the same component with different nouns.This specification and claims are not with name The difference of title is as the mode for distinguishing component, but with the difference of component functionally as the criterion of differentiation.Such as logical The "comprising" of piece specification and claim mentioned in is an open language, therefore should be construed to " include but do not limit In "." substantially " refer within an acceptable error range, those skilled in the art can solve technology within a certain error range Problem basically reaches technical effect.
In the description of the present invention, it is to be understood that, term " on ", "lower", "front", "rear", "left", "right", level " The orientation or positional relationship of equal instructions is to be based on the orientation or positional relationship shown in the drawings, be merely for convenience of the description present invention and Simplify description, rather than the device or element of indication or suggestion meaning must have a particular orientation, with specific orientation construction And operation, therefore be not considered as limiting the invention.
In invention unless specifically defined or limited otherwise, the arts such as term " installation ", " connected ", " connection ", " fixation " Language shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can be machinery Connection, is also possible to be electrically connected;It can be directly connected, two elements can also be can be indirectly connected through an intermediary Internal connection.For the ordinary skill in the art, above-mentioned term can be understood in the present invention as the case may be In concrete meaning.
Below in conjunction with attached drawing 1~4, invention is further described in detail, but not as a limitation of the invention.
Embodiment 1
A kind of generation method based on water scene audio end to end, includes the following steps:
Step 1 is chosen all kinds of water scene videos, and is pre-processed;
Step 2 obtains Maker model by training according to pretreated data;
Step 3 pre-processes silent video, is loaded into trained Maker model, output and silent video pair The audio answered;
Step 4 generates envelope according to the sequence of audio, and is loaded into trained tone color booster model, exports tone color Enhanced audio.
It should be noted that: in generation method of the invention, in step 1, chooses all kinds of water scene videos and instructed Practice, help to optimize model training, error is reduced, simultaneously as having between the image information and sound of video larger Dimension difference, by pretreatment image information and sound can be made in the same dimension;In step 2, after to pretreatment Data training Maker model, the fluid sound being synchronised with outdoor water scene video can be automatically synthesized, do not need profession Quasi- sound teacher synthesizes synchronous water scene sound, does not also need artificial to be gone to design different algorithms according to different scene characteristics The sound of all kinds of scenes is generated, while saving human and material resources, the accuracy of Maker model is improved, meets the need of people It asks, it is also desirable to which discriminator is arranged, the quality of result is generated for assessing generator, and assessment result is fed back to generation In device model, Maker model is realized by multiple feedback and adjustment process and effectively train to Maker model, thus The accuracy for improving Maker model matches sound to silent video is synchronous;In step 3, silent video does not have sound, needs Trained Maker model generates corresponding audio data according to each second silent video information vector, to complete to give Silent video mixes sound;In step 4, since the audio data of Maker model output may not necessarily meet practical water scene, such as Waterfall scene needs to enhance tone color, to meet practical water scene demand, meanwhile, in order to further increase Automated water It is flat, also tone color is enhanced using trained tone color booster model, realize outdoor water scene sound end to end from Dynamic to generate, trained tone color booster model can directly obtain enhanced audio, remove centre from according to the envelope of sound Physical method, e.g., imaging method, comparison method, synthesis, control variate method and conversion method etc. greatly improve processing speed, reduce The time that user waits.
Preferably, in step 1, pretreated method includes the following steps:
A1, the feature for extracting video frame, obtain the information of video;
A2, video information per second is converted into vector identical with audio dimension.
It is unfavorable because complete water scene video occupies bigger memory headroom in step A1 in above-mentioned preprocess method In the information for obtaining video, and calculation amount is larger, so, by extracting the feature of video frame, calculation amount can be reduced, is reached simultaneously To the purpose for the information for obtaining video, arithmetic speed is improved;In step A2, due to have between the image information and sound of video compared with Big dimension difference, not only there are many calculation amount, also will increase the error of Maker model, reduce water scene sound and video matches Effect.
Preferably, in step 2, the training method of Maker model includes the following steps:
The vector of B1, input video information pass through Maker model output audio signal;
B2, assessment audio signal feed back to Maker model, and be adjusted again if not corresponding to, until output pair The audio signal answered;If corresponding, continue the training of next video information.
In above-mentioned training method, in step B2, initial Maker model is not by training, and the audio signal of output is not It must correspond, be trained by all kinds of water scene videos, and Real-time Feedback has to Maker model with the vector of frequency information Help optimize training to model, reduces the error of output.
Preferably, in step 4, the training method of tone color booster model includes the following steps:
C1, the envelope for inputting target audio, the sequence of audio is exported by tone color booster model;
C2, the sequence for assessing audio, if not target sequence, then feed back to tone color booster model, and re-start tune It is whole, the sequence until exporting target audio;If target sequence, then continue next tone color enhancing training.
In above-mentioned training method, in step C2, initial tone color booster model is not by training, the tonic train of output May not be corresponding with the envelope of target audio, it is trained by the envelope of each class audio frequency, and Real-time Feedback gives tone color booster mould Type helps to optimize training to model, reduces the error of output.
Preferably, in step 4, the generation method of envelope includes the following steps:
D1, input a segment of audio sequence GVAnd the sampling interval L of envelopestep
D2, tonic train G is takenVIn each sampling interval LstepThe maximum value of interior absolute value is as in this section of interval One envelope point pi;
The array Ep that D3, the envelope point pi in all sampling intervals are formed by connecting, by linear interpolation formation length and GV Identical sequence E (1:len), as tonic train GVCorresponding envelope,
Wherein, Pi ∈ Gv, interp () indicate linear interpolation,Indicate attended operation.
Preferably, in step A2, video information conversion formula are as follows:
G(y1..., ym)→x1..., xn, x ∈ { audio }, y ∈ { video }
Wherein y1..., ym represents the Color Channel information of the video frame, each channel is by between 0 to 255 Between array at matrix, G (y1 ..., ym) indicates that (value range is -1 for the value of the audio signal generated based on video frame To 1), x1 ..., xn indicate the value of the corresponding audio signal of video (variation range is -1 to 1).
Preferably, in step 2, loss function used in output audio signal are as follows:
Wherein, λ=100, wherein X indicates that sound true value, V indicate video frame information, and G indicates that generator generates as a result, D indicates assessment As a result, E expression is averaged.
Preferably, in step 2, loss function used in audio signal is assessed are as follows:
Wherein, V indicates that video frame information, G indicate that generator generates as a result, that D expression is assessed as a result, E expression is averaged.
Embodiment 2
Unlike the first embodiment: in the video pre-filtering of the present embodiment, for the video of different inputs, image ruler Very little to be usually different from, in order to reduce calculation amount and unified management, it is 256 × 256 × 3 that input picture, which is scaled to size, Then 30 256 × 256 × 3 images in each second are encoded to corresponding with audio scale 1 × 4096 × 1 by image. Firstly, for each video frame yi, extract its feature vector v under VGG19 networki, dimension is 1 × 4096 × 1.If SRvidecAnd SRaudioIt is in the present invention 30 and 44100 for the sample rate of video and audio.It is right for t seconds videos Vector V after the video pre-filtering answeredtIt can be expressed as form:
Wherein,Indicate attended operation,vt,q Indicate the t seconds extracted VGG19 features of q frame, Floor indicates to be rounded downwards.So p=10, q=in the present invention 3.For the missing in splicing the final lengths caused by rounding up, the present invention is uniformly in vacancy zero padding.In this way, The conversion of the video of script to audio can be expressed as form:
G(V1, V2..., VΔt)→X1, X2..., XΔt
Wherein, Xt={ xT, 1, xT, 2..., xt, SRaudio, t ∈ { 1,2 ..., Δ t }.Vt and Xt has identical at this moment Dimension.
Embodiment 3
Unlike the first embodiment: the generation method of the water scene audio of the present embodiment is based on GAN network, GAN network Including generator, discriminator and tone color booster.The demand that network in the present invention is generated according to sound to input and output institute into Row adjustment, so that the receptive field of each layer of convolution (receptive field: in convolutional neural networks CNN, determines in the image network of script The area size of input layer corresponding to an element in a certain layer output result) no longer it is applicable in.In image network, usually make The convolutional layer for being 3 × 3 with receptive field.And correspond to outputting and inputting for 44100 dimension of the present invention, the convolution of generator and discriminator The receptive field of layer is also changed, and has used biggish receptive field to complete corresponding convolution operation.In addition, in convolution process In, two dimensional filter used in image is rejected, and is directed to the feature of sound dimension, and present invention uses one-dimensional filtering devices Convolution is carried out, in order to remove unwanted frequency information in number voice result, finally increases a filtering in generator Device filters the component frequency information in result, keeps the length of output sequence constant during filtering.Generator and discrimination The specific structure of device can refer to Tables 1 and 2
Table 1
Table 2
Wherein, due to part Relu corresponding after convolutional layer (Conv1D) and warp lamination (Trans Conv1D), LeakRelu and BatchNorm layers is not related to convolution kernel and the change to Output Size, does not display to it in table. Stride indicates the convolution step-length during convolution or deconvolution.Three parameters corresponding to " convolution kernel size " column respectively refer to Be receptive field size, the input channel number of this layer and the output channel number of this layer.Corresponding to " output shape " column Three parameters refer respectively to the size of this layer of Batch, input dimension and port number.In order to guarantee pair of convolution Yu deconvolution process It answers, during convolution or deconvolution, the present invention makes in layer by using continually changing receptive field and convolution step-length There is no give up dimension or increase the process of dimension for conversion between input and output.
According to the disclosure and teachings of the above specification, those skilled in the art in the invention can also be to above-mentioned embodiment party Formula is changed and is modified.Therefore, the invention is not limited to above-mentioned specific embodiment, all those skilled in the art exist Made any conspicuous improvement, replacement or modification all belong to the scope of protection of the present invention on the basis of the present invention.This Outside, although using some specific terms in this specification, these terms are merely for convenience of description, not to the present invention Constitute any restrictions.

Claims (10)

1. a kind of generation method based on water scene audio end to end, which comprises the steps of:
Step 1 is chosen all kinds of water scene videos, and is pre-processed;
Step 2 obtains Maker model by training according to pretreated data;
Step 3 pre-processes silent video, is loaded into the trained Maker model, output and the noiseless view Frequently corresponding audio;
Step 4 generates envelope according to the sequence of the audio, and is loaded into trained tone color booster model, exports tone color The enhanced audio.
2. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: step 1 In, the pretreated method includes the following steps:
A1, the feature for extracting video frame, obtain the information of video;
A2, video information per second is converted into vector identical with audio dimension.
3. a kind of generation method based on water scene audio end to end as claimed in claim 2, it is characterised in that: step 2 In, the training method of the Maker model includes the following steps:
The vector of B1, the input video information, pass through the Maker model output audio signal;
B2, the assessment audio signal feed back to the Maker model, and be adjusted again, until defeated if not corresponding to Corresponding audio signal out;If corresponding, continue the training of next video information.
4. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: step 4 In, the training method of the tone color booster model includes the following steps:
C1, the envelope for inputting target audio, the sequence of the audio is exported by the tone color booster model;
The sequence of C2, the assessment audio, if not target sequence, then feed back to the tone color booster model, and again into Row adjustment, the sequence until exporting target audio;If target sequence, then continue next tone color enhancing training.
5. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: step 4 In, the generation method of the envelope includes the following steps:
D1, input a segment of audio sequence GVAnd the sampling interval L of envelopestep
D2, tonic train G is takenVIn each sampling interval LstepThe maximum value of interior absolute value is as one in this section of interval Envelope point pi;
The array Ep that D3, the envelope point pi in all sampling intervals are formed by connecting, by linear interpolation formation length and GVIt is identical Sequence E (1:len), as tonic train GVCorresponding envelope,
Wherein, Pi ∈ Gv, interp () indicate linear interpolation,Indicate attended operation.
6. a kind of generation method based on water scene audio end to end as claimed in claim 2, it is characterised in that: step A2 In, the video information conversion formula are as follows:
G (y1 ..., ym) → x1 ..., xn, x ∈ { audio }, y ∈ { video }
Wherein y1 ..., ym represents the Color Channel information of the video frame, each channel is by between 0 to 255 Array at matrix, G (y1 ..., ym) indicates the value (value range is -1 to 1) of the audio signal generated based on video frame, X1 ..., xn indicates the value of the corresponding audio signal of video (variation range is -1 to 1).
7. a kind of generation method based on water scene audio end to end as claimed in claim 3, it is characterised in that: step 2 In, export loss function used in the audio signal are as follows:
Wherein, λ=100, wherein X indicate sound true value, V indicate video frame information, G indicate generator generate as a result, D Indicate assessment as a result, E expression average.
8. a kind of generation method based on water scene audio end to end as claimed in claim 3, it is characterised in that: step 2 In, assess loss function used in the audio signal are as follows:
Wherein, V indicates that video frame information, G indicate that generator generates as a result, that D expression is assessed as a result, E expression is averaged.
9. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: the water The generation method of scene audio is based on GAN network, and the GAN network includes generator, discriminator and tone color booster.
10. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: step In one, the vector V of pretreated video frame generationtIt can be expressed as form:
Wherein,Indicate attended operation,VT, qIndicate the The t seconds extracted features of q frame, Floor indicate to be rounded downwards;
The generation task of sound can be further represented as following form:
G(V1, V2..., VΔt)→X1, X2..., XΔt
Wherein, Xt={ xT, 1, xT, 2..., xT, SRaudio, t ∈ { 1,2 ..., Δ t }.
CN201910091367.1A 2019-01-30 2019-01-30 End-to-end-based method for generating audio of water scene Active CN109936766B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910091367.1A CN109936766B (en) 2019-01-30 2019-01-30 End-to-end-based method for generating audio of water scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910091367.1A CN109936766B (en) 2019-01-30 2019-01-30 End-to-end-based method for generating audio of water scene

Publications (2)

Publication Number Publication Date
CN109936766A true CN109936766A (en) 2019-06-25
CN109936766B CN109936766B (en) 2021-04-13

Family

ID=66985371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910091367.1A Active CN109936766B (en) 2019-01-30 2019-01-30 End-to-end-based method for generating audio of water scene

Country Status (1)

Country Link
CN (1) CN109936766B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435591A (en) * 2020-01-17 2020-07-21 珠海市杰理科技股份有限公司 Sound synthesis method and system, audio processing chip and electronic equipment
CN113223493A (en) * 2020-01-20 2021-08-06 Tcl集团股份有限公司 Voice nursing method, device, system and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5831518A (en) * 1995-06-16 1998-11-03 Sony Corporation Sound producing method and sound producing apparatus
CN101299241A (en) * 2008-01-14 2008-11-05 浙江大学 Method for detecting multi-mode video semantic conception based on tensor representation
US20090241753A1 (en) * 2004-12-30 2009-10-01 Steve Mann Acoustic, hyperacoustic, or electrically amplified hydraulophones or multimedia interfaces
CN102222506A (en) * 2010-04-15 2011-10-19 迪尔公司 Context-based sound generation
CN103117057A (en) * 2012-12-27 2013-05-22 安徽科大讯飞信息科技股份有限公司 Application method of special human voice synthesis technique in mobile phone cartoon dubbing
WO2018039433A1 (en) * 2016-08-24 2018-03-01 Delos Living Llc Systems, methods and articles for enhancing wellness associated with habitable environments

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5831518A (en) * 1995-06-16 1998-11-03 Sony Corporation Sound producing method and sound producing apparatus
US20090241753A1 (en) * 2004-12-30 2009-10-01 Steve Mann Acoustic, hyperacoustic, or electrically amplified hydraulophones or multimedia interfaces
CN101299241A (en) * 2008-01-14 2008-11-05 浙江大学 Method for detecting multi-mode video semantic conception based on tensor representation
CN102222506A (en) * 2010-04-15 2011-10-19 迪尔公司 Context-based sound generation
CN103117057A (en) * 2012-12-27 2013-05-22 安徽科大讯飞信息科技股份有限公司 Application method of special human voice synthesis technique in mobile phone cartoon dubbing
WO2018039433A1 (en) * 2016-08-24 2018-03-01 Delos Living Llc Systems, methods and articles for enhancing wellness associated with habitable environments

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ROGER B DANNENBERG ET AL.: "Sound Synthesis from real-Time Video Image", 《REASERCHGATE》 *
刘杰: "基于海浪谱的海浪模拟算法研究与系统实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王凯等: "Efficient Sound Synthesis for Natural Scenes", 《IEEE》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435591A (en) * 2020-01-17 2020-07-21 珠海市杰理科技股份有限公司 Sound synthesis method and system, audio processing chip and electronic equipment
CN113223493A (en) * 2020-01-20 2021-08-06 Tcl集团股份有限公司 Voice nursing method, device, system and storage medium

Also Published As

Publication number Publication date
CN109936766B (en) 2021-04-13

Similar Documents

Publication Publication Date Title
CN102449664B (en) Gradual-change animation generating method and apparatus
CN112465955B (en) Dynamic human body three-dimensional reconstruction and visual angle synthesis method
CN109711401B (en) Text detection method in natural scene image based on Faster Rcnn
CN102054287B (en) Facial animation video generating method and device
CN113706412B (en) SDR (read-write memory) to HDR (high-definition digital interface) conversion method
US20070009180A1 (en) Real-time face synthesis systems
CN111783658B (en) Two-stage expression animation generation method based on dual-generation reactance network
CN112001992A (en) Voice-driven 3D virtual human expression sound-picture synchronization method and system based on deep learning
CN109936766A (en) A kind of generation method based on water scene audio end to end
CN106709933B (en) Motion estimation method based on unsupervised learning
Zhang et al. Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video
CN116721191A (en) Method, device and storage medium for processing mouth-shaped animation
Xu et al. Deep video inverse tone mapping
CN109525787B (en) Live scene oriented real-time subtitle translation and system implementation method
CN108765246B (en) A kind of selection method of steganographic system carrier image
CN109658369A (en) Video intelligent generation method and device
CN113160358A (en) Non-green-curtain cutout rendering method
Nakatsuka et al. Audio-guided Video Interpolation via Human Pose Features.
Zhang et al. Deep Learning Technology in Film and Television Post-Production
Xiao et al. Dense convolutional recurrent neural network for generalized speech animation
CN112287998A (en) Method for detecting target under low-light condition
CN114679605B (en) Video transition method, device, computer equipment and storage medium
CN115829868B (en) Underwater dim light image enhancement method based on illumination and noise residual image
CN117593442A (en) Portrait generation method based on multi-stage fine grain rendering
CN113806584B (en) Self-supervision cross-modal perception loss-based method for generating command actions of band

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant