CN109936766A

CN109936766A - A kind of generation method based on water scene audio end to end

Info

Publication number: CN109936766A
Application number: CN201910091367.1A
Authority: CN
Inventors: 刘世光; 程皓楠; 王凯
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2019-06-25
Anticipated expiration: 2039-01-30
Also published as: CN109936766B

Abstract

The invention belongs to the technical fields of audio processing, and in particular to a kind of generation method based on water scene audio end to end includes the following steps: step 1, chooses all kinds of water scene videos, and pre-processed；Step 2 obtains Maker model by training according to pretreated data；Step 3 pre-processes silent video, is loaded into trained Maker model, exports audio corresponding with silent video；Step 4 generates envelope according to the sequence of audio, and is loaded into trained tone color booster model, exports the enhanced audio of tone color.The present invention can be realized automatically generating for outdoor water scene sound end to end, solve the problems, such as to dub for scene it is time-consuming and laborious, meanwhile, generate water scene audio using the resulting model of training, can be improved formation speed and synchronous degree, to improve working efficiency.

Description

A kind of generation method based on water scene audio end to end

Technical field

The invention belongs to the technical fields of audio processing, and in particular to a kind of generation based on water scene audio end to end Method.

Background technique

With the continuous development of computer graphics techniques, people propose the sound quality of video and animation higher It is required that.And water scene, especially outdoor water scene are present in video display, among game, so developing one kind being capable of automatic basis Outdoor water scene video goes the method for generating corresponding scene sound to seem very necessary.Currently, people are mostly using based on physics Method go generate water scene sound.

Water scene sound generation method based on physics is based primarily upon a kind of theory, i.e. the formation and resonance of bubble are the underwater sounds The most important source of sound.Zheng et al. proposes the water flow sound generation method based on harmonic wave bubble in harmonic wave bubble, The considerations of by sound transmission process, generates a variety of flowing water sound including running tap water, but it is generated Result need by cumbersome artificial adjustment, then, Langlois et al. is mentioned in the Fluid Dynamics based on complicated acoustics bubble A sound generation method based on the simulation of two-phase incompressible fluid is suggested out, for improving the fluid for utilizing bubble formation Acoustic consequences, the bubble in liquid no longer use random model, but go to generate according to the state of fluid more true Bubble, but also final sound effect is more life-like, but the main study subject of these methods is all confined to small-scale water Stream, also, as the continuous improvement of acoustic consequences, algorithm complexity are also constantly being promoted, this, which allows for them, to apply Into the sound rendering of outdoor water scene.

The sound generation method of deep learning goes to generate corresponding sound based on video.Owens et al. shows sound in vision A neural network being composed of convolutional neural networks (CNN) and shot and long term memory unit (LSTM) is proposed in sound, is led to Cross the characteristics of image for inputting the spacetime figure of each frame video gradation figure and its before and after frames gray level image composition, output and view Frequently corresponding sound electrocochleogram, then go in voice bank to find to generate with the most matched sample sound splicing of this figure and most terminate Fruit, Chen et al. propose that input is found pleasure in respectively using two kinds of translative mode of GAN network design in deep cross-module state audiovisual generation The logarithmic amplitude Meier spectrogram (LMS) of device sound is converted to corresponding musical instrument figure, and musical instrument figure is converted to corresponding LMS Figure, then look for all being analogous to the spectrum of image with the matched musical instrument sound of LMS, the output of the depth network of the two algorithms Figure, does not directly generate original voice signal, Zhou et al. is in video to sound: proposing in the sound generation of life outdoor videos Tentatively generated using sound of the SampleRNN model to natural scene video, by extract video image or Input of the feature of light stream figure as RNN, to directly generate corresponding voice signal, however its in audio-visual synchronization still So there are some problems.

Summary of the invention

It is an object of the invention to: in view of the deficiencies of the prior art, provide a kind of based on water scene audio end to end Generation method can be realized automatically generating for outdoor water scene sound end to end, solve to dub for scene time-consuming and laborious Problem, meanwhile, water scene audio is generated using the resulting model of training, can be improved formation speed and synchronous degree, to mention High working efficiency.

To achieve the goals above, the present invention adopts the following technical scheme:

A kind of generation method based on water scene audio end to end, includes the following steps:

Step 1 is chosen all kinds of water scene videos, and is pre-processed；

Step 2 obtains Maker model by training according to pretreated data；

Step 3 pre-processes silent video, is loaded into the trained Maker model, output and the nothing The corresponding audio of sound video；

Step 4 generates envelope according to the sequence of the audio, and is loaded into trained tone color booster model, exports The enhanced audio of tone color.

It should be noted that: in generation method of the invention, in step 1, chooses all kinds of water scene videos and instructed Practice, help to optimize model training, error is reduced, simultaneously as having between the image information and sound of video larger Dimension difference, by pretreatment image information and sound can be made in the same dimension；In step 2, after to pretreatment Data training Maker model, the fluid sound being synchronised with outdoor water scene video can be automatically synthesized, do not need profession Quasi- sound teacher synthesizes synchronous water scene sound, does not also need artificial to be gone to design different algorithms according to different scene characteristics The sound of all kinds of scenes is generated, while saving human and material resources, the accuracy of Maker model is improved, meets the need of people It asks, it is also desirable to which discriminator is arranged, the quality of result is generated for assessing generator, and assessment result is fed back to generation In device model, Maker model is realized by multiple feedback and adjustment process and effectively train to Maker model, thus The accuracy for improving Maker model matches sound to silent video is synchronous；In step 3, silent video does not have sound, needs Trained Maker model generates corresponding audio data according to each second silent video information vector, to complete to give Silent video mixes sound；In step 4, since the audio data of Maker model output may not necessarily meet practical water scene, such as Waterfall scene needs to enhance tone color, to meet practical water scene demand, meanwhile, in order to further increase Automated water It is flat, also tone color is enhanced using trained tone color booster model, realize outdoor water scene sound end to end from Dynamic to generate, trained tone color booster model can directly obtain enhanced audio, remove centre from according to the envelope of sound Physical method, e.g., imaging method, comparison method, synthesis, control variate method and conversion method etc. greatly improve processing speed, reduce The time that user waits.

As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 1 In, the pretreated method includes the following steps:

A1, the feature for extracting video frame, obtain the information of video；

A2, video information per second is converted into vector identical with audio dimension.

As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 2 In, the training method of the Maker model includes the following steps:

The vector of B1, the input video information, pass through the Maker model output audio signal；

B2, the assessment audio signal feed back to the Maker model, and be adjusted again, directly if not corresponding to To the corresponding audio signal of output；If corresponding, continue the training of next video information.

As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 4 In, the training method of the tone color booster model includes the following steps:

C1, the envelope for inputting target audio, the sequence of the audio is exported by the tone color booster model；

The sequence of C2, the assessment audio, if not target sequence, then feed back to the tone color booster model, lay equal stress on It is newly adjusted, the sequence until exporting target audio；If target sequence, then continue next tone color enhancing training.

As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 4 In, the generation method of the envelope includes the following steps:

D1, input a segment of audio sequence G_VAnd the sampling interval L of envelope_step；

D2, tonic train G is taken_VIn each sampling interval L_stepThe maximum value of interior absolute value is as in this section of interval One envelope point pi；

The array Ep that D3, the envelope point pi in all sampling intervals are formed by connecting, by linear interpolation formation length and G_V Identical sequence E (1:len), as tonic train G_VCorresponding envelope,

Wherein, Pi ∈ Gv, interp () indicate linear interpolation,Indicate attended operation.

As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step A2 In, the video information conversion formula are as follows:

G(y₁..., y_m)→x₁..., x_n, x ∈ audio), y ∈ { video }

Wherein y1 ..., ym represents the Color Channel information of the video frame, each channel is by between 0 to 255 Between array at matrix, G (y1 ..., ym) indicates that (value range is -1 for the value of the audio signal generated based on video frame To 1), x1 ..., xn indicate the value of the corresponding audio signal of video (variation range is -1 to 1).

As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 2 In, export loss function used in the audio signal are as follows:

Wherein, λ =100, wherein X indicates that sound true value, V indicate video frame information, and G indicates that generator generates as a result, D indicates assessment As a result, E expression is averaged.

As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, step 2 In, assess loss function used in the audio signal are as follows:

Wherein, V indicates that video frame information, G indicate that generator generates as a result, that D expression is assessed as a result, E expression is averaged.

As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, the water The generation method of scene audio is based on GAN network, and the GAN network includes generator, discriminator and tone color booster.

As a kind of a kind of improvement based on the generation method of water scene audio end to end of the present invention, the water The generation method of scene audio is based on GAN network, in step 1, vector V that pretreated video frame generates_tIt can be expressed as Following form:

Wherein,Indicate attended operation,v_t,q Indicate the t seconds extracted features of q frame, Floor indicates to be rounded downwards；

The generation task of sound can be further represented as following form:

G(V₁, V₂..., V_Δt)→X₁, X₂..., X_Δt

Wherein, X_t={ X_{T, 1}, x_{T, 2}..., X_t, SR_audio, t ∈ { 1,2 ..., Δ t }.

The beneficial effects of the present invention are, the present invention includes the following steps: step 1, all kinds of water scene videos are chosen, and It is pre-processed；Step 2 obtains Maker model by training according to pretreated data；Step 3, by silent video It is pre-processed, is loaded into trained Maker model, export audio corresponding with silent video；Step 4, according to audio Sequence generate envelope, and be loaded into trained tone color booster model, export the enhanced audio of tone color.Of the invention It in generation method, in step 1, chooses all kinds of water scene videos and is trained, help to optimize model training, reduce Error can be such that image believes simultaneously as there is biggish dimension difference between the image information and sound of video by pretreatment Breath and sound are in the same dimension；In step 2, by pretreated data training Maker model, can be automatically synthesized with The fluid sound that outdoor water scene video is synchronised does not need the quasi- sound teacher of profession to synthesize synchronous water scene sound, not yet Need it is artificial gone to design sound of the different algorithms to generate all kinds of scenes according to different scene characteristics, save human and material resources While, the accuracy of Maker model is improved, meets the needs of people, it is also desirable to discriminator be arranged, for assessing life Grow up to be a useful person and generate the quality of result, and by assessment result feedback into Maker model, Maker model by multiple feedback and Adjustment process is realized and effectively train to Maker model, so that the accuracy of Maker model is improved, it is same to silent video Step matches sound；In step 3, silent video does not have sound, needs trained Maker model according to each second noiseless view Frequency information vector generates corresponding audio data, to complete to mix sound to silent video；In step 4, due to generator The audio data of model output may not necessarily meet practical water scene, such as waterfall scene, need to enhance tone color, to meet reality Border water scene demand, meanwhile, in order to further increase automatization level, also using trained tone color booster model to tone color Enhanced, realizes automatically generating for outdoor water scene sound end to end, trained tone color booster model being capable of basis The envelope of sound directly obtains enhanced audio, removes intermediate physical method, e.g., imaging method, comparison method, synthesis, control from Quantity method processed and conversion method etc. greatly improve processing speed, reduce the time that user waits.The present invention can be realized end to end Outdoor water scene sound automatically generates, solve the problems, such as to dub for scene it is time-consuming and laborious, meanwhile, utilize the resulting mould of training Type generates water scene audio, can be improved formation speed and synchronous degree, to improve working efficiency.

Detailed description of the invention

Fig. 1 is flow diagram of the invention；

Fig. 2 is operation schematic diagram of the invention；

Fig. 3 is the waveform diagram of water scene and its corresponding audio signal in the present invention；

Fig. 4 is the spectral contrast figure of tone color enhancing front and back in the present invention.

Specific embodiment

As used some vocabulary to censure specific components in the specification and claims.Those skilled in the art answer It is understood that hardware manufacturer may call the same component with different nouns.This specification and claims are not with name The difference of title is as the mode for distinguishing component, but with the difference of component functionally as the criterion of differentiation.Such as logical The "comprising" of piece specification and claim mentioned in is an open language, therefore should be construed to " include but do not limit In "." substantially " refer within an acceptable error range, those skilled in the art can solve technology within a certain error range Problem basically reaches technical effect.

In the description of the present invention, it is to be understood that, term " on ", "lower", "front", "rear", "left", "right", level " The orientation or positional relationship of equal instructions is to be based on the orientation or positional relationship shown in the drawings, be merely for convenience of the description present invention and Simplify description, rather than the device or element of indication or suggestion meaning must have a particular orientation, with specific orientation construction And operation, therefore be not considered as limiting the invention.

In invention unless specifically defined or limited otherwise, the arts such as term " installation ", " connected ", " connection ", " fixation " Language shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected；It can be machinery Connection, is also possible to be electrically connected；It can be directly connected, two elements can also be can be indirectly connected through an intermediary Internal connection.For the ordinary skill in the art, above-mentioned term can be understood in the present invention as the case may be In concrete meaning.

Below in conjunction with attached drawing 1~4, invention is further described in detail, but not as a limitation of the invention.

Embodiment 1

Step 1 is chosen all kinds of water scene videos, and is pre-processed；

Step 2 obtains Maker model by training according to pretreated data；

Step 3 pre-processes silent video, is loaded into trained Maker model, output and silent video pair The audio answered；

Step 4 generates envelope according to the sequence of audio, and is loaded into trained tone color booster model, exports tone color Enhanced audio.

Preferably, in step 1, pretreated method includes the following steps:

A1, the feature for extracting video frame, obtain the information of video；

It is unfavorable because complete water scene video occupies bigger memory headroom in step A1 in above-mentioned preprocess method In the information for obtaining video, and calculation amount is larger, so, by extracting the feature of video frame, calculation amount can be reduced, is reached simultaneously To the purpose for the information for obtaining video, arithmetic speed is improved；In step A2, due to have between the image information and sound of video compared with Big dimension difference, not only there are many calculation amount, also will increase the error of Maker model, reduce water scene sound and video matches Effect.

Preferably, in step 2, the training method of Maker model includes the following steps:

The vector of B1, input video information pass through Maker model output audio signal；

B2, assessment audio signal feed back to Maker model, and be adjusted again if not corresponding to, until output pair The audio signal answered；If corresponding, continue the training of next video information.

In above-mentioned training method, in step B2, initial Maker model is not by training, and the audio signal of output is not It must correspond, be trained by all kinds of water scene videos, and Real-time Feedback has to Maker model with the vector of frequency information Help optimize training to model, reduces the error of output.

Preferably, in step 4, the training method of tone color booster model includes the following steps:

C1, the envelope for inputting target audio, the sequence of audio is exported by tone color booster model；

C2, the sequence for assessing audio, if not target sequence, then feed back to tone color booster model, and re-start tune It is whole, the sequence until exporting target audio；If target sequence, then continue next tone color enhancing training.

In above-mentioned training method, in step C2, initial tone color booster model is not by training, the tonic train of output May not be corresponding with the envelope of target audio, it is trained by the envelope of each class audio frequency, and Real-time Feedback gives tone color booster mould Type helps to optimize training to model, reduces the error of output.

Preferably, in step 4, the generation method of envelope includes the following steps:

Preferably, in step A2, video information conversion formula are as follows:

G(y₁..., y_m)→x₁..., x_n, x ∈ { audio }, y ∈ { video }

Wherein y₁..., ym represents the Color Channel information of the video frame, each channel is by between 0 to 255 Between array at matrix, G (y1 ..., ym) indicates that (value range is -1 for the value of the audio signal generated based on video frame To 1), x1 ..., xn indicate the value of the corresponding audio signal of video (variation range is -1 to 1).

Preferably, in step 2, loss function used in output audio signal are as follows:

Wherein, λ=100, wherein X indicates that sound true value, V indicate video frame information, and G indicates that generator generates as a result, D indicates assessment As a result, E expression is averaged.

Preferably, in step 2, loss function used in audio signal is assessed are as follows:

Embodiment 2

Unlike the first embodiment: in the video pre-filtering of the present embodiment, for the video of different inputs, image ruler Very little to be usually different from, in order to reduce calculation amount and unified management, it is 256 × 256 × 3 that input picture, which is scaled to size, Then 30 256 × 256 × 3 images in each second are encoded to corresponding with audio scale 1 × 4096 × 1 by image. Firstly, for each video frame y_i, extract its feature vector v under VGG19 network_i, dimension is 1 × 4096 × 1.If SR_videcAnd SR_audioIt is in the present invention 30 and 44100 for the sample rate of video and audio.It is right for t seconds videos Vector V after the video pre-filtering answered_tIt can be expressed as form:

Wherein,Indicate attended operation,v_t,q Indicate the t seconds extracted VGG19 features of q frame, Floor indicates to be rounded downwards.So p=10, q=in the present invention 3.For the missing in splicing the final lengths caused by rounding up, the present invention is uniformly in vacancy zero padding.In this way, The conversion of the video of script to audio can be expressed as form:

G(V₁, V₂..., V_Δt)→X₁, X₂..., X_Δt

Wherein, X_t={ x_{T, 1}, x_{T, 2}..., x_t, SR_audio, t ∈ { 1,2 ..., Δ t }.Vt and Xt has identical at this moment Dimension.

Embodiment 3

Unlike the first embodiment: the generation method of the water scene audio of the present embodiment is based on GAN network, GAN network Including generator, discriminator and tone color booster.The demand that network in the present invention is generated according to sound to input and output institute into Row adjustment, so that the receptive field of each layer of convolution (receptive field: in convolutional neural networks CNN, determines in the image network of script The area size of input layer corresponding to an element in a certain layer output result) no longer it is applicable in.In image network, usually make The convolutional layer for being 3 × 3 with receptive field.And correspond to outputting and inputting for 44100 dimension of the present invention, the convolution of generator and discriminator The receptive field of layer is also changed, and has used biggish receptive field to complete corresponding convolution operation.In addition, in convolution process In, two dimensional filter used in image is rejected, and is directed to the feature of sound dimension, and present invention uses one-dimensional filtering devices Convolution is carried out, in order to remove unwanted frequency information in number voice result, finally increases a filtering in generator Device filters the component frequency information in result, keeps the length of output sequence constant during filtering.Generator and discrimination The specific structure of device can refer to Tables 1 and 2

Table 1

Table 2

Wherein, due to part Relu corresponding after convolutional layer (Conv1D) and warp lamination (Trans Conv1D), LeakRelu and BatchNorm layers is not related to convolution kernel and the change to Output Size, does not display to it in table. Stride indicates the convolution step-length during convolution or deconvolution.Three parameters corresponding to " convolution kernel size " column respectively refer to Be receptive field size, the input channel number of this layer and the output channel number of this layer.Corresponding to " output shape " column Three parameters refer respectively to the size of this layer of Batch, input dimension and port number.In order to guarantee pair of convolution Yu deconvolution process It answers, during convolution or deconvolution, the present invention makes in layer by using continually changing receptive field and convolution step-length There is no give up dimension or increase the process of dimension for conversion between input and output.

According to the disclosure and teachings of the above specification, those skilled in the art in the invention can also be to above-mentioned embodiment party Formula is changed and is modified.Therefore, the invention is not limited to above-mentioned specific embodiment, all those skilled in the art exist Made any conspicuous improvement, replacement or modification all belong to the scope of protection of the present invention on the basis of the present invention.This Outside, although using some specific terms in this specification, these terms are merely for convenience of description, not to the present invention Constitute any restrictions.

Claims

1. a kind of generation method based on water scene audio end to end, which comprises the steps of:

Step 1 is chosen all kinds of water scene videos, and is pre-processed；

Step 2 obtains Maker model by training according to pretreated data；

Step 3 pre-processes silent video, is loaded into the trained Maker model, output and the noiseless view Frequently corresponding audio；

Step 4 generates envelope according to the sequence of the audio, and is loaded into trained tone color booster model, exports tone color The enhanced audio.

2. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: step 1 In, the pretreated method includes the following steps:

A1, the feature for extracting video frame, obtain the information of video；

3. a kind of generation method based on water scene audio end to end as claimed in claim 2, it is characterised in that: step 2 In, the training method of the Maker model includes the following steps:

B2, the assessment audio signal feed back to the Maker model, and be adjusted again, until defeated if not corresponding to Corresponding audio signal out；If corresponding, continue the training of next video information.

4. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: step 4 In, the training method of the tone color booster model includes the following steps:

The sequence of C2, the assessment audio, if not target sequence, then feed back to the tone color booster model, and again into Row adjustment, the sequence until exporting target audio；If target sequence, then continue next tone color enhancing training.

5. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: step 4 In, the generation method of the envelope includes the following steps:

D2, tonic train G is taken_VIn each sampling interval L_stepThe maximum value of interior absolute value is as one in this section of interval Envelope point pi；

The array Ep that D3, the envelope point pi in all sampling intervals are formed by connecting, by linear interpolation formation length and G_VIt is identical Sequence E (1:len), as tonic train G_VCorresponding envelope,

6. a kind of generation method based on water scene audio end to end as claimed in claim 2, it is characterised in that: step A2 In, the video information conversion formula are as follows:

G (y1 ..., ym) → x1 ..., xn, x ∈ { audio }, y ∈ { video }

Wherein y1 ..., ym represents the Color Channel information of the video frame, each channel is by between 0 to 255 Array at matrix, G (y1 ..., ym) indicates the value (value range is -1 to 1) of the audio signal generated based on video frame, X1 ..., xn indicates the value of the corresponding audio signal of video (variation range is -1 to 1).

7. a kind of generation method based on water scene audio end to end as claimed in claim 3, it is characterised in that: step 2 In, export loss function used in the audio signal are as follows:

Wherein, λ=100, wherein X indicate sound true value, V indicate video frame information, G indicate generator generate as a result, D Indicate assessment as a result, E expression average.

8. a kind of generation method based on water scene audio end to end as claimed in claim 3, it is characterised in that: step 2 In, assess loss function used in the audio signal are as follows:

9. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: the water The generation method of scene audio is based on GAN network, and the GAN network includes generator, discriminator and tone color booster.

10. a kind of generation method based on water scene audio end to end as described in claim 1, it is characterised in that: step In one, the vector V of pretreated video frame generation_tIt can be expressed as form:

Wherein,Indicate attended operation,V_{T, q}Indicate the The t seconds extracted features of q frame, Floor indicate to be rounded downwards；

The generation task of sound can be further represented as following form:

G(V₁, V₂..., V_Δt)→X₁, X₂..., X_Δt

Wherein, X_t={ x_{T, 1}, x_{T, 2}..., x_{T, SRaudio}, t ∈ { 1,2 ..., Δ t }.