CN109887515A

CN109887515A - Audio-frequency processing method and device, electronic equipment and storage medium

Info

Publication number: CN109887515A
Application number: CN201910086763.5A
Authority: CN
Inventors: 周航; 刘子纬; 徐旭东; 罗平; 王晓刚
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2019-06-14
Anticipated expiration: 2039-01-29
Also published as: CN109887515B

Abstract

This disclosure relates to a kind of audio-frequency processing method and device, electronic equipment and storage medium, which comprises carry out frequency spectrum conversion to corrupted audio to be processed, obtain the first spectral image of the corrupted audio；Frequency spectrum completion is carried out to first spectral image, obtains the second spectral image of completion；Completion is carried out to the corrupted audio according to second spectral image, the first audio after obtaining completion allows the first audio after completion that good auditory effect is presented.

Description

Audio-frequency processing method and device, electronic equipment and storage medium

Technical field

This disclosure relates to signal processing technology field more particularly to a kind of audio-frequency processing method and device, electronic equipment and Storage medium.

Background technique

Audio completion refers to when one section in audio is because of noise jamming or surprisingly leads to missing, regenerates missing Partial audio and by its natural completion.This technology has more application in terms of audio-frequency information reparation and noise reduction.The relevant technologies Rely primarily on traditional audio-frequency processing method, using sparse audio representation method, find and deletion fragment around known segment Similar part is filled.

Summary of the invention

The present disclosure proposes a kind of audio signal processing technique schemes.

According to the one side of the disclosure, a kind of audio-frequency processing method is provided, comprising: carry out to corrupted audio to be processed Frequency spectrum conversion, obtains the first spectral image of the corrupted audio；Frequency spectrum completion is carried out to first spectral image, is mended The second full spectral image；Completion is carried out to the corrupted audio according to second spectral image, first after obtaining completion Audio.

In one possible implementation, frequency spectrum completion is carried out to first spectral image, obtains the second of completion Spectral image, comprising: feature extraction is carried out to first spectral image, obtains the first spectrum signature；To first frequency spectrum Feature carries out frequency spectrum reconfiguration, obtains second spectral image.

In one possible implementation, frequency spectrum completion is carried out to first spectral image, obtains the second of completion Spectral image, comprising: feature extraction is carried out to first spectral image, obtains the second spectrum signature；To the corrupted audio Relevant information carry out feature extraction, obtain supervision feature；It is aligned second spectrum signature and the supervision feature；According to right Supervision feature after neat carries out frequency spectrum reconfiguration to first spectrum signature, obtains second spectral image, wherein the phase Closing information includes at least one of video information corresponding with the corrupted audio and Optic flow information.

In one possible implementation, the corrupted audio includes corrupted audio segment；It is described according to described second Spectral image carries out completion, the first audio after obtaining completion to the corrupted audio, comprising: in the second spectral image with by The corresponding spectral image of damage audio fragment carries out spectrum frequency and converts, and obtains completion audio fragment；Using completion audio fragment to impaired Audio carries out completion, the first audio after obtaining completion.

In one possible implementation, the corrupted audio includes corrupted audio segment and undamaged audio fragment； It is described that completion is carried out to the corrupted audio according to second spectral image, the first audio after obtaining completion, comprising: according to Spectral image corresponding with corrupted audio segment and undamaged audio fragment in second spectral image, predict the completion audio piece Section；Completion is carried out to corrupted audio using completion audio fragment, the first audio after obtaining completion.

In one possible implementation, described that the corrupted audio is mended according to second spectral image Entirely, the operation of the first audio after obtaining completion, is realized by WaveNet decoding network.

In one possible implementation, first spectral image and the second spectral image include Meier spectral image Or mel cepstrum image.

According to the one side of the disclosure, a kind of apparatus for processing audio is provided, comprising: frequency spectrum conversion module, for treating The corrupted audio of processing carries out frequency spectrum conversion, obtains the first spectral image of the corrupted audio；Frequency spectrum completion module, for pair First spectral image carries out frequency spectrum completion, obtains the second spectral image of completion；Audio completion module, for according to Second spectral image carries out completion to the corrupted audio, the first audio after obtaining completion.

In one possible implementation, the frequency spectrum completion module includes: fisrt feature extracting sub-module, for pair First spectral image carries out feature extraction, obtains the first spectrum signature；First frequency spectrum reconfiguration submodule, for described the One spectrum signature carries out frequency spectrum reconfiguration, obtains second spectral image.

In one possible implementation, the frequency spectrum completion module includes: second feature extracting sub-module, for pair First spectral image carries out feature extraction, obtains the second spectrum signature；Third feature extracting sub-module, for it is described by The relevant information for damaging audio carries out feature extraction, obtains supervision feature；It is aligned submodule, for being aligned second spectrum signature With the supervision feature；Second frequency spectrum reconfiguration submodule, for according to the supervision feature after alignment to first spectrum signature Frequency spectrum reconfiguration is carried out, obtains second spectral image, wherein the relevant information includes view corresponding with the corrupted audio At least one of frequency information and Optic flow information.

In one possible implementation, the corrupted audio includes corrupted audio segment；The audio completion module It include: the first spectrum frequency transform subblock, for being carried out to spectral image corresponding with corrupted audio segment in the second spectral image Spectrum frequency is converted, and completion audio fragment is obtained；First audio completion submodule, for using completion audio fragment to corrupted audio into Row completion, the first audio after obtaining completion.

In one possible implementation, the corrupted audio includes corrupted audio segment and undamaged audio fragment； The audio completion module includes: prediction submodule, for according to frequency corresponding with corrupted audio segment in the second spectral image Spectrogram picture and undamaged audio fragment predict the completion audio fragment；Second audio completion submodule, for utilizing completion sound Frequency segment carries out completion to corrupted audio, the first audio after obtaining completion.

In one possible implementation, the audio completion module is realized by WaveNet decoding network.

According to the one side of the disclosure, a kind of electronic equipment is provided, comprising: processor；It can be held for storage processor The memory of row instruction；Wherein, the processor is configured to: execute above-mentioned audio-frequency processing method method.

According to the one side of the disclosure, a kind of computer readable storage medium is provided, computer program is stored thereon with Instruction, the computer program instructions realize above-mentioned audio-frequency processing method method when being executed by processor.

In the embodiments of the present disclosure, by carrying out frequency spectrum conversion to corrupted audio to be processed, the corrupted audio is obtained The first spectral image；Frequency spectrum completion is carried out to first spectral image, obtains the second spectral image of completion；According to described Second spectral image carries out completion to the corrupted audio, and the first audio after obtaining completion converts the problem of audio completion The problem of for frequency spectrum completion, to reduce the excessive dependence to audio-frequency information.By audio-frequency processing method can with completion for example by To noise jamming, have an explosion sound quality segment or localized distortion that Partial Fragment is erased etc. corrupted audio so that completion Good auditory effect can be presented in audio afterwards.

It should be understood that above general description and following detailed description is only exemplary and explanatory, rather than Limit the disclosure.

According to below with reference to the accompanying drawings to detailed description of illustrative embodiments, the other feature and aspect of the disclosure will become It is clear.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and those figures show meet this public affairs The embodiment opened, and together with specification it is used to illustrate the technical solution of the disclosure.

Fig. 1 shows a kind of flow chart of audio-frequency processing method according to the embodiment of the present disclosure.

Fig. 2 shows the signals according to a kind of the audio-frequency processing method sound intermediate frequency information and spectral image of the embodiment of the present disclosure Figure.

Fig. 3 shows the structural representation of the neural network according to employed in a kind of audio-frequency processing method of the embodiment of the present disclosure Figure.

Fig. 4 shows the structural representation of the neural network according to employed in a kind of audio-frequency processing method of the embodiment of the present disclosure Figure.

Fig. 5 shows the block diagram of the apparatus for processing audio according to the embodiment of the present disclosure.

Fig. 6 is the block diagram of a kind of electronic equipment shown accoding to exemplary embodiment.

Fig. 7 is the block diagram of a kind of electronic equipment shown accoding to exemplary embodiment.

Specific embodiment

Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove It non-specifically points out, it is not necessary to attached drawing drawn to scale.

Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary " Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.

The terms "and/or", only a kind of incidence relation for describing affiliated partner, indicates that there may be three kinds of passes System, for example, A and/or B, can indicate: individualism A exists simultaneously A and B, these three situations of individualism B.In addition, herein Middle term "at least one" indicate a variety of in any one or more at least two any combination, it may for example comprise A, B, at least one of C can indicate to include any one or more elements selected from the set that A, B and C are constituted.

In addition, giving numerous details in specific embodiment below in order to which the disclosure is better described. It will be appreciated by those skilled in the art that without certain details, the disclosure equally be can be implemented.In some instances, for Method, means, element and circuit well known to those skilled in the art are not described in detail, in order to highlight the purport of the disclosure.

Fig. 1 shows the flow chart of the audio-frequency processing method according to the embodiment of the present disclosure.The audio-frequency processing method can be by end End equipment or other processing equipments execute, wherein terminal device can be user equipment (User Equipment, UE), movement Equipment, user terminal, terminal, cellular phone, wireless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld device, calculating equipment, mobile unit, wearable device etc..Other processing equipments can be server or cloud service Device etc..In some possible implementations, which can call the meter stored in memory by processor The mode of calculation machine readable instruction is realized.

As shown in Figure 1, which comprises

Step S11 carries out frequency spectrum conversion to corrupted audio to be processed, obtains the first spectrogram of the corrupted audio Picture.

In one possible implementation, corrupted audio to be processed can be the audio-frequency information in a first full songs It is generated on the basis of (the complete audio of i.e. no any damage)；For example, it may be during transmitting audio-frequency information What the phenomenon that loss of learning occur generated, caused by being also possible to the virus carried in audio-frequency information, it can also be editor's audio It is surprisingly deleted caused by partial information during information.

In the implementation of the disclosure, the audio frame number of corrupted audio to be processed will not be defined, it can be with All audio frequency frame (such as 1000 frames) in audio-frequency information including a first full songs, as an example, in corrupted audio by Damage audio fragment can be the audio frame positioned at 8-10 frame, can also only include the part in the audio-frequency information of the full songs Audio frame (such as 10 frames), as an example, the corrupted audio segment in corrupted audio can be the audio frame positioned at the 7th frame.

In one possible implementation, the corrupted audio to be processed can pass through the audio signal of arbitrary format It indicates.As an example, as shown in Fig. 2, corrupted audio can be indicated by sonograph 12, and for indicating the sound spectrum of intact audio Figure 11 is compared, and has more apparent absent region (the rectangle blank parts in sonograph 12), the missing area in sonograph 12 Domain can indicate the damaged segment of corrupted audio to be processed.

In one possible implementation, audio signal can indicate audio-frequency information, spectral image under time domain space (signal) can indicate audio-frequency information under domain space, therefore, in this implementation can by the first spectral image and Different expression form of the corrupted audio as identical information.

In one possible implementation, the first spectral image and the second spectral image include Meier spectral image or plum That cepstrum image.

Step S12 carries out frequency spectrum completion to first spectral image, obtains the second spectral image of completion.

Wherein, include at least one absent region in the first spectral image, that is, need the region of completion, which uses Deletion fragment in expression corrupted audio.As an example, may include multiple continuous absent regions in the first spectral image, It also may include the absent region at multiple intervals, in one possible implementation, absent region in the first spectral image Area it is bigger, the damaged segment in corrupted audio is more.Smaller, the damaged segment in corrupted audio of the area of absent region It is smaller.

In one possible implementation, for frequency spectrum completion, it can use absent region week in the first spectral image The pixel enclosed fills the absent region of the first spectral image, achievees the effect that frequency spectrum completion；Also it can use deep learning skill Art, according to the pixel point prediction (or association) in other regions in the first spectral image in addition to absent region in absent region Each pixel, achieve the effect that frequency spectrum completion.

In one possible implementation, between the second spectral image of the completion and the spectral image of intact audio Gap it is smaller.Frequency spectrum completion operation is used for completion for the lack part in the first spectral image, so that the second frequency after completion Spectrogram picture is identical as intact spectral image approach.

As an example, as shown in Fig. 2, spectral image 13 is the first spectral image for needing completion, face in spectral image 13 The deeper rectangular area of color is the absent region of the first spectral image, and spectral image 14 is the second spectral image after completion.

Step S13 carries out completion to the corrupted audio according to second spectral image, the first sound after obtaining completion Frequently.

Wherein, the gap between the first audio and intact audio after completion is smaller.Completion operation for completion by The deletion fragment in audio is damaged, so that the first audio after completion is identical as intact audio approach.

It wherein, include at least one corrupted audio segment in corrupted audio, corrupted audio segment is to need the audio of completion Segment.As an example, may include multiple continuous corrupted audio segments in corrupted audio, also may include multiple intervals by Audio fragment is damaged, the quantity of corrupted audio segment is more in corrupted audio, and the extent of damage is more serious, the quantity of corrupted audio segment Fewer, the extent of damage is slighter.

It in one possible implementation, can be by frequency corresponding with corrupted audio segment in the second spectral image Spectrogram picture carries out spectrum frequency and converts, and obtains completion audio fragment, carries out completion to corrupted audio using completion audio fragment, reaches sound The effect of frequency completion；Also it can use depth learning technology, using the second spectral image as learning objective, and according to corrupted audio In other audio fragments prediction (or association) damaged segment in addition to deletion fragment information, achieve the effect that audio completion.

In one possible implementation, completion is carried out to the corrupted audio according to the second spectral image, is mended The operation of the first audio after complete, is realized by WaveNet decoding network.For example, can be by WaveNet decoding network Convolutional layer carries out spectrum frequency to spectral image corresponding with corrupted audio segment in the second spectral image and converts, and utilizes completion audio Segment carries out completion to corrupted audio, achievees the effect that audio completion；In another example can use in WaveNet decoding network Band hole cause and effect convolutional layer (dialated causal convolutions), the letter of corrupted audio segment in predictive of impaired audio Breath, achievees the effect that audio completion.

In embodiment of the disclosure, by carrying out frequency spectrum conversion to corrupted audio to be processed, the impaired sound is obtained First spectral image of frequency；Frequency spectrum completion is carried out to first spectral image, obtains the second spectral image of completion；According to institute It states the second spectral image and completion is carried out to the corrupted audio, the first audio after obtaining completion will turn the problem of audio completion The problem of turning to frequency spectrum completion, to reduce the excessive dependence to audio-frequency information.It can be with completion for example by audio-frequency processing method By noise jamming, have an explosion sound quality segment or localized distortion that Partial Fragment is erased etc. corrupted audio so that mending Good auditory effect can be presented in audio after complete.

In one possible implementation, step S12 carries out frequency spectrum completion to first spectral image, is mended The second full spectral image, comprising: feature extraction is carried out to first spectral image, obtains the first spectrum signature；To described First spectrum signature carries out frequency spectrum reconfiguration, obtains second spectral image.

Wherein, frequency spectrum reconfiguration is it is to be understood that utilize the pixel filling first around the first spectral image absent region The absent region of spectral image achievees the effect that frequency spectrum completion；It also will be understood that for according in the first spectral image except missing area The pixel point prediction (or association) in other regions except domain reaches the effect of frequency spectrum completion to each pixel in absent region Fruit.

In one possible implementation, convolutional neural networks be can use, feature extraction is carried out to the first spectral image With the operation of frequency spectrum reconfiguration.Wherein, convolutional neural networks may include at least one convolutional layer, wherein convolutional layer is used for the One spectral image carries out process of convolution, extracts the feature of the first spectral image.

In one possible implementation, for feature extraction, can use Short Time Fourier Transform (STFT, Short-time Fourier transform) and the feature extracting methods such as Meier filter feature is carried out to the first spectral image It extracts, to obtain the first spectrum signature.

It in one possible implementation, can will be by when the quantity of corrupted audio segment is excessive in corrupted audio Damaging audio segmentation to be only includes the corrupted audio of small number of audio fragment, and then the corrupted audio difference frequency that segmentation is obtained Spectrum is converted to the first spectral image, and carries out feature extraction and frequency spectrum reconfiguration to each first spectral image respectively, is corresponded to In the second spectral image of each corrupted audio.

In one possible implementation, step S12 carries out frequency spectrum completion to first spectral image, is mended The second full spectral image, comprising: feature extraction is carried out to first spectral image, obtains the second spectrum signature；To described The relevant information of corrupted audio carries out feature extraction, obtains supervision feature；It is aligned second spectrum signature and the supervision is special Sign；Frequency spectrum reconfiguration is carried out to first spectrum signature according to the supervision feature after alignment, obtains second spectral image, In, the relevant information includes at least one of video information corresponding with the corrupted audio and Optic flow information.

In this implementation, the corrupted audio can be the video from one section with audio-frequency information, wherein each Frame (section) video frame all has corresponding audio fragment, and video frame and the content of audio fragment match, therefore, this reality Existing mode can use the video information completion corrupted audio being naturally aligned with corrupted audio.

As an example, corrupted audio can be the recorded video played from one section of violoncello, when the video in video Performing artist's performance amplitude is larger in frame, the audio piece when relative distance between string and body is larger, corresponding to the video frame Loudness expressed by section is larger, the audio fragment when performance frequency is very fast in the video frame in video, corresponding to the video frame Expressed rhythm is more rapid, opposite, when performing artist plays the relative distance that amplitude is smaller, between string and body in video frame When smaller, loudness expressed by the audio frame corresponding to the video frame is smaller, right when performance frequency is slower in a certain section of video Should the rhythm expressed by the audio fragment of this section it is slower.

In one possible implementation, the corresponding video information of the corrupted audio and Optic flow information be it is complete, It is not affected by the information of noise jamming.

Wherein, video information includes each video frame corresponding to audio-frequency information, and Optic flow information is for indicating video information Image sequence in pixel in the variation in time-domain, the correlation between adjacent video frames image and adjacent video frames figure The information that object relatively moves as in.In this implementation, video information and Optic flow information all can serve as to be damaged The reference of audio, so that the first audio after completion is more complete.

In one possible implementation, for feature extraction, can use Short Time Fourier Transform (STFT, Short-time Fourier transform) and the feature extracting methods such as Meier filter to the relevant information of corrupted audio into Row feature extraction, to obtain the supervision feature of corrupted audio.

In one possible implementation, the supervision feature can be from the corresponding video information of corrupted audio and/ Or the feature extracted in Optic flow information, for example, it may be the edge feature of video information and/or Optic flow information, texture are special It seeks peace style and features etc..

In one possible implementation, it is related to corrupted audio to corrupted audio to can use depth learning technology Information extraction operations；For example, can use the convolutional layer letter related to corrupted audio to corrupted audio in convolutional neural networks Breath carries out process of convolution (feature extraction), to extract the second spectrum signature and supervision feature.

In one possible implementation, it is aligned second spectrum signature and the supervision feature, for making second The distance of spectrum signature and the supervision feature between the two reduces as far as possible, makes the second spectrum signature and the supervision feature can To be in the same space.

In one possible implementation, fusion can be passed through by being aligned second spectrum signature and the supervision feature (such as splicing) the second spectrum signature and the supervision feature are completed；In this implementation, the width of the second spectrum signature can Identical as supervision feature, whether the height of the two is identical with no restrictions, and the second spectrum signature corresponding in this way and supervision feature can To splice in the width direction.Alternatively, the height of the second spectrum signature can be identical as supervision feature, whether the width of the two is identical not It is limited, in this way, corresponding second spectrum signature and supervision feature can splice along short transverse.

For example, the dimension of the second spectrum signature is 1 × 4 × 1, and the dimension for supervising feature is 1 × 4 × 1, is melted in feature During conjunction, can along height to corresponding second spectrum signature and supervision feature splice, obtain dimension be (1+1) × 4 × 1 feature.

In one possible implementation, deep learning skill can be passed through to the frequency spectrum reconfiguration operation of the first spectrum signature Art is completed.During frequency spectrum reconfiguration, relevant information (such as video information and the streamer of corrupted audio and corrupted audio can use Information) self-supervision that is naturally aligned, the pixel according to other regions in the first spectral image in addition to absent region is pre- (or association) is surveyed to each pixel is reconstructed, to achieve the effect that frequency spectrum reconfiguration, makes the second frequency spectrum obtained after frequency spectrum reconfiguration Image can better completion corrupted audio.

It, can (such as video corresponding with the corrupted audio be believed by the relevant information of corrupted audio in this implementation At least one of breath and Optic flow information) it is used as supervision message, the first sound to instruct completion corrupted audio, after improving completion The integrity degree of frequency optimizes the presentation effect of the first audio.

In one possible implementation, the corrupted audio includes corrupted audio segment；Step 13, described according to institute It states the second spectral image and completion is carried out to the corrupted audio, the first audio after obtaining completion, comprising: to the second spectral image In spectral image corresponding with corrupted audio segment carry out spectrum frequency convert, obtain completion audio fragment；Utilize completion audio fragment Completion is carried out to corrupted audio, the first audio after obtaining completion.

Wherein, corrupted audio can be made of corrupted audio segment and undamaged audio fragment, be wrapped in the second spectral image Include image-region corresponding with each segment of corrupted audio (including corrupted audio segment and undamaged audio fragment) respectively.It mends Full acoustic frequency segment is substituted for the corrupted audio segment in corrupted audio.Spectrum frequency conversion, which can be, is converted to spectral image The process of audio, it can be understood as the inverse process that frequency spectrum is converted in step 11.

It, can be by the frequency spectrum in image-region corresponding with corrupted audio segment in the second spectral image in this implementation Image is converted into completion audio fragment, and using the corrupted audio segment in completion audio fragment replacement corrupted audio, to reach To the effect of audio completion.

As an example, corrupted audio is made of 5 audio fragments, wherein the 1st, 2,3,4 segment is undamaged audio piece Section, the 5th audio fragment are corrupted audio segment.By this implementation can by the second spectral image with the 5th audio Spectral image in the corresponding image-region of segment carries out spectrum frequency and converts, and obtains completion audio fragment, and utilize completion audio piece The 5th audio fragment in section replacement corrupted audio, obtains being made of the 1st, 2,3,4 undamaged segment and completion audio fragment Completion after the first audio.

In one possible implementation, the corrupted audio includes corrupted audio segment and undamaged audio fragment； Step 13, described that completion is carried out to the corrupted audio according to second spectral image, the first audio after obtaining completion, packet It includes: according to spectral image corresponding with corrupted audio segment in the second spectral image and undamaged audio fragment, predicting the benefit Full acoustic frequency segment；Completion is carried out to corrupted audio using completion audio fragment, the first audio after obtaining completion.

It, can be with during determining the completion audio fragment for replacing corrupted audio segment in this implementation It is realized using the undamaged audio fragment in corrupted audio.For example (continue in above-mentioned implementation), this reality Existing mode can use the undamaged audio fragment positioned at the 1st, 2,3,4 segment to completion when determining completion audio fragment The content of audio fragment is predicted (or association), and utilizes spectrogram corresponding with corrupted audio segment in the second spectral image As guiding the generation of completion audio fragment, and then more accurate completion audio fragment is obtained, and utilization completion sound Frequency segment replaces the 5th audio fragment in corrupted audio, obtains by the 1st, 2,3,4 undamaged segment and completion audio fragment The first audio after the completion of composition.

In one possible implementation, step 12 and step 13 can be executed by neural network, wherein step S12 carries out feature extraction to first spectral image, obtains the first spectrum signature；Frequency is carried out to first spectrum signature Spectrum reconstruct, obtains the operation of second spectral image, can be by being made of the first coding network and the first decoding network First generates network implementations；Completion is carried out to the corrupted audio according to second spectral image described in step S13, is mended The operation of the first audio after complete, can be realized by the second decoding network.

Fig. 3 shows the structural representation of the neural network according to employed in a kind of audio-frequency processing method of the embodiment of the present disclosure Figure.In one possible implementation, as shown in figure 3, carrying out frequency spectrum conversion to corrupted audio 201, frequency spectrum conversion knot is obtained Fruit (the first spectral image 202)；First coding network E_aBy the first spectral image 202 boil down to, one feature vectorAnd it will For indicating the feature vector of the first spectral imageIt is sent to the first decoding network G_a；First decoding network G_aCan be based on should Feature vector carries out the operation of frequency spectrum completion, and the second spectral image 203 after frequency spectrum completion is sent to the second decoding Network 204；Second decoding network 204 then can be according to the second spectral image 203 and corrupted audio 201 of frequency spectrum completion, to impaired Audio carries out completion, the first audio 205 after finally obtaining completion.

In one possible implementation, step 12 and step 13 can be executed by neural network；Wherein, step 12 It is described that feature extraction is carried out to first spectral image, obtain the operation of the second spectrum signature；It is special to be aligned second frequency spectrum It seeks peace the supervision feature；Feature extraction is carried out to the relevant information of the corrupted audio after alignment, obtains supervision feature；Institute It states and frequency spectrum reconfiguration is carried out to first spectrum signature according to the supervision feature, obtain the operation of second spectral image, Network implementations can be generated by second be made of the first coding network, the first decoding network and the second coding network；Step 13 carry out completion to the corrupted audio according to second spectral image, and the operation of the first audio after obtaining completion can be with It is realized by the second decoding network.

Fig. 4 shows the structural representation of the neural network according to employed in a kind of audio-frequency processing method of the embodiment of the present disclosure Figure.In one possible implementation, as shown in figure 4, audio-frequency processing method can also be applied to, using corrupted audio and by The self-supervision that the relevant information (such as video information and streamer information) of damage audio is naturally aligned carrys out the feelings of polishing corrupted audio Condition.

Detailed process include: first to corrupted audio 301 carry out frequency spectrum convert the first spectral image 302, and by first frequency Spectrogram is sent to the first coding network E as 302_a；First coding network can E_aFirst spectral image 302 is compressed into a feature VectorAt the same time it can also which the video information of corrupted audio and streamer information (not shown) are sent to the second coding network E_v Obtain supervision feature f_v；Fusion is for indicating the feature vector of the first spectral imageWith supervision feature f_v, and by fusion results It is sent to the first decoding network G_av；First decoding network G_avFrequency spectrum completion can be carried out based on this feature vector sum supervision feature Operation, and the second spectral image 304 after frequency spectrum completion is sent to the second decoding network 305；Second decoding network 305 can carry out completion to corrupted audio, finally obtain according to the second spectral image 304 and corrupted audio 301 of frequency spectrum completion The first audio 306 after completion.

In one possible implementation, the first generation network and the second decoding network can separately be trained, wherein Network can be generated by dual training method training first, pass through mixed discrete loss function the second decoding network of training.

For training network losses function used in the first generation network that can indicate are as follows:

Wherein,For indicating the network losses of the first coding network,For indicate the first coding network and and The network losses of first decoding network,For indicatingWithThe sum of loss, a is for indicating each for inputting The corrupted audio of network, β are used to indicate the weight of the network losses of the first decoding network.

In this implementation, network and the second decoding network can be generated by following processes training first:

Step 401 obtains multiple training samples (corrupted audio) and corresponding with each training sample from training set Markup information (spectral image of undamaged audio and undamaged audio)；

Step 402 is directed to each training sample, carries out frequency spectrum conversion to training sample, obtains the of the training sample One spectral image；

First spectral image is input to the first generation network by step 403, generates network to the trained sample based on first This first spectral image carries out frequency spectrum completion, obtains the second spectral image of training sample completion；

It is step 404, true by the spectral image of audio undamaged in markup information and the second spectral image of training sample It is fixedWithAccording toWithThe sum ofAdjustment first generates the network parameter of each network in network；

Second spectral image of training sample and training sample is input to the second decoding network by step 405, is obtained to instruction Practice the audio that sample carries out completion；

Step 406, the audio that completion is carried out according to the undamaged audio in markup information and to training sample, determine second The network losses of decoding network；The network parameter of the second decoding network is adjusted according to the network losses of the second decoding network.

Wherein, this implementation will not to step 405 and step 406 execution sequence be defined.

In one possible implementation, the second generation network and the second decoding network can separately be trained, wherein It can use the training of dual training method and generate network by second, mixed discrete loss function training the second decoding net can be passed through Network.

For training loss function used in the second generation network that can indicate are as follows:

Wherein,For indicating the network losses of the first coding network and the first decoding network, η₂For the network losses Weight, t indicate the time, η₂It can decay as time increases.For indicating the first coding network, the first decoding net The network losses of network and the second coding network, L_SyncFor indicating the network losses of the first coding network and the second coding network,The sum of network losses for indicating three.

In this implementation, network and the second decoding network can be generated by following processes training second:

Step 411, obtained from training set multiple training samples (corrupted audio, corrupted audio spectral image and by Damage audio video information and streamer information), markup information corresponding with each training sample (undamaged audio and undamaged sound The spectral image of frequency)；

Step 412 is directed to each training sample, carries out frequency spectrum conversion to the corrupted audio of training sample, obtains the instruction Practice the first spectral image of sample；

First spectral image is input to the first coding network by step 413, obtains the second spectrum signature；

Step 414, by corrupted audio video information and streamer information input to the second coding network, it is special to obtain supervision Sign；

Step 415 is aligned second spectrum signature and the supervision feature, and according to the second spectrum signature after alignment L is determined with supervision feature_Sync；

Step 416, by after the alignment the second spectrum signature and supervision feature determination be input to the second decoding network, base Network is generated in first, frequency spectrum completion is carried out to the first spectral image of the training sample, obtain the second of training sample completion Spectral image；

Step 417, according to undamaged audio and undamaged sound in the second spectral image of training sample completion, markup information The spectral image of frequency determinesWithAccording to L_Sync、WithThe sum ofIt adjusts (second) and generates net The network parameter of each network in network；

Second spectral image of training sample and training sample is input to the second decoding network by step 418, is obtained to instruction Practice the audio that sample carries out completion；

Step 419, the audio that completion is carried out according to the undamaged audio in markup information and to training sample, calculate second The network losses of decoding network；The network parameter of the second decoding network is adjusted according to the network losses of the second decoding network.

Wherein, this implementation will not to step S418 and step S419 execution sequence be defined.

In one possible implementation, audio-frequency processing method can be applied in video-audio fix tool, by one Section includes that the audio-frequency information of corrupted audio is input to the video-audio fix tool, video-audio fix tool, that is, exportable one Intact audio after section completion.

It is appreciated that above-mentioned each embodiment of the method that the disclosure refers to, without prejudice to principle logic, To engage one another while the embodiment to be formed after combining, as space is limited, the disclosure is repeated no more.

In addition, the disclosure additionally provides apparatus for processing audio, electronic equipment, computer readable storage medium, program, it is above-mentioned It can be used to realize any audio-frequency processing method that the disclosure provides, corresponding technical solution and description and referring to method part It is corresponding to record, it repeats no more.

It will be understood by those skilled in the art that each step writes sequence simultaneously in the above method of specific embodiment It does not mean that stringent execution sequence and any restriction is constituted to implementation process, the specific execution sequence of each step should be with its function It can be determined with possible internal logic.

Fig. 5 shows the block diagram of the apparatus for processing audio according to the embodiment of the present disclosure.As shown in figure 5, the audio processing dress It sets including frequency spectrum conversion module 501, frequency spectrum completion module 502 and audio completion module 503.

Wherein, frequency spectrum conversion module 501 obtains described impaired for carrying out frequency spectrum conversion to corrupted audio to be processed First spectral image of audio；

Frequency spectrum completion module 502 obtains the second frequency spectrum of completion for carrying out frequency spectrum completion to first spectral image Image；

Audio completion module 503 is mended for carrying out completion to the corrupted audio according to second spectral image The first audio after complete.

In some embodiments, the embodiment of the present disclosure provides the function that has of device or comprising module can be used for holding The method of row embodiment of the method description above, specific implementation are referred to the description of embodiment of the method above, for sake of simplicity, this In repeat no more.

The embodiment of the present disclosure also proposes a kind of computer readable storage medium, is stored thereon with computer program instructions, institute It states and realizes above-mentioned audio-frequency processing method when computer program instructions are executed by processor.Computer readable storage medium can be with right and wrong Volatile computer readable storage medium storing program for executing.

The embodiment of the present disclosure also proposes a kind of electronic equipment, comprising: processor；For storage processor executable instruction Memory；Wherein, the processor is configured to above-mentioned audio-frequency processing method.

Fig. 6 is the block diagram of a kind of electronic equipment 800 shown according to an exemplary embodiment.For example, electronic equipment 800 can To be mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, Medical Devices are good for Body equipment, the terminals such as personal digital assistant.

Referring to Fig. 6, electronic equipment 800 may include following one or more components: processing component 802, memory 804, Power supply module 806, multimedia component 808, audio component 810, the interface 812 of input/output (I/O), sensor module 814, And communication component 816.

The integrated operation of the usual controlling electronic devices 800 of processing component 802, such as with display, call, data are logical Letter, camera operation and record operate associated operation.Processing component 802 may include one or more processors 820 to hold Row instruction, to perform all or part of the steps of the methods described above.In addition, processing component 802 may include one or more moulds Block, convenient for the interaction between processing component 802 and other assemblies.For example, processing component 802 may include multi-media module, with Facilitate the interaction between multimedia component 808 and processing component 802.

Memory 804 is configured as storing various types of data to support the operation in electronic equipment 800.These data Example include any application or method for being operated on electronic equipment 800 instruction, contact data, telephone directory Data, message, picture, video etc..Memory 804 can by any kind of volatibility or non-volatile memory device or it Combination realize, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable Except programmable read only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, fastly Flash memory, disk or CD.

Power supply module 806 provides electric power for the various assemblies of electronic equipment 800.Power supply module 806 may include power supply pipe Reason system, one or more power supplys and other with for electronic equipment 800 generate, manage, and distribute the associated component of electric power.

Multimedia component 808 includes the screen of one output interface of offer between the electronic equipment 800 and user. In some embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch surface Plate, screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touches Sensor is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding The boundary of movement, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, Multimedia component 808 includes a front camera and/or rear camera.When electronic equipment 800 is in operation mode, as clapped When taking the photograph mode or video mode, front camera and/or rear camera can receive external multi-medium data.It is each preposition Camera and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 810 is configured as output and/or input audio signal.For example, audio component 810 includes a Mike Wind (MIC), when electronic equipment 800 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone It is configured as receiving external audio signal.The received audio signal can be further stored in memory 804 or via logical Believe that component 816 is sent.In some embodiments, audio component 810 further includes a loudspeaker, is used for output audio signal.

I/O interface 812 provides interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.

Sensor module 814 includes one or more sensors, for providing the state of various aspects for electronic equipment 800 Assessment.For example, sensor module 814 can detecte the state that opens/closes of electronic equipment 800, the relative positioning of component, example As the component be electronic equipment 800 display and keypad, sensor module 814 can also detect electronic equipment 800 or The position change of 800 1 components of electronic equipment, the existence or non-existence that user contacts with electronic equipment 800, electronic equipment 800 The temperature change of orientation or acceleration/deceleration and electronic equipment 800.Sensor module 814 may include proximity sensor, be configured For detecting the presence of nearby objects without any physical contact.Sensor module 814 can also include optical sensor, Such as CMOS or ccd image sensor, for being used in imaging applications.In some embodiments, which may be used also To include acceleration transducer, gyro sensor, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 816 is configured to facilitate the communication of wired or wireless way between electronic equipment 800 and other equipment. Electronic equipment 800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.Show at one In example property embodiment, communication component 816 receives broadcast singal or broadcast from external broadcasting management system via broadcast channel Relevant information.In one exemplary embodiment, the communication component 816 further includes near-field communication (NFC) module, short to promote Cheng Tongxin.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band can be based in NFC module (UWB) technology, bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, electronic equipment 800 can be by one or more application specific integrated circuit (ASIC), number Word signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.

In the exemplary embodiment, a kind of non-volatile computer readable storage medium storing program for executing is additionally provided, for example including calculating The memory 804 of machine program instruction, above-mentioned computer program instructions can be executed by the processor 820 of electronic equipment 800 to complete The above method.

Fig. 7 is the block diagram of a kind of electronic equipment 1900 shown according to an exemplary embodiment.For example, electronic equipment 1900 It may be provided as a server.Referring to Fig. 7, electronic equipment 1900 includes processing component 1922, further comprise one or Multiple processors and memory resource represented by a memory 1932, can be by the execution of processing component 1922 for storing Instruction, such as application program.The application program stored in memory 1932 may include it is one or more each Module corresponding to one group of instruction.In addition, processing component 1922 is configured as executing instruction, to execute the above method.

Electronic equipment 1900 can also include that a power supply module 1926 is configured as executing the power supply of electronic equipment 1900 Management, a wired or wireless network interface 1950 is configured as electronic equipment 1900 being connected to network and an input is defeated (I/O) interface 1958 out.Electronic equipment 1900 can be operated based on the operating system for being stored in memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.

In the exemplary embodiment, a kind of non-volatile computer readable storage medium storing program for executing is additionally provided, for example including calculating The memory 1932 of machine program instruction, above-mentioned computer program instructions can by the processing component 1922 of electronic equipment 1900 execute with Complete the above method.

The disclosure can be system, method and/or computer program product.Computer program product may include computer Readable storage medium storing program for executing, containing for making processor realize the computer-readable program instructions of various aspects of the disclosure.

Computer readable storage medium, which can be, can keep and store the tangible of the instruction used by instruction execution equipment Equipment.Computer readable storage medium for example can be-- but it is not limited to-- storage device electric, magnetic storage apparatus, optical storage Equipment, electric magnetic storage apparatus, semiconductor memory apparatus or above-mentioned any appropriate combination.Computer readable storage medium More specific example (non exhaustive list) includes: portable computer diskette, hard disk, random access memory (RAM), read-only deposits It is reservoir (ROM), erasable programmable read only memory (EPROM or flash memory), static random access memory (SRAM), portable Compact disk read-only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical coding equipment, for example thereon It is stored with punch card or groove internal projection structure and the above-mentioned any appropriate combination of instruction.Calculating used herein above Machine readable storage medium storing program for executing is not interpreted that instantaneous signal itself, the electromagnetic wave of such as radio wave or other Free propagations lead to It crosses the electromagnetic wave (for example, the light pulse for passing through fiber optic cables) of waveguide or the propagation of other transmission mediums or is transmitted by electric wire Electric signal.

Computer-readable program instructions as described herein can be downloaded to from computer readable storage medium it is each calculate/ Processing equipment, or outer computer or outer is downloaded to by network, such as internet, local area network, wide area network and/or wireless network Portion stores equipment.Network may include copper transmission cable, optical fiber transmission, wireless transmission, router, firewall, interchanger, gateway Computer and/or Edge Server.Adapter or network interface in each calculating/processing equipment are received from network to be counted Calculation machine readable program instructions, and the computer-readable program instructions are forwarded, for the meter being stored in each calculating/processing equipment In calculation machine readable storage medium storing program for executing.

Computer program instructions for executing disclosure operation can be assembly instruction, instruction set architecture (ISA) instructs, Machine instruction, machine-dependent instructions, microcode, firmware instructions, condition setup data or with one or more programming languages The source code or object code that any combination is write, the programming language include the programming language-of object-oriented such as Smalltalk, C++ etc., and conventional procedural programming languages-such as " C " language or similar programming language.Computer Readable program instructions can be executed fully on the user computer, partly execute on the user computer, be only as one Vertical software package executes, part executes on the remote computer or completely in remote computer on the user computer for part Or it is executed on server.In situations involving remote computers, remote computer can pass through network-packet of any kind It includes local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as benefit It is connected with ISP by internet).In some embodiments, by utilizing computer-readable program instructions Status information carry out personalized customization electronic circuit, such as programmable logic circuit, field programmable gate array (FPGA) or can Programmed logic array (PLA) (PLA), the electronic circuit can execute computer-readable program instructions, to realize each side of the disclosure Face.

Referring herein to according to the flow chart of the method, apparatus (system) of the embodiment of the present disclosure and computer program product and/ Or block diagram describes various aspects of the disclosure.It should be appreciated that flowchart and or block diagram each box and flow chart and/ Or in block diagram each box combination, can be realized by computer-readable program instructions.

These computer-readable program instructions can be supplied to general purpose computer, special purpose computer or other programmable datas The processor of processing unit, so that a kind of machine is produced, so that these instructions are passing through computer or other programmable datas When the processor of processing unit executes, function specified in one or more boxes in implementation flow chart and/or block diagram is produced The device of energy/movement.These computer-readable program instructions can also be stored in a computer-readable storage medium, these refer to It enables so that computer, programmable data processing unit and/or other equipment work in a specific way, thus, it is stored with instruction Computer-readable medium then includes a manufacture comprising in one or more boxes in implementation flow chart and/or block diagram The instruction of the various aspects of defined function action.

Computer-readable program instructions can also be loaded into computer, other programmable data processing units or other In equipment, so that series of operation steps are executed in computer, other programmable data processing units or other equipment, to produce Raw computer implemented process, so that executed in computer, other programmable data processing units or other equipment Instruct function action specified in one or more boxes in implementation flow chart and/or block diagram.

The flow chart and block diagram in the drawings show system, method and the computer journeys according to multiple embodiments of the disclosure The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation One module of table, program segment or a part of instruction, the module, program segment or a part of instruction include one or more use The executable instruction of the logic function as defined in realizing.In some implementations as replacements, function marked in the box It can occur in a different order than that indicated in the drawings.For example, two continuous boxes can actually be held substantially in parallel Row, they can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that block diagram and/or The combination of each box in flow chart and the box in block diagram and or flow chart, can the function as defined in executing or dynamic The dedicated hardware based system made is realized, or can be realized using a combination of dedicated hardware and computer instructions.

The presently disclosed embodiments is described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In principle, the practical application or to the technological improvement in market for best explaining each embodiment, or make the art its Its those of ordinary skill can understand each embodiment disclosed herein.

Claims

1. a kind of audio-frequency processing method characterized by comprising

Frequency spectrum conversion is carried out to corrupted audio to be processed, obtains the first spectral image of the corrupted audio；

Frequency spectrum completion is carried out to first spectral image, obtains the second spectral image of completion；

Completion is carried out to the corrupted audio according to second spectral image, the first audio after obtaining completion.

2. being obtained the method according to claim 1, wherein carrying out frequency spectrum completion to first spectral image Second spectral image of completion, comprising:

Feature extraction is carried out to first spectral image, obtains the first spectrum signature；

Frequency spectrum reconfiguration is carried out to first spectrum signature, obtains second spectral image.

3. being obtained the method according to claim 1, wherein carrying out frequency spectrum completion to first spectral image Second spectral image of completion, comprising:

Feature extraction is carried out to first spectral image, obtains the second spectrum signature；

Feature extraction is carried out to the relevant information of the corrupted audio, obtains supervision feature；

It is aligned second spectrum signature and the supervision feature；

Frequency spectrum reconfiguration is carried out to first spectrum signature according to the supervision feature after alignment, obtains second spectral image,

Wherein, the relevant information includes at least one of video information corresponding with the corrupted audio and Optic flow information.

4. method described in any one of -3 according to claim 1, which is characterized in that the corrupted audio includes corrupted audio Segment；

It is described that completion is carried out to the corrupted audio according to second spectral image, the first audio after obtaining completion, comprising:

Spectrum frequency is carried out to spectral image corresponding with corrupted audio segment in the second spectral image to convert, and obtains completion audio piece Section；

Completion is carried out to corrupted audio using completion audio fragment, the first audio after obtaining completion.

5. method described in any one of -3 according to claim 1, which is characterized in that the corrupted audio includes corrupted audio Segment and undamaged audio fragment；

According to spectral image corresponding with corrupted audio segment in the second spectral image and undamaged audio fragment, the benefit is predicted Full acoustic frequency segment；

6. method described in any one of -5 according to claim 1, it is characterised in that:

It is described that completion, the behaviour of the first audio after obtaining completion are carried out to the corrupted audio according to second spectral image Make, is realized by WaveNet decoding network.

7. method described in any one of -6 according to claim 1, it is characterised in that:

First spectral image and the second spectral image include Meier spectral image or mel cepstrum image.

8. a kind of apparatus for processing audio characterized by comprising

Frequency spectrum conversion module obtains the first frequency of the corrupted audio for carrying out frequency spectrum conversion to corrupted audio to be processed Spectrogram picture；

Frequency spectrum completion module obtains the second spectral image of completion for carrying out frequency spectrum completion to first spectral image；

Audio completion module, for carrying out completion to the corrupted audio according to second spectral image, after obtaining completion First audio.

9. a kind of electronic equipment characterized by comprising

Processor；

Memory for storage processor executable instruction；

Wherein, the processor is configured to: perform claim require any one of 1 to 7 described in method.

10. a kind of computer readable storage medium, is stored thereon with computer program instructions, which is characterized in that the computer Method described in any one of claim 1 to 7 is realized when program instruction is executed by processor.