CN109887515B

CN109887515B - Audio processing method and device, electronic equipment and storage medium

Info

Publication number: CN109887515B
Application number: CN201910086763.5A
Authority: CN
Inventors: 周航; 刘子纬; 徐旭东; 罗平; 王晓刚
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2021-07-09
Anticipated expiration: 2039-01-29
Also published as: CN109887515A

Abstract

The present disclosure relates to an audio processing method and apparatus, an electronic device, and a storage medium, the method including: carrying out spectrum conversion on damaged audio to be processed to obtain a first spectrum image of the damaged audio; performing spectrum completion on the first spectrum image to obtain a completed second spectrum image; and completing the damaged audio according to the second frequency spectrum image to obtain a completed first audio, so that the completed first audio can present a good auditory effect.

Description

Audio processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of signal processing technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a storage medium.

Background

Audio complementing means that when a segment of audio is lost due to noise interference or accident, the lost part of audio is regenerated and complemented naturally. This technique has many applications in audio information repair and noise reduction. The related art mainly relies on the traditional audio processing method, and uses a sparse audio representation method to find and fill a part similar to a known segment around a missing segment.

Disclosure of Invention

The present disclosure proposes an audio processing technical solution.

According to an aspect of the present disclosure, there is provided an audio processing method including: carrying out spectrum conversion on damaged audio to be processed to obtain a first spectrum image of the damaged audio; performing spectrum completion on the first spectrum image to obtain a completed second spectrum image; and completing the damaged audio according to the second frequency spectrum image to obtain a completed first audio.

In a possible implementation manner, performing spectrum completion on the first spectrum image to obtain a completed second spectrum image includes: performing feature extraction on the first frequency spectrum image to obtain a first frequency spectrum feature; and carrying out frequency spectrum reconstruction on the first frequency spectrum characteristic to obtain the second frequency spectrum image.

In a possible implementation manner, performing spectrum completion on the first spectrum image to obtain a completed second spectrum image includes: performing feature extraction on the first frequency spectrum image to obtain a second frequency spectrum feature; carrying out feature extraction on the relevant information of the damaged audio to obtain supervision features; aligning the second spectral feature and the supervisory feature; and performing spectrum reconstruction on the first spectrum feature according to the aligned surveillance feature to obtain the second spectrum image, wherein the related information comprises at least one of video information and optical flow information corresponding to the damaged audio.

In one possible implementation, the marred audio includes a marred audio segment; the complementing the damaged audio according to the second spectral image to obtain a complemented first audio includes: performing spectrum frequency conversion on the spectrum image corresponding to the damaged audio segment in the second spectrum image to obtain a complete audio segment; and completing the damaged audio by utilizing the completing audio segment to obtain a completed first audio.

In one possible implementation, the corrupted audio includes corrupted audio segments and non-corrupted audio segments; the complementing the damaged audio according to the second spectral image to obtain a complemented first audio includes: predicting the complementing audio segment according to the spectral image corresponding to the damaged audio segment in the second spectral image and the undamaged audio segment; and completing the damaged audio by utilizing the completing audio segment to obtain a completed first audio.

In a possible implementation manner, the operation of completing the damaged audio according to the second spectral image to obtain a completed first audio is implemented by a WaveNet decoding network.

In one possible implementation, the first and second spectral images comprise mel-frequency spectral images or mel-frequency cepstral images.

According to an aspect of the present disclosure, there is provided an audio processing apparatus including: the system comprises a frequency spectrum conversion module, a frequency spectrum conversion module and a frequency spectrum conversion module, wherein the frequency spectrum conversion module is used for carrying out frequency spectrum conversion on damaged audio to be processed to obtain a first frequency spectrum image of the damaged audio; the frequency spectrum complementing module is used for carrying out frequency spectrum complementing on the first frequency spectrum image to obtain a complemented second frequency spectrum image; and the audio complementing module is used for complementing the damaged audio according to the second frequency spectrum image to obtain a complemented first audio.

In one possible implementation, the spectrum completion module includes: the first feature extraction submodule is used for performing feature extraction on the first frequency spectrum image to obtain a first frequency spectrum feature; and the first frequency spectrum reconstruction submodule is used for carrying out frequency spectrum reconstruction on the first frequency spectrum characteristic to obtain the second frequency spectrum image.

In one possible implementation, the spectrum completion module includes: the second feature extraction submodule is used for extracting features of the first frequency spectrum image to obtain second frequency spectrum features; the third feature extraction submodule is used for performing feature extraction on the relevant information of the damaged audio to obtain supervision features; an alignment sub-module for aligning the second spectral feature and the supervisory feature; and the second spectrum reconstruction submodule is used for performing spectrum reconstruction on the first spectrum feature according to the aligned surveillance feature to obtain the second spectrum image, wherein the related information comprises at least one of video information and optical flow information corresponding to the damaged audio.

In one possible implementation, the marred audio includes a marred audio segment; the audio completion module includes: the first spectrum frequency conversion sub-module is used for performing spectrum frequency conversion on the spectrum image corresponding to the damaged audio segment in the second spectrum image to obtain a complete audio segment; and the first audio complementing submodule is used for complementing the damaged audio by using the complementing audio segment to obtain a complemented first audio.

In one possible implementation, the corrupted audio includes corrupted audio segments and non-corrupted audio segments; the audio completion module includes: the prediction sub-module is used for predicting the complementing audio segment according to the spectral image corresponding to the damaged audio segment in the second spectral image and the undamaged audio segment; and the second audio complementing submodule is used for complementing the damaged audio by using the complementing audio segment to obtain the complemented first audio.

In one possible implementation, the audio completion module is implemented by a WaveNet decoding network.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: the audio processing method is performed.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described audio processing method.

In the embodiment of the disclosure, a first spectrum image of a damaged audio is obtained by performing spectrum conversion on the damaged audio to be processed; performing spectrum completion on the first spectrum image to obtain a completed second spectrum image; and completing the damaged audio according to the second spectral image to obtain a completed first audio, and converting the problem of audio completion into the problem of spectral completion, thereby reducing excessive dependence on audio information. The damaged audio frequency with local distortion, such as noise interference, fragment with burst tone quality or part of the fragment erased, can be supplemented through the audio processing method, so that the supplemented audio frequency can present good auditory effect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow diagram of an audio processing method according to an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of audio information and a spectral image in an audio processing method according to an embodiment of the present disclosure.

Fig. 3 shows a schematic structural diagram of a neural network used in an audio processing method according to an embodiment of the present disclosure.

Fig. 4 shows a schematic structural diagram of a neural network used in an audio processing method according to an embodiment of the present disclosure.

Fig. 5 shows a block diagram of an audio processing device according to an embodiment of the present disclosure.

FIG. 6 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.

FIG. 7 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flow diagram of an audio processing method according to an embodiment of the present disclosure. The audio processing method may be performed by a terminal device or other processing device, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. The other processing devices may be servers or cloud servers, etc. In some possible implementations, the audio processing method may be implemented by a processor calling computer readable instructions stored in a memory.

As shown in fig. 1, the method includes:

step S11, performing spectrum conversion on the damaged audio to be processed to obtain a first spectrum image of the damaged audio.

In one possible implementation, the marred audio to be processed may be generated based on the audio information of a complete song (i.e., complete audio without any marring); for example, the audio information may be generated by a phenomenon that information is missing in the process of transmitting the audio information, may be caused by a virus carried in the audio information, or may be caused by accidentally deleting a part of information in the process of editing the audio information.

In the implementation manner of the present disclosure, the number of audio frames of the damaged audio to be processed is not limited, and may include all audio frames (for example, 1000 frames) in the audio information of a complete song, for example, the damaged audio segments in the damaged audio may be audio frames located in 8 th to 10 th frames, or may include only partial audio frames (for example, 10 frames) in the audio information of the complete song, for example, the damaged audio segments in the damaged audio may be audio frames located in 7 th frame.

In one possible implementation, the damaged audio to be processed may be represented by an audio signal of any format. As an example, as shown in fig. 2, the damaged audio may be represented by a spectrogram 12, and the spectrogram 12 has a more obvious missing region (a rectangular blank portion in the spectrogram 12) compared to the spectrogram 11 representing the sound audio, and the missing region may represent a damaged segment of the damaged audio to be processed.

In one possible implementation, the audio signal may represent the audio information in a time domain space and the spectral image (signal) may represent the audio information in a frequency domain space, so that the first spectral image and the corrupted audio may be different representations of the same information in this implementation.

In one possible implementation, the first spectral image and the second spectral image comprise mel-frequency spectral images or mel-frequency cepstral images.

And step S12, performing spectrum completion on the first spectrum image to obtain a completed second spectrum image.

Wherein, the first spectrum image comprises at least one missing region, namely a region needing to be complemented, and the missing region is used for representing a missing segment in the damaged audio. As an example, the first spectrum image may include a plurality of consecutive missing regions, and may also include a plurality of spaced missing regions, and in one possible implementation, the larger the area of the missing region in the first spectrum image, the more corrupted segments in the corrupted audio. The smaller the area of the missing region, the smaller the corrupted segment in the corrupted audio.

In a possible implementation manner, for spectrum completion, the missing region of the first spectrum image can be filled with pixel points around the missing region in the first spectrum image, so that the effect of spectrum completion is achieved; the deep learning technology can also be utilized to predict (or associate) each pixel point in the missing region according to the pixel points in other regions except the missing region in the first spectrum image, so as to achieve the effect of spectrum completion.

In one possible implementation, the gap between the complemented second spectral image and the spectral image of intact audio is small. The spectral complementing operation is used to complement the missing part in the first spectral image so that the complemented second spectral image is approximately the same as the intact spectral image.

As an example, as shown in fig. 2, the spectrum image 13 is a first spectrum image that needs to be complemented, the darker rectangular region in the spectrum image 13 is a missing region of the first spectrum image, and the spectrum image 14 is a complemented second spectrum image.

And step S13, completing the damaged audio according to the second frequency spectrum image to obtain a completed first audio.

Wherein the gap between the first audio after completion and the sound audio is small. The completion operation is used to complete the missing segments in the corrupted audio so that the completed first audio approximates the same as the good audio.

Wherein, the damaged audio includes at least one damaged audio segment, and the damaged audio segment is an audio segment that needs to be completed. As an example, the damaged audio may include a plurality of consecutive damaged audio segments, or may include a plurality of spaced damaged audio segments, where the larger the number of damaged audio segments in the damaged audio, the more serious the damage is, and the smaller the number of damaged audio segments, the more slight the damage is.

In a possible implementation manner, a complementary audio segment can be obtained by performing spectrum frequency conversion on a spectrum image corresponding to the damaged audio segment in the second spectrum image, and the damaged audio is complemented by using the complementary audio segment, so that an audio complementing effect is achieved; the second spectrum image can also be used as a learning target by utilizing a deep learning technology, and the information of the damaged segment is predicted (or associated) according to other audio segments except the missing segment in the damaged audio, so that the effect of audio completion is achieved.

In a possible implementation manner, the operation of completing the damaged audio according to the second spectral image to obtain the completed first audio is implemented by a WaveNet decoding network. For example, the spectral frequency conversion can be performed on the spectral image corresponding to the damaged audio segment in the second spectral image through the convolutional layer in the WaveNet decoding network, and the damaged audio is complemented by the complementing audio segment, so that the audio complementing effect is achieved; for another example, information of damaged audio segments in damaged audio can be predicted by using a perforated causal convolution layer (generalized functional constants) in the WaveNet decoding network, so as to achieve the effect of audio completion.

In a possible implementation manner, in step S12, performing spectrum completion on the first spectrum image to obtain a completed second spectrum image, including: performing feature extraction on the first frequency spectrum image to obtain a first frequency spectrum feature; and carrying out frequency spectrum reconstruction on the first frequency spectrum characteristic to obtain the second frequency spectrum image.

The spectrum reconstruction can be understood as filling the missing region of the first spectrum image with pixel points around the missing region of the first spectrum image to achieve the effect of spectrum completion; the method can also be understood as predicting (or associating) each pixel point in the missing region according to the pixel points in other regions except the missing region in the first spectrum image, so as to achieve the effect of spectrum completion.

In one possible implementation, the first spectral image may be subjected to feature extraction and spectral reconstruction operations using a convolutional neural network. The convolutional neural network may include at least one convolutional layer, where the convolutional layer is configured to perform convolution processing on the first spectral image to extract features of the first spectral image.

In a possible implementation manner, for feature extraction, feature extraction methods such as short-time Fourier transform (STFT) and mel filter may be used to perform feature extraction on the first spectrum image to obtain the first spectrum feature.

In a possible implementation manner, when the number of damaged audio segments in the damaged audio is too large, the damaged audio may be divided into damaged audio that only includes a small number of audio segments, and then the damaged audio obtained by the division is respectively subjected to spectrum conversion into first spectral images, and feature extraction and spectrum reconstruction are respectively performed on each first spectral image, so as to obtain a second spectral image corresponding to each damaged audio.

In a possible implementation manner, in step S12, performing spectrum completion on the first spectrum image to obtain a completed second spectrum image, including: performing feature extraction on the first frequency spectrum image to obtain a second frequency spectrum feature; carrying out feature extraction on the relevant information of the damaged audio to obtain supervision features; aligning the second spectral feature and the supervisory feature; and performing spectrum reconstruction on the first spectrum feature according to the aligned surveillance feature to obtain the second spectrum image, wherein the related information comprises at least one of video information and optical flow information corresponding to the damaged audio.

In this implementation, the damaged audio may be derived from a segment of video with audio information, where each frame (segment) of video frame has an audio segment corresponding to it, and the content of the video frame matches the content of the audio segment, so this implementation may complement the damaged audio with the video information that naturally aligns with the damaged audio.

As an example, the marred audio may be a recorded video from a viola performance, where the loudness expressed by the audio segment corresponding to the video frame is greater when the performer performs a large amplitude and the relative distance between the strings and the body is greater in the video frame, the tempo expressed by the audio segment corresponding to the video frame is sharper when the performance frequency is faster in the video frame in the video, and conversely, the loudness expressed by the audio segment corresponding to the video frame is smaller when the performer performs a small amplitude and the relative distance between the strings and the body is smaller when the performer performs a small amplitude in the video frame, and the tempo expressed by the audio segment corresponding to the video frame is slower when the performance frequency is slower in a certain video.

In one possible implementation, the video information and optical flow information corresponding to the damaged audio are complete information that is not disturbed by noise.

Wherein the video information includes respective video frames corresponding to the audio information, and the optical flow information is used to represent a change in a pixel in an image sequence of the video information in a time domain, a correlation between adjacent video frame images, and information of occurrence of a relative movement of an object in the adjacent video frame images. In this implementation, both the video information and the optical flow information can be used as references for the damaged audio, so that the completed first audio is more complete.

In a possible implementation manner, for feature extraction, feature extraction methods such as short-time Fourier transform (STFT) and mel filter may be used to perform feature extraction on the relevant information of the damaged audio to obtain the supervision feature of the damaged audio.

In one possible implementation, the surveillance features may be features extracted from video information and/or optical flow information corresponding to the damaged audio, for example, edge features, texture features, and style features of the video information and/or optical flow information.

In one possible implementation, the damaged audio and the relevant information extraction operation of the damaged audio can be performed by using a deep learning technology; for example, the corrupted audio and the information related to the corrupted audio may be convolved (feature extraction) with convolutional layers in a convolutional neural network to extract second spectral features and supervised features.

In a possible implementation, the second spectral feature and the supervisory feature are aligned for reducing the distance between the second spectral feature and the supervisory feature as much as possible, so that the second spectral feature and the supervisory feature can be in the same space.

In one possible implementation, aligning the second spectral feature and the supervisory feature may be done by fusing (e.g., stitching) the second spectral feature and the supervisory feature; in this implementation, the width of the second spectral feature may be the same as that of the supervisory feature, and there is no limitation on whether the heights of the second spectral feature and the supervisory feature are the same, so that the corresponding second spectral feature and the supervisory feature may be spliced in the width direction. Alternatively, the height of the second spectral feature may be the same as the supervisory feature, and whether the width of the second spectral feature is the same or not is not limited, so that the corresponding second spectral feature and the supervisory feature can be spliced in the height direction.

For example, the dimension of the second spectral feature is 1 × 4 × 1, the dimension of the surveillance feature is 1 × 4 × 1, and in the feature fusion process, the corresponding second spectral feature and the surveillance feature may be spliced along the height to obtain the feature with the dimension of (1+1) × 4 × 1.

In one possible implementation, the spectral reconstruction operation for the first spectral feature may be performed by a deep learning technique. In the process of spectrum reconstruction, self-supervision of natural alignment of the damaged audio and related information (such as video information and streamer information) of the damaged audio can be utilized, and each pixel point is reconstructed according to prediction (or association) of pixel points in other regions except for the missing region in the first spectrum image, so that the effect of spectrum reconstruction is achieved, and a second spectrum image obtained after spectrum reconstruction can better complement the damaged audio.

In this implementation manner, the damaged audio related information (for example, at least one of the video information and the optical flow information corresponding to the damaged audio) may be used as the supervision information to guide the completion of the damaged audio, so as to improve the integrity of the completed first audio and optimize the presentation effect of the first audio.

In one possible implementation, the marred audio includes a marred audio segment; step 13, completing the damaged audio according to the second spectral image to obtain a completed first audio, including: performing spectrum frequency conversion on the spectrum image corresponding to the damaged audio segment in the second spectrum image to obtain a complete audio segment; and completing the damaged audio by utilizing the completing audio segment to obtain a completed first audio.

Wherein the damaged audio may be composed of damaged audio segments and undamaged audio segments, and the second spectral image includes image regions corresponding to respective segments of the damaged audio (including the damaged audio segments and the undamaged audio segments). The complementing audio segment may be used to replace the corrupted audio segment in the corrupted audio. The spectral frequency conversion may be a process of converting a spectral image into audio, which may be understood as an inverse process of the spectral conversion in step 11.

In this implementation manner, the spectral image in the image region corresponding to the damaged audio segment in the second spectral image may be converted into a complementary audio segment, and the damaged audio segment in the damaged audio is replaced by the complementary audio segment, so as to achieve the effect of audio complementation.

As an example, the marred audio consists of 5 audio segments, where the 1 st, 2 nd, 3 th, and 4 th segments are intact audio segments and the 5 th audio segment is a marred audio segment. By the implementation mode, the spectrum image in the image region corresponding to the 5 th audio segment in the second spectrum image can be subjected to spectrum frequency conversion to obtain a supplemented audio segment, and the 5 th audio segment in the damaged audio is replaced by the supplemented audio segment to obtain a supplemented first audio consisting of the 1 st, 2 nd, 3 th and 4 th undamaged segments and the supplemented audio segment.

In one possible implementation, the corrupted audio includes corrupted audio segments and non-corrupted audio segments; step 13, completing the damaged audio according to the second spectral image to obtain a completed first audio, including: predicting the complementing audio segment according to the spectral image corresponding to the damaged audio segment in the second spectral image and the undamaged audio segment; and completing the damaged audio by utilizing the completing audio segment to obtain a completed first audio.

In this implementation, in the process of determining a complementary audio segment for replacing a damaged audio segment, the damaged audio segment may be replaced by an undamaged audio segment in the damaged audio. For example (continuing to take the above implementation as an example), when determining the complementary audio segment, the implementation may predict (or associate) the content of the complementary audio segment by using the undamaged audio segments located in the 1 st, 2 nd, 3 th, and 4 th segments, and use the spectral image corresponding to the damaged audio segment in the second spectral image as a guide for generating the complementary audio segment, so as to obtain a more accurate complementary audio segment, and replace the 5 th audio segment in the damaged audio by using the complementary audio segment, so as to obtain a complementary first audio composed of the 1 st, 2 nd, 3 th, and 4 th undamaged segments and the complementary audio segment.

In a possible implementation manner, steps 12 and 13 may be performed by a neural network, wherein step S12 performs feature extraction on the first spectrum image to obtain a first spectrum feature; the operation of performing spectrum reconstruction on the first spectrum feature to obtain the second spectrum image can be realized by a first generation network consisting of a first coding network and a first decoding network; in step S13, the operation of completing the damaged audio according to the second spectral image to obtain the completed first audio may be implemented by a second decoding network.

FIG. 3 shows a schematic representation of a system according to the inventionThe schematic structural diagram of the neural network used in the audio processing method of the embodiment is disclosed. In one possible implementation, as shown in fig. 3, the damaged audio 201 is subjected to spectral transformation to obtain a spectral transformation result (a first spectral image 202); first encoding network E_aCompressing the first spectral image 202 into a feature vector

And will be used to represent the feature vector of the first spectral image

Sent to a first decoding network G_a(ii) a First decoding network G_aThe operation of spectrum completion can be performed based on the feature vector, and the second spectrum image 203 after the spectrum completion is sent to the second decoding network 204; the second decoding network 204 completes the damaged audio according to the second spectrum image 203 of the spectrum completion and the damaged audio 201, and finally obtains a first audio 205 after completion.

In one possible implementation, steps 12 and 13 may be performed by a neural network; step 12, performing feature extraction on the first spectrum image to obtain a second spectrum feature; aligning the second spectral feature and the supervisory feature; performing feature extraction on the aligned relevant information of the damaged audio to obtain supervision features; the operation of carrying out spectrum reconstruction on the first spectrum characteristic according to the supervision characteristic to obtain the second spectrum image can be realized by a second generation network consisting of a first coding network, a first decoding network and a second coding network; and step 13, completing the damaged audio according to the second spectral image to obtain a completed first audio, which can be implemented by a second decoding network.

Fig. 4 shows a schematic structural diagram of a neural network used in an audio processing method according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 4, the audio processing method may also be applied to a case where damaged audio is supplemented with self-supervision that the relevant information (e.g., video information and streamer information) of the damaged audio and the damaged audio are naturally aligned.

The specific process comprises the following steps: firstly, the damaged audio 301 is subjected to spectral transformation to obtain a first spectral image 302, and the first spectral image 302 is sent to a first encoding network E_a(ii) a First coding network element E_aCompressing the first spectral image 302 into a feature vector

At the same time, video information 303 and streamer information (not shown) of the marred audio may also be sent to the second encoding network E_vObtaining a supervision characteristic f_v(ii) a Fusing feature vectors for representing a first spectral image

And a supervision feature f_vAnd sends the fusion result to the first decoding network G_av(ii) a First decoding network G_avThe operation of spectral completion may be performed based on the feature vector and the supervised feature, and a second spectral image 304 after spectral completion is sent to the second decoding network 305; the second decoding network 305 completes the damaged audio according to the second spectrum image 304 of the spectrum completion and the damaged audio 301, and finally obtains a first audio 306 after completion.

In one possible implementation, the first generating network and the second decoding network may be trained separately, wherein the first generating network may be trained by a counter-training method and the second decoding network may be trained by a discrete mixing loss function.

The network loss function used to train the first generating network may be expressed as:

wherein the content of the first and second substances,

for representing a first encoded networkThe network loss of (a) is reduced,

for indicating the network loss of the first encoding network and the first decoding network,

for representing

And

the sum of losses, a is used to represent the corrupted audio for input to the respective network,_βa weight for representing a network loss of the first decoding network.

In this implementation, the first generating network and the second decoding network may be trained by the following processes:

step 401, obtaining a plurality of training samples (damaged audio) and labeling information (undamaged audio and spectral images of undamaged audio) corresponding to each training sample from a training set;

step 402, carrying out spectrum conversion on a training sample aiming at each training sample to obtain a first spectrum image of the training sample;

step 403, inputting the first spectrum image into a first generation network, and performing spectrum completion on the first spectrum image of the training sample based on the first generation network to obtain a second spectrum image completed by the training sample;

step 404, determining by the spectral image of the undamaged audio in the annotation information and the second spectral image of the training sample

And

according to

And

sum of

Adjusting network parameters of each network in the first generated network;

step 405, inputting the training samples and the second spectrum images of the training samples into a second decoding network to obtain audio for completing the training samples;

step 406, determining the network loss of the second decoding network according to the undamaged audio in the labeling information and the audio for completing the training sample; and adjusting the network parameters of the second decoding network according to the network loss of the second decoding network.

The execution order of step 405 and step 406 is not limited in this implementation manner.

In one possible implementation, the second generating network and the second decoding network may be trained separately, wherein the second generating network may be trained using a counter-training method, and the second decoding network may be trained using a discrete mixing loss function.

The loss function used for training the second generation network may be expressed as:

wherein the content of the first and second substances,

and is used for representing the network loss of the first encoding network and the first decoding network, wherein eta 2 is the weight of the network loss, t represents time, and eta 2 can be attenuated along with the increase of the time.

For representing the network loss, L, of the first encoding network, the first decoding network and the second encoding network_SyncFor representing a first coded network and a second coded networkThe network loss of the code network is reduced,

which represents the sum of the network losses of the three.

In this implementation, the second generating network and the second decoding network may be trained by the following processes:

step 411, obtaining a plurality of training samples (damaged audio, spectral image of damaged audio, video information and streamer information of damaged audio) and annotation information (spectral image of undamaged audio and undamaged audio) corresponding to each training sample from the training set;

step 412, performing spectrum conversion on the damaged audio of the training sample aiming at each training sample to obtain a first spectrum image of the training sample;

step 413, inputting the first spectrum image into a first coding network to obtain a second spectrum characteristic;

step 414, inputting video information and streamer information of the damaged audio into a second coding network to obtain supervision characteristics;

step 415, aligning the second spectral feature and the supervisory feature, and determining L according to the aligned second spectral feature and the supervisory feature_Sync；

Step 416, the aligned second spectrum feature and the aligned supervision feature are determined and input to a second decoding network, and spectrum completion is performed on the first spectrum image of the training sample based on the first generating network to obtain a second spectrum image completed by the training sample;

step 417, according to the second spectrum image complemented by the training sample and the spectrum images of the undamaged audio and the undamaged audio in the annotation information, determining

And

according to L_Sync、

And

sum of

Adjusting (second) network parameters of respective ones of the generated networks;

step 418, inputting the training samples and the second spectrum images of the training samples into a second decoding network to obtain audio for completing the training samples;

step 419, calculating the network loss of the second decoding network according to the undamaged audio in the labeling information and the audio for completing the training sample; and adjusting the network parameters of the second decoding network according to the network loss of the second decoding network.

The execution order of steps S418 and S419 is not limited in this implementation manner.

In one possible implementation, the audio processing method may be applied to a video-audio repair tool, to which audio information including damaged audio is input, and the video-audio repair tool may output a complete audio after the repair.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted.

In addition, the present disclosure also provides an audio processing apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the audio processing methods provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the method section are not repeated.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Fig. 5 shows a block diagram of an audio processing device according to an embodiment of the present disclosure. As shown in fig. 5, the audio processing apparatus includes a spectrum conversion module 501, a spectrum completion module 502, and an audio completion module 503.

The frequency spectrum conversion module 501 is configured to perform frequency spectrum conversion on a damaged audio to be processed to obtain a first frequency spectrum image of the damaged audio;

a spectrum complementing module 502, configured to perform spectrum complementing on the first spectrum image to obtain a complemented second spectrum image;

and an audio complementing module 503, configured to complement the damaged audio according to the second spectral image, to obtain a complemented first audio.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The disclosed embodiments also provide a computer-readable storage medium, on which computer program instructions are stored, which when executed by a processor implement the above-mentioned audio processing method. The computer readable storage medium may be a non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the above audio processing method.

Fig. 6 is a block diagram illustrating an electronic device 800 in accordance with an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 6, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 7 is a block diagram illustrating an electronic device 1900 according to an example embodiment. For example, the electronic device 1900 may be provided as a server. Referring to fig. 7, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An audio processing method, comprising:

carrying out spectrum conversion on damaged audio to be processed to obtain a first spectrum image of the damaged audio;

performing spectrum completion on the first spectrum image to obtain a completed second spectrum image;

completing the damaged audio according to the second frequency spectrum image to obtain a completed first audio;

performing spectrum completion on the first spectrum image to obtain a completed second spectrum image, including:

performing feature extraction on the first frequency spectrum image to obtain a first frequency spectrum feature;

performing spectrum reconstruction on the first spectrum characteristic to obtain a second spectrum image;

wherein the spectral reconstruction comprises: and predicting each pixel point in the missing area according to pixel points of other areas except the missing area in the first spectrum image by using related information of the damaged audio naturally aligned with the damaged audio as supervision information, wherein the related information comprises at least one of video information and optical flow information corresponding to the damaged audio.

2. The method of claim 1, wherein spectrally complementing the first spectral image to obtain a complemented second spectral image comprises:

performing feature extraction on the first frequency spectrum image to obtain a second frequency spectrum feature;

carrying out feature extraction on the relevant information of the damaged audio to obtain supervision features;

aligning the second spectral feature and the supervisory feature;

and carrying out frequency spectrum reconstruction on the first frequency spectrum characteristic according to the aligned supervision characteristic to obtain the second frequency spectrum image.

3. The method of any of claims 1-2, wherein the marred audio comprises a segment of marred audio;

the complementing the damaged audio according to the second spectral image to obtain a complemented first audio includes:

performing spectrum frequency conversion on the spectrum image corresponding to the damaged audio segment in the second spectrum image to obtain a complete audio segment;

and completing the damaged audio by utilizing the completing audio segment to obtain a completed first audio.

4. The method of any of claims 1-2, wherein the marred audio comprises marred audio segments and uncorrupted audio segments;

predicting a complementary audio segment according to the spectral image corresponding to the damaged audio segment and the undamaged audio segment in the second spectral image;

5. The method according to any one of claims 1-2, wherein:

and completing the damaged audio according to the second frequency spectrum image to obtain a completed first audio, wherein the operation is realized through a WaveNet decoding network.

6. The method according to any one of claims 1-2, wherein:

the first and second spectral images include mel-frequency spectral images or mel-frequency cepstrum images.

7. An audio processing apparatus, comprising:

the system comprises a frequency spectrum conversion module, a frequency spectrum conversion module and a frequency spectrum conversion module, wherein the frequency spectrum conversion module is used for carrying out frequency spectrum conversion on damaged audio to be processed to obtain a first frequency spectrum image of the damaged audio;

the frequency spectrum complementing module is used for carrying out frequency spectrum complementing on the first frequency spectrum image to obtain a complemented second frequency spectrum image;

the audio complementing module is used for complementing the damaged audio according to the second frequency spectrum image to obtain a complemented first audio;

the spectrum completion module comprises:

the first feature extraction submodule is used for performing feature extraction on the first frequency spectrum image to obtain a first frequency spectrum feature;

the first frequency spectrum reconstruction submodule is used for carrying out frequency spectrum reconstruction on the first frequency spectrum characteristic to obtain a second frequency spectrum image;

the first spectrum reconstruction submodule is specifically configured to predict each pixel point in the missing region according to pixel points of other regions except the missing region in the first spectrum image by using, as supervision information, related information of the damaged audio that is naturally aligned with the damaged audio, where the related information includes at least one of video information and optical flow information corresponding to the damaged audio.

8. The apparatus of claim 7, wherein the spectrum completion module comprises:

the second feature extraction submodule is used for extracting features of the first frequency spectrum image to obtain second frequency spectrum features;

the third feature extraction submodule is used for performing feature extraction on the relevant information of the damaged audio to obtain supervision features;

an alignment sub-module for aligning the second spectral feature and the supervisory feature;

and the second frequency spectrum reconstruction submodule is used for carrying out frequency spectrum reconstruction on the first frequency spectrum characteristic according to the aligned supervision characteristic to obtain a second frequency spectrum image.

9. The apparatus of any of claims 7-8, wherein the marred audio comprises a segment of marred audio; the audio completion module includes:

the first spectrum frequency conversion sub-module is used for performing spectrum frequency conversion on the spectrum image corresponding to the damaged audio segment in the second spectrum image to obtain a complete audio segment;

and the first audio complementing submodule is used for complementing the damaged audio by using the complementing audio segment to obtain a complemented first audio.

10. The apparatus of any of claims 7-8, wherein the marred audio comprises a marred audio segment and a non-marred audio segment; the audio completion module includes:

the prediction sub-module is used for predicting a complementary audio segment according to the spectral image corresponding to the damaged audio segment in the second spectral image and the undamaged audio segment;

and the second audio complementing submodule is used for complementing the damaged audio by using the complementing audio segment to obtain the complemented first audio.

11. The apparatus according to any one of claims 7-8, wherein:

the audio completion module is realized through a WaveNet decoding network.

12. The apparatus according to any one of claims 7-8, wherein:

13. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: performing the method of any one of claims 1 to 6.

14. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 6.