WO2019233364A1 - 基于深度学习的音频音质增强 - Google Patents

基于深度学习的音频音质增强 Download PDF

Info

Publication number
WO2019233364A1
WO2019233364A1 PCT/CN2019/089763 CN2019089763W WO2019233364A1 WO 2019233364 A1 WO2019233364 A1 WO 2019233364A1 CN 2019089763 W CN2019089763 W CN 2019089763W WO 2019233364 A1 WO2019233364 A1 WO 2019233364A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
lossy
lossless
samples
neural network
Prior art date
Application number
PCT/CN2019/089763
Other languages
English (en)
French (fr)
Inventor
秦宇
姚青山
喻浩文
卢峰
Original Assignee
安克创新科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安克创新科技股份有限公司 filed Critical 安克创新科技股份有限公司
Publication of WO2019233364A1 publication Critical patent/WO2019233364A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present invention relates to the technical field of sound quality optimization, and more particularly to a method, device, system, storage medium, and computer program for audio sound quality enhancement based on deep learning.
  • Sound quality usually refers to human's subjective evaluation of audio quality. There are many factors that affect audio quality. For coded audio, a decisive factor is the degree of audio compression.
  • the original sound may be formed in a linear pulse code modulation (LPCM) format file after recording. It is a pulse sequence after digital-to-analog conversion. This is, in audio theory, the most sound quality of digital audio, and it is a lossless high bit rate. Format.
  • LPCM linear pulse code modulation
  • Lossless compression such as lossless audio compression coding such as FLAC, APE, etc.
  • Lossy compression such as MP3, Advanced Audio Coding (AAC), etc. have been widely used due to greatly reducing the bit rate, saving transmission and storage resources.
  • AAC Advanced Audio Coding
  • the lossy audio is reconstructed through digital signal processing methods to make Sound quality close to lossless audio before encoding is a valuable research direction.
  • the lower bit rate audio can be reconstructed by an algorithm to make its sound quality close to the level of lossless audio, it is also of great significance to save bandwidth resources.
  • the lossy audio reconstruction using software methods for the reconstruction of lossy audio, the method of filling or interpolating data is usually adopted, but this method is too rough to restore the sound quality of the lossless audio. .
  • the present invention has been made to solve at least one of the problems described above.
  • the present invention proposes a solution for audio sound quality enhancement based on deep learning, which enhances the lossy audio sound quality based on the deep learning method, so that the lossy audio sound quality is reconstructed by deep neural network to achieve a sound quality close to the lossless audio, so that Realize the sound quality improvement effect that cannot be achieved by traditional methods.
  • a method for enhancing audio sound quality based on deep learning includes: obtaining lossy audio data, and performing feature extraction on the lossy audio data to obtain the lossy audio data. Characteristics; and based on the characteristics of the lossy audio data, the trained audio reconstruction neural network is used to reconstruct the lossy audio data into output audio data whose sound quality is close to the lossless audio.
  • the training of the audio reconstruction neural network includes: obtaining lossless audio samples and lossy audio samples, wherein the lossy audio samples are obtained by transforming the lossless audio samples by transforming; Performing feature extraction on the lossy audio sample and the lossless audio sample to obtain the characteristics of the lossy audio sample and the characteristics of the lossless audio sample respectively; and using the obtained characteristics of the lossy audio sample as all The input of the input layer of the audio reconstruction neural network, and using the obtained features of the lossless audio samples as the target of the output layer of the audio reconstruction neural network to train the audio reconstruction neural network.
  • the lossless audio sample is obtained by format conversion of the lossless audio sample.
  • the sampling frequency and the number of quantization bits of the lossless audio sample and the lossy audio sample are the same.
  • the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  • the features obtained by the feature extraction further include spectrum phase information.
  • the feature extraction manner includes a short-time Fourier transform.
  • the training of the audio reconstruction neural network further includes: before performing feature extraction on the lossy audio samples and the lossless audio samples, the lossy audio samples and the The lossless audio samples are framed separately, and the feature extraction is performed frame by frame for the audio samples obtained after the framed.
  • the training of the audio reconstruction neural network further includes: before framing the lossy audio samples and the lossless audio samples, the lossy audio samples and the The lossless audio samples are respectively decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
  • the reconstructing the lossy audio data into the output audio data by using the trained audio reconstruction neural network includes: using features of the lossy audio data as the training An input of a good audio reconstruction neural network, and reconstructed audio features from the output of the trained audio reconstruction neural network; and generating a time-domain audio waveform based on the reconstructed audio features as the output audio data.
  • a deep learning-based audio sound quality enhancement device includes: a feature extraction module for acquiring lossy audio data, and performing feature extraction on the lossy audio data to obtain The characteristics of the lossy audio data; and an audio reconstruction module, configured to use the trained audio reconstruction neural network to convert the lossy audio data based on the characteristics of the lossy audio data extracted by the feature extraction module Reconstructed output audio data with sound quality close to lossless audio.
  • the training of the audio reconstruction neural network includes: obtaining lossless audio samples and lossy audio samples, wherein the lossy audio samples are obtained by transforming the lossless audio samples by transforming; Performing feature extraction on the lossy audio sample and the lossless audio sample to obtain the characteristics of the lossy audio sample and the characteristics of the lossless audio sample respectively; and using the obtained characteristics of the lossy audio sample as all The input of the input layer of the audio reconstruction neural network, and using the obtained features of the lossless audio samples as the target of the output layer of the audio reconstruction neural network to train the audio reconstruction neural network.
  • the lossless audio sample is obtained by format conversion of the lossless audio sample.
  • the sampling frequency and the number of quantization bits of the lossless audio sample and the lossy audio sample are the same.
  • the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  • the features obtained by the feature extraction further include spectrum phase information.
  • the feature extraction manner includes a short-time Fourier transform.
  • the training of the audio reconstruction neural network further includes: before performing feature extraction on the lossy audio samples and the lossless audio samples, the lossy audio samples and the The lossless audio samples are framed separately, and the feature extraction is performed frame by frame for the audio samples obtained after the framed.
  • the training of the audio reconstruction neural network further includes: before framing the lossy audio samples and the lossless audio samples, the lossy audio samples and the The lossless audio samples are respectively decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
  • the audio reconstruction module further includes: a reconstruction module, configured to use the characteristics of the lossy audio data as an input of the trained audio reconstruction neural network, The trained audio reconstruction neural network outputs reconstructed audio features; and a generating module for generating a time-domain audio waveform based on the reconstructed audio features output by the reconstruction module as the output audio data.
  • a deep learning-based audio sound quality enhancement system includes a storage device and a processor.
  • the storage device stores a computer program run by the processor.
  • the computer when executed by the processor, executes the deep learning-based audio sound quality enhancement method according to any one of the above.
  • a storage medium stores a computer program, and the computer program executes any one of the above-mentioned deep learning-based audio sound quality enhancement methods when running.
  • a computer program is provided, and the computer program is used by a computer or a processor to execute the deep learning-based audio sound quality enhancement method according to any one of the above, and the computer program further uses Each module in the deep learning-based audio sound quality enhancement device according to any one of the above.
  • a method, device, system, storage medium, and computer program for enhancing audio sound quality based on deep learning enhance the lossy audio sound quality based on the deep learning method, so that the lossy audio sound quality is reconstructed to a near-lossless level through deep neural network reconstruction
  • the sound quality of the audio can improve the sound quality which cannot be achieved by traditional methods.
  • the method, device, system, storage medium, and computer program for enhancing audio sound quality based on deep learning according to the embodiments of the present invention can be conveniently deployed on a server or a user end, and can effectively enhance audio sound quality.
  • FIG. 1 shows a schematic block diagram of an example electronic device for implementing a deep learning-based audio sound quality enhancement method, apparatus, system, storage medium, and computer program according to an embodiment of the present invention
  • FIG. 2 shows a schematic flowchart of a method for enhancing audio sound quality based on deep learning according to an embodiment of the present invention
  • FIG. 3 shows a training schematic diagram of an audio reconstruction neural network according to an embodiment of the present invention
  • FIG. 4 shows a schematic block diagram of a deep learning-based audio sound quality enhancement device according to an embodiment of the present invention.
  • FIG. 5 shows a schematic block diagram of a deep learning-based audio sound quality enhancement system according to an embodiment of the present invention.
  • an example electronic device 100 for implementing a deep learning-based audio sound quality enhancement method, apparatus, system, storage medium, and computer program according to an embodiment of the present invention is described with reference to FIG. 1.
  • the electronic device 100 includes one or more processors 102, one or more storage devices 104, input devices 106, and output devices 108. These components are connected through a bus system 110 and / or other forms of connection mechanisms (not shown). (Shown) interconnected. It should be noted that the components and structures of the electronic device 100 shown in FIG. 1 are only exemplary and not restrictive, and the electronic device may have other components and structures as needed.
  • the processor 102 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and / or instruction execution capabilities, and may control other components in the electronic device 100 to perform a desired function.
  • CPU central processing unit
  • the storage device 104 may include one or more computer program products, and the computer program product may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory.
  • the volatile memory may include, for example, a random access memory (RAM) and / or a cache memory.
  • the non-volatile memory may include, for example, a read-only memory (ROM), a hard disk, a flash memory, and the like.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 102 may run the program instructions to implement client functions in the embodiments of the present invention (implemented by the processor) described below. And / or other desired functions.
  • Various application programs and various data such as various data used and / or generated by the application program, can also be stored in the computer-readable storage medium.
  • the input device 106 may be a device used by a user to input instructions, and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like. In addition, the input device 106 may be any interface for receiving information.
  • the output device 108 may output various information (such as images or sounds) to the outside (such as a user), and may include one or more of a display, a speaker, and the like. In addition, the output device 108 may be any other device having an output function.
  • an example electronic device for implementing a method, an apparatus, a system, a storage medium, and a computer program for deep learning-based audio sound quality enhancement may be implemented as a terminal such as a smartphone, a tablet computer, or the like.
  • the deep learning-based audio sound quality enhancement method 200 may include the following steps:
  • the audio data obtained in step S210 may be lossy audio data that needs to be enhanced in sound quality received, stored, or played in an audio storage / playback device.
  • These data include, but are not limited to: Audio, audio in a list, or audio files stored in the cloud, client, etc.
  • the lossy audio data may include, but is not limited to, audio data such as music in formats such as MP3, AAC, and OGG.
  • the audio data obtained in step S210 may also be any data that requires sound quality enhancement, such as audio data included in video data.
  • the audio data acquired in step S210 may come from a file stored offline, or from a file played online.
  • a manner of performing feature extraction on the acquired lossy audio data may include, but is not limited to, a short-time Fourier transform (STFT).
  • STFT short-time Fourier transform
  • the features of the lossy audio data obtained by performing feature extraction on the obtained lossy audio data may include frequency domain amplitude and / or energy information.
  • the features of the lossy audio data obtained by performing feature extraction on the lossy audio data may further include spectral phase information.
  • the features of the lossy audio data obtained by performing feature extraction on the lossy audio data may also be time-domain features.
  • the features of the lossy audio data obtained by performing feature extraction on the obtained lossy audio data may also include any other features that can characterize the lossy audio data.
  • frame processing before performing feature extraction on the lossy audio data, frame processing may be performed on it, and the aforementioned feature extraction may be performed frame by frame for the audio data obtained after the frame, which may be applicable to
  • the lossy audio data obtained in step S210 is from a file stored offline or a complete file from any source.
  • the lossy audio data obtained in step S210 comes from a file played online, one or more frames of lossy audio data may be buffered before feature extraction.
  • part of the data can be selected for feature extraction for each frame of the lossy audio data obtained after the framed or cached, which can effectively reduce the amount of data and improve the processing efficiency.
  • the lossy audio data before performing the aforementioned frame processing on the lossy audio data, the lossy audio data may be decoded first, and the aforementioned frame processing may be performed on the time-domain waveform data obtained after decoding.
  • the acquired lossy audio data is generally in an encoded form. In order to obtain its complete audio time domain information, it may be decoded first. Therefore, in one example, the acquired lossy audio data can be sequentially decoded, framed, and feature extracted to efficiently extract features that are well representative.
  • step S220 based on the characteristics of the lossy audio data, the trained audio reconstruction neural network is used to reconstruct the lossy audio data into output audio data whose sound quality is close to the lossless audio.
  • the features of the lossy audio data extracted in step S210 are input to a trained audio reconstruction neural network, and the audio reconstruction neural network reconstructs the input features to obtain reconstructed audio.
  • the reconstructed reconstructed audio feature can be used to generate output audio data that is closer to the lossless audio quality than the acquired lossy audio data sound quality. Therefore, the sound quality enhancement method of the present invention can accurately supplement the audio information lost in the lossy audio based on deep learning, which can not only effectively realize the great improvement of the lossy audio sound quality, but also not affect the consideration of the communication bandwidth (because It is still lossy audio data with a small amount of data, but the lossy audio data can be reconstructed at the receiving end into data with sound quality close to lossless audio).
  • the training process of the audio reconstruction neural network according to the embodiment of the present invention is described below with reference to FIG. 3.
  • the training of the audio reconstruction neural network according to the embodiment of the present invention may include the following process:
  • a lossless audio sample and a lossy audio sample are obtained, wherein the lossy audio sample is obtained by transforming the lossless audio sample.
  • the lossless audio samples may be lossless audio data in pulse code modulation (PCM) format, WAV format, FLAC format, or other lossless audio format.
  • the lossless audio samples can be formatted to obtain the lossy audio samples.
  • lossless audio samples are lossy encoded and decoded to obtain lossy audio samples.
  • the bit rate may include, but is not limited to, 128 kb, 192 kb, and the like, and the encoding method may include, but is not limited to, OGG, MP3, ACC, and the like.
  • lossless audio samples can be converted to lossy audio samples while maintaining the same sampling frequency and number of quantization bits.
  • the sampling frequency and the number of quantization bits of both the lossless audio sample and the transformed lossy audio sample can be the same.
  • a typical scene in which lossless audio samples are converted from lossless audio samples may include, but is not limited to, transcoding music in FLAC format with a sampling frequency of 44.1kHz to music in MP3 format with a code rate of 128kbps and a sampling frequency of 44.1kHz. .
  • Lossless audio sample transformation to obtain lossy audio samples can also be in other cases, which can be adaptively adjusted based on the actual application scenario.
  • feature extraction is performed on the lossy audio sample and the lossless audio sample to obtain the features of the lossy audio sample and the features of the lossless audio sample, respectively.
  • the feature extraction method for each of the lossless audio sample and the lossy audio sample may include, but is not limited to, a short-time Fourier transform.
  • the features obtained by performing feature extraction on each of the lossless audio samples and the lossy audio samples may include their respective frequency domain amplitude and / or energy information.
  • the features obtained by performing feature extraction on the lossless audio samples and the lossy audio samples may further include their respective spectral phase information.
  • the features obtained by performing feature extraction on the lossless audio samples and the lossy audio samples may also be their respective time-domain features.
  • the features obtained by performing feature extraction on the lossless audio samples and the lossy audio samples may also include any other features that can characterize their respective features.
  • the lossless audio sample and the lossy audio sample may be performed separately.
  • Frame processing, and the aforementioned feature extraction may be performed frame-by-frame for the respective audio samples obtained after framing the lossless audio samples and the lossy audio samples.
  • part of the data can be selected for feature extraction for each frame of lossy / lossless audio samples, which can effectively reduce the amount of data and improve processing efficiency.
  • the lossless audio samples and the lossy audio samples may be decoded respectively, and the foregoing framing processing may be performed on The lossless audio samples and the lossy audio samples are decoded from their respective time-domain waveform data. Therefore, the lossless audio samples and the lossy audio samples can be sequentially decoded, framed, and feature extracted in order to efficiently extract their respective representative features.
  • the obtained feature of the lossy audio sample is used as an input of the input layer of the audio reconstruction neural network, and the obtained feature of the lossless audio sample is used as an output layer of the audio reconstruction neural network.
  • the goal is to train the audio reconstruction neural network.
  • the features of one or more frames of lossy audio samples can be used as the input of the input layer of the audio reconstruction neural network, and the features of one or more frames of lossless audio samples can be used as the input of the audio reconstruction neural network.
  • the goal of the output layer is to train a neural network regressor as the audio reconstruction neural network used in step S220.
  • step S220 based on the trained audio reconstruction neural network, the features of the lossy audio data can be reconstructed into reconstructed audio features. Domain features, so time domain audio waveform output can be generated based on the reconstructed audio features.
  • a time domain audio waveform may be obtained by transforming the reconstructed audio feature by inverse Fourier transform.
  • the output audio waveform can be stored or buffered for playback, providing users with a better enhanced sound quality experience.
  • the deep learning-based audio sound quality enhancement method enhances the lossy audio sound quality based on the deep learning method, so that the lossy audio sound quality is reconstructed to a near-lossless audio sound quality through deep neural network reconstruction, thereby It can achieve the sound quality improvement effect that cannot be achieved by traditional methods.
  • a deep learning-based audio sound quality enhancement method may be implemented in a device, an apparatus, or a system having a memory and a processor.
  • the method for enhancing audio sound quality based on deep learning according to an embodiment of the present invention can be conveniently deployed to mobile devices such as smart phones, tablet computers, personal computers, headphones, and speakers.
  • the method for enhancing audio sound quality based on deep learning according to an embodiment of the present invention may also be deployed on a server side (or cloud).
  • the method for enhancing audio sound quality based on deep learning according to an embodiment of the present invention may also be deployed on the server (or cloud) and personal terminals in a distributed manner.
  • a typical application scenario of an audio sound quality enhancement method based on deep learning may include, but is not limited to, taking MP3 format music with a code rate of 128 kbps and a sampling frequency of 44.1 kHz as input.
  • the audio reconstruction neural network reconstructs the music in MP3 format into music in FLAC format with a quality close to the sampling frequency of 44.1Hz.
  • this is only an exemplary typical application scenario, and the deep learning-based audio sound quality enhancement method according to an embodiment of the present invention can also be applied to any scene where sound quality enhancement is required.
  • FIG. 4 shows a schematic block diagram of an audio sound quality enhancement apparatus 400 based on deep learning according to an embodiment of the present invention.
  • a deep learning-based audio sound quality enhancement device 400 includes a feature extraction module 410 and an audio reconstruction module 420.
  • Each of the modules may perform each step / function of the method for enhancing audio sound quality based on deep learning described above in conjunction with FIG. 2.
  • the main functions of the modules of the audio sound quality enhancement device 400 based on deep learning are described, and the details that have been described above are omitted.
  • the feature extraction module 410 is configured to obtain lossy audio data, and perform feature extraction on the lossy audio data to obtain features of the lossy audio data.
  • the audio reconstruction module 420 is configured to reconstruct the lossy audio data into an output with a sound quality close to that of the lossless audio based on the features of the lossy audio data extracted by the feature extraction module using a trained audio reconstruction neural network. Audio data. Both the feature extraction module 410 and the audio reconstruction module 420 may be implemented by the processor 102 in the electronic device shown in FIG. 1 running program instructions stored in the storage device 104.
  • the audio data acquired by the feature extraction module 410 may be lossy audio data that needs to be enhanced for sound quality received, stored, or played in an audio storage / playback device. These data include but are not limited to: Audio, audio in a list, or audio files stored in the cloud, client, etc. Exemplarily, the lossy audio data may include, but is not limited to, audio data such as music in formats such as MP3, AAC, and OGG. In other examples, the audio data acquired by the feature extraction module 410 may also be any data that requires sound quality enhancement, such as audio data included in video data. In addition, the audio data obtained by the feature extraction module 410 may come from files stored offline, or from files played online.
  • the manner in which the feature extraction module 410 performs feature extraction on the acquired lossy audio data may include, but is not limited to, a short-time Fourier transform (STFT).
  • STFT short-time Fourier transform
  • the feature extraction module 410 performs feature extraction on the acquired lossy audio data, and features of the lossy audio data may include frequency domain amplitude and / or energy information.
  • the features of the lossy audio data obtained by the feature extraction of the lossy audio data by the feature extraction module 410 may further include spectrum phase information.
  • the feature of the lossy audio data obtained by performing feature extraction on the lossy audio data is a time domain feature.
  • the feature extraction module 410 performs feature extraction on the acquired lossy audio data, and the features of the lossy audio data may further include any other features that may characterize the lossy audio data.
  • the feature extraction module 410 may perform frame processing on it, and the aforementioned feature extraction may be performed frame-by-frame for the audio data obtained after the framed.
  • the situation may be applicable when the lossy audio data obtained by the feature extraction module 410 is from a file stored offline or a complete file from any source.
  • the lossy audio data obtained by the feature extraction module 410 comes from a file played online, one or more frames of the lossy audio data may be buffered before feature extraction.
  • the feature extraction module 410 may select part of the data for each frame of the lossy audio data obtained after the framed or cached to perform feature extraction, which can effectively reduce the amount of data and improve processing efficiency.
  • the lossy audio data may be processed, for example, by a decoding module (not shown in FIG. 4) included therein.
  • Decoding processing, and the aforementioned framing processing may be performed on the time-domain waveform data obtained after decoding. This is because the lossy audio data obtained by the feature extraction module 410 is generally in an encoded form. In order to obtain its complete audio time domain information, it may be decoded first. Therefore, in one example, the feature extraction module 410 may sequentially decode the acquired lossy audio data, framing, and feature extraction in order to efficiently extract features with good representativeness.
  • the audio reconstruction module 420 may use the trained audio reconstruction neural network to reconstruct the lossy audio data into output audio data whose sound quality is close to that of the lossless audio.
  • the audio reconstruction module 420 may further include a reconstruction module (not shown in FIG. 4) and a generation module (not shown in FIG. 4).
  • the reconstruction module may include a trained audio reconstruction neural network that takes as input the features of the lossy audio data extracted by the feature extraction module 410, and reconstructs the input features to obtain reconstructed audio features. .
  • the generating module generates output audio data that is closer to the lossless audio quality than the acquired lossy audio data sound quality based on the reconstructed audio features output by the reconstruction module.
  • the sound quality enhancement device of the present invention can accurately supplement the audio information lost in the lossy audio based on deep learning, which can not only effectively realize the great improvement of the lossy audio sound quality, but also not affect the consideration of the communication bandwidth (because of the transmission It is still lossy audio data with a small amount of data, but the lossy audio data can be reconstructed at the receiving end into data with sound quality close to lossless audio).
  • the training of the audio reconstruction neural network used by the audio reconstruction module 420 may include: obtaining a lossless audio sample and a lossy audio sample, wherein the lossy audio sample is the lossless audio sample Obtained by transformation; performing feature extraction on the lossy audio sample and the lossless audio sample to obtain the characteristics of the lossy audio sample and the characteristics of the lossless audio sample respectively; and the obtained lossy
  • the features of the audio samples are used as input to the input layer of the audio reconstruction neural network, and the obtained features of the lossless audio samples are used as targets of the output layer of the audio reconstruction neural network to train the audio reconstruction Neural Networks.
  • the training process of the audio reconstruction neural network utilized by the audio reconstruction module 420 of the deep learning-based audio sound quality enhancement device 400 can be understood with reference to FIG. 3 with reference to the description of FIG. 3 above. For brevity, I won't go into too much detail here.
  • the lossless audio samples may be lossless audio data in pulse code modulation (PCM) format, WAV format, FLAC format, or other lossless audio format.
  • the lossless audio samples can be formatted to obtain the lossy audio samples.
  • lossless audio samples are lossy encoded and decoded to obtain lossy audio samples.
  • the bit rate may include, but is not limited to, 128 kb, 192 kb, and the like, and the encoding method may include, but is not limited to, OGG, MP3, ACC, and the like.
  • the sampling frequency and the number of quantization bits can be kept unchanged when the lossless audio samples are converted into the lossy audio samples.
  • the sampling frequency and the number of quantization bits of both the lossless audio sample and the transformed lossy audio sample can be the same.
  • a typical scene in which lossless audio samples are converted from lossless audio samples may include, but is not limited to, transcoding music in FLAC format with a sampling frequency of 44.1kHz to music in MP3 format with a code rate of 128kbps and a sampling frequency of 44.1kHz .
  • Lossless audio sample transformation to obtain lossy audio samples can also be in other cases, which can be adaptively adjusted based on the actual application scenario.
  • the manner of performing feature extraction on each of the lossless audio sample and the lossy audio sample may include, but is not limited to, a short-time Fourier transform.
  • the features obtained by performing feature extraction on each of the lossless audio samples and the lossy audio samples may include their respective frequency domain amplitude and / or energy information.
  • the features obtained by performing feature extraction on the lossless audio samples and the lossy audio samples may further include their respective spectral phase information.
  • the features obtained by performing feature extraction on the lossless audio samples and the lossy audio samples may also be their respective time-domain features.
  • the features obtained by performing feature extraction on the lossless audio samples and the lossy audio samples may also include any other features that can characterize their respective features.
  • frame processing may be performed on each of the lossless audio samples and the lossy audio samples, and the foregoing feature extraction may be performed on the lossless audio samples and
  • the respective audio samples obtained after the lossy audio samples are framed are performed frame by frame.
  • part of the data can be selected for feature extraction for each frame of lossy / lossless audio samples, which can effectively reduce the amount of data and improve processing efficiency.
  • the lossless audio samples and the lossy audio samples may be decoded respectively, and the foregoing framing processing may be performed on The lossless audio samples and the lossy audio samples are decoded from their respective time-domain waveform data. Therefore, the lossless audio samples and the lossy audio samples can be sequentially decoded, framed, and feature extracted in order to efficiently extract their respective representative features.
  • the features of one or more frames of lossy audio samples can be used as the input of the input layer of the audio reconstruction neural network, and the features of one or more frames of lossless audio samples can be used as the input of the audio reconstruction neural network.
  • the goal of the output layer is to train a neural network regressor as the audio reconstruction neural network used by the audio reconstruction module 420.
  • the reconstruction module of the audio reconstruction module 420 may reconstruct features of the lossy audio data into reconstructed audio features. Since the reconstructed audio feature is a frequency domain feature, the generation module of the audio reconstruction module 420 may generate a time domain audio waveform output based on the reconstructed audio feature. Exemplarily, the generation module may transform the reconstructed audio feature to obtain a time-domain audio waveform by inverse Fourier transform. The output audio waveform can be stored or buffered for playback, providing users with a better enhanced sound quality experience.
  • the deep learning-based audio sound quality enhancement device enhances the lossy audio sound quality based on the deep learning method, so that the lossy audio sound quality is reconstructed to a near-lossless audio sound quality through deep neural network reconstruction, thereby It can achieve the sound quality improvement effect that cannot be achieved by traditional methods.
  • the deep learning-based device can be conveniently deployed on a server or a user, and can effectively enhance audio quality.
  • FIG. 5 shows a schematic block diagram of an audio sound quality enhancement system 500 based on deep learning according to an embodiment of the present invention.
  • the deep learning-based audio sound quality enhancement system 500 includes a storage device 510 and a processor 520.
  • the storage device 510 stores a program for implementing corresponding steps in the method for enhancing audio sound quality based on deep learning according to an embodiment of the present invention.
  • the processor 520 is configured to run a program stored in the storage device 510 to execute the corresponding steps of the deep learning-based audio sound quality enhancement method according to an embodiment of the present invention, and to implement the deep learning-based audio sound quality according to an embodiment of the present invention Enhance the corresponding module in the unit.
  • the deep learning-based audio sound quality enhancement system 500 when the program is executed by the processor 520, the deep learning-based audio sound quality enhancement system 500 is caused to perform the following steps: acquiring lossy audio data, and performing feature extraction on the lossy audio data to obtain the Describe the characteristics of the lossy audio data; and based on the characteristics of the lossy audio data, use the trained audio reconstruction neural network to reconstruct the lossy audio data into output audio data with a sound quality close to the lossless audio.
  • the training of the audio reconstruction neural network includes: obtaining lossless audio samples and lossy audio samples, wherein the lossy audio samples are obtained by transforming the lossless audio samples by transforming; Feature extraction of the lossy audio sample and the lossless audio sample to obtain the characteristics of the lossy audio sample and the characteristics of the lossless audio sample respectively; and using the obtained characteristics of the lossy audio sample as the audio weight Construct the input of the input layer of the neural network, and use the obtained features of the lossless audio samples as the target of the output layer of the audio reconstruction neural network to train the audio reconstruction neural network.
  • the lossless audio sample is formatted to obtain the lossy audio sample.
  • the sampling frequency and the number of quantization bits of the lossless audio sample and the lossy audio sample are the same.
  • the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  • the features obtained by the feature extraction further include spectrum phase information.
  • the feature extraction manner includes a short-time Fourier transform.
  • the training of the audio reconstruction neural network further includes: before performing feature extraction on the lossy audio sample and the lossless audio sample, performing the feature extraction on the lossy audio sample and the lossless audio sample. Framing is performed separately, and the feature extraction is performed frame by frame for the audio samples obtained after the framing.
  • the training of the audio reconstruction neural network further includes: before framing the lossy audio samples and the lossless audio samples, dividing the lossy audio samples and the lossless audio samples. Each is decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
  • the outputting audio data includes: taking the features of the lossy audio data as input of the trained audio reconstruction neural network, and reconstructing audio features from the output of the trained audio reconstruction neural network; and The reconstructed audio feature generates a time-domain audio waveform as the output audio data.
  • a storage medium on which program instructions are stored, and when the program instructions are run by a computer or a processor, the program is used to execute the deep learning-based learning of the embodiment of the present invention
  • the corresponding steps of the audio sound quality enhancement method are used to implement corresponding modules in the deep learning-based audio sound quality enhancement device according to an embodiment of the present invention.
  • the storage medium may include, for example, a memory card of a smart phone, a storage part of a tablet computer, a hard disk of a personal computer, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disk read-only memory (CD-ROM), USB memory, or any combination of the above storage media.
  • the computer-readable storage medium may be any combination of one or more computer-readable storage media.
  • the computer program instructions when run by a computer, may implement each functional module of the deep learning-based audio sound quality enhancement device according to an embodiment of the present invention, and / or may execute the depth-based based on the embodiment of the present invention. Learn audio audio quality enhancement methods.
  • the computer program instructions when executed by a computer or processor, cause the computer or processor to perform the following steps: obtain lossy audio data, and perform feature extraction on the lossy audio data to obtain the lossy audio data.
  • the training of the audio reconstruction neural network includes: obtaining lossless audio samples and lossy audio samples, wherein the lossy audio samples are obtained by transforming the lossless audio samples by transforming; Feature extraction of the lossy audio sample and the lossless audio sample to obtain the characteristics of the lossy audio sample and the characteristics of the lossless audio sample respectively; and using the obtained characteristics of the lossy audio sample as the audio weight Construct the input of the input layer of the neural network, and use the obtained features of the lossless audio samples as the target of the output layer of the audio reconstruction neural network to train the audio reconstruction neural network.
  • the lossless audio sample is formatted to obtain the lossy audio sample.
  • the sampling frequency and the number of quantization bits of the lossless audio sample and the lossy audio sample are the same.
  • the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  • the features obtained by the feature extraction further include spectrum phase information.
  • the feature extraction manner includes a short-time Fourier transform.
  • the training of the audio reconstruction neural network further includes: before performing feature extraction on the lossy audio sample and the lossless audio sample, performing the feature extraction on the lossy audio sample and the lossless audio sample. Framing is performed separately, and the feature extraction is performed frame by frame for the audio samples obtained after the framing.
  • the training of the audio reconstruction neural network further includes: before framing the lossy audio samples and the lossless audio samples, dividing the lossy audio samples and the lossless audio samples. Each is decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
  • said computer program instructions when executed by a computer or processor, cause said computer or processor to perform said utilizing said trained audio reconstruction neural network to reconstruct said lossy audio data into said output
  • the audio data includes: taking the features of the lossy audio data as an input to the trained audio reconstruction neural network, and reconstructing audio features from the output of the trained audio reconstruction neural network; and based on the weighting
  • the texture audio feature generates a time-domain audio waveform as the output audio data.
  • Each module in the deep learning-based audio sound quality enhancement device may be implemented by running a computer program instruction stored in a memory of a processor of an electronic device based on deep learning-based audio sound quality enhancement , Or may be implemented when computer instructions stored in a computer-readable storage medium of a computer program product according to an embodiment of the present invention are run by a computer.
  • a computer program is also provided, and the computer program may be stored on a cloud or a local storage medium.
  • the method is used to execute the corresponding steps of the deep learning-based audio sound quality enhancement method of the embodiment of the present invention, and is used to implement the deep learning-based audio sound quality enhancement device according to the embodiment of the present invention.
  • Corresponding module is also provided, and the computer program may be stored on a cloud or a local storage medium.
  • the method, device, system, storage medium and computer program for enhancing audio sound quality based on deep learning enhance the lossy audio sound quality based on the deep learning method, so that the lossy audio sound quality is reconstructed to a near-lossless level through deep neural network
  • the sound quality of the audio can improve the sound quality which cannot be achieved by traditional methods.
  • the method, device, system, storage medium, and computer program for enhancing audio sound quality based on deep learning according to the embodiments of the present invention can be conveniently deployed on a server or a user end, and can effectively enhance audio sound quality.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or Can be integrated into another device, or some features can be ignored or not implemented.
  • the various component embodiments of the present invention may be implemented by hardware, or by software modules running on one or more processors, or by a combination thereof.
  • a microprocessor or a digital signal processor (DSP) may be used to implement some or all functions of some modules according to embodiments of the present invention.
  • DSP digital signal processor
  • the invention may also be implemented as a device program (e.g., a computer program and a computer program product) for performing part or all of the method described herein.
  • a program that implements the present invention may be stored on a computer-readable medium or may have the form of one or more signals. Such signals can be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种基于深度学习的音频音质增强方法,包括:获取有损音频数据,并对有损音频数据进行特征提取以得到有损音频数据的特征(S210);以及基于有损音频数据的特征,利用训练好的音频重构神经网络将有损音频数据重构为音质接近于无损音频的输出音频数据(S220)。还涉及一种基于深度学习的音频音质增强装置、系统。

Description

基于深度学习的音频音质增强
说明书
技术领域
本发明涉及音质优化技术领域,更具体地涉及一种基于深度学习的音频音质增强方法、装置、系统、存储介质和计算机程序。
背景技术
音质通常指人对音频质量的主观评价。影响音频音质的因素有很多,对于编码的音频,一个起决定性作用的因素是音频的编码压缩程度。原始的声音被录音后形成的可能是线性脉冲编码调制(LPCM)格式文件,是数模转换后的脉冲序列,这在音频理论上是音质最还原实际现场的数字音频,是高码率的无损的格式。但是由于通信带宽和存储空间的限制,一般在数字设备上存储的音频要经过编码压缩。无损压缩诸如无损音频压缩编码如FLAC、APE等,能够保存原始无损文件信息。有损压缩诸如MP3、高级音频编码(AAC)等由于极大降低了码率、节省了传输和存储资源而得到了更广泛的应用。然而,有损压缩方法虽然在一定程度上保存了音频基本的音质水平,但是相比无损音频音质仍然有所不如。
随着人们对音质的需求越来越高,有损音频格式的音质已不能满足需求,因此在有限的存储和带宽资源限制下,通过数字信号处理方法,对有损音频进行重构,使其音质接近编码前的无损音频是一个有价值的研究方向。另一方面,涉及音频传输通信场景中,如果较低码率的音频能通过算法重构,使其音质接近无损音频的水平,对节省带宽资源也有重要意义。然而,目前用软件方法进行有损音频重构尚无相应可行方案,对于有损音频的重构,通常是采取填充或插值数据的方法,但这种方法过于粗糙,基本无法还原无损音频的音质。
发明内容
为了解决上述问题中的至少一个而提出了本发明。本发明提出了一种 关于基于深度学习的音频音质增强的方案,其基于深度学习方法对有损音频音质进行增强,使有损音频音质通过深层神经网络重构达到接近无损音频的音质,从而能够实现传统方法无法达到的音质提升效果。下面简要描述本发明提出的关于基于深度学习的音频音质增强的方案,更多细节将在后续结合附图在具体实施方式中加以描述。
根据本发明一方面,提供了一种基于深度学习的音频音质增强方法,所述方法包括:获取有损音频数据,并对所述有损音频数据进行特征提取以得到所述有损音频数据的特征;以及基于所述有损音频数据的特征,利用训练好的音频重构神经网络将所述有损音频数据重构为音质接近于无损音频的输出音频数据。
在本发明的一个实施例中,所述音频重构神经网络的训练包括:获取无损音频样本和有损音频样本,其中所述有损音频样本是由所述无损音频样本通过变换而得到;对所述有损音频样本和所述无损音频样本分别进行特征提取以分别得到所述有损音频样本的特征和所述无损音频样本的特征;以及将得到的所述有损音频样本的特征作为所述音频重构神经网络的输入层的输入,并将得到的所述无损音频样本的特征作为所述音频重构神经网络的输出层的目标,以训练所述音频重构神经网络。
在本发明的一个实施例中,所述无损音频样本经过格式变换得到所述有损音频样本。
在本发明的一个实施例中,所述无损音频样本和所述有损音频样本的采样频率和量化位数均相同。
在本发明的一个实施例中,所述特征提取得到的特征包括频域幅度和/或能量信息。
在本发明的一个实施例中,所述特征提取得到的特征还包括频谱相位信息。
在本发明的一个实施例中,所述特征提取的方式包括短时傅里叶变换。
在本发明的一个实施例中,所述音频重构神经网络的训练还包括:在对所述有损音频样本和所述无损音频样本进行特征提取之前,对所述有损音频样本和所述无损音频样本分别进行分帧,并且所述特征提取是针对分 帧后得到的音频样本逐帧进行的。
在本发明的一个实施例中,所述音频重构神经网络的训练还包括:在对所述有损音频样本和所述无损音频样本进行分帧之前,将所述有损音频样本和所述无损音频样本分别解码为时域波形数据,并且所述分帧是针对解码后得到的时域波形数据进行的。
在本发明的一个实施例中,所述利用训练好的音频重构神经网络将所述有损音频数据重构为所述输出音频数据包括:将所述有损音频数据的特征作为所述训练好的音频重构神经网络的输入,并由所述训练好的音频重构神经网络输出重构音频特征;以及基于所述重构音频特征生成时域音频波形以作为所述输出音频数据。
根据本发明另一方面,提供了一种基于深度学习的音频音质增强装置,所述装置包括:特征提取模块,用于获取有损音频数据,并对所述有损音频数据进行特征提取以得到所述有损音频数据的特征;以及音频重构模块,用于基于所述特征提取模块提取的所述有损音频数据的特征,利用训练好的音频重构神经网络将所述有损音频数据重构为音质接近于无损音频的输出音频数据。
在本发明的一个实施例中,所述音频重构神经网络的训练包括:获取无损音频样本和有损音频样本,其中所述有损音频样本是由所述无损音频样本通过变换而得到;对所述有损音频样本和所述无损音频样本分别进行特征提取以分别得到所述有损音频样本的特征和所述无损音频样本的特征;以及将得到的所述有损音频样本的特征作为所述音频重构神经网络的输入层的输入,并将得到的所述无损音频样本的特征作为所述音频重构神经网络的输出层的目标,以训练所述音频重构神经网络。
在本发明的一个实施例中,所述无损音频样本经过格式变换得到所述有损音频样本。
在本发明的一个实施例中,所述无损音频样本和所述有损音频样本的采样频率和量化位数均相同。
在本发明的一个实施例中,所述特征提取得到的特征包括频域幅度和/或能量信息。
在本发明的一个实施例中,所述特征提取得到的特征还包括频谱相位 信息。
在本发明的一个实施例中,所述特征提取的方式包括短时傅里叶变换。
在本发明的一个实施例中,所述音频重构神经网络的训练还包括:在对所述有损音频样本和所述无损音频样本进行特征提取之前,对所述有损音频样本和所述无损音频样本分别进行分帧,并且所述特征提取是针对分帧后得到的音频样本逐帧进行的。
在本发明的一个实施例中,所述音频重构神经网络的训练还包括:在对所述有损音频样本和所述无损音频样本进行分帧之前,将所述有损音频样本和所述无损音频样本分别解码为时域波形数据,并且所述分帧是针对解码后得到的时域波形数据进行的。
在本发明的一个实施例中,所述音频重构模块进一步包括:重构模块,用于将所述有损音频数据的特征作为所述训练好的音频重构神经网络的输入,并由所述训练好的音频重构神经网络输出重构音频特征;以及生成模块,用于基于所述重构模块输出的所述重构音频特征生成时域音频波形以作为所述输出音频数据。
根据本发明又一方面,提供了一种基于深度学习的音频音质增强系统,所述系统包括存储装置和处理器,所述存储装置上存储有由所述处理器运行的计算机程序,所述计算机程序在被所述处理器运行时执行上述任一项所述的基于深度学习的音频音质增强方法。
根据本发明再一方面,提供了一种存储介质,所述存储介质上存储有计算机程序,所述计算机程序在运行时执行上述任一项所述的基于深度学习的音频音质增强方法。
根据本发明又一方面,提供了一种计算机程序,所述计算机程序被计算机或处理器运行时用于执行上述任一项所述的基于深度学习的音频音质增强方法,所述计算机程序还用于实现上述任一项所述的基于深度学习的音频音质增强装置中的各模块。
根据本发明实施例的基于深度学习的音频音质增强方法、装置、系统、存储介质和计算机程序基于深度学习方法对有损音频音质进行增强,使有损音频音质通过深层神经网络重构达到接近无损音频的音质,从而能够实 现传统方法无法达到的音质提升效果。此外,根据本发明实施例的基于深度学习的音频音质增强方法、装置、系统、存储介质和计算机程序可以便利地部署在服务端或用户端,能够高效地实现音频音质的增强。
附图说明
通过结合附图对本发明实施例进行更详细的描述,本发明的上述以及其它目的、特征和优势将变得更加明显。附图用来提供对本发明实施例的进一步理解,并且构成说明书的一部分,与本发明实施例一起用于解释本发明,并不构成对本发明的限制。在附图中,相同的参考标号通常代表相同部件或步骤。
图1示出用于实现根据本发明实施例的基于深度学习的音频音质增强方法、装置、系统、存储介质和计算机程序的示例电子设备的示意性框图;
图2示出根据本发明实施例的基于深度学习的音频音质增强方法的示意性流程图;
图3示出根据本发明实施例的音频重构神经网络的训练示意图;
图4示出根据本发明实施例的基于深度学习的音频音质增强装置的示意性框图;以及
图5示出根据本发明实施例的基于深度学习的音频音质增强系统的示意性框图。
具体实施方式
为了使得本发明的目的、技术方案和优点更为明显,下面将参照附图详细描述根据本发明的示例实施例。显然,所描述的实施例仅仅是本发明的一部分实施例,而不是本发明的全部实施例,应理解,本发明不受这里描述的示例实施例的限制。基于本发明中描述的本发明实施例,本领域技术人员在没有付出创造性劳动的情况下所得到的所有其它实施例都应落入本发明的保护范围之内。
首先,参照图1来描述用于实现本发明实施例的基于深度学习的音频音质增强方法、装置、系统、存储介质和计算机程序的示例电子设备100。
如图1所示,电子设备100包括一个或多个处理器102、一个或多个 存储装置104、输入装置106以及输出装置108,这些组件通过总线系统110和/或其它形式的连接机构(未示出)互连。应当注意,图1所示的电子设备100的组件和结构只是示例性的,而非限制性的,根据需要,所述电子设备也可以具有其他组件和结构。
所述处理器102可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元,并且可以控制所述电子设备100中的其它组件以执行期望的功能。
所述存储装置104可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器102可以运行所述程序指令,以实现下文所述的本发明实施例中(由处理器实现)的客户端功能以及/或者其它期望的功能。在所述计算机可读存储介质中还可以存储各种应用程序和各种数据,例如所述应用程序使用和/或产生的各种数据等。
所述输入装置106可以是用户用来输入指令的装置,并且可以包括键盘、鼠标、麦克风和触摸屏等中的一个或多个。此外,所述输入装置106也可以是任何接收信息的接口。
所述输出装置108可以向外部(例如用户)输出各种信息(例如图像或声音),并且可以包括显示器、扬声器等中的一个或多个。此外,所述输出装置108也可以是任何其他具备输出功能的设备。
示例性地,用于实现根据本发明实施例的基于深度学习的音频音质增强方法、装置、系统、存储介质和计算机程序的示例电子设备可以被实现诸如智能手机、平板电脑等终端。
下面,将参考图2描述根据本发明实施例的基于深度学习的音频音质增强方法200。如图2所示,基于深度学习的音频音质增强方法200可以包括如下步骤:
在步骤S210,获取有损音频数据,并对所述有损音频数据进行特征提取以得到所述有损音频数据的特征。
在一个实施例中,在步骤S210中所获取的音频数据可以为音频存储/播放设备中接收、存储或播放的需要进行音质增强的有损音频数据,这些数据包括但不限于:用户正在播放的音频、在列表中的音频、或存储在云端、客户端的音频文件等。示例性地,有损音频数据可以包括但不限于:MP3、AAC、OGG等格式的音乐等音频数据。在其他示例中,在步骤S210中所获取的音频数据也可以为任何需要进行音质增强的数据,例如包括在视频数据中的音频数据等。此外,在步骤S210中所获取的音频数据可以来自离线存放的文件,也可以来自在线播放的文件。
在一个实施例中,对所获取的有损音频数据进行特征提取的方式可以包括但不限于短时傅里叶变换(STFT)。示例性地,对所获取的有损音频数据进行特征提取所得到的有损音频数据的特征可以包括频域幅度和/或能量信息。示例性地,对有损音频数据进行特征提取所得到的有损音频数据的特征还可以包括频谱相位信息。示例性地,对有损音频数据进行特征提取所得到的有损音频数据的特征也可以是时域特征。在其他示例中,对所获取的有损音频数据进行特征提取所得到的有损音频数据的特征还可以包括任何其他可以表征有损音频数据的特征。
在一个实施例中,在对有损音频数据进行特征提取之前,可以先对其进行分帧处理,并且前述的特征提取可以针对分帧后得到的音频数据逐帧进行,这种情况可以适用于在步骤S210所获取的有损音频数据是来自于离线存放的文件或来自于任何源的完整文件时。在另一个实施例中,如果在步骤S210所获取的有损音频数据来自于在线播放的文件,则可以缓存一帧或多帧有损音频数据后再进行特征提取。示例性地,可以针对分帧后得到的或缓存后得到的每帧有损音频数据选择部分数据进行特征提取,这样可以有效减少数据量,提高处理效率。
在又一个实施例中,在对有损音频数据进行前述的分帧处理之前,可以先对有损音频数据进行解码处理,并且前述的分帧处理可以针对解码后得到的时域波形数据进行。这是因为,所获取的有损音频数据一般为经过编码的形式,为了获得其完整的音频时域信息,可先对其进行解码。因此,在一个示例中,可以对所获取的有损音频数据依次进行解码、分帧和特征提取,以高效地提取具有很好代表性的特征。
现在继续参考图2,描述根据本发明实施例的基于深度学习的音频音质增强方法200的后续步骤。
在步骤S220,基于所述有损音频数据的特征,利用训练好的音频重构神经网络将所述有损音频数据重构为音质接近于无损音频的输出音频数据。
在本发明的实施例中,将在步骤S210中提取的有损音频数据的特征输入到训练好的音频重构神经网络,由该音频重构神经网络对输入的特征进行重构得到重构音频特征,该重构得到的重构音频特征可以用于生成相对于所获取的有损音频数据音质更好的接近无损音频音质的输出音频数据。因此,本发明的音质增强方法可以基于深度学习精确地补充有损音频中丢失的音频信息,不仅能够高效地实现有损音频音质的极大提升,又不影响对通信带宽的兼顾(因为传输的仍然是数据量较小的有损音频数据,但该有损音频数据可在接收端被重构为音质接近无损音频的数据)。
此处,应该注意,将输出音频数据描述为音质接近无损音频的音质是一种较为严谨的表述,本领域技术人员能够理解本发明的主旨在于基于深度学习方法将有损音频重构为无损音频,但因实际实现上的问题,可能无法完全实现无损音频的音质,因此将其描述为接近无损音频的音质,这样的描述对于本领域技术人员而言是清楚的。此外,下面将描述的音频重构神经网络的训练及其应用也进一步促进本领域技术人员理解最终输出音频数据的音质情况。
下面结合图3描述根据本发明实施例的上述音频重构神经网络的训练过程。如图3所示,根据本发明实施例的音频重构神经网络的训练可以包括如下过程:
在S310,获取无损音频样本和有损音频样本,其中所述有损音频样本是由所述无损音频样本通过变换而得到。
在一个示例中,无损音频样本可以是脉冲编码调制(PCM)格式、WAV格式、FLAC格式或其他无损音频格式的无损音频数据。在一个示例中,可以将无损音频样本进行格式变换以获得有损音频样本。例如,将无损音频样本进行有损编解码来得到有损音频样本,其中码率可以包括但不限于128kb、192kb等,编码方式可以包括但不限于OGG、MP3、ACC等。在 一个示例中,无损音频样本变换为有损音频样本时可以保持采样频率和量化位数的不变。也就是说,无损音频样本和经过变换得到的有损音频样本这两者的采样频率和量化位数可以均相同。示例性地,无损音频样本变换得到有损音频样本的典型场景可以包括但不限于将采样频率为44.1kHz的FLAC格式的音乐转码为码率为128kbps、采样频率为44.1kHz的MP3格式的音乐。当然,这仅是示例性的。无损音频样本变换得到有损音频样本也可以是其他的情况,这可以基于实际应用场景来适应性调整。
继续参考图3,在S320,对所述有损音频样本和所述无损音频样本分别进行特征提取以分别得到所述有损音频样本的特征和所述无损音频样本的特征。
与前文在步骤S210中所描述的类似的,在一个实施例中,对无损音频样本和有损音频样本各自进行特征提取的方式可以包括但不限于短时傅里叶变换。示例性地,对无损音频样本和有损音频样本各自进行特征提取所得到的特征可以包括其各自的频域幅度和/或能量信息。示例性地,对无损音频样本和有损音频样本各自进行特征提取所得到的特征还可以包括其各自的频谱相位信息。示例性地,对无损音频样本和有损音频样本各自进行特征提取所得到的特征也可以是其各自的时域特征。在其他示例中,对无损音频样本和有损音频样本各自进行特征提取所得到的特征还可以包括任何其他可以表征其各自的特征。
此外,仍与前文在步骤S210中所描述的类似的,在一个实施例中,在对无损音频样本和有损音频样本各自进行特征提取之前,可以先对无损音频样本和有损音频样本各自进行分帧处理,并且前述的特征提取可以针对无损音频样本和有损音频样本各自分帧后得到的其各自的音频样本逐帧进行。示例性地,可以针对每帧有损/无损音频样本选择部分数据进行特征提取,这样可以有效减少数据量,提高处理效率。
在又一个实施例中,在对无损音频样本和有损音频样本各自进行前述的分帧处理之前,可以先对无损音频样本和有损音频样本各自进行解码处理,并且前述的分帧处理可以针对无损音频样本和有损音频样本各自解码后得到的其各自的时域波形数据进行。因此,可以对无损音频样本和有损音频样本分别依次进行解码、分帧和特征提取,以高效地提取它们各自具 有很好代表性的特征。
在S330,将得到的所述有损音频样本的特征作为所述音频重构神经网络的输入层的输入,并将得到的所述无损音频样本的特征作为所述音频重构神经网络的输出层的目标,以训练所述音频重构神经网络。
在一个实施例中,可以将一帧或多帧有损音频样本的特征作为音频重构神经网络的输入层的输入,可以将一帧或多帧无损音频样本的特征作为音频重构神经网络的输出层的目标,从而训练一个神经网络回归器作为在步骤S220中采用的音频重构神经网络。
以上结合图3示例性地描述了根据本发明实施例的音频重构神经网络的训练过程。现在继续参考图2,如前所述,在步骤S220中,基于训练好的音频重构神经网络,可将有损音频数据的特征重构为重构音频特征,由于该重构音频特征为频域特征,因此可基于该重构音频特征生成时域音频波形输出。示例性地,可以通过傅里叶逆变换来对该重构音频特征进行变换得到时域音频波形。输出的音频波形可被存储或经缓存用于播放,从而为用户提供更好的经提升的音质体验。
基于上面的描述,根据本发明实施例的基于深度学习的音频音质增强方法基于深度学习方法对有损音频音质进行增强,使有损音频音质通过深层神经网络重构达到接近无损音频的音质,从而能够实现传统方法无法达到的音质提升效果。
以上示例性地描述了根据本发明实施例的基于深度学习的音频音质增强方法。示例性地,根据本发明实施例的基于深度学习的音频音质增强方法可以在具有存储器和处理器的设备、装置或者系统中实现。
此外,根据本发明实施例的基于深度学习的音频音质增强方法可以方便地部署到智能手机、平板电脑、个人计算机、耳机、音箱等移动设备上。替代地,根据本发明实施例的基于深度学习的音频音质增强方法还可以部署在服务器端(或云端)。替代地,根据本发明实施例的基于深度学习的音频音质增强方法还可以分布地部署在服务器端(或云端)和个人终端处。
根据本发明实施例的基于深度学习的音频音质增强方法的典型的应用场景可以包括但不限于将码率为128kbps、采样频率为44.1kHz的MP3格式的音乐作为输入,经由根据本发明实施例的音频重构神经网络将该 MP3格式的音乐重构为质量接近44.1Hz采样频率的FLAC格式的音乐。当然,这仅是示例性的典型应用场景,根据本发明实施例的基于深度学习的音频音质增强方法还可以应用于任何需要进行音质增强的场景。
下面结合图4描述本发明另一方面提供的基于深度学习的音频音质增强装置。图4示出了根据本发明实施例的基于深度学习的音频音质增强装置400的示意性框图。
如图4所示,根据本发明实施例的基于深度学习的音频音质增强装置400包括特征提取模块410和音频重构模块420。所述各个模块可分别执行上文中结合图2描述的基于深度学习的音频音质增强方法的各个步骤/功能。以下仅对基于深度学习的音频音质增强装置400的各模块的主要功能进行描述,而省略以上已经描述过的细节内容。
特征提取模块410用于获取有损音频数据,并对所述有损音频数据进行特征提取以得到所述有损音频数据的特征。音频重构模块420用于基于所述特征提取模块提取的所述有损音频数据的特征,利用训练好的音频重构神经网络将所述有损音频数据重构为音质接近于无损音频的输出音频数据。特征提取模块410和音频重构模块420均可以由图1所示的电子设备中的处理器102运行存储装置104中存储的程序指令来实现。
在一个实施例中,特征提取模块410所获取的音频数据可以为音频存储/播放设备中接收、存储或播放的需要进行音质增强的有损音频数据,这些数据包括但不限于:用户正在播放的音频、在列表中的音频、或存储在云端、客户端的音频文件等。示例性地,有损音频数据可以包括但不限于:MP3、AAC、OGG等格式的音乐等音频数据。在其他示例中,特征提取模块410所获取的音频数据也可以为任何需要进行音质增强的数据,例如包括在视频数据中的音频数据等。此外,特征提取模块410所获取的音频数据可以来自离线存放的文件,也可以来自在线播放的文件。
在一个实施例中,特征提取模块410对所获取的有损音频数据进行特征提取的方式可以包括但不限于短时傅里叶变换(STFT)。示例性地,特征提取模块410对所获取的有损音频数据进行特征提取所得到的有损音频数据的特征可以包括频域幅度和/或能量信息。示例性地,特征提取模块410对有损音频数据进行特征提取所得到的有损音频数据的特征还可以包括频 谱相位信息。示例性地,特征提取模块410对有损音频数据进行特征提取所得到的有损音频数据的特征是时域特征。在其他示例中,特征提取模块410对所获取的有损音频数据进行特征提取所得到的有损音频数据的特征还可以包括任何其他可以表征有损音频数据的特征。
在一个实施例中,在特征提取模块410对有损音频数据进行特征提取之前,可以先对其进行分帧处理,并且前述的特征提取可以针对分帧后得到的音频数据逐帧进行,这种情况可以适用于在特征提取模块410所获取的有损音频数据是来自于离线存放的文件或来自于任何源的完整文件时。在另一个实施例中,如果特征提取模块410所获取的有损音频数据来自于在线播放的文件,则可以缓存一帧或多帧有损音频数据后再进行特征提取。示例性地,特征提取模块410可以针对分帧后得到的或缓存后得到的每帧有损音频数据选择部分数据进行特征提取,这样可以有效减少数据量,提高处理效率。
在又一个实施例中,在特征提取模块410对有损音频数据进行前述的分帧处理之前,可以先例如由包括在其中的解码模块(未在图4中示出)对有损音频数据进行解码处理,并且前述的分帧处理可以针对解码后得到的时域波形数据进行。这是因为,特征提取模块410所获取的有损音频数据一般为经过编码的形式,为了获得其完整的音频时域信息,可先对其进行解码。因此,在一个示例中,特征提取模块410可以对所获取的有损音频数据依次进行解码、分帧和特征提取,以高效地提取具有很好代表性的特征。
基于特征提取模块410所提取的有损音频数据的特征,音频重构模块420可以利用训练好的音频重构神经网络将所述有损音频数据重构为音质接近于无损音频的输出音频数据。
在本发明的实施例中,音频重构模块420可以进一步包括重构模块(未在图4中示出)和生成模块(未在图4中示出)。其中,重构模块可以包括训练好的音频重构神经网络,该音频重构神经网络将特征提取模块410提取的有损音频数据的特征作为输入,对输入的特征进行重构得到重构音频特征。生成模块基于重构模块输出的重构音频特征生成相对于所获取的有损音频数据音质更好的接近无损音频音质的输出音频数据。因此,本发明 的音质增强装置可以基于深度学习精确地补充有损音频中丢失的音频信息,不仅能够高效地实现有损音频音质的极大提升,又不影响对通信带宽的兼顾(因为传输的仍然是数据量较小的有损音频数据,但该有损音频数据可在接收端被重构为音质接近无损音频的数据)。
在本发明的实施例中,音频重构模块420所利用的音频重构神经网络的训练可以包括:获取无损音频样本和有损音频样本,其中所述有损音频样本是由所述无损音频样本通过变换而得到;对所述有损音频样本和所述无损音频样本分别进行特征提取以分别得到所述有损音频样本的特征和所述无损音频样本的特征;以及将得到的所述有损音频样本的特征作为所述音频重构神经网络的输入层的输入,并将得到的所述无损音频样本的特征作为所述音频重构神经网络的输出层的目标,以训练所述音频重构神经网络。可以结合图3参照上文关于图3的描述理解根据本发明实施例的基于深度学习的音频音质增强装置400的音频重构模块420所利用的音频重构神经网络的训练过程。为了简洁,此处不赘述过多的细节。
在一个示例中,无损音频样本可以是脉冲编码调制(PCM)格式、WAV格式、FLAC格式或其他无损音频格式的无损音频数据。在一个示例中,可以将无损音频样本进行格式变换以获得有损音频样本。例如,将无损音频样本进行有损编解码来得到有损音频样本,其中码率可以包括但不限于128kb、192kb等,编码方式可以包括但不限于OGG、MP3、ACC等。在一个示例中,无损音频样本变换为有损音频样本时可以保持采样频率和量化位数的不变。也就是说,无损音频样本和经过变换得到的有损音频样本这两者的采样频率和量化位数可以均相同。示例性地,无损音频样本变换得到有损音频样本的典型场景可以包括但不限于将采样频率为44.1kHz的FLAC格式的音乐转码为码率为128kbps、采样频率为44.1kHz的MP3格式的音乐。当然,这仅是示例性的。无损音频样本变换得到有损音频样本也可以是其他的情况,这可以基于实际应用场景来适应性调整。
在一个实施例中,对无损音频样本和有损音频样本各自进行特征提取的方式可以包括但不限于短时傅里叶变换。示例性地,对无损音频样本和有损音频样本各自进行特征提取所得到的特征可以包括其各自的频域幅度和/或能量信息。示例性地,对无损音频样本和有损音频样本各自进行特征 提取所得到的特征还可以包括其各自的频谱相位信息。示例性地,对无损音频样本和有损音频样本各自进行特征提取所得到的特征也可以是其各自的时域特征。在其他示例中,对无损音频样本和有损音频样本各自进行特征提取所得到的特征还可以包括任何其他可以表征其各自的特征。
在一个实施例中,在对无损音频样本和有损音频样本各自进行特征提取之前,可以先对无损音频样本和有损音频样本各自进行分帧处理,并且前述的特征提取可以针对无损音频样本和有损音频样本各自分帧后得到的其各自的音频样本逐帧进行。示例性地,可以针对每帧有损/无损音频样本选择部分数据进行特征提取,这样可以有效减少数据量,提高处理效率。
在又一个实施例中,在对无损音频样本和有损音频样本各自进行前述的分帧处理之前,可以先对无损音频样本和有损音频样本各自进行解码处理,并且前述的分帧处理可以针对无损音频样本和有损音频样本各自解码后得到的其各自的时域波形数据进行。因此,可以对无损音频样本和有损音频样本分别依次进行解码、分帧和特征提取,以高效地提取它们各自具有很好代表性的特征。
在一个实施例中,可以将一帧或多帧有损音频样本的特征作为音频重构神经网络的输入层的输入,可以将一帧或多帧无损音频样本的特征作为音频重构神经网络的输出层的目标,从而训练一个神经网络回归器作为音频重构模块420采用的音频重构神经网络。
基于训练好的音频重构神经网络,音频重构模块420的重构模块可将有损音频数据的特征重构为重构音频特征。由于该重构音频特征为频域特征,因此音频重构模块420的生成模块可基于该重构音频特征生成时域音频波形输出。示例性地,生成模块可以通过傅里叶逆变换来对该重构音频特征进行变换得到时域音频波形。输出的音频波形可被存储或经缓存用于播放,从而为用户提供更好的经提升的音质体验。
基于上面的描述,根据本发明实施例的基于深度学习的音频音质增强装置基于深度学习方法对有损音频音质进行增强,使有损音频音质通过深层神经网络重构达到接近无损音频的音质,从而能够实现传统方法无法达到的音质提升效果。此外,根据本发明实施例的基于深度学习装置可以便利地部署在服务端或用户端,能够高效地实现音频音质的增强。
图5示出了根据本发明实施例的基于深度学习的音频音质增强系统500的示意性框图。基于深度学习的音频音质增强系统500包括存储装置510以及处理器520。
其中,存储装置510存储用于实现根据本发明实施例的基于深度学习的音频音质增强方法中的相应步骤的程序。处理器520用于运行存储装置510中存储的程序,以执行根据本发明实施例的基于深度学习的音频音质增强方法的相应步骤,并且用于实现根据本发明实施例的基于深度学习的音频音质增强装置中的相应模块。
在一个实施例中,在所述程序被处理器520运行时使得基于深度学习的音频音质增强系统500执行以下步骤:获取有损音频数据,并对所述有损音频数据进行特征提取以得到所述有损音频数据的特征;以及基于所述有损音频数据的特征,利用训练好的音频重构神经网络将所述有损音频数据重构为音质接近于无损音频的输出音频数据。
在一个实施例中,所述音频重构神经网络的训练包括:获取无损音频样本和有损音频样本,其中所述有损音频样本是由所述无损音频样本通过变换而得到;对所述有损音频样本和所述无损音频样本分别进行特征提取以分别得到所述有损音频样本的特征和所述无损音频样本的特征;以及将得到的所述有损音频样本的特征作为所述音频重构神经网络的输入层的输入,并将得到的所述无损音频样本的特征作为所述音频重构神经网络的输出层的目标,以训练所述音频重构神经网络。
在一个实施例中,所述无损音频样本经过格式变换得到所述有损音频样本。
在一个实施例中,所述无损音频样本和所述有损音频样本的采样频率和量化位数均相同。
在一个实施例中,所述特征提取得到的特征包括频域幅度和/或能量信息。
在一个实施例中,所述特征提取得到的特征还包括频谱相位信息。
在一个实施例中,所述特征提取的方式包括短时傅里叶变换。
在一个实施例中,所述音频重构神经网络的训练还包括:在对所述有损音频样本和所述无损音频样本进行特征提取之前,对所述有损音频样本 和所述无损音频样本分别进行分帧,并且所述特征提取是针对分帧后得到的音频样本逐帧进行的。
在一个实施例中,所述音频重构神经网络的训练还包括:在对所述有损音频样本和所述无损音频样本进行分帧之前,将所述有损音频样本和所述无损音频样本分别解码为时域波形数据,并且所述分帧是针对解码后得到的时域波形数据进行的。
在一个实施例中,在所述程序被处理器520运行时使得基于深度学习的音频音质增强系统500执行的所述利用训练好的音频重构神经网络将所述有损音频数据重构为所述输出音频数据包括:将所述有损音频数据的特征作为所述训练好的音频重构神经网络的输入,并由所述训练好的音频重构神经网络输出重构音频特征;以及基于所述重构音频特征生成时域音频波形以作为所述输出音频数据。
此外,根据本发明实施例,还提供了一种存储介质,在所述存储介质上存储了程序指令,在所述程序指令被计算机或处理器运行时用于执行本发明实施例的基于深度学习的音频音质增强方法的相应步骤,并且用于实现根据本发明实施例的基于深度学习的音频音质增强装置中的相应模块。所述存储介质例如可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器、或者上述存储介质的任意组合。所述计算机可读存储介质可以是一个或多个计算机可读存储介质的任意组合。
在一个实施例中,所述计算机程序指令在被计算机运行时可以实现根据本发明实施例的基于深度学习的音频音质增强装置的各个功能模块,并且/或者可以执行根据本发明实施例的基于深度学习的音频音质增强方法。
在一个实施例中,所述计算机程序指令在被计算机或处理器运行时使计算机或处理器执行以下步骤:获取有损音频数据,并对所述有损音频数据进行特征提取以得到所述有损音频数据的特征;以及基于所述有损音频数据的特征,利用训练好的音频重构神经网络将所述有损音频数据重构为音质接近于无损音频的输出音频数据。
在一个实施例中,所述音频重构神经网络的训练包括:获取无损音频 样本和有损音频样本,其中所述有损音频样本是由所述无损音频样本通过变换而得到;对所述有损音频样本和所述无损音频样本分别进行特征提取以分别得到所述有损音频样本的特征和所述无损音频样本的特征;以及将得到的所述有损音频样本的特征作为所述音频重构神经网络的输入层的输入,并将得到的所述无损音频样本的特征作为所述音频重构神经网络的输出层的目标,以训练所述音频重构神经网络。
在一个实施例中,所述无损音频样本经过格式变换得到所述有损音频样本。
在一个实施例中,所述无损音频样本和所述有损音频样本的采样频率和量化位数均相同。
在一个实施例中,所述特征提取得到的特征包括频域幅度和/或能量信息。
在一个实施例中,所述特征提取得到的特征还包括频谱相位信息。
在一个实施例中,所述特征提取的方式包括短时傅里叶变换。
在一个实施例中,所述音频重构神经网络的训练还包括:在对所述有损音频样本和所述无损音频样本进行特征提取之前,对所述有损音频样本和所述无损音频样本分别进行分帧,并且所述特征提取是针对分帧后得到的音频样本逐帧进行的。
在一个实施例中,所述音频重构神经网络的训练还包括:在对所述有损音频样本和所述无损音频样本进行分帧之前,将所述有损音频样本和所述无损音频样本分别解码为时域波形数据,并且所述分帧是针对解码后得到的时域波形数据进行的。
在一个实施例中,所述计算机程序指令在被计算机或处理器运行时使计算机或处理器执行的所述利用训练好的音频重构神经网络将所述有损音频数据重构为所述输出音频数据包括:将所述有损音频数据的特征作为所述训练好的音频重构神经网络的输入,并由所述训练好的音频重构神经网络输出重构音频特征;以及基于所述重构音频特征生成时域音频波形以作为所述输出音频数据。
根据本发明实施例的基于深度学习的音频音质增强装置中的各模块可以通过根据本发明实施例的基于深度学习的音频音质增强的电子设备的 处理器运行在存储器中存储的计算机程序指令来实现,或者可以在根据本发明实施例的计算机程序产品的计算机可读存储介质中存储的计算机指令被计算机运行时实现。
此外,根据本发明实施例,还提供了一种计算机程序,该计算机程序可以存储在云端或本地的存储介质上。在该计算机程序被计算机或处理器运行时用于执行本发明实施例的基于深度学习的音频音质增强方法的相应步骤,并且用于实现根据本发明实施例的基于深度学习的音频音质增强装置中的相应模块。
根据本发明实施例的基于深度学习的音频音质增强方法、装置、系统、存储介质和计算机程序基于深度学习方法对有损音频音质进行增强,使有损音频音质通过深层神经网络重构达到接近无损音频的音质,从而能够实现传统方法无法达到的音质提升效果。此外,根据本发明实施例的基于深度学习的音频音质增强方法、装置、系统、存储介质和计算机程序可以便利地部署在服务端或用户端,能够高效地实现音频音质的增强。
尽管这里已经参考附图描述了示例实施例,应理解上述示例实施例仅仅是示例性的,并且不意图将本发明的范围限制于此。本领域普通技术人员可以在其中进行各种改变和修改,而不偏离本发明的范围和精神。所有这些改变和修改意在被包括在所附权利要求所要求的本发明的范围之内。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个设备,或一些特征可以忽略,或不执行。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解, 本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本发明并帮助理解各个发明方面中的一个或多个,在对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该本发明的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如相应的权利要求书所反映的那样,其发明点在于可以用少于某个公开的单个实施例的所有特征的特征来解决相应的技术问题。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。
本领域的技术人员可以理解,除了特征之间相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的一些模块的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限 制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
以上所述,仅为本发明的具体实施方式或对具体实施方式的说明,本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。本发明的保护范围应以权利要求的保护范围为准。

Claims (21)

  1. 一种基于深度学习的音频音质增强方法,其特征在于,所述方法包括:
    获取有损音频数据,并对所述有损音频数据进行特征提取以得到所述有损音频数据的特征;以及
    基于所述有损音频数据的特征,利用训练好的音频重构神经网络将所述有损音频数据重构为音质接近于无损音频的输出音频数据。
  2. 根据权利要求1所述的方法,其特征在于,所述音频重构神经网络的训练包括:
    获取无损音频样本和有损音频样本,其中所述有损音频样本是由所述无损音频样本通过变换而得到;
    对所述有损音频样本和所述无损音频样本分别进行特征提取以分别得到所述有损音频样本的特征和所述无损音频样本的特征;以及
    将得到的所述有损音频样本的特征作为所述音频重构神经网络的输入层的输入,并将得到的所述无损音频样本的特征作为所述音频重构神经网络的输出层的目标,以训练所述音频重构神经网络。
  3. 根据权利要求2所述的方法,其特征在于,所述无损音频样本经过格式变换得到所述有损音频样本。
  4. 根据权利要求3所述的方法,其特征在于,所述无损音频样本和所述有损音频样本的采样频率和量化位数均相同。
  5. 根据权利要求1或2所述的方法,其特征在于,所述特征提取得到的特征包括频域幅度和/或能量信息。
  6. 根据权利要求5所述的方法,其特征在于,所述特征提取得到的特征还包括频谱相位信息。
  7. 根据权利要求6所述的方法,其特征在于,所述特征提取的方式包括短时傅里叶变换。
  8. 根据权利要求2所述的方法,其特征在于,所述音频重构神经网络的训练还包括:
    在对所述有损音频样本和所述无损音频样本进行特征提取之前,对所述有损音频样本和所述无损音频样本分别进行分帧,并且所述特征提取是 针对分帧后得到的音频样本逐帧进行的。
  9. 根据权利要求8所述的方法,其特征在于,所述音频重构神经网络的训练还包括:
    在对所述有损音频样本和所述无损音频样本进行分帧之前,将所述有损音频样本和所述无损音频样本分别解码为时域波形数据,并且所述分帧是针对解码后得到的时域波形数据进行的。
  10. 根据权利要求1所述的方法,其特征在于,所述利用训练好的音频重构神经网络将所述有损音频数据重构为所述输出音频数据包括:
    将所述有损音频数据的特征作为所述训练好的音频重构神经网络的输入,并由所述训练好的音频重构神经网络输出重构音频特征;以及
    基于所述重构音频特征生成时域音频波形以作为所述输出音频数据。
  11. 一种基于深度学习的音频音质增强装置,其特征在于,所述装置包括:
    特征提取模块,用于获取有损音频数据,并对所述有损音频数据进行特征提取以得到所述有损音频数据的特征;以及
    音频重构模块,用于基于所述特征提取模块提取的所述有损音频数据的特征,利用训练好的音频重构神经网络将所述有损音频数据重构为音质接近于无损音频的输出音频数据。
  12. 根据权利要求11所述的装置,其特征在于,所述音频重构神经网络的训练包括:
    获取无损音频样本和有损音频样本,其中所述有损音频样本是由所述无损音频样本通过变换而得到;
    对所述有损音频样本和所述无损音频样本分别进行特征提取以分别得到所述有损音频样本的特征和所述无损音频样本的特征;以及
    将得到的所述有损音频样本的特征作为所述音频重构神经网络的输入层的输入,并将得到的所述无损音频样本的特征作为所述音频重构神经网络的输出层的目标,以训练所述音频重构神经网络。
  13. 根据权利要求12所述的装置,其特征在于,所述无损音频样本经过格式变换得到所述有损音频样本。
  14. 根据权利要求13所述的装置,其特征在于,所述无损音频样本 和所述有损音频样本的采样频率和量化位数均相同。
  15. 根据权利要求11或12所述的装置,其特征在于,所述特征提取得到的特征包括频域幅度和/或能量信息。
  16. 根据权利要求15所述的装置,其特征在于,所述特征提取得到的特征还包括频谱相位信息。
  17. 根据权利要求16所述的装置,其特征在于,所述特征提取的方式包括短时傅里叶变换。
  18. 根据权利要求12所述的装置,其特征在于,所述音频重构神经网络的训练还包括:
    在对所述有损音频样本和所述无损音频样本进行特征提取之前,对所述有损音频样本和所述无损音频样本分别进行分帧,并且所述特征提取是针对分帧后得到的音频样本逐帧进行的。
  19. 根据权利要求18所述的装置,其特征在于,所述音频重构神经网络的训练还包括:
    在对所述有损音频样本和所述无损音频样本进行分帧之前,将所述有损音频样本和所述无损音频样本分别解码为时域波形数据,并且所述分帧是针对解码后得到的时域波形数据进行的。
  20. 根据权利要求11所述的装置,其特征在于,所述音频重构模块进一步包括:
    重构模块,用于将所述有损音频数据的特征作为所述训练好的音频重构神经网络的输入,并由所述训练好的音频重构神经网络输出重构音频特征;以及
    生成模块,用于基于所述重构模块输出的所述重构音频特征生成时域音频波形以作为所述输出音频数据。
  21. 一种基于深度学习的音频音质增强系统,其特征在于,所述系统包括存储装置和处理器,所述存储装置上存储有由所述处理器运行的计算机程序,所述计算机程序在被所述处理器运行时执行如权利要求1-10中的任一项所述的基于深度学习的音频音质增强方法。
PCT/CN2019/089763 2018-06-05 2019-06-03 基于深度学习的音频音质增强 WO2019233364A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810583122.6 2018-06-05
CN201810583122.6A CN109147805B (zh) 2018-06-05 2018-06-05 基于深度学习的音频音质增强

Publications (1)

Publication Number Publication Date
WO2019233364A1 true WO2019233364A1 (zh) 2019-12-12

Family

ID=64802016

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/089763 WO2019233364A1 (zh) 2018-06-05 2019-06-03 基于深度学习的音频音质增强

Country Status (2)

Country Link
CN (1) CN109147805B (zh)
WO (1) WO2019233364A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147805B (zh) * 2018-06-05 2021-03-02 安克创新科技股份有限公司 基于深度学习的音频音质增强
CN110797038B (zh) * 2019-10-30 2020-11-06 腾讯科技(深圳)有限公司 音频处理方法、装置、计算机设备及存储介质
CN111508509A (zh) * 2020-04-02 2020-08-07 广东九联科技股份有限公司 基于深度学习的声音质量处理系统及其方法
CN112820315B (zh) * 2020-07-13 2023-01-06 腾讯科技(深圳)有限公司 音频信号处理方法、装置、计算机设备及存储介质
CN111899729B (zh) * 2020-08-17 2023-11-21 广州市百果园信息技术有限公司 一种语音模型的训练方法、装置、服务器和存储介质
CN113555034B (zh) * 2021-08-03 2024-03-01 京东科技信息技术有限公司 压缩音频识别方法、装置及存储介质
CN114863942B (zh) * 2022-07-05 2022-10-21 北京百瑞互联技术有限公司 音质转换的模型训练方法、提升语音音质的方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944362A (zh) * 2010-09-14 2011-01-12 北京大学 一种基于整形小波变换的音频无损压缩编码、解码方法
CN107077849A (zh) * 2014-11-07 2017-08-18 三星电子株式会社 用于恢复音频信号的方法和设备
CN107112025A (zh) * 2014-09-12 2017-08-29 美商楼氏电子有限公司 用于恢复语音分量的系统和方法
CN109147805A (zh) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 基于深度学习的音频音质增强
CN109147804A (zh) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 一种基于深度学习的音质特性处理方法及系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5275102B2 (ja) * 2009-03-25 2013-08-28 株式会社東芝 音声合成装置及び音声合成方法
CN104810022B (zh) * 2015-05-11 2018-06-15 东北师范大学 一种基于音频断点的时域数字音频水印方法
CN107895571A (zh) * 2016-09-29 2018-04-10 亿览在线网络技术(北京)有限公司 无损音频文件识别方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944362A (zh) * 2010-09-14 2011-01-12 北京大学 一种基于整形小波变换的音频无损压缩编码、解码方法
CN107112025A (zh) * 2014-09-12 2017-08-29 美商楼氏电子有限公司 用于恢复语音分量的系统和方法
CN107077849A (zh) * 2014-11-07 2017-08-18 三星电子株式会社 用于恢复音频信号的方法和设备
CN109147805A (zh) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 基于深度学习的音频音质增强
CN109147804A (zh) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 一种基于深度学习的音质特性处理方法及系统

Also Published As

Publication number Publication date
CN109147805B (zh) 2021-03-02
CN109147805A (zh) 2019-01-04

Similar Documents

Publication Publication Date Title
WO2019233364A1 (zh) 基于深度学习的音频音质增强
CN109147806B (zh) 基于深度学习的语音音质增强方法、装置和系统
JP7427723B2 (ja) ニューラルネットワークを使用したターゲット話者の声でのテキストからの音声合成
TWI480856B (zh) 音訊編解碼器中之雜訊產生技術
WO2021258940A1 (zh) 音频编解码方法、装置、介质及电子设备
WO2018223727A1 (zh) 识别声纹的方法、装置、设备及介质
US20220180881A1 (en) Speech signal encoding and decoding methods and apparatuses, electronic device, and storage medium
CN102165699A (zh) 使用变换域对数压缩扩展来进行信号处理的方法和装置
JP7123910B2 (ja) インデックスコーディング及びビットスケジューリングを備えた量子化器
CN113724683B (zh) 音频生成方法、计算机设备及计算机可读存储介质
WO2023241240A1 (zh) 音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
JP6573887B2 (ja) オーディオ信号の符号化方法、復号方法及びその装置
JP2023548707A (ja) 音声強調方法、装置、機器及びコンピュータプログラム
CN111816197B (zh) 音频编码方法、装置、电子设备和存储介质
WO2023241205A1 (zh) 音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
WO2012075476A2 (en) Warped spectral and fine estimate audio encoding
Raj et al. Multilayered convolutional neural network-based auto-CODEC for audio signal denoising using mel-frequency cepstral coefficients
US20230186927A1 (en) Compressing audio waveforms using neural networks and vector quantizers
WO2022166738A1 (zh) 语音增强方法、装置、设备及存储介质
CN113744715A (zh) 声码器语音合成方法、装置、计算机设备及存储介质
Jose Amrconvnet: Amr-coded speech enhancement using convolutional neural networks
US9413323B2 (en) System and method of filtering an audio signal prior to conversion to an MU-LAW format
CN113470616B (zh) 语音处理方法和装置以及声码器和声码器的训练方法
CN113724716B (zh) 语音处理方法和语音处理装置
US20240105203A1 (en) Enhanced audio file generator

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19815853

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19815853

Country of ref document: EP

Kind code of ref document: A1