WO2019233362A1 - Procédé d'amélioration de la qualité de la parole basés sur un apprentissage profond, dispositif et système - Google Patents

Procédé d'amélioration de la qualité de la parole basés sur un apprentissage profond, dispositif et système Download PDF

Info

Publication number
WO2019233362A1
WO2019233362A1 PCT/CN2019/089759 CN2019089759W WO2019233362A1 WO 2019233362 A1 WO2019233362 A1 WO 2019233362A1 CN 2019089759 W CN2019089759 W CN 2019089759W WO 2019233362 A1 WO2019233362 A1 WO 2019233362A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
sample
data
neural network
Prior art date
Application number
PCT/CN2019/089759
Other languages
English (en)
Chinese (zh)
Inventor
秦宇
姚青山
喻浩文
卢峰
Original Assignee
安克创新科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安克创新科技股份有限公司 filed Critical 安克创新科技股份有限公司
Publication of WO2019233362A1 publication Critical patent/WO2019233362A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • the present invention relates to the technical field of sound quality optimization, and more particularly, to a method, a device, and a system for enhancing sound quality based on deep learning.
  • Enhanced Voice Services (EVS) coding technology may reach 48k sampling frequency and 128kbps code rate.
  • EVS Enhanced Voice Services
  • this does not mean that all users can enjoy the experience of high-definition voice communication.
  • the operator of the calling user supports the 4G network
  • the operator of the receiving user only supports the 3G network
  • the two parties may only An adaptive multi-rate coding-narrowband (amr-nb) coding method is selected for speech coding, rather than an adaptive multi-rate coding-wideband (amr-wb) coding method such as a 16kHz sampling frequency. Due to the existence of these scenarios where low-quality voice has to be adopted due to hardware conditions, not everyone can enjoy the benefits of high-definition voice communications.
  • the present invention has been made to solve at least one of the problems described above.
  • the present invention proposes a solution for enhancing the sound quality of speech based on deep learning, which enhances the sound quality of low-quality speech based on a deep learning method, so that the sound quality of low-quality speech is reconstructed to achieve the sound quality of high-quality speech through a deep neural network, thereby enabling Realize the sound quality improvement effect that cannot be achieved by traditional methods.
  • a method for enhancing the sound quality of speech based on deep learning includes: acquiring to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain the Characteristics; and based on the characteristics of the speech data to be processed, using the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data, wherein the speech quality of the output speech data is higher than the quality of the speech data to be processed Handle the voice quality of voice data.
  • the training of the speech reconstruction neural network includes: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than that of the first speech sample.
  • Voice quality, and the second voice sample is obtained by transcoding the first voice sample; feature extraction is performed on the first voice sample and the second voice sample to obtain the first voice sample respectively And the characteristics of the second speech sample; and using the obtained characteristics of the second speech sample as an input of an input layer of the speech reconstruction neural network, and using the obtained characteristics of the first speech sample As a target of the output layer of the speech reconstruction neural network, to train the speech reconstruction neural network.
  • the first speech sample has a first bit rate
  • the second speech sample has a second bit rate
  • the first bit rate is higher than or equal to the second bit rate
  • the first speech sample has a first sampling frequency
  • the second speech sample has a second sampling frequency
  • the first sampling frequency is higher than or equal to the second sampling frequency
  • the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  • the features obtained by the feature extraction further include spectrum phase information.
  • the feature extraction manner includes a short-time Fourier transform.
  • the training of the speech reconstruction neural network further includes: before performing feature extraction on the first speech sample and the second speech sample, the first speech sample and the The second speech samples are framed separately, and the feature extraction is performed frame by frame for the speech samples obtained after framed.
  • the training of the speech reconstruction neural network further includes: before framing the first speech sample and the second speech sample, dividing the first speech sample and the The second speech samples are decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
  • using the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data includes using the characteristics of the speech data to be processed as the trained speech data.
  • An input of a speech reconstruction neural network, and a reconstructed speech feature is output from the trained speech reconstruction neural network; and a time-domain speech waveform is generated based on the reconstructed speech feature as the output speech data.
  • a deep learning-based voice sound quality enhancement device includes: a feature extraction module for obtaining to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain Characteristics of the speech data to be processed; and a speech reconstruction module, configured to use the trained speech reconstruction neural network to convert the speech data to be processed based on the characteristics of the speech data to be processed extracted by the feature extraction module Reconstructed to output voice data, wherein the voice quality of the output voice data is higher than the voice quality of the to-be-processed voice data.
  • the training of the speech reconstruction neural network includes: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than that of the first speech sample.
  • Voice quality, and the second voice sample is obtained by transcoding the first voice sample; feature extraction is performed on the first voice sample and the second voice sample to obtain the first voice sample respectively And the characteristics of the second speech sample; and using the obtained characteristics of the second speech sample as an input of an input layer of the speech reconstruction neural network, and using the obtained characteristics of the first speech sample As a target of the output layer of the speech reconstruction neural network, to train the speech reconstruction neural network.
  • the first speech sample has a first bit rate
  • the second speech sample has a second bit rate
  • the first bit rate is higher than or equal to the second bit rate
  • the first speech sample has a first sampling frequency
  • the second speech sample has a second sampling frequency
  • the first sampling frequency is higher than or equal to the second sampling frequency
  • the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  • the features obtained by the feature extraction further include spectrum phase information.
  • the feature extraction manner includes a short-time Fourier transform.
  • the training of the speech reconstruction neural network further includes: before performing feature extraction on the first speech sample and the second speech sample, the first speech sample and the The second speech samples are framed separately, and the feature extraction is performed frame by frame for the speech samples obtained after framed.
  • the training of the speech reconstruction neural network further includes: before framing the first speech sample and the second speech sample, dividing the first speech sample and the The second speech samples are decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
  • the speech reconstruction module further includes: a reconstruction module, configured to use the characteristics of the speech data to be processed as an input of the trained speech reconstruction neural network, The trained speech reconstruction neural network outputs reconstructed speech features; and a generation module for generating a time-domain speech waveform based on the reconstructed speech features output by the reconstruction module as the output speech data.
  • a deep learning-based voice sound quality enhancement system includes a storage device and a processor.
  • the storage device stores a computer program run by the processor.
  • the computer when executed by the processor, executes the deep learning-based speech sound quality enhancement method according to any one of the above.
  • a storage medium stores a computer program, and the computer program executes the deep learning-based voice sound quality enhancement method according to any one of the foregoing when running.
  • a computer program is provided, which is used by a computer or a processor to execute the deep learning-based voice sound quality enhancement method according to any one of the foregoing, and the computer program further uses Each module in the deep learning-based voice sound quality enhancement device according to any one of the above.
  • the method, device, and system for enhancing speech sound quality based on deep learning according to the embodiments of the present invention enhance low-quality speech sound quality based on the deep learning method, so that the low-quality speech sound quality is reconstructed by deep neural network to achieve high-quality speech sound quality, so that Realize the sound quality improvement effect that cannot be achieved by traditional methods.
  • the method, device, and system for enhancing the voice quality based on deep learning according to the embodiments of the present invention can be conveniently deployed on the server or user end, and can effectively enhance the voice quality.
  • FIG. 1 shows a schematic block diagram of an example electronic device for implementing a deep learning-based method, apparatus, and system for voice sound quality enhancement according to an embodiment of the present invention
  • FIG. 2 shows a schematic flowchart of a deep learning-based voice sound quality enhancement method according to an embodiment of the present invention
  • FIG. 3 shows a training schematic diagram of a speech reconstruction neural network according to an embodiment of the present invention
  • 4A, 4B, and 4C respectively show high-quality speech, low-quality speech, and speech maps of speech obtained by reconstructing low-quality speech using a deep learning-based speech sound quality enhancement method according to an embodiment of the present invention
  • FIG. 5 shows a schematic block diagram of a deep learning-based voice sound quality enhancement device according to an embodiment of the present invention.
  • FIG. 6 shows a schematic block diagram of a deep learning-based speech sound quality enhancement system according to an embodiment of the present invention.
  • an example electronic device 100 for implementing a method, an apparatus, and a system for improving the sound quality of a voice based on deep learning according to an embodiment of the present invention is described with reference to FIG. 1.
  • the electronic device 100 includes one or more processors 102, one or more storage devices 104, input devices 106, and output devices 108. These components are connected through a bus system 110 and / or other forms of connection mechanisms (not shown). (Shown) interconnected. It should be noted that the components and structures of the electronic device 100 shown in FIG. 1 are only exemplary and not restrictive, and the electronic device may have other components and structures as needed.
  • the processor 102 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and / or instruction execution capabilities, and may control other components in the electronic device 100 to perform a desired function.
  • CPU central processing unit
  • the storage device 104 may include one or more computer program products, and the computer program product may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory.
  • the volatile memory may include, for example, a random access memory (RAM) and / or a cache memory.
  • the non-volatile memory may include, for example, a read-only memory (ROM), a hard disk, a flash memory, and the like.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 102 may run the program instructions to implement client functions in the embodiments of the present invention (implemented by the processor) described below. And / or other desired functions.
  • Various application programs and various data such as various data used and / or generated by the application program, can also be stored in the computer-readable storage medium.
  • the input device 106 may be a device used by a user to input instructions, and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like. In addition, the input device 106 may also be any interface for receiving information.
  • the output device 108 may output various information (such as images or sounds) to the outside (such as a user), and may include one or more of a display, a speaker, and the like. In addition, the output device 108 may be any other device having an output function.
  • an example electronic device for implementing a method, a device, and a system for enhancing the sound quality of a voice based on deep learning may be implemented as a terminal such as a smart phone, a tablet computer, or the like.
  • a deep learning-based speech sound quality enhancement method 200 may include the following steps:
  • step S210 the speech data to be processed is acquired, and feature extraction is performed on the speech data to be processed to obtain the characteristics of the speech data to be processed.
  • the to-be-processed voice data obtained in step S210 may be low-quality voice data that is received, stored, or played in a voice communication terminal or voice storage / playback device and requires sound quality enhancement, such as a low bit rate, Speech data with low sampling frequency.
  • the to-be-processed voice data may include, but is not limited to, a data stream of a wireless voice call, a voice in a list being played by a user, or a voice file stored in the cloud or on a client.
  • the to-be-processed voice data obtained in step S210 may also be any data that requires sound quality enhancement, such as voice data included in video data.
  • the to-be-processed voice data obtained in step S210 may come from a file stored offline, or from a file played online.
  • a manner of performing feature extraction on the acquired to-be-processed voice data may include, but is not limited to, a short-time Fourier transform (STFT).
  • STFT short-time Fourier transform
  • the features of the to-be-processed voice data obtained by performing feature extraction on the obtained to-be-processed voice data may include frequency domain amplitude and / or energy information.
  • the features of the to-be-processed voice data obtained by performing feature extraction on the obtained to-be-processed voice data may further include spectrum phase information.
  • the features of the to-be-processed voice data obtained by performing feature extraction on the obtained to-be-processed voice data may also be time-domain features.
  • the features of the to-be-processed voice data obtained by performing feature extraction on the obtained to-be-processed voice data may further include any other feature that can characterize the to-be-processed voice data.
  • frame processing before performing feature extraction on the speech data to be processed, frame processing may be performed on it, and the aforementioned feature extraction is performed frame-by-frame on the speech data obtained after the frame.
  • This situation may be applicable when the to-be-processed voice data obtained in step S210 is from a file stored offline or a complete file from any source.
  • the to-be-processed voice data obtained in step S210 comes from a file played online, one or more frames of to-be-processed voice data may be buffered before feature extraction.
  • feature data can be selected for each frame of voice data to be processed after framed or buffered to perform feature extraction, which can effectively reduce the amount of data and improve processing efficiency.
  • the voice data to be processed may be decoded, and the aforementioned frame processing may be performed on the time-domain waveform data obtained after decoding. This is because the acquired speech data to be processed is generally in an encoded form, and in order to obtain its complete speech time domain information, it may be decoded first.
  • the voice data to be processed before performing feature extraction on the voice data to be processed, may be pre-processed, and the aforementioned feature extraction may be performed on the voice data obtained after the pre-processing.
  • the pre-processing of the speech data to be processed may include, but is not limited to, denoising, echo suppression, automatic gain control, and the like.
  • the pre-processing may be performed after the aforementioned decoding process. Therefore, in one example, the acquired to-be-processed voice data may be sequentially decoded, pre-processed, framed, and feature extracted in order to efficiently extract well-represented features.
  • the aforementioned pre-processing operation may also be performed before the feature extraction after framing.
  • step S220 based on the characteristics of the voice data to be processed, the trained voice reconstruction neural network is used to reconstruct the voice data to be processed into output voice data, where the voice quality of the output voice data is higher than the voice quality The voice quality of the pending voice data.
  • the features of the speech data to be processed extracted in step S210 are input to the trained speech reconstruction neural network, and the speech reconstruction neural network reconstructs the input features to obtain reconstructed speech.
  • Feature, and the reconstructed reconstructed voice feature can be used to generate output voice data with higher voice quality than the acquired to-be-processed voice data. Therefore, the speech sound quality enhancement method of the present invention can accurately supplement the speech information lost in low-quality speech based on deep learning, which can not only effectively achieve a great improvement in the sound quality of low-quality speech, but also not affect the consideration of communication bandwidth (because of transmission Is still low-quality voice data with a small amount of data, but the low-quality voice data can be reconstructed into high-quality voice data at the receiving end).
  • training of the speech reconstruction neural network according to the embodiment of the present invention is described below with reference to FIG. 3.
  • training of a speech reconstruction neural network according to an embodiment of the present invention may include the following process:
  • a first voice sample and a second voice sample are obtained, wherein the voice quality of the second voice sample is lower than the voice quality of the first voice sample, and the second voice sample is determined by the first voice sample Obtained by transcoding.
  • the first speech sample may be a high-quality speech sample and the second speech sample may be a low-quality speech sample.
  • the first speech sample may be a set of speech samples with a high bit rate and a high sampling frequency, including but not limited to speech data with a sampling frequency of 16 kHz, 24 kHz, and 32 kHz.
  • a first speech sample may be transcoded to obtain a second speech sample.
  • the amr-wb speech sample with a sampling frequency of 16 kHz and a code rate of 23.85 kbps can be used as the first speech sample, and the first sample can be obtained by transcoding the amr-nb speech with a sampling frequency of 8 kHz and a code rate of 12.2 kbps.
  • the second speech sample can be obtained by converting the first speech sample in the FLAC format to the MP3 format without reducing the bit rate and the sampling frequency. That is, the code rate of the first voice sample may be higher than or equal to the code rate of the second voice sample; the sampling frequency of the first voice sample may be higher than or equal to the sampling frequency of the second voice sample.
  • the transcoding of the first speech sample (that is, the high-quality speech sample) to obtain the second speech sample can also be other situations, which can be adaptively adjusted based on the actual application scenario.
  • the first voice sample and the second voice sample that should be selected can be determined based on the reconstruction requirement for the to-be-processed voice data obtained in step S210, that is, the first voice sample that should be selected can be determined based on the reconstruction requirement. And should use the transcoding method of transcoding it into the second speech sample.
  • feature extraction is performed on the first voice sample and the second voice sample to obtain the features of the first voice sample and the features of the second voice sample, respectively.
  • the manner of performing feature extraction on each of the first speech sample and the second speech sample may include, but is not limited to, a short-time Fourier transform.
  • the features obtained by performing feature extraction on each of the first speech sample and the second speech sample may include their respective frequency domain amplitude and / or energy information.
  • the features obtained by performing feature extraction on the first speech sample and the second speech sample may further include their respective spectral phase information.
  • the features obtained by performing feature extraction on the first voice sample and the second voice sample may also be their respective time-domain features.
  • the features obtained by performing feature extraction on the first voice sample and the second voice sample may further include any other features that can characterize their respective features.
  • the first voice sample and the second voice sample may be first processed.
  • Frame processing is performed separately, and the aforementioned feature extraction may be performed frame by frame for the respective speech samples obtained after the first and second speech samples are framed.
  • part of the data can be selected for feature extraction for each frame of voice samples, which can effectively reduce the amount of data and improve processing efficiency.
  • the first speech sample and the second speech sample may be respectively decoded and the foregoing frame processing is performed. This may be performed on the respective time domain waveform data obtained after the first and second speech samples are decoded.
  • each of the first voice sample and the second voice sample may be pre-processed, and the aforementioned feature extraction may be performed on the pre-processing.
  • the speech samples obtained afterwards are performed.
  • the pre-processing performed on each of the first speech sample and the second speech sample may include, but is not limited to, denoising, echo suppression, and automatic gain control.
  • the pre-processing may be performed after the aforementioned decoding process. Therefore, in one example, the first speech sample and the second speech sample may be sequentially decoded, preprocessed, framed, and feature extracted in order to efficiently extract features with good representativeness.
  • the foregoing pre-processing operation may also be performed before the feature extraction is performed after the first speech sample and the second speech sample are respectively framed.
  • the obtained feature of the second speech sample is used as an input of an input layer of the speech reconstruction neural network, and the obtained feature of the first speech sample is used as an output of the speech reconstruction neural network. Layers to train the speech reconstruction neural network.
  • the features of one or more frames of the second speech sample may be used as the input of the input layer of the speech reconstruction neural network, and the features of the one or more frames of the first speech sample may be used as the speech reconstruction neural network.
  • Target the output layer thereby training a neural network regressor as the speech reconstruction neural network employed in step S220.
  • step S220 based on the trained speech reconstruction neural network, the features of the speech data to be processed can be reconstructed into reconstructed speech features. Domain features, so time domain voice waveform output can be generated based on the reconstructed voice features.
  • the time-domain speech waveform can be obtained by transforming the reconstructed speech feature by inverse Fourier transform.
  • the output voice waveform can be stored or buffered for playback, providing users with a better improved voice sound quality experience.
  • FIG. 4A to FIG. 4C can realize the voice sound quality enhancement effect of the deep learning-based voice sound quality enhancement method according to the embodiment.
  • FIGS. 4A, 4B, and 4C respectively show the respective spectrograms of high-quality speech, low-quality speech, and speech obtained by reconstructing low-quality speech using a deep learning-based speech sound quality enhancement method according to an embodiment of the present invention.
  • FIG. 4A shows a grammap 400 of high-quality speech using PCM format, 16kHz sampling frequency, and 16-bit quantization bits as examples;
  • FIG. 4B shows MP3 format and 8kHz sampling frequency obtained by transcoding the high-quality speech.
  • FIG. 4C shows the reconstructed speech at a 16 kHz sampling frequency obtained by reconstructing the low-quality speech using the deep learning-based speech quality enhancement method according to an embodiment of the present invention Gram map 402. It is obvious from FIGS. 4A to 4C that compared with the high-quality speech map shown in FIG. 4A, the low-quality speech map shown in FIG. 4B lacks a lot of high-frequency components. Reconstruction of the method for improving the sound quality of speech based on deep learning in the embodiment of the present invention, and the reconstructed speech spectrogram shown in FIG. 4C restores these high-frequency components to achieve super-resolution of narrow-band speech, making low-quality speech The sound quality has been improved.
  • the deep learning-based voice sound quality enhancement method enhances the low-quality voice sound quality based on the deep learning method, so that the low-quality voice sound quality is reconstructed by the deep neural network to achieve the high-quality voice sound quality. It can achieve the sound quality improvement effect that cannot be achieved by traditional methods.
  • the deep learning-based voice sound quality enhancement method may be implemented in a device, an apparatus, or a system having a memory and a processor.
  • the method for enhancing the sound quality of a voice based on deep learning can be conveniently deployed on a mobile device such as a smart phone, a tablet computer, a personal computer, a headset, and a speaker.
  • a mobile device such as a smart phone, a tablet computer, a personal computer, a headset, and a speaker.
  • the deep learning-based voice sound quality enhancement method according to an embodiment of the present invention may also be deployed on a server side (or cloud).
  • the deep learning-based voice sound quality enhancement method according to an embodiment of the present invention may also be deployed on the server (or cloud) and personal terminals in a distributed manner.
  • FIG. 5 shows a schematic block diagram of a deep learning-based voice sound quality enhancement apparatus 500 according to an embodiment of the present invention.
  • a deep learning-based voice sound quality enhancement device 500 includes a feature extraction module 510 and a voice reconstruction module 520.
  • Each of the modules may perform each step / function of the deep learning-based speech sound quality enhancement method described above in conjunction with FIG. 2.
  • only the main functions of each module of the deep learning-based voice sound quality enhancement device 500 are described, and details that have been described above are omitted.
  • the feature extraction module 510 is configured to obtain voice data to be processed, and perform feature extraction on the voice data to be processed to obtain characteristics of the voice data to be processed.
  • the speech reconstruction module 520 is configured to reconstruct the speech data to be processed into the output speech data by using the trained speech reconstruction neural network based on the features of the speech data to be processed extracted by the feature extraction module.
  • the voice quality of the output voice data is higher than the voice quality of the voice data to be processed.
  • Both the feature extraction module 510 and the voice reconstruction module 520 can be implemented by the processor 102 in the electronic device shown in FIG. 1 running program instructions stored in the storage device 104.
  • the to-be-processed voice data obtained by the feature extraction module 510 may be low-quality voice data that is received, stored, or played in a voice communication terminal or voice storage / playback device and requires sound quality enhancement, such as a low bit rate, Speech data with low sampling frequency.
  • the to-be-processed voice data may include, but is not limited to, a data stream of a wireless voice call, a voice in a list being played by a user, or a voice file stored in the cloud or on a client.
  • the to-be-processed voice data obtained by the feature extraction module 510 may also be any data that requires sound quality enhancement, such as voice data included in video data.
  • the to-be-processed voice data obtained by the feature extraction module 510 may come from files stored offline, or from files played online.
  • the manner in which the feature extraction module 510 performs feature extraction on the acquired speech data to be processed may include, but is not limited to, a short-time Fourier transform (STFT).
  • STFT short-time Fourier transform
  • the feature extraction module 510 performs feature extraction on the acquired to-be-processed voice data, and the features of the to-be-processed voice data may include frequency domain amplitude and / or energy information.
  • the feature extraction module 510 performs feature extraction on the acquired to-be-processed voice data, and the features of the to-be-processed voice data may further include spectrum phase information.
  • the feature extraction module 510 performs feature extraction on the acquired to-be-processed voice data, and the features of the to-be-processed voice data may also be time-domain features. In other examples, the feature extraction module 510 performs feature extraction on the acquired to-be-processed voice data, and the features of the to-be-processed voice data may further include any other feature that can characterize the to-be-processed voice data.
  • the feature extraction module 510 may perform frame processing on it, and the aforementioned feature extraction is performed on the frame-by-frame speech data obtained.
  • This situation may be applicable when the to-be-processed voice data obtained by the feature extraction module 510 is from a file stored offline or a complete file from any source.
  • the to-be-processed voice data obtained by the feature extraction module 510 comes from a file played online, one or more frames of to-be-processed voice data may be buffered before feature extraction.
  • the feature extraction module 510 may select a part of data for each frame of voice data to be processed or obtained after buffering to perform feature extraction, which can effectively reduce the amount of data and improve processing efficiency.
  • the voice data to be processed may be decoded, and the aforementioned frame processing may be performed on the time-domain waveform data obtained after decoding. get on. This is because the acquired speech data to be processed is generally in an encoded form, and in order to obtain its complete speech time domain information, it may be decoded first.
  • the voice data to be processed may be pre-processed, and the aforementioned feature extraction may be performed on the voice data obtained after pre-processing.
  • the pre-processing of the speech data to be processed by the feature extraction module 510 may include, but is not limited to, denoising, echo suppression, automatic gain control, and the like.
  • the pre-processing may be performed after the aforementioned decoding process. Therefore, in one example, the feature extraction module 510 may sequentially decode, pre-process, frame, and feature extract the acquired speech data to be processed in order to efficiently extract features with good representativeness.
  • the aforementioned pre-processing operation may also be performed before the feature extraction after framing.
  • the speech reconstruction module 520 may use the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data.
  • the voice reconstruction module 520 may further include a reconstruction module (not shown in FIG. 5) and a generation module (not shown in FIG. 5).
  • the reconstruction module may include a trained speech reconstruction neural network that takes as input the features of the speech data to be processed extracted by the feature extraction module 510, and reconstructs the input features to obtain reconstructed speech features. .
  • the generating module generates output voice data with higher voice quality than the acquired to-be-processed voice data based on the reconstructed voice features output by the reconstruction module.
  • the voice sound quality enhancement device of the present invention can accurately supplement the voice information lost in low-quality speech based on deep learning, which can not only effectively achieve a great improvement in the sound quality of low-quality speech, but also not affect the consideration of communication bandwidth (because of transmission Is still low-quality voice data with a small amount of data, but the low-quality voice data can be reconstructed into high-quality voice data at the receiving end).
  • the training of the speech reconstruction neural network used by the speech reconstruction module 520 may include: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than that of the The speech quality of the first speech sample, and the second speech sample is obtained by transcoding the first speech sample; feature extraction is performed on the first speech sample and the second speech sample to obtain The feature of the first speech sample and the feature of the second speech sample; and the obtained feature of the second speech sample is used as an input of an input layer of the speech reconstruction neural network, and the obtained The features of the first speech sample are used as targets of the output layer of the speech reconstruction neural network to train the speech reconstruction neural network.
  • the first speech sample may be a high-quality speech sample and the second speech sample may be a low-quality speech sample.
  • the first speech sample may be a set of speech samples with a high bit rate and a high sampling frequency, including but not limited to speech data with a sampling frequency of 16 kHz, 24 kHz, and 32 kHz.
  • a first speech sample may be transcoded to obtain a second speech sample.
  • an amr-wb voice sample with a sampling frequency of 16 kHz and a code rate of 23.85 kbps can be used as the first voice sample, and the first can be obtained by transcoding it into an amr-nb voice with a sampling frequency of 8 kHz and a code rate of 12.2 kbps.
  • the second speech sample can be obtained by converting the first speech sample in the FLAC format to the MP3 format without reducing the bit rate and the sampling frequency. That is, the code rate of the first voice sample may be higher than or equal to the code rate of the second voice sample; the sampling frequency of the first voice sample may be higher than or equal to the sampling frequency of the second voice sample.
  • the transcoding of the first speech sample (that is, the high-quality speech sample) to obtain the second speech sample (that is, the low-quality speech sample) can also be other situations, which can be adaptively adjusted based on the actual application scenario.
  • the first voice sample and the second voice sample that should be selected can be determined based on the reconstruction needs of the to-be-processed voice data obtained by the feature extraction module 510, that is, the first voice sample that should be selected can be determined based on the above-mentioned reconstruction requirements.
  • a manner of performing feature extraction on each of the first speech sample and the second speech sample may include, but is not limited to, a short-time Fourier transform.
  • the features obtained by performing feature extraction on each of the first speech sample and the second speech sample may include their respective frequency domain amplitude and / or energy information.
  • the features obtained by performing feature extraction on the first speech sample and the second speech sample may further include their respective spectral phase information.
  • the features obtained by performing feature extraction on the first voice sample and the second voice sample may also be their respective time-domain features.
  • the features obtained by performing feature extraction on the first voice sample and the second voice sample may further include any other features that can characterize their respective features.
  • frame processing may be performed on each of the first voice sample and the second voice sample, and the aforementioned feature extraction may be performed on the first
  • the respective speech samples obtained after the speech samples and the second speech samples are framed are performed frame by frame.
  • part of the data can be selected for feature extraction for each frame of voice samples, which can effectively reduce the amount of data and improve processing efficiency.
  • the first speech sample and the second speech sample may be respectively decoded and the foregoing frame processing is performed. This may be performed on the respective time domain waveform data obtained after the first and second speech samples are decoded.
  • each of the first voice sample and the second voice sample may be pre-processed, and the aforementioned feature extraction may be performed on the pre-processing.
  • the speech samples obtained afterwards are performed.
  • the pre-processing performed on each of the first speech sample and the second speech sample may include, but is not limited to, denoising, echo suppression, and automatic gain control.
  • the pre-processing may be performed after the aforementioned decoding process. Therefore, in one example, the first speech sample and the second speech sample may be sequentially decoded, preprocessed, framed, and feature extracted in order to efficiently extract features with good representativeness.
  • the foregoing pre-processing operation may also be performed before the feature extraction is performed after the first speech sample and the second speech sample are respectively framed.
  • the features of one or more frames of the second speech sample may be used as the input of the input layer of the speech reconstruction neural network, and the features of the one or more frames of the first speech sample may be used as the speech reconstruction neural network.
  • the goal of the output layer is to train a neural network regressor as the speech reconstruction neural network used in the speech reconstruction module 520.
  • the reconstruction module of the speech reconstruction module 520 can reconstruct the features of the speech data to be processed into reconstructed speech features. Since the reconstructed speech features are frequency domain features, the speech reconstruction is performed.
  • the generating module of module 520 may generate a time-domain speech waveform output based on the reconstructed speech feature. Exemplarily, the generating module may transform the reconstructed speech feature to obtain a time-domain speech waveform by inverse Fourier transform.
  • the output voice waveform can be stored or buffered for playback, providing users with a better improved voice sound quality experience.
  • 4A-4C may be combined with reference to the foregoing description of FIGS. 4A-4C to realize the voice sound quality enhancement effect of the deep learning-based voice sound quality enhancement device according to an embodiment. For brevity, I will not repeat them here.
  • the deep learning-based voice sound quality enhancement device enhances the low-quality voice sound quality based on the deep learning method, so that the low-quality voice sound quality is reconstructed by the deep neural network to achieve the high-quality voice sound quality, thereby It can achieve the sound quality improvement effect that cannot be achieved by traditional methods.
  • the deep learning-based device can be conveniently deployed on a server or a user, and can effectively enhance the voice quality.
  • FIG. 6 shows a schematic block diagram of a deep learning-based speech sound quality enhancement system 600 according to an embodiment of the present invention.
  • the deep learning-based speech sound quality enhancement system 600 includes a storage device 610 and a processor 620.
  • the storage device 610 stores a program for implementing the corresponding steps in the method for enhancing the sound quality of a voice based on deep learning according to an embodiment of the present invention.
  • the processor 620 is configured to run a program stored in the storage device 610 to execute the corresponding steps of the deep learning-based voice sound quality enhancement method according to an embodiment of the present invention, and to implement the deep learning-based voice sound quality according to an embodiment of the present invention Enhance the corresponding module in the unit.
  • the deep learning-based voice sound quality enhancement system 600 when the program is executed by the processor 620, the deep learning-based voice sound quality enhancement system 600 performs the following steps: obtaining to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain the Describe the characteristics of the speech data to be processed; and based on the characteristics of the speech data to be processed, use the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data, wherein the speech of the output speech data The quality is higher than the speech quality of the speech data to be processed.
  • the training of the speech reconstruction neural network includes: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than the speech quality of the first speech sample, And the second voice sample is obtained by transcoding the first voice sample; feature extraction is performed on the first voice sample and the second voice sample respectively to obtain the features and The features of the second speech sample; and using the obtained features of the second speech sample as an input of an input layer of the speech reconstruction neural network, and using the obtained features of the first speech sample as the input The goal of the output layer of the speech reconstruction neural network to train the speech reconstruction neural network.
  • the first speech sample has a first bit rate
  • the second speech sample has a second bit rate
  • the first bit rate is higher than or equal to the second bit rate
  • the first speech sample has a first sampling frequency
  • the second speech sample has a second sampling frequency
  • the first sampling frequency is higher than or equal to the second sampling frequency
  • the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  • the features obtained by the feature extraction further include spectrum phase information.
  • the feature extraction manner includes a short-time Fourier transform.
  • the training of the speech reconstruction neural network further includes: before performing feature extraction on the first speech sample and the second speech sample, performing the first speech sample and the second speech sample The speech samples are framed separately, and the feature extraction is performed frame by frame for the speech samples obtained after framed.
  • the training of the speech reconstruction neural network further includes: before framing the first speech sample and the second speech sample, dividing the first speech sample and the second speech sample into frames.
  • the speech samples are decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
  • the deep voice-based speech quality enhancement system 600 when the program is run by the processor 620, the deep voice-based speech quality enhancement system 600 is executed to use the trained speech reconstruction neural network to process the to-be-processed Reconstructing speech data to output speech data includes using features of the speech data to be processed as input to the trained speech reconstruction neural network, and reconstructing speech features from the output of the trained speech reconstruction neural network And generating a time-domain speech waveform based on the reconstructed speech feature as the output speech data.
  • a storage medium on which program instructions are stored, and when the program instructions are run by a computer or a processor, the program is used to execute a deep learning-based learning method according to an embodiment of the present invention.
  • the storage medium may include, for example, a memory card of a smart phone, a storage part of a tablet computer, a hard disk of a personal computer, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disk read-only memory (CD-ROM), USB memory, or any combination of the above storage media.
  • the computer-readable storage medium may be any combination of one or more computer-readable storage media.
  • the computer program instructions when run by a computer, may implement various functional modules of a deep learning-based voice sound quality enhancement device according to an embodiment of the present invention, and / or may execute a depth-based Learning method for improving sound quality of speech.
  • the computer program instructions when executed by a computer or a processor, cause the computer or the processor to perform the following steps: obtaining to-be-processed voice data, and performing feature extraction on the to-be-processed voice data to obtain the to-be-processed voice data.
  • Characteristics of processing speech data and based on the characteristics of the speech data to be processed, using the trained speech reconstruction neural network to reconstruct the speech data to be processed into output speech data, where the speech quality of the output speech data is high The speech quality of the speech data to be processed.
  • the training of the speech reconstruction neural network includes: obtaining a first speech sample and a second speech sample, wherein the speech quality of the second speech sample is lower than the speech quality of the first speech sample, And the second voice sample is obtained by transcoding the first voice sample; feature extraction is performed on the first voice sample and the second voice sample respectively to obtain the features and The features of the second speech sample; and using the obtained features of the second speech sample as an input of an input layer of the speech reconstruction neural network, and using the obtained features of the first speech sample as the input The goal of the output layer of the speech reconstruction neural network to train the speech reconstruction neural network.
  • the first speech sample has a first bit rate
  • the second speech sample has a second bit rate
  • the first bit rate is higher than or equal to the second bit rate
  • the first speech sample has a first sampling frequency
  • the second speech sample has a second sampling frequency
  • the first sampling frequency is higher than or equal to the second sampling frequency
  • the features obtained by the feature extraction include frequency domain amplitude and / or energy information.
  • the features obtained by the feature extraction further include spectrum phase information.
  • the feature extraction manner includes a short-time Fourier transform.
  • the training of the speech reconstruction neural network further includes: before performing feature extraction on the first speech sample and the second speech sample, performing the first speech sample and the second speech sample The speech samples are framed separately, and the feature extraction is performed frame by frame for the speech samples obtained after framed.
  • the training of the speech reconstruction neural network further includes: before framing the first speech sample and the second speech sample, dividing the first speech sample and the second speech sample into frames.
  • the speech samples are decoded into time-domain waveform data, and the framing is performed on the time-domain waveform data obtained after decoding.
  • the computer program instructions when executed by a computer or processor, cause the computer or processor to execute the reconstructed speech data to be processed into the output speech data using the trained speech reconstruction neural network. Including: using the features of the speech data to be processed as input of the trained speech reconstruction neural network, and outputting reconstructed speech features from the trained speech reconstruction neural network; and based on the reconstructed speech The feature generates a time-domain speech waveform as the output speech data.
  • Each module in the deep learning-based voice sound quality enhancement device may be implemented by running a computer program instruction stored in a memory of a processor of an electronic device based on deep learning-based voice sound quality enhancement according to an embodiment of the present invention.
  • Or may be implemented when computer instructions stored in a computer-readable storage medium of a computer program product according to an embodiment of the present invention are run by a computer.
  • a computer program is also provided, and the computer program may be stored on a cloud or a local storage medium.
  • the method is used to execute the corresponding steps of the deep learning-based voice sound quality enhancement method of the embodiment of the present invention, and is used to implement the deep learning-based voice sound quality enhancement device according to the embodiment of the present invention.
  • Corresponding module is also provided, and the computer program may be stored on a cloud or a local storage medium.
  • a method, device, system, storage medium, and computer program for deep learning-based voice sound quality enhancement enhance low-quality voice sound quality based on the deep learning method, so that the low-quality voice sound quality is achieved by deep neural network reconstruction
  • the sound quality of the speech so that the sound quality improvement effect that cannot be achieved by the traditional method can be achieved.
  • the method, device, system, storage medium, and computer program for deep learning-based voice sound quality enhancement according to the embodiments of the present invention can be conveniently deployed on the server or user side, and can effectively enhance the voice sound quality.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or Can be integrated into another device, or some features can be ignored or not implemented.
  • the various component embodiments of the present invention may be implemented by hardware, or by software modules running on one or more processors, or by a combination thereof.
  • a microprocessor or a digital signal processor (DSP) may be used to implement some or all functions of some modules according to embodiments of the present invention.
  • DSP digital signal processor
  • the invention may also be implemented as a device program (e.g., a computer program and a computer program product) for performing part or all of the method described herein.
  • a program that implements the present invention may be stored on a computer-readable medium or may have the form of one or more signals. Such signals can be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.

Abstract

L'invention concerne un procédé d'amélioration de la qualité de la parole basé sur un apprentissage profond, un dispositif et un système. Le procédé comprend les étapes consistant à : acquérir des données de parole à traiter, et effectuer une extraction de caractéristique sur les données de parole, de façon à obtenir une caractéristique de celles-ci (S210) ; et reconstruire, sur la base de la caractéristique des données de parole et au moyen d'un réseau neuronal de reconstruction de la parole entraîné, les données de parole en tant que données de parole de sortie, les données de parole de sortie ayant une qualité vocale supérieure à la qualité vocale des données de parole à traiter (S220). L'invention améliore la qualité vocale de données de parole de faible qualité sur la base d'un procédé d'apprentissage profond, de telle sorte que la qualité vocale des données de parole de faible qualité est améliorée pour atteindre la qualité vocale de données de parole de haute qualité au moyen d'une reconstruction basée sur un réseau neuronal profond, ce qui permet de réaliser un effet d'amélioration de la qualité vocale que des procédés classiques ne peuvent pas accomplir.
PCT/CN2019/089759 2018-06-05 2019-06-03 Procédé d'amélioration de la qualité de la parole basés sur un apprentissage profond, dispositif et système WO2019233362A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810583123.0 2018-06-05
CN201810583123.0A CN109147806B (zh) 2018-06-05 2018-06-05 基于深度学习的语音音质增强方法、装置和系统

Publications (1)

Publication Number Publication Date
WO2019233362A1 true WO2019233362A1 (fr) 2019-12-12

Family

ID=64801980

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/089759 WO2019233362A1 (fr) 2018-06-05 2019-06-03 Procédé d'amélioration de la qualité de la parole basés sur un apprentissage profond, dispositif et système

Country Status (2)

Country Link
CN (2) CN113870872A (fr)
WO (1) WO2019233362A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111681669A (zh) * 2020-05-14 2020-09-18 上海眼控科技股份有限公司 一种基于神经网络的语音数据的识别方法与设备
CN114863940A (zh) * 2022-07-05 2022-08-05 北京百瑞互联技术有限公司 音质转换的模型训练方法、提升音质的方法、装置及介质
CN114863942A (zh) * 2022-07-05 2022-08-05 北京百瑞互联技术有限公司 音质转换的模型训练方法、提升语音音质的方法及装置

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870872A (zh) * 2018-06-05 2021-12-31 安克创新科技股份有限公司 基于深度学习的语音音质增强方法、装置和系统
CN110022400A (zh) * 2019-01-28 2019-07-16 努比亚技术有限公司 一种语音通话输出方法、终端及计算机可读存储介质
CN111833892A (zh) * 2019-04-22 2020-10-27 浙江宇视科技有限公司 音视频数据处理方法及装置
CN113748460A (zh) * 2019-04-30 2021-12-03 渊慧科技有限公司 使用神经网络的传入数据的带宽扩展
CN111429930B (zh) * 2020-03-16 2023-02-28 云知声智能科技股份有限公司 一种基于自适应采样率的降噪模型处理方法及系统
US20220365799A1 (en) * 2021-05-17 2022-11-17 Iyo Inc. Using machine learning models to simulate performance of vacuum tube audio hardware

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170162194A1 (en) * 2015-12-04 2017-06-08 Conexant Systems, Inc. Semi-supervised system for multichannel source enhancement through configurable adaptive transformations and deep neural network
CN107358966A (zh) * 2017-06-27 2017-11-17 北京理工大学 基于深度学习语音增强的无参考语音质量客观评估方法
CN107516527A (zh) * 2016-06-17 2017-12-26 中兴通讯股份有限公司 一种语音编解码方法和终端
CN107845389A (zh) * 2017-12-21 2018-03-27 北京工业大学 一种基于多分辨率听觉倒谱系数和深度卷积神经网络的语音增强方法
CN109147806A (zh) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 基于深度学习的语音音质增强方法、装置和系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5230103B2 (ja) * 2004-02-18 2013-07-10 ニュアンス コミュニケーションズ,インコーポレイテッド 自動音声認識器のためのトレーニングデータを生成する方法およびシステム
WO2014039828A2 (fr) * 2012-09-06 2014-03-13 Simmons Aaron M Procédé et système d'apprentissage de la fluidité de lecture
CN103236262B (zh) * 2013-05-13 2015-08-26 大连理工大学 一种语音编码器码流的转码方法
CN103531205B (zh) * 2013-10-09 2016-08-31 常州工学院 基于深层神经网络特征映射的非对称语音转换方法
CN104318927A (zh) * 2014-11-04 2015-01-28 东莞市北斗时空通信科技有限公司 一种抗噪声的低速率语音编码方法及解码方法
CN104464744A (zh) * 2014-11-19 2015-03-25 河海大学常州校区 一种基于混合高斯随机过程的分簇语音转换方法及系统
CN107622777B (zh) * 2016-07-15 2020-04-14 公安部第三研究所 一种基于过完备字典对的高码率信号获取方法
CN106997767A (zh) * 2017-03-24 2017-08-01 百度在线网络技术(北京)有限公司 基于人工智能的语音处理方法及装置
CN107274883B (zh) * 2017-07-04 2020-06-02 清华大学 语音信号重构方法及装置
CN107564538A (zh) * 2017-09-18 2018-01-09 武汉大学 一种实时语音通信的清晰度增强方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170162194A1 (en) * 2015-12-04 2017-06-08 Conexant Systems, Inc. Semi-supervised system for multichannel source enhancement through configurable adaptive transformations and deep neural network
CN107516527A (zh) * 2016-06-17 2017-12-26 中兴通讯股份有限公司 一种语音编解码方法和终端
CN107358966A (zh) * 2017-06-27 2017-11-17 北京理工大学 基于深度学习语音增强的无参考语音质量客观评估方法
CN107845389A (zh) * 2017-12-21 2018-03-27 北京工业大学 一种基于多分辨率听觉倒谱系数和深度卷积神经网络的语音增强方法
CN109147806A (zh) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 基于深度学习的语音音质增强方法、装置和系统

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111681669A (zh) * 2020-05-14 2020-09-18 上海眼控科技股份有限公司 一种基于神经网络的语音数据的识别方法与设备
CN114863940A (zh) * 2022-07-05 2022-08-05 北京百瑞互联技术有限公司 音质转换的模型训练方法、提升音质的方法、装置及介质
CN114863942A (zh) * 2022-07-05 2022-08-05 北京百瑞互联技术有限公司 音质转换的模型训练方法、提升语音音质的方法及装置
CN114863940B (zh) * 2022-07-05 2022-09-30 北京百瑞互联技术有限公司 音质转换的模型训练方法、提升音质的方法、装置及介质
CN114863942B (zh) * 2022-07-05 2022-10-21 北京百瑞互联技术有限公司 音质转换的模型训练方法、提升语音音质的方法及装置

Also Published As

Publication number Publication date
CN109147806B (zh) 2021-11-12
CN113870872A (zh) 2021-12-31
CN109147806A (zh) 2019-01-04

Similar Documents

Publication Publication Date Title
WO2019233362A1 (fr) Procédé d'amélioration de la qualité de la parole basés sur un apprentissage profond, dispositif et système
WO2019233364A1 (fr) Amélioration de la qualité audio basée sur un apprentissage profond
US9536540B2 (en) Speech signal separation and synthesis based on auditory scene analysis and speech modeling
JP6599362B2 (ja) 高帯域励起信号生成
US8554550B2 (en) Systems, methods, and apparatus for context processing using multi resolution analysis
WO2015154397A1 (fr) Procédé de traitement et de génération de signal de bruit, codeur/décodeur, et système de codage/décodage
TW201214419A (en) Systems, methods, apparatus, and computer program products for wideband speech coding
WO2017206432A1 (fr) Procédé de traitement de signal vocal, et dispositif et système associés
WO2021179788A1 (fr) Procédés de codage et de décodage de signal vocal, appareils et dispositif électronique, et support d'enregistrement
US20130332171A1 (en) Bandwidth Extension via Constrained Synthesis
WO2023241240A1 (fr) Procédé et appareil de traitement audio, et dispositif électronique, support d'enregistrement lisible par ordinateur et produit programme informatique
JP6573887B2 (ja) オーディオ信号の符号化方法、復号方法及びその装置
JP6258522B2 (ja) デバイスにおいてコーディング技術を切り替える装置および方法
WO2023241205A1 (fr) Procédé et appareil de traitement d'image, et dispositif électronique, support de stockage lisible par ordinateur et produit-programme informatique
Romano et al. A real-time audio compression technique based on fast wavelet filtering and encoding
CN115116451A (zh) 音频解码、编码方法、装置、电子设备及存储介质
Jose Amrconvnet: Amr-coded speech enhancement using convolutional neural networks
Xu et al. A Multi-Scale Feature Aggregation Based Lightweight Network for Audio-Visual Speech Enhancement
Maes et al. Conversational networking: conversational protocols for transport, coding, and control.
WO2023069805A1 (fr) Reconstruction de signal audio
KR20220050924A (ko) 오디오 코딩을 위한 다중 래그 형식
CN117219095A (zh) 音频编码方法、音频解码方法、装置、设备及存储介质
CN117594035A (zh) 多模态语音分离识别方法、装置、冰箱及存储介质
Li et al. A device of speech signal coding/decoding upon sieve of eratosthenes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19815463

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19815463

Country of ref document: EP

Kind code of ref document: A1