WO2021052287A1 - 频带扩展方法、装置、电子设备及计算机可读存储介质 - Google Patents

频带扩展方法、装置、电子设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021052287A1
WO2021052287A1 PCT/CN2020/115052 CN2020115052W WO2021052287A1 WO 2021052287 A1 WO2021052287 A1 WO 2021052287A1 CN 2020115052 W CN2020115052 W CN 2020115052W WO 2021052287 A1 WO2021052287 A1 WO 2021052287A1
Authority
WO
WIPO (PCT)
Prior art keywords
frequency
spectrum
low
sub
initial
Prior art date
Application number
PCT/CN2020/115052
Other languages
English (en)
French (fr)
Inventor
肖玮
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20864964.0A priority Critical patent/EP3920182A4/en
Priority to JP2021558882A priority patent/JP7297368B2/ja
Publication of WO2021052287A1 publication Critical patent/WO2021052287A1/zh
Priority to US17/468,662 priority patent/US11763829B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
    • G10L19/0208Subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • G10L19/0216Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation using wavelet decomposition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/038Speech enhancement, e.g. noise reduction or echo cancellation using band spreading techniques
    • G10L21/0388Details of processing therefor
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components

Definitions

  • the embodiments of the present application relate to the field of audio processing technology. Specifically, the present application relates to a frequency band extension method, device, electronic device, and computer-readable storage medium.
  • Band extension also called band duplication, is a classic technique in the field of audio coding.
  • Band expansion technology is a parametric encoding technology. Through frequency band expansion, the effective bandwidth can be expanded at the receiving end to improve the quality of audio signals, so that users can intuitively feel brighter tone, louder volume and better performance. Intelligibility.
  • a classic implementation method of frequency band extension is to use the correlation between high frequency and low frequency in the speech signal to perform frequency band extension.
  • the above correlation is used as side information, and at the encoding end, The above-mentioned side information is merged into the code stream and transmitted, and the decoding end sequentially restores the low frequency spectrum through decoding, and performs a band expansion operation to restore the high frequency spectrum.
  • this method requires the system to consume corresponding bits (for example, on the basis of encoding the low-frequency part information, an additional 10% of the bits are used to encode the above-mentioned side information), that is, additional bits are needed for encoding, and there is a problem of forward compatibility.
  • Another commonly used frequency band extension method is a blind scheme based on data analysis. This scheme is based on neural networks or deep learning. The input is low-frequency coefficients and the output is high-frequency coefficients.
  • This coefficient-coefficient mapping method requires high generalization ability of the network; in order to ensure the effect, the network depth and volume are large, and the complexity is high; in the actual process, in the scene beyond the mode contained in the training library , The performance of this method is average.
  • a frequency band extension method which is executed by an electronic device, and includes:
  • the correlation parameter includes at least one of the high frequency spectrum envelope and the relative flatness information, and the relative flatness The information characterizes the correlation between the spectral flatness of the high-frequency part of the target broadband spectrum and the spectral flatness of the low-frequency part;
  • the initial high frequency spectrum is obtained
  • the target low-frequency spectrum and the target high-frequency spectrum obtain the wideband signal after frequency band expansion
  • the target low-frequency spectrum is the initial low-frequency spectrum or the spectrum after filtering the initial low-frequency spectrum
  • the target high-frequency spectrum is the initial high-frequency spectrum or the spectrum after filtering the initial high-frequency spectrum
  • a frequency band extension device including:
  • the low-frequency spectrum determination module is used to perform time-frequency transformation on the narrowband signal to be processed to obtain the corresponding initial low-frequency spectrum
  • the correlation parameter determination module is used to obtain the correlation parameters between the high frequency part and the low frequency part of the target broadband spectrum through the neural network model based on the initial low frequency spectrum, where the correlation parameters include high frequency spectrum envelope and relative flatness information At least one of them, the relative flatness information represents the correlation between the spectral flatness of the high-frequency part of the target broadband spectrum and the spectral flatness of the low-frequency part;
  • the high-frequency spectrum determination module is used to obtain the initial high-frequency spectrum based on the correlation parameter and the initial low-frequency spectrum;
  • the broadband signal determination module is used to obtain the broadband signal after the band expansion according to the target low-frequency spectrum and the target high-frequency spectrum; where the target low-frequency spectrum is the initial low-frequency spectrum or the spectrum after filtering the initial low-frequency spectrum, and the target high-frequency spectrum It is the initial high-frequency spectrum or the spectrum after filtering the initial high-frequency spectrum.
  • an electronic device which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and the processor implements the above-mentioned frequency band extension method when the program is executed.
  • a computer-readable storage medium is provided, and a computer program is stored on the computer-readable storage medium, and when the program is executed by a processor, the above-mentioned frequency band extension method is realized.
  • Fig. 1A shows a scene diagram of a frequency band extension method provided in an embodiment of the present application.
  • FIG. 1B is a schematic flowchart of a frequency band extension method according to an embodiment of the application.
  • FIG. 2 is a schematic diagram of the network structure of the neural network model according to an embodiment of the application.
  • FIG. 3 is a schematic flowchart of a frequency band extension method in the first example of an embodiment of the application
  • FIG. 4 is a schematic flowchart of a frequency band extension method in a second example of an embodiment of the application.
  • FIG. 5 is a schematic structural diagram of a frequency band extension apparatus according to an embodiment of the application.
  • FIG. 6 is a schematic structural diagram of an electronic device according to an embodiment of the application.
  • BWE Band Width Extension
  • Spectrum Envelope It is the energy representation of the spectral coefficient corresponding to the signal on the frequency axis corresponding to the signal. For sub-bands, it is the energy representation of the spectral coefficients corresponding to the sub-band. The average energy of the corresponding spectral coefficients.
  • Spectrum Flatness (Spectrum Flatness, SF): characterizes the degree of flatness of the power of the signal under test in its channel.
  • Neural Network It is an algorithmic mathematical model that imitates the behavioral characteristics of animal neural networks and performs distributed and parallel information processing. This kind of network relies on the complexity of the system, and achieves the purpose of processing information by adjusting the interconnection between a large number of internal nodes.
  • Deep learning is a type of machine learning. Deep learning combines low-level features to form more abstract high-level representation attribute categories or features to discover distributed feature representations of data.
  • PSTN Public Switched Telephone Network
  • PSTN Public Switched Telephone Network
  • VoIP Voice over Internet Protocol, Internet telephony
  • Internet telephony is a kind of voice call technology, through the Internet protocol to achieve voice calls and multimedia conferences, that is, to communicate via the Internet.
  • 3GPP EVS 3rd Generation Partnership Project
  • EVS Enhanced Voice Services, enhanced voice services
  • It is a new generation of speech and audio encoder, which not only can provide very high audio quality for voice and music signals, but also has strong anti-drop frame and anti-delay jitter ability, which can bring a new experience to users .
  • Opus is a lossy audio coding format developed by the Internet Engineering Task Force (IETF, The Internet Engineering Task Force).
  • Silk audio encoder is a silk broadband that provides royalty-free certification to third-party developers and hardware manufacturers for Skype VoIP.
  • frequency band extension is a classic technology in the audio coding field.
  • frequency band extension can be achieved in the following ways:
  • the first method is to select a narrowband signal at a low sampling rate to copy the low frequency part of the frequency spectrum of the narrowband signal to the high frequency; according to the boundary information recorded in advance (the information describing the energy correlation between the high frequency and the low frequency)
  • the narrow-band signal ie, narrow-band signal
  • the narrow-band signal is expanded into a wide-band signal (ie, wide-band signal).
  • the second method does not require additional bits, and directly completes the band expansion.
  • the input of neural networks or deep learning is narrow-band signals.
  • the low-frequency spectrum is output as a high-frequency spectrum, and the narrow-band signal is expanded into a wide-band signal based on the high-frequency spectrum.
  • the side information needs to consume corresponding bits, and there is a problem of forward compatibility.
  • PSTN-VoIP narrowband voice
  • VoIP wideband voice
  • the frequency band is expanded.
  • the input is the low-frequency spectrum and the output is the high-frequency spectrum.
  • This method does not need to consume additional bits, but it has a high requirement on the generalization ability of the network.
  • the network has a large depth and volume, high complexity, and poor performance. Therefore, neither of the above-mentioned two frequency band expansion methods can meet the performance requirements of actual frequency band expansion.
  • the embodiments of the present application provide a frequency band extension method, which not only does not require additional bits, reduces the depth and volume of the network, but also reduces Increase the complexity of the network.
  • the PSTN (narrowband voice) and VoIP (wideband voice) intercommunication scenarios are taken as an example to describe the solution of the present application, that is, in the transmission direction from PSTN to VoIP (PSTN-VoIP), the narrowband The voice is expanded to broadband voice.
  • this application does not limit the above application scenarios, and is also applicable to other encoding systems, including but not limited to mainstream audio encoders such as 3GPP EVS, IEFT OPUS, and SILK.
  • the sampling rate is 8000 Hz
  • the frame length of one voice frame is 10 ms (equivalent to 80 sample points /frame).
  • the PSTN frame length is 20 ms
  • only two operations are required for each PSTN frame.
  • the data frame length is fixed at 10ms as an example.
  • the frame length is other values, such as 20ms (equivalent to 160 sample points). /Frame), this application still applies, and it is not limited here.
  • the sampling rate of 8000 Hz in the embodiment of the present application is taken as an example, and it is not used to limit the scope of the frequency band extension provided by the embodiment of the present application.
  • the main embodiment of this application is to extend the signal frequency band with a sampling rate of 8000 Hz to a signal with a sampling rate of 16000 Hz
  • this application can also be applied to other sampling rate scenarios, such as extending a signal with a sampling rate of 16000 Hz to a sampling rate of 32000 Hz.
  • Signals with a sampling rate of 8000 Hz are expanded to signals with a sampling rate of 12000 Hz, etc.
  • the solutions of the embodiments of the present application can be applied to any scenario where signal frequency band expansion is required.
  • Fig. 1A shows an application scenario diagram of a frequency band extension method provided in an embodiment of the present application.
  • the electronic device may include a mobile phone 110 or a notebook computer 112, but is not limited thereto. Taking the electronic device as the mobile phone 110 as an example, the rest of the situation is similar.
  • the mobile phone 110 communicates with the server device 13 through the network 12.
  • the server device 13 includes a neural network model.
  • the mobile phone 110 inputs the narrowband signal to be processed into the neural network model in the server device 13, and obtains and outputs a wideband signal with an expanded frequency band through the method shown in FIG. 1B.
  • the neural network model is located in the server device 13, in another implementation manner, the neural network model may be located in an electronic device (not shown in the figure).
  • An example of the present application provides a frequency band extension method, which is executed by an electronic device as shown in FIG. 6, and the electronic device may be a terminal or a server.
  • the terminal can be a desktop device or a mobile terminal.
  • the server can be an independent physical server, a cluster of physical servers, or a virtual server. As shown in Figure 1B, the method includes:
  • Step S110 time-frequency transformation is performed on the narrowband signal to be processed to obtain the corresponding initial low frequency spectrum.
  • the initial low-frequency spectrum is obtained by performing time-frequency transformation on the narrowband signal.
  • the time-frequency transformation includes but is not limited to Fourier transform, discrete cosine transform, discrete sine transform, wavelet transform, and the like.
  • the narrowband signal to be processed may be a voice frame signal that requires frequency band expansion.
  • the narrowband signal to be processed may be a PSTN narrowband voice signal.
  • the narrowband signal to be processed is a signal of a speech frame
  • the narrowband signal to be processed may be all or part of the speech signal of a speech frame.
  • the signal can be used as a narrowband signal to be processed to complete the frequency band expansion at one time, or the signal can be divided into multiple sub-signals, and the multiple sub-signals can be processed separately, such as
  • the frame length of the above PSTN frame is 20ms
  • the signal of the 20ms speech frame can be band-extended once
  • the 20ms speech frame can be divided into two 10ms speech frames
  • the two 10ms speech frames can be band-expanded respectively.
  • Step S120 based on the initial low frequency spectrum, obtain correlation parameters between the high frequency part and the low frequency part of the target broadband spectrum through the neural network model, where the correlation parameter includes at least one of the high frequency spectrum envelope and relative flatness information,
  • the relative flatness information represents the correlation between the spectral flatness of the high-frequency part of the target broadband spectrum and the spectral flatness of the low-frequency part.
  • the neural network model may be a model trained in advance based on the low-frequency spectrum of the signal, and the model is used to predict the correlation parameter of the signal.
  • the target broadband spectrum refers to the spectrum corresponding to the bandwidth of the narrowband signal.
  • the target broadband spectrum is obtained based on the low-frequency spectrum of the voice signal to be processed.
  • the target broadband spectrum can be the low-frequency spectrum of the voice signal to be processed. Copy it.
  • Step S130 Obtain an initial high frequency spectrum based on the correlation parameter and the initial low frequency spectrum.
  • the initial low-frequency spectrum (parameters corresponding to the low-frequency part)
  • the initial high-frequency spectrum of the broadband signal it is possible to predict the initial high-frequency spectrum of the broadband signal to be extended (that is, the parameters corresponding to the high-frequency part of the broadband signal).
  • Step S140 Obtain a band-expanded broadband signal according to the target low-frequency spectrum and the target high-frequency spectrum; wherein the target low-frequency spectrum is the initial low-frequency spectrum or the spectrum after filtering the initial low-frequency spectrum, and the target high-frequency spectrum is the initial high-frequency spectrum.
  • the frequency spectrum or the frequency spectrum after filtering the initial high frequency spectrum is the target low-frequency spectrum or the spectrum after filtering the initial high frequency spectrum.
  • the narrowband signal usually needs to be quantized, and quantization noise is generally introduced during the quantization process.
  • the initial low-frequency spectrum can be filtered to obtain the corresponding target low-frequency spectrum to filter out the quantization noise in the initial low-frequency spectrum, and then based on the target low-frequency spectrum, the wideband signal after the band expansion is obtained to prevent the quantization noise Extend to broadband signals.
  • the initial high-frequency spectrum can be filtered first to obtain the corresponding target high-frequency spectrum, so as to effectively filter out the noise that may exist in the initial high-frequency spectrum, and then Based on the target high-frequency spectrum, a wideband signal with an expanded frequency band is obtained, which enhances the signal quality of the wideband signal and further enhances the user's hearing experience.
  • the broadband signal after the frequency band expansion is obtained, including any of the following situations:
  • Obtaining the wideband signal after the frequency band expansion can be: obtaining the wideband signal after the frequency band expansion according to the initial high frequency spectrum (without filtering processing) and the target low frequency spectrum.
  • the specific process of obtaining the wideband signal after the band expansion can be: firstly combine the initial high-frequency spectrum with the target low-frequency spectrum, and then perform time-frequency inverse transformation on the combined spectrum ( That is, frequency-time conversion), to obtain a new wideband signal, and realize the frequency band expansion of the narrowband signal to be processed.
  • the frequency spectrum to obtain the wideband signal after the frequency band expansion can be: according to the initial low frequency spectrum (without filtering processing) and the target high frequency spectrum, the wideband signal after the frequency band expansion is obtained.
  • the specific process of obtaining the wideband signal after the band expansion can be: first merge the initial low-frequency spectrum with the target high-frequency spectrum, and then perform a time-frequency inverse transformation on the combined spectrum ( That is, frequency-time conversion), to obtain a new wideband signal, and realize the frequency band expansion of the narrowband signal to be processed.
  • the target high-frequency spectrum is the spectrum after filtering the initial high-frequency spectrum
  • the target low-frequency spectrum is the initial low-frequency spectrum.
  • the broadband signal after the frequency band expansion is obtained according to the target low frequency spectrum and the target high frequency spectrum, which can be: first combine the target low frequency spectrum with the target high frequency spectrum, and then perform the time control on the combined spectrum.
  • Frequency inverse transformation ie, frequency-time transformation
  • the bandwidth of the expanded wideband signal is greater than the bandwidth of the narrowband signal to be processed, based on the wideband signal, a voice frame with a loud tone and a louder volume can be obtained, so that the user can have a better hearing experience.
  • the frequency band expansion method in the process of obtaining a wideband signal after frequency band expansion according to the target low frequency spectrum and the target high frequency spectrum, filtering is performed on at least one of the initial low frequency spectrum or the initial high frequency spectrum. Processing, so that before the wideband signal is obtained, the initial low-frequency spectrum can be filtered, thereby effectively filtering the quantization noise that may be introduced by the narrow-band signal in the quantization process; the initial high-frequency spectrum can also be filtered to effectively filter out the The noise introduced during the band expansion of the initial low-frequency spectrum enhances the signal quality of the broadband signal and further enhances the user's hearing experience.
  • the frequency band expansion is carried out by the method of this scheme, without the need to record side information in advance, that is, no additional bandwidth is required.
  • the target broadband spectrum refers to the spectrum corresponding to the broadband signal (target broadband signal) to which the narrowband signal is intended to be extended, and the target broadband spectrum is based on the low frequency spectrum of the voice signal to be processed Obtained, for example, the target broadband spectrum may be obtained by copying the low frequency spectrum of the voice signal to be processed.
  • the neural network model may be a model trained in advance based on sample data.
  • Each sample data includes a sample narrowband signal and a sample wideband signal corresponding to the sample narrowband signal.
  • For each sample data its sample wideband signal can be determined
  • the correlation parameter between the high-frequency part and the low-frequency part of the spectrum (this parameter can be understood as the label information of the sample data, that is, the sample label, referred to as the label result).
  • the correlation parameter includes the high-frequency spectrum envelope and can also include the sample The relative flatness information of the high frequency part and the low frequency part of the frequency spectrum of the broadband signal.
  • the input of the initial neural network model is the low frequency spectrum of the sample narrowband signal, and the output is the predicted correlation
  • the performance parameter (referred to as the prediction result for short) can be based on the similarity between the prediction result corresponding to each sample data and the labeling result to determine whether the model training is over. For example, whether the model training is over is judged by whether the model's loss function converges, the loss function The degree of difference between the prediction results and the annotation results of each sample data is characterized, and the model at the end of the training is used as the neural network model when the embodiment of the present application is applied.
  • the low frequency spectrum of the narrowband signal can be input into the trained neural network model to obtain the correlation parameter corresponding to the narrowband signal.
  • the model is trained based on sample data, the sample label of the sample data is the correlation parameter between the high frequency part and the low frequency part of the sample broadband signal, therefore, the correlation parameter of the narrowband signal obtained based on the output of the neural network model ,
  • the correlation parameter can well characterize the correlation between the high frequency part and the low frequency part of the spectrum of the target broadband signal.
  • the correlation parameter can characterize the correlation between the high-frequency part and the low-frequency part of the target broadband spectrum, based on the correlation parameter and the initial low-frequency spectrum (parameters corresponding to the low-frequency part), it can be predicted that the broadband signal needs to be expanded.
  • the initial high frequency spectrum that is, the parameter corresponding to the high frequency part of the broadband signal.
  • the correlation parameters between the high frequency part and the low frequency part of the target broadband frequency spectrum can be obtained through the neural network model. Since the neural network model is used for prediction, therefore, No need to encode additional bits, it is a blind analysis method with good forward compatibility, and because the output of the model is a parameter that can reflect the correlation between the high-frequency part and the low-frequency part of the target broadband spectrum, It realizes the mapping of spectral parameters to correlation parameters. Compared with the existing coefficient-to-coefficient mapping method, it has better generalization ability, and can get a signal with a loud tone and a larger volume, so that users have better Auditory experience.
  • the initial low-frequency spectrum is obtained by time-frequency transformation of the narrowband signal to be processed.
  • the time-frequency transformation includes, but is not limited to, Fourier transform, discrete cosine transform, discrete sine transform, and wavelet. Transformation and so on.
  • determining the initial low frequency spectrum of the narrowband signal to be processed may include:
  • the low-frequency frequency domain coefficient is determined as the initial low-frequency spectrum.
  • the sampling rate of the voice signal is 8000 Hz
  • the frame length of one voice frame is 10 ms as an example for description.
  • the PSTN signal sampling rate is 8000 Hz.
  • the effective bandwidth of the narrowband signal is 4000 Hz.
  • the purpose of this example is to expand the narrowband signal to obtain a signal with a bandwidth of 8000 Hz, that is, the bandwidth of the wideband signal is 8000 Hz.
  • the upper bound of the general effective bandwidth is 3500 Hz. Therefore, in this solution, the effective bandwidth of the wideband signal actually obtained is 7000Hz.
  • the purpose of this example is to expand the narrowband signal with a bandwidth of 3500Hz to obtain a wideband signal with a bandwidth of 7000Hz, that is, the sampling rate is 8000Hz. Extend to a signal with a sampling rate of 16000 Hz.
  • the sampling factor is 2, and the up-sampling processing with the sampling factor of 2 is performed on the narrowband signal to obtain an up-sampling signal with a sampling rate of 16000 Hz. Since the sampling rate of the narrowband signal is 8000 Hz and the frame length is 10 ms, the up-sampled signal corresponds to 160 sample points.
  • time-frequency transformation is performed on the up-sampled signal to obtain the initial low-frequency frequency domain coefficients.
  • the initial low-frequency frequency domain coefficients can be used as the initial low-frequency spectrum for the subsequent low-frequency spectrum envelope, Calculation of low frequency amplitude spectrum, etc.
  • the aforementioned Fourier transform may be a Short-Time Fourier Transform (STFT), and the aforementioned discrete cosine transform may be a Modified Discrete Cosine Transform (MDCT) (Modified Discrete Cosine Transform).
  • STFT Short-Time Fourier Transform
  • MDCT Modified Discrete Cosine Transform
  • the up-sampled signal is processed.
  • time-frequency conversion taking into account the elimination of the discontinuity of the data between frames, the frequency corresponding to the previous speech frame and the frequency corresponding to the current speech frame (narrowband signal to be processed) can be combined into an array. Then the frequency points in the array are windowed to obtain the windowed signal.
  • Hanning window can be used for windowing processing. After the windowing processing of the Hanning window is performed, STFT can be performed on the windowed signal to obtain the corresponding low-frequency frequency domain coefficients. Taking into account the conjugate symmetry of the Fourier transform, the first coefficient is the DC component. If the obtained low frequency frequency domain coefficients are M, then (1+M/2) low frequency frequency domain coefficients can be selected for subsequent processing.
  • the specific process of performing STFT on the above-mentioned up-sampled signal containing 160 sample points is: 160 sample points corresponding to the previous speech frame and 160 sample points corresponding to the current speech frame (narrowband signal to be processed)
  • Form an array the array includes 320 sample points.
  • perform the Hanning window processing on the sample points in the array to obtain the windowed signal s Low (i,j), and then perform Fourier transform on s ow (i,j) to obtain 320 low frequency frequencies Domain coefficient S Low (i,j).
  • i is the frame index of the speech frame
  • the first coefficient is the DC component, so only the first 161 low-frequency frequency domain coefficients can be considered, that is, the second to the 161th low-frequency frequency domain of the 161 low-frequency frequency domain coefficients
  • the coefficient is used as the initial low frequency spectrum mentioned above.
  • a cosine window can be used for windowing processing. After performing the windowing processing of the cosine window, MDCT can be performed on the windowed signal to obtain the corresponding low-frequency frequency domain coefficients, and subsequent processing is performed based on the low-frequency frequency domain coefficients.
  • the time-frequency transform includes Fourier transform or discrete cosine transform.
  • the time-frequency transformation is Fourier transform (such as STFT)
  • the initial low-frequency spectrum at this time is in complex form, so it can be based on the
  • the initial low-frequency spectrum in the complex form obtains the real-numbered low-frequency amplitude spectrum, and then the subsequent processing is performed based on the low-frequency amplitude spectrum, that is, based on the initial low-frequency spectrum, through the neural network model, the correlation parameters between the high-frequency part and the low-frequency part of the target broadband spectrum are obtained
  • the initial low-frequency spectrum at this time is in the form of a real number, so the subsequent processing can be directly based on the initial low-frequency spectrum in the real number form, that is, based on the initial low-frequency spectrum, through the neural network model
  • the initial low frequency spectrum can be input to the neural network model, and the correlation between the high frequency part and the low frequency part of the target broadband spectrum can be obtained based on the output of the neural network model sexual parameters.
  • time-frequency transform is discrete sine transform, wavelet transform, etc.
  • the correlation parameters of the high frequency part and the low frequency part will not be repeated here.
  • the input of the neural network model also includes the low-frequency spectrum envelope.
  • the low frequency amplitude spectrum of the narrowband signal can be obtained according to the initial low frequency spectrum.
  • the low frequency amplitude spectrum can be obtained according to the low frequency amplitude. Spectrum, determine the low-frequency spectrum envelope of the narrow-band signal, that is, determine the low-frequency spectrum envelope of the narrow-band signal based on the initial low-frequency spectrum.
  • the time-frequency transform is a discrete cosine transform (such as MDCT)
  • the low-frequency spectrum envelope of the narrowband signal can be obtained from the initial low-frequency spectrum, that is, the low-frequency spectrum envelope of the narrowband signal is determined based on the initial low-frequency spectrum.
  • the low-frequency spectrum envelope can be used as the input of the neural network model, that is, the input of the neural network model also includes the low-frequency spectrum envelope.
  • the low-frequency spectrum envelope of the narrowband signal is the information related to the signal's spectrum.
  • the low-frequency spectrum envelope is used as the input of the neural network model, so that more accurate correlation parameters can be obtained based on the low-frequency spectrum envelope and the low-frequency spectrum (in the case of time-frequency conversion to MDCT), that is, the low-frequency spectrum envelope and the initial low-frequency spectrum are input to
  • the neural network model can obtain correlation parameters, or obtain more accurate correlation parameters based on the low-frequency spectrum envelope and low-frequency amplitude spectrum (in the case of time-frequency conversion to STFT), thereby inputting the low-frequency spectrum envelope and low-frequency amplitude spectrum to the nerve With the network model, correlation parameters can be obtained.
  • the time-frequency transform is Fourier transform (such as STFT)
  • the initial low-frequency spectrum is obtained, the low-frequency amplitude spectrum of the narrowband signal can be determined based on the initial low-frequency spectrum. Specifically, the following formula (1) Calculate the low frequency amplitude spectrum:
  • P Low (i,j) represents the low-frequency amplitude spectrum
  • S Low (i,j) is the initial low-frequency spectrum
  • Real and Imag are the real and imaginary parts of the initial low-frequency spectrum, respectively
  • SQRT is the root-opening operation.
  • the calculated 70 low-frequency amplitude spectrum coefficients can be directly used as the low-frequency amplitude spectrum of the narrowband signal. Further, for the convenience of calculation, the low-frequency amplitude spectrum can also be further converted to the logarithmic domain, that is, through the formula ( 1) The calculated amplitude spectrum is subjected to logarithmic operation, and the amplitude spectrum after logarithmic operation is used as the low-frequency amplitude spectrum for subsequent processing.
  • the low-frequency spectrum envelope of the narrowband signal can be determined based on the low-frequency amplitude spectrum.
  • the method may further include:
  • the sub-spectrum envelope corresponding to each sub-amplitude spectrum is respectively determined, and the low-frequency spectrum envelope includes the determined fourth number of sub-spectrum envelopes.
  • one achievable way of dividing the spectral coefficients of the low-frequency amplitude spectrum into a fourth number (denoted as M) of sub-amplitude spectra is: performing band-dividing processing on the narrowband signal to obtain M sub-amplitude spectra, each sub-band It can correspond to the same or different numbers of spectral coefficients of the sub-amplitude spectrum, and the total number of spectral coefficients corresponding to all sub-bands is equal to the number of spectral coefficients of the low-frequency amplitude spectrum.
  • the sub-spectrum envelope corresponding to each sub-amplitude spectrum can be determined based on each sub-amplitude spectrum.
  • One possible way to achieve this is: based on the spectral coefficients of the low-frequency amplitude spectrum corresponding to each sub-amplitude spectrum , The sub-spectral envelope of each sub-band can be determined, that is, the sub-spectral envelope corresponding to each sub-amplitude spectrum.
  • M sub-amplitude spectra can correspondingly determine M sub-spectral envelopes, then the low-frequency spectrum envelope includes the determined M sub-spectrums Envelope.
  • determining the sub-spectrum envelope corresponding to each sub-amplitude spectrum may include:
  • the sub-spectrum envelope corresponding to each sub-amplitude spectrum is obtained.
  • the sub-spectrum envelope corresponding to each sub-amplitude spectrum is determined by formula (2).
  • e Low (i,k) represents the subspectral envelope
  • i is the frame index of the speech frame
  • k represents the index number of the subband.
  • the envelope includes M sub-spectral envelopes.
  • the spectral envelope of a subband is defined as the average energy of adjacent coefficients (or further converted to logarithmic representation), but this method may cause coefficients with smaller amplitudes to fail to play a substantial role.
  • the embodiment of the present application provides the scheme of directly averaging the logarithmic identifiers of the spectral coefficients included in each sub-amplitude spectrum to obtain the sub-spectral envelope corresponding to the sub-amplitude spectrum, which is compared with the existing commonly used envelope determination scheme , It can better protect the coefficients with smaller amplitude in the distortion control of the neural network model training process, so that more signal parameters can play a corresponding role in the frequency band expansion.
  • the neural network model in this solution is small in size and low in complexity.
  • the time-frequency transform is a discrete cosine transform (such as MDCT)
  • the low-frequency spectrum envelope of the narrowband signal can be determined based on the initial low-frequency spectrum.
  • the narrowband signal can be sub-banded.
  • the frequency band corresponding to every 5 adjacent low-frequency frequency domain coefficients can be divided into one sub-band, which is divided into 14 sub-bands in total.
  • the band corresponds to 5 low-frequency frequency domain coefficients.
  • the low frequency spectrum envelope of the subband is defined as the average energy of adjacent low frequency frequency domain coefficients. Specifically, it can be calculated by formula (3):
  • e Low (i,k) represents the sub-spectral envelope (the low-frequency spectrum envelope of each sub-band)
  • S Low (i, j) is the initial low-frequency spectrum
  • the low-frequency spectrum envelope includes 14 sub-spectrum envelopes.
  • the 70-dimensional low-frequency frequency domain coefficient S Low_rev (i,j) and the 14-dimensional low-frequency spectrum envelope e Low (i,k) can be used as the input of the neural network model, that is, the input of the neural network model is 84-dimensional The data.
  • the process of obtaining the target high-frequency spectrum based on the correlation parameter and the initial low-frequency spectrum may include:
  • the low frequency spectrum envelope of the narrowband signal to be processed is obtained
  • the target high-frequency spectrum is obtained;
  • the process of obtaining the initial high-frequency spectrum based on the correlation parameters and the initial low-frequency spectrum can include:
  • the low-frequency spectrum envelope of the narrowband signal is obtained
  • the first high frequency spectrum is adjusted to obtain the initial high frequency spectrum.
  • the aforementioned method of generating the corresponding high-frequency phase spectrum based on the low-frequency phase spectrum of the narrowband signal may include, but is not limited to, any of the following:
  • the first method is to obtain the corresponding high-frequency phase spectrum by copying the low-frequency phase spectrum.
  • the second type Flip the low-frequency phase spectrum, and obtain a phase spectrum that is the same as the low-frequency phase spectrum after folding, and map the two low-frequency phase spectra to the corresponding high-frequency frequency points to obtain the corresponding high-frequency phase spectrum.
  • the initial high-frequency amplitude spectrum may be obtained by copying the low-frequency amplitude spectrum.
  • the specific way of copying the low-frequency amplitude spectrum depends on the bandwidth of the broadband signal that needs to be finally obtained and the bandwidth of the selected low-frequency amplitude spectrum for copying.
  • the copying method is also different. Will be different. For example, assuming that the bandwidth of a wideband signal is twice that of a narrowband signal, and if you choose to copy all the low-frequency amplitude spectrum of the narrowband signal, you only need to make one copy.
  • the bandwidth of the expanded wideband signal is 7kHz
  • the bandwidth corresponding to the low-frequency amplitude spectrum selected for copying is 1.75kHz
  • you can change The bandwidth corresponding to the low-frequency amplitude spectrum is copied 3 times, and the bandwidth (5.25kHz) corresponding to the initial high-frequency amplitude spectrum is obtained. If the bandwidth corresponding to the low-frequency amplitude spectrum selected for copying is 3.5kHz, and the bandwidth of the expanded broadband signal is 7kHz, the bandwidth corresponding to the low-frequency amplitude spectrum can be copied once to obtain the bandwidth corresponding to the initial high-frequency amplitude spectrum (3.5 kHz).
  • the initial low-frequency spectrum may be copied to obtain the first high-frequency spectrum.
  • the process of copying the initial low-frequency spectrum is similar to the process of copying the low-frequency amplitude spectrum under Fourier transform to obtain the initial high-frequency amplitude spectrum, and will not be repeated here.
  • time-frequency transform is discrete sine transform, wavelet transform, etc.
  • wavelet transform etc.
  • the above-mentioned discrete cosine transform generating process of the first high-frequency spectrum can also be referred to as needed, which will not be repeated here.
  • an implementation manner of generating the initial high-frequency amplitude spectrum may be: copying the amplitude spectrum of the high-frequency part of the low-frequency amplitude spectrum to obtain the initial high-frequency amplitude spectrum;
  • An implementation manner of generating the first high-frequency spectrum for the initial low-frequency spectrum may be: copying the spectrum of the high-frequency part in the initial low-frequency spectrum to obtain the first high-frequency spectrum.
  • the amplitude of the high-frequency part of the low-frequency amplitude spectrum can be selected The spectrum is copied to obtain the initial high frequency amplitude spectrum.
  • the low-frequency amplitude spectrum corresponds to a total of 70 frequency points, if you select 35-69 frequency points corresponding to the low-frequency amplitude spectrum (the amplitude spectrum of the high frequency range in the frequency amplitude spectrum) as The frequency point to be copied is the "mother board", and the bandwidth of the expanded wideband signal is 7000 Hz, you need to copy the frequency point corresponding to the selected low frequency amplitude spectrum to obtain the initial high frequency amplitude spectrum containing 70 frequency points In order to obtain the initial high-frequency amplitude spectrum containing 70 frequency points, 35-69 of the low-frequency amplitude spectrum corresponding to 35-69 can be copied twice to generate the initial high-frequency amplitude spectrum.
  • the 0-69 frequency points corresponding to the low-frequency amplitude spectrum are selected as the frequency points to be copied, and the bandwidth of the expanded wideband signal is 7000 Hz, then the 0-69 frequency points corresponding to the low-frequency amplitude spectrum can be changed to 70 frequencies in total. The points are copied once to generate the initial high-frequency amplitude spectrum, which includes 70 frequency points in total.
  • the signal corresponding to the low-frequency amplitude spectrum may contain a large number of harmonics
  • the signal corresponding to the initial high-frequency amplitude spectrum obtained only by copying will also contain a large number of harmonics.
  • the initial high-frequency amplitude spectrum can be adjusted by the difference between the high-frequency spectrum envelope and the low-frequency spectrum envelope.
  • the adjusted initial high-frequency amplitude spectrum is used as the target high-frequency amplitude spectrum, which can reduce the final frequency point expansion. Harmonics in broadband signals.
  • the spectrum of the high-frequency part of the initial low-frequency spectrum can be selected for copying .
  • the first high frequency spectrum which is similar to the process of copying the amplitude spectrum of the high frequency part of the low frequency amplitude spectrum in the case of Fourier transform to obtain the initial high frequency amplitude spectrum, which will not be repeated here.
  • time-frequency transform is discrete sine transform, wavelet transform, etc.
  • wavelet transform etc.
  • the above-mentioned discrete cosine transform generating process of the first high-frequency spectrum can also be referred to as needed, which will not be repeated here.
  • the high-frequency spectrum envelope and the low-frequency spectrum envelope are both logarithmic domain spectrum envelopes
  • the initial high-frequency amplitude spectrum is adjusted to obtain the target high-frequency amplitude spectrum, which may include:
  • Adjust the first high-frequency spectrum based on the high-frequency spectrum envelope and the low-frequency spectrum envelope including:
  • the first high-frequency spectrum is adjusted based on the second difference to obtain the initial high-frequency spectrum.
  • the high-frequency spectrum envelope and the low-frequency spectrum envelope can be represented by the spectrum envelope in the logarithmic domain.
  • the first difference can be determined based on the spectrum envelope in the logarithmic domain.
  • Value adjusts the initial high-frequency amplitude spectrum to obtain the target high-frequency amplitude spectrum;
  • the time-frequency transform is a discrete cosine transform, the first high-frequency spectrum can be adjusted based on the second difference determined by the spectrum envelope in the logarithmic domain , Get the initial high frequency spectrum.
  • the high-frequency spectrum envelope and the low-frequency spectrum envelope can be expressed by the spectrum envelope in the logarithmic domain for easy calculation.
  • time-frequency transform is discrete sine transform, wavelet transform, etc.
  • wavelet transform etc.
  • the above-mentioned discrete cosine transform generating process of the initial high-frequency spectrum can also be referred to as needed, which will not be repeated here.
  • the high-frequency spectrum envelope includes the second number of first sub-spectral envelopes
  • the initial high-frequency amplitude spectrum includes the second number of first sub-spectral envelopes.
  • the high-frequency spectrum envelope includes a third number of second sub-spectral envelopes
  • the first high-frequency spectrum includes a third number of first sub-spectrums, where each second The sub-spectrum envelope is determined based on the corresponding first sub-spectrum in the first high-frequency spectrum.
  • the sub-spectral envelope is determined based on the corresponding sub-amplitude spectrum in the corresponding amplitude spectrum, and a first sub-spectral envelope can be based on the corresponding initial height.
  • the corresponding sub-amplitude spectrum in the frequency amplitude spectrum is determined.
  • the number of spectral coefficients corresponding to each sub-amplitude spectrum can be the same or different. If each first sub-spectral envelope is determined based on the corresponding sub-amplitude spectrum in the corresponding amplitude spectrum, then each first sub-spectrum envelope is determined based on the corresponding sub-amplitude spectrum in the corresponding amplitude spectrum.
  • the number of spectral coefficients of the sub-amplitude spectrum in the amplitude spectrum corresponding to the sub-spectrum envelope may also be different.
  • the sub-spectral envelope is determined based on the corresponding sub-spectrum in the corresponding frequency spectrum, and a second sub-spectral envelope can be based on the corresponding first high-frequency spectrum in the corresponding The sub-spectrum is determined.
  • time-frequency transform is discrete sine transform, wavelet transform, etc.
  • the time-frequency transform is Fourier transform
  • the output of the neural network model is a 14-dimensional high-frequency spectrum envelope (the second number is 14)
  • the input of the neural network model includes the low-frequency amplitude Spectrum and low-frequency spectrum envelopes, where the low-frequency amplitude spectrum contains 70-dimensional low-frequency frequency domain coefficients, and the low-frequency spectrum envelope contains 14-dimensional sub-spectral envelopes.
  • the input of the neural network model is 84-dimensional data, and the output dimension is much smaller than the input dimension , Can reduce the volume and depth of the neural network model, while reducing the complexity of the model.
  • the time-frequency transform is a discrete cosine transform
  • the input and output of the neural network model are similar to the neural network model under the Fourier transform described above, and will not be repeated here.
  • the time-frequency transform is a Fourier transform
  • the first difference between the high-frequency spectrum envelope and the low-frequency spectrum envelope is determined, and the initial high-frequency amplitude spectrum is adjusted based on the first difference value to obtain the target high-frequency amplitude spectrum ,
  • the target high-frequency amplitude spectrum is obtained.
  • the time-frequency transform is a discrete cosine transform
  • the second difference between the high-frequency spectrum envelope and the low-frequency spectrum envelope is determined, and the first high-frequency spectrum is adjusted based on the second difference to obtain the initial high-frequency spectrum, including :
  • the high-frequency spectrum envelope obtained through the neural network model may include a second number of first sub-spectral envelopes.
  • the second number of first sub-spectral envelopes The spectrum envelope is determined based on the corresponding sub-amplitude spectrum in the low-frequency amplitude spectrum, that is, a sub-spectral envelope is determined based on a corresponding sub-amplitude spectrum in the low-frequency amplitude spectrum. Based on the foregoing scenario as an example, the description will continue.
  • the first difference between the high-frequency spectrum envelope and the low-frequency spectrum envelope is the difference between each first sub-spectral envelope and the corresponding third sub-spectral envelope.
  • the difference between the high-frequency spectrum envelope and the high-frequency spectrum envelope is based on the first difference.
  • the adjustment of the envelope is to adjust the corresponding first sub-amplitude spectrum based on the first difference between each first sub-spectral envelope and the corresponding third sub-spectral envelope. Based on the foregoing scenario as an example, the description will continue.
  • the high-frequency spectrum envelope includes 14 first sub-spectral envelopes
  • the low-frequency spectrum envelope includes 14 second sub-spectral envelopes
  • the 14 second sub-spectral envelopes can be determined based on the 14 second sub-spectral envelopes. From the sub-spectrum envelope and the corresponding 14 first sub-spectral envelopes, 14 first difference values are determined, and based on the 14 first difference values, the first sub-amplitude spectrum corresponding to the corresponding sub-band is adjusted.
  • the high-frequency spectrum envelope obtained through the neural network model may include a third number of second sub-spectral envelopes, and the second difference between the high-frequency spectrum envelope and the low-frequency spectrum envelope.
  • the value is the difference between each second sub-spectral envelope and the corresponding fourth sub-spectral envelope.
  • the process of adjusting the high-frequency spectrum envelope based on the second difference is similar to the process of adjusting the high-frequency spectrum envelope based on the first difference when the time-frequency transformation is Fourier transform. Go into details again.
  • time-frequency transform is discrete sine transform, wavelet transform, etc.
  • it is also possible to adjust the corresponding high-frequency spectrum envelope by referring to the adjustment process of the high-frequency spectrum envelope of the discrete cosine transform as required, which will not be repeated here.
  • the correlation parameter also includes relative flatness information, and the relative flatness information characterizes the correlation between the spectral flatness of the high-frequency part of the target broadband spectrum and the spectral flatness of the low-frequency part;
  • the adjustment of the high frequency spectrum information can include:
  • the high-frequency spectrum information is adjusted based on the adjusted high-frequency spectrum envelope and the low-frequency spectrum envelope, where the high-frequency spectrum information includes the initial high-frequency amplitude spectrum or the first high-frequency spectrum.
  • the difference between the adjusted high-frequency spectrum envelope and the low-frequency spectrum envelope can be determined.
  • the first difference or the second difference and then adjust the initial high-frequency amplitude spectrum according to the first difference to obtain the target high-frequency amplitude spectrum, or adjust the first high-frequency spectrum according to the second difference to obtain the initial high Frequency spectrum.
  • the labeling result may include relative flatness information, that is, the sample label of the sample data includes the relative flatness information of the high-frequency part and the low-frequency part of the sample broadband signal.
  • the relative flatness information is determined based on the high frequency and low frequency parts of the frequency spectrum of the sample broadband signal. Therefore, when the neural network model is applied, when the input of the model is the low frequency spectrum of the narrowband signal, it can be based on the output of the neural network model.
  • the relative flatness information of the high frequency part and the low frequency part of the target broadband spectrum is predicted.
  • the relative flatness information can reflect the relative flatness of the high-frequency part and the low-frequency part of the target broadband spectrum, that is, whether the high-frequency part is flat relative to the low-frequency part of the spectrum, if the correlation parameter also includes the relative flatness Information, you can first adjust the high-frequency spectrum envelope based on the relative flatness information and the energy information of the low-frequency spectrum, and then adjust the target broadband spectrum based on the difference between the adjusted high-frequency spectrum envelope and the low-frequency spectrum envelope.
  • the resulting broadband signal has fewer harmonics.
  • the energy information of the low-frequency spectrum can be determined based on the spectral coefficients of the low-frequency amplitude spectrum, and the energy information of the low-frequency spectrum can indicate the flatness of the spectrum.
  • the above-mentioned correlation parameters may include high-frequency spectrum envelope and relative flatness information
  • the neural network model includes at least an input layer and an output layer, and the input layer inputs a feature vector of low-frequency spectrum parameters (the feature vector includes 70 One-dimensional low-frequency amplitude spectrum and 14-dimensional low-frequency spectrum envelope)
  • the output layer includes at least a single-sided Long Short-Term Memory (LSTM) layer and two fully connected network layers connected to the LSTM layer, each fully connected
  • the network layer may include at least one fully connected layer, where the LSTM layer converts the feature vector processed by the input layer, and one of the fully connected network layer performs the first classification process according to the vector value converted by the LSTM layer, and outputs the high frequency spectrum Envelope (14-dimensional), another fully connected network layer performs the second classification process according to the vector value converted by the LSTM layer, and outputs the relative flatness information (4-dimensional).
  • FIG. 2 shows a schematic structural diagram of a neural network model provided by an embodiment of the present application.
  • the neural network model may mainly include two parts: a single-sided LSTM layer and two full-scale neural network models.
  • the connection layer that is, each fully connected network layer in this example includes a fully connected layer, where the output of one fully connected layer is the high-frequency spectrum envelope, and the output of the other fully connected layer is the relative flatness information.
  • the LSTM layer is a kind of cyclic neural network. Its input is the feature vector of the above-mentioned low-frequency spectrum parameters (it can be referred to as the input vector for short). The input vector is processed by LSTM to obtain a hidden vector of a certain dimension, which is used as two hidden vectors. The input of a fully connected layer is classified and predicted by two fully connected layers respectively. A fully connected layer predicts and outputs a 14-dimensional column vector. This output corresponds to the high-frequency spectrum envelope, and the other fully connected layer Predict and output a 4-dimensional column vector, the values of the four dimensions of the vector are the four probability values described above, and the four probability values respectively represent the probability that the relative flatness information is the above four arrays.
  • the 70-dimensional low-frequency spectrum S Low_rev (i,j) after filtering can be first obtained to obtain the low-frequency amplitude spectrum P of the 70-dimensional narrowband signal Low (i,j) is the feature vector, then P Low (i,j) is used as an input of the neural network model, and the 14-dimensional low-frequency spectrum envelope e Low calculated according to P Low (i,j) (i,k)
  • This feature vector is used as another input of the neural network model, that is, the input layer of the neural network model is an 84-dimensional feature vector.
  • the neural network model transforms the 84-dimensional feature vector through the LSTM layer (for example, including 256 parameters) to obtain the transformed vector value, and through a fully connected network layer connected to the LSTM layer (for example, including 512 parameters) ), classify the converted vector value (that is, the first classification process), and output the 14-dimensional high-frequency spectrum envelope e High (i,k), and at the same time, another fully connected network layer connected through the LSTM layer ( For example, including 512 parameters), perform classification processing (ie, second classification processing) on the converted vector value, and output 4 pieces of relative flatness information.
  • classification processing ie, second classification processing
  • the filtered 70-dimensional low-frequency spectrum S Low_rev (i, j) can be used as an input of the neural network model, and at the same time Take the 14-dimensional low-frequency spectrum envelope e Low (i,k) obtained according to S Low_rev (i,j) as another input of the neural network model, that is, the input layer of the neural network model is 84-dimensional Feature vector.
  • the neural network model transforms the 84-dimensional feature vector through the LSTM layer (for example, including 256 parameters) to obtain the transformed vector value, and through a fully connected network layer connected to the LSTM layer (for example, including 512 parameters) ), classify the converted vector value (that is, the first classification process), and output the 14-dimensional high-frequency spectrum envelope e High (i,k), and at the same time, another fully connected network layer connected through the LSTM layer ( For example, including 512 parameters), perform classification processing (ie, second classification processing) on the converted vector value, and output 4 pieces of relative flatness information.
  • classification processing ie, second classification processing
  • the relative flatness information includes the relative flatness information of at least two sub-band regions corresponding to the high-frequency part, and the relative flatness information corresponding to one sub-band region represents one part of the high-frequency part.
  • the relative flatness information is determined based on the high frequency and low frequency parts of the frequency spectrum of the sample wideband signal. Since the low frequency part of the sample narrowband signal contains more harmonics in the low frequency part, the low frequency of the sample narrowband signal can be selected. Part of the high-frequency band is used as a reference for determining the relative flatness information.
  • the high-frequency band of the low-frequency part is used as the master, and the high-frequency part of the sample broadband signal is divided into at least two sub-band regions.
  • the relative flatness of each sub-band region The information is determined based on the frequency spectrum of the corresponding subband region and the frequency spectrum of the low frequency part.
  • the labeling result can include the relative flatness information of each subband region, that is, the sample label of the sample data can include the subband regions and the low frequency of the high frequency part of the sample broadband signal.
  • the relative flatness information is determined based on the frequency spectrum of the high-frequency part of the sample broadband signal and the frequency spectrum of the low-frequency part. Therefore, when the neural network model is applied, the input of the model is narrowband In the case of the low-frequency spectrum of the signal, the relative flatness information of the sub-band region of the high-frequency part and the low-frequency part of the target broadband spectrum can be predicted based on the output of the neural network model.
  • the spectral parameters of each sub-band region are determined based on the spectral parameters of the high-frequency band of the low-frequency part, and correspondingly, the relative flatness information may include each sub-band region.
  • the time-frequency transform is a Fourier transform
  • the spectrum parameter is the amplitude spectrum
  • the time-frequency transform is a discrete cosine transform
  • the number of spectral coefficients of the low-frequency part of the target broadband spectrum can be the same as or different from the number of spectral coefficients of the high-frequency part of the amplitude spectrum.
  • Each subband region corresponds to The number of spectral coefficients may be the same or different, as long as the total number of spectral coefficients corresponding to at least two subband regions is consistent with the number of spectral coefficients corresponding to the initial high-frequency amplitude spectrum.
  • the high-frequency part includes at least two sub-band regions corresponding to two sub-band regions, namely the first sub-band region and the second sub-band region, and the low-frequency part
  • the high frequency band is the frequency band corresponding to the 35th to 69th frequency points.
  • the number of spectral coefficients corresponding to the first subband region is the same as the number of spectral coefficients corresponding to the second subband region.
  • the total number of spectral coefficients corresponding to the subband area is the same as the number of spectral coefficients corresponding to the low frequency part, then the frequency band corresponding to the first subband area is the frequency band corresponding to the 70th to 104th frequency points, and the second subband area corresponds to The frequency band of is the frequency band corresponding to the 105th to the 139th frequency point.
  • the number of spectral coefficients of the amplitude spectrum of each subband area is 35, which is the same as the number of spectral coefficients of the amplitude spectrum of the high frequency band of the low frequency part. .
  • the high frequency part can be divided into 5 subband regions, and each subband region corresponds to 14 spectral coefficients.
  • the time-frequency transform is a discrete cosine transform
  • the high-frequency part includes the frequency spectrum corresponding to at least two subband regions.
  • the time-frequency transform is a Fourier transform
  • the high-frequency part includes the frequency corresponding to The amplitude spectra of at least two subband regions are similar, and will not be repeated here.
  • determining the gain adjustment value of the high-frequency spectrum envelope based on the relative flatness information and the energy information of the initial low-frequency spectrum may include:
  • the adjustment of the high-frequency spectrum envelope based on the gain adjustment value may include:
  • the corresponding spectrum envelope part is adjusted.
  • the high-frequency part includes at least two sub-band regions, it can be determined based on the relative flatness information corresponding to each sub-band region and the spectral energy information corresponding to each sub-band region in the low-frequency spectrum.
  • the high-frequency spectrum envelope corresponds to the gain adjustment value of the spectrum envelope part, and then based on the determined gain adjustment value, the corresponding spectrum envelope part is adjusted.
  • the time-frequency transform described above is Fourier transform
  • at least two subband regions are two subband regions, namely the first subband region and the second subband region, and the first subband region and the
  • the relative flatness information of the high-frequency band of the low-frequency part is the first relative flatness information
  • the relative flatness information of the second sub-band region and the high-frequency band of the low-frequency part is the second relative flatness information, based on the first relative flatness.
  • Information and the gain adjustment value determined by the spectral energy information corresponding to the first sub-band region can adjust the envelope part of the high-frequency spectrum envelope corresponding to the first sub-band region based on the second relative flatness information and the second relative flatness information.
  • the gain adjustment value determined by the spectral energy information corresponding to the sub-band region can adjust the envelope part of the high-frequency spectral envelope corresponding to the second sub-band region.
  • the time-frequency transform is a discrete cosine transform
  • the relative flatness information and gain adjustment value determination process is the same as the determination of the flatness information and gain adjustment value when the time-frequency transform in this example is Fourier transform The process is similar, so I won't repeat it here.
  • the high-frequency band of the low-frequency part of the sample narrowband signal can be selected as a reference for determining the relative flatness information.
  • Use the high-frequency band of the low-frequency part as a master divide the high-frequency part of the sample broadband signal into at least two sub-band regions, and determine each sub-band region based on the frequency spectrum of each sub-band region of the high-frequency part and the frequency spectrum of the low-frequency part Relative flatness information.
  • sample data includes the sample narrowband signal and the corresponding sample broadband signal
  • sample data can be used to determine each sub-band of the high frequency part of the sample broadband signal spectrum through the analysis of variance. Relative flatness information of the zone.
  • the relative flatness information of the high-frequency part and the low-frequency part of the sample broadband signal can be Is, the first relative flatness information of the first sub-band region and the high-frequency band of the low-frequency part of the sample broadband signal, and the second relative flatness information of the second sub-band region and the high-frequency band of the low-frequency part of the sample broadband signal .
  • the time-frequency transform is the Fourier transform as an example
  • the process of determining the first relative flatness information and the second relative flatness information is introduced below:
  • the specific determination method of the first relative flatness information and the second relative flatness information may be:
  • formula (4) is the variance of the amplitude spectrum of the low frequency part of the sample narrowband signal
  • formula (5) is the variance of the amplitude spectrum of the first subband region
  • formula (6) is the variance of the amplitude spectrum of the second subband region.
  • the variance of the amplitude spectrum, var() represents the variance
  • the variance of the spectrum can be represented based on the corresponding frequency domain coefficients.
  • S Low,sample (i,j) represents the frequency domain coefficients of the sample narrowband signal.
  • fc(0) represents the first relative flatness information of the amplitude spectrum of the first subband region and the amplitude spectrum of the high frequency band of the low frequency part
  • fc(1) represents the amplitude spectrum of the second subband region and the amplitude spectrum of the low frequency part.
  • fc(0) and fc(1) can be classified according to whether they are greater than or equal to 0, and fc(0) and fc(1) can be defined as a two-category array, so the array contains 4 permutations and combinations: ⁇ 0,0 ⁇ , ⁇ 0,1 ⁇ , ⁇ 1,0 ⁇ , ⁇ 1,1 ⁇ .
  • the relative flatness information output by the model may be 4 probability values, and the probability values are used to identify the probability that the relative flatness information belongs to the aforementioned 4 arrays.
  • one of the permutations and combinations of the four arrays can be selected as the predicted relative flatness information of the amplitude spectrum of the extended region of the two subbands and the amplitude spectrum of the low frequency part of the high frequency band.
  • the specific can be expressed by formula (9):
  • the low frequency spectrum of the second narrowband signal is input to
  • the trained neural network model can predict the relative flatness information of the high frequency part of the target broadband spectrum through the neural network model. If the frequency spectrum corresponding to the high frequency band of the low frequency part of the narrowband signal is selected as the input of the neural network model, based on the trained neural network model, the relative flatness of at least two subband regions of the high frequency part of the target broadband spectrum can be predicted information.
  • the high-frequency spectrum envelope includes a first predetermined number of high-frequency sub-spectral envelopes. If the initial low-frequency spectrum is obtained by Fourier transform, the first predetermined number is the above-mentioned second number. , If the initial low-frequency spectrum is obtained through discrete cosine transform, the first predetermined number is the aforementioned third number;
  • the gain adjustment value of the corresponding spectral envelope part in the high-frequency spectrum envelope is determined, including:
  • the spectrum corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope For each high-frequency sub-spectral envelope, according to the spectrum energy information corresponding to the spectrum envelope corresponding to the high-frequency sub-spectral envelope in the low-frequency spectrum envelope, the spectrum corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope.
  • the relative flatness information corresponding to the sub-band region corresponding to the envelope, the spectral energy information corresponding to the sub-band region corresponding to the spectral envelope corresponding to the high-frequency sub-spectral envelope in the low-frequency spectrum envelope, to determine the high-frequency sub-spectrum The gain adjustment value of the envelope;
  • each corresponding spectrum envelope part in the high frequency spectrum envelope adjusts the corresponding spectrum envelope part, including:
  • the corresponding high-frequency sub-spectrum envelope is adjusted.
  • the following takes the initial low-frequency spectrum obtained through Fourier transform and the first predetermined number is the second number as an example for specific introduction:
  • each high-frequency sub-spectral envelope of the high-frequency spectrum envelope corresponds to a gain adjustment value
  • the gain adjustment value is based on the spectral energy information corresponding to the low-frequency sub-spectral envelope, and the sub-frequency sub-spectral envelope corresponding to the low-frequency sub-spectral envelope.
  • the relative flatness information corresponding to the band region and the spectral energy information corresponding to the sub-band region corresponding to the low-frequency sub-spectral envelope are determined, and the low-frequency sub-spectral envelope corresponds to the high-frequency sub-spectral envelope.
  • the spectrum envelope includes the second number of high-frequency sub-spectral envelopes, and the high-frequency spectrum envelope includes the corresponding second number of gain adjustment values.
  • the gain adjustment corresponding to the first sub-spectral envelope corresponding to each sub-band region can be used.
  • the value adjusts the first sub-spectral envelope of the corresponding sub-band region.
  • the following takes 35 frequency points in the first sub-band region as an example, based on the spectral energy information corresponding to the second sub-spectral envelope, and the relative value corresponding to the sub-band region corresponding to the second sub-spectral envelope.
  • the flatness information, the spectral energy information corresponding to the subband region corresponding to the second sub-spectral envelope, and the gain adjustment value of the first sub-spectral envelope corresponding to the second sub-spectral envelope can be implemented as follows:
  • the 35 frequency points in the first subband area are divided into 7 subbands, and each subband corresponds to a first subspectral envelope.
  • Calculate the average energy pow_env of each sub-band (the spectral energy information corresponding to the second sub-spectral envelope), and calculate the average value Mpow_env of the above 7 average energies (the sub-band region corresponding to the second sub-spectral envelope) Spectrum energy information.
  • the average energy of each sub-band is determined based on the corresponding low-frequency amplitude spectrum, for example, the square of the absolute value of the spectral coefficient of each low-frequency amplitude spectrum is taken as the energy of a low-frequency amplitude spectrum, and one sub-band corresponds to For the spectral coefficients of the five low-frequency amplitude spectra, the average energy of the low-frequency amplitude spectrum corresponding to a sub-band can be used as the average energy of the sub-band.
  • a 1 0.875
  • b 1 0.125
  • a 0 0.925
  • b 0 0.075
  • G(j) is the gain adjustment value.
  • the gain adjustment value is 1, that is, there is no need to perform a flattening operation (adjustment) on the high-frequency spectrum envelope.
  • the gain adjustment values of the 7 first sub-spectral envelopes in the high-frequency spectrum envelope can be determined, and the corresponding first sub-spectral envelopes are adjusted based on the gain adjustment values of the 7 first sub-spectrum envelopes
  • the foregoing operation can narrow the average energy difference of different subbands, and perform different degrees of flattening processing on the frequency spectrum corresponding to the first subband region.
  • the corresponding high-frequency spectrum envelope of the second subband region can be adjusted in the same manner as described above, which will not be repeated here.
  • the high-frequency spectrum envelope includes a total of 14 sub-bands, and 14 gain adjustment values can be correspondingly determined, and the corresponding sub-spectrum envelopes are adjusted based on the 14 gain adjustment values.
  • the broadband signal includes the signal of the low frequency part of the narrowband signal and the signal of the expanded high frequency part, then the initial low frequency spectrum corresponding to the low frequency part and the initial high frequency part corresponding to the high frequency part are obtained.
  • the initial low-frequency spectrum and the initial high-frequency spectrum can be combined to obtain a broadband spectrum, and then frequency-time transformation of the broadband spectrum (inverse transformation of time-frequency transformation, transforming frequency domain signals into time domain signals), you can Obtain the target voice signal after the frequency band is expanded.
  • At least one of the initial low-frequency spectrum or the initial high-frequency spectrum may be filtered first, and then based on the filtered spectrum, a wideband after the frequency band expansion is obtained.
  • only the initial low frequency spectrum can be filtered to obtain the filtered initial low frequency spectrum (denoted as the target low frequency spectrum), and then the target low frequency spectrum and the initial high frequency spectrum can be combined, or only the initial high frequency
  • the spectrum is filtered to obtain the filtered initial high frequency spectrum (denoted as the target high frequency spectrum), then the initial low frequency spectrum and the target high frequency spectrum are combined, and the initial low frequency spectrum and the initial high frequency spectrum can be filtered separately
  • the corresponding target low-frequency spectrum and target high-frequency spectrum are obtained, and then the target low-frequency spectrum and the target high-frequency spectrum are combined.
  • the filtering process of the initial low frequency spectrum is basically the same as the filtering process of the initial high frequency spectrum.
  • the filtering process of the initial low frequency spectrum is taken as an example to introduce the filtering process in detail, as follows:
  • filter processing is performed on each corresponding sub-spectrum respectively.
  • the foregoing process of filtering the initial low-frequency spectrum may also first determine the filter gain of the initial low-frequency spectrum (hereinafter referred to as the first filter gain) based on the spectrum energy of the initial low-frequency spectrum, and then calculate the initial filter gain based on the first filter gain.
  • the low-frequency spectrum is filtered to obtain the low-frequency spectrum, where the first filter gain includes the filter gain corresponding to each sub-spectrum (hereinafter referred to as the second filter gain).
  • the initial low-frequency spectrum is usually represented by the initial low-frequency frequency domain coefficients
  • the low-frequency spectrum is represented by the low-frequency frequency domain coefficients
  • the initial low frequency frequency domain coefficients can be filtered by multiplying the first filter gain and the initial low frequency frequency domain coefficients to obtain the low frequency frequency domain coefficients, where the initial low frequency frequency domain coefficients are S Low (i, j), the low frequency domain coefficient is S Low_rev (i,j). If the determined first filter gain is G Low_post_filt (j), the initial low-frequency frequency domain coefficients can be filtered according to the following formula (10):
  • i is the frame index of the speech frame
  • the initial low-frequency frequency domain coefficients are first divided into a first number of sub-spectrums, and the first spectrum energy corresponding to each sub-spectrum is determined, and then based on each sub-spectrum.
  • the first spectrum energy corresponding to each sub-spectrum is determined, and the second filter gain corresponding to each sub-spectrum is determined, where the first filter gain includes a first number of second filter gains; when the initial spectrum is filtered according to the first filter gain According to the second filtering gain corresponding to each sub-spectrum, filtering processing can be performed on each corresponding sub-spectrum respectively.
  • each sub-band corresponds to N initial low-frequency frequency domain coefficients
  • N*L is equal to the total number of initial low-frequency frequency domain coefficients, L ⁇ 2, N ⁇ 1.
  • each sub-band corresponds to 5 initial low-frequency frequency domain coefficients.
  • One possible implementation manner for determining the first spectrum energy corresponding to each sub-spectrum is to determine the sum of the spectrum energy of the N initial low-frequency frequency domain coefficients corresponding to each sub-spectrum as the first spectrum energy corresponding to each sub-spectrum.
  • the spectral energy of each initial low-frequency frequency domain coefficient is defined as the sum of the square of the real part and the square of the imaginary part of the initial low-frequency frequency domain coefficient.
  • i is the frame index of the speech frame
  • Pe( k) represents the first spectrum energy corresponding to the k-th sub-spectrum
  • S Low (i, j) is the low-frequency frequency domain coefficient obtained from the time-frequency transform (ie, the initial low-frequency frequency domain coefficient)
  • Real and Imag are the real and imaginary parts, respectively unit.
  • the second filter gain corresponding to each sub-spectrum may be determined based on the first spectrum energy corresponding to each sub-spectrum.
  • the frequency band corresponding to the initial spectrum can be divided into the first sub-band and the second sub-band; then according to the first sub-band of all sub-frequency bands corresponding to the first sub-band.
  • the first subband energy of the first subband is determined, and the second subband energy of the second subband is determined according to the first spectrum energy of all the subbands corresponding to the second subband; then according to the first subband
  • the sub-band energy and the second sub-band energy determine the spectral tilt coefficient of the initial spectrum; then, according to the spectral tilt coefficient and the first spectral energy corresponding to each sub-spectrum, the second filter gain corresponding to each sub-spectrum is determined.
  • the frequency band corresponding to the initial frequency spectrum is the sum of the frequency bands corresponding to the initial low frequency frequency domain coefficients (for example, 70).
  • the sum of the frequency bands corresponding to the first to the 35th initial low-frequency frequency domain coefficients can be regarded as the first subband, and the sum of the frequency bands corresponding to the 36th to the 70th initial low-frequency frequency domain coefficients can be regarded as the second sub-band.
  • the first subband corresponds to the first to the 35th initial low frequency frequency domain coefficients in the initial spectrum
  • the second subband corresponds to the 36th to 70th initial low frequency frequency domain coefficients in the initial spectrum.
  • the first sub-band includes 7 sub-spectrums
  • the second sub-band also includes 7 sub-spectrums.
  • the sum of the first spectrum energy of the 7 sub-frequency spectrums determines the first sub-band energy of the first sub-band
  • the second sub-band can also be determined according to the sum of the first spectrum energy of the 7 sub-frequency spectrums included in the second sub-band The second sub-band energy.
  • the narrowband signal is the speech signal of the current speech frame
  • one possible way to determine its corresponding first spectrum energy is: according to the above formula (11), it is determined that each sub-spectrum corresponds to each sub-spectrum.
  • the first initial spectrum energy Pe(k). If the current speech frame is the first speech frame, the first initial spectral energy Pe(k) of each sub-spectrum can be determined as the first spectral energy of each sub-spectrum, and the first spectral energy can be denoted as Fe(k ), that is, Fe(k) Pe(k).
  • the second initial spectral energy of the sub-spectrum corresponding to the k-th sub-spectrum of the associated speech frame can be obtained , Denote the second initial spectrum energy as Pe pre (k), where the associated speech frame is at least one speech frame located before and adjacent to the current speech frame.
  • the first spectrum energy of the sub-spectrum can be obtained based on the first initial spectrum energy and the second initial spectrum energy.
  • the first spectrum energy of the k-th sub-spectrum can be determined according to the following formula (12):
  • Pe(k) is the first initial spectral energy of the k-th sub-spectrum
  • Pe pre (k) is the second initial spectral energy of the sub-spectrum corresponding to the k-th sub-spectrum of the associated speech frame
  • Fe(k) is the The first spectral energy of the k sub-spectrums.
  • the associated speech frame in the above formula (11) is a speech frame located before and adjacent to the current speech frame.
  • the above formula (12) can be appropriately adjusted as needed, for example, when the associated speech frame is located in the current speech frame.
  • the above formula (12) can be adjusted as follows: The Is the first initial spectral energy of the first speech frame that is located before the current speech frame and is next to the current speech frame, and Pe pre2 (k) is located before the first speech frame and is next to the first speech frame The first initial spectral energy of the speech frame.
  • the first spectral energy of the k-th sub-spectrum can be smoothed, and after the smoothed first spectral energy Fe_sm(k) is determined, the first spectral energy Fe_sm(k) can be Fe_sm(k) is determined as the first spectral energy of the k-th sub-spectrum.
  • the first spectrum energy can be smoothed according to the following formula (13):
  • Fe_sm(k) (Fe(k)+Fe pre (k))/2 (13)
  • Fe(k) is the first spectral energy of the k-th sub-spectrum
  • Fe pre (k) is the first spectral energy of the sub-spectrum corresponding to the k-th sub-spectrum of the associated speech frame
  • Fe_sm(k) is the smoothed The first spectrum energy. After determining the smoothed first spectral energy Fe_sm(k), Fe_sm(k) may be determined as the first spectral energy of the k-th sub-spectrum.
  • the associated speech frame in the above formula (13) is a speech frame located before and adjacent to the current speech frame.
  • the above formula (13) can be appropriately adjusted as needed, for example, when the associated speech frame is located in the current speech frame.
  • Fe_sm(k) (F(k)+Fe pre1 (k)+Fe pre2 (k) )/3
  • Fe pre1 (k) is the first spectral energy of the first speech frame immediately before the current speech frame
  • Pe pre2 (k) is the first spectral energy before the first speech frame
  • the first spectral energy of the speech frame immediately adjacent to the first speech frame is the first spectral energy of the speech frame immediately adjacent to the first speech frame.
  • e1 is the first subband energy of the first subband
  • e2 is the second subband energy of the second subband.
  • the first sub-band energy of the first sub-band and the second sub-band energy of the second sub-band can be determined according to the following formula (15):
  • e1 is the first subband energy of the first subband
  • e2 is the second subband energy of the second subband.
  • the spectral tilt coefficient of the initial spectrum can be determined according to the energy of the first subband and the energy of the second subband.
  • the spectrum tilt coefficient of the initial spectrum can be determined according to the following logic:
  • the initial spectral tilt coefficient is determined to be 0.
  • the initial spectral tilt coefficient can be determined according to the following expression :
  • T_para_0 8*f_cont_low*SQRT((e1-e2)/(e1+e2);
  • T_para_0 is the initial spectral tilt coefficient
  • f_cont_low is the preset filter coefficient
  • SQRT is the square root operation
  • e1 is the energy of the first subband
  • e2 is the energy of the second subband.
  • the aforementioned initial spectral coefficient can be used as the spectral tilt coefficient of the initial spectrum, or the obtained initial spectral tilt coefficient can be further optimized according to the following method, and the optimized The initial spectral tilt coefficient of is used as the spectral tilt coefficient of the initial spectrum.
  • the optimized expression is:
  • T_para_1 min(1.0,T_para_0);
  • T_para_2 T_para_1/7
  • min represents the minimum value
  • T_para_1 is the initial spectrum tilt coefficient after the initial optimization
  • T_para_2 is the initial spectrum tilt coefficient after the final optimization, that is, the aforementioned spectrum tilt coefficient of the initial spectrum.
  • the second filter gain corresponding to each sub-spectrum can be determined according to the spectral tilt coefficient and the first spectral energy corresponding to each sub-spectrum.
  • the second filter gain corresponding to the k-th sub-spectrum can be determined according to the following formula (16):
  • gain f0 (k) is the second filter gain corresponding to the k-th sub-spectrum
  • Fe(k) is the first spectral energy of the k-th sub-spectrum
  • f_cont_low is the preset filter coefficient.
  • f_cont_low 0.035
  • K 0,1,...,13, is the subband index, indicating 14 subbands.
  • the second filter gain gain f0 (k) After determining the second filter gain gain f0 (k) corresponding to the k-th sub-spectrum, if the above-mentioned initial spectrum's spectral tilt coefficient is not positive, then gain f 0(k) can be directly used as the corresponding to the k-th sub-spectrum.
  • the second filter gain if the above-mentioned initial spectrum slope coefficient is positive, the second filter gain gain f0 (k) can be adjusted according to the spectrum slope coefficient of the initial spectrum, and the adjusted second filter gain gain f0 (k) is used as the second filter gain corresponding to the k-th sub-spectrum.
  • the second filter gain gain f0 (k) can be adjusted according to the following formula (17):
  • gain f1 (k) is the adjusted second filter gain
  • gain f0 (k) is the second filter gain corresponding to the k-th sub-spectrum
  • T para is the spectral tilt coefficient of the initial spectrum
  • the second filter gain gain f1 (k) can be further optimized to gain f1 (k), and the optimized gain f1 (k) as the final spectrum k th The corresponding second filter gain.
  • the second filter gain gain f1 (k) can be adjusted according to the following formula (18):
  • gain Low_post_filt (k) is the second filter gain corresponding to the k-th sub-spectrum finally obtained
  • gain f1 (k) is the second filter gain adjusted according to formula (17)
  • k 0,1,...,13
  • the above is based on dividing 5 initial low frequency frequency domain coefficients into one subband, that is, dividing 70 initial low frequency frequency domain coefficients into 14 subbands, and each subband includes 5 initial low frequency frequency domain coefficients as an example.
  • the first filter gain of the initial low frequency frequency domain coefficients is introduced.
  • the second filter gain corresponding to each sub-band obtained above is the filter gain of the 5 initial low-frequency frequency domain coefficients corresponding to each sub-band, so that 70 initial low-frequency frequency domain coefficients can be obtained according to the second filter gains of the 14 sub-bands.
  • the first filter gain corresponding to the domain coefficient is [gain Low_post_filt (0), gain Low_post_filt (1),..., gain Low_post_filt (14)].
  • the second filter gain gain Low_post_filt (k), the second filter gain gain Low_post_filt (k) is the filter gain of the N spectral coefficients corresponding to the kth sub-spectrum.
  • the method may further include:
  • Each of the at least two associated signals is regarded as a narrowband signal.
  • the narrowband signal may be a multi-channel associated signal.
  • at least two associated signals can be merged to obtain one signal, which is regarded as a narrowband signal, and then passed through this application.
  • the frequency band expansion method expands the narrowband signal to obtain a wideband signal.
  • each of the at least two associated signals may be used as a narrowband signal, and the narrowband signal may be expanded by the frequency band expansion method in the embodiment of the present application to obtain at least two corresponding wideband signals.
  • the wideband signals can be combined into one signal for output, or can be output separately, which is not limited in the embodiment of the present application.
  • the application scenario is a PSTN (narrowband voice) and VoIP (wideband voice) intercommunication scenario, that is, the narrowband voice corresponding to the PSTN phone is used as the narrowband signal to be processed, and the bandwidth of the narrowband signal to be processed is expanded to make VoIP receive
  • the voice frame received by the end is broadband voice, thereby improving the listening experience of the receiving end.
  • the narrowband signal to be processed is a signal with a adoption rate of 8000 Hz and a frame length of 10 ms.
  • the effective bandwidth of the narrowband signal to be processed is 4000 Hz.
  • the upper bound of the effective bandwidth is generally 3500 Hz. Therefore, in this example, the bandwidth of the expanded wideband signal is 7000 Hz as an example for description.
  • the time-frequency transform is a Fourier transform (such as STFT).
  • STFT Fourier transform
  • Step S1 front-end signal processing:
  • the narrowband signal to be processed is subjected to an up-sampling process with a factor of 2, and an up-sampling signal with a sampling rate of 16000 Hz is output.
  • the up-sampled signal corresponds to 160 sample points (frequency points).
  • STFT Short-time Fourier Transform
  • the 160 sample points corresponding to a speech frame and the 160 sample points corresponding to the current speech frame (the narrowband signal to be processed) form an array, and the array includes 320 sample points.
  • the sample points in the array are windowed (that is, the windowing of the Hanning window), and the windowed signal is obtained as s Low (i, j).
  • fast Fourier transform is performed on s Low (i,j) to obtain 320 low-frequency frequency domain coefficients S Low (i,j).
  • i is the frame index of the speech frame
  • the first coefficient is the DC component, so only the first 161 low-frequency frequency domain coefficients can be considered.
  • Step S2 feature extraction:
  • P Low (i, j) represents the low-frequency amplitude spectrum
  • S Low (i, j) is the low-frequency frequency domain coefficient
  • Real and Imag are the real and imaginary parts of the low-frequency frequency domain coefficient, respectively
  • SQRT is the root-opening operation.
  • the calculated 70 low-frequency amplitude spectrum coefficients can be directly used as the low-frequency amplitude spectrum of the narrowband signal to be processed. Further, for the convenience of calculation, the low-frequency amplitude spectrum can also be further converted to the logarithmic domain.
  • the low-frequency spectrum envelope of the narrowband signal to be processed can be determined based on the low-frequency amplitude spectrum.
  • the low-frequency spectrum envelope can also be determined based on the low-frequency amplitude spectrum in the following manner:
  • the narrowband signal to be processed is divided into bands.
  • the frequency band corresponding to the spectral coefficients of every 5 adjacent sub-amplitude spectra can be divided into one sub-band, which is divided into 14 sub-bands in total.
  • the low frequency spectral envelope of the subband is defined as the average energy of adjacent spectral coefficients. Specifically, it can be calculated by formula (20):
  • the low frequency spectrum envelope includes 14 sub-spectral envelopes.
  • the spectral envelope of a subband is defined as the average energy of adjacent coefficients (or further converted to logarithmic representation), but this method may cause coefficients with smaller amplitudes to fail to play a substantial role.
  • the solution provided by the embodiment of the present application that directly averages the logarithmic identifiers of the spectral coefficients included in each sub-amplitude spectrum to obtain the sub-spectrum envelope corresponding to the sub-amplitude spectrum is similar to the existing commonly used envelope determination solution. It can better protect the coefficients with smaller amplitude in the distortion control of the neural network model training process, so that more signal parameters can play a corresponding role in the expansion of the frequency band.
  • the 70-dimensional low-frequency amplitude spectrum and the 14-dimensional low-frequency spectrum envelope can be used as the input of the neural network model.
  • Step S3 input the neural network model:
  • Input layer The neural network model inputs the above-mentioned 84-dimensional feature vector.
  • Output layer Considering that the target bandwidth of the frequency band extension in this embodiment is 7000 Hz, it is necessary to predict the high frequency spectrum envelopes of 14 subbands corresponding to the 3500-7000 Hz frequency band to complete the basic frequency band extension function.
  • the low-frequency part of the speech frame contains a lot of harmonic structures such as the fundamental tone and formant; the frequency spectrum of the high-frequency part is flatter; if you simply copy the low-frequency spectrum to the high-frequency, the initial high-frequency amplitude spectrum is obtained, and Perform sub-band-based gain control on the initial high-frequency amplitude spectrum, and the reconstructed high-frequency part will produce too much harmonic-like structure, which will cause distortion and affect the sense of hearing; therefore, the relative prediction based on the neural network model in this example Flatness information describes the relative flatness of the low-frequency part and the high-frequency part, and adjusts the initial high-frequency amplitude spectrum to make the adjusted high-frequency part flatter and reduce the interference of harmonics.
  • the initial high-frequency amplitude spectrum is generated by duplicating the amplitude spectrum of the high-frequency part of the low-frequency amplitude spectrum, and the frequency band of the high-frequency part is equally divided into two sub-band regions, respectively, the first sub-band region And the second subband area, the high frequency part corresponds to 70 spectral coefficients, and each subband area corresponds to 35 spectral coefficients. Therefore, the high frequency part will be subjected to two flatness analysis, that is, a flatness analysis is performed for each subband area.
  • the spectral coefficients corresponding to the 35-69 frequency points are selected as the "motherboard", and the first subband region corresponds to The frequency band is the frequency band corresponding to the 70th to 104th frequency points, and the frequency band corresponding to the second subband area is the frequency band corresponding to the 105th to 139th frequency points.
  • Flatness analysis can use the variance analysis method defined in classical statistics.
  • the degree of oscillation of the spectrum can be described by the method of variance analysis. The higher the value, the richer the harmonic components.
  • the high-frequency band of the low-frequency part of the sample narrowband signal can be selected as the reference for determining the relative flatness information, that is, the low-frequency part
  • the high-frequency band (the frequency band corresponding to the frequency points of 35-69) is used as the master, and the high-frequency part of the sample broadband signal is divided into at least two sub-band regions, based on the frequency spectrum and low-frequency of each sub-band region of the high-frequency part. Part of the frequency spectrum is used to determine the relative flatness information of each subband region.
  • the relative value of each subband area of the high frequency part of the sample broadband signal spectrum can be determined by the analysis of variance method. Flatness information.
  • the relative flatness information of the high-frequency part and the low-frequency part of the sample broadband signal can be Is, the first relative flatness information of the first sub-band region and the high-frequency band of the low-frequency part of the sample broadband signal, and the second relative flatness information of the second sub-band region and the high-frequency band of the low-frequency part of the sample broadband signal .
  • the specific determination method of the first relative flatness information and the second relative flatness information may be:
  • formula (21) is the variance of the amplitude spectrum of the low frequency part of the sample narrowband signal
  • formula (22) is the variance of the amplitude spectrum of the first subband region
  • formula (23) is the variance of the amplitude spectrum of the second subband region.
  • the variance of the amplitude spectrum, var() represents the variance, and the variance of the spectrum can be represented based on the corresponding frequency domain coefficients.
  • S Low,sample (i,j) represents the frequency domain coefficients of the sample narrowband signal.
  • formula (24) and formula (25) are used to determine the relative flatness information of the amplitude spectrum of each subband region and the amplitude spectrum of the low frequency part of the high frequency band:
  • fc(0) represents the first relative flatness information of the amplitude spectrum of the first subband region and the amplitude spectrum of the high frequency band of the low frequency part
  • fc(1) represents the amplitude spectrum of the second subband region and the amplitude spectrum of the low frequency part.
  • fc(0) and fc(1) can be classified according to whether they are greater than or equal to 0, and fc(0) and fc(1) can be defined as a two-category array, so the array contains 4 permutations and combinations: ⁇ 0,0 ⁇ , ⁇ 0,1 ⁇ , ⁇ 1,0 ⁇ , ⁇ 1,1 ⁇ .
  • the relative flatness information output by the model may be 4 probability values, and the probability values are used to identify the probability that the relative flatness information belongs to the aforementioned 4 arrays.
  • one of the permutations and combinations of the four arrays can be selected as the predicted relative flatness information of the amplitude spectrum of the extended region of the two subbands and the amplitude spectrum of the low frequency part of the high frequency band.
  • the specific can be expressed by formula (26):
  • Step S4 generate high frequency amplitude spectrum:
  • the trained neural network model can predict the relative flatness of at least two sub-band regions of the high-frequency part of the target broadband spectrum.
  • Degree information that is, the high frequency part of the target broadband spectrum is divided into at least two subband regions. In this example, with 2 subband regions, the output of the neural network model is the relative flatness information of the two subband regions.
  • the main steps include:
  • each sub-band can correspond to a first sub-spectral envelope.
  • calculate the average energy pow_env of each sub-band (the spectral energy information corresponding to the second sub-spectral envelope), and calculate the average energy Mpow_env (the sub-band region corresponding to the second sub-spectral envelope) of the above 7 sub-bands Spectral energy information).
  • the average energy of each sub-band is determined based on the corresponding low-frequency amplitude spectrum.
  • the square of the absolute value of the spectral coefficient of each low-frequency amplitude spectrum is taken as the energy of a low-frequency amplitude spectrum, and one sub-band corresponds to 5 low-frequency amplitudes.
  • the average energy of the low-frequency amplitude spectrum corresponding to a sub-band can be used as the average energy of the sub-band.
  • a 1 0.875
  • b 1 0.125
  • a 0 0.925
  • b 0 0.075
  • G(j) is the gain adjustment value.
  • the gain adjustment value is 1, that is, there is no need to perform a flattening operation (adjustment) on the high-frequency spectrum envelope.
  • the gain adjustment value corresponding to each first sub-spectral envelope in the high-frequency spectrum envelope e High (i, k) can be determined, and the gain adjustment value corresponding to each first sub-spectral envelope can be determined Value, adjust the corresponding first sub-spectrum envelope, the above operation can narrow the average energy difference of different sub-bands, and perform different degrees of flattening processing on the spectrum corresponding to the first sub-band region.
  • the corresponding high-frequency spectrum envelope of the second subband region can be adjusted in the same manner as described above, which will not be repeated here.
  • the high-frequency spectrum envelope includes a total of 14 sub-bands, and 14 gain adjustment values can be correspondingly determined, and the corresponding sub-spectrum envelopes are adjusted based on the 14 gain adjustment values.
  • the difference between the adjusted high-frequency spectrum envelope and the low-frequency spectrum envelope is determined, and the initial high-frequency amplitude spectrum is adjusted based on the difference to obtain the target high-frequency amplitude spectrum P High (i,j).
  • Step S5 generate high frequency spectrum:
  • the corresponding high-frequency phase spectrum Ph High (i,j) is generated based on the low-frequency phase spectrum Ph Low (i,j), which may include any of the following:
  • the first method is to obtain the corresponding high-frequency phase spectrum by copying the low-frequency phase spectrum.
  • the second type Flip the low-frequency phase spectrum, and obtain a phase spectrum that is the same as the low-frequency phase spectrum after folding, and map the two low-frequency phase spectra to the corresponding high-frequency frequency points to obtain the corresponding high-frequency phase spectrum.
  • the high frequency frequency domain coefficient S High (i, j) is generated; based on the low frequency frequency domain coefficient and the high frequency frequency domain coefficient, the high frequency spectrum is generated.
  • Step S6 high-frequency post-filtering:
  • High-frequency post-filtering is to filter the obtained initial high-frequency frequency domain coefficients, and obtain the filtered initial high-frequency frequency domain coefficients, which are recorded as high-frequency frequency domain coefficients.
  • the high frequency frequency domain coefficients are filtered through the filter gain determined based on the high frequency frequency domain coefficients, as shown in the following formula (27):
  • G High_post_filt (j) is the filter gain calculated according to the high frequency frequency domain coefficients
  • S High (i,j) is the initial high frequency frequency domain coefficients
  • S High_rev (i,j) is the high frequency frequency domain coefficients obtained by filtering. .
  • the initial low-frequency frequency domain coefficients are divided into bands. For example, five adjacent initial low-frequency frequency domain coefficients are combined into one sub-frequency spectrum. This example corresponds to 14 sub-bands. Calculate the average energy for each subband.
  • the energy of each frequency point ie, the aforementioned initial low-frequency frequency domain coefficient
  • the energy values of five adjacent frequency points are calculated by the following formula (28), and the sum of the energy values of the five frequency points is the first spectrum energy of the current sub-spectrum:
  • S High (i, j) is the initial high-frequency frequency domain coefficient
  • Real and Imag are the real and imaginary parts of the initial high-frequency frequency domain coefficient respectively
  • Pe(k) is the first spectral energy
  • Fe_sm(k) (Fe(k)+Fe pre (k))/2 (30)
  • Fe(k) is the smoothing term of the first spectrum energy of the current sub-spectrum
  • Pe(k) is the first spectrum energy of the current sub-spectrum of the current speech frame
  • Pe pre (k) is the associated speech frame of the current speech frame
  • Fe_sm(k) is the smoothing term of the accumulated and averaged first spectrum energy
  • Fe pre (k) is the associated voice frame of the current voice frame and the current sub-spectrum energy.
  • the associated speech frame is at least one speech frame located before the current speech frame and adjacent to the current speech frame, thus fully considering the short-term correlation and long-term correlation between speech signal frames. Time correlation.
  • e1 is the first subband energy of the first subband
  • e2 is the second subband energy of the second subband.
  • the spectral tilt coefficient of the initial spectrum is determined based on the following logic:
  • T_para is the spectrum tilt coefficient
  • SQRT is the root-opening operation
  • f_cont_low 0.07
  • 7 is half of the total number of sub-spectrums.
  • gain f0 (k) is the second filter gain of the k-th sub-spectrum
  • Fe(k) is the first spectrum of the k-th sub-spectrum
  • gain f1 (k) is the second filter gain adjusted according to formula (33)
  • gain High_post_filt (k) is the filtering of 5 high-frequency frequency domain coefficients corresponding to the k-th sub-spectrum finally obtained according to gain f1 (k)
  • Step S7 low frequency post filtering
  • the low-frequency post-filtering is to filter the initial low-frequency frequency domain coefficients obtained by the STFT of the narrowband signal to be processed to obtain the low-frequency frequency domain coefficients.
  • the initial low-frequency frequency domain coefficients are filtered through the filter gain determined based on the initial low-frequency frequency domain coefficients, as shown in the following formula (35):
  • G Low_post_filt (j) is the filter gain calculated according to the initial low-frequency frequency domain coefficients
  • S Low (i,j) is the initial low-frequency frequency domain coefficients
  • S Low_rev (i,j) is the low-frequency frequency domain obtained through filtering coefficient.
  • the initial low-frequency frequency domain coefficients are divided into bands. For example, five adjacent initial low-frequency frequency domain coefficients are combined into one sub-frequency spectrum. This example corresponds to 14 sub-bands. Calculate the average energy for each subband.
  • the energy of each frequency point ie, the aforementioned initial low-frequency frequency domain coefficient
  • the energy values of five adjacent frequency points are calculated by the following formula (36), and the sum of the energy values of the five frequency points is the first spectral energy of the current sub-spectrum:
  • S Low (i, j) is the initial low-frequency frequency domain coefficient
  • Real and Imag are the real and imaginary parts of the initial low-frequency frequency domain coefficient respectively
  • Pe(k) is the first spectral energy
  • Fe_sm(k) (Fe(k)+Fe pre (k))/2 (38)
  • Fe(k) is the smoothing term of the first spectrum energy of the current sub-spectrum
  • Pe(k) is the first spectrum energy of the current sub-spectrum of the current speech frame
  • Pe pre (k) is the associated speech frame of the current speech frame
  • Fe_sm(k) is the smoothing term of the accumulated and averaged first spectrum energy
  • Fe pre (k) is the associated voice frame of the current voice frame and the current sub-spectrum energy.
  • the smoothing term of the first spectrum energy corresponding to the frequency spectrum, and the associated speech frame is at least one speech frame located before the current speech frame and adjacent to the current speech frame.
  • e1 is the first subband energy of the first subband
  • e2 is the second subband energy of the second subband
  • the spectral tilt coefficient of the initial spectrum is determined based on the following logic:
  • T_para is the spectrum tilt coefficient
  • SQRT is the root-opening operation
  • f_cont_low 0.035
  • 7 is half of the total number of sub-spectrums.
  • gain f0 (k) is the second filter gain of the k-th sub-spectrum
  • Fe(k) is the first spectral energy of the k-th sub-spectrum
  • gain f1 (k) is the second filter gain adjusted according to the spectrum tilt coefficient T_para.
  • gain f1 (k) is the second filter gain adjusted according to formula (41)
  • gain Low_post_filt (k) is the filtering of 5 low-frequency frequency domain coefficients corresponding to the k-th sub-spectrum finally obtained according to gain f1 (k)
  • the gain that is, the second filter gain
  • gain f1 (k) is the adjusted second filter gain
  • Step S8 frequency-time transformation, that is, inverse short-time Fourier transform ISTFT:
  • the low frequency frequency domain coefficient S Low_rev (i, j) and the high frequency frequency domain coefficient S High_rev (i, j) are combined to generate a high frequency spectrum.
  • the time-frequency transform inverse transformation ie ISTFT (Inverse Short Time Fourier Transform)
  • Rec (i, j) the effective spectrum of the narrowband signal to be processed has been expanded to 7000 Hz.
  • the time-frequency transform is MDCT.
  • the time-frequency conversion of the narrowband signal to be processed is based on STFT.
  • each signal frequency point contains amplitude information and phase information.
  • the phase of the high frequency part is directly mapped from the low frequency part, and there is a certain error. Therefore, MDCT is used in the second example.
  • MDCT is still similar to the windowing and overlap processing of the first example, but the generated MDCT coefficients are real numbers and the amount of information is larger. Only the correlation between the high-frequency MDCT coefficients and the low-frequency MDCT coefficients is used, and the same as the first example
  • the neural network model can complete the frequency band expansion.
  • the specific process includes the following steps:
  • Step T1 front-end signal processing:
  • the narrowband signal to be processed is subjected to an up-sampling process with a factor of 2, and an up-sampling signal with a sampling rate of 16000 Hz is output.
  • the up-sampled signal corresponds to 160 sample points (frequency points).
  • the up-sampled signal is transformed by modified discrete cosine transform (MDCT), specifically:
  • MDCT modified discrete cosine transform
  • the 160 sample points corresponding to the speech frame and the 160 sample points corresponding to the current speech frame (the narrowband signal to be processed) form an array, and the array includes 320 sample points.
  • the sample points in the array are windowed by the cosine window, and the signal s Low (i, j) obtained after the windowing is processed by MDCT to obtain 160 low-frequency frequency domain coefficients S Low (i, j).
  • i is the frame index of the speech frame
  • Step T2 feature extraction:
  • the narrowband signal is a signal with a sampling rate of 16000 Hz and a bandwidth of 0 to 3500 Hz
  • the low-frequency spectrum envelope of the narrowband signal to be processed can be determined based on the 70 low-frequency frequency domain coefficients.
  • the low-frequency spectrum envelope can be determined based on the low-frequency frequency domain coefficients in the following manner:
  • the narrowband signal to be processed is divided into bands.
  • the frequency band corresponding to every 5 adjacent low-frequency frequency domain coefficients can be divided into one sub-band, which is divided into 14 sub-bands.
  • Each sub-band corresponds to 5 low frequency frequency domain coefficients.
  • the low frequency spectrum envelope of the subband is defined as the average energy of adjacent low frequency frequency domain coefficients. Specifically, it can be calculated by formula (43):
  • the low frequency spectrum envelope includes 14 sub-spectral envelopes.
  • the 70-dimensional low-frequency frequency domain coefficient S Low (i, j) and the 14-dimensional low-frequency spectrum envelope e Low (i, k) can be used as the input of the neural network model.
  • Input layer The neural network model inputs the above 84-dimensional feature vector
  • Output layer Considering that the target bandwidth of the frequency band extension in this embodiment is 7000 Hz, it is necessary to predict the high frequency spectrum envelope e High (i, k) of 14 subbands corresponding to the frequency band of 3500-7000 Hz. In addition, 4 probability densities fc related to flatness information can be output at the same time, that is, the output result is 18-dimensional.
  • the processing process of the neural network model in the second example is the same as the processing process of the neural network model in the above-mentioned first example, and will not be repeated here.
  • Step T4 generate high frequency amplitude spectrum:
  • Step T5 high-frequency post-filtering:
  • High-frequency post-filtering is to filter the obtained initial high-frequency frequency domain coefficients, and obtain the filtered initial high-frequency frequency domain coefficients, which are recorded as high-frequency frequency domain coefficients.
  • the high frequency frequency domain coefficients are filtered through the filter gain determined based on the high frequency frequency domain coefficients, as shown in the following formula (44):
  • G High_post_filt (j) is the filter gain calculated according to the high frequency frequency domain coefficients
  • S High (i,j) is the initial high frequency frequency domain coefficients
  • S High_rev (i,j) is the high frequency frequency domain coefficients obtained by filtering. .
  • the specific processing process of the high-frequency post-filtering is similar to the specific processing process of the aforementioned high-frequency pre-filtering, and the details are as follows:
  • the initial low-frequency frequency domain coefficients are divided into bands. For example, five adjacent initial low-frequency frequency domain coefficients are combined into one sub-frequency spectrum. This example corresponds to 14 sub-bands. Calculate the average energy for each subband.
  • the energy of each frequency point ie, the aforementioned initial low-frequency frequency domain coefficient
  • the energy values of five adjacent frequency points are calculated by the following formula (45), and the sum of the energy values of the five frequency points is the first spectrum energy of the current sub-spectrum:
  • S High (i, j) is the initial high frequency frequency domain coefficient
  • Pe(k) is the first spectrum energy
  • k 0, 1,...13, indicating the index number of the subband, there are 14 subbands in total.
  • Fe_sm(k) (Fe(k)+Fe pre (k))/2 (47)
  • Fe(k) is the smoothing term of the first spectrum energy of the current sub-spectrum
  • Pe(k) is the first spectrum energy of the current sub-spectrum of the current speech frame
  • Pe pre (k) is the associated speech frame of the current speech frame
  • Fe_sm(k) is the smoothing term of the accumulated and averaged first spectrum energy
  • Fe pre (k) is the associated voice frame of the current voice frame and the current sub-spectrum energy.
  • the associated speech frame is at least one speech frame located before the current speech frame and adjacent to the current speech frame, thus fully considering the short-term correlation and long-term correlation between speech signal frames. Time correlation.
  • e1 is the first subband energy of the first subband
  • e2 is the second subband energy of the second subband.
  • the spectral tilt coefficient of the initial spectrum is determined based on the following logic:
  • T_para is the spectrum tilt coefficient
  • SQRT is the root-opening operation
  • f_cont_low 0.07
  • 7 is half of the total number of sub-spectrums.
  • gain f0 (k) is the second filter gain of the k-th sub-spectrum
  • Fe(k) is the first spectral energy of the k-th sub-spectrum
  • the second filter gain gain f0 (k) needs to be further adjusted according to the following formula (50):
  • gain f1 (k) is the second filter gain adjusted according to formula (50)
  • gain High_post_filt (k) is the filtering of 5 low-frequency frequency domain coefficients corresponding to the k-th sub-spectrum finally obtained according to gain f1 (k)
  • Gain that is, the second filter gain
  • Step T6 low frequency post filtering
  • Low-frequency post-filtering is to filter the initial low-frequency frequency domain coefficients obtained by MDCT of the narrowband signal to be processed to obtain the low-frequency frequency domain coefficients.
  • the initial low-frequency frequency domain coefficients are filtered through the filtering gain determined based on the initial low-frequency frequency domain coefficients, as shown in the following formula (52):
  • G Low_post_filt (j) is the filter gain calculated according to the initial low-frequency frequency domain coefficients
  • S Low (i,j) is the initial low-frequency frequency domain coefficients
  • S Low_rev (i,j) is the low-frequency frequency domain obtained through filtering coefficient.
  • the initial low-frequency frequency domain coefficients are divided into bands. For example, five adjacent initial low-frequency frequency domain coefficients are combined into one sub-frequency spectrum. This example corresponds to 14 sub-bands. Calculate the average energy for each subband.
  • the energy of each frequency point ie, the aforementioned initial low-frequency frequency domain coefficient
  • the energy values of five adjacent frequency points are calculated by the following formula (53), and the sum of the energy values of the five frequency points is the first spectrum energy of the current sub-spectrum:
  • S Low (i, j) is the initial low-frequency frequency domain coefficient
  • Real and Imag are the real and imaginary parts of the initial low-frequency frequency domain coefficient respectively
  • Pe(k) is the first spectral energy
  • Fe_sm(k) (Fe(k)+Fe pre (k))/2 (55)
  • Fe(k) is the smoothing term of the first spectrum energy of the current sub-spectrum
  • Pe(k) is the first spectrum energy of the current sub-spectrum of the current speech frame
  • Pe pre (k) is the associated speech frame of the current speech frame
  • Fe_sm(k) is the smoothing term of the accumulated and averaged first spectrum energy
  • Fe pre (k) is the associated voice frame of the current voice frame and the current sub-spectrum energy.
  • the smoothing term of the first spectrum energy corresponding to the frequency spectrum, and the associated speech frame is at least one speech frame located before the current speech frame and adjacent to the current speech frame.
  • e1 is the first subband energy of the first subband
  • e2 is the second subband energy of the second subband.
  • the spectral tilt coefficient of the initial spectrum is determined based on the following logic:
  • T_para is the spectral tilt coefficient
  • SQRT is the root-opening operation
  • f_cont_low is the preset filter coefficient.
  • f_cont_low 0.035
  • 7 is half of the total number of sub-spectrums.
  • gain f0 (k) is the second filter gain of the k-th sub-spectrum
  • Fe(k) is the first spectral energy of the k-th sub-spectrum
  • the second filter gain gain f0 (k) needs to be further adjusted according to the following formula (58):
  • gain f1 (k) is the second filter gain adjusted according to the spectrum tilt coefficient T_para.
  • gain f1 (k) is the second filter gain adjusted according to formula (58)
  • gain Low_post_filt (k) is the filtering of 5 low-frequency frequency domain coefficients corresponding to the k-th sub-spectrum finally obtained according to gain f1 (k)
  • Gain that is, the second filter gain
  • Step T7 frequency-time transform, that is, inverse modified discrete cosine Fourier transform IMDCT
  • the low frequency frequency domain coefficient S Low_rev (i, j) and the high frequency frequency domain coefficient S High_rev (i, j) are combined to generate a high frequency spectrum.
  • the time-frequency transform inverse transformation ie IMDCT (Inverse Modified Discrete Cosine Transform)
  • IMDCT Inverse Modified Discrete Cosine Transform
  • the VoIP side can only receive narrowband voices from the PSTN (the sampling rate is 8kHz, and the effective bandwidth is generally 3.5kHz).
  • the user's intuitive feeling is that the sound is not bright enough, the volume is not loud enough, and the intelligibility is average.
  • the frequency band is expanded based on the technical solution disclosed in this application without additional bits, and the effective bandwidth can be expanded to 7 kHz at the receiving end of the VoIP side. Users can intuitively feel brighter tone, louder volume and better intelligibility.
  • there is no forward compatibility problem that is, without modifying the protocol, it can be perfectly compatible with PSTN.
  • the method of the embodiment of this application can be applied to the downstream side of the PSTN-VoIP channel.
  • the functional module of the solution provided by the embodiment of this application can be integrated in the client terminal equipped with the conference system, and the narrow band can be realized on the client terminal.
  • the frequency band of the signal is expanded to obtain a broadband signal.
  • the signal processing in this scenario is a signal post-processing technology.
  • PSTN the encoding system can be ITU-T G.711
  • the voice is restored after G.711 decoding is completed.
  • Frame The post-processing technology involved in the implementation of this application is performed on the voice frame, so that VoIP users can receive wideband signals, even if the sending end is a narrowband signal.
  • the method of the embodiment of the present application can also be applied in the mixing server of the PSTN-VoIP channel.
  • the expanded broadband signal is sent to the VoIP client, and the VoIP client receives
  • the VoIP client receives
  • the VoIP code stream corresponding to the wideband signal by decoding the VoIP code stream, the wideband voice output after the frequency band expansion can be recovered.
  • a typical function of the audio mixing server is to perform transcoding, for example, transcoding the bit stream of the PSTN link (such as using G.711 encoding) to the bit stream commonly used in VoIP (such as OPUS or SILK, etc.).
  • the G.711 decoded speech frame can be up-sampled to 16000 Hz, and then the solution provided in the embodiment of this application can be used to complete the frequency band expansion; then, it can be transcoded into a common stream for VoIP.
  • the VoIP client receives one or more VoIP streams, it can recover the wideband voice output after frequency band expansion through decoding.
  • FIG. 5 is a schematic structural diagram of a frequency band extension device provided by another embodiment of the application.
  • the device 50 may include a low-frequency spectrum determination module 51, a correlation parameter determination module 52, and a high-frequency amplitude spectrum determination module. 53 and broadband signal determination module 54, where:
  • the low-frequency spectrum determination module 51 is configured to perform time-frequency transformation on the narrowband signal to be processed to obtain the corresponding initial low-frequency spectrum
  • the correlation parameter determination module 52 is used to obtain correlation parameters between the high frequency part and the low frequency part of the target broadband spectrum through the neural network model based on the initial low frequency spectrum, where the correlation parameters include high frequency spectrum envelope and relative flatness At least one of the information, the relative flatness information represents the correlation between the spectral flatness of the high-frequency part of the target broadband spectrum and the spectral flatness of the low-frequency part;
  • the high-frequency spectrum determination module 53 is used to obtain the initial high-frequency spectrum based on the correlation parameter and the initial low-frequency spectrum;
  • the broadband signal determining module 54 is used to obtain a broadband signal after frequency band expansion according to the target low-frequency spectrum and the target high-frequency spectrum; wherein, the target low-frequency spectrum is the initial low-frequency spectrum or the spectrum after filtering the initial low-frequency spectrum, and the target high-frequency spectrum is The frequency spectrum is the initial high frequency spectrum or the frequency spectrum after filtering the initial high frequency spectrum.
  • the broadband signal determination module when the broadband signal determination module performs filtering processing on the initial low-frequency spectrum or the initial high-frequency spectrum, it is specifically used to:
  • filter processing is performed on each corresponding sub-spectrum respectively.
  • the broadband signal determination module determines the filter gain corresponding to each sub-spectrum based on the first spectrum energy corresponding to each sub-spectrum, it is specifically used to:
  • the filter gain corresponding to each sub-spectrum is determined.
  • the narrowband signal is the speech signal of the current speech frame
  • the wideband signal determining module is specifically used to: when determining the first spectrum energy of a sub-spectrum:
  • the first initial spectral energy is the first spectral energy
  • the current speech frame is not the first speech frame, obtain the second initial spectrum energy of the sub-spectrum corresponding to a sub-spectrum of the associated speech frame.
  • the associated speech frame is located before and adjacent to the current speech frame At least one speech frame;
  • the first spectrum energy of a sub-spectrum is obtained.
  • the correlation parameters include high-frequency spectrum envelope and relative flatness information;
  • the neural network model includes at least an input layer and an output layer, the input layer inputs the feature vector of the low-frequency spectrum, and the output layer includes at least one side
  • Each fully connected network layer includes at least one fully connected layer.
  • the LSTM layer converts the feature vector processed by the input layer, and one of them is fully connected.
  • the connected network layer performs the first classification process according to the converted vector value of the LSTM layer and outputs the high-frequency spectrum envelope.
  • the other fully connected network layer performs the second classification process according to the converted vector value of the LSTM layer and outputs the relative flatness. information.
  • it also includes a processing module
  • the processing module is specifically configured to determine the low frequency spectrum envelope of the narrowband signal to be processed based on the initial low frequency spectrum
  • the input of the neural network model also includes the low-frequency spectrum envelope.
  • the time-frequency transform includes Fourier transform or discrete cosine transform
  • the high-frequency spectrum determination module uses the neural network model to obtain the correlation parameters between the high-frequency part and the low-frequency part of the target broadband spectrum based on the initial low-frequency spectrum, which is specifically used to:
  • the low frequency amplitude spectrum of the narrowband signal to be processed is obtained
  • the high-frequency spectrum determination module uses the neural network model to obtain the correlation parameters between the high-frequency part and the low-frequency part of the target broadband spectrum based on the initial low-frequency spectrum, which is specifically used to:
  • the initial low frequency spectrum is input to the neural network model, and the correlation parameters are obtained based on the output of the neural network model.
  • the time-frequency transform includes Fourier transform or discrete cosine transform
  • the high-frequency spectrum determination module is specifically used to obtain the initial high-frequency spectrum based on the correlation parameters and the initial low-frequency spectrum:
  • the low frequency spectrum envelope of the narrowband signal to be processed is obtained
  • the initial high-frequency spectrum is obtained
  • the high-frequency spectrum determination module obtains the initial high-frequency spectrum based on the correlation parameters and the initial low-frequency spectrum, which is specifically used to:
  • the low frequency spectrum envelope of the narrowband signal to be processed is obtained
  • the first high frequency spectrum is adjusted to obtain the initial high frequency spectrum.
  • the correlation parameter further includes relative flatness information, and the relative flatness information represents the correlation between the spectral flatness of the high-frequency part of the target broadband spectrum and the spectral flatness of the low-frequency part;
  • the high-frequency spectrum determination module is specifically used to adjust the high-frequency spectrum information based on the high-frequency spectrum envelope and the low-frequency spectrum envelope:
  • the high-frequency spectrum information is adjusted, and the high-frequency spectrum information includes the initial high-frequency amplitude spectrum or the first high-frequency spectrum.
  • the relative flatness information includes the relative flatness information of at least two sub-band regions corresponding to the high-frequency part, and the relative flatness information corresponding to one sub-band region represents one of the high-frequency parts.
  • the spectral parameters of each subband region are obtained based on the spectral parameters of the high frequency band of the low frequency part, and the relative flatness information includes the spectral parameters of each subband region and Relative flatness information of the spectral parameters of the high-frequency band, where, if the time-frequency transformation is Fourier transformation, the spectral parameter is the amplitude spectrum, and if the time-frequency transformation is the discrete cosine transformation, the spectral parameter is the frequency spectrum;
  • the high-frequency spectrum determination module determines the gain adjustment value of the high-frequency spectrum envelope based on the relative flatness information and the energy information of the initial low-frequency spectrum, it is specifically used to:
  • the high-frequency spectrum determination module adjusts the high-frequency spectrum envelope based on the gain adjustment value, it is specifically used to:
  • the corresponding spectrum envelope part is adjusted.
  • the high-frequency spectrum envelope includes a first predetermined number of high-frequency sub-spectrum envelopes
  • the high-frequency spectrum determination module determines the gain adjustment value of the corresponding spectrum envelope part in the high-frequency spectrum envelope based on the relative flatness information corresponding to each sub-band area and the spectrum energy information corresponding to each sub-band area in the initial low-frequency spectrum.
  • the spectrum corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope For each high-frequency sub-spectral envelope, according to the spectrum energy information corresponding to the spectrum envelope corresponding to the high-frequency sub-spectral envelope in the low-frequency spectrum envelope, the spectrum corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope.
  • the relative flatness information corresponding to the sub-band region corresponding to the envelope, the spectral energy information corresponding to the sub-band region corresponding to the spectral envelope corresponding to the high-frequency sub-spectral envelope in the low-frequency spectrum envelope, to determine the high-frequency sub-spectrum The gain adjustment value of the envelope;
  • the high-frequency spectrum determination module adjusts the corresponding frequency spectrum envelope part according to the gain adjustment value of each corresponding spectrum envelope part in the high-frequency spectrum envelope, it is specifically used for:
  • the corresponding high-frequency sub-spectrum envelope is adjusted.
  • the frequency band expansion method and device in the process of obtaining a wideband signal after the frequency band expansion according to the target low frequency spectrum and the target high frequency spectrum, by performing at least one of the initial low frequency spectrum or the initial high frequency spectrum Filtering processing allows the initial low-frequency spectrum to be filtered before the wideband signal is obtained, so as to effectively filter out the quantization noise that may be introduced by the narrow-band signal in the quantization process; the initial high-frequency spectrum can also be filtered to effectively filter out
  • the noise introduced during the frequency band expansion process based on the initial low-frequency spectrum enhances the signal quality of the broadband signal and further enhances the user's hearing experience.
  • the frequency band expansion is carried out by the method of this scheme, without the need to record side information in advance, that is, no additional bandwidth is required.
  • this embodiment is a device item embodiment corresponding to the foregoing method item embodiment, and this embodiment can be implemented in cooperation with the foregoing method item embodiment.
  • the related technical details mentioned in the above method item embodiments are still valid in this embodiment, and in order to reduce repetition, they will not be repeated here.
  • the related technical details mentioned in this embodiment can also be applied to the above method item embodiment.
  • the electronic device 600 shown in FIG. 6 includes a processor 601 and a memory 603. Wherein, the processor 601 and the memory 603 are connected, for example, connected through a bus 602. Furthermore, the electronic device 600 may further include a transceiver 604. It should be noted that in actual applications, the transceiver 604 is not limited to one, and the structure of the electronic device 600 does not constitute a limitation to the embodiment of the present application.
  • the processor 601 is applied in the embodiment of the present application to implement the low-frequency spectrum parameter determination module, the correlation parameter determination module, the high-frequency amplitude spectrum determination module, the high-frequency phase spectrum generation module, and the high-frequency spectrum shown in FIG. 5 Determine the function of the module and the broadband signal determine the module.
  • the processor 601 may be a CPU, a general-purpose processor, a DSP, an ASIC, an FPGA, or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It can implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of this application.
  • the processor 601 may also be a combination that implements computing functions, for example, including a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and so on.
  • the bus 602 may include a path for transferring information between the above-mentioned components.
  • the bus 602 may be a PCI bus, an EISA bus, or the like.
  • the bus 602 can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one thick line is used in FIG. 6, but it does not mean that there is only one bus or one type of bus.
  • the memory 603 can be ROM or other types of static storage devices that can store static information and instructions, RAM or other types of dynamic storage devices that can store information and instructions, or it can be EEPROM, CD-ROM or other optical disk storage, or optical disk storage. (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be used by a computer Any other media accessed, but not limited to this.
  • the memory 603 is used to store application program codes for executing the solutions of the present application, and the processor 601 controls the execution.
  • the processor 601 is configured to execute application program codes stored in the memory 603 to implement the actions of the frequency band extension apparatus provided in the embodiment shown in FIG. 5.
  • the electronic device includes a memory, a processor, and a computer program stored on the memory and running on the processor.
  • the processor executes the program, it can realize: according to the target low frequency spectrum and the target high frequency spectrum,
  • the initial low frequency spectrum can be filtered to effectively filter out
  • the quantization noise that may be introduced by the narrowband signal during the quantization process
  • the initial high-frequency spectrum can also be filtered, so as to effectively filter out the noise introduced during the band expansion process based on the initial low-frequency spectrum, enhance the signal quality of the wideband signal, and further improve The user’s hearing experience.
  • the frequency band expansion is carried out by the method of this scheme, without the need to record side information in advance, that is, no additional bandwidth is required.
  • the embodiments of the present application also provide a computer program product or computer program.
  • the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the electronic device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the electronic device executes the above-mentioned frequency band extension method.
  • the embodiments of the present application provide a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the program is executed by a processor, the method shown in the foregoing embodiment is implemented.
  • filtering is performed on at least one of the initial low frequency spectrum or the initial high frequency spectrum, so that before the broadband signal is obtained,
  • the initial low-frequency spectrum can be filtered to effectively filter out the quantization noise that may be introduced by the narrowband signal in the quantization process;
  • the initial high-frequency spectrum can also be filtered to effectively filter out the process of band expansion based on the initial low-frequency spectrum
  • the introduced noise enhances the signal quality of the broadband signal and further enhances the user's hearing experience.
  • the frequency band expansion is carried out by the method of this scheme, without the need to record side information in advance, that is, no additional bandwidth is required.
  • the computer-readable storage medium provided in the embodiment of the present application is applicable to any embodiment of the foregoing method.

Abstract

一种频带扩展方法、装置、电子设备及计算机可读存储介质,该方法由电子设备执行,包括:对待处理窄带信号进行时频变换,得到对应的初始低频频谱;基于初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数,相关性参数包括高频频谱包络和相对平坦度信息至少其中之一,相对平坦度信息表征了目标宽频频谱的高频部分的频谱平坦度与低频部分的频谱平坦度的相关性;基于相关性参数和初始低频频谱,得到初始高频频谱;根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号;目标低频频谱为初始低频频谱或对初始低频频谱进行滤波处理后的频谱,目标高频频谱为初始高频频谱或对初始高频频谱进行滤波处理后的频谱。

Description

频带扩展方法、装置、电子设备及计算机可读存储介质
本申请要求于2019年9月18日提交中国专利局、申请号为201910882478.4、发明名称为“频带扩展方法、装置、电子设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及音频处理技术领域,具体而言,本申请涉及一种频带扩展方法、装置、电子设备及计算机可读存储介质。
发明背景
频带扩展,也可称为频带复制,是音频编码领域的一项经典技术。频带扩展技术是一种参数编码技术,通过频带扩展可以在接收端实现有效带宽的扩展,以提高音频信号的质量,使用户可以直观感受到更亮的音色、更大的音量和更好的可懂度。
在现有技术中,一种频带扩展的经典实现方法是利用语音信号中高频与低频的相关性进行频带扩展,在音频编码系统中,上述相关性作为边信息(side information),在编码端,将上述边信息合并到码流并传输出去,解码端通过解码,顺序恢复低频频谱,并进行频带扩展操作恢复高频频谱。但是该方法需要系统消耗相应的比特(例如:在编码低频部分信息的基础上,额外花费10%的比特编码上述边信息),即需要额外的比特进行编码,且存在前向兼容的问题。
另一种常用的频带扩展方法是基于数据分析的盲式方案,该方案基于神经网络或者深度学习,输入是低频系数、输出是高频系数。这种系数-系数的映射方式,对网络的泛化能力要求很高;为了保证效果,网络深度和体积较大,复杂度高;在实际过程中,在超出训练库所包含的模式外的场景,该方法的性能一般。
发明内容
本申请实施例的目的旨在至少能解决上述的技术缺陷之一,特提出以下技术方案:
一方面,提供了一种频带扩展方法,由电子设备执行,包括:
对待处理窄带信号进行时频变换得到对应的初始低频频谱;
基于初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数,其中,相关性参数包括高频频谱包络和相对平坦度信息至少其中之一,相对平坦度信息表征了目标宽频频谱的高频部分的频谱平坦度与低频部分的频谱平坦度的相关性;
基于相关性参数和初始低频频谱,得到初始高频频谱;
根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号;
其中,目标低频频谱为初始低频频谱或对初始低频频谱进行滤波处理后的频谱,目标高频频谱为初始高频频谱或对初始高频频谱进行滤波处理后的频谱。
一方面,提供了一种频带扩展装置,包括:
低频频谱确定模块,用于对待处理窄带信号进行时频变换得到对应的初始低频频谱;
相关性参数确定模块,用于基于初始低频频谱,通过神经网络模型,得到目标宽频频谱的高 频部分与低频部分的相关性参数,其中,相关性参数包括高频频谱包络和相对平坦度信息至少其中之一,相对平坦度信息表征了目标宽频频谱的高频部分的频谱平坦度与低频部分的频谱平坦度的相关性;
高频频谱确定模块,用于基于相关性参数和初始低频频谱,得到初始高频频谱;
宽带信号确定模块,用于根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号;其中,目标低频频谱为初始低频频谱或对初始低频频谱进行滤波处理后的频谱,目标高频频谱为初始高频频谱或对初始高频频谱进行滤波处理后的频谱。
一方面,提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现上述的频带扩展方法。
一方面,提供了一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,该程序被处理器执行时实现上述的频带扩展方法。
本申请实施例附加的方面和优点将在下面的描述中部分给出,这些将从下面的描述中变得明显,或通过本申请的实践了解到。
附图简要说明
本申请实施例上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1A示出了本申请实施例中提供的一种频带扩展方法的场景图。
图1B为本申请实施例的频带扩展方法的流程示意图;
图2为本申请实施例的神经网络模型的网络结构示意图;
图3为本申请实施例的第一示例中频带扩展方法的流程示意图;
图4为本申请实施例的第二示例中频带扩展方法的流程示意图;
图5为本申请实施例的频带扩展装置的结构示意图;
图6为本申请实施例的电子设备的结构示意图。
实施方式
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本申请,而不能解释为对本申请的限制。
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本申请的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解,当我们称元件被“连接”或“耦接”到另一元件时,它可以直接连接或耦接到其他元件,或者也可以存在中间元件。此外,这里使用的“连接”或“耦接”可以包括无线连接或无线耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。本实施例中,所述“多个”指两个或两个以上。
为了更好的理解及说明本申请实施例的方案,下面对本申请实施例中所涉及到的一些技术用 语进行简单说明。
频带扩展(Band Width Extension,BWE):是音频编码领域中的一项将窄频带信号扩展为宽带信号的技术。
频谱:是频率谱密度的简称,是频率的分布曲线。
频谱包络(Spectrum Envelope,SE):是信号对应的频率轴上,信号所对应的谱系数的能量表示,对于子带而言,是子带所对应的谱系数的能量表示,如子带所对应的谱系数的平均能量。
频谱平坦度(Spectrum Flatness,SF):表征待测信号在其所在信道内,功率平坦的程度。
神经网络(Neural Network,NN):是一种模仿动物神经网络行为特征,进行分布式并行信息处理的算法数学模型。这种网络依靠系统的复杂程度,通过调整内部大量节点之间相互连接的关系,从而达到处理信息的目的。
深度学习(Deep Learning,DL):深度学习是机器学习的一种,深度学习通过组合低层特征形成更加抽象的高层表示属性类别或特征,以发现数据的分布式特征表示。
PSTN(Public Switched Telephone Network,公共交换电话网络):一种常用旧式电话系统,即我们日常生活中常用的电话网。
VoIP(Voice over Internet Protocol,网络电话):是一种语音通话技术,经由网际协议来达成语音通话与多媒体会议,也就是经由互联网来进行通信。
3GPP EVS:3GPP(3rd Generation Partnership Project,第三代合作伙伴计划)主要是制订以全球移动通信系统为基础,为无线接口的第三代技术规范;EVS(Enhance Voice Services,增强型话音业务)编码器是新一代的语音频编码器,不仅对于语音和音乐信号都能够提供非常高的音频质量,而且还具有很强的抗丢帧和抗延时抖动的能力,可以为用户带来全新的体验。
IEFT OPUS:Opus是一个有损声音编码格式,由互联网工程任务组(IETF,The Internet Engineering Task Force)开发。
SILK:Silk音频编码器是Skype网络电话向第三方开发人员和硬件制造商提供免版税认证的Silk宽带。
具体地,频带扩展是音频编码领域的一项经典技术,在现有技术中,频带扩展可通过以下方式实现:
第一种方式,在低采样率下的窄频带信号,选择窄频带信号中的低频部分的频谱复制到高频;根据提前记录的边界信息(描述高频与低频的能量相关性的信息)将窄频带信号(即窄带信号)扩展为宽频带信号(即宽带信号)。
第二种方式,盲式频带扩展,无需额外比特,直接完成频带扩展,在低采样率下的窄频带信号,利用神经网络或深度学习等技术,神经网络或深度学习的输入为窄频带信号的低频频谱,输出为高频频谱,基于高频频谱将窄频带信号扩展为宽频带信号。
但是,通过第一种方式进行频带扩展,其中的边信息需要消耗相应的比特,且存在前向兼容的问题,比如,一个典型PSTN(窄带语音)和VoIP(宽带语音)互通场景。在PSTN-VoIP的传输方向,如果不修改传输协议(添加对应的频带扩展码流),无法完成PSTN-VoIP的传输方向输出宽带语音的目的。通过第二种方式进行频带扩展,输入是低频频谱,输出是高频频谱。这种方式不需要消耗额外的比特,但是对网络的泛化能力要求很高,为了保证网络输出的准确性,网络的深度和体积较大,复杂度较高,性能较差。因此,基于上述两种频带扩展方式均不能满足实际频带扩展的性能要求。
针对现有技术存在的问题,以及为了更好的满足实际应用需求,本申请实施例提供了一种频带扩展方法,通过该方法不但不需要额外的比特,减少了网络的深度和体积,还降低了网络复杂度。
在本申请的实施例中,以PSTN(窄带语音)和VoIP(宽带语音)互通场景为例,对本申请的方案进行描述,即在PSTN至VoIP(简写为PSTN-VoIP)的传输方向,将窄带语音扩展为宽带语音。在实际应用中,本申请并不限定上述应用场景,也适用于其它编码系统,包括但不限于:3GPP EVS、IEFT OPUS、SILK等主流音频编码器。
下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本申请的实施例进行描述。
需要说明的是,在下面以PSTN和VoIP互通的语音场景为例对本申请实施例的方案进行描述的过程中,采样率为8000Hz、一帧语音帧的帧长为10ms(相当于80个样本点/帧)。在实际应用中,考虑到PSTN帧长为20ms,因此,只需要对每一个PSTN帧进行两次操作。本申请实施例的描述过程中,将以数据帧长固定为10ms为例,然而,对于本领域技术人员来说清楚的是,帧长为其它值的场景,如20ms(相当于160个样本点/帧)的场景,本申请依然适用,在此不做限定。
同样的,本申请实施例中以采样率为8000Hz为例,并不是用于限定本申请实施例所提供的频带扩展的作用范围。比如,虽然本申请主要实施例是将采样率为8000Hz的信号频带扩展到16000Hz采样率的信号,但是,本申请也可以适用于其它采样率场景,如将16000Hz采样率的信号扩展为32000Hz采样率的信号、将8000Hz采样率的信号扩展为12000Hz采样率的信号等。本申请实施例的方案可以应用于任意的需要进行信号频带扩展的场景中。
图1A示出了本申请实施例中提供的一种频带扩展方法的应用场景图。如图1A所示,电子设备可以包括手机110、或者笔记本电脑112,但不限于此。以电子设备为手机110为例,其余情况类似。手机110通过网络12与服务器设备13通信。其中,在该示例中,服务器设备13包括神经网络模型。手机110将待处理的窄带信号输入至服务器设备13中的神经网络模型,通过图1B所示的方法得到并输出频带扩展后的宽带信号。
虽然在图1A的示例中,神经网络模型位于服务器设备13中,但是在另外一种实现方式中,神经网络模型可以位于电子设备中(图中未示出)。
本申请一个示例提供了一种频带扩展方法,该方法由如图6所示的电子设备执行,该电子设备可以是终端或者服务器。终端可以是台式设备或者移动终端。服务器可以是独立的物理服务器、物理服务器集群或者虚拟服务器。如图1B所示,该方法包括:
步骤S110,对待处理窄带信号进行时频变换得到对应的初始低频频谱。
具体地,初始低频频谱是通过对窄带信号进行时频变换得到的,该时频变换包括但不限于傅里叶变换、离散余弦变换、离散正弦变换及小波变换等。待处理的窄带信号可以是需要进行频带扩展的语音帧信号,比如,在PSTN-VoIP通路中,需要将PSTN窄带语音信号扩展为VoIP宽带语音信号,则待处理的窄带信号可以是PSTN窄带语音信号。如果待处理的窄带信号是语音帧的信号,则该待处理的窄带信号可以是一帧语音帧的全部或部分语音信号。
其中,在实际的应用场景中,对于需要处理的信号,可以将该信号作为待处理的窄带信号一次完成频带扩展,亦可以将该信号划分为多个子信号,对多个子信号分别进行处理,如上述PSTN帧的帧长为20ms,可以将该20ms语音帧的信号进行一次频带扩展,也可以将该20ms的语音帧划 分为两个10ms的语音帧,分别对两个10ms的语音帧进行频带扩展。
步骤S120,基于初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数,其中,相关性参数包括高频频谱包络和相对平坦度信息至少其中之一,相对平坦度信息表征了目标宽频频谱的高频部分的频谱平坦度与低频部分的频谱平坦度的相关性。
具体地,神经网络模型可以是预先基于信号的低频频谱训练得到的模型,该模型用于预测信号的相关性参数。目标宽频频谱指的是对窄带信号的带宽进行扩展后所对应的频谱,目标宽频频谱是基于待处理语音信号的低频频谱得到的,比如,目标宽频频谱可以是将待处理语音信号的低频频谱进行复制得到的。
步骤S130,基于相关性参数和初始低频频谱,得到初始高频频谱。
具体地,基于初始低频频谱(低频部分对应的参数),可以预测出需要扩展到的宽带信号的初始高频频谱(即宽带信号的高频部分对应的参数)。
步骤S140,根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号;其中,目标低频频谱为初始低频频谱或对初始低频频谱进行滤波处理后的频谱,目标高频频谱为初始高频频谱或对初始高频频谱进行滤波处理后的频谱。
具体地,由于在确定待处理的窄带信号的初始低频频谱的过程中,通常需要对窄带信号进行量化处理,而在量化处理时一般会引入量化噪声,因此,在得到频带扩展后的宽带信号的过程中,可以对该初始低频频谱进行滤波处理,得到相应的目标低频频谱,来滤除初始低频频谱中的量化噪声,再基于目标低频频谱,得到频带扩展后的宽带信号,以防止将量化噪声扩展到宽带信号中。
具体地,在得到频带扩展后的宽带信号的过程中,可以先对该初始高频频谱进行滤波处理,得到相应的目标高频频谱,从而有效滤除初始高频频谱中可能存在的噪声,再基于目标高频频谱,得到频带扩展后的宽带信号,增强宽带信号的信号质量,进一步提升用户的听觉体验。
换言之,根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号,包括以下任一种情况:
一种情况:若仅对初始低频频谱进行滤波处理,即目标低频频谱为对初始低频频谱进行滤波处理后的频谱,目标高频频谱为初始高频频谱,则根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号,可以为:根据初始高频频谱(未进行滤波处理)与目标低频频谱,得到频带扩展后的宽带信号。其中,根据初始高频频谱与目标低频频谱,得到频带扩展后的宽带信号的具体过程,可以为:先将初始高频频谱与目标低频频谱合并,再对合并后的频谱进行时频反变换(即频时变换),得到新的宽带信号,实现待处理的窄带信号的频带扩展。
另一种情况:若仅对初始高频频谱进行滤波处理,即目标高频频谱为对初始高频频谱进行滤波处理后的频谱,目标低频频谱为初始低频频谱,则根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号,可以为:根据初始低频频谱(未进行滤波处理)与目标高频频谱,得到频带扩展后的宽带信号。其中,基于初始低频频谱与目标高频频谱,得到频带扩展后的宽带信号的具体过程,可以为:先将初始低频频谱与目标高频频谱合并,再对合并后的频谱进行时频反变换(即频时变换),得到新的宽带信号,实现待处理的窄带信号的频带扩展。
再一种情况:若既对初始低频频谱进行滤波处理,又对初始高频频谱进行滤波处理,即目标高频频谱为对初始高频频谱进行滤波处理后的频谱,目标低频频谱为对初始低频频谱进行滤波处理后的频谱,则根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号,可以为:先将目标低频频谱与目标高频频谱合并,再对合并后的频谱进行时频反变换(即频时变换),得到新 的宽带信号,实现待处理的窄带信号的频带扩展。
其中,由于扩展后的宽带信号的带宽大于待处理的窄带信号的带宽,因此,基于该宽带信号,可以得到音色洪亮、音量较大的语音帧,使得用户可以有更好的听觉体验。
本申请实施例所提供的频带扩展方法,在根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号的过程中,通过对初始低频频谱或初始高频频谱中的至少一项进行滤波处理,使得在得到宽带信号之前,可以对初始低频频谱进行滤波处理,从而有效滤除窄带信号在量化过程中可能引入的量化噪声;也可以对初始高频频谱进行滤波处理,从而有效滤除基于初始低频频谱进行频带扩展的过程中引入的噪声,增强宽带信号的信号质量,进一步提升用户的听觉体验。此外,通过本方案的方法进行频带扩展,无需提前记录边信息,即无需额外的带宽。
在本申请实施例的一种实现方式中,目标宽频频谱指的是与窄带信号想要扩展到的宽带信号(目标宽带信号)所对应的频谱,目标宽频频谱是基于待处理语音信号的低频频谱得到的,比如,目标宽频频谱可以是将待处理语音信号的低频频谱进行复制得到的。
具体地,神经网络模型可以是预先基于样本数据训练得到的模型,每个样本数据包括样本窄带信号和该样本窄带信号所对应的样本宽带信号,对于每个样本数据,可以确定出其样本宽带信号的频谱的高频部分与低频部分的相关性参数(该参数可以理解为样本数据的标注信息,即样本标签,简称为标注结果),该相关性参数包括高频频谱包络,还可以包括样本宽带信号的频谱的高频部分与低频部分的相对平坦度信息,在基于样本数据对神经网络模型进行训练时,初始的神经网络模型的输入为样本窄带信号的低频频谱,输出为预测出的相关性参数(简称为预测结果),可以基于各样本数据所对应的预测结果和标注结果的相似程度来判断模型训练是否结束,如通过模型的损失函数是否收敛来判断模型训练是否结束,该损失函数表征了各样本数据的预测结果和标注结果的差异程度,将训练结束时的模型作为本申请实施例应用时的神经网络模型。
在神经网络模型的应用阶段,对于上述窄带信号,则可以将该窄带信号的低频频谱输入至训练好的神经网络模型中,得到该窄带信号所对应的相关性参数。由于在基于样本数据对模型进行训练时,样本数据的样本标签为样本宽带信号的高频部分与低频部分的相关性参数,因此,基于该神经网络模型的输出得到的该窄带信号的相关性参数,则该相关性参数可以很好的表征出目标宽带信号的频谱的高频部分与低频部分的相关性。
具体地,由于相关性参数可以表征目标宽频频谱的高频部分与低频部分的相关性,则基于相关性参数和初始低频频谱(低频部分对应的参数),可以预测出需要扩展得到的宽带信号的初始高频频谱(即宽带信号的高频部分对应的参数)。
在本实现方式中,可以基于待处理的窄带信号的初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数,由于是采用神经网络模型进行预测,因此,无需对额外的比特进行编码,是一种盲式分析方法,具有较好的前向兼容性,且由于模型的输出是能够反映出目标宽频频谱的高频部分与低频部分的相关性的参数,实现了频谱参数到相关性参数的映射,与现有的系数至系数的映射方式相比,具有更好的泛化能力,并且可以得到音色洪亮、音量较大的信号,使得用户有更好的听觉体验。
在本申请实施例的一种实现方式中,初始低频频谱是通过对待处理窄带信号进行时频变换得到的,该时频变换包括但不限于傅里叶变换、离散余弦变换、离散正弦变换及小波变换等。
其中,确定待处理窄带信号的初始低频频谱,可以包括:
对窄带信号进行采样因子为第一设定值的上采样处理,得到上采样信号;
对上采样信号进行时频变换,得到低频频域系数;
将该低频频域系数确定为初始低频频谱。
下面结合一个示例对确定初始低频频谱的方式进行进一步详细的说明。该示例中以前文描述的PSTN和VoIP互通的语音场景、语音信号的采样率为8000Hz、一帧语音帧的帧长为10ms为例进行描述。
该示例中,PSTN信号采样率为8000Hz,根据Nyquist(奈奎斯特)采样定理,窄带信号的有效带宽为4000Hz。本示例的目的是将该窄带信号进行频带扩展后,得到带宽为8000Hz的信号,即宽带信号的带宽为8000Hz。考虑到在实际的语音通信场景中,有效带宽为4000Hz的信号,其一般有效带宽的上界为3500Hz。因此,在本方案中,实际得到的宽带信号的有效带宽为7000Hz,则本示例的目的是将带宽为3500Hz的窄带信号进行频带扩展,得到带宽为7000Hz的宽带信号,即将采样率为8000Hz信号频带扩展到采样率为16000Hz的信号。
本示例中,采样因子为2,对窄带信号进行采样因子为2的上采样处理,得到采样率为16000Hz的上采样信号。由于窄带信号的采样率为8000Hz,帧长为10ms,则该上采样信号对应160个样本点。
之后,对上采样信号进行时频变换,得到初始低频频域系数,在得到初始低频频域系数后,可以将该初始低频频域系数作为初始低频频谱,以用于后续的低频频谱包络、低频幅度谱等的计算。
具体地,上述的傅里叶变换可以为短时傅立叶变换STFT(Short-Time Fourier Transform),上述的离散余弦变换可以为改进离散余弦变换MDCT((Modified Discrete Cosine Transform)。在对上采样信号进行时频变换的过程中,考虑到消除帧间数据的不连续性,可采用将上一帧语音帧对应的频点和当前语音帧(待处理的窄带信号)对应的频点组合成一个数组,然后对该数组中的频点进行加窗处理,得到加窗处理后的信号。
具体地,当时频变换采用STFT时,可以采用汉宁窗进行加窗处理。在进行汉宁窗的加窗处理之后,可以对加窗处理后的信号进行STFT,得到相应的低频频域系数。考虑到傅立叶变换的共轭对称关系,第一个系数为直流分量,如果得到的低频频域系数为M个,则可选择(1+M/2)个低频频域系数进行后续的处理。
作为一个示例,对上述包含160个样本点的上采样信号进行STFT的具体过程为:将上一语音帧对应的160个样本点与当前语音帧(待处理的窄带信号)对应的160个样本点组成一个数组,该数组包括320个样本点。接着对该数组中的样本点进行汉宁窗的加窗处理,得到加窗处理后的信号s Low(i,j),接着对s ow(i,j)进行傅立叶变换,得到320个低频频域系数S Low(i,j)。其中,i为语音帧的帧索引,j为帧内样本索引(j=0,1,…,319)。考虑到傅立叶变换的共扼对称关系,第一个系数为直流分量,因此可以只考虑前161个低频频域系数,即将该161个低频频域系数中的第2个至第161个低频频域系数作为上述的初始低频频谱。
具体地,当时频变换采用MDCT时,可以采用余弦窗进行加窗处理。在进行余弦窗的加窗处理之后,可以对加窗处理后的信号进行MDCT,得到相应的低频频域系数,并基于该低频频域系数进行后续的处理。假定加窗处理后的信号为s Low(i,j),其中,i为语音帧的帧索引,j为帧内样本索引(j=0,1,…,319),则:可以对s Low(i,j)进行320点的MDCT,得到160点的MDCT系数S Low(i,j),其中,i为语音帧的帧索引,j为帧内样本索引(j=0,1,…,159),并将该160点的MDCT系数作为低频频域系数。
需要说明的是,当窄带信号为采样率为8000Hz,带宽为0~3500Hz的信号时,基于窄带信号的采样率和帧长,可以确定出具有有效数据的低频频域系数实际上为70个,即初始低频频谱S Low(i,j)的有效系数个数为70个,即j=0,1,…,69,下面也将以该70个初始低频频谱为例,对后续处理过程进行具体介绍。
在本申请实施例的一种实现方式中,时频变换包括傅里叶变换或离散余弦变换。其中,在通过对待处理的窄带信号进行时频变换,得到初始低频频谱后,若时频变换为傅里叶变换(比如STFT),此时的初始低频频谱是复数形式的,故可以先根据该复数形式的初始低频频谱得到实数形式的低频幅度谱,再基于低频幅度谱进行后续处理,即在基于初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数的过程中,可以先根据初始低频频谱,得到窄带信号的低频幅度谱;再将低频幅度谱输入至神经网络模型,基于神经网络模型的输出得到目标宽频频谱的高频部分与低频部分的相关性参数。若时频变换为离散余弦变换(比如MDCT),此时的初始低频频谱是实数形式的,故可以直接根据该实数形式的初始低频频谱进行后续处理,即在基于初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数的过程中,可以将初始低频频谱输入至神经网络模型,基于神经网络模型的输出得到目标宽频频谱的高频部分与低频部分的相关性参数。
具体地,当时频变换为离散正弦变换、小波变换等时,可以根据需要参考上述的傅里叶变换或者离散余弦变换的处理过程,来基于初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数,在此不再赘述。
在本申请实施例的一种实现方式中,还包括如下操作步骤:
基于初始低频频谱,确定待处理窄带信号的低频频谱包络;
其中,神经网络模型的输入还包括低频频谱包络。
具体地,当时频变换为傅里叶变换(比如STFT)时,在得到初始低频频谱后,可以再根据初始低频频谱得到窄带信号的低频幅度谱,在得到低频幅度谱之后,可以再根据低频幅度谱,确定窄带信号的低频频谱包络,即基于初始低频频谱,确定窄带信号的低频频谱包络。当时频变换为离散余弦变换(比如MDCT)时,在得到初始低频频谱后,可以再根据初始低频频谱得到窄带信号的低频频谱包络,即基于初始低频频谱,确定窄带信号的低频频谱包络。其中,在确定出窄带信号的低频频谱包络之后,可以将该低频频谱包络作为神经网络模型的输入,即神经网络模型的输入还包括低频频谱包络。
具体地,为了使输入神经网络模型的数据更丰富,还可以选择与低频部分的频谱相关的参数作为神经网络模型的输入,窄带信号的低频频谱包络是与信号的频谱相关的信息,则可以将低频频谱包络作为神经网络模型的输入,从而可以基于低频频谱包络和低频频谱得到更加准确的相关性参数(时频变换为MDCT的情况),即将低频频谱包络和初始低频频谱输入至神经网络模型,可以得到相关性参数,或者基于低频频谱包络和低频幅度谱得到更加准确的相关性参数(时频变换为STFT的情况),从而将低频频谱包络和低频幅度谱输入至神经网络模型,可以得到相关性参数。
一种情况,当时频变换为傅里叶变换(比如STFT)时,在得到初始低频频谱之后,即可基于初始低频频谱,确定窄带信号的低频幅度谱,具体的,可以通过以下公式(1)计算得到低频幅度谱:
P Low(i,j)=SQRT(Real(S Low(i,j)) 2+Imag(S Low(i,j)) 2)       (1)
其中,P Low(i,j)表示低频幅度谱,S Low(i,j)为初始低频频谱,Real和Imag分别为初始低频频谱的实部和虚部,SQRT为开根号操作。若窄带信号为采样率为8000Hz,带宽为0~3500Hz的信号,则可以基于窄带信号的采样率和帧长,通过低频频域系数确定70个低频幅度谱的谱系数(低频幅度谱系数)P Low(i,j),j=0,1,…69。在实际应用中,可以直接将计算出的70个低频幅度谱系数作为窄带信号的低频幅度谱,进一步地,为了计算方便,也可以进一步将低频幅度谱转换到对数域,即对通过公式(1)计算得到的幅度谱进行对数运算,将对数运算后的幅度谱作为后续处理时的低频幅度谱。
其中,在根据公式(1)得到包含70个系数的低频幅度谱之后,即可基于低频幅度谱确定出窄带信号的低频频谱包络。
在本申请实施例的方案中,该方法还可以包括:
将低频幅度谱划分为第四数量的子幅度谱;
分别确定每个子幅度谱对应的子频谱包络,低频频谱包络包括确定出的第四数量的子频谱包络。
具体地,将低频幅度谱的谱系数划分为第四数量(记作M个)的子幅度谱的一种可实现方式为:对窄带信号进行分带处理,得到M个子幅度谱,每个子带可以对应相同或不同数量的子幅度谱的谱系数,所有子带对应的谱系数的总数量等于低频幅度谱的谱系数的个数。
在划分为M个子幅度谱后,可以基于每个子幅度谱,确定每个子幅度谱对应的子频谱包络,其中,一种可实现方式为:基于每个子幅度谱对应的低频幅度谱的谱系数,可以确定每个子带的子频谱包络,即每个子幅度谱对应的子频谱包络,M个子幅度谱可以对应确定出M个子频谱包络,则低频频谱包络包括确定出的M个子频谱包络。
作为一个示例,比如,对于上述70个低频幅度谱的谱系数(可以是基于公式(1)计算出的系数,也可以是基于公式(1)计算出之后再转换到对数域的系数),如果每个子带包含相同数量的谱系数,比如5个,记作N=5,则每5个子幅度谱的谱系数对应的频带可以划分为一个子带,此时共划分为14(M=14)个子带,每个子带对应有5个谱系数。则在划分14个子幅度谱之后,可基于该14个子幅度谱对应确定出14个子频谱包络。
其中,确定每个子幅度谱对应的子频谱包络,可以包括:
基于每个子幅度谱所包括的谱系数的对数取值,得到每个子幅度谱对应的子频谱包络。
具体的,基于每个子幅度谱的谱系数,通过公式(2)确定每个子幅度谱对应的子频谱包络。
其中,公式(2)为:
Figure PCTCN2020115052-appb-000001
其中,e Low(i,k)表示子频谱包络,i为语音帧的帧索引,k表示子带的索引号,共M个子带,k=0,1,2……M,则低频频谱包络中包括M个子频谱包络。
一般地,子带的谱包络定义为相邻系数的平均能量(或者进一步转换成对数表示),但是该方式,有可能会导致幅值较小的系数不能够起到实质性的作用,本申请实施例提供的该种将每个子幅度谱所包括的谱系数的对数标识直接求平均,得到子幅度谱对应的子频谱包络的方案,与现有常用的包络确定方案相比,可以更好的在神经网络模型训练过程的失真控制中保护好幅值较小的系数,从而使更多的信号参数能够在频带扩展中起到相应的作用。
由此,如果将低频幅度谱和低频频谱包络作为神经网络模型的输入,低频幅度谱为70维的 数据,低频频谱包络为14维的数据,则模型的输入为84维的数据,由此,本方案中的神经网络模型的体积小,复杂度低。
另一种情况,当时频变换为离散余弦变换(比如MDCT)时,在得到初始低频频谱之后,即可基于初始低频频谱,确定窄带信号的低频频谱包络。具体的,可以通过对窄带信号进行分带,针对70个低频频域系数,可以将每5个相邻的低频频域系数对应的频带划分为一个子带,共划分为14个子带,每个子带对应有5个低频频域系数。对于每个子带,该子带的低频频谱包络定义为相邻低频频域系数的平均能量。具体可通过公式(3)计算得到:
Figure PCTCN2020115052-appb-000002
其中,e Low(i,k)表示子频谱包络(每个子带的低频频谱包络),S Low(i,j)为初始低频频谱,k表示子带的索引号,共14个子带,k=0,1,2……13,则低频频谱包络中包括14个子频谱包络。
由此,可以将70维的低频频域系数S Low_rev(i,j)和14维的低频频谱包络e Low(i,k)作为神经网络模型的输入,即神经网络模型的输入为84维的数据。
在本申请实施例的方案中,若时频变换为傅里叶变换,在基于相关性参数和初始低频频谱,得到目标高频频谱的过程中,可以包括:
基于初始低频频谱,得到待处理窄带信号的低频频谱包络;
基于低频幅度谱,生成初始高频幅度谱;
基于高频频谱包络和低频频谱包络,对初始高频幅度谱进行调整,得到目标高频幅度谱;
基于窄带信号的低频相位谱,生成相应的高频相位谱;
根据目标高频幅度谱和高频相位谱,得到目标高频频谱;
若时频变换为离散余弦变换,在基于相关性参数和初始低频频谱,得到初始高频频谱的过程中,可以包括:
根据初始低频频谱,得到窄带信号的低频频谱包络;
基于初始低频频谱,生成第一高频频谱;
基于高频频谱包络和低频频谱包络,对第一高频频谱进行调整,得到初始高频频谱。
具体地,当时频变换为傅里叶变换时,上述基于窄带信号的低频相位谱生成相应的高频相位谱的方式可以包括但不限于以下任一种:
第一种:通过复制低频相位谱,得到相应的高频相位谱。
第二种:对低频相位谱进行翻折,翻折后得到一个与低频相位谱相同的相位谱,将这两个低频相位谱映射到相应的高频频点,得到相应的高频相位谱。
具体地,当时频变换为傅里叶变换时,在基于低频幅度谱,生成初始高频幅度谱的过程中,可以是通过对低频幅度谱进行复制得到初始高频幅度谱。可以理解的是,在实际应用中,对低频幅度谱进行复制的具体方式,根据最后需要得到的宽带信号的频带宽度、进行复制的所选择的低频幅度谱部分的频带宽度的不同,复制方式也会不同。例如,假设宽带信号的频带宽度为窄带信号的2倍,且选择对窄带信号全部的低频幅度谱进行复制,则只需进行一次复制,如果选择对窄带信号部分的低频幅度谱进行复制,则需要根据所选择的部分对应的频带宽度,进行相应次数的复制,如选择窄带信号1/2的低频幅度谱进行复制,则需要复制2次,如果选择窄带信号1/4的低频幅度谱进行复制,则需要复制4次。
作为一个示例,比如,扩展后的宽带信号的带宽为7kHz,所选择进行复制的低频幅度谱对应 的带宽为1.75kHz,则基于低频幅度谱对应的带宽和扩展后的宽带信号的带宽,可以将低频幅度谱对应的带宽复制3次,得到初始高频幅度谱对应的带宽(5.25kHz)。如果所选择进行复制的低频幅度谱对应的带宽为3.5kHz,扩展后的宽带信号的带宽为7kHz,则将低频幅度谱对应的带宽复制1次即可得到初始高频幅度谱对应的带宽(3.5kHz)。
具体地,当时频变换为离散余弦变换时,在基于初始低频频谱,生成第一高频频谱的过程中,可以对初始低频频谱进行复制得到第一高频频谱。其中,对初始低频频谱进行复制的过程,与傅里叶变换下的对低频幅度谱进行复制得到初始高频幅度谱的过程类似,在此不再赘述。
需要说明的是,当时频变换为离散正弦变换、小波变换等时,在生成初始高频幅度谱的过程中,可以根据需要参考上述的傅里叶变换的初始高频幅度谱的生成过程;当然,在生成第一高频频谱的过程中,也可以根据需要参考上述的离散余弦变换的第一高频频谱的生成过程,在此不再赘述。
本申请实施例的实施方式中,基于低频幅度谱,生成初始高频幅度谱的一种实现方式可以为:对低频幅度谱中高频段部分的幅度谱进行复制,得到初始高频幅度谱;基于初始低频频谱,生成第一高频频谱的一种实现方式可以为:对初始低频频谱中高频段部分的频谱进行复制,得到第一高频频谱。
具体地,当时频变换为傅里叶变换时,由于得到的低频幅度谱的低频段部分包含大量谐波,影响扩展后宽带信号的信号质量,因此,可以选择低频幅度谱中高频段部分的幅度谱进行复制,以得到初始高频幅度谱。
作为一个示例,如前述场景为例,继续进行说明,低频幅度谱共对应70个频点,如果选择低频幅度谱对应的35-69个频点(频幅度谱中高频段部分的幅度谱)作为待复制的频点,即“母板”,且扩展后的宽带信号的带宽为7000Hz,则需要对所选择的低频幅度谱对应的频点进行复制得到包含70个频点的初始高频幅度谱,为了得到该包含70个频点的初始高频幅度谱,可以将低频幅度谱对应的35-69,共计35个频点复制两次,生成初始高频幅度谱。同样的,如果选择低频幅度谱对应的0-69个频点作为待复制的频点,且扩展后的宽带信号的带宽为7000Hz,则可将低频幅度谱对应的0-69,共计70个频点复制一次,生成初始高频幅度谱,该初始高频幅度谱共包括70个频点。
由于低频幅度谱对应的信号中可能包含大量的谐波,仅通过复制得到的初始高频幅度谱对应的信号中同样会包含大量的谐波,则为了减少频带扩展后的宽带信号中的谐波,可以通过高频频谱包络和低频频谱包络的差值对初始高频幅度谱进行调整,将调整后的初始高频幅度谱作为目标高频幅度谱,可以减少最终频点扩展后得到的宽带信号中的谐波。
具体地,当时频变换为离散余弦变换时,同样由于初始低频频谱的低频段部分包含大量谐波,影响扩展后宽带信号的信号质量,因此,可以选择初始低频频谱中高频段部分的频谱进行复制,以得到第一高频频谱,这与傅里叶变换情况下的对低频幅度谱中高频段部分的幅度谱进行复制,得到初始高频幅度谱的过程类似,在此不再赘述。
需要说明的是,当时频变换为离散正弦变换、小波变换等时,在生成初始高频幅度谱的过程中,可以根据需要参考上述的傅里叶变换的初始高频幅度谱的生成过程;当然,在生成第一高频频谱的过程中,也可以根据需要参考上述的离散余弦变换的第一高频频谱的生成过程,在此不再赘述。
本申请实施例的方案中,高频频谱包络和低频频谱包络均为对数域的频谱包络;
基于高频频谱包络和低频频谱包络,对初始高频幅度谱进行调整,得到目标高频幅度谱,可以包括:
确定高频频谱包络和低频频谱包络的第一差值;
基于第一差值对初始高频幅度谱进行调整,得到目标高频幅度谱;
基于高频频谱包络和低频频谱包络,对第一高频频谱进行调整,包括:
确定高频频谱包络和低频频谱包络的第二差值;
基于第二差值对第一高频频谱进行调整,得到初始高频频谱。
具体地,可以将高频频谱包络和低频频谱包络通过对数域的频谱包络表示,当时频变换为傅里叶变换时,可基于对数域的频谱包络确定出的第一差值对初始高频幅度谱进行调整,得到目标高频幅度谱;当时频变换为离散余弦变换时,可基于对数域的频谱包络确定出的第二差值对第一高频频谱进行调整,得到初始高频频谱。其中,可以通过对数域的频谱包络来表示高频频谱包络和低频频谱包络,以便于计算。
需要说明的是,当时频变换为离散正弦变换、小波变换等时,在确定目标高频幅度谱的过程中,可以根据需要参考上述的傅里叶变换的目标高频幅度谱的生成过程;当然,在确定初始高频频谱的过程中,也可以根据需要参考上述的离散余弦变换的初始高频频谱的生成过程,在此不再赘述。
本申请实施例的方案中,若初始低频频谱是通过傅里叶变换得到的,高频频谱包络包括第二数量的第一子频谱包络,初始高频幅度谱包括第二数量的第一子幅度谱,其中,每个第一子频谱包络是基于初始高频幅度谱中对应的第一子幅度谱确定的。若初始低频频谱是通过离散余弦变换得到的,高频频谱包络包括第三数量的第二子频谱包络,第一高频频谱包括第三数量的第一子频谱,其中,每个第二子频谱包络是基于第一高频频谱中对应的第一子频谱确定的。
具体地,(1)当时频变换是傅里叶变换时,子频谱包络是基于相对应的幅度谱中对应的子幅度谱确定的,一个第一子频谱包络可以基于相对应的初始高频幅度谱中对应的子幅度谱确定。每个子幅度谱对应的谱系数的数量可以是相同的,也可以是不同的,如果每个第一子频谱包络是基于相对应的幅度谱中对应的子幅度谱确定,则每个第一子频谱包络对应的幅度谱中的子幅度谱的谱系数的数量也可以是不同的。(2)当时频变换为离散余弦变换时,子频谱包络是基于相对应的频谱中对应的子频谱确定的,一个第二子频谱包络可以基于相对应的第一高频频谱中对应的子频谱确定。
需要说明的是,当时频变换为离散正弦变换、小波变换等时,可以根据需要参考上述的傅里叶变换的子频谱包络的确定方式,来得到子频谱包络,当然,也可以根据需要参考上述的离散余弦变换的子频谱包络的确定方式,来得到子频谱包络,在此不再赘述。
基于前述场景为例,继续进行说明,若时频变换是傅里叶变换,神经网络模型的输出为14维的高频频谱包络(第二数量为14),神经网络模型的输入包括低频幅度谱和低频频谱包络,其中,低频幅度谱包含70维低频频域系数,低频频谱包络包含14维子频谱包络,则神经网络模型的输入为84维的数据,输出维度远小于输入维度,可以减小神经网络模型的体积和深度,同时降低模型的复杂度。若时频变换是离散余弦变换,神经网络模型的输入、输出,与上述傅里叶变换下的神经网络模型类似,在此不再赘述。
进一步地,若时频变换为傅里叶变换,确定高频频谱包络和低频频谱包络的第一差值,基于第一差值对初始高频幅度谱进行调整,得到目标高频幅度谱,可以包括:
确定每个第一子频谱包络与低频频谱包络中对应的频谱包络(下文将该低频频谱包络中对应的频谱包络记作第三子频谱包络)的第一差值;
基于每个第一子频谱包络所对应的第一差值,对相应的第一子幅度谱进行调整,得到第二数量的调整后的第一子幅度谱;
基于第二数量的调整后的第一子幅度谱,得到目标高频幅度谱。
进一步地,若时频变换为离散余弦变换,确定高频频谱包络和低频频谱包络的第二差值,基于第二差值对第一高频频谱进行调整,得到初始高频频谱,包括:
确定每个第二子频谱包络与低频频谱包络中对应的频谱包络(下文将该低频频谱包络中对应的频谱包络记为第四子频谱包络)的第二差值;
基于每个第二子频谱包络所对应的第二差值,对相应的第一子频谱进行调整,得到第三数量的调整后的第一子频谱;
基于第三数量的调整后的第一子频谱,得到初始高频频谱。
具体地,当时频变换是傅里叶变换时,通过神经网络模型得到的高频频谱包络可以包括第二数量的第一子频谱包络,通过前文描述可知,这第二数量的第一子频谱包络是基于低频幅度谱中对应的子幅度谱确定的,即一个子频谱包络是基于低频幅度谱中对应的一个子幅度谱确定的。基于前述场景为例,继续进行说明,低频幅度谱中的子幅度谱为14个,则高频频谱包络包括14个子频谱包络。
其中,高频频谱包络和低频频谱包络的第一差值即为每一个第一子频谱包络与对应的第三子频谱包络的差值,基于第一差值对高频频谱包络进行调整则是基于每个第一子频谱包络与对应的第三子频谱包络的第一差值对相应的第一子幅度谱进行调整。基于前述场景为例,继续进行说明,如果高频频谱包络包括14个第一子频谱包络,低频频谱包络包括14个第二子频谱包络,则可以基于确定出的14个第二子频谱包络与对应的14个第一子频谱包络,确定出14个第一差值,基于这14个第一差值,对相应的子带对应的第一子幅度谱进行调整。
具体地,当时频变换是离散余弦变换时,通过神经网络模型得到的高频频谱包络可以包括第三数量的第二子频谱包络,高频频谱包络和低频频谱包络的第二差值即为每一个第二子频谱包络与对应的第四子频谱包络的差值。基于第二差值对高频频谱包络进行调整的过程中,与时频变换是傅里叶变换的情况下,基于第一差值对高频频谱包络进行调整的过程类似,在此不再赘述。
需要说明的是,当时频变换为离散正弦变换、小波变换等时,可以根据需要参考上述的傅里叶变换的高频频谱包络的调整过程,来调整相应的高频频谱包络,当然,也可以根据需要参考上述的离散余弦变换的高频频谱包络的调整过程,来调整相应的高频频谱包络,在此不再赘述。
在本申请实施例的方案中,相关性参数还包括相对平坦度信息,相对平坦度信息表征了目标宽频频谱的高频部分的频谱平坦度与低频部分的频谱平坦度的相关性;
基于高频频谱包络和低频频谱包络,对高频谱信息进行调整,可以包括:
基于相对平坦度信息以及初始低频频谱的能量信息,确定高频频谱包络的增益调整值;
基于增益调整值对高频频谱包络进行调整,得到调整后的高频频谱包络;
基于调整后的高频频谱包络和低频频谱包络,对高频谱信息进行调整,其中,高频谱信息包括初始高频幅度谱或第一高频频谱。
具体地,基于前文的描述,在基于调整后的高频频谱包络和低频频谱包络,对高频谱信息进行调整的过程中,可以确定调整后的高频频谱包络和低频频谱包络的第一差值或第二差值,接着 根据第一差值对初始高频幅度谱进行调整,得到目标高频幅度谱,或者根据第二差值对第一高频频谱进行调整,得到初始高频频谱。
具体地,基于前文的描述,在神经网络模型训练的过程中,标注结果可以包括相对平坦度信息,即样本数据的样本标签包括样本宽带信号的高频部分与低频部分的相对平坦度信息,该相对平坦度信息是基于样本宽带信号的频谱的高频部分与低频部分确定的,因此,在神经网络模型应用时,在模型的输入为窄带信号的低频频谱时,可以基于该神经网络模型的输出预测出目标宽频频谱的高频部分与低频部分的相对平坦度信息。其中,相对平坦度信息可以反应出目标宽频频谱的高频部分与低频部分的相对频谱平坦度,即高频部分相对于低频部分的频谱是否是平坦的,如果相关性参数中还包括相对平坦度信息,则可以先基于相对平坦度信息和低频频谱的能量信息对高频频谱包络进行调整,再基于调整后的高频频谱包络和低频频谱包络的差值对目标宽频频谱进行调整,使得最终得到的宽带信号中的谐波更少。其中,低频频谱的能量信息可以基于低频幅度谱的谱系数确定得到,低频频谱的能量信息可以表示频谱平坦度。
本申请的实施例中,上述相关性参数可以包括高频频谱包络和相对平坦度信息,神经网络模型至少包括输入层和输出层,输入层输入低频频谱参数的特征向量(该特征向量包括70维低频幅度谱和14维低频频谱包络),输出层至少包括单边长短期记忆网络(LSTM,Long Short-Term Memory)层以及分别连接LSTM层的两个全连接网络层,每个全连接网络层可以包括至少一个全连接层,其中,LSTM层将输入层处理后的特征向量进行转换,其中一个全连接网络层根据LSTM层转换后的向量值进行第一分类处理,并输出高频频谱包络(14维),另一个全连接网络层根据LSTM层转换后的向量值进行第二分类处理,并输出相对平坦度信息(4维)。
作为一个示例,图2中示出了本申请实施例提供的一种神经网络模型的结构示意图,如图中所示,该神经网络模型主要可以包括两个部分:单边LSTM层和两个全连接层,即该示例中的每个全连接网络层包括一个全连接层,其中,一个全连接层的输出为高频频谱包络,另一个全连接层的输出为相对平坦度信息。
其中,LSTM层是一种循环神经网络,其输入为上述低频频谱参数的特征向量(可以简称为输入向量),通过LSTM将输入向量进行处理,得到一定维度的隐向量,该隐向量分别作为两个全连接层的输入,由两个全连接层分别进行分类预测处理,由一个全连接层预测输出一个14维的列向量,该输出即对应为高频频谱包络,由另一个全连接层预测输出一个4维的列向量,该向量的4个维度的值即为前文所描述的4个概率值,4个概率值分别表征了相对平坦度信息为上述4个数组的概率。
在一个示例中,当时频变换为傅里叶变换(比如STFT)时,可以先根据滤波处理后的70维的低频频谱S Low_rev(i,j),得到70维的窄带信号的低频幅度谱P Low(i,j)这一特征向量,接着将P Low(i,j)作为神经网络模型的一个输入,同时将根据P Low(i,j)计算得到的14维的低频频谱包络e Low(i,k)这一特征向量,作为神经网络模型的另一个输入,即神经网络模型的输入层为84维的特征向量。神经网络模型通过LSTM层(比如包括256个参数)对该84维的特征向量进行转换处理,得到转换处理后的向量值,并通过与LSTM层连接的一个全连接网络层(比如包括512个参数),对转换处理后的向量值进行分类处理(即第一分类处理),输出14维的高频频谱包络e High(i,k),同时通过LSTM层连接的另一个全连接网络层(比如包括512个参数),对转换处理后的向量值进行分类处理(即第二分类处理),输出4个相对平坦度信息。
在另一个示例中,当时频变换为离散余弦变换(比如MDCT)时,可以将滤波处理后的70维 的低频频谱S Low_rev(i,j)这一特征向量作为神经网络模型的一个输入,同时将根据S Low_rev(i,j)得到的14维的低频频谱包络e Low(i,k)这一特征向量,作为神经网络模型的另一个输入,即神经网络模型的输入层为84维的特征向量。神经网络模型通过LSTM层(比如包括256个参数)对该84维的特征向量进行转换处理,得到转换处理后的向量值,并通过与LSTM层连接的一个全连接网络层(比如包括512个参数),对转换处理后的向量值进行分类处理(即第一分类处理),输出14维的高频频谱包络e High(i,k),同时通过LSTM层连接的另一个全连接网络层(比如包括512个参数),对转换处理后的向量值进行分类处理(即第二分类处理),输出4个相对平坦度信息。
在本申请实施例的方案中,相对平坦度信息包括对应于高频部分的至少两个子带区域的相对平坦度信息,一个子带区域所对应的相对平坦度信息,表征了高频部分的一个子带区域的频谱平坦度与低频部分的高频频段的频谱平坦度的相关性。
其中,相对平坦度信息是基于样本宽带信号的频谱的高频部分与低频部分确定的,由于样本窄带信号的低频部分的低频频段包含的谐波更为丰富,因此,可以选择样本窄带信号的低频部分的高频频段作为确定相对平坦度信息的参考,将该低频部分的高频频段作为母版,将样本宽带信号的高频部分划分为至少两个子带区域,每个子带区域的相对平坦度信息是基于相对应的子带区域的频谱和低频部分的频谱确定的。
基于前文的描述,在神经网络模型训练的过程中,标注结果可以包括每个子带区域的相对平坦度信息,即样本数据的样本标签可以包括样本宽带信号的高频部分的各个子带区域与低频部分的相对平坦度信息,该相对平坦度信息是基于样本宽带信号的高频部分的子带区域的频谱与低频部分的频谱确定的,因此,在神经网络模型应用时,在模型的输入为窄带信号的低频频谱时,可以基于该神经网络模型的输出预测出目标宽频频谱的高频部分的子带区域与低频部分的相对平坦度信息。
具体地,若高频部分包括对应于至少两个子带区域的谱参数,每个子带区域的谱参数是基于低频部分的高频频段的谱参数确定的,相应的,相对平坦度信息可以包括每个子带区域的谱参数与低频部分的高频频段的谱参数的相对平坦度信息,其中,谱参数为幅度谱或所述频谱。其中,当时频变换是傅里叶变换时,谱参数为幅度谱,当时频变换是离散余弦变换时,谱参数为频谱。
其中,为了达到频带扩展的目的,目标宽频频谱的低频部分的幅度谱的谱系数的个数可以与高频部分的幅度谱的谱系数的个数相同,也可以不同,每个子带区域对应的谱系数的数量可以相同,也可以不同,只要至少两个子带区域对应的谱系数的总数量与初始高频幅度谱对应的谱系数的数量一致即可。
作为一个示例,当时频变换是傅里叶变换时,比如,高频部分包括对应的至少两个子带区域为2个子带区域,分别为第一子带区域和第二子带区域,低频部分的高频频段为第35个至第69个频点所对应的频段,第一子带区域对应谱系数的数量与第二子带区域对应的谱系数的数量相同,第一子带区域和第二子带区域对应的谱系数的总数量与低频部分对应的谱系数的数量一致,则第一子带区域对应的频段是第70个至第104个频点对应的频段,第二子带区域对应的频段是第105个至第139个频点对应的频段,每个子带区域的幅度谱的谱系数的个数为35个,与低频部分的高频频段的幅度谱的谱系数的个数相同。如果选择的低频部分的高频频段为第56个至第69个频点所对应的频段,则可以将高频部分划分为5个子带区域,每个子带区域对应14个谱系数。需要说明的是,当时频变换是离散余弦变换时,高频部分包括对应于至少两个子带区域的频谱的情况,与本示例中时频变换是傅里叶变换下,高频部分包括对应于至少两个子带区域的幅度谱的情 况类似,在此不再赘述。
具体地,无论时频变换是傅里叶变换还是离散余弦变换,基于相对平坦度信息以及初始低频频谱的能量信息,确定高频频谱包络的增益调整值,可以包括:
基于每个子带区域所对应的相对平坦度信息、以及低频频谱中每个子带区域所对应的频谱能量信息,确定高频频谱包络中对应频谱包络部分的增益调整值;
其中,基于增益调整值对高频频谱包络进行调整,可以包括:
基于高频频谱包络中每个对应频谱包络部分的增益调整值,对相应的频谱包络部分进行调整。
具体地,如果高频部分包括至少两个子带区域,则可以基于每个子带区域所对应的相对平坦度信息和低频频谱中每个子带区域所对应的频谱能量信息,确定每个子带区域对应的高频频谱包络中对应频谱包络部分的增益调整值,然后基于确定得到的增益调整值,对相应的频谱包络部分进行调整。
作为一个示例,如前文所描述的时频变换是傅里叶变换时,至少两个子带区域为两个子带区域,分别为第一子带区域和第二子带区域,第一子带区域与低频部分的高频频段的相对平坦度信息为第一相对平坦度信息,第二子带区域与低频部分的高频频段的相对平坦度信息为第二相对平坦度信息,基于第一相对平坦度信息和第一子带区域对应的频谱能量信息确定出的增益调整值,可以对第一子带区域对应的高频频谱包络的包络部分进行调整,基于第二相对平坦度信息和第二子带区域对应的频谱能量信息确定出的增益调整值,可以对第二子带区域对应的高频频谱包络的包络部分进行调整。需要说明的是,当时频变换是离散余弦变换时,相对平坦度信息、增益调整值的确定过程,与本示例中的时频变换是傅里叶变换时,平坦度信息、增益调整值的确定过程类似,在此不再赘述。
在本申请实施例的方案中,由于样本窄带信号的低频部分的低频频段包含的谐波更为丰富,因此,可以选择样本窄带信号的低频部分的高频频段作为确定相对平坦度信息的参考,将该低频部分的高频频段作为母版,将样本宽带信号的高频部分划分为至少两个子带区域,基于高频部分的每个子带区域的频谱和低频部分的频谱来确定每个子带区域的相对平坦度信息。
基于前文的描述,在神经网络的训练阶段,可以基于样本数据(样本数据中包括样本窄带信号和对应的样本宽带信号),通过方差分析法来确定样本宽带信号的频谱的高频部分的每个子带区域的相对平坦度信息。作为一个示例,如果样本宽带信号的高频部分划分为两个子带区域,分别为第一子带区域和第二子带区域,则样本宽带信号的高频部分与低频部分的相对平坦度信息可以为,第一子带区域与样本宽带信号的低频部分的高频频段的第一相对平坦度信息,以及第二子带区域与样本宽带信号的低频部分的高频频段的第二相对平坦度信息。
下面以时频变换是傅里叶变换的情况为例,对第一相对平坦度信息和第二相对平坦度信息的确定过程进行介绍:
其中,第一相对平坦度信息和第二相对平坦度信息的具体确定方式可以为:
基于样本数据中窄带信号的频域系数S Low,sample(i,j)和样本数据中宽带信号的高频部分的频域系数S High,sample(i,j),通过公式(4)至公式(6)计算如下三个方差:
var L(S Low,sample(i,j)),j=35,36,…,69      (4)
var H1(S High,sample(i,j)),j=70,71,…,104     (5)
var H2(S High,sample(i,j)),j=105,106,…,139      (6)
其中,公式(4)为样本窄带信号的低频部分的高频频段的幅度谱的方差,公式(5)为第一子带区域的幅度谱的方差,公式(6)为第二子带区域的幅度谱的方差,var()表示求方差,频谱的方差可基于对应的频域系数表示,S Low,sample(i,j)表示样本窄带信号的频域系数。
基于上述三个方差,通过公式(7)和公式(8)确定每个子带区域的幅度谱与低频部分的高频频段的幅度谱的相对平坦度信息:
Figure PCTCN2020115052-appb-000003
Figure PCTCN2020115052-appb-000004
其中,fc(0)表示第一子带区域的幅度谱与低频部分的高频频段的幅度谱的第一相对平坦度信息,fc(1)表示第二子带区域的幅度谱与低频部分的高频频段的幅度谱的第二相对平坦度信息。
其中,可以将上述两个值fc(0)和fc(1)以是否大于等于0分类,将fc(0)和fc(1)定义为一个二分类数组,因此该数组包含4种排列组合:{0,0}、{0,1}、{1,0}、{1,1}。
由此,模型输出的相对平坦度信息可以为4个概率值,该概率值用于标识相对平坦度信息属于上述4个数组的概率。
通过概率最大原则,可以选择出4个数组的排列组合中其中一个,作为预测出的两个子带区域扩展区域的幅度谱与低频部分的高频频段的幅度谱的相对平坦度信息。具体的可以通过公式(9)表示:
v(i,k)=0 or 1,k=0,1       (9)
其中,v(i,k)表示两个子带区域扩展区域的幅度谱与低频部分的高频频段的幅度谱的相对平坦度信息,k表示不同子带区域的索引,则每个子带区域可以对应一个相对平坦度信息,例如,k=0时,v(i,k)=0表示第一子带区域相对于低频部分较为振荡,即平坦度较差,v(i,k)=1则表示第一子带区域相对于低频部分较为平坦,即平坦度较好。
在本申请实施例中,将第二窄带信号的低频频谱输入至
Figure PCTCN2020115052-appb-000005
练好的神经网络模型,可以通过神经网络模型预测得到目标宽频频谱的高频部分的相对平坦度信息。如果选择窄带信号的低频部分的高频频段对应的频谱作为神经网络模型的输入,则基于该训练好的神经网络模型可以预测得到目标宽频频谱的高频部分的至少两个子带区域的相对平坦度信息。
在本申请实施例的方案中,高频频谱包络包括第一预定数量的高频子频谱包络,若初始低频频谱是通过傅里叶变换得到的,则第一预定数量为上述第二数量,若初始低频频谱是通过离散余弦变换得到的,则第一预定数量为上述第三数量;
其中,基于每个子带区域所对应的相对平坦度信息,以及初始低频频谱中每个子带区域对应的频谱能量信息,确定高频频谱包络中对应频谱包络部分的增益调整值,包括:
对于每一个高频子频谱包络,根据低频频谱包络中与高频子频谱包络对应的频谱包络所对应的频谱能量信息、低频频谱包络中与高频子频谱包络对应的频谱包络所对应的子带区域所对应的相对平坦度信息、低频频谱包络中与高频子频谱包络对应的频谱包络所对应的子带区域对应的频谱能量信息,确定高频子频谱包络的增益调整值;
根据高频频谱包络中每个对应频谱包络部分的增益调整值,对相应的频谱包络部分进行调整,包括:
根据高频频谱包络中每个高频子频谱包络的增益调整值,对相应的高频子频谱包络进行调整。
具体地,下面以初始低频频谱是通过傅里叶变换得到的,第一预定数量为第二数量为例,进行具体介绍:
具体地,高频频谱包络的每个高频子频谱包络对应一个增益调整值,该增益调整值是基于低频子频谱包络所对应的频谱能量信息、低频子频谱包络所对应的子带区域所对应的相对平坦度信息、低频子频谱包络所对应的子带区域对应的频谱能量信息确定的,且该低频子频谱包络是与该高频子频谱包络对应的,高频频谱包络包括第二数量的高频子频谱包络,则高频频谱包络包括对应的第二数量的增益调整值。
可以理解的是,如果高频部分包括对应于至少两个子带区域,对于至少两个子带区域对应的高频频谱包络,可基于每个子带区域对应的第一子频谱包络对应的增益调整值对相应子带区域的第一子频谱包络进行调整。
作为一个示例,下面以第一子带区域中包括35个频点为例,基于第二子频谱包络所对应的频谱能量信息、第二子频谱包络所对应的子带区域所对应的相对平坦度信息、第二子频谱包络所对应的子带区域对应的频谱能量信息,确定第二子频谱包络对应的第一子频谱包络的增益调整值的一种可实现方案为:
(1)、解析v(i,k),如果为1,表示高频部分非常平坦,如果为0,表示高频部分振荡。
(2)、对于第一子带区域中的35个频点,分成7个子带,每个子带对应一个第一子频谱包络。分别计算每个子带的平均能量pow_env(第二子频谱包络所对应的频谱能量信息),并计算上述7个平均能量的平均值Mpow_env(第二子频谱包络所对应的子带区域对应的频谱能量信息。其中,每个子带的平均能量为基于对应的低频幅度谱确定的,比如,将每个低频幅度谱的谱系数的绝对值的平方作为一个低频幅度谱的能量,一个子带对应5个低频幅度谱的谱系数,则可将一个子带对应的低频幅度谱的能量的平均值作为该子带的平均能量。
(3)、基于解析的第一子带区域对应的相对平坦度信息、平均能量pow_env和平均值Mpow_env,计算每个第一子频谱包络的增益调整值,具体包括:
当v(i,k)=1,G(j)=a 1+b 1*SQRT(Mpow_env/pow_env(j)),j=0,1,…,6;
当v(i,k)=0,G(j)=a 0+b 0*SQRT(Mpow_env/pow_env(j)),j=0,1,…,6;
其中,作为一方案,a 1=0.875,b 1=0.125,a 0=0.925,b 0=0.075,G(j)为增益调整值。
其中,对于v(i,k)=0的情况,增益调整值为1,即无需对高频频谱包络进行平坦化操作(调整)。
基于上述方式可确定出高频频谱包络中7个第一子频谱包络的增益调整值,基于7个第一子频谱包络的增益调整值,对相应的第一子频谱包络进行调整,上述操作可以拉近不同子带的平均能量差异,对第一子带区域对应的频谱进行不同程度的平坦化处理。
可以理解的是,可以通过上述相同的方式对第二子带区域对相应的高频频谱包络进行调整,在此不再赘述。高频频谱包络一共包括14个子频带,则可以对应确定出14个增益调整值,基于该14个增益调整值对相应的子频谱包络进行调整。
在本申请实施例的方案中,宽带信号中包括窄带信号中的低频部分的信号以及扩展后的高频部分的信号,则在得到低频部分对应的初始低频频谱和高频部分对应的初始高频频谱后,可以将初始低频频谱和初始高频频谱合并,得到宽频带频谱,进而对宽频带频谱进行频时变换(时频变换的反变换,将频域信号变换为时域信号),就可以得到频带扩展后的目标语音信号。
具体地,在将初始低频频谱和初始高频频谱合并之前,可以先对初始低频频谱或初始高频频 谱中的至少一项进行滤波处理,再基于滤波处理后的频谱,得到频带扩展后的宽带信号,换言之,可以只对初始低频频谱进行滤波处理,得到滤波处理后的初始低频频谱(记作目标低频频谱),再将目标低频频谱与初始高频频谱进行合并,也可以只对初始高频频谱进行滤波处理,得到滤波处理后的初始高频频谱(记作目标高频频谱),再将初始低频频谱与目标高频频谱进行合并,还可以对初始低频频谱与初始高频频谱分别进行滤波处理,得到相应的目标低频频谱与目标高频频谱,再将目标低频频谱与目标高频频谱进行合并。
具体地,初始低频频谱的滤波处理过程与初始高频频谱的滤波处理过程基本一致,下面以对初始低频频谱进行滤波处理为例,具体介绍滤波处理过程,如下所示:
在对初始低频频谱进行滤波处理的过程中,可以执行如下操作:
将初始低频频谱划分为第一数量的子频谱,并确定每个子频谱对应的第一频谱能量;
基于每个子频谱各自对应的第一频谱能量,确定每个子频谱对应的滤波增益;
根据每个子频谱对应的滤波增益,对相应的每个子频谱分别进行滤波处理。
具体地,上述对初始低频频谱进行滤波处理的过程,也可以先基于初始低频频谱的频谱能量,确定初始低频频谱的滤波增益(下文记作第一滤波增益),再根据第一滤波增益对初始低频频谱进行滤波处理,得到低频频谱,其中,第一滤波增益包括每个子频谱对应的滤波增益(下文记作第二滤波增益)。在实际应用中,由于初始低频频谱通常是使用初始低频频域系数表示的、低频频谱是使用低频频域系数表示,因此,在根据第一滤波增益对初始低频频谱进行滤波处理的过程中,可以描述为:先基于初始低频频域系数确定第一滤波增益,再根据第一滤波增益对初始低频频域系数进行滤波处理,得到低频频域系数。
具体地,可以通过对第一滤波增益与初始低频频域系数进行乘积运算,来对初始低频频域系数进行滤波处理,得到低频频域系数,其中,初始低频频域系数为S Low(i,j),低频频域系数为S Low_rev(i,j)。假如确定出的第一滤波增益为G Low_post_filt(j),则可以根据如下公式(10)对初始低频频域系数进行滤波处理:
S Low_rev(i,j)=G Low_post_filt(j)*S Low(i,j)       (10)
其中,i为语音帧的帧索引,j为帧内样本索引(j=0,1,…,69)。
具体地,在基于初始低频频域系数确定第一滤波增益的过程中,首先将初始低频频域系数划分为第一数量的子频谱,并确定每个子频谱对应的第一频谱能量,接着基于每个子频谱各自对应的第一频谱能量,确定每个子频谱对应的第二滤波增益,其中,第一滤波增益包括第一数量的第二滤波增益;在根据第一滤波增益对初始频谱进行滤波处理时,可以根据每个子频谱对应的第二滤波增益,对相应的每个子频谱分别进行滤波处理。
为便于描述,将上述的第一数量记作L,其中,将初始低频频域系数划分为L个子频谱的一种可能的实现方式为:对初始低频频域系数进行分带处理,得到第一数量的子频谱,每个子带对应N个初始低频频域系数,N*L等于初始低频频域系数的总个数,L≥2,N≥1。作为一个示例,比如,初始低频频域系数有70个,则可以将每5(N=5)个初始低频频域系数对应的频带划分为一个子带,共划分为14(L=14)个子带,每个子带对应有5个初始低频频域系数。
确定每个子频谱对应的第一频谱能量一种可能的实现方式为:将每个子频谱分别对应的N个初始低频频域系数的频谱能量的和,确定为每个子频谱对应的第一频谱能量。每个初始低频频域系数的频谱能量定义为初始低频频域系数的实部平方与虚部平方的和。作为一个示例,比如初始低频频域系数有70个频谱系数、N=5、L=14,则每个子频谱各自对应的第一频谱能量可以通过如 下公式(11)计算得到:
Figure PCTCN2020115052-appb-000006
其中,i为语音帧的帧索引,j为帧内样本索引(j=0,1,…,69),k=0,1,…,13,为子带索引,表示14个子带,Pe(k)表示第k个子频谱对应的第一频谱能量,S Low(i,j)为根据时频变换得到的低频频域系数(即初始低频频域系数),Real和Imag分别为实部和虚部。
具体地,在得到每个子频谱各自对应的第一频谱能量后,可以基于每个子频谱各自对应的第一频谱能量,确定每个子频谱对应的第二滤波增益。在确定每个子频谱对应的第二滤波增益的过程中,可以先将初始频谱对应的频带划分为第一子带和第二子带;接着根据第一子带所对应的所有子频谱的第一频谱能量,确定出第一子带的第一子带能量,根据第二子带所对应的所有子频谱的第一频谱能量,确定出第二子带的第二子带能量;接着根据第一子带能量与第二子带能量,确定初始频谱的频谱倾斜系数;接着根据频谱倾斜系数及每个子频谱各自对应的第一频谱能量,确定每个子频谱对应的第二滤波增益。
其中,初始频谱对应的频带即为初始低频频域系数(比如70个)分别对应的频带的和,在将初始低频频域系数对应的频带划分为第一子带和第二子带的过程中,可以将第1个至第35个初始低频频域系数分别对应的频带的和作为第一子带,将第36个至第70个初始低频频域系数分别对应的频带的和作为第二子带,即第一子带对应着初始频谱中的第1个至第35个初始低频频域系数,第二子带对应着初始频谱中的第36个至第70个初始低频频域系数。假如N=5,即将每5个初始低频频域系数划分为一个子频谱,则第一子带包括7个子频谱,第二子带也包括7个子频谱,于是,可以根据第一子带包括的7个子频谱的第一频谱能量的和,确定出第一子带的第一子带能量,也可以根据第二子带包括的7个子频谱的第一频谱能量的和,确定出第二子带的第二子带能量。
具体地,当窄带信号为当前语音帧的语音信号时,对于每一个子频谱,确定其对应的第一频谱能量的一种可能的方式为:根据上述公式(11)确定每个子频谱分别对应的第一初始频谱能量Pe(k)。若当前语音帧为第一个语音帧,则可以将每个子频谱的第一初始频谱能量Pe(k)确定为该每个子频谱的第一频谱能量,可以将第一频谱能量记作Fe(k),即Fe(k)=Pe(k)。若当前语音帧不是第一个语音帧,在确定第k个子频谱的第一频谱能量的过程中,则可以获取关联语音帧的与该第k个子频谱相对应的子频谱的第二初始频谱能量,将第二初始频谱能量记作Pe pre(k),其中,该关联语音帧是位于当前语音帧之前、且与当前语音帧相邻的至少一个语音帧。在获取到第二初始频谱能量之后,可以基于第一初始频谱能量与第二初始频谱能量,得到该子频谱的第一频谱能量。
在一个示例中,可以根据如下公式(12)确定第k个子频谱的第一频谱能量:
Fe(k)=1.0+Pe(k)+Pe pre(k)       (12)
其中,Pe(k)为第k个子频谱的第一初始频谱能量,Pe pre(k)为关联语音帧的与第k个子频谱对应的子频谱的第二初始频谱能量,Fe(k)为第k个子频谱的第一频谱能量。
需要说明的是,在上述公式(11)中的关联语音帧是位于当前语音帧之前、且与当前语音帧相邻的一个语音帧。当关联语音帧是位于当前语音帧之前、且与当前语音帧相邻的两个或多个语音帧时,可以根据需要对上述公式(12)进行适当调整,比如当关联语音帧是位于当前语音帧之 前、且与当前语音帧相邻的两个语音帧时,上述公式(12)可相应调整为:
Figure PCTCN2020115052-appb-000007
Figure PCTCN2020115052-appb-000008
Figure PCTCN2020115052-appb-000009
是位于当前语音帧之前、且与当前语音帧紧邻的第一个语音帧的第一初始频谱能量,Pe pre2(k)是位于该第一个语音帧之前、且与该第一个语音帧紧邻的语音帧的第一初始频谱能量。
在另一个示例中,在根据上述公式得到第k个子频谱的第一频谱能量之后,可以对该第一频谱能量进行平滑,在确定出平滑后的第一频谱能量Fe_sm(k)之后,可以将Fe_sm(k)确定为第k个子频谱的第一频谱能量。其中,可以根据如下公式(13)对该第一频谱能量进行平滑:
Fe_sm(k)=(Fe(k)+Fe pre(k))/2       (13)
其中,Fe(k)为第k个子频谱的第一频谱能量,Fe pre(k)为关联语音帧的与第k个子频谱对应的子频谱的第一频谱能量,Fe_sm(k)为平滑后的第一频谱能量。在确定出平滑后的第一频谱能量Fe_sm(k)之后,可以将Fe_sm(k)确定为第k个子频谱的第一频谱能量。
需要说明的是,在上述公式(13)中的关联语音帧是位于当前语音帧之前、且与当前语音帧相邻的一个语音帧。当关联语音帧是位于当前语音帧之前、且与当前语音帧相邻的两个或多个语音帧时,可以根据需要对上述公式(13)进行适当调整,比如当关联语音帧是位于当前语音帧之前、且与当前语音帧相邻的两个语音帧时,上述公式(13)则可以相应调整为:Fe_sm(k)=(F(k)+Fe pre1(k)+Fe pre2(k))/3,该Fe pre1(k)是位于当前语音帧之前、且与当前语音帧紧邻的第一个语音帧的第一频谱能量,Pe pre2(k)是位于该第一个语音帧之前、且与该第一个语音帧紧邻的语音帧的第一频谱能量。
具体地,在根据上述过程确定出每个子频谱的第一频谱能量Fe(k)或Fe_sm(k)之后,当每个子频谱的第一频谱能量为Fe(k)时,可以根据如下公式(14)确定出第一子带的第一子带能量与第二子带的第二子带能量:
Figure PCTCN2020115052-appb-000010
Figure PCTCN2020115052-appb-000011
其中,e1为第一子带的第一子带能量,e2为第二子带的第二子带能量。
当每个子频谱的第一频谱能量为Fe_sm(k)时,可以根据如下公式(15)确定出第一子带的第一子带能量与第二子带的第二子带能量:
Figure PCTCN2020115052-appb-000012
Figure PCTCN2020115052-appb-000013
其中,e1为第一子带的第一子带能量,e2为第二子带的第二子带能量。
具体地,在确定出第一子带能量与第二子带能量后,可以根据第一子带能量与第二子带能量,确定初始频谱的频谱倾斜系数。在实际应用中,可以根据如下逻辑来确定初始频谱的频谱倾斜系数:
当第二子带能量大于或等于第一子带能量时,将初始频谱倾斜系数确定为0,当第二子带能量小于第一子带能量时,可以根据下述表达式确定初始频谱倾斜系数:
T_para_0=8*f_cont_low*SQRT((e1-e2)/(e1+e2);
其中,T_para_0为初始频谱倾斜系数,f_cont_low为预先设定的滤波系数,作为一个方案,f_cont_low=0.035,SQRT为开根号操作,e1为第一子带能量,e2为第二子带能量。
具体地,在根据上述方式得到初始频谱倾斜系数T_para_0后,可以将上述的初始频谱系数作为初始频谱的频谱倾斜系数,也可以进一步根据以下方式对得到的初始频谱倾斜系数进行优化,并将优化后的初始频谱倾斜系数作为初始频谱的频谱倾斜系数,在一示例中,优化的表达式为:
T_para_1=min(1.0,T_para_0);
T_para_2=T_para_1/7;
其中,min表示取最小值,T_para_1为初始优化后的初始频谱倾斜系数,T_para_2为最终优化后的初始频谱倾斜系数,即为上述的初始频谱的频谱倾斜系数。
具体地,在确定出初始频谱的频谱倾斜系数后,可以根据频谱倾斜系数及每个子频谱各自对应的第一频谱能量,确定每个子频谱分别对应的第二滤波增益。在一示例中,可以根据如下公式(16),确定第k个子频谱对应的第二滤波增益:
gain f0(k)=Fe(k) f_cont_low         (16)
其中,gain f0(k)为第k个子频谱对应的第二滤波增益,Fe(k)为第k个子频谱的第一频谱能量,f_cont_low为预先设定的滤波系数,作为一个方案,f_cont_low=0.035,k=0,1,…,13,为子带索引,表示14个子带。
在确定出第k个子频谱对应的第二滤波增益gain f0(k)之后,如果上述的初始频谱的频谱倾斜系数不为正,则可以直接将gain f0(k)作为第k个子频谱对应的第二滤波增益,如果上述的初始频谱的频谱倾斜系数为正,则可以根据初始频谱的频谱倾斜系数,对该第二滤波增益gain f0(k)进行调整,并将调整后的第二滤波增益gain f0(k)作为第k个子频谱对应的第二滤波增益。在一示例中,可以根据如下公式(17)对第二滤波增益gain f0(k)进行调整:
gain f1(k)=gain f0(k)*(1+k*T para)        (17)
其中,gain f1(k)为调整后的第二滤波增益,gain f0(k)为第k个子频谱对应的第二滤波增益,T para为初始频谱的频谱倾斜系数,k=0,1,…,13,为子带索引,表示14个子带。
具体地,在确定出第k个子频谱对应的第二滤波增益gain f1(k)后,可以对gain f1(k)进一步优化,并将优化后的gain f1(k)作为最终的第k个子频谱对应的第二滤波增益。在一示例中,可以根据如下公式(18)对第二滤波增益gain f1(k)进行调整:
gain Low_post_filt(k)=(1+gain f1(k))/2          (18)
其中,gain Low_post_filt(k)为最终得到的第k个子频谱对应的第二滤波增益,gain f1(k)为根据公式(17)调整后的第二滤波增益,k=0,1,…,13,为子带索引,表示14个子带,从而得到14个子带分别对应的滤波增益(即上述的第二滤波增益)。
具体地,上述是以将5个初始低频频域系数划分为一个子带,即将70个初始低频频域系数划分为14个子带,每个子带包括5个初始低频频域系数为例,对计算初始低频频域系数的第一滤波增益进行介绍。上述得到的每个子带对应的第二滤波增益,即为该每个子带对应的5个初始低频频域系数的滤波增益,从而可以根据14个子带的第二滤波增益,得到70个初始低频频域系数对应的第一滤波增益为[gain Low_post_filt(0),gain Low_post_filt(1),…,gain Low_post_filt(14)]。换言之,在确定出第k个子频谱对应的第二滤波增益gain Low_post_filt(k)后,可以得到前述的第一滤波增益,其中,第一滤波增益包括第二数量的(比如L=14个)第二滤波增益gain Low_post_filt(k), 第二滤波增益gain Low_post_filt(k)为第k个子频谱对应的N个频谱系数的滤波增益。
在本申请实施例的方案中,若窄带信号包括至少两路关联的信号,该方法还可以包括:
将至少两路关联的信号进行融合,得到窄带信号;
或者,
将至少两路关联的信号中的每一路信号分别作为窄带信号。
具体地,窄带信号可以为多路关联的信号,比如,相邻的语音帧,则可以将至少两路关联的信号进行融合,得到一路信号,将该一路信号作为窄带信号,然后通过本申请中的频带扩展方法对该窄带信号进行扩展,得到宽带信号。
或者,也可以将至少两路关联的信号中的每一路信号作为窄带信号,通过本申请实施例中的频带扩展方法对该窄带信号进行扩展,得到对应的至少两路宽带信号,该至少两路宽带信号可以合并成一路信号输出,也可以分别输出,本申请实施例中不作限定。
为了更好的理解本申请实施例所提供的方法,下面结合具体应用场景的示例对本申请实施例的方案进行进一步详细说明。
作为一个示例,应用场景为PSTN(窄带语音)和VoIP(宽带语音)互通场景,即将PSTN电话机对应的窄带语音作为待处理的窄带信号,对该待处理的窄带信号进行频带扩展,使得VoIP接收端接收到的语音帧为宽带语音,从而提高接收端的听觉体验。
在本示例中,待处理的窄带信号为采用率为8000Hz,帧长为10ms的信号,根据Nyquist采样定理,待处理的窄带信号的有效带宽为4000Hz。在实际的语音通信场景,一般有效带宽的上界为3500Hz。因此,在本示例中,扩展后的宽带信号的带宽为7000Hz为例进行说明。
在如图3所示的第一示例中,时频变换为傅里叶变换(比如STFT),具体流程可由图6所示的电子设备执行,包括以下步骤:
步骤S1,前端信号处理:
对待处理的窄带信号进行因子为2的上采样处理,输出采样率为16000Hz的上采样信号。
由于待处理的窄带信号的采样率为8000Hz,帧长为10ms,则该上采样信号对应160个样本点(频点),对上采样信号进行短时傅立叶变换(STFT),具体为:将上一语音帧对应的160个样本点与当前语音帧(待处理的窄带信号)对应的160个样本点组成一个数组,该数组包括320个样本点。接着对该数组中的样本点进行加窗处理(即汉宁窗的加窗处理),得到加窗处理后的信号为s Low(i,j)。之后,对s Low(i,j)进行快速傅立叶变换,得到320个低频频域系数S Low(i,j)。其中,i为语音帧的帧索引,j为帧内样本索引(j=0,1,…,319)。考虑到快速傅立叶变换的共扼对称关系,第一个系数为直流分量,因此可以只考虑前161个低频频域系数。
步骤S2,特征提取:
a)、基于低频频域系数,通过公式(19)计算低频幅度谱:
P Low(i,j)=SQRT(Real(S Low(i,j)) 2+Imag(S Low(i,j)) 2)       (19)
其中,P Low(i,j)表示低频幅度谱,S Low(i,j)为低频频域系数,Real和Imag分别为低频频域系数的实部和虚部,SQRT为开根号操作。若待处理的窄带信号为采样率为16000Hz,带宽为0~3500Hz的信号,则可以基于待处理的窄带信号的采样率和帧长,通过低频频域系数确定出70个低频幅度谱的谱系数(低频幅度谱系数)P Low(i,j),j=0,1,…69。在实际应用中,可以直接将计算出的70个低频幅度谱系数作为待处理的窄带信号的低频幅度谱,进一步的,为了计算方便,也可以进一步将低频幅度谱转换到对数域。
在得到包含70个系数的低频幅度谱之后,即可基于低频幅度谱确定出待处理的窄带信号的低频频谱包络。
b)、进一步地,还可以通过以下方式基于低频幅度谱,确定低频频谱包络:
对待处理的窄带信号进行分带,针对70个低频幅度谱的谱系数,可以将每5个相邻的子幅度谱的谱系数对应的频带划分为一个子带,共划分为14个子带,每个子带对应有5个谱系数。对于每个子带,该子带的低频频谱包络定义为相邻谱系数的平均能量。具体可通过公式(20)计算得到:
Figure PCTCN2020115052-appb-000014
其中,e Low(i,k)表示子频谱包络(每个子带的低频频谱包络),k表示子带的索引号,共14个子带,k=0,1,2……13,则低频频谱包络中包括14个子频谱包络。
一般地,子带的谱包络定义为相邻系数的平均能量(或者进一步转换成对数表示),但是该方式,有可能会导致幅值较小的系数不能够起到实质性的作用,本申请实施例所提供的该种将每个子幅度谱所包括的谱系数的对数标识直接求平均,得到子幅度谱对应的子频谱包络的方案,与现有常用的包络确定方案相比,可以更好的在神经网络模型训练过程的失真控制中保护好幅值较小的系数,从而使更多的信号参数能够在频带扩展中起到相应的作用。
由此,可以将70维的低频幅度谱和14维的低频频谱包络作为神经网络模型的输入。
步骤S3,输入神经网络模型:
输入层:神经网络模型输入上述84维特征向量。
输出层:考虑到本实施例中频带扩展的目标宽带是7000Hz,因此,需要预测14个对应于3500-7000Hz频段的子带的高频频谱包络,即可完成基本的频带扩展功能。通常,语音帧的低频部分包含大量的基音和共振峰等类谐波结构;高频部分的频谱更为平坦;如果仅是简单地将低频频谱复制到高频,得到初始高频幅度谱,并对初始高频幅度谱进行基于子带的增益控制,重建的高频部分将产生过多的类谐波结构,会引起失真,影响听感;因此,本示例中基于神经网络模型预测出的相对平坦度信息,描述低频部分和高频部分的相对平坦度,对初始高频幅度谱进行调整,使得调整后的高频部分更为平坦,减少谐波的干扰。
在本示例中,通过对低频幅度谱中高频段部分的幅度谱进行两次复制,生成初始高频幅度谱,同时将高频部分的频段平均分成两个子带区域,分别为第一子带区域和第二子带区域,高频部分对应70个谱系数,每个子带区域对应35个谱系数,因此,高频部分将做两次平坦度分析,即对每个子带区域进行一次平坦度分析,由于低频部分特别是1000Hz以下对应的频段,谐波成分更为丰富;因此,本实施例中选择35-69的频点对应的谱系数作为“母板”,则第一子带区域对应的频段是第70个至第104个频点对应的频段,第二子带区域对应的频段是第105个至第139个频点对应的频段。
平坦度分析可以使用经典统计学中定义的方差(Variance)分析方法。通过方差分析方法可以描述出频谱的振荡程度,值越高说明谐波成份更丰富。
基于前文的描述,由于样本窄带信号的低频部分的低频频段包含的谐波更为丰富,因此,可以选择样本窄带信号的低频部分的高频频段作为确定相对平坦度信息的参考,即将该低频部分的高频频段(35-69的频点所对应的频段)作为母版,对应将样本宽带信号的高频部分划分为至少两个子带区域,基于高频部分的每个子带区域的频谱和低频部分的频谱来确定出每个子带区域的 相对平坦度信息。
在神经网络模型的训练阶段,可以基于样本数据(样本数据中包括样本窄带信号和对应的样本宽带信号),通过方差分析法来确定样本宽带信号的频谱的高频部分的每个子带区域的相对平坦度信息。
作为一个示例,如果样本宽带信号的高频部分划分为两个子带区域,分别为第一子带区域和第二子带区域,则样本宽带信号的高频部分与低频部分的相对平坦度信息可以为,第一子带区域与样本宽带信号的低频部分的高频频段的第一相对平坦度信息,以及第二子带区域与样本宽带信号的低频部分的高频频段的第二相对平坦度信息。
其中,时频变换是傅里叶变换时,第一相对平坦度信息和第二相对平坦度信息的具体确定方式可以为:
基于样本数据中窄带信号的频域系数S Low,sample(i,j)和样本数据中宽带信号的高频部分的频域系数S High,sample(i,j),通过公式(21)至公式(23)计算如下三个方差:
var L(S Low,sample(i,j)),j=35,36,…,69        (21)
var H1(S High,sample(i,j)),j=70,71,…,104      (22)
var H2(S High,sample(i,j)),j=105,106,…,139      (23)
其中,公式(21)为样本窄带信号的低频部分的高频频段的幅度谱的方差,公式(22)为第一子带区域的幅度谱的方差,公式(23)为第二子带区域的幅度谱的方差,var()表示求方差,频谱的方差可基于对应的频域系数表示,S Low,sample(i,j)表示样本窄带信号的频域系数。
基于上述三个方差,通过公式(24)和公式(25)确定每个子带区域的幅度谱与低频部分的高频频段的幅度谱的相对平坦度信息:
Figure PCTCN2020115052-appb-000015
Figure PCTCN2020115052-appb-000016
其中,fc(0)表示第一子带区域的幅度谱与低频部分的高频频段的幅度谱的第一相对平坦度信息,fc(1)表示第二子带区域的幅度谱与低频部分的高频频段的幅度谱的第二相对平坦度信息。
其中,可以将上述两个值fc(0)和fc(1)以是否大于等于0分类,将fc(0)和fc(1)定义为一个二分类数组,因此该数组包含4种排列组合:{0,0}、{0,1}、{1,0}、{1,1}。
由此,模型输出的相对平坦度信息可以为4个概率值,该概率值用于标识相对平坦度信息属于上述4个数组的概率。
通过概率最大原则,可以选择出4个数组的排列组合中其中一个,作为预测出的两个子带区域扩展区域的幅度谱与低频部分的高频频段的幅度谱的相对平坦度信息。具体的可以通过公式(26)表示:
v(i,k)=0 or 1,k=0,1           (26)
其中,v(i,k)表示两个子带区域扩展区域的幅度谱与低频部分的高频频段的幅度谱的相对平坦度信息,k表示不同子带区域的索引,则每个子带区域可以对应一个相对平坦度信息,例如,k=0时,v(i,k)=0表示第一子带区域相对于低频部分较为振荡,即平坦度较差,v(i,k)=1则表示第一子带区域相对于低频部分较为平坦,即平坦度较好。
步骤S4,生成高频幅度谱:
如前文,将低频幅度谱(35-69共计35个频点)复制两次,生成高频的幅度谱(共70个频点),基于窄带信号对应的初始低频频域系数或者经过滤波处理后的低频频域系数,通过训练好的神经网络模型,可以得到预测的目标宽频频谱的高频部分的相对平坦度信息。由于在本示例中选择的是35-69对应的第一低频频谱的频域系数,则通过该训练好的神经网络模型可以预测得到目标宽频频谱的高频部分的至少两个子带区域的相对平坦度信息,即目标宽频频谱的高频部分被划分为至少两个子带区域,在本示例中,以2个子带区域,则神经网络模型的输出为该2个子带区域的相对平坦度信息。
根据预测出的2个频带扩展区域对应的相对平坦度信息,对重建的高频幅度谱进行后滤波。以对其中第一子带区域为例,主要步骤包括:
(1)、解析v(i,k),如果为1,表示高频部分非常平坦,如果为0,表示高频部分振荡。
(2)、对于第一子带区域中的35个频点,分成7个子带,高频频谱包络包括14个第一子频谱包络,低频频谱包络包括14个第二子频谱包络,则每个子带可以对应一个第一子频谱包络。分别计算每个子带的平均能量pow_env(第二子频谱包络所对应的频谱能量信息),并计算上述7个子带平均能量的平均值Mpow_env(第二子频谱包络所对应的子带区域对应的频谱能量信息)。其中,每个子带的平均能量是基于对应的低频幅度谱确定的,比如,将每个低频幅度谱的谱系数的绝对值的平方作为一个低频幅度谱的能量,一个子带对应5个低频幅度谱的谱系数,则可将一个子带对应的低频幅度谱的能量的平均值作为该子带的平均能量。
(3)、基于解析的第一子带区域对应的相对平坦度信息、平均能量pow_env和平均值Mpow_env,计算每个第一子频谱包络的增益调整值,具体包括:
当v(i,k)=1,G(j)=a 1+b 1*SQRT(Mpow_env/pow_env(j)),j=0,1,…,6;
当v(i,k)=0,G(j)=a 0+b 0*SQRT(Mpow_env/pow_env(j)),j=0,1,…,6;
其中,在本示例中,a 1=0.875,b 1=0.125,a 0=0.925,b 0=0.075,G(j)为增益调整值。
其中,对于v(i,k)=0的情况,增益调整值为1,即无需对高频频谱包络进行平坦化操作(调整)。
(4)、基于上述方式可确定出高频频谱包络e High(i,k)中每个第一子频谱包络对应的增益调整值,基于每个第一子频谱包络对应的增益调整值,对相应的第一子频谱包络进行调整,上述操作可以拉近不同子带的平均能量差异,对第一子带区域对应的频谱进行不同程度的平坦化处理。
可以理解的是,可以通过上述相同的方式对第二子带区域对相应的高频频谱包络进行调整,在此不再赘述。高频频谱包络一共包括14个子频带,则可以对应确定出14个增益调整值,基于该14个增益调整值对相应的子频谱包络进行调整。
进一步地,基于调整后的高频频谱包络,确定调整后的高频频谱包络和低频频谱包络的差值,基于差值对初始高频幅度谱进行调整,得到目标高频幅度谱P High(i,j)。
步骤S5,生成高频频谱:
基于低频相位谱Ph Low(i,j)生成相应的高频相位谱Ph High(i,j),可以包括以下任一种:
第一种:通过复制低频相位谱,得到相应的高频相位谱。
第二种:对低频相位谱进行翻折,翻折后得到一个与低频相位谱相同的相位谱,将这两个低频相位谱映射到相应的高频频点,得到相应的高频相位谱。
根据高频幅度谱和高频相位谱,生成高频频域系数S High(i,j);基于低频频域系数和高频频域系数,生成高频频谱。
步骤S6,高频后置滤波:
高频后置滤波是对得到的初始高频频域系数进行滤波处理,得到滤波后的初始高频频域系数,记作高频频域系数。在该滤波处理过程中,通过基于高频频域系数确定的滤波增益,对高频频域系数进行滤波处理,具体如下公式(27)所示:
S High_rev(i,j)=G High_post_filt(j)*S High(i,j)        (27)
其中,G High_post_filt(j)为根据高频频域系数计算得到的滤波增益,S High(i,j)为初始高频频域系数,S High_rev(i,j)为经滤波处理得到的高频频域系数。
本示例中,假定同一个子带内每5个初始频域系数共用一个滤波增益,其中,滤波增益的计算过程具体如下所示:
(1)将初始低频频域系数进行分带,例如,相邻5个初始低频频域系数合并成一个子频谱,本示例对应于14个子带。对每个子带计算平均能量。特别地,每一个频点(即上述的初始低频频域系数)的能量定义为实部平方与虚部平方的和。通过如下公式(28)计算相邻5个频点的能量值,该5个频点的能量值的和即为当前子频谱的第一频谱能量:
Figure PCTCN2020115052-appb-000017
其中,S High(i,j)为初始高频频域系数,Real和Imag分别为初始高频频域系数的实部和虚部,Pe(k)为第一频谱能量,k=0,1,…13,表示子带索引,为14个子带。
(2)基于帧间相关性,通过公式(29)与公式(30)中的至少一项,计算当前子频谱的第一频谱能量:
Fe(k)=1.0+Pe(k)+Pe pre(k)        (29)
Fe_sm(k)=(Fe(k)+Fe pre(k))/2           (30)
其中,Fe(k)是当前子频谱的第一频谱能量的平滑项,Pe(k)是当前语音帧的当前子频谱的第一频谱能量,Pe pre(k)是当前语音帧的关联语音帧的与当前子频谱对应的子频谱的第二初始频谱能量,Fe_sm(k)是累加平均后的第一频谱能量的平滑项,Fe pre(k)是当前语音帧的关联语音帧的与当前子频谱对应的第一频谱能量的平滑项,关联语音帧是位于当前语音帧之前、且与当前语音帧相邻的至少一个语音帧,从而充分考虑了语音信号帧之间的短时相关性和长时相关性。
(3)计算初始频谱的频谱倾斜系数,将初始频谱对应的频带均分为第一子带和第二子带,分别计算第一子带的第一子带能量与第二子带的第二子带能量,计算公式(31)如下所示:
Figure PCTCN2020115052-appb-000018
Figure PCTCN2020115052-appb-000019
其中,e1为第一子带的第一子带能量,e2为第二子带的第二子带能量。
接着,根据e1与e2,基于以下逻辑来确定初始频谱的频谱倾斜系数:
Figure PCTCN2020115052-appb-000020
Figure PCTCN2020115052-appb-000021
其中,T_para为频谱倾斜系数,SQRT为开根号操作,f_cont_low=0.07,为预先设定的滤波系数,7为子频谱总数量的一半。
(4)计算每个子频谱的第二滤波增益,可以根据如下公式(32)计算:
gain f0(k)=Fe(k) f_cont_low          (32)
其中,gain f0(k)为第k个子频谱的第二滤波增益,f_cont_low为预先设定的滤波系数,在一种方案中,f_cont_low=0.07,Fe(k)为第k个子频谱的第一频谱能量的平滑项,k=0,1,…,13,表示子带索引,为14个子带。
接着,如果频谱倾斜系数T_para为正,还需要根据如下公式(33)对第二滤波增益gain f0(k)进一步调整:
If(T_para>0):
gain f1(k)=gain f0(k)*(1+k*T para)         (33)
(5)根据如下公式(34),得到高频后置滤波的滤波增益值:
gain High_post_filt(k)=(1+gain f1(k))/2       (34)
其中,gain f1(k)为根据公式(33)调整后的第二滤波增益,gain High_post_filt(k)为根据gain f1(k)最终得到的第k个子频谱对应的5个高频频域系数的滤波增益(即第二滤波增益),gain f1(k)为调整后的第二滤波增益,k=0,1,…,13,表示14个子带。
具体地,在确定出第k个子频谱对应的第二滤波增益gain High_post_filt(k)后,由于第一滤波增益包括第二数量(比如L=14个)的第二滤波增益gain High_post_filt(k),且第二滤波增益gain High_post_filt(k)为第k个子频谱对应的N个频谱系数的滤波增益,从而可以得到第一滤波增益G High_post_filt(j)。
步骤S7,低频后置滤波:
低频后置滤波是对待处理的窄带信号经STFT得到的初始低频频域系数进行滤波处理,得到低频频域系数。在该滤波处理过程中,通过基于初始低频频域系数确定的滤波增益,对初始低频频域系数进行滤波处理,具体如下公式(35)所示:
S Low_rev(i,j)=G Low_post_filt(j)*S Low(i,j)         (35)
其中,G Low_post_filt(j)为根据初始低频频域系数计算得到的滤波增益,S Low(i,j)为初始低频频域系数,S Low_rev(i,j)为经滤波处理得到的低频频域系数。
本示例中,假定同一个子带内每5个初始低频频域系数共用一个滤波增益,其中,滤波增益的计算过程具体如下所示:
(1)将初始低频频域系数进行分带,例如,相邻5个初始低频频域系数合并成一个子频谱,本示例对应于14个子带。对每个子带计算平均能量。特别地,每一个频点(即上述的初始低频频域系数)的能量定义为实部平方与虚部平方的和。通过如下公式(36)计算相邻5个频点的能量值,该5个频点的能量值的和即为当前子频谱的第一频谱能量:
Figure PCTCN2020115052-appb-000022
其中,S Low(i,j)为初始低频频域系数,Real和Imag分别为初始低频频域系数的实部和虚部,Pe(k)为第一频谱能量,k=0,1,…13,表示子带索引,为14个子带。
(2)基于帧间相关性,通过公式(37)与公式(38)中的至少一项,计算当前子频谱的第一频谱能量:
Fe(k)=1.0+Pe(k)+Pe pre(k)         (37)
Fe_sm(k)=(Fe(k)+Fe pre(k))/2          (38)
其中,Fe(k)是当前子频谱的第一频谱能量的平滑项,Pe(k)是当前语音帧的当前子频谱的第一频谱能量,Pe pre(k)是当前语音帧的关联语音帧的与当前子频谱对应的子频谱的第二初始频谱能量,Fe_sm(k)是累加平均后的第一频谱能量的平滑项,Fe pre(k)是当前语音帧的关联语音帧的与当前子频谱对应的第一频谱能量的平滑项,关联语音帧是位于当前语音帧之前、且与当前语音帧相邻的至少一个语音帧。
(3)计算初始频谱的频谱倾斜系数,将初始频谱对应的频带均分为第一子带和第二子带,分别计算第一子带的第一子带能量与第二子带的第二子带能量,计算公式(39)如下所示:
Figure PCTCN2020115052-appb-000023
Figure PCTCN2020115052-appb-000024
其中,e1为第一子带的第一子带能量,e2为第二子带的第二子带能量
接着,根据e1与e2,基于以下逻辑来确定初始频谱的频谱倾斜系数:
Figure PCTCN2020115052-appb-000025
其中,T_para为频谱倾斜系数,SQRT为开根号操作,f_cont_low=0.035,为预先设定的滤波系数,7为子频谱总数量的一半。
(4)计算每个子频谱的第二滤波增益,可以根据如下公式(40)计算:
gain f0(k)=Fe(k) f_cont_low        (40)
其中,gain f0(k)为第k个子频谱的第二滤波增益,f_cont_low为预先设定的滤波系数,在一方案中,f_cont_low=0.035,Fe(k)为第k个子频谱的第一频谱能量的平滑项,k=0,1,…,13,表示子带索引,为14个子带。
接着,如果频谱倾斜系数T_para为正,还需要根据如下公式(41)对第二滤波增益gain f0(k)进一步调整:
If(T_para>0):
gain f1(k)=gain f0(k)*(1+k*T para)       (41)
其中,gain f1(k)为根据频谱倾斜系数T_para调整后的第二滤波增益。
(5)根据如下公式(42),得到低频后置滤波的滤波增益值:
gain Low_post_filt(k)=(1+gain f1(k))/2        (42)
其中,gain f1(k)为根据公式(41)调整后的第二滤波增益,gain Low_post_filt(k)为根据gain f1(k)最终得到的第k个子频谱对应的5个低频频域系数的滤波增益(即第二滤波增益),gain f1(k)为调整后的第二滤波增益,k=0,1,…,13,为子带索引,表示14个子带。
具体地,在确定出第k个子频谱对应的第二滤波增益gain Low_post_filt(k)后,由于第一滤波增益包括第二数量(比如L=14个)的第二滤波增益gain Low_post_filt(k),且第二滤波增益gain Low_post_filt(k)为第k个子频谱对应的N个频谱系数的滤波增益,从而可以得到第一滤波增益G Low_post_filt(j)。
步骤S8,频时变换,即逆短时傅里叶变换ISTFT:
基于低频频谱和高频频谱,得到频带扩展后的宽带信号。
具体的,将低频频域系数S Low_rev(i,j)和高频频域系数S High_rev(i,j)合并,生成高频频谱,基于低频频谱和高频频谱,进行时频变换反变换(即ISTFT(逆短时傅里叶变换)),可以生成新的语音帧s Rec(i,j),即宽带信号。此时,待处理的窄带信号的有效频谱已经扩展为7000Hz。
在如图4所示的第二示例中,时频变换为MDCT。在上述第一示例中,待处理的窄带信号的时频变换是基于STFT的,按照经典信号理论,每一个信号频点包含幅度信息和相位信息。在第一示例中,高频部分的相位是直接从低频部分映射过来,存在一定的误差,因此,在第二示例中采用MDCT。MDCT依然是类似第一示例的加窗、交叠处理,但是生成的MDCT系数是实数,信息量更大,只需利用高频MDCT系数与低频MDCT系数的相关性,采用与第一示例类似的神经网络模型即可完成频带扩展。具体流程包括以下步骤:
步骤T1,前端信号处理:
对待处理的窄带信号进行因子为2的上采样处理,输出采样率为16000Hz的上采样信号。
由于待处理的窄带信号的采样率为8000Hz,帧长为10ms,则该上采样信号对应160个样本点(频点),对上采样信号进行改进离散余弦变换MDCT变换,具体为:将上一语音帧对应的160个样本点与当前语音帧(待处理的窄带信号)对应的160个样本点组成一个数组,该数组包括320个样本点。接着对该数组中的样本点进行余弦窗的加窗处理,对加窗处理后得到的信号s Low(i,j)进行MDCT,得到160个低频频域系数S Low(i,j)。其中,i为语音帧的帧索引,j为帧内样本索引(j=0,1,…,159)。
步骤T2,特征提取:
a)得到低频频域系数S Low(i,j)。
若窄带信号为采样率为16000Hz,带宽为0~3500Hz的信号,则可以基于待处理的窄带信号的采样率和帧长,从S Low(i,j)中确定出70个低频频域系数j=0,1,…69。
在得到包含70个低频频域系数之后,即可基于该70个低频频域系数,确定出待处理的窄带信号的低频频谱包络。其中,可以通过以下方式基于低频频域系数,确定低频频谱包络:
对待处理的窄带信号进行分带,针对70个低频频域系数,可以将每5个相邻的低频频域系数对应的频带划分为一个子带,共划分为14个子带,每个子带对应有5个低频频域系数。对于每个子带,该子带的低频频谱包络定义为相邻低频频域系数的平均能量。具体可通过公式(43)计算得到:
Figure PCTCN2020115052-appb-000026
其中,e Low(i,k)表示子频谱包络(每个子带的低频频谱包络),k表示子带的索引号,共14个子带,k=0,1,2……13,则低频频谱包络中包括14个子频谱包络。
由此,可以将70维的低频频域系数S Low(i,j)和14维的低频频谱包络e Low(i,k)作为神经网络模型的输入。
步骤T3,神经网络模型:
输入层:神经网络模型输入上述84维特征向量,
输出层:考虑到本实施例中频带扩展的目标宽带是7000Hz,因此,需要预测14个对应于3500-7000Hz频段的子带的高频频谱包络e High(i,k)。此外,还可以同时输出4个与平坦度信息相关的概率密度fc,即输出结果为18维。
其中,本第二示例中的神经网络模型与上述第一示例中的神经网络模型的处理过程相同,在此不再赘述。
步骤T4,生成高频幅度谱:
与上述第一示例类似,基于平坦度信息,使用与第一示例类似的平坦度分析,生成高频的两个子带区域与低频部分的平坦度关系v(i,k),然后结合高频频谱包络e High(i,k),使用与第一示例类似的流程,可以生成高频MDCT系数S High(i,j)。
步骤T5,高频后置滤波:
高频后置滤波是对得到的初始高频频域系数进行滤波处理,得到滤波后的初始高频频域系数,记作高频频域系数。在该滤波处理过程中,通过基于高频频域系数确定的滤波增益,对高频频域系数进行滤波处理,具体如下公式(44)所示:
S High_rev(i,j)=G High_post_filt(j)*S High(i,j)        (44)
其中,G High_post_filt(j)为根据高频频域系数计算得到的滤波增益,S High(i,j)为初始高频频域系数,S High_rev(i,j)为经滤波处理得到的高频频域系数。
高频后置滤波的具体处理过程,与前述的高频前置滤波的具体处理过程类似,具体如下:
本示例中,假定同一个子带内每5个初始频域系数共用一个滤波增益,其中,滤波增益的计算过程具体如下所示:
(1)将初始低频频域系数进行分带,例如,相邻5个初始低频频域系数合并成一个子频谱,本示例对应于14个子带。对每个子带计算平均能量。特别地,每一个频点(即上述的初始低频频域系数)的能量定义为实部平方与虚部平方的和。通过如下公式(45)计算相邻5个频点的能量值,该5个频点的能量值的和即为当前子频谱的第一频谱能量:
Figure PCTCN2020115052-appb-000027
其中,S High(i,j)为初始高频频域系数,Pe(k)为第一频谱能量,k=0,1,…13,表示子带的索引号,共14个子带。
(2)基于帧间相关性,通过公式(46)与公式(47)中的至少一项,计算当前子频谱的第一频谱能量:
Fe(k)=1.0+Pe(k)+Pe pre(k)         (46)
Fe_sm(k)=(Fe(k)+Fe pre(k))/2         (47)
其中,Fe(k)是当前子频谱的第一频谱能量的平滑项,Pe(k)是当前语音帧的当前子频谱的第 一频谱能量,Pe pre(k)是当前语音帧的关联语音帧的与当前子频谱对应的子频谱的第二初始频谱能量,Fe_sm(k)是累加平均后的第一频谱能量的平滑项,Fe pre(k)是当前语音帧的关联语音帧的与当前子频谱对应的第一频谱能量的平滑项,关联语音帧是位于当前语音帧之前、且与当前语音帧相邻的至少一个语音帧,从而充分考虑了语音信号帧之间的短时相关性和长时相关性。
(3)计算初始频谱的频谱倾斜系数,将初始频谱对应的频带均分为第一子带和第二子带,分别计算第一子带的第一子带能量与第二子带的第二子带能量,计算公式(48)如下所示:
Figure PCTCN2020115052-appb-000028
Figure PCTCN2020115052-appb-000029
其中,e1为第一子带的第一子带能量,e2为第二子带的第二子带能量。
接着,根据e1与e2,基于以下逻辑来确定初始频谱的频谱倾斜系数:
Figure PCTCN2020115052-appb-000030
其中,T_para为频谱倾斜系数,SQRT为开根号操作,f_cont_low=0.07,为预先设定的滤波系数,7为子频谱总数量的一半。
(4)计算每个子频谱的第二滤波增益,可以根据如下公式(49)计算:
gain f0(k)=Fe(k) f_cont_low           (49)
其中,gain f0(k)为第k个子频谱的第二滤波增益,f_cont_low为预先设定的滤波系数,在一方案中,f_cont_low=0.07,Fe(k)为第k个子频谱的第一频谱能量的平滑项,k=0,1,…,13,表示子带的索引号,共14个子带。
接着,如果频谱倾斜系数T_para为正,还需要根据如下公式(50)对第二滤波增益gain f0(k)进一步调整:
If(T_para>0):
gain f1(k)=gain f0(k)*(1+k*T para)        (50)
(5)根据如下公式(51),得到高频后置滤波的滤波增益值:
gain High_post_filt(k)=(1+gain f1(k))/2         (51)
其中,gain f1(k)为根据公式(50)调整后的第二滤波增益,gain High_post_filt(k)为根据gain f1(k)最终得到的第k个子频谱对应的5个低频频域系数的滤波增益(即第二滤波增益),gain f1(k)为调整后的第二滤波增益,k=0,1,…,13,表示子带的索引号,共14个子带。
具体地,在确定出第k个子频谱对应的第二滤波增益gain High_post_filt(k)后,由于第一滤波增益包括第二数量(比如L=14个)的第二滤波增益gain High_post_filt(k),且第二滤波增益gain High_post_filt(k)为第k个子频谱对应的N个频谱系数的滤波增益,从而可以得到第一滤波增益G High_post_filt(j)。
步骤T6,低频后置滤波:
低频后置滤波是对待处理的窄带信号经MDCT得到的初始低频频域系数进行滤波处理,得到低频频域系数。在该滤波处理过程中,通过基于初始低频频域系数确定的滤波增益,对初始低频频域系数进行滤波处理,具体如下公式(52)所示:
S Low_rev(i,j)=G Low_post_filt(j)*S Low(i,j)        (52)
其中,G Low_post_filt(j)为根据初始低频频域系数计算得到的滤波增益,S Low(i,j)为初始低频频域系数,S Low_rev(i,j)为经滤波处理得到的低频频域系数。
本示例中,假定同一个子带内每5个初始低频频域系数共用一个滤波增益,其中,滤波增益的计算过程具体如下所示:
(1)将初始低频频域系数进行分带,例如,相邻5个初始低频频域系数合并成一个子频谱,本示例对应于14个子带。对每个子带计算平均能量。特别地,每一个频点(即上述的初始低频频域系数)的能量定义为实部平方与虚部平方的和。通过如下公式(53)计算相邻5个频点的能量值,该5个频点的能量值的和即为当前子频谱的第一频谱能量:
Figure PCTCN2020115052-appb-000031
其中,S Low(i,j)为初始低频频域系数,Real和Imag分别为初始低频频域系数的实部和虚部,Pe(k)为第一频谱能量,k=0,1,…13,表示子带索引,为14个子带。
(2)基于帧间相关性,通过公式(54)与公式(55)中的至少一项,计算当前子频谱的第一频谱能量:
Fe(k)=1.0+Pe(k)+Pe pre(k)          (54)
Fe_sm(k)=(Fe(k)+Fe pre(k))/2         (55)
其中,Fe(k)是当前子频谱的第一频谱能量的平滑项,Pe(k)是当前语音帧的当前子频谱的第一频谱能量,Pe pre(k)是当前语音帧的关联语音帧的与当前子频谱对应的子频谱的第二初始频谱能量,Fe_sm(k)是累加平均后的第一频谱能量的平滑项,Fe pre(k)是当前语音帧的关联语音帧的与当前子频谱对应的第一频谱能量的平滑项,关联语音帧是位于当前语音帧之前、且与当前语音帧相邻的至少一个语音帧。
(3)计算初始频谱的频谱倾斜系数,将初始频谱对应的频带均分为第一子带和第二子带,分别计算第一子带的第一子带能量与第二子带的第二子带能量,计算公式(56)如下所示:
Figure PCTCN2020115052-appb-000032
Figure PCTCN2020115052-appb-000033
其中,e1为第一子带的第一子带能量,e2为第二子带的第二子带能量。
接着,根据e1与e2,基于以下逻辑来确定初始频谱的频谱倾斜系数:
Figure PCTCN2020115052-appb-000034
Figure PCTCN2020115052-appb-000035
其中,T_para为频谱倾斜系数,SQRT为开根号操作,f_cont_low为预先设定的滤波系数,在一方案中,f_cont_low=0.035,7为子频谱总数量的一半。
(4)计算每个子频谱的第二滤波增益,可以根据如下公式(57)计算:
gain f0(k)=Fe(k) f_cont_low          (57)
其中,gain f0(k)为第k个子频谱的第二滤波增益,f_cont_low为预先设定的滤波系数,在一方案中,f_cont_low=0.035,Fe(k)为第k个子频谱的第一频谱能量的平滑项,k=0,1,…,13,表示子带索引,为14个子带。
接着,如果频谱倾斜系数T_para为正,还需要根据如下公式(58)对第二滤波增益gain f0(k)进一步调整:
If(T_para>0):
gain f1(k)=gain f0(k)*(1+k*T para)         (58)
其中,gain f1(k)为根据频谱倾斜系数T_para调整后的第二滤波增益。
(5)根据如下公式(59),得到低频后置滤波的滤波增益值:
gain Low_post_filt(k)=(1+gain f1(k))/2          (59)
其中,gain f1(k)为根据公式(58)调整后的第二滤波增益,gain Low_post_filt(k)为根据gain f1(k)最终得到的第k个子频谱对应的5个低频频域系数的滤波增益(即第二滤波增益),gain f1(k)为调整后的第二滤波增益,k=0,1,…,13,表示子带索引,表示14个子带。
具体地,在确定出第k个子频谱对应的第二滤波增益gain Low_post_filt(k)后,由于第一滤波增益包括第二数量(比如L=14个)的第二滤波增益gain Low_post_filt(k),且第二滤波增益gain Low_post_filt(k)为第k个子频谱对应的N个频谱系数的滤波增益,从而可以得到第一滤波增益G Low_post_filt(j)。
步骤T7,频时变换,即逆改进离散余弦傅里叶变换IMDCT
基于低频频谱和高频频谱,得到频带扩展后的宽带信号。
具体的,将低频频域系数S Low_rev(i,j)和高频频域系数S High_rev(i,j)合并,生成高频频谱,基于低频频谱和高频频谱,进行时频变换反变换(即IMDCT(逆改进离散余弦变换)),可以生成新的语音帧s Rec(i,j),即宽带信号。此时,待处理的窄带信号的有效频谱已经扩展为7000Hz。
通过本方案的方法,在PSTN与VoIP互通的语音通信场景,VoIP侧只能收到来自于PSTN的窄带话音(采样率为8kHz,有效带宽一般是3.5kHz)。用户的直观感受是声音不够亮、音量不够大、可懂度一般。基于本申请公开的技术方案进行频带扩展,无需额外比特,可以在VoIP侧接收端将有效带宽扩展到7kHz。用户可以直观感受到更亮的音色、更大的音量和更好的可懂度。此外,基于本方案没有前向兼容的问题,即无需修改协议,可以完美兼容PSTN。
本申请实施例的方法可以应用在PSTN-VoIP通路的下行侧,比如,可以在装有会议系统的客户端集成本申请实施例所提供的方案的功能模块,则可以在客户端实现对窄频带信号的频带扩展,得到宽带信号。具体,该场景中的信号处理为一种信号后处理技术,以PSTN(编码系统可以是ITU-T G.711)为例,在会议系统客户端内部,当完成G.711解码后恢复出语音帧;对语音帧进行本申请实施涉及的后处理技术,可以让VoIP用户接收到宽带信号,即使发送端是窄带信号。
本申请实施例的方法也可以应用在PSTN-VoIP通路的混音服务器内,在通过该混音服务器进行频带扩展后,将频带扩展后的宽带信号发送给VoIP客户端,VoIP客户端在收到宽带信号对应的VoIP码流后,通过解码VoIP码流,可以恢复出经过频带扩展输出的宽带语音。混音服务器中一个典型功能是进行转码,例如,将PSTN链路的码流(如使用G.711编码)转码中VoIP常用的码流(如OPUS或者SILK等)。在混音服务器中,可以将G.711解码后的语音帧上采样到16000Hz,然后使用本申请实施例所提供的方案,完成频带扩展;然后,转码成VoIP常用的码流。VoIP客户端在收到一路或者多路的VoIP码流,通过解码,可以恢复出经过频带扩展输出的宽带语音。
图5为本申请又一实施例提供的一种频带扩展装置的结构示意图,如图5所示,该装置50可以包括低频频谱确定模块51、相关性参数确定模块52、高频幅度谱确定模块53及宽带信号确定模块54,其中:
低频频谱确定模块51,用于对待处理窄带信号进行时频变换得到对应的初始低频频谱;
相关性参数确定模块52,用于基于初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数,其中,相关性参数包括高频频谱包络和相对平坦度信息至少其中之一,相对平坦度信息表征了目标宽频频谱的高频部分的频谱平坦度与低频部分的频谱平坦度的相关性;
高频频谱确定模块53,用于基于相关性参数和初始低频频谱,得到初始高频频谱;
宽带信号确定模块54,用于根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号;其中,目标低频频谱为初始低频频谱或对初始低频频谱进行滤波处理后的频谱,目标高频频谱为初始高频频谱或对初始高频频谱进行滤波处理后的频谱。
在一种可能的实现方式中,宽带信号确定模块在对初始低频频谱或初始高频频谱进行滤波处理时,具体用于:
将初始频谱划分为第一数量的子频谱,并确定每个子频谱对应的第一频谱能量,初始频谱包括初始低频频谱或初始高频频谱;
基于每个子频谱各自对应的第一频谱能量,确定每个子频谱对应的滤波增益;
根据每个子频谱对应的滤波增益,对相应的每个子频谱分别进行滤波处理。
在一种可能的实现方式中,宽带信号确定模块在基于每个子频谱各自对应的第一频谱能量,确定每个子频谱对应的滤波增益时,具体用于:
将初始频谱对应的频带划分为第一子带和第二子带;
根据第一子带所对应的所有子频谱的第一频谱能量,确定出第一子带的第一子带能量,根据第二子带所对应的所有子频谱的第一频谱能量,确定出第二子带的第二子带能量;
根据第一子带能量与第二子带能量,确定初始频谱的频谱倾斜系数;
根据频谱倾斜系数及每个子频谱各自对应的第一频谱能量,确定每个子频谱对应的滤波增益。
在一种可能的实现方式中,窄带信号为当前语音帧的语音信号,宽带信号确定模块在确定一个子频谱的第一频谱能量时,具体用于:
确定一个子频谱的第一初始频谱能量;
若当前语音帧为第一个语音帧,则第一初始频谱能量为第一频谱能量;
若当前语音帧不是第一个语音帧,则获取关联语音帧的与一个子频谱对应的子频谱的第二初始频谱能量,关联语音帧是位于当前语音帧之前、且与当前语音帧相邻的至少一个语音帧;
基于第一初始频谱能量和第二初始频谱能量,得到一个子频谱的第一频谱能量。
在一种可能的实现方式中,相关性参数包括高频频谱包络和相对平坦度信息;神经网络模型至少包括输入层和输出层,输入层输入低频频谱的特征向量,输出层至少包括单边长短期记忆网络LSTM层以及分别连接LSTM层的两个全连接网络层,每个全连接网络层包括至少一个全连接层,其中,LSTM层将输入层处理后的特征向量进行转换,其中一个全连接网络层根据LSTM层转换后的向量值进行第一分类处理,并输出高频频谱包络,另一个全连接网络层根据LSTM层转换后的向量值进行第二分类处理,并输出相对平坦度信息。
在一种可能的实现方式中,还包括处理模块;
处理模块具体用于基于初始低频频谱,确定待处理窄带信号的低频频谱包络;
其中,神经网络模型的输入还包括低频频谱包络。
在一种可能的实现方式中,时频变换包括傅里叶变换或离散余弦变换;
若时频变换为傅里叶变换,高频频谱确定模块在基于初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数时,具体用于:
根据初始低频频谱,得到待处理窄带信号的低频幅度谱;
将低频幅度谱输入至神经网络模型,基于神经网络模型的输出得到相关性参数;
若时频变换为离散余弦变换,高频频谱确定模块在基于初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数时,具体用于:
将初始低频频谱输入至神经网络模型,基于神经网络模型的输出得到相关性参数。
在一种可能的实现方式中,时频变换包括傅里叶变换或离散余弦变换;
若时频变换为傅里叶变换,高频频谱确定模块在基于相关性参数和初始低频频谱,得到初始高频频谱时,具体用于:
根据初始低频频谱,得到待处理窄带信号的低频频谱包络;
对低频幅度谱中高频段部分的幅度谱进行复制,生成初始高频幅度谱;
基于高频频谱包络和低频频谱包络,对初始高频幅度谱进行调整,得到目标高频幅度谱;
基于窄带信号的低频相位谱,生成相应的高频相位谱;
根据目标高频幅度谱和高频相位谱,得到初始高频频谱;
若时频变换为离散余弦变换,高频频谱确定模块在基于相关性参数和初始低频频谱,得到初始高频频谱时,具体用于:
根据初始低频频谱,得到待处理窄带信号的低频频谱包络;
对初始低频频谱中高频段部分的频谱进行复制,生成第一高频频谱;
基于高频频谱包络和低频频谱包络,对第一高频频谱进行调整,得到初始高频频谱。
在一种可能的实现方式中,相关性参数还包括相对平坦度信息,相对平坦度信息表征了目标宽频频谱的高频部分的频谱平坦度与低频部分的频谱平坦度的相关性;
高频频谱确定模块在基于高频频谱包络和低频频谱包络,对高频谱信息进行调整时,具体用于:
基于相对平坦度信息以及初始低频频谱的能量信息,确定高频频谱包络的增益调整值;
基于增益调整值对高频频谱包络进行调整,得到调整后的高频频谱包络;
基于调整后的高频频谱包络和低频频谱包络,对高频谱信息进行调整,高频谱信息包括初始高频幅度谱或第一高频频谱。
在一种可能的实现方式中,相对平坦度信息包括对应于高频部分的至少两个子带区域的相对 平坦度信息,一个子带区域所对应的相对平坦度信息,表征了高频部分的一个子带区域的频谱平坦度与低频部分的高频频段的频谱平坦度的相关性;
若高频部分包括对应于至少两个子带区域的谱参数,每个子带区域的谱参数是基于低频部分的高频频段的谱参数得到的,相对平坦度信息包括每个子带区域的谱参数与高频频段的谱参数的相对平坦度信息,其中,若时频变换为傅里叶变换,谱参数为幅度谱,若时频变换为离散余弦变换,谱参数为频谱;
高频频谱确定模块在基于相对平坦度信息以及初始低频频谱的能量信息,确定高频频谱包络的增益调整值时,具体用于:
基于每个子带区域所对应的相对平坦度信息、以及低频频谱中每个子带区域所对应的频谱能量信息,确定高频频谱包络中对应频谱包络部分的增益调整值;
高频频谱确定模块在基于增益调整值对高频频谱包络进行调整时,具体用于:
根据高频频谱包络中每个对应频谱包络部分的增益调整值,对相应的频谱包络部分进行调整。
在一种可能的实现方式中,若高频频谱包络包括第一预定数量的高频子频谱包络;
高频频谱确定模块在基于每个子带区域所对应的相对平坦度信息,以及初始低频频谱中每个子带区域对应的频谱能量信息,确定高频频谱包络中对应频谱包络部分的增益调整值时,具体用于:
对于每一个高频子频谱包络,根据低频频谱包络中与高频子频谱包络对应的频谱包络所对应的频谱能量信息、低频频谱包络中与高频子频谱包络对应的频谱包络所对应的子带区域所对应的相对平坦度信息、低频频谱包络中与高频子频谱包络对应的频谱包络所对应的子带区域对应的频谱能量信息,确定高频子频谱包络的增益调整值;
高频频谱确定模块在根据高频频谱包络中每个对应频谱包络部分的增益调整值,对相应的频谱包络部分进行调整时,具体用于:
根据高频频谱包络中每个高频子频谱包络的增益调整值,对相应的高频子频谱包络进行调整。
本申请实施例提供的频带扩展方法和装置,在根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号的过程中,通过对初始低频频谱或初始高频频谱中的至少一项进行滤波处理,使得在得到宽带信号之前,可以对初始低频频谱进行滤波处理,从而有效滤除窄带信号在量化过程中可能引入的量化噪声;也可以对初始高频频谱进行滤波处理,从而有效滤除基于初始低频频谱进行频带扩展的过程中引入的噪声,增强宽带信号的信号质量,进一步提升用户的听觉体验。此外,通过本方案的方法进行频带扩展,无需提前记录边信息,即无需额外的带宽。
需要说明的是,本实施例为与上述的方法项实施例相对应的装置项实施例,本实施例可与上述方法项实施例互相配合实施。上述方法项实施例中提到的相关技术细节在本实施例中依然有效,为了减少重复,这里不再赘述。相应地,本实施例中提到的相关技术细节也可应用在上述方法项实施例中。
本申请另一实施例提供了一种电子设备,如图6所示,图6所示的电子设备600包括:处理器601和存储器603。其中,处理器601和存储器603相连,如通过总线602相连。进一步地,电子设备600还可以包括收发器604。需要说明的是,实际应用中收发器604不限于一个,该电子设备600的结构并不构成对本申请实施例的限定。
其中,处理器601应用于本申请实施例中,用于实现图5所示的低频频谱参数确定模块、相关性参数确定模块、高频幅度谱确定模块、高频相位谱生成模块、高频频谱确定模块及宽带信号 确定模块的功能。
处理器601可以是CPU,通用处理器,DSP,ASIC,FPGA或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器601也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等。
总线602可包括一通路,在上述组件之间传送信息。总线602可以是PCI总线或EISA总线等。总线602可以分为地址总线、数据总线、控制总线等。为便于表示,图6中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器603可以是ROM或可存储静态信息和指令的其他类型的静态存储设备,RAM或者可存储信息和指令的其他类型的动态存储设备,也可以是EEPROM、CD-ROM或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。
存储器603用于存储执行本申请方案的应用程序代码,并由处理器601来控制执行。处理器601用于执行存储器603中存储的应用程序代码,以实现图5所示实施例提供的频带扩展装置的动作。
本申请实施例提供的电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时,可实现:在根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号的过程中,通过对初始低频频谱或初始高频频谱中的至少一项进行滤波处理,使得在得到宽带信号之前,可以对初始低频频谱进行滤波处理,从而有效滤除窄带信号在量化过程中可能引入的量化噪声;也可以对初始高频频谱进行滤波处理,从而有效滤除基于初始低频频谱进行频带扩展的过程中引入的噪声,增强宽带信号的信号质量,进一步提升用户的听觉体验。此外,通过本方案的方法进行频带扩展,无需提前记录边信息,即无需额外的带宽。
本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。电子设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该电子设备执行上述频带扩展方法。
本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该程序被处理器执行时实现上述实施例所示的方法。其中:在根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号的过程中,通过对初始低频频谱或初始高频频谱中的至少一项进行滤波处理,使得在得到宽带信号之前,可以对初始低频频谱进行滤波处理,从而有效滤除窄带信号在量化过程中可能引入的量化噪声;也可以对初始高频频谱进行滤波处理,从而有效滤除基于初始低频频谱进行频带扩展的过程中引入的噪声,增强宽带信号的信号质量,进一步提升用户的听觉体验。此外,通过本方案的方法进行频带扩展,无需提前记录边信息,即无需额外的带宽。
本申请实施例提供的计算机可读存储介质适用于上述方法的任一实施例。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶 段的至少一部分轮流或者交替地执行。
以上所述仅是本申请的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (20)

  1. 一种频带扩展方法,由电子设备执行,包括:
    对待处理窄带信号进行时频变换得到对应的初始低频频谱;
    基于所述初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数,其中,所述相关性参数包括高频频谱包络和相对平坦度信息至少其中之一,所述相对平坦度信息表征了所述目标宽频频谱的高频部分的频谱平坦度与低频部分的频谱平坦度的相关性;
    基于所述相关性参数和所述初始低频频谱,得到初始高频频谱;
    根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号;其中,所述目标低频频谱为所述初始低频频谱或对所述初始低频频谱进行滤波处理后的频谱,所述目标高频频谱为所述初始高频频谱或对所述初始高频频谱进行滤波处理后的频谱。
  2. 根据权利要求1所述的方法,其中,对初始低频频谱或初始高频频谱进行滤波处理,包括:
    将初始频谱划分为第一数量的子频谱,并确定每个子频谱对应的第一频谱能量,初始频谱包括初始低频频谱或初始高频频谱;
    基于每个子频谱各自对应的第一频谱能量,确定每个子频谱对应的滤波增益;
    根据每个子频谱对应的滤波增益,对相应的每个子频谱分别进行滤波处理。
  3. 根据权利要求2所述的方法,其中,基于每个子频谱各自对应的第一频谱能量,确定每个子频谱对应的滤波增益,包括:
    将所述初始频谱对应的频带划分为第一子带和第二子带;
    根据第一子带所对应的所有子频谱的第一频谱能量,确定出第一子带的第一子带能量,根据第二子带所对应的所有子频谱的第一频谱能量,确定出第二子带的第二子带能量;
    根据所述第一子带能量与第二子带能量,确定所述初始频谱的频谱倾斜系数;
    根据所述频谱倾斜系数及每个子频谱各自对应的第一频谱能量,确定每个子频谱对应的滤波增益。
  4. 根据权利要求3所述的方法,其中,所述窄带信号为当前语音帧的语音信号,确定一个子频谱的第一频谱能量,包括:
    确定所述一个子频谱的第一初始频谱能量;
    若所述当前语音帧为第一个语音帧,则所述第一初始频谱能量为所述第一频谱能量;
    若所述当前语音帧不是第一个语音帧,则获取关联语音帧的与所述一个子频谱对应的子频谱的第二初始频谱能量,所述关联语音帧是位于所述当前语音帧之前、且与所述当前语音帧相邻的至少一个语音帧;
    基于所述第一初始频谱能量和所述第二初始频谱能量,得到所述一个子频谱的第一频谱能量。
  5. 根据权利要求1~4任一项所述的方法,其中,所述相关性参数包括高频频谱包络和相对平坦度信息;
    所述神经网络模型至少包括输入层和输出层,所述输入层输入低频频谱的特征向量,所述输出层至少包括单边长短期记忆网络LSTM层以及分别连接所述LSTM层的两个全连接网络层,每个 全连接网络层包括至少一个全连接层,其中,所述LSTM层将输入层处理后的特征向量进行转换,其中一个全连接网络层根据LSTM层转换后的向量值进行第一分类处理,并输出所述高频频谱包络,另一个所述全连接网络层根据LSTM层转换后的向量值进行第二分类处理,并输出所述相对平坦度信息。
  6. 根据权利要求1~4任一项所述的方法,其中,所述方法还包括:
    基于所述初始低频频谱,确定所述待处理窄带信号的低频频谱包络;
    其中,所述神经网络模型的输入还包括所述低频频谱包络。
  7. 根据权利要求1~4任一项所述的方法,其中,所述方法还包括:
    若所述时频变换为傅里叶变换,所述基于所述初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数,包括:
    根据所述初始低频频谱,得到所述待处理窄带信号的低频幅度谱;
    将所述低频幅度谱输入至所述神经网络模型,基于所述神经网络模型的输出得到所述相关性参数。
  8. 根据权利要求1~4任一项所述的方法,其中,所述方法还包括:
    若所述时频变换为离散余弦变换,所述基于所述初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数,包括:
    将所述初始低频频谱输入至所述神经网络模型,基于所述神经网络模型的输出得到所述相关性参数。
  9. 根据权利要求1~4任一项所述的方法,其中,所述方法还包括:
    若时频变换为傅里叶变换,基于所述相关性参数和所述初始低频频谱,得到初始高频频谱,包括:
    根据所述初始低频频谱,得到所述待处理窄带信号的低频频谱包络;
    对所述低频幅度谱中高频段部分的幅度谱进行复制,生成高频谱信息;
    基于所述高频频谱包络和所述低频频谱包络,对所述高频谱信息进行调整,得到所述目标高频幅度谱,其中,所述高频谱信息包括初始高频幅度谱;
    基于所述窄带信号的低频相位谱,生成相应的高频相位谱;
    根据所述目标高频幅度谱和所述高频相位谱,得到初始高频频谱。
  10. 根据权利要求1~4任一项所述的方法,其中,所述方法还包括:
    若时频变换为离散余弦变换,基于所述相关性参数和所述初始低频频谱,得到初始高频频谱,包括:
    根据所述初始低频频谱,得到所述待处理窄带信号的低频频谱包络;
    对所述初始低频频谱中高频段部分的频谱进行复制,生成高频谱信息;
    基于所述高频频谱包络和所述低频频谱包络,对所述高频谱信息进行调整,得到初始高频频谱,其中,所述高频谱信息包括第一高频频谱。
  11. 根据权利要求9或10所述的方法,其特征在于,所述基于所述高频频谱包络和所述低频频谱包络,对高频谱信息进行调整,包括:
    基于所述相对平坦度信息以及所述初始低频频谱的能量信息,确定所述高频频谱包络的增益调整值;
    基于所述增益调整值对所述高频频谱包络进行调整,得到调整后的高频频谱包络;
    基于所述调整后的高频频谱包络和所述低频频谱包络,对所述高频谱信息进行调整。
  12. 根据权利要求11所述的方法,其中,所述相对平坦度信息包括对应于所述高频部分的至少两个子带区域的相对平坦度信息,一个子带区域所对应的相对平坦度信息,表征了所述高频部分的一个子带区域的频谱平坦度与所述低频部分的高频频段的频谱平坦度的相关性;
    若所述高频部分包括对应于至少两个子带区域的谱参数,每个子带区域的谱参数是基于所述低频部分的高频频段的谱参数得到的,所述相对平坦度信息包括每个子带区域的谱参数与所述高频频段的谱参数的相对平坦度信息,其中,若时频变换为傅里叶变换,所述谱参数为所述幅度谱,若时频变换为离散余弦变换,所述谱参数为频谱;
    所述基于所述相对平坦度信息以及所述初始低频频谱的能量信息,确定所述高频频谱包络的增益调整值,包括:
    基于每个子带区域所对应的相对平坦度信息、以及所述低频频谱中每个子带区域所对应的频谱能量信息,确定所述高频频谱包络中对应频谱包络部分的增益调整值;
    所述基于所述增益调整值对所述高频频谱包络进行调整,包括:
    根据所述高频频谱包络中每个对应频谱包络部分的增益调整值,对相应的频谱包络部分进行调整。
  13. 根据权利要求12所述的方法,其中,所述高频频谱包络包括第一预定数量的高频子频谱包络;
    所述基于每个子带区域所对应的相对平坦度信息,以及所述初始低频频谱中每个子带区域对应的频谱能量信息,确定所述高频频谱包络中对应频谱包络部分的增益调整值,包括:
    对于每一个高频子频谱包络,根据所述低频频谱包络中与所述高频子频谱包络对应的频谱包络所对应的频谱能量信息、所述低频频谱包络中与所述高频子频谱包络对应的频谱包络所对应的子带区域所对应的相对平坦度信息、所述低频频谱包络中与所述高频子频谱包络对应的频谱包络所对应的子带区域对应的频谱能量信息,确定所述高频子频谱包络的增益调整值;
    所述根据所述高频频谱包络中每个对应频谱包络部分的增益调整值,对相应的频谱包络部分进行调整,包括:
    根据所述高频频谱包络中每个高频子频谱包络的增益调整值,对相应的高频子频谱包络进行调整。
  14. 一种频带扩展装置,包括:
    低频频谱确定模块,用于对待处理窄带信号进行时频变换得到对应的初始低频频谱;
    相关性参数确定模块,用于基于所述初始低频频谱,通过神经网络模型,得到目标宽频频谱的高频部分与低频部分的相关性参数,其中,所述相关性参数包括高频频谱包络和相对平坦度信息至少其中之一,所述相对平坦度信息表征了所述目标宽频频谱的高频部分的频谱平坦度与低频部分的频谱平坦度的相关性;
    高频频谱确定模块,用于基于所述相关性参数和所述初始低频频谱,得到初始高频幅度谱;
    宽带信号确定模块,用于根据目标低频频谱和目标高频频谱,得到频带扩展后的宽带信号;其中,所述目标低频频谱为所述初始低频频谱或对所述初始低频频谱进行滤波处理后的频谱,所述目标高频频谱为所述初始高频频谱或对所述初始高频频谱进行滤波处理后的频谱。
  15. 根据权利要求14所述的装置,其中,所述宽带信号确定模块进一步用于:
    将初始频谱划分为第一数量的子频谱,并确定每个子频谱对应的第一频谱能量,初始频谱包 括初始低频频谱或初始高频频谱;
    基于每个子频谱各自对应的第一频谱能量,确定每个子频谱对应的滤波增益;
    根据每个子频谱对应的滤波增益,对相应的每个子频谱分别进行滤波处理。
  16. 根据权利要求15所述的装置,其中,所述宽带信号确定模块进一步用于:
    将初始频谱对应的频带划分为第一子带和第二子带;
    根据第一子带所对应的所有子频谱的第一频谱能量,确定出第一子带的第一子带能量,根据第二子带所对应的所有子频谱的第一频谱能量,确定出第二子带的第二子带能量;
    根据第一子带能量与第二子带能量,确定初始频谱的频谱倾斜系数;
    根据频谱倾斜系数及每个子频谱各自对应的第一频谱能量,确定每个子频谱对应的滤波增益。
  17. 根据权利要求16所述的装置,其中,所述宽带信号确定模块进一步用于:
    确定一个子频谱的第一初始频谱能量;
    若当前语音帧为第一个语音帧,则第一初始频谱能量为第一频谱能量;
    若当前语音帧不是第一个语音帧,则获取关联语音帧的与一个子频谱对应的子频谱的第二初始频谱能量,其中,所述关联语音帧是位于当前语音帧之前、且与当前语音帧相邻的至少一个语音帧;
    基于第一初始频谱能量和第二初始频谱能量,得到一个子频谱的第一频谱能量。
  18. 根据权利要求14所述的装置,其中,所述相关性参数包括高频频谱包络和相对平坦度信息;所述神经网络模型至少包括输入层和输出层,所述输入层输入低频频谱的特征向量,输出层至少包括单边长短期记忆网络LSTM层以及分别连接LSTM层的两个全连接网络层,每个全连接网络层包括至少一个全连接层,其中,所述LSTM层将输入层处理后的特征向量进行转换,其中一个全连接网络层根据所述LSTM层转换后的向量值进行第一分类处理,并输出所述高频频谱包络,另一个全连接网络层根据所述LSTM层转换后的向量值进行第二分类处理,并输出所述相对平坦度信息。
  19. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现权利要求1-13任一项所述的频带扩展方法。
  20. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,该程序被处理器执行时实现权利要求1-13任一项所述的频带扩展方法。
PCT/CN2020/115052 2019-09-18 2020-09-14 频带扩展方法、装置、电子设备及计算机可读存储介质 WO2021052287A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP20864964.0A EP3920182A4 (en) 2019-09-18 2020-09-14 FREQUENCY BAND EXPANSION METHOD, APPARATUS, ELECTRONIC DEVICE AND COMPUTER READABLE STORAGE MEDIA
JP2021558882A JP7297368B2 (ja) 2019-09-18 2020-09-14 周波数帯域拡張方法、装置、電子デバイスおよびコンピュータプログラム
US17/468,662 US11763829B2 (en) 2019-09-18 2021-09-07 Bandwidth extension method and apparatus, electronic device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910882478.4A CN110556122B (zh) 2019-09-18 2019-09-18 频带扩展方法、装置、电子设备及计算机可读存储介质
CN201910882478.4 2019-09-18

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/468,662 Continuation US11763829B2 (en) 2019-09-18 2021-09-07 Bandwidth extension method and apparatus, electronic device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2021052287A1 true WO2021052287A1 (zh) 2021-03-25

Family

ID=68740906

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/115052 WO2021052287A1 (zh) 2019-09-18 2020-09-14 频带扩展方法、装置、电子设备及计算机可读存储介质

Country Status (5)

Country Link
US (1) US11763829B2 (zh)
EP (1) EP3920182A4 (zh)
JP (1) JP7297368B2 (zh)
CN (1) CN110556122B (zh)
WO (1) WO2021052287A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110556123B (zh) * 2019-09-18 2024-01-19 腾讯科技(深圳)有限公司 频带扩展方法、装置、电子设备及计算机可读存储介质
CN110556122B (zh) 2019-09-18 2024-01-19 腾讯科技(深圳)有限公司 频带扩展方法、装置、电子设备及计算机可读存储介质
CN110992739B (zh) * 2019-12-26 2021-06-01 上海松鼠课堂人工智能科技有限公司 学生在线听写系统
CN112530454A (zh) * 2020-11-30 2021-03-19 厦门亿联网络技术股份有限公司 一种窄带语音信号检测方法、装置、系统和可读存储介质
KR20220158395A (ko) * 2021-05-24 2022-12-01 한국전자통신연구원 오디오 신호의 부호화 및 복호화 방법과 그 방법을 수행하는 부호화기 및 복호화기
WO2023198925A1 (en) * 2022-04-14 2023-10-19 Dolby International Ab High frequency reconstruction using neural network system
CN117040663B (zh) * 2023-10-10 2023-12-22 北京海格神舟通信科技有限公司 一种用于估计宽带频谱噪底的方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101288614A (zh) * 2008-05-21 2008-10-22 清华大学深圳研究生院 基于谱扩展技术的电子耳蜗电话适配装置及方法
CN101471072A (zh) * 2007-12-27 2009-07-01 华为技术有限公司 高频重建方法、编码模块和解码模块
US7698143B2 (en) * 2005-05-17 2010-04-13 Mitsubishi Electric Research Laboratories, Inc. Constructing broad-band acoustic signals from lower-band acoustic signals
CN101996640A (zh) * 2009-08-31 2011-03-30 华为技术有限公司 频带扩展方法及装置
CN102044250A (zh) * 2009-10-23 2011-05-04 华为技术有限公司 频带扩展方法及装置
CN103413557A (zh) * 2013-07-08 2013-11-27 深圳Tcl新技术有限公司 语音信号带宽扩展的方法和装置
CN110556122A (zh) * 2019-09-18 2019-12-10 腾讯科技(深圳)有限公司 频带扩展方法、装置、电子设备及计算机可读存储介质

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6144937A (en) * 1997-07-23 2000-11-07 Texas Instruments Incorporated Noise suppression of speech by signal processing including applying a transform to time domain input sequences of digital signals representing audio information
US7146316B2 (en) * 2002-10-17 2006-12-05 Clarity Technologies, Inc. Noise reduction in subbanded speech signals
KR20070084002A (ko) * 2004-11-05 2007-08-24 마츠시타 덴끼 산교 가부시키가이샤 스케일러블 복호화 장치 및 스케일러블 부호화 장치
US7546237B2 (en) * 2005-12-23 2009-06-09 Qnx Software Systems (Wavemakers), Inc. Bandwidth extension of narrowband speech
DE102006047197B3 (de) * 2006-07-31 2008-01-31 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Vorrichtung und Verfahren zum Verarbeiten eines reellen Subband-Signals zur Reduktion von Aliasing-Effekten
US8639500B2 (en) * 2006-11-17 2014-01-28 Samsung Electronics Co., Ltd. Method, medium, and apparatus with bandwidth extension encoding and/or decoding
US20100280833A1 (en) * 2007-12-27 2010-11-04 Panasonic Corporation Encoding device, decoding device, and method thereof
EP2151822B8 (en) * 2008-08-05 2018-10-24 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for processing an audio signal for speech enhancement using a feature extraction
JP2010079275A (ja) * 2008-08-29 2010-04-08 Sony Corp 周波数帯域拡大装置及び方法、符号化装置及び方法、復号化装置及び方法、並びにプログラム
JP4783412B2 (ja) 2008-09-09 2011-09-28 日本電信電話株式会社 信号広帯域化装置、信号広帯域化方法、そのプログラム、その記録媒体
SG185606A1 (en) * 2010-05-25 2012-12-28 Nokia Corp A bandwidth extender
US9047875B2 (en) * 2010-07-19 2015-06-02 Futurewei Technologies, Inc. Spectrum flatness control for bandwidth extension
US9251800B2 (en) * 2011-11-02 2016-02-02 Telefonaktiebolaget L M Ericsson (Publ) Generation of a high band extension of a bandwidth extended audio signal
KR101897455B1 (ko) 2012-04-16 2018-10-04 삼성전자주식회사 음질 향상 장치 및 방법
FR3007563A1 (fr) * 2013-06-25 2014-12-26 France Telecom Extension amelioree de bande de frequence dans un decodeur de signaux audiofrequences
US9666202B2 (en) * 2013-09-10 2017-05-30 Huawei Technologies Co., Ltd. Adaptive bandwidth extension and apparatus for the same
US10008218B2 (en) * 2016-08-03 2018-06-26 Dolby Laboratories Licensing Corporation Blind bandwidth extension using K-means and a support vector machine
EP3701527B1 (en) * 2017-10-27 2023-08-30 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus, method or computer program for generating a bandwidth-enhanced audio signal using a neural network processor
CN107993672B (zh) * 2017-12-12 2020-07-03 腾讯音乐娱乐科技(深圳)有限公司 频带扩展方法及装置
CN110556123B (zh) * 2019-09-18 2024-01-19 腾讯科技(深圳)有限公司 频带扩展方法、装置、电子设备及计算机可读存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7698143B2 (en) * 2005-05-17 2010-04-13 Mitsubishi Electric Research Laboratories, Inc. Constructing broad-band acoustic signals from lower-band acoustic signals
CN101471072A (zh) * 2007-12-27 2009-07-01 华为技术有限公司 高频重建方法、编码模块和解码模块
CN101288614A (zh) * 2008-05-21 2008-10-22 清华大学深圳研究生院 基于谱扩展技术的电子耳蜗电话适配装置及方法
CN101996640A (zh) * 2009-08-31 2011-03-30 华为技术有限公司 频带扩展方法及装置
CN102044250A (zh) * 2009-10-23 2011-05-04 华为技术有限公司 频带扩展方法及装置
CN103413557A (zh) * 2013-07-08 2013-11-27 深圳Tcl新技术有限公司 语音信号带宽扩展的方法和装置
CN110556122A (zh) * 2019-09-18 2019-12-10 腾讯科技(深圳)有限公司 频带扩展方法、装置、电子设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN110556122B (zh) 2024-01-19
US11763829B2 (en) 2023-09-19
EP3920182A4 (en) 2022-06-08
CN110556122A (zh) 2019-12-10
US20210407526A1 (en) 2021-12-30
JP2022526403A (ja) 2022-05-24
EP3920182A1 (en) 2021-12-08
JP7297368B2 (ja) 2023-06-26

Similar Documents

Publication Publication Date Title
WO2021052287A1 (zh) 频带扩展方法、装置、电子设备及计算机可读存储介质
WO2021052285A1 (zh) 频带扩展方法、装置、电子设备及计算机可读存储介质
US9251800B2 (en) Generation of a high band extension of a bandwidth extended audio signal
CN110556121B (zh) 频带扩展方法、装置、电子设备及计算机可读存储介质
DE112014003337T5 (de) Sprachsignaltrennung und Synthese basierend auf auditorischer Szenenanalyse und Sprachmodellierung
US9280978B2 (en) Packet loss concealment for bandwidth extension of speech signals
US8929568B2 (en) Bandwidth extension of a low band audio signal
WO2021179788A1 (zh) 语音信号的编解码方法、装置、电子设备及存储介质
JP2019191597A (ja) ノイズ変調とゲイン調整とを実行するシステムおよび方法
JP6289507B2 (ja) エネルギー制限演算を用いて周波数増強信号を生成する装置および方法
US9245538B1 (en) Bandwidth enhancement of speech signals assisted by noise reduction
JP4928703B2 (ja) スペクトル増強実行方法および装置
Lee et al. Sequential deep neural networks ensemble for speech bandwidth extension
WO2024051412A1 (zh) 语音编码、解码方法、装置、计算机设备和存储介质
Bhatt et al. A novel approach for artificial bandwidth extension of speech signals by LPC technique over proposed GSM FR NB coder using high band feature extraction and various extension of excitation methods
CN112530446B (zh) 频带扩展方法、装置、电子设备及计算机可读存储介质
Liu et al. Iccrn: Inplace cepstral convolutional recurrent neural network for monaural speech enhancement
CN117975976A (zh) 频带扩展方法、装置、电子设备及计算机可读存储介质
Hosoda et al. Speech bandwidth extension using data hiding based on discrete hartley transform domain
CN116110424A (zh) 一种语音带宽扩展方法及相关装置
Singh et al. Design of Medium to Low Bitrate Neural Audio Codec

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20864964

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020864964

Country of ref document: EP

Effective date: 20210903

ENP Entry into the national phase

Ref document number: 2021558882

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE