CN112530446B - Band expansion method, device, electronic equipment and computer readable storage medium - Google Patents

Band expansion method, device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN112530446B
CN112530446B CN201910955743.7A CN201910955743A CN112530446B CN 112530446 B CN112530446 B CN 112530446B CN 201910955743 A CN201910955743 A CN 201910955743A CN 112530446 B CN112530446 B CN 112530446B
Authority
CN
China
Prior art keywords
spectrum
frequency
sub
low
envelope
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910955743.7A
Other languages
Chinese (zh)
Other versions
CN112530446A (en
Inventor
肖玮
商世东
吴祖榕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Publication of CN112530446A publication Critical patent/CN112530446A/en
Application granted granted Critical
Publication of CN112530446B publication Critical patent/CN112530446B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the application relates to the technical field of audio processing, and discloses a frequency band expansion method, a device, electronic equipment and a computer readable storage medium, wherein the frequency band expansion method comprises the following steps: performing time-frequency conversion on the narrowband signal to be processed to obtain a corresponding low-frequency spectrum; based on the low frequency spectrum, obtaining a correlation parameter of a high frequency part and a low frequency part of the target broadband spectrum through a neural network model, wherein the correlation parameter comprises at least one of a high frequency spectrum envelope and relative flatness information, and the relative flatness information characterizes the correlation of the spectrum flatness of the high frequency part and the spectrum flatness of the low frequency part of the target broadband spectrum; obtaining a target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum; based on the low-frequency spectrum and the target high-frequency spectrum, obtaining a broadband signal with the expanded frequency band; wherein at least one of the low frequency spectrum or the target high frequency spectrum is a spectrum obtained by filtering the corresponding initial spectrum.

Description

Band expansion method, device, electronic equipment and computer readable storage medium
Technical Field
The embodiment of the application relates to the technical field of audio processing, in particular to a frequency band expansion method, a device, electronic equipment and a computer readable storage medium.
Background
Band extension, also known as band replication, is a classical technique in the field of audio coding. The frequency band expansion technology is a parameter coding technology, and can realize the expansion of effective bandwidth at a receiving end through frequency band expansion so as to improve the quality of audio signals, and a user can intuitively feel brighter tone, larger volume and better intelligibility.
In the prior art, a classical implementation method of band extension uses correlation between high frequency and low frequency in a speech signal to perform band extension, in an audio coding system, the correlation is used as side information (side information), the side information is combined into a code stream and transmitted at a coding end, a decoding end sequentially restores a low frequency spectrum through decoding, and a band extension operation is performed to restore the high frequency spectrum. But this method requires the system to consume corresponding bits (e.g., it takes 10% more bits to encode the above side information on the basis of encoding the low frequency part information), i.e., it requires additional bits to encode, and has a problem of forward compatibility.
Another common method of band expansion is a blind approach based on data analysis, which is based on neural networks or deep learning, where the input is a low frequency coefficient and the output is a high frequency coefficient. The coefficient-coefficient mapping mode has high requirement on generalization capability of the network; in order to ensure the effect, the depth and the volume of the network are large, and the complexity is high; in practice, the performance of the method is generally beyond the context of the patterns contained in the training library.
Disclosure of Invention
The aim of the embodiment of the application is to at least solve one of the technical defects, and the following technical scheme is specifically provided:
in one aspect, a method for expanding a frequency band is provided, including:
performing time-frequency conversion on the narrowband signal to be processed to obtain a corresponding low-frequency spectrum;
based on the low frequency spectrum, obtaining a correlation parameter of a high frequency part and a low frequency part of the target broadband spectrum through a neural network model, wherein the correlation parameter comprises at least one of a high frequency spectrum envelope and relative flatness information, and the relative flatness information characterizes the correlation of the spectrum flatness of the high frequency part and the spectrum flatness of the low frequency part of the target broadband spectrum;
obtaining a target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum;
based on the low-frequency spectrum and the target high-frequency spectrum, obtaining a broadband signal with a spread frequency band;
wherein at least one of the low frequency spectrum and the target high frequency spectrum is a spectrum obtained by filtering the corresponding initial spectrum.
In one aspect, there is provided a band expanding apparatus including:
the low-frequency spectrum determining module is used for carrying out time-frequency conversion on the narrowband signal to be processed to obtain a corresponding low-frequency spectrum;
The correlation parameter determining module is used for obtaining correlation parameters of a high-frequency part and a low-frequency part of the target broadband spectrum through a neural network model based on the low-frequency spectrum, wherein the correlation parameters comprise at least one of a high-frequency spectrum envelope and relative flatness information, and the relative flatness information characterizes correlation between the spectral flatness of the high-frequency part and the spectral flatness of the low-frequency part of the target broadband spectrum;
the high-frequency spectrum determining module is used for obtaining a target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum;
the broadband signal determining module is used for obtaining broadband signals with expanded frequency bands based on the low-frequency spectrum and the target high-frequency spectrum;
wherein at least one of the low frequency spectrum or the target high frequency spectrum is a spectrum obtained by filtering the corresponding initial spectrum.
In one possible implementation, the low frequency spectrum determining module is configured to:
determining a first filtering gain of the initial spectrum based on the spectral energy of the initial spectrum;
and performing filtering processing on the initial frequency spectrum according to the first filtering gain.
In one possible implementation, the low frequency spectrum determining module is configured to:
dividing the initial frequency spectrum into a first number of sub-frequency spectrums, and determining first frequency spectrum energy corresponding to each sub-frequency spectrum;
Determining a second filter gain corresponding to each sub-spectrum based on the respective corresponding first spectral energy of each sub-spectrum, wherein the first filter gain value comprises a first number of second filter gains;
the low-frequency spectrum determining module is used for respectively carrying out filtering processing on each corresponding sub-spectrum according to the second filtering gain corresponding to each sub-spectrum when carrying out filtering processing on the initial spectrum according to the first filtering gain.
In one possible implementation, the low frequency spectrum determining module is configured to:
dividing a frequency band corresponding to the initial frequency spectrum into a first sub-band and a second sub-band;
determining first sub-band energy of the first sub-band according to first frequency spectrum energy of all sub-spectrums corresponding to the first sub-band, and determining second sub-band energy of the second sub-band according to first frequency spectrum energy of all sub-spectrums corresponding to the second sub-band;
determining a spectrum inclination coefficient of the initial spectrum according to the first sub-band energy and the second sub-band energy;
and determining a second filtering gain corresponding to each sub-spectrum according to the spectrum inclination coefficient and the first spectrum energy corresponding to each sub-spectrum.
In one possible implementation, the narrowband signal is a speech signal of a current speech frame, and the low frequency spectrum determining module is configured to:
Determining a first initial spectral energy of a sub-spectrum of the current speech frame;
if the current voice frame is the first voice frame, determining the first spectrum energy of the sub-spectrum based on the first initial spectrum energy of the sub-spectrum;
if the current voice frame is not the first voice frame, acquiring first initial spectrum energy of a sub-spectrum corresponding to one sub-spectrum of an associated voice frame, wherein the associated voice frame is at least one voice frame positioned before the current voice frame and adjacent to the current voice frame;
the first spectral energy of the one sub-spectrum is obtained based on the first initial spectral energy of the one sub-spectrum and the first initial spectral energy of the sub-spectrum of the associated speech frame corresponding to the one sub-spectrum.
In one possible implementation, the associated speech frame is a speech frame preceding the current speech frame, and the low frequency spectrum determining module is configured to, when the current speech frame is a first speech frame, determine a first spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum:
determining a second spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum and the initialized first initial spectral energy;
Determining a first spectral energy of the one sub-spectrum based on the second spectral energy of the one sub-spectrum and the initialized first spectral energy;
when the current speech frame is not the first speech frame, based on the first initial spectral energy of the one sub-spectrum and the first initial spectral energy of the sub-spectrum of the associated speech frame corresponding to the one sub-spectrum, obtaining the first spectral energy of the one sub-spectrum, for:
determining a second spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum and the first initial spectral energy of the sub-spectrum of the previous speech frame corresponding to the one sub-spectrum;
the first spectral energy of the one sub-spectrum is determined from the second spectral energy of the one sub-spectrum and the first spectral energy of the sub-spectrum of the previous speech frame corresponding to the one sub-spectrum.
In one possible implementation, the correlation parameters include a high frequency spectral envelope and relative flatness information; the neural network model at least comprises an input layer and an output layer, wherein the input layer inputs the characteristic vector of the low-frequency spectrum, the output layer at least comprises a unilateral long-short-term memory network LSTM layer and two fully-connected network layers respectively connected with the LSTM layer, each fully-connected network layer comprises at least one fully-connected layer, the LSTM layer converts the characteristic vector processed by the input layer, one fully-connected network layer performs first classification processing according to the vector value converted by the LSTM layer and outputs a high-frequency spectrum envelope, and the other fully-connected network layer performs second classification processing according to the vector value converted by the LSTM layer and outputs relative flatness information.
In one possible implementation, the time-frequency transform comprises a fourier transform or a discrete cosine transform; the correlation parameter determining module is used for obtaining correlation parameters of a high-frequency part and a low-frequency part of a target broadband frequency spectrum through a neural network model based on the low-frequency spectrum:
obtaining a low-frequency amplitude spectrum of the narrowband signal according to the low-frequency spectrum;
inputting the low-frequency amplitude spectrum into a neural network model, and obtaining correlation parameters of a high-frequency part and a low-frequency part of a target broadband spectrum based on the output of the neural network model;
if the time-frequency transformation is discrete cosine transformation, the high-frequency spectrum determining module is used for obtaining the correlation parameters of the high-frequency part and the low-frequency part of the target broadband frequency spectrum through a neural network model based on the low-frequency spectrum:
and inputting the low-frequency spectrum into a neural network model, and obtaining correlation parameters of a high-frequency part and a low-frequency part of the target broadband spectrum based on the output of the neural network model.
In one possible implementation, the time-frequency transform comprises a fourier transform or a discrete cosine transform;
if the time-frequency transformation is fourier transformation, the high-frequency spectrum determining module is used for obtaining a target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum:
Obtaining a low-frequency spectrum envelope of the narrowband signal according to the low-frequency spectrum;
generating an initial high frequency amplitude spectrum based on the low frequency amplitude spectrum;
based on the high-frequency spectrum envelope and the low-frequency spectrum envelope, adjusting an initial high-frequency amplitude spectrum to obtain a target frequency amplitude spectrum;
generating a corresponding high-frequency phase spectrum based on the low-frequency phase spectrum of the narrowband signal;
obtaining a target high-frequency spectrum according to the target high-frequency amplitude spectrum and the high-frequency phase spectrum;
if the time-frequency transformation is discrete cosine transformation, the high-frequency spectrum determining module is used for obtaining a target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum:
obtaining a low-frequency spectrum envelope of the narrowband signal according to the low-frequency spectrum;
generating an initial high frequency spectrum based on the low frequency spectrum;
and adjusting the initial high-frequency spectrum based on the high-frequency spectrum envelope and the low-frequency spectrum envelope to obtain a target high-frequency spectrum.
In one possible implementation, the high frequency spectrum determination module, when generating the initial high frequency amplitude spectrum based on the low frequency amplitude spectrum, is configured to:
copying the amplitude spectrum of the high-frequency band part in the low-frequency amplitude spectrum;
the high-frequency spectrum determining module is used for generating an initial high-frequency spectrum based on the low-frequency spectrum:
The spectrum of the high band portion of the low frequency spectrum is replicated.
In one possible implementation, the high frequency spectral envelope and the low frequency spectral envelope are both spectral envelopes in the logarithmic domain;
the high-frequency spectrum determining module is used for adjusting the initial high-frequency amplitude spectrum based on the high-frequency spectrum envelope and the low-frequency spectrum envelope:
determining a first difference of the high frequency spectrum envelope and the low frequency spectrum envelope;
adjusting the initial high-frequency amplitude spectrum based on the first difference value to obtain a target high-frequency amplitude spectrum;
the high-frequency spectrum determining module is used for adjusting the initial high-frequency spectrum based on the high-frequency spectrum envelope and the low-frequency spectrum envelope:
determining a second difference of the high frequency spectrum envelope and the low frequency spectrum envelope;
and adjusting the initial high-frequency spectrum based on the second difference value to obtain a target high-frequency spectrum.
In one possible implementation, if the time-frequency transform is a fourier transform, the high-frequency spectral envelope comprises a second number of first sub-spectral envelopes, the initial high-frequency amplitude spectrum comprises a second number of first sub-amplitude spectra, wherein each first sub-spectral envelope is determined based on a corresponding first sub-amplitude spectrum in the initial high-frequency amplitude spectrum;
the high-frequency spectrum determining module is used for adjusting the initial high-frequency amplitude spectrum based on a first difference value when determining the first difference value of the high-frequency spectrum envelope and the low-frequency spectrum envelope to obtain the target frequency amplitude spectrum:
Determining a first difference value of each first sub-spectrum envelope and a corresponding spectrum envelope in the low-frequency spectrum envelopes;
based on the first difference value corresponding to each first sub-spectrum envelope, the corresponding first sub-amplitude spectrum is adjusted to obtain a second number of adjusted first sub-amplitude spectrums;
obtaining a target frequency amplitude spectrum based on the second number of adjusted first sub-amplitude spectrums;
if the time-frequency transformation is discrete cosine transformation, the high-frequency spectrum envelope comprises a third number of second sub-spectrum envelopes, the initial high-frequency spectrum comprises the third number of first sub-spectrums, and each second sub-spectrum envelope is determined based on the corresponding first sub-spectrum in the initial high-frequency spectrum;
the high-frequency spectrum determining module is used for adjusting the initial high-frequency spectrum based on a second difference value of the high-frequency spectrum envelope and the low-frequency spectrum envelope to obtain a target high-frequency spectrum when determining the second difference value of the high-frequency spectrum envelope and the low-frequency spectrum envelope:
determining a second difference value of each second sub-spectral envelope and a corresponding spectral envelope in the low-frequency spectral envelopes;
based on the second difference value corresponding to each second sub-spectrum envelope, the corresponding first sub-spectrum is adjusted to obtain a third number of adjusted first sub-spectrums;
And obtaining a target high-frequency spectrum based on the third number of adjusted first sub-spectrums.
In one possible implementation, the high frequency spectrum determining module is configured to, when determining the first difference or the second difference of the high frequency spectrum envelope and the low frequency spectrum envelope:
determining a gain adjustment value of the high frequency spectrum envelope based on the relative flatness information and the energy information of the low frequency spectrum;
adjusting the high-frequency spectrum envelope based on the gain adjustment value to obtain an adjusted high-frequency spectrum envelope;
a first difference or a second difference of the adjusted high frequency spectral envelope and the low frequency spectral envelope is determined.
In one possible implementation, the relative flatness information includes relative flatness information of at least two sub-band areas corresponding to the high frequency part, the relative flatness information corresponding to one sub-band area characterizing a correlation of a spectral flatness of one sub-band area of the high frequency part and a spectral flatness of a high frequency band of the low frequency part;
if the high-frequency part comprises spectrum parameters corresponding to at least two sub-band regions, the spectrum parameters of each sub-band region are obtained by the spectrum parameters of a high-frequency band of the basic low-frequency part, and the relative flatness information comprises the relative flatness information of the spectrum parameters of each sub-band region and the spectrum parameters of the high-frequency band, wherein the spectrum parameters are the amplitude spectrum if the time-frequency transformation is Fourier transformation, the spectrum parameters are the frequency spectrum if the time-frequency transformation is discrete cosine transformation;
The high-frequency spectrum determining module is used for determining a gain adjustment value of the high-frequency spectrum envelope based on the relative flatness information and the energy information of the low-frequency spectrum:
determining a gain adjustment value of a corresponding spectrum envelope part in the high-frequency spectrum envelope based on the relative flatness information corresponding to each sub-band region and the spectrum energy information corresponding to each sub-band region in the low-frequency spectrum;
the high-frequency spectrum determining module is used for adjusting the high-frequency spectrum envelope based on the gain adjustment value:
and adjusting the corresponding spectrum envelope part according to the gain adjustment value of each corresponding spectrum envelope part in the high-frequency spectrum envelope.
In one possible implementation, if the high frequency spectrum envelope includes a first predetermined number of high frequency sub-spectrum envelopes, the first predetermined number is a second number when the low frequency spectrum is obtained by fourier transform, and the first predetermined number is a third number when the low frequency spectrum is obtained by discrete cosine transform;
the high-frequency spectrum determining module is used for determining a gain adjustment value of a corresponding spectrum envelope part in the high-frequency spectrum envelope based on the relative flatness information corresponding to each sub-band region and the spectrum energy information corresponding to each sub-band region in the low-frequency spectrum:
For each high-frequency sub-spectrum envelope, determining a gain adjustment value of the high-frequency sub-spectrum envelope according to spectrum energy information corresponding to a spectrum envelope corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope, relative flatness information corresponding to a sub-band region corresponding to the spectrum envelope corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope, and spectrum energy information corresponding to a sub-band region corresponding to the spectrum envelope corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope;
adjusting each corresponding spectrum envelope part according to the gain adjustment value of each corresponding spectrum envelope part in the high-frequency spectrum envelope, including:
and adjusting the corresponding high-frequency sub-spectrum envelopes according to the gain adjustment value of each high-frequency sub-spectrum envelope in the high-frequency spectrum envelopes.
In one possible implementation, if the narrowband signal includes at least two associated signals, the apparatus further includes:
the narrowband signal determining module is used for fusing at least two paths of related signals to obtain narrowband signals; or, each of the at least two associated signals is used as a narrowband signal.
In one aspect, an electronic device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the band extension method described above when executing the program.
In one aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the above-described band extension method.
According to the frequency band expansion method provided by the embodiment of the application, the low-frequency spectrum can be the frequency spectrum obtained by filtering the corresponding initial frequency spectrum, so that quantization noise possibly introduced in the quantization process of a narrow-band signal is effectively filtered, and the quantization noise is prevented from being expanded to the high-frequency spectrum in the process of carrying out frequency band expansion based on the low-frequency spectrum; the target high-frequency spectrum can also be a spectrum obtained by filtering the corresponding initial spectrum, so that noise possibly existing in the target high-frequency spectrum can be effectively filtered, the signal quality of the broadband signal is enhanced, and the hearing experience of a user is further improved.
Additional aspects and advantages of embodiments of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of embodiments of the application will become apparent and may be better understood from the following description of embodiments with reference to the accompanying drawings, in which:
Fig. 1 is a flowchart of a band expanding method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a neural network model according to an embodiment of the present application;
fig. 3 is a flowchart of a band expanding method in the first example of the embodiment of the present application;
fig. 4 is a flowchart of a band expanding method in a second example of the embodiment of the present application;
fig. 5 is a schematic structural diagram of a band expanding device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.
In order to better understand and describe the schemes of the embodiments of the present application, some technical terms related to the embodiments of the present application are briefly described below.
Band extension (Band Width Extension, BWE): is a technology in the field of audio coding that extends a narrowband signal into a wideband signal.
Frequency spectrum: is the abbreviation of frequency spectral density, which is the distribution curve of frequency.
Spectral envelope (Spectrum Envelope, SE): the energy representation of the spectral coefficients corresponding to the signal is the energy representation of the spectral coefficients corresponding to the subband, e.g. the average energy of the spectral coefficients corresponding to the subband, on the frequency axis corresponding to the signal.
Spectral flatness (Spectrum Flatness, SF): the degree of power flatness of the signal to be measured in the channel where the signal to be measured is located is represented.
Neural Networks (NN): is an algorithm mathematical model which simulates the behavior characteristics of an animal neural network and processes distributed parallel information. The network relies on the complexity of the system and achieves the purpose of processing information by adjusting the relationship of the interconnection among a large number of nodes.
Deep Learning (DL): deep learning is a type of machine learning that combines low-level features to form more abstract high-level representation attribute categories or features to discover a distributed feature representation of data.
PSTN (Public Switched Telephone Network ): a common old telephone system, namely a telephone network commonly used in our daily lives.
VoIP (Voice over Internet Protocol, internet phone): is a voice call technology for achieving voice call and multimedia conference via internet protocol, that is, communication via internet.
3GPP EVS: the 3GPP (3 rd Generation Partnership Project, third Generation partnership project) mainly prepares third generation specifications for wireless interfaces based on the Global System for Mobile communications; the EVS (Enhance Voice Services) encoder is a new generation voice frequency encoder, not only can provide very high audio quality for voice and music signals, but also has very strong frame loss resistance and delay jitter resistance, and can bring brand new experience for users.
IEFT OPUS: opus is a lossy vocoded format developed by the internet engineering task force (IETF, the Internet Engineering Task Force).
SILK: the Silk audio encoder is a Silk broadband that Skype web phone provides royalty-free authentication to third party developers and hardware manufacturers.
In particular, band extension is a classical technique in the field of audio coding, in which the band extension can be achieved by:
in a first way, a narrow-band signal at a low sampling rate is selected, and the spectrum of the low-frequency part of the narrow-band signal is copied to a high frequency; the narrowband signal (i.e., narrowband signal) is expanded into a wideband signal (i.e., wideband signal) based on boundary information (information describing the energy correlation of high frequency and low frequency) recorded in advance.
In the second mode, the blind band spreading is directly performed without extra bits, and the narrow band signal at a low sampling rate is spread into a wide band signal based on a high frequency spectrum by using a neural network or deep learning technique, which is input as a low frequency spectrum of the narrow band signal and output as a high frequency spectrum.
However, the band extension is performed in the first manner, in which side information needs to consume corresponding bits, and there are problems of forward compatibility, such as a typical PSTN (narrowband voice) and VoIP (broadband voice) interworking scenario. In the transmission direction of PSTN-VoIP, if the transmission protocol is not modified (a corresponding band spreading code stream is added), the purpose of outputting wideband speech in the transmission direction of PSTN-VoIP cannot be completed. The second way is to spread the band, the input is the low frequency spectrum and the output is the high frequency spectrum. This approach does not require the consumption of extra bits, but requires a very high generalization capability of the network, which is large in depth and volume, high in complexity and poor in performance in order to ensure the accuracy of network output. Therefore, the performance requirements of the actual band extension cannot be satisfied based on both the above-mentioned band extension methods.
Aiming at the problems existing in the prior art and better meeting the actual application demands, the embodiment of the application provides a frequency band expansion method, by which extra bits are not needed, the depth and the volume of a network are reduced, and the complexity of the network is reduced.
In the embodiment of the present application, taking a PSTN (narrowband voice) and VoIP (wideband voice) interworking scenario as an example, the scheme of the present application is described, that is, in the transmission direction from PSTN to VoIP (abbreviated as PSTN-VoIP), narrowband voice is extended to wideband voice. In practical applications, the present application is not limited to the above application scenario, but is also applicable to other coding systems, including but not limited to: mainstream audio encoders such as 3GPP EVS, IEFT OPUS, SILK, etc.
The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
In the following description of the scheme of the embodiment of the present application, taking a voice scenario of PSTN and VoIP interworking as an example, the sampling rate is 8000Hz, and the frame length of one frame of voice frame is 10ms (equivalent to 80 sample points/frame). In practical applications, the PSTN frame length is considered to be 20ms, so only two operations are required for each PSTN frame. In the description of the embodiment of the present application, the frame length of data will be fixed to 10ms as an example, however, it will be clear to those skilled in the art that the present application is still applicable to a scenario where the frame length is other values, such as a scenario of 20ms (equivalent to 160 sample points/frame), and is not limited herein.
Likewise, the sampling rate of 8000Hz is taken as an example in the embodiment of the present application, and is not used to limit the range of the band extension provided by the embodiment of the present application. For example, although the main embodiment of the present application is to extend the frequency band of a signal with a sampling rate of 8000Hz to a signal with a sampling rate of 16000Hz, the present application can also be applied to other sampling rate scenarios, such as extending a signal with a sampling rate of 16000Hz to a signal with a sampling rate of 32000Hz, extending a signal with a sampling rate of 8000Hz to a signal with a sampling rate of 12000Hz, etc. The scheme of the embodiment of the application can be applied to any scene needing signal band expansion.
An embodiment of the present application provides a band extension method, which is performed by a computer device, which may be a terminal or a server. The terminal may be a desktop device or a mobile terminal. The servers may be separate physical servers, clusters of physical servers, or virtual servers. As shown in fig. 1, the method includes:
step S110, performing time-frequency transformation on the narrowband signal to be processed to obtain a corresponding low-frequency spectrum.
Specifically, the low frequency spectrum is obtained by performing a time-frequency transform on the narrowband signal, including but not limited to fourier transform, discrete cosine transform, discrete sine transform, wavelet transform, and the like.
The narrowband signal to be processed may be a voice frame signal requiring band extension, for example, in a PSTN-VoIP path, the PSTN narrowband voice signal needs to be extended to a VoIP wideband voice signal, and the narrowband signal to be processed may be a PSTN narrowband voice signal. If the narrowband signal to be processed is a signal of speech frames, the narrowband signal to be processed may be all or part of a speech signal of a frame of speech frames.
In an actual application scenario, the signal to be processed may be used as a narrowband signal to be processed to complete band extension at one time, or the signal may be divided into a plurality of sub-signals, and the plurality of sub-signals may be processed respectively, for example, the frame length of the PSTN frame is 20ms, the signal of the 20ms voice frame may be subjected to band extension once, or the 20ms voice frame may be divided into two 10ms voice frames, and the two 10ms voice frames may be subjected to band extension respectively.
Step S120, obtaining a correlation parameter of a high-frequency part and a low-frequency part of the target broadband spectrum through a neural network model based on the low-frequency spectrum, wherein the correlation parameter comprises at least one of a high-frequency spectrum envelope and relative flatness information, and the relative flatness information characterizes the correlation of the spectrum flatness of the high-frequency part and the spectrum flatness of the low-frequency part of the target broadband spectrum.
In particular, the neural network model may be a model trained in advance based on a low frequency spectrum of the signal, the model being used to predict correlation parameters of the signal. The target wideband spectrum refers to a spectrum corresponding to the bandwidth of the narrowband signal after being expanded, and the target wideband spectrum is obtained based on the low frequency spectrum of the to-be-processed voice signal, for example, the target wideband spectrum may be obtained by copying the low frequency spectrum of the to-be-processed voice signal.
Step S130, obtaining a target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum.
Specifically, based on the correlation parameter and the low frequency spectrum (parameter corresponding to the low frequency portion), a target high frequency spectrum of the wideband signal (i.e., parameter corresponding to the high frequency portion of the wideband signal) to be spread can be predicted.
Step S140, obtaining a wideband signal with a spread frequency band based on the low frequency spectrum and the target high frequency spectrum, wherein at least one of the low frequency spectrum or the target high frequency spectrum is a spectrum obtained by filtering the corresponding initial spectrum.
Specifically, after the target high-frequency spectrum is obtained, the low-frequency spectrum and the target high-frequency spectrum can be combined, and after the combined spectrum is subjected to time-frequency inverse transformation (namely, frequency-time transformation), a new broadband signal is obtained, so that the band expansion of the narrowband signal is realized. Because the bandwidth of the expanded broadband signal is larger than that of the narrowband signal, voice frames with flood tone and larger volume can be obtained based on the broadband signal, so that a user can have better hearing experience.
Specifically, the low frequency spectrum in the step S110 and the step S130 may be a spectrum obtained by performing a filtering process on the corresponding initial low frequency spectrum, that is, the low frequency spectrum is a spectrum obtained by performing a filtering process on the initial low frequency spectrum obtained by performing a time-frequency transformation on the narrowband signal. Since the narrowband signal is usually quantized before the time-frequency transformation of the narrowband signal, and quantization noise is generally introduced during the quantization process, after the time-frequency transformation of the narrowband signal, the quantization noise in the initial low-frequency spectrum can be filtered by performing a filtering process on the initial low-frequency spectrum after the time-frequency transformation, so as to obtain the low-frequency spectrum, thereby preventing the quantization noise from being introduced into the high-frequency spectrum during the subsequent frequency band expansion based on the low-frequency spectrum.
Specifically, the target high frequency spectrum in the step S140 may be a spectrum obtained by performing filtering processing on a corresponding initial high frequency spectrum, that is, the target high frequency spectrum in the step S140 is obtained by performing filtering processing on an initial high frequency spectrum obtained based on a low frequency spectrum, so that noise possibly existing in the target high frequency spectrum is effectively filtered, the signal quality of the wideband signal is enhanced, and the hearing experience of the user is further improved.
According to the frequency band expansion method provided by the embodiment of the application, the low-frequency spectrum can be the frequency spectrum obtained by filtering the corresponding initial frequency spectrum, so that quantization noise possibly introduced in the quantization process of a narrowband signal is effectively filtered, and the quantization noise is prevented from being expanded to a target high-frequency spectrum in the process of frequency band expansion based on the low-frequency spectrum; the target high-frequency spectrum can also be a spectrum obtained by filtering the corresponding initial spectrum, so that noise possibly existing in the target high-frequency spectrum can be effectively filtered, the signal quality of the broadband signal is enhanced, and the hearing experience of a user is further improved.
In one implementation manner of the embodiment of the present application, obtaining the target high frequency spectrum based on the low frequency spectrum may include:
based on the low frequency spectrum, obtaining a correlation parameter of a high frequency part and a low frequency part of a target broadband spectrum through a neural network model, wherein the correlation parameter comprises a high frequency spectrum envelope;
and obtaining a target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum.
The target wideband spectrum refers to a spectrum corresponding to a wideband signal to which the narrowband signal wants to spread, where the target wideband spectrum is obtained based on a low frequency spectrum of the to-be-processed voice signal, for example, the target wideband spectrum may be obtained by copying the low frequency spectrum of the to-be-processed voice signal.
Specifically, the neural network model may be a model obtained by training based on sample data in advance, where each sample data includes a sample narrowband signal and a sample wideband signal corresponding to the sample narrowband signal, for each sample data, a correlation parameter of a high frequency portion and a low frequency portion of a spectrum of the sample wideband signal may be determined (the parameter may be understood as labeling information of the sample data, that is, a sample label, simply referred to as a labeling result), the correlation parameter includes a high frequency spectrum envelope, and may also include relative flatness information of a high frequency portion and a low frequency portion of a spectrum of the sample wideband signal, when the neural network model is trained based on the sample data, an input of the initial neural network model is a low frequency spectrum of the sample narrowband signal, and an output is a predicted correlation parameter (simply referred to as a prediction result), and whether the model training is finished may be determined based on a similarity degree of the prediction result and the labeling result corresponding to each sample data, for example, whether the model training is finished may be determined by a loss function of the model, that characterizes a difference degree between the prediction result and the labeling result of each sample data, and the model training is finished, and when the model is finished is applied as the neural network model of the embodiment of the application.
In the application stage of the neural network model, for the narrowband signal, the low-frequency spectrum of the narrowband signal can be input into the trained neural network model to obtain the correlation parameter corresponding to the narrowband signal. Because the sample label of the sample data is the correlation parameter of the high-frequency part and the low-frequency part of the sample broadband signal when the model is trained based on the sample data, the correlation parameter of the narrowband signal obtained based on the output of the neural network model can well represent the correlation of the high-frequency part and the low-frequency part of the frequency spectrum of the target broadband signal.
Specifically, since the correlation parameter may characterize the correlation of the high frequency portion and the low frequency portion of the target wideband spectrum, the target high frequency spectrum (i.e., the parameter corresponding to the high frequency portion of the wideband signal) of the wideband signal to be spread may be predicted based on the correlation parameter and the low frequency spectrum (the parameter corresponding to the low frequency portion).
In the implementation manner, the correlation parameters of the high-frequency part and the low-frequency part of the target broadband spectrum can be obtained through the neural network model based on the low-frequency spectrum of the narrowband signal to be processed, and because the prediction is performed by adopting the neural network model, no extra bits are needed to be encoded, the blind analysis method is a blind analysis method, has good forward compatibility, and because the output of the model is the parameter capable of reflecting the correlation between the high-frequency part and the low-frequency part of the target broadband spectrum, the mapping of the spectrum parameters to the correlation parameters is realized, and compared with the existing coefficient-to-coefficient mapping mode, the method has better flooding capability, and can obtain signals with high tone and volume, so that a user has better hearing experience.
In an implementation manner of the embodiment of the present application, performing time-frequency conversion on a narrowband signal to be processed to obtain a corresponding low-frequency spectrum may include:
carrying out up-sampling processing on the narrowband signal with a sampling factor of a first set value to obtain an up-sampling signal;
performing time-frequency conversion on the up-sampling signal to obtain a low-frequency domain coefficient;
the low frequency domain coefficients are determined to be a low frequency spectrum.
The manner in which the low frequency spectral parameters are determined is described in further detail below in connection with one example. In this example, the foregoing description is given by taking as an example a voice scenario of PSTN and VoIP interworking, a sampling rate of a voice signal of 8000Hz, and a frame length of one voice frame of 10 ms.
In this example, the PSTN signal sampling rate is 8000Hz, and the effective bandwidth of the narrowband signal is 4000Hz according to the Nyquist sampling theorem. The purpose of this example is to obtain a signal with a bandwidth of 8000Hz after the narrowband signal is subjected to band expansion, i.e. the bandwidth of the wideband signal is 8000Hz. Considering that in an actual voice communication scenario, the effective bandwidth is 4000Hz, the upper bound of the effective bandwidth is typically 3500Hz. Therefore, in this scheme, if the effective bandwidth of the actually obtained wideband signal is 7000Hz, the purpose of this example is to perform band expansion on the signal with the bandwidth of 3500Hz, so as to obtain the wideband signal with the bandwidth of 7000Hz, that is, to expand the frequency band of the signal with the sampling rate of 8000Hz to the signal with the sampling rate of 16000 Hz.
In this example, the sampling factor is 2, and up-sampling processing with the sampling factor of 2 is performed on the narrowband signal, so as to obtain an up-sampled signal with the sampling rate of 16000 Hz. Since the sampling rate of the narrowband signal is 8000Hz and the frame length is 10ms, the up-sampled signal corresponds to 160 sample points.
Then, performing time-frequency conversion on the up-sampled signal to obtain an initial low-frequency domain coefficient, and after obtaining the initial low-frequency domain coefficient, performing filtering processing on the initial low-frequency domain coefficient to obtain a low-frequency domain coefficient, and determining the low-frequency domain coefficient as a low-frequency spectrum for subsequent calculation of a low-frequency spectrum envelope, a low-frequency amplitude spectrum and the like; of course, the initial low frequency domain coefficient may be directly used as the low frequency spectrum without performing the filtering process on the initial low frequency domain coefficient.
Specifically, the fourier transform may be Short-time fourier transform STFT (Short-Time Fourier Transform), and the discrete cosine transform may be modified discrete cosine transform MDCT ((Modified Discrete Cosine Transform). In the process of performing time-frequency transform on the up-sampled signal, in consideration of eliminating discontinuity of data between frames, frequency points corresponding to a previous frame of voice frame and frequency points corresponding to a current voice frame (narrowband signal to be processed) may be combined into an array, and then windowing is performed on the frequency points in the array to obtain a windowed signal.
Specifically, when the time-frequency transformation adopts STFT, the windowing process can be performed by adopting a Hanning window. After the hanning window windowing process is performed, the signal after the windowing process can be subjected to STFT to obtain a corresponding low-frequency domain coefficient. Considering the conjugate symmetry of the fourier transform, the first coefficient is a direct current component, and if the obtained low-frequency domain coefficients are M, then (1+m/2) low-frequency domain coefficients can be selected for subsequent processing.
As an example, the specific process of STFT on the upsampled signal containing 160 sample points is: 160 sample points corresponding to the previous voice frame and 160 sample points corresponding to the current voice frame (narrowband signal to be processed) form an array, and the array comprises 320 sample points. Then, the sample points in the array are subjected to the windowing processing of the Hanning window to obtain a signal s after the windowing processing Low (i, j), then to s Low (i, j) performing Fourier transform to obtain 320 low-frequency domain coefficients S Low (i, j). Where i is the frame index of the speech frame and j is the intra sample index (j=0, 1, …, 319). Considering the conjugate symmetry of the fourier transform, the first coefficient is a direct current component, so that only the first 161 low frequency domain coefficients, namely, the 2 nd to 161 th low frequency domain coefficients of the 161 low frequency domain coefficients, may be considered as the above-mentioned initial low frequency spectrum.
Specifically, when the time-frequency transform adopts MDCT, a cosine window may be used for the windowing process. After the windowing of the cosine window, the signal after the windowing can be subjected to MDCT to obtainTo the corresponding low frequency domain coefficient and to the subsequent processing based on the low frequency domain coefficient. Assume that the windowed signal is s Low (i, j), where i is the frame index of the speech frame and j is the intra sample index (j=0, 1, …, 319), then: can be to s Low (i, j) performing 320-point MDCT to obtain 160-point MDCT coefficient S Low (i, j), where i is the frame index of the speech frame and j is the intra sample index (j=0, 1, …, 159), and the MDCT coefficient at the 160 points is taken as the low frequency domain coefficient.
When the narrowband signal is a signal with a sampling rate of 8000Hz and an effective bandwidth of 0-3500 Hz, based on the sampling rate and frame length of the narrowband signal, it can be determined that the low frequency domain coefficients corresponding to the effective bandwidth of the narrowband signal are actually 70, i.e. the initial low frequency spectrum S Low The number of effective coefficients of (i, j) is 70, i.e., j=0, 1, …,69, and the following processing will be specifically described by taking the 70 low-frequency domain coefficients as examples.
In one possible implementation of an embodiment of the application, the time-frequency transform comprises a fourier transform or a discrete cosine transform. After obtaining a low-frequency spectrum by performing time-frequency transformation on a narrowband signal to be processed, if the time-frequency transformation is fourier transformation (such as STFT), the low-frequency spectrum at this time is in a complex form, so that a real low-frequency amplitude spectrum can be obtained according to the complex low-frequency spectrum, and then subsequent processing can be performed based on the low-frequency amplitude spectrum, that is, in the process of obtaining correlation parameters of a high-frequency part and a low-frequency part of a target broadband spectrum based on the low-frequency spectrum through a neural network model, the low-frequency amplitude spectrum of the narrowband signal can be obtained according to the low-frequency spectrum; and inputting the low-frequency amplitude spectrum into a neural network model, and obtaining correlation parameters of a high-frequency part and a low-frequency part of the target broadband spectrum based on the output of the neural network model. If the time-frequency transform is discrete cosine transform (such as MDCT), the low-frequency spectrum at this time is in a real form, so that the subsequent processing can be directly performed according to the low-frequency spectrum in the real form, that is, in the process of obtaining the correlation parameters of the high-frequency part and the low-frequency part of the target broadband spectrum through the neural network model based on the low-frequency spectrum, the low-frequency spectrum can be input into the neural network model, and the correlation parameters of the high-frequency part and the low-frequency part of the target broadband spectrum can be obtained based on the output of the neural network model.
Specifically, when the time-frequency transformation is discrete sine transformation, wavelet transformation, or the like, the processing procedure of fourier transformation or discrete cosine transformation can be referred to as needed to obtain the correlation parameters of the high-frequency part and the low-frequency part of the target broadband spectrum through the neural network model based on the low-frequency spectrum, which is not described herein.
In one possible implementation manner of the embodiment of the present application, the windowed signal s may be processed Low (i, j) performing STFT or MDCT, and the obtained low frequency domain coefficient is named as an initial low frequency domain coefficient. When the initial low-frequency domain coefficient is obtained, filtering processing can be performed on the initial low-frequency domain coefficient to obtain the filtered initial low-frequency domain coefficient, the filtered initial low-frequency domain coefficient is recorded as the low-frequency domain coefficient, and then the low-frequency amplitude spectrum of the narrow-band signal is determined according to the low-frequency domain coefficient. Wherein, for convenience of description, in the following description, the above-mentioned initial low frequency domain coefficient is denoted as S Low (i, j) recording the low frequency domain coefficient obtained by the filtering as S Low_rev (i, j), i being the frame index of the speech frame, j being the intra sample index (j=0, 1, …, 69).
Specifically, in the process of performing filtering processing on the initial low-frequency domain coefficient, the first filtering gain may be determined based on the initial low-frequency domain coefficient, and then the filtering processing may be performed on the initial low-frequency domain coefficient according to the first filtering gain, so as to obtain a filtered low-frequency domain coefficient. Similarly, in the process of performing the filtering process on the initial high-frequency domain coefficient, the first filtering gain may be determined based on the initial high-frequency domain coefficient, and then the filtering process may be performed on the initial high-frequency domain coefficient according to the first filtering gain. The mode of filtering the initial low-frequency domain coefficient and the mode of filtering the high-frequency domain coefficient can adopt the same principle of processing, but only one is aimed at the initial low-frequency domain coefficient, one is aimed at the high-frequency domain coefficient, the filtering process is explained by taking the initial low-frequency domain coefficient as an example, and correspondingly, when the filtering process is carried out on the initial high-frequency spectral coefficient, only the parameters related to the initial low-frequency domain coefficient in the filtering process of the initial low-frequency domain coefficient are replaced by the related parameters corresponding to the initial high-frequency domain coefficient.
Taking the initial low-frequency domain coefficient as an example, in practical application, in the process of performing filtering processing on the initial low-frequency domain coefficient according to the first filtering gain, the filtering processing on the initial low-frequency domain coefficient can be performed by performing product operation on the first filtering gain and the initial low-frequency domain coefficient. If the determined first filter gain is G pre_filt (j) The initial low frequency domain coefficients may be filtered according to the following equation (1):
S Low_rev (i,j)=G pre_filt (j)*S Low (i,j) (1)
where i is the frame index of the speech frame and j is the intra sample index (j=0, 1, …, 69).
Specifically, in determining the first filter gain based on the initial low frequency domain coefficient, the initial low frequency domain coefficient may be first divided into a first number of sub-spectrums, and a first spectrum energy corresponding to each sub-spectrum is determined, and then a second filter gain corresponding to each sub-spectrum is determined based on the first spectrum energy corresponding to each sub-spectrum, where the first filter gain value includes the first number of second filter gains.
For ease of description, the first number is denoted as L, where one possible implementation of dividing the initial low frequency domain coefficients into L sub-spectrums is: and carrying out band division processing on the initial low-frequency domain coefficients to obtain a first number of sub-spectrums, wherein each sub-band corresponds to N initial low-frequency domain coefficients, N is equal to the total number of the initial low-frequency domain coefficients, L is more than or equal to 2, and N is more than or equal to 1. As an example, for example, there are 70 initial low frequency domain coefficients, and a frequency band corresponding to every 5 (n=5) initial low frequency domain coefficients may be divided into one subband, and divided into 14 (l=14) subbands, and each subband corresponds to 5 initial low frequency domain coefficients.
In an alternative, the first initial spectrum energy of each sub-spectrum may be used as the first spectrum energy of each sub-spectrum, and one possible implementation manner of determining the first initial spectrum energy corresponding to each sub-spectrum is as follows: and determining the average value of the spectrum energy of the N initial low-frequency domain coefficients corresponding to each sub-spectrum as the first initial spectrum energy corresponding to each sub-spectrum. The spectral energy of each initial low frequency domain coefficient is defined as the sum of the real and imaginary squares of the initial low frequency domain coefficients. As an example, if the initial low-frequency domain coefficient has 70 spectral coefficients, n=5, and l=14, the first spectral energy corresponding to each sub-spectrum may be calculated by the following formula (2):
where i is the frame index of the speech frame, j is the intra sample index (j=0, 1, …, 69), k=0, 1, …,13 is the subband index, 14 subbands are respectively represented correspondingly, pe (k) represents the first initial spectral energy corresponding to the kth subband, S Low (i, j) is a low frequency domain coefficient (i.e., an initial low frequency domain coefficient) obtained from the time-frequency transform.
Specifically, after obtaining the first spectral energy corresponding to each sub-spectrum, the second filter gain corresponding to each sub-spectrum may be determined based on the first spectral energy corresponding to each sub-spectrum. In the process of determining the second filtering gain corresponding to each sub-spectrum, the frequency band corresponding to the initial spectrum can be divided into a first sub-band and a second sub-band; then, according to the first frequency spectrum energy of all the frequency spectrums corresponding to the first sub-band, determining the first frequency spectrum energy of the first sub-band, and according to the first frequency spectrum energy of all the frequency spectrums corresponding to the second sub-band, determining the second frequency spectrum energy of the second sub-band; then, according to the first sub-band energy and the second sub-band energy, determining a spectrum inclination coefficient of the initial spectrum; and then determining a second filter gain corresponding to each sub-spectrum according to the spectrum inclination coefficient and the first spectrum energy corresponding to each sub-spectrum.
For the initial low frequency spectrum (that is, when the initial frequency spectrum is the initial low frequency spectrum), the frequency band corresponding to the initial frequency spectrum is the sum of the frequency bands corresponding to the initial low frequency domain coefficients (for example, 70) respectively, in the process of dividing the frequency band corresponding to the initial low frequency domain coefficients into the first sub-band and the second sub-band, the sum of the frequency bands corresponding to the 1 st to 35 th (corresponding to the j being 0 to 34) initial low frequency domain coefficients respectively can be used as the first sub-band, and the sum of the frequency bands corresponding to the 36 th to 70 th (corresponding to the j being 35 to 69) initial low frequency domain coefficients respectively can be used as the second sub-band, that is, the first sub-band corresponds to the 1 st to 35 th initial low frequency domain coefficients in the initial frequency spectrum, and the second sub-band corresponds to the 36 th to 70 th initial low frequency domain coefficients in the initial frequency spectrum. If n=5, i.e. dividing each 5 initial low frequency domain coefficients into one sub-spectrum, the first sub-band comprises 7 sub-spectrums, and the second sub-band comprises 7 sub-spectrums, so that the first sub-band energy of the first sub-band can be determined according to the sum of the first spectrum energies of the 7 sub-spectrums comprised by the first sub-band, and the second sub-band energy of the second sub-band can be determined according to the sum of the first spectrum energies of the 7 sub-spectrums comprised by the second sub-band.
In particular, when the narrowband signal is a speech signal of a current speech frame, one possible way to determine, for each sub-spectrum, its corresponding first spectral energy is: the first initial spectral energy Pe (k) corresponding to each sub-spectrum is determined according to the above formula (2). If the current speech frame is the first speech frame, the first spectral energy of each sub-spectrum may be determined based on the first initial spectral energy Pe (k) of each sub-spectrum (denoted as Fe (k)), and optionally, for the first speech frame, the first initial spectral energy of each sub-spectrum may be used as the first spectral energy of each sub-spectrum, i.e., fe (k) =pe (k), or may be determined based on the initial spectral energy of each sub-spectrum and a preset calculation rule, e.g., may be determined according to the following equation (3) or equation (4). If the current voice frame is notIs the first speech frame, and in the course of defining the first spectral energy of the kth sub-spectrum, the first initial spectral energy of the sub-spectrum of the associated speech frame corresponding to the kth sub-spectrum is obtained and is noted as Pe pre (k) Wherein the associated speech frame is at least one speech frame (e.g., 1, 2) that is located before and adjacent to the current speech frame. After the first initial spectral energy of the sub-spectrum of the associated speech frame corresponding to the kth sub-spectrum is obtained, the first spectral energy of the sub-spectrum of the current speech frame may be obtained based on the first initial spectral energy of the sub-spectrum of the current speech frame and the first initial spectral energy of the sub-spectrum of the associated speech frame corresponding to the sub-spectrum.
In one example, the first spectral energy of the kth sub-spectrum may be determined according to the following equation (3):
Fe(k)=1.0+Pe(k)+Pe pre (k) (3)
wherein Pe (k) is the first initial spectral energy of the kth sub-spectrum, pe pre (k) To correlate the first initial spectral energy of the sub-spectrum of the speech frame corresponding to the kth sub-spectrum, fe (k) is the first spectral energy of the kth sub-spectrum.
It will be appreciated that, for the initial speech frame, i.e. the first speech frame, since there is no associated speech frame of the first speech frame, in practical application, an initial spectral energy may be initialized as the first initial spectral energy of the associated speech frame of the first speech frame, i.e. the Pe corresponding to the first speech frame pre (k) May be the initialized first initial spectral energy (initialized energy value).
It should be noted that, the associated speech frame in the above formula (3) is a speech frame located before and adjacent to the current speech frame. When the associated speech frame is two or more speech frames preceding and adjacent to the current speech frame, the above formula (3) can be appropriately adjusted as needed, for example, when the associated speech frame is two speech frames preceding and adjacent to the current speech frame Equation (3) can be adjusted accordingly: fe (k) =1.0+pe (k) +pe pre1 (k)+Pe pre2 (k) The Pe is pre1 (k) Is the first initial spectral energy, pe, of the first speech frame located before and immediately adjacent to the current speech frame pre2 (k) Is a first initial spectral energy of a speech frame preceding and immediately adjacent to the first speech frame.
In an alternative embodiment of the present application, the associated speech frame is a speech frame preceding the current speech frame, and if the current speech frame is the first speech frame, determining the first spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum includes:
determining a second spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum and the initialized first initial spectral energy;
determining a first spectral energy of the one sub-spectrum based on the second spectral energy of the one sub-spectrum and the initialized first spectral energy;
if the current speech frame is not the first speech frame, obtaining the first spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum and the first initial spectral energy of the sub-spectrum of the associated speech frame corresponding to the one sub-spectrum, including:
Determining a second spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum and the first initial spectral energy of the sub-spectrum of the previous speech frame corresponding to the one sub-spectrum;
the first spectral energy of the one sub-spectrum is determined from the second spectral energy of the one sub-spectrum and the first spectral energy of the sub-spectrum of the previous speech frame corresponding to the one sub-spectrum.
In this alternative, the associated speech frame may specifically be a previous speech frame of the current speech frame, after the first spectral energy of the kth sub-spectrum is obtained according to the scheme corresponding to the above formula (3), the first spectral energy may be smoothed, and after the smoothed first spectral energy fe_sm (k) is determined, fe_sm (k) may be determined as the first spectral energy of the kth sub-spectrum. In order to distinguish the first spectral energy determined by the above-described scheme (scheme corresponding to the example of formula (3)) for determining the first spectral energy based on the initial spectral energy from the first spectral energy obtained by further smoothing in this example, in this alternative, the first spectral energy determined by the above-described scheme (scheme corresponding to the example of formula (3)) for determining the first spectral energy based on the initial spectral energy is referred to as the second spectral energy, and the spectral energy obtained by further smoothing is referred to as the first spectral energy. Based on the alternative scheme, the effect of smoothing the spectrum energy of the current voice frame based on the continuously accumulated smoothing result of the historical voice frames is achieved.
As an example, the second spectral energy (e.g., the first spectral energy determined using equation (3)) may be smoothed according to equation (4) as follows to obtain the first spectral energy:
Fe_sm(k)=(Fe(k)+Fe pre _sm(k))/2 (4)
wherein Fe (k) is the second spectral energy of the kth sub-spectrum, fe pre Sm (k) is the first spectral energy of the sub-spectrum corresponding to the kth sub-spectrum of the associated speech frame (the preceding speech frame in this example), and fe_sm (k) is the smoothed first spectral energy, i.e. the first spectral energy in this example. After determining the smoothed first spectral energy fe_sm (k), fe_sm (k) may be determined as the first spectral energy of the kth sub-spectrum.
It can be seen that, as an alternative embodiment, for a sub-spectrum, for example, for the kth sub-spectrum, the first spectral energy may be the initial spectral energy of the sub-spectrum (i.e., pe (k) described above), the spectral energy after short-time smoothing based on the initial spectral energy (i.e., fe (k) described above), or the spectral energy after long-time smoothing (i.e., fe_sm (k) described above).
It will be appreciated that in the following description of the first spectral energy, the first spectral energy may be in any of the three forms described above, and in some examples only one of the forms may be described, but it will be apparent to those skilled in the art that, as a different alternative, the first spectral energy in one of the forms may be replaced with the first spectral energy in another form.
Specifically, after determining the first spectral energy Fe (k) or fe_sm (k) of each sub-spectrum according to the above procedure, when the first spectral energy of each sub-spectrum is Fe (k), the first sub-band energy of the first sub-band and the second sub-band energy of the second sub-band may be determined according to the following formula (5):
where e1 is the first subband energy of the first subband and e2 is the second subband energy of the second subband.
When the first spectral energy of each sub-spectrum is fe_sm (k), the first sub-band energy of the first sub-band and the second sub-band energy of the second sub-band can be determined according to the following formula (6):
where e1 is the first subband energy of the first subband and e2 is the second subband energy of the second subband.
Specifically, after determining the first subband energy and the second subband energy, a spectral tilt coefficient of the initial spectrum may be determined according to the first subband energy and the second subband energy. In practical applications, as an alternative, the spectral tilt coefficients of the initial spectrum may be determined according to the following logic:
the initial spectral tilt coefficient is determined to be 0 when the second sub-band energy is greater than or equal to the first sub-band energy, and the initial spectral tilt coefficient may be determined according to the following expression when the second sub-band energy is less than the first sub-band energy:
T_para_0=8*f_cont_low*SQRT((e1-e2)/(e1+e2)
Where t_para_0 is an initial spectral tilt coefficient, f_cont_low is a preset filter coefficient, as an alternative, f_cont_low=0.035, sqrt is an open root operation, e1 is a first subband energy, and e2 is a second subband energy.
Specifically, after the initial spectral tilt coefficient t_para_0 is obtained according to the above manner, the above initial spectral tilt coefficient may be used as the spectral tilt coefficient of the initial spectrum, or the obtained initial spectral tilt coefficient may be further optimized according to the following manner, and the optimized initial spectral tilt coefficient is used as the spectral tilt coefficient of the initial spectrum, where in an example, the optimized expression is:
T_para_1=min(1.0,T_para_0)
T_para_2=T_para_1/7
wherein min represents a minimum value, t_para_1 is an initial optimized spectral tilt coefficient, t_para_2 is a final optimized spectral tilt coefficient, and t_para_2 can be used as the spectral tilt coefficient of the initial spectrum.
Specifically, after determining the spectral tilt coefficient of the initial spectrum, the second filtering gain corresponding to each sub-spectrum may be determined according to the spectral tilt coefficient and the first spectral energy corresponding to each sub-spectrum. In one example, the second filter gain corresponding to the kth sub-spectrum may be determined according to the following equation (7-1) or equation (7-2):
gain f0 (k)=Fe(k) f_cont_low (7-1)
gain f0 (k)=Fe_sm(k) f_cont_low (7-2)
Wherein, gain f0 (k) A second filter gain (initial filter gain) corresponding to the kth sub-spectrum; fe (k) and fe_sm (k) are the first spectral energy of the kth sub-spectrum; f_cont_low is a preset filter coefficient as a possibleAlternatively, f_cont_low=0.035; k=0, 1, …,13, which are subband indexes, respectively represent the above 14 subbands. As can be seen from the equation (7-1) and the equation (7-2), the equation (7-1) is the first spectral energy to be the most sub-spectrum based on the spectral energy determined in the corresponding example equation in the equation (3), and the equation (7-2) is the first spectral energy to be the most sub-spectrum based on the spectral energy determined in the corresponding example equation in the equation (4).
In determining a second filter gain corresponding to the kth sub-spectrum f0 (k) Then, if the above-mentioned spectrum inclination coefficient of the initial spectrum is not positive, gain can be directly used f0 (k) As the second filter gain corresponding to the kth sub-spectrum, if the spectral tilt coefficient of the initial spectrum is positive, the second filter gain may be set according to the spectral tilt coefficient of the initial spectrum f0 (k) Adjusting and adding the adjusted second filter gain f0 (k) As a second filter gain corresponding to the kth sub-spectrum. In one example, the second filter gain may be calculated according to the following equation (8) f0 (k) And (3) adjusting:
gain f1 (k)=gain f0 (k)*(1+k*T para ) (8)
wherein, gain f1 (k) Gain for the second filter gain of the k-th sub-spectrum after adjustment f0 (k) For the second filter gain before adjustment corresponding to the kth sub-spectrum, i.e. the initial filter gain, T para For spectral tilt coefficients, k=0, 1, …,13, are subband indexes, respectively representing 14 subbands.
Specifically, in determining the second filter gain corresponding to the kth sub-spectrum f1 (k) After that, the gain can be adjusted f1 (k) Further adjusting and optimizing gain f1 (k) As a second filter gain corresponding to the final kth sub-spectrum. In one example, the second filter gain may be calculated according to the following equation (9) f1 (k) And (3) adjusting:
gain pre_filt (k)=(1+gain f1 (k))/2 (9)
wherein, gain pre_filt (k) Gain for the second filter gain corresponding to the k sub-spectrum finally obtained f1 (k) For the second filter gain adjusted according to formula (8), k=0, 1, …,13 represents the subband index, and each represents 14 subbands, so as to obtain the filter gain (i.e., the second filter gain) corresponding to each of the 14 subbands.
Specifically, the first filtering gain for calculating the initial low frequency domain coefficients is described by taking as an example that 5 initial low frequency domain coefficients are divided into one subband, that is, 70 initial low frequency domain coefficients are divided into 14 subbands, and each subband includes 5 initial low frequency domain coefficients. The obtained second filter gain corresponding to each sub-band is the filter gain of the 5 initial low frequency domain systems corresponding to each sub-band, so that the first filter gain corresponding to 70 initial low frequency domain coefficients can be obtained as [ gain ] according to the second filter gains of 14 sub-bands pre_filt (0),gain pre_filt (1),…,gain pre_filt (14)]In other words, the second filter gain corresponding to the kth sub-spectrum is determined pre_filt (k) The first filter gain value described above may then be obtained, wherein the first filter gain comprises a first number (L, such as 14) of second filter gains gain pre_filt (k) Second filter gain pre_filt (k) And the filter gain is the filter gain of N frequency spectrum coefficients corresponding to the kth sub-spectrum.
In an alternative of an embodiment of the application, the correlation parameters include a high frequency spectral envelope and relative flatness information; the neural network model at least comprises an input layer and an output layer, wherein the input layer inputs the characteristic vector of the low-frequency spectrum, the output layer at least comprises a unilateral long-short-term memory network LSTM layer and two fully-connected network layers respectively connected with the LSTM layer, each fully-connected network layer comprises at least one fully-connected layer, the LSTM layer converts the characteristic vector processed by the input layer, one fully-connected network layer performs first classification processing according to the vector value converted by the LSTM layer and outputs a high-frequency spectrum envelope, and the other fully-connected network layer performs second classification processing according to the vector value converted by the LSTM layer and outputs relative flatness information.
Specifically, when the time-frequency transformation is fourier transformation (such as STFT), after the initial spectrum is filtered to obtain a low-frequency spectrum, a low-frequency amplitude spectrum of the narrowband signal may be obtained according to the low-frequency spectrum, and after the low-frequency amplitude spectrum is obtained, a low-frequency spectrum envelope of the narrowband signal may be determined according to the low-frequency amplitude spectrum, that is, based on the low-frequency spectrum, a low-frequency spectrum envelope of the narrowband signal may be determined. When the time-frequency transformation is discrete cosine transformation (such as MDCT), after the initial frequency spectrum is filtered to obtain the low-frequency spectrum, the low-frequency spectrum envelope of the narrowband signal can be obtained according to the low-frequency spectrum, namely, the low-frequency spectrum envelope of the narrowband signal is determined based on the low-frequency spectrum. After determining the low-frequency spectrum envelope of the narrowband signal, the low-frequency spectrum envelope may be used as an input of the neural network model, i.e. the input of the neural network model further comprises the low-frequency spectrum envelope.
Specifically, in order to enrich the data input to the neural network model, parameters related to the frequency spectrum of the low frequency part may be selected as the input of the neural network model, and the low frequency spectrum envelope of the narrowband signal is information related to the frequency spectrum of the signal, then the low frequency spectrum envelope may be input to the neural network model, so that more accurate correlation parameters (in the case of time-frequency transformation into MDCT) may be obtained based on the low frequency spectrum envelope and the low frequency spectrum, that is, the low frequency spectrum envelope and the low frequency spectrum are input to the neural network model, so that the correlation parameters may be obtained, or more accurate correlation parameters (in the case of time-frequency transformation into STFT) may be obtained based on the low frequency spectrum envelope and the low frequency amplitude spectrum, so that the low frequency spectrum envelope and the low frequency amplitude spectrum are input to the neural network model, so that the correlation parameters may be obtained.
In the case of a time-frequency transformation into a fourier transformation (such as STFT), after obtaining a low-frequency spectrum, the low-frequency amplitude spectrum of the narrowband signal may be determined based on the low-frequency spectrum, specifically, the low-frequency amplitude spectrum may be calculated by the following formula (10):
P Low (i,j)=SQRT(Real(S Low_rev (i,j)) 2 +Imag(S Low_rev (i,j)) 2 ) (10)
wherein P is Low (i, j) represents a low frequency amplitude spectrum, S Low_rev (i, j) is the low frequency spectrum, real and Imag are the Real and imaginary parts of the low frequency spectrum, respectively, and SQRT is an open root operation. If the narrowband signal is a signal with a sampling rate of 8000Hz and an effective bandwidth of 0-3500 Hz, the spectral coefficients (low-frequency amplitude spectral coefficients) P of 70 low-frequency amplitude spectrums can be determined by the low-frequency domain coefficients based on the sampling rate and the frame length of the narrowband signal Low (i, j), j=0, 1, … 69. In practical application, the calculated 70 low-frequency amplitude spectral coefficients can be directly used as the low-frequency amplitude spectrum of the narrowband signal, and further, for the convenience of calculation, the low-frequency amplitude spectrum can be further converted into a logarithmic domain, namely, the amplitude spectrum calculated by the formula (10) is subjected to logarithmic operation, and the amplitude spectrum after logarithmic operation is used as the low-frequency amplitude spectrum in subsequent processing.
Wherein, after obtaining the low frequency amplitude spectrum comprising 70 coefficients according to formula (10), the low frequency spectrum envelope of the narrowband signal can be determined based on the low frequency amplitude spectrum.
In an alternative of the embodiment of the present application, the method may further include:
dividing the low frequency amplitude spectrum into a fourth number of sub-amplitude spectrums;
and respectively determining the sub-spectrum envelopes corresponding to each sub-amplitude spectrum, wherein the low-frequency spectrum envelopes comprise the determined fourth number of sub-spectrum envelopes.
Specifically, one implementation of dividing the spectral coefficients of the low frequency amplitude spectrum into a fourth number (denoted as M) of sub-amplitude spectra is: and carrying out banded processing on the narrowband signal to obtain M sub-amplitude spectrums, wherein each sub-band can correspond to the spectrum coefficients of the same or different numbers of sub-amplitude spectrums, and the total number of the spectrum coefficients corresponding to all the sub-bands is equal to the number of the spectrum coefficients of the low-frequency amplitude spectrum.
After being divided into M sub-magnitude spectrums, a sub-spectrum envelope corresponding to each sub-magnitude spectrum may be determined based on each sub-magnitude spectrum, where one implementation manner is: based on the spectral coefficients of the low-frequency spectrum corresponding to each sub-amplitude spectrum, a sub-spectrum envelope of each sub-band, that is, a sub-spectrum envelope corresponding to each sub-amplitude spectrum, may be determined, and the M sub-amplitude spectrums may correspondingly determine M sub-spectrum envelopes, where the low-frequency spectrum envelope includes the determined M sub-spectrum envelopes.
As an example, for the above 70 spectral coefficients of the low-frequency amplitude spectrum (which may be the coefficients calculated based on the formula (10) or the coefficients calculated based on the formula (10) and then converted to the logarithmic domain), if each subband contains the same number of spectral coefficients, for example, 5, denoted as n=5, the frequency band corresponding to the spectral coefficients of each 5 sub-amplitude spectrum may be divided into one subband, and in this case, 14 (m=14) subbands in total, and each subband corresponds to 5 spectral coefficients. Then after dividing the 14 sub-amplitude spectra, 14 sub-spectral envelopes may be determined based on the 14 sub-amplitude spectra correspondence.
Wherein determining the sub-spectrum envelope corresponding to each sub-magnitude spectrum may include:
And obtaining a sub-spectrum envelope corresponding to each sub-amplitude spectrum based on the logarithmic value of the spectrum coefficient included in each sub-amplitude spectrum.
Specifically, based on the spectral coefficient of each sub-amplitude spectrum, the sub-spectrum envelope corresponding to each sub-amplitude spectrum is determined by formula (11).
Wherein, formula (11) is:
wherein e Low (i, k) denotes a sub-spectrum envelope, i is a frame index of a speech frame, k denotes index numbers of sub-bands, M sub-bands in total, k=0, 1,2 … … M, and M sub-spectrum envelopes are included in the low frequency spectrum envelope.
In general, the spectrum envelope of the sub-band is defined as the average energy (or further converted into logarithmic representation) of adjacent coefficients, but this way may possibly lead to that the coefficient with smaller amplitude cannot play a substantial role.
Therefore, if the low-frequency amplitude spectrum and the low-frequency spectrum envelope are used as the input of the neural network model, the low-frequency amplitude spectrum is 70-dimensional data, the low-frequency spectrum envelope is 14-dimensional data, and the input of the model is 84-dimensional data, so that the neural network model in the scheme is small in size and low in complexity.
In another case, when the time-frequency transform is a discrete cosine transform (such as MDCT), after the low-frequency spectrum is obtained, the low-frequency spectrum envelope of the narrowband signal may be determined based on the low-frequency spectrum. Specifically, by dividing the narrowband signal into bands, for 70 low frequency domain coefficients, the frequency band corresponding to every 5 adjacent low frequency domain coefficients may be divided into one subband, and divided into 14 subbands, where each subband corresponds to 5 low frequency domain coefficients. For each subband, the low frequency spectral envelope of the subband is defined as the average energy of the neighboring low frequency domain coefficients. Specifically, the method can be calculated by a formula (12):
wherein e Low (i, k) represents a sub-spectral envelope (a low-frequency spectral envelope of each sub-band), S Low_rev (i, j) is a low frequency spectrum, k represents index numbers of subbands, 14 subbands are total, k=0, 1,2 … …, and 14 subband envelopes are included in the low frequency spectrum envelope.
Thereby, the 70-dimensional low-frequency domain coefficient S can be used Low_rev (i, j) and 14-dimensional low frequency spectral envelope e Low (i, k) as input to the neural network model, i.e. the input to the neural network model is 84-dimensional data.
In an alternative solution of the embodiment of the present application, if the time-frequency transformation is fourier transformation, in the process of obtaining the target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum, the method may include:
obtaining a low-frequency spectrum envelope of the narrowband signal according to the low-frequency spectrum;
generating an initial high frequency amplitude spectrum based on the low frequency amplitude spectrum;
based on the high-frequency spectrum envelope and the low-frequency spectrum envelope, adjusting an initial high-frequency amplitude spectrum to obtain a target frequency amplitude spectrum;
generating a corresponding high-frequency phase spectrum based on the low-frequency phase spectrum of the narrowband signal;
obtaining a target high-frequency spectrum according to the target high-frequency amplitude spectrum and the high-frequency phase spectrum;
if the time-frequency transformation is discrete cosine transformation, the process of obtaining the target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum may include:
obtaining a low-frequency spectrum envelope of the narrowband signal according to the low-frequency spectrum;
generating an initial high frequency spectrum based on the low frequency spectrum;
and adjusting the initial high-frequency spectrum based on the high-frequency spectrum envelope and the low-frequency spectrum envelope to obtain a target high-frequency spectrum.
Specifically, when the time-frequency transform is a fourier transform, the above manner of generating the corresponding high-frequency phase spectrum based on the low-frequency phase spectrum of the narrowband signal may include, but is not limited to, any of the following:
first kind: by copying the low frequency phase spectrum, a corresponding high frequency phase spectrum is obtained.
Second kind: the low-frequency phase spectrum is turned over, a phase spectrum identical to the low-frequency phase spectrum is obtained after turning over, and the two low-frequency phase spectrums are mapped to corresponding high-frequency points to obtain corresponding high-frequency phase spectrums.
Specifically, when the time-frequency transformation is fourier transformation, in the process of generating the initial high-frequency amplitude spectrum based on the low-frequency amplitude spectrum, the initial high-frequency amplitude spectrum may be obtained by copying the low-frequency amplitude spectrum. It will be appreciated that in practical applications, the specific way in which the low-frequency amplitude spectrum is copied may vary depending on the bandwidth of the wideband signal to be obtained and the bandwidth of the selected low-frequency amplitude spectrum portion to be copied. For example, assuming that the bandwidth of the wideband signal is 2 times that of the narrowband signal, and the low-frequency amplitude spectrum of the narrowband signal is selected to be copied, only one copy is required, if the low-frequency amplitude spectrum of the narrowband signal portion is selected to be copied, the corresponding number of copies is required according to the bandwidth corresponding to the selected portion, if the low-frequency amplitude spectrum of the narrowband signal 1/2 is selected to be copied, the copy is required 2 times, and if the low-frequency amplitude spectrum of the narrowband signal 1/4 is selected to be copied, the copy is required 4 times.
As an example, for example, the bandwidth of the extended wideband signal is 7kHz, and the bandwidth corresponding to the low-frequency amplitude spectrum selected for copying is 1.75kHz, then based on the bandwidth corresponding to the low-frequency amplitude spectrum and the bandwidth of the extended wideband signal, the bandwidth corresponding to the low-frequency amplitude spectrum may be copied 3 times, to obtain the bandwidth (5.25 kHz) corresponding to the initial high-frequency amplitude spectrum. If the bandwidth corresponding to the low-frequency amplitude spectrum selected for copying is 3.5kHz and the bandwidth of the expanded broadband signal is 7kHz, the bandwidth corresponding to the low-frequency amplitude spectrum is copied for 1 time, and the bandwidth (3.5 kHz) corresponding to the initial high-frequency amplitude spectrum can be obtained.
Specifically, when the time-frequency transform is a discrete cosine transform, the low-frequency spectrum may be copied to obtain an initial high-frequency spectrum in the process of generating the initial high-frequency spectrum based on the low-frequency spectrum. The process of copying the low-frequency spectrum is similar to the process of copying the low-frequency amplitude spectrum to obtain the initial high-frequency amplitude spectrum under the fourier transform, and will not be repeated here.
In the process of generating the initial high-frequency spectrum, when the time-frequency transform is a discrete sine transform, a wavelet transform, or the like, reference may be made to the above-described process of generating the initial high-frequency spectrum by fourier transform as needed; of course, in the process of generating the initial high frequency spectrum, reference may be made to the above-mentioned process of generating the initial high frequency spectrum of discrete cosine transform as required, which is not described herein.
In an alternative embodiment of the present application, one implementation of generating the initial high frequency amplitude spectrum based on the low frequency amplitude spectrum may be: copying the amplitude spectrum of the high-frequency band part in the low-frequency amplitude spectrum to obtain an initial high-frequency amplitude spectrum; based on the low frequency spectrum, one implementation of generating the initial high frequency spectrum may be: and copying the frequency spectrum of the high-frequency part in the low-frequency spectrum to obtain an initial high-frequency spectrum.
Specifically, when the time-frequency transformation is fourier transformation, since the low-frequency band portion of the obtained low-frequency amplitude spectrum contains a large number of harmonics, which affect the signal quality of the extended wideband signal, the amplitude spectrum of the high-frequency band portion of the low-frequency amplitude spectrum can be selected for replication, so as to obtain an initial high-frequency amplitude spectrum.
As an example, as the foregoing scenario is taken as an example, continuing to describe, the low-frequency amplitude spectrum corresponds to 70 frequency points in total, if 35-69 frequency points (amplitude spectrum of the high-frequency band portion in the frequency amplitude spectrum) corresponding to the low-frequency amplitude spectrum are selected as the frequency points to be copied, that is, the "mother board", and the bandwidth of the broadband signal after expansion is 7000Hz, the frequency points corresponding to the selected low-frequency amplitude spectrum need to be copied to obtain an initial high-frequency amplitude spectrum containing 70 frequency points, and in order to obtain the initial high-frequency amplitude spectrum containing 70 frequency points, 35-69 frequency points corresponding to the low-frequency amplitude spectrum can be copied twice in total to generate the initial high-frequency amplitude spectrum. Similarly, if 0-69 frequency points corresponding to the low-frequency amplitude spectrum are selected as frequency points to be copied, and the bandwidth of the expanded broadband signal is 7000Hz, 0-69 frequency points corresponding to the low-frequency amplitude spectrum can be copied once for 70 frequency points in total, and an initial high-frequency amplitude spectrum is generated, wherein the initial high-frequency amplitude spectrum comprises 70 frequency points in total.
Since the signal corresponding to the low-frequency amplitude spectrum may contain a large number of harmonics, the signal corresponding to the initial high-frequency amplitude spectrum obtained by copying also contains a large number of harmonics, in order to reduce the harmonics in the broadband signal after the frequency band expansion, the initial high-frequency amplitude spectrum may be adjusted by the difference between the high-frequency spectrum envelope and the low-frequency spectrum envelope, the adjusted initial high-frequency amplitude spectrum is used as the target high-frequency amplitude spectrum, and the harmonics in the broadband signal obtained after the final frequency point expansion may be reduced.
Specifically, when the time-frequency transform is discrete cosine transform, the low-frequency part of the low-frequency spectrum contains a large number of harmonics, which affects the signal quality of the extended wideband signal, so that the spectrum of the high-frequency part in the low-frequency spectrum can be selected for replication to obtain an initial high-frequency spectrum, which is similar to the process of replicating the amplitude spectrum of the high-frequency part in the low-frequency amplitude spectrum under the condition of fourier transform to obtain the initial high-frequency amplitude spectrum, and is not repeated here.
In the process of generating the initial high-frequency spectrum, when the time-frequency transform is a discrete sine transform, a wavelet transform, or the like, reference may be made to the above-described process of generating the initial high-frequency spectrum by fourier transform as needed; of course, in the process of generating the initial high frequency spectrum, reference may be made to the above-mentioned process of generating the initial high frequency spectrum of discrete cosine transform as required, which is not described herein.
In an alternative scheme of the embodiment of the application, the high-frequency spectrum envelope and the low-frequency spectrum envelope are both spectrum envelopes in a logarithmic domain;
based on the high-frequency spectrum envelope and the low-frequency spectrum envelope, the initial high-frequency amplitude spectrum is adjusted to obtain a target frequency amplitude spectrum, which may include:
determining a first difference of the high frequency spectrum envelope and the low frequency spectrum envelope;
adjusting the initial high-frequency amplitude spectrum based on the first difference value to obtain a target high-frequency amplitude spectrum;
adjusting the initial high frequency spectrum based on the high frequency spectrum envelope and the low frequency spectrum envelope, comprising:
determining a second difference of the high frequency spectrum envelope and the low frequency spectrum envelope;
and adjusting the initial high-frequency spectrum based on the second difference value to obtain a target high-frequency spectrum.
Specifically, the high-frequency spectrum envelope and the low-frequency spectrum envelope can be represented by the frequency spectrum envelope of the logarithmic domain, and when the time-frequency transformation is the Fourier transformation, the initial high-frequency amplitude spectrum can be adjusted based on a first difference value determined by the frequency spectrum envelope of the logarithmic domain, so as to obtain the target frequency amplitude spectrum; when the time-frequency transformation is discrete cosine transformation, the initial high-frequency spectrum can be adjusted based on a second difference value determined by the spectrum envelope of the logarithmic domain, and the target high-frequency spectrum can be obtained. Wherein the high frequency spectral envelope and the low frequency spectral envelope can be represented by the spectral envelope of the logarithmic domain for ease of calculation.
In the process of determining the target high-frequency amplitude spectrum, the generation process of the target high-frequency amplitude spectrum of the fourier transform can be referred to as required when the time-frequency transform is converted into discrete sine transform, wavelet transform or the like; of course, in the process of determining the target high frequency spectrum, reference may also be made to the above-mentioned process of generating the discrete cosine transform target high frequency spectrum as required, which is not described herein.
In an alternative of the embodiment of the present application, if the low frequency spectrum is obtained by fourier transform, the high frequency spectrum envelope includes a second number of first sub-spectrum envelopes, and the initial high frequency amplitude spectrum includes a second number of first sub-amplitude spectrums, where each first sub-spectrum envelope is determined based on a corresponding first sub-amplitude spectrum in the initial high frequency amplitude spectrum. If the low frequency spectrum is obtained by discrete cosine transformation, the high frequency spectrum envelope comprises a third number of second sub-spectrum envelopes, and the initial high frequency spectrum comprises a third number of first sub-spectrums, wherein each second sub-spectrum envelope is determined based on the corresponding first sub-spectrum in the initial high frequency spectrum.
Specifically, (1) when the time-frequency transform is a fourier transform, the sub-spectral envelopes are determined based on corresponding sub-amplitude spectra in the corresponding amplitude spectra, and one first sub-spectral envelope may be determined based on corresponding sub-amplitude spectra in the corresponding initial high-frequency amplitude spectra. The number of spectral coefficients corresponding to each sub-spectrum may be the same or different, and if each first sub-spectrum envelope is determined based on the corresponding sub-spectrum in the corresponding sub-spectrum, the number of spectral coefficients of the sub-spectrum in the corresponding sub-spectrum in each first sub-spectrum envelope may also be different. (2) When the time-frequency transform is a discrete cosine transform, the sub-spectral envelopes are determined based on corresponding sub-spectrums in the corresponding frequency spectrums, and a second sub-spectral envelope may be determined based on corresponding sub-spectrums in the corresponding initial high-frequency spectrum.
Note that, when the time-frequency transform is a discrete sine transform, a wavelet transform, or the like, the sub-spectrum envelope may be obtained by referring to the above-mentioned determination method of the sub-spectrum envelope of the fourier transform as needed, or of course, the sub-spectrum envelope may be obtained by referring to the above-mentioned determination method of the sub-spectrum envelope of the discrete cosine transform as needed, which is not described herein.
Based on the foregoing scenario as an example, continuing to describe, if the time-frequency transformation is a fourier transformation, the output of the neural network model is a 14-dimensional high-frequency spectrum envelope (the second number is 14), and the input of the neural network model includes a low-frequency amplitude spectrum and a low-frequency spectrum envelope, where the low-frequency amplitude spectrum includes 70-dimensional low-frequency domain coefficients, and the low-frequency spectrum envelope includes 14-dimensional sub-spectrum envelopes, the input of the neural network model is 84-dimensional data, and the output dimension is far smaller than the input dimension, so that the volume and depth of the neural network model can be reduced, and the complexity of the model can be reduced. If the time-frequency transformation is discrete cosine transformation, the input and output of the neural network model are similar to those of the neural network model under the fourier transformation, and will not be described in detail herein.
Further, if the time-frequency transformation is fourier transformation, determining a first difference between the high-frequency spectrum envelope and the low-frequency spectrum envelope, and adjusting the initial high-frequency amplitude spectrum based on the first difference to obtain the target frequency amplitude spectrum, may include:
Determining a first difference value for each first sub-spectral envelope and a corresponding one of the low-frequency spectral envelopes (hereinafter the corresponding one of the low-frequency spectral envelopes is denoted as a third sub-spectral envelope);
based on the first difference value corresponding to each first sub-spectrum envelope, the corresponding first sub-amplitude spectrum is adjusted to obtain a second number of adjusted first sub-amplitude spectrums;
and obtaining the target frequency amplitude spectrum based on the second number of adjusted first sub-amplitude spectrums.
Further, the time-frequency transformation is discrete cosine transformation, a second difference value between the high-frequency spectrum envelope and the low-frequency spectrum envelope is determined, and the initial high-frequency spectrum is adjusted based on the second difference value, so as to obtain a target high-frequency spectrum, which comprises the following steps:
determining a second difference value for each second sub-spectral envelope and a corresponding one of the low-frequency spectral envelopes (hereinafter the corresponding one of the low-frequency spectral envelopes is denoted as a fourth sub-spectral envelope);
based on the second difference value corresponding to each second sub-spectrum envelope, the corresponding first sub-spectrum is adjusted to obtain a third number of adjusted first sub-spectrums;
and obtaining a target high-frequency spectrum based on the third number of adjusted first sub-spectrums.
In particular, when the time-frequency transformation is a fourier transformation, the high-frequency spectral envelope obtained by the neural network model may comprise a second number of first sub-spectral envelopes, which, as can be seen from the foregoing description, are determined based on corresponding ones of the low-frequency amplitude spectra, i.e. one sub-frequency spectral envelope is determined based on a corresponding one of the low-frequency amplitude spectra. Continuing with the description based on the foregoing scenario as an example, the high-frequency spectral envelope includes 14 sub-spectral envelopes if the sub-amplitude spectrum in the low-frequency amplitude spectrum is 14.
The first difference between the high-frequency spectrum envelope and the low-frequency spectrum envelope is the difference between each first sub-spectrum envelope and the corresponding third sub-spectrum envelope, and the adjustment of the high-frequency spectrum envelope based on the first difference is the adjustment of the corresponding first sub-amplitude spectrum based on the first difference between each first sub-spectrum envelope and the corresponding third sub-spectrum envelope. Continuing with the description based on the foregoing scenario as an example, if the high-frequency spectral envelope includes 14 first sub-spectral envelopes and the low-frequency spectral envelope includes 14 second sub-spectral envelopes, 14 first differences may be determined based on the determined 14 second sub-spectral envelopes and the corresponding 14 first sub-spectral envelopes, and the first sub-amplitude spectrum corresponding to the corresponding sub-band may be adjusted based on the 14 first differences.
Specifically, when the time-frequency transformation is discrete cosine transformation, the high-frequency spectrum envelope obtained by the neural network model may include a third number of second sub-spectrum envelopes, and the second difference between the high-frequency spectrum envelope and the low-frequency spectrum envelope is the difference between each second sub-spectrum envelope and the corresponding fourth sub-spectrum envelope. In the process of adjusting the high-frequency spectrum envelope based on the second difference, similar to the process of adjusting the high-frequency spectrum envelope based on the first difference in the case that the time-frequency transform is fourier transform, the description thereof will be omitted.
It should be noted that, when the time-frequency transformation is a discrete sine transformation, a wavelet transformation, or the like, the corresponding high-frequency spectrum envelope may be adjusted according to the need by referring to the adjustment process of the high-frequency spectrum envelope of the fourier transformation, and of course, the corresponding high-frequency spectrum envelope may also be adjusted according to the need by referring to the adjustment process of the high-frequency spectrum envelope of the discrete cosine transformation, which is not described herein.
In an alternative of the embodiment of the present application, the correlation parameter further includes relative flatness information, the relative flatness information characterizing a correlation of a spectral flatness of a high frequency part and a spectral flatness of a low frequency part of the target broadband spectrum;
Determining the first difference or the second difference of the high frequency spectral envelope and the low frequency spectral envelope may comprise:
determining a gain adjustment value of the high frequency spectrum envelope based on the relative flatness information and the energy information of the low frequency spectrum;
adjusting the high-frequency spectrum envelope based on the gain adjustment value to obtain an adjusted high-frequency spectrum envelope;
a first difference or a second difference of the adjusted high frequency spectral envelope and the low frequency spectral envelope is determined.
Specifically, based on the foregoing description, in the training process of the neural network model, the labeling result may include relative flatness information, that is, the sample tag of the sample data includes relative flatness information of the high-frequency part and the low-frequency part of the sample broadband signal, which is determined based on the high-frequency part and the low-frequency part of the spectrum of the sample broadband signal, so when the neural network model is applied, when the input of the model is the low-frequency spectrum of the narrowband signal, the relative flatness information of the high-frequency part and the low-frequency part of the target broadband spectrum can be predicted based on the output of the neural network model.
The relative flatness information may reflect the relative flatness of the high frequency part and the low frequency part of the target broadband spectrum, i.e. whether the spectrum of the high frequency part is flat with respect to the low frequency part, if the correlation parameter further includes the relative flatness information, the high frequency spectrum envelope may be adjusted based on the relative flatness information and the energy information of the low frequency spectrum, and then the initial high frequency spectrum may be adjusted based on the difference between the adjusted high frequency spectrum envelope and the low frequency spectrum envelope, so that the harmonics in the finally obtained broadband signal are less. The energy information of the low-frequency spectrum may be determined based on spectral coefficients of the low-frequency amplitude spectrum, and the energy information of the low-frequency spectrum may represent spectrum flatness.
In an alternative embodiment of the present application, the correlation parameter may include a high-frequency spectrum envelope and relative flatness information, the neural network model includes at least an input layer and an output layer, the input layer inputs a feature vector of a low-frequency spectrum parameter (the feature vector includes a 70-dimensional low-frequency amplitude spectrum and a 14-dimensional low-frequency spectrum envelope), the output layer includes at least a single-side Long Short-Term Memory (LSTM) layer and two fully-connected network layers respectively connected to the LSTM layer, each fully-connected network layer may include at least one fully-connected layer, where the LSTM layer converts the feature vector processed by the input layer, one fully-connected network layer performs a first classification processing according to a vector value converted by the LSTM layer, and outputs a high-frequency spectrum envelope (14-dimensional), and the other fully-connected network layer performs a second classification processing according to a vector value converted by the LSTM layer, and outputs relative flatness information (4-dimensional).
As an example, fig. 2 shows a schematic structural diagram of a neural network model provided by an embodiment of the present application, where the neural network model may mainly include two parts as shown in the figure: the single-sided LSTM layer and the two fully-connected layers, i.e. each fully-connected network layer in this example, comprise one fully-connected layer, wherein the output of one fully-connected layer is a high-frequency spectrum envelope and the output of the other fully-connected layer is relative flatness information.
The LSTM layer is a cyclic neural network, the input of the LSTM layer is a feature vector (may be simply referred to as an input vector) of the low-frequency spectrum parameter, the input vector is processed through LSTM to obtain hidden vectors with certain dimensions, the hidden vectors are respectively used as input of two full-connection layers, the two full-connection layers respectively perform classification prediction processing, one full-connection layer predicts and outputs a column vector with 14 dimensions, the output is a high-frequency spectrum envelope correspondingly, the other full-connection layer predicts and outputs a column vector with 4 dimensions, the values of 4 dimensions of the vector are the 4 probability values described above, and the 4 probability values respectively represent the probability that the relative flatness information is the 4 arrays.
In one example, when the time-frequency transform is a Fourier transform (such as STFT), the filtering process may be performed based on the 70-dimensional low-frequency spectrum S Low_rev (i, j) obtaining a low frequency amplitude spectrum P of the 70-dimensional narrowband signal Low (i, j) this feature vector, then P Low (i, j) as an input to the neural network model, and will be according to P Low (i, j) the calculated 14-dimensional low frequency spectral envelope e Low (i, k) as another input of the neural network model, i.e. the input layer of the neural network model is an 84-dimensional feature vector. The neural network model converts the 84-dimensional feature vector through an LSTM layer (comprising 256 parameters for example) to obtain a converted vector value, classifies the converted vector value (namely, first classification) through a fully-connected network layer (comprising 512 parameters for example) connected with the LSTM layer, and outputs a 14-dimensional high-frequency spectrum envelope e High (i, k), and the other fully-connected network layer (for example, including 512 parameters) connected through the LSTM layer performs a classification process (i.e., a second classification process) on the vector values after the conversion process, and outputs 4 pieces of relative flatness information.
In another example, when the time-frequency transform is a discrete cosine transform (such as MDCT), the filtered 70-dimensional low-frequency spectrum S may be processed Low_rev (i, j) this feature vector is taken as an input to the neural network model, and will be based on S Low_rev (i, j) the resulting 14-dimensional low frequency spectral envelope e Low (i, k) this feature orientationThe amount is taken as another input of the neural network model, namely the input layer of the neural network model is an 84-dimensional feature vector. The neural network model converts the 84-dimensional feature vector through an LSTM layer (comprising 256 parameters for example) to obtain a converted vector value, classifies the converted vector value (namely, first classification) through a fully-connected network layer (comprising 512 parameters for example) connected with the LSTM layer, and outputs a 14-dimensional high-frequency spectrum envelope e High (i, k), and the other fully-connected network layer (for example, including 512 parameters) connected through the LSTM layer performs a classification process (i.e., a second classification process) on the vector values after the conversion process, and outputs 4 pieces of relative flatness information.
In an alternative of the embodiment of the present application, the relative flatness information includes relative flatness information of at least two sub-band areas corresponding to the high frequency part, and the relative flatness information corresponding to one sub-band area characterizes a correlation between a spectral flatness of one sub-band area of the high frequency part and a spectral flatness of a high frequency band of the low frequency part.
Wherein the relative flatness information is determined based on a high frequency part and a low frequency part of a spectrum of the sample broadband signal, and since a low frequency band of the low frequency part of the sample narrowband signal contains more abundant harmonics, a high frequency band of the low frequency part of the sample narrowband signal can be selected as a reference for determining the relative flatness information, the high frequency band of the low frequency part is used as a master, the high frequency part of the sample broadband signal is divided into at least two sub-band regions, and the relative flatness information of each sub-band region is determined based on the spectrum of the corresponding sub-band region and the spectrum of the low frequency part.
Based on the foregoing description, in the training process of the neural network model, the labeling result may include relative flatness information of each sub-band region, that is, the sample tag of the sample data may include relative flatness information of each sub-band region and the low-frequency portion of the high-frequency portion of the sample broadband signal, which is determined based on the frequency spectrum of the sub-band region and the frequency spectrum of the low-frequency portion of the high-frequency portion of the sample broadband signal, so that when the input of the model is the low-frequency spectrum of the narrowband signal in the application of the neural network model, the relative flatness information of the sub-band region and the low-frequency portion of the high-frequency portion of the target broadband frequency spectrum may be predicted based on the output of the neural network model.
Specifically, if the high frequency portion includes spectral parameters corresponding to at least two sub-band regions, the spectral parameters of each sub-band region are determined based on the spectral parameters of the high frequency band of the low frequency portion, and accordingly, the relative flatness information may include the spectral parameters of each sub-band region and the relative flatness information of the spectral parameters of the high frequency band of the low frequency portion, wherein the spectral parameters are the magnitude spectrum or the spectrum. Wherein, when the time-frequency transformation is Fourier transformation, the spectrum parameter is amplitude spectrum, and when the time-frequency transformation is discrete cosine transformation, the spectrum parameter is frequency spectrum.
The number of spectral coefficients of the amplitude spectrum of the low frequency portion of the target broadband spectrum may be the same as or different from the number of spectral coefficients of the amplitude spectrum of the high frequency portion, and the number of spectral coefficients corresponding to each sub-band region may be the same or different, so long as the total number of spectral coefficients corresponding to at least two sub-band regions is consistent with the number of spectral coefficients corresponding to the initial high frequency amplitude spectrum.
As an example, when the time-frequency transform is a fourier transform, for example, the high-frequency portion includes at least two corresponding sub-band regions, namely, a first sub-band region and a second sub-band region, respectively, the high-frequency band of the low-frequency portion is a frequency band corresponding to 35 th to 69 th frequency points, the number of spectral coefficients corresponding to the first sub-band region is the same as the number of spectral coefficients corresponding to the second sub-band region, the total number of spectral coefficients corresponding to the first sub-band region and the second sub-band region is the same as the number of spectral coefficients corresponding to the low-frequency portion, the frequency band corresponding to the first sub-band region is a frequency band corresponding to 70 th to 104 th frequency points, the frequency band corresponding to the second sub-band region is a frequency band corresponding to 105 th to 139 th frequency points, the number of spectral coefficients of the magnitude spectrum of each sub-band region is 35, and the number of spectral coefficients of the magnitude spectrum of the high-frequency band of the low-frequency portion is the same. If the high frequency band of the selected low frequency part is the frequency band corresponding to the 56 th to 69 th frequency points, the high frequency part can be divided into 5 sub-band regions, each sub-band region corresponding to 14 spectral coefficients. It should be noted that, when the time-frequency transform is a discrete cosine transform, the high-frequency portion includes a spectrum corresponding to at least two subband regions, similar to the case where the time-frequency transform is a fourier transform in this example, the high-frequency portion includes an amplitude spectrum corresponding to at least two subband regions, and will not be described here.
Specifically, whether the time-frequency transform is a fourier transform or a discrete cosine transform, determining the gain adjustment value of the high-frequency spectral envelope based on the relative flatness information and the energy information of the low-frequency spectrum may include:
determining a gain adjustment value of a corresponding spectrum envelope part in the high-frequency spectrum envelope based on the relative flatness information corresponding to each sub-band region and the spectrum energy information corresponding to each sub-band region in the low-frequency spectrum;
wherein adjusting the high frequency spectral envelope based on the gain adjustment value may include:
the respective spectral envelope portions are adjusted based on gain adjustment values for each corresponding spectral envelope portion in the high frequency spectral envelope.
Specifically, if the high frequency portion includes at least two sub-band regions, a gain adjustment value of a corresponding spectral envelope portion in a high frequency spectral envelope corresponding to each sub-band region may be determined based on the relative flatness information corresponding to the sub-band region and the spectral energy information corresponding to each sub-band region in the low frequency spectrum, and then the corresponding spectral envelope portion may be adjusted based on the determined gain adjustment value.
As an example, when the time-frequency transformation is fourier transformation as described above, the at least two subband regions are two subband regions, respectively, a first subband region and a second subband region, the relative flatness information of the high frequency band of the first subband region and the low frequency portion is first relative flatness information, the relative flatness information of the high frequency band of the second subband region and the low frequency portion is second relative flatness information, the envelope portion of the high frequency spectrum envelope corresponding to the first subband region may be adjusted based on the gain adjustment value determined by the first relative flatness information and the spectral energy information corresponding to the first subband region, and the envelope portion of the high frequency spectrum envelope corresponding to the second subband region may be adjusted based on the gain adjustment value determined by the second relative flatness information and the spectral energy information corresponding to the second subband region. When the time-frequency transform is discrete cosine transform, the determination process of the relative flatness information and the gain adjustment value is similar to that when the time-frequency transform is fourier transform in this example, and is not described here again.
In an alternative of the embodiment of the present application, since the low frequency band of the low frequency portion of the sample narrowband signal contains more abundant harmonics, the high frequency band of the low frequency portion of the sample narrowband signal may be selected as a reference for determining the relative flatness information, the high frequency band of the low frequency portion may be used as a master, the high frequency portion of the sample wideband signal may be divided into at least two sub-band areas, and the relative flatness information of each sub-band area may be determined based on the frequency spectrum of each sub-band area of the high frequency portion and the frequency spectrum of the low frequency portion.
Based on the foregoing description, in the training phase of the neural network, the relative flatness information of each sub-band region of the high frequency part of the spectrum of the sample broadband signal may be determined by an analysis of variance based on the sample data (including the sample narrowband signal and the corresponding sample broadband signal in the sample data). As an example, if the high frequency part of the sample broadband signal is divided into two sub-band regions, a first sub-band region and a second sub-band region, respectively, the relative flatness information of the high frequency part and the low frequency part of the sample broadband signal may be first relative flatness information of the high frequency band of the first sub-band region and the low frequency part of the sample broadband signal, and second relative flatness information of the high frequency band of the second sub-band region and the low frequency part of the sample broadband signal.
The determination process of the first relative flatness information and the second relative flatness information will be described below taking the case where the time-frequency transform is fourier transform as an example:
the specific determination manner of the first relative flatness information and the second relative flatness information may be:
frequency domain coefficient S based on narrowband signal in sample data Low,sample (i, j) and frequency domain coefficients S of a high frequency portion of the wideband signal in the sample data High,sample (i, j) by formulas (13) to (15) the following three variances are calculated:
var L (S Low,sample (i,j)),j=35,36,…,69 (13)
var H1 (S High,sample (i,j)),j=70,71,…,104 (14)
var H2 (S Hi,sample (i,j)),j=105,106,…,139 (15)
wherein equation (13) is the variance of the amplitude spectrum of the high frequency band of the low frequency portion of the sample narrowband signal, equation (14) is the variance of the amplitude spectrum of the first subband region, equation (15) is the variance of the amplitude spectrum of the second subband region, var () represents the variance, the variance of the spectrum can be represented based on the corresponding frequency domain coefficients, S Low,sampl (i, j) represents the frequency domain coefficient of the sample narrowband signal, and the low frequency domain coefficient of the sample narrowband signal may be the frequency domain coefficient S after filtering Low,sam_rev (i, j), i.e. S in the above formulas (13) to (15) Low,sam (i, j) is replaced by S Low,sample_rev (i,j)。
Based on the above three variances, the relative flatness information of the amplitude spectrum of each subband region and the amplitude spectrum of the high frequency band of the low frequency part is determined by the formula (16) and the formula (17):
Where fc (0) represents first relative flatness information of the amplitude spectrum of the first sub-band region and the amplitude spectrum of the high frequency band of the low frequency part, and fc (1) represents second relative flatness information of the amplitude spectrum of the second sub-band region and the amplitude spectrum of the high frequency band of the low frequency part.
Wherein, the above two values fc (0) and fc (1) can be classified into 0 or more (in the embodiment of the present application, 1 represents 0 or more and 0 represents 0 or less), and fc (0) and fc (1) are defined as a two-class array, so the array contains 4 permutation and combination: {0,0}, {0,1}, {1,0}, and {1,1}.
Thus, the relative flatness information output by the model may be 4 probability values for identifying the probability that the relative flatness information belongs to the 4 arrays described above.
One of the 4 arrays can be selected as the predicted relative flatness information of the amplitude spectrum of the expansion area and the low frequency part of the two sub-band areas and the amplitude spectrum of the high frequency band of the low frequency part according to the principle of maximum probability. Specifically, it can be expressed by the formula (18):
v(i,k)=0 or 1,k=0,1 (18)
where v (i, k) represents the relative flatness information of the amplitude spectrum of the two sub-band region expansion regions and the amplitude spectrum of the high frequency band of the low frequency part, k represents the index of the different sub-band regions, and each sub-band region may correspond to one piece of relative flatness information, for example, when k=0, v (i, k) =0 represents that the first sub-band region oscillates relatively to the low frequency part, i.e. has poor flatness, and v (i, k) =1 represents that the first sub-band region is relatively flat to the low frequency part, i.e. has good flatness.
In the embodiment of the application, the low-frequency spectrum of the second narrowband signal is input into the trained neural network model, and the relative flatness information of the high-frequency part of the target broadband frequency spectrum can be obtained through prediction of the neural network model. If the frequency spectrum corresponding to the high frequency band of the low frequency part of the narrow-band signal is selected as the input of the neural network model, the relative flatness information of at least two sub-band areas of the high frequency part of the target wide-band frequency spectrum can be predicted based on the trained neural network model.
In an alternative of the embodiment of the present application, if the high frequency spectrum envelope includes a first predetermined number of high frequency sub-spectrum envelopes, if the low frequency spectrum is obtained by fourier transform, the first predetermined number is the second number, and if the low frequency spectrum is obtained by discrete cosine transform, the first predetermined number is the third number;
wherein determining a gain adjustment value for a corresponding spectral envelope portion in the high frequency spectral envelope based on the relative flatness information for each subband region and the spectral energy information for each subband region in the low frequency spectrum comprises:
for each high-frequency sub-spectrum envelope, determining a gain adjustment value of the high-frequency sub-spectrum envelope according to spectrum energy information corresponding to a spectrum envelope corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope, relative flatness information corresponding to a sub-band region corresponding to the spectrum envelope corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope, and spectrum energy information corresponding to a sub-band region corresponding to the spectrum envelope corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope;
Adjusting each corresponding spectrum envelope part according to the gain adjustment value of each corresponding spectrum envelope part in the high-frequency spectrum envelope, including:
and adjusting the corresponding high-frequency sub-spectrum envelopes according to the gain adjustment value of each high-frequency sub-spectrum envelope in the high-frequency spectrum envelopes.
Specifically, the following description will be given by taking the case that the low frequency spectrum is obtained by fourier transform and the first predetermined number is the second number as an example:
specifically, each high-frequency sub-spectrum envelope of the high-frequency spectrum envelope corresponds to a gain adjustment value, the gain adjustment value is determined based on the spectral energy information corresponding to the low-frequency sub-spectrum envelope, the relative flatness information corresponding to the sub-band region corresponding to the low-frequency sub-spectrum envelope, and the spectral energy information corresponding to the sub-band region corresponding to the low-frequency sub-spectrum envelope, and the low-frequency sub-spectrum envelope corresponds to the high-frequency sub-spectrum envelope, the high-frequency spectrum envelope includes a second number of high-frequency sub-spectrum envelopes, and the high-frequency spectrum envelope includes a corresponding second number of gain adjustment values.
It will be appreciated that if the high frequency portion comprises corresponding at least two sub-band regions, for the high frequency spectral envelope corresponding to the at least two sub-band regions, the high frequency sub-spectral envelope of the corresponding sub-band region may be adjusted based on the gain adjustment value corresponding to the high frequency sub-spectral envelope corresponding to each sub-band region.
Taking the example that the first sub-band region includes 35 frequency points as an example, one implementation scheme of determining the gain adjustment value of the high-frequency sub-spectrum envelope corresponding to the low-frequency sub-spectrum envelope based on the spectrum energy information corresponding to the low-frequency sub-spectrum envelope, the relative flatness information corresponding to the sub-band region corresponding to the low-frequency sub-spectrum envelope, and the spectrum energy information corresponding to the sub-band region corresponding to the low-frequency sub-spectrum envelope is as follows:
(1) Analysis v (i, k) shows that the high frequency part is very flat if 1, and that the high frequency part oscillates if 0.
(2) For 35 frequency points in the first sub-band region, 7 sub-bands are divided, each sub-band corresponding to one high frequency sub-spectral envelope. The average energy pow_env (spectral energy information corresponding to the second sub-spectral envelope) of each sub-band is calculated, and the average energy mpow_env (spectral energy information corresponding to the sub-band region corresponding to the low-frequency sub-spectral envelope) of the 7 average energies is calculated.
(3) Calculating a gain adjustment value of each high-frequency sub-spectrum envelope based on the relative flatness information, average energy pow_env and average value mpow_env corresponding to the resolved first sub-band region, specifically including:
when v (i, k) =1, g (j) =c 1 +b 1 *SQRT(Mpow_env/pow_env(j)),j=0,1,…,6;
When v (i, k) = 0,G (j) =a 0 +b 0 *SQRT(Mpow_env/pow_env(j)),j=0,1,…,6;
Wherein, as an alternative, a 1 =0.875,b 1 =0.125,a 0 =0.925,b 0 =0.075, g (j) is the gain adjustment value.
In this case, for v (i, k) =0, the gain adjustment value is 1, i.e., no flattening operation (adjustment) of the high-frequency spectral envelope is required.
Based on the above mode, gain adjustment values of 7 high-frequency sub-spectrum envelopes in the high-frequency spectrum envelopes can be determined, corresponding high-frequency sub-spectrum envelopes are adjusted based on the gain adjustment values of 7 high-frequency sub-spectrum envelopes, the average energy difference of different sub-bands can be reduced, and flattening processing of different degrees is carried out on the frequency spectrum corresponding to the first sub-band region.
It will be appreciated that the second sub-band region may be adjusted to the corresponding high frequency spectrum envelope in the same manner as described above, and will not be described in detail herein. The high frequency spectral envelope comprises 14 sub-bands in total, and 14 gain adjustment values can be determined correspondingly, and the corresponding sub-spectral envelope is adjusted based on the 14 gain adjustment values.
In an alternative solution of the embodiment of the present application, the low frequency domain parameter further includes a low frequency domain coefficient, and obtaining a target high frequency spectrum according to the target frequency amplitude spectrum and the high frequency phase spectrum may include:
generating a target high-frequency domain coefficient according to the target high-frequency amplitude spectrum and the high-frequency phase spectrum;
a target high frequency spectrum is generated based on the low frequency domain coefficients and the target high frequency domain coefficients.
Specifically, in one implementation manner, after generating a target high-frequency domain coefficient according to a target frequency amplitude spectrum and a high-frequency phase spectrum, filtering the target high-frequency domain coefficient to obtain a filtered target high-frequency domain coefficient, and generating a target high-frequency spectrum based on a low-frequency domain coefficient and the filtered target high-frequency domain coefficient. The filtering process is basically identical to the process of filtering the low-frequency domain coefficient, and is not described herein.
In an alternative of the embodiment of the present application, in step S130, obtaining a wideband signal with a spread frequency band based on the low frequency spectrum and the target high frequency spectrum may include:
combining the low-frequency spectrum with the target high-frequency spectrum to obtain a broadband spectrum;
And performing frequency-time conversion on the broadband frequency spectrum to obtain a broadband signal with the expanded frequency band.
Specifically, the wideband signal includes a signal of a low frequency portion and a signal of an extended high frequency portion in the narrowband signal, and after obtaining a low frequency spectrum corresponding to the low frequency portion and a high frequency spectrum corresponding to the high frequency portion, the low frequency spectrum and the high frequency spectrum may be combined to obtain a wideband spectrum, and then frequency-time conversion (inverse conversion of time-frequency conversion, conversion of a frequency domain signal into a time domain signal) is performed on the wideband spectrum, so that a target speech signal after frequency band extension may be obtained.
In an alternative aspect of the embodiment of the present application, if the narrowband signal includes at least two associated signals, the method may further include:
fusing at least two paths of associated signals to obtain a narrowband signal;
or alternatively, the process may be performed,
and taking each of at least two paths of associated signals as a narrow-band signal respectively.
Specifically, the narrowband signal may be a multipath associated signal, for example, adjacent voice frames, at least two paths of associated signals may be fused to obtain a path of signal, the path of signal is used as a narrowband signal, and then the narrowband signal is expanded by the band expansion method in the embodiment of the present application to obtain a wideband signal.
Or, each of the at least two paths of related signals may be used as a narrowband signal, and the narrowband signal is expanded by the band expansion method in the present application to obtain at least two paths of corresponding wideband signals, where the at least two paths of wideband signals may be combined into one path of signal output, or may be output separately.
In order to better understand the method provided by the embodiment of the present application, the following describes the scheme of the embodiment of the present application in further detail by taking time-frequency transformation into fourier transformation and discrete cosine transformation as examples, respectively, and combining with examples of specific application scenarios.
As an example, the application scenario is a PSTN (narrowband speech) and VoIP (wideband speech) interworking scenario, that is, a narrowband speech corresponding to a PSTN telephone is used as a narrowband signal to be processed, and the narrowband signal to be processed is subjected to band expansion, so that a speech frame received by a VoIP receiving end is wideband speech, thereby improving hearing experience of the receiving end.
In this example, the narrowband signal to be processed is a signal with a utilization rate of 8000Hz and a frame length of 10ms, and the effective bandwidth of the narrowband signal to be processed is 4000Hz according to the Nyquist sampling theorem. In an actual voice communication scenario, the upper bound of the effective bandwidth is typically 3500Hz. Therefore, in the present example, the bandwidth of the extended broadband signal is described as 7000 Hz.
In a first example, as shown in fig. 3, the time-frequency transform is a fourier transform (such as STFT), and the specific procedure includes the following steps:
step S1, front-end signal processing:
and carrying out up-sampling processing of a factor of 2 on the narrowband signal to be processed, and outputting an up-sampled signal with a sampling rate of 16000 Hz.
Since the sampling rate of the narrowband signal to be processed is 8000Hz and the frame length is 10ms, the up-sampled signal corresponds to 160 sample points (frequency points), and the up-sampled signal is subjected to short-time fourier transform (STFT), specifically: 160 sample points corresponding to the previous voice frame and 160 sample points corresponding to the current voice frame (narrowband signal to be processed) form an array, and the array comprises 320 sample points. Then, the sample points in the array are subjected to windowing (i.e. the windowing of the hanning window) to obtain a windowed signal s Low (i, j) performing fast Fourier transform to obtain 320 low-frequency domain coefficients S Low (i, j). Where i is the frame index of the speech frame and j is the intra sample index (j=0, 1, …, 319). Taking into account thatThe first coefficient is a direct current component and therefore only the first 161 low frequency domain coefficients can be considered.
Step S2, low frequency pre-filtering (the initial spectrum in the step is the initial low frequency spectrum):
the low-frequency pre-filtering is to perform filtering processing on an initial low-frequency domain coefficient obtained by STFT on a narrowband signal to be processed, so as to obtain the low-frequency domain coefficient. In the filtering process, the initial low frequency domain coefficient is subjected to filtering processing by a filter gain determined based on the initial low frequency domain coefficient, specifically as shown in the following formula (19):
S Low_rev (i,j)=G pre_filt (j)*S Low (i,j) (19)
wherein i is the frame index of the speech frame, j is the intra sample index (j=0, 1, …, 69), G pre_filt (j) For a first filter gain calculated from the initial low frequency domain coefficients, S Low (i, j) is an initial low frequency domain coefficient, S Low_rev (i, j) is a low frequency domain coefficient obtained by the filtering process.
In this example, it is assumed that a filtering gain is shared by every 5 initial low frequency domain coefficients in the same subband, where the filtering gain is calculated as follows:
(1) The initial low frequency domain coefficients are banded, e.g. 5 adjacent initial low frequency domain coefficients are combined into one sub-spectrum, the example corresponding to 14 sub-bands. The average energy is calculated for each subband. In particular, the energy of each bin (i.e., the initial low frequency domain coefficient described above) is defined as the sum of the real square and the imaginary square. Calculating energy values of adjacent 5 frequency points according to the following formula (20), wherein the average value of the energy values of the 5 frequency points is the first spectrum energy of the current sub-spectrum:
Wherein S is Low (i, j) is a low frequency domain coefficient (i.e., an initial low frequency domain coefficient) obtained from time-frequency conversionReal and Imag are the Real part and the imaginary part of the initial low frequency domain coefficient, respectively, pe (k) is the first spectral energy (initial spectral energy), k=0, 1, … 13 is the subband index, 14 subbands are respectively represented, pe (k) is the first spectral energy (initial spectral energy) corresponding to the kth subband, and Real and Imag are the Real part and the imaginary part, respectively.
(2) Based on the inter-frame correlation, a first spectral energy of the current sub-spectrum is calculated by at least one of equation (21) and equation (22):
Fe(k)=1.0+Pe(k)+Pe pre (k) (21)
Fe_sm(k)=(Fe(k)+Fe pre _sm(k))/2 (22)
where Fe (k) is a smoothed term of the second spectral energy of the current sub-spectrum, pe (k) is the first initial spectral energy of the current sub-spectrum of the current speech frame, pe pre (k) Is the first initial spectral energy of the sub-spectrum corresponding to the current sub-spectrum of the associated speech frame of the current speech frame, fe_sm (k) is a smoothed term of the accumulated averaged first spectral energy, fe pre Sm (k) is a smoothed term of the first spectral energy corresponding to the current sub-spectrum of an associated speech frame of the current speech frame, the associated speech frame being at least one speech frame preceding and adjacent to the current speech frame. As an alternative, the associated speech frame is a speech frame preceding the current speech frame.
In this example, the first spectral energy calculated by the scheme of formula (22) is taken as the first spectral energy of the sub-spectrum.
(3) Calculating a spectrum inclination coefficient of the initial spectrum, equally dividing a frequency band corresponding to the initial spectrum into a first sub-band and a second sub-band, and respectively calculating first sub-band energy of the first sub-band and second sub-band energy of the second sub-band, wherein a calculation formula (23) is as follows:
wherein e1 is the first sub-band energy of the first sub-band and e2 is the second sub-band energy of the second sub-band
Next, from e1 and e2, the spectral tilt coefficients of the initial spectrum are determined based on the following logic:
If(e2>=e1):
T_para=0;
Else:
T_para=8*f_cont_low*SQRT((e1-e2)/(e1+e2);
T_para=min(1.0,T_para);
T_para=T_para/7;
where t_para is a spectral tilt coefficient, SQRT is an open root operation, f_cont_low=0.035, is a preset filter coefficient, and 7 is half of the total number of sub-spectrums.
(4) The second filter gain for each sub-spectrum is calculated, and may be calculated according to the following equation (24):
gain f0 (k)=Fe_sm(k) f_cont_low (24)
wherein, gain f0 (k) For the second filter gain of the kth sub-spectrum, f_cont_low is a preset filter coefficient, for example, f_cont_low=0.035, fe_sm (k) is a smoothed term of the first spectral energy of the kth sub-spectrum calculated according to equation (22), k=0, 1, …,13.
Then, if the spectral tilt coefficient T_para is positive, the second filter gain is also required according to the following formula (25) f0 (k) And (3) further adjusting:
If(T_para>0):
gain f1 (k)=gain f0 (k)*(1+k*T para ) (25)
wherein, gain f1 (k) Is the second filter gain adjusted according to the spectral tilt factor t_para.
(5) The filter gain value of the low frequency pre-filter is obtained according to the following formula (26):
G pre_filt (k)=(1+gain f1 (k))/2 (26)
wherein, gain f1 (k) G is the second filter gain adjusted according to equation (25) pre_filt (k) According to gain f1 (k) The filter gain (i.e., the second filter gain) of the 5 low frequency domain coefficients corresponding to the kth sub-spectrum finally obtained.
Specifically, in determining the second filter gain G corresponding to the kth sub-spectrum pre_filt (k) Thereafter, since the first filter gain includes a second number (e.g., l=14) of second filter gains G pre_filt (k) And a second filtering gain G pre_filt (k) The filter gain of N frequency spectrum coefficients corresponding to the kth sub-spectrum can be obtained to obtain a first filter gain G pre_filt (j)。
Step S3, feature extraction:
a) A low frequency amplitude spectrum is calculated by the formula (27) based on the low frequency domain coefficients:
P Low (i,j)=SQRT(Real(S Low (i,j)) 2 +Imag(S Low (i,j)) 2 ) (27)
wherein P is Low (i, j) represents a low frequency amplitude spectrum, S Low (i, j) is an initial low frequency domain coefficient obtained by STFT, real and Imag are Real and imaginary parts of the low frequency domain coefficient, respectively, SQRT is an open root operation when S is calculated by the above formula (19) Low When (i, j) is subjected to a filtering process, the equation (27) can be transformed into the following form:
P Low (i,j)=SQRT(Real(S Low_rev (i,j)) 2 +Imag(S Low_rev (i,j)) 2 )
If the narrowband signal is a signal with a sampling rate of 8000Hz and a bandwidth of 0-3500 Hz, the spectral coefficients (low-frequency amplitude spectral coefficients) P of 70 low-frequency amplitude spectrums can be determined by the low-frequency domain coefficients based on the sampling rate and the frame length of the narrowband signal Low (i, j), j=0, 1, … 69. In practical application, the calculated 70 low-frequency amplitude spectral coefficients can be directly used as the low-frequency amplitude spectrum of the narrowband signal, and further, for the convenience of calculation, the low-frequency amplitude spectrum can be further converted into a logarithmic domain.
After obtaining a low frequency amplitude spectrum comprising 70 coefficients, the low frequency envelope of the narrowband signal may be determined based on the low frequency amplitude spectrum.
b) Further, the low frequency spectral envelope may also be determined based on the low frequency amplitude spectrum by:
the narrowband signal is banded, and for the spectral coefficients of 70 low-frequency amplitude spectrums, the frequency band corresponding to the spectral coefficients of every 5 adjacent sub-amplitude spectrums can be divided into one sub-band, and 14 sub-bands are divided, and each sub-band corresponds to 5 spectral coefficients. For each subband, the low frequency spectral envelope of the subband is defined as the average energy of the neighboring spectral coefficients. Specifically, the method can be calculated by the formula (28):
Wherein e Low (i, k) denotes a sub-spectrum envelope (a low-frequency spectrum envelope of each sub-band), k denotes an index number of the sub-band, 14 sub-bands are total, and k=0, 1,2 … … 13, and 14 sub-spectrum envelopes are included in the low-frequency spectrum envelope.
In general, the spectrum envelope of the sub-band is defined as the average energy (or further converted into logarithmic representation) of the adjacent coefficients, but this approach may possibly result in that the coefficient with smaller amplitude cannot play a substantial role.
Thus, a 70-dimensional low-frequency amplitude spectrum and a 14-dimensional low-frequency spectrum envelope can be used as inputs to the neural network model.
Step S4, inputting a neural network model:
input layer: the neural network model inputs the 84-dimensional feature vector,
output layer: considering that the target wideband for band extension in this embodiment is 7000Hz, it is necessary to predict the high-frequency spectral envelope of 14 sub-bands corresponding to the 3500-7000Hz band, so that the basic band extension function can be completed. Typically, the low frequency portion of a speech frame contains a large number of harmonic-like structures, such as pitch and formants; the spectrum of the high frequency part is flatter; if the low-frequency spectrum is simply copied to the high frequency to obtain an initial high-frequency amplitude spectrum, and the gain control based on the sub-band is carried out on the initial high-frequency amplitude spectrum, the reconstructed high-frequency part generates excessive harmonic-like structures, so that distortion is caused, and the hearing is influenced; therefore, in this example, based on the relative flatness information predicted by the neural network model, the relative flatness of the low-frequency part and the high-frequency part is described, and the initial high-frequency amplitude spectrum is adjusted, so that the adjusted high-frequency part is flatter, and the interference of the harmonic wave is reduced.
In this example, the initial high-frequency amplitude spectrum is generated by twice copying the amplitude spectrum of the high-frequency band part in the low-frequency amplitude spectrum, and meanwhile, the frequency band of the high-frequency part is equally divided into two sub-band areas, namely a first sub-band area and a second sub-band area, wherein the high-frequency part corresponds to 70 spectral coefficients, and each sub-band area corresponds to 35 spectral coefficients, so that the high-frequency part performs twice flatness analysis, namely performs once flatness analysis on each sub-band area, and harmonic components are more abundant due to the fact that the low-frequency part particularly corresponds to the frequency band below 1000 Hz; therefore, in this embodiment, the spectral coefficients corresponding to the frequency points 35-69 are selected as the "mother board", and the frequency band corresponding to the first sub-band region is the frequency band corresponding to the 70 th to 104 th frequency points, and the frequency band corresponding to the second sub-band region is the frequency band corresponding to the 105 th to 139 th frequency points.
Flatness analysis may use Variance (Variance) analysis methods defined in classical statistics. The oscillation degree of the frequency spectrum can be described by a variance analysis method, and the higher the value is, the richer harmonic components are indicated.
Based on the foregoing description, since the low frequency band of the low frequency portion of the sample narrowband signal contains more abundant harmonics, the high frequency band of the low frequency portion of the sample narrowband signal may be selected as a reference for determining the relative flatness information, that is, the high frequency band of the low frequency portion (the frequency band corresponding to the frequency points 35-69) is used as a master, the high frequency portion of the sample wideband signal is correspondingly divided into at least two sub-band regions, and the relative flatness information of each sub-band region is determined based on the frequency spectrum of each sub-band region of the high frequency portion and the frequency spectrum of the low frequency portion.
In the training phase of the neural network model, the relative flatness information of each sub-band region of the high frequency part of the spectrum of the sample broadband signal may be determined by an analysis of variance based on sample data (including the sample narrowband signal and the corresponding sample broadband signal in the sample data).
As an example, if the high frequency part of the sample broadband signal is divided into two sub-band regions, a first sub-band region and a second sub-band region, respectively, the relative flatness information of the high frequency part and the low frequency part of the sample broadband signal may be first relative flatness information of the high frequency band of the first sub-band region and the low frequency part of the sample broadband signal, and second relative flatness information of the high frequency band of the second sub-band region and the low frequency part of the sample broadband signal.
When the time-frequency transformation is fourier transformation, the specific determination mode of the first relative flatness information and the second relative flatness information may be:
frequency domain coefficient S based on narrowband signal in sample data Low,sample (i, j) and frequency domain coefficients S of a high frequency portion of the wideband signal in the sample data High,sample (i, j) by means of formulas (29) to (31) the following three variances are calculated:
var L (S Low,sample (i,j)),j=35,36,…,69 (29)
var H1 (S High,sample (i,j)),j=70,71,…,104 (30)
var H2 (S High,sample (i,j)),j=105,106,…,139 (31)
wherein, formula (29) is the variance of the amplitude spectrum of the high frequency band of the low frequency part of the sample narrowband signal, formula (30) is the variance of the amplitude spectrum of the first sub-band region, formula (31) is the variance of the amplitude spectrum of the second sub-band region, var () represents the variance, and the variance of the frequency spectrum can be calculated Based on the corresponding frequency domain coefficient representation, S Low,sample (i, j) represents the frequency domain coefficient of the sample narrowband signal, and the low frequency domain coefficient of the sample narrowband signal may be the frequency domain coefficient S after filtering Low,samp_rev (i, j), i.e. S in the above formulas (29) to (31) Low,sampl (i, j) is replaced by S Low,sample_rev (i,j)。
Based on the above three variances, the relative flatness information of the amplitude spectrum of each subband region and the amplitude spectrum of the high frequency band of the low frequency part is determined by the formula (32) and the formula (33):
where fc (0) represents first relative flatness information of the amplitude spectrum of the first sub-band region and the amplitude spectrum of the high frequency band of the low frequency part, and fc (1) represents second relative flatness information of the amplitude spectrum of the second sub-band region and the amplitude spectrum of the high frequency band of the low frequency part.
Wherein, the above two values fc (0) and fc (1) can be classified into 0 or more (in the embodiment of the present application, 1 represents 0 or more and 0 represents 0 or less), and fc (0) and fc (1) are defined as a two-class array, so the array contains 4 permutation and combination: {0,0}, {0,1}, {1,0}, and {1,1}.
Thus, the relative flatness information output by the model may be 4 probability values for identifying the probability that the relative flatness information belongs to the 4 arrays described above.
One of the 4 arrays can be selected as the predicted relative flatness information of the amplitude spectrum of the expansion area and the low frequency part of the two sub-band areas and the amplitude spectrum of the high frequency band of the low frequency part according to the principle of maximum probability. Specifically, the expression (34) can be expressed as:
v(i,k)=0 or 1,k=0,1 (34)
where v (i, k) represents the relative flatness information of the amplitude spectrum of the two sub-band region expansion regions and the amplitude spectrum of the high frequency band of the low frequency part, k represents the index of the different sub-band regions, and each sub-band region may correspond to one piece of relative flatness information, for example, when k=0, v (i, k) =0 represents that the first sub-band region oscillates relatively to the low frequency part, i.e. has poor flatness, and v (i, k) =1 represents that the first sub-band region is relatively flat to the low frequency part, i.e. has good flatness.
Step S5, generating a high-frequency amplitude spectrum:
as described above, the low-frequency amplitude spectrum (35-69 total 35 points) is copied twice to generate the high-frequency amplitude spectrum (70 total frequency points), and the relative flatness information of the high-frequency part of the target broadband spectrum obtained by prediction can be obtained through the trained neural network model based on the initial low-frequency domain coefficient corresponding to the narrowband signal or the low-frequency domain coefficient after filtering processing. Since the frequency domain coefficients of the first low frequency spectrum corresponding to 35-69 are selected in this example, the trained neural network model can predict the relative flatness information of at least two sub-band areas of the high frequency part of the target broadband spectrum, that is, the high frequency part of the target broadband spectrum is divided into at least two sub-band areas, in this example, taking 2 sub-band areas as an example, the output of the neural network model is the relative flatness information for the 2 sub-band areas.
And carrying out post-filtering on the reconstructed high-frequency amplitude spectrum according to the predicted relative flatness information corresponding to the 2 frequency band expansion areas. Taking the first sub-band region as an example, the main steps include:
(1) Analysis v (i, k) shows that the high frequency part is very flat if 1, and that the high frequency part oscillates if 0.
(2) For 35 frequency points in the first sub-band region, which is divided into 7 sub-bands, the high frequency spectral envelope comprises 14 first sub-spectral envelopes and the low frequency spectral envelope comprises 14 second sub-spectral envelopes, each sub-band may correspond to one first sub-spectral envelope. The average energy pow_env (spectral energy information corresponding to the second sub-spectral envelope) of each sub-band is calculated, and the average value mpow_env (spectral energy information corresponding to the sub-band region corresponding to the second sub-spectral envelope) of the 7 average energies is calculated. Wherein the average energy of each sub-band is determined based on the corresponding low frequency amplitude spectrum, for example, taking the square of the absolute value of the spectral coefficient of each low frequency amplitude spectrum as the energy of one low frequency amplitude spectrum, and taking the average value of the energy of the low frequency amplitude spectrum corresponding to one sub-band as the average energy of the sub-band if one sub-band corresponds to the spectral coefficients of 5 low frequency amplitude spectrums.
(3) Calculating a gain adjustment value of each first sub-spectrum envelope based on the relative flatness information, the average energy pow_env and the average value mpow_env corresponding to the resolved first sub-band region, specifically including:
when v (i, k) =1, g (j) =a 1 +b 1 *SQRT(Mpow_env/pow_env(j)),j=0,1,…,6;
When v (i, k) = 0,G (j) =a 0 +b 0 *SQRT(Mpow_env/pow_env(j)),j=0,1,…,6;
Wherein, in the present example, a 1 =0.875,b 1 =0.125,a 0 =0.925,b 0 =0.075, g (j) is the gain adjustment value.
In this case, for v (i, k) =0, the gain adjustment value is 1, i.e., no flattening operation (adjustment) of the high-frequency spectral envelope is required.
(4) Based on the above mode, the high-frequency spectrum envelope e can be determined high The gain adjustment value corresponding to each first sub-spectrum envelope in (i, k) is based on the gain adjustment value corresponding to each first sub-spectrum envelope, the corresponding first sub-spectrum envelope is adjusted, the average energy difference of different sub-bands can be reduced, and the frequency spectrum corresponding to the first sub-band region is flattened to different degrees.
It will be appreciated that the second sub-band region may be adjusted to the corresponding high frequency spectrum envelope in the same manner as described above, and will not be described in detail herein. The high frequency spectral envelope comprises 14 sub-bands in total, and 14 gain adjustment values can be determined correspondingly, and the corresponding sub-spectral envelope is adjusted based on the 14 gain adjustment values.
Further, based on the adjusted high-frequency spectrum envelope, determining a difference value between the adjusted high-frequency spectrum envelope and the low-frequency spectrum envelope, and adjusting the initial high-frequency amplitude spectrum based on the difference value to obtain a target frequency amplitude spectrum P High (i,j)。
Step S6, generating a high-frequency spectrum:
based on low frequency phase spectrum Ph High (i, j) generating a corresponding high frequency phase spectrum Ph Hig (i, j) may include any of the following:
first kind: by copying the low frequency phase spectrum, a corresponding high frequency phase spectrum is obtained.
Second kind: the low-frequency phase spectrum is turned over, a phase spectrum identical to the low-frequency phase spectrum is obtained after turning over, and the two low-frequency phase spectrums are mapped to corresponding high-frequency points to obtain corresponding high-frequency phase spectrums.
Generating a high-frequency domain coefficient S from the high-frequency amplitude spectrum and the high-frequency phase spectrum High (i, j); a high frequency spectrum is generated based on the low frequency domain coefficients and the high frequency domain coefficients.
Step S7, high-frequency post-filtering (the initial spectrum in the step is the initial high-frequency spectrum):
the high-frequency post-filtering is to perform filtering processing on the obtained initial high-frequency domain coefficient to obtain a filtered initial high-frequency domain coefficient, and the filtered initial high-frequency domain coefficient is recorded as a high-frequency domain coefficient. In the filtering process, the high-frequency domain coefficient is subjected to filtering processing by a filter gain determined based on the high-frequency domain coefficient, specifically as shown in the following formula (35):
S High_rev (i,j)=G post_filt (j)*S High (i,j) (35)
Wherein G is post_filt (j) For the filtering gain calculated from the high-frequency domain coefficient, S High (i, j) is an initial high frequency domain coefficient, S Hig_rev (i, j) is a high-frequency domain coefficient obtained by the filtering process.
In this example, it is assumed that a filtering gain is shared by every 5 initial high frequency domain coefficients in the same subband, and the calculating process of the filtering gain is specifically as follows:
(1) The initial high frequency domain coefficients are banded, e.g. 5 adjacent initial high frequency domain coefficients are combined into one sub-spectrum, the example corresponding to 14 sub-bands. The average energy is calculated for each subband. In particular, the energy of each bin (i.e., the initial low frequency domain coefficient described above) is defined as the sum of the real square and the imaginary square. Calculating energy values of adjacent 5 frequency points by the following formula (36), wherein the average value of the energy values of the 5 frequency points is the first initial spectrum energy of the current sub-spectrum:
wherein S is High (i, j) initial high frequency domain coefficients, real and Imag being the Real and imaginary parts of the initial high frequency domain coefficients, respectively, pe (k) being the first initial spectral energy, k=0, 1, ….
(2) Based on the inter-frame correlation, a first spectral energy of the current sub-spectrum is calculated by at least one of equation (37) and equation (38):
Fe(k)=1.0+Pe(k)+Pe pre (k) (37)
Fe_sm(k)=(Fe(k)+Fe pre _sm(k))/2 (38)
Where Fe (k) is the second spectral energy of the current sub-spectrum and Pe (k) is the first initial spectral energy of the current sub-spectrum of the current speech frame, pe pre (k) Is the first initial spectral energy of the sub-spectrum corresponding to the current sub-spectrum of the associated speech frame of the current speech frame, fe_sm (k) is the first spectral energy after the accumulated and averaged, i.e. the finally determined first spectral energy, fe pre Sm (k) is a first spectral energy corresponding to the current sub-spectrum of an associated speech frame of the current speech frame, the associated speech frame being at least one speech frame preceding and adjacent to the current speech frame, whereby short-term and long-term correlations between speech signal frames are fully taken into account.
(3) Calculating a spectrum inclination coefficient of the initial spectrum, equally dividing a frequency band corresponding to the initial spectrum into a first sub-band and a second sub-band, and respectively calculating first sub-band energy of the first sub-band and second sub-band energy of the second sub-band, wherein a calculation formula (39) is as follows:
where e1 is the first subband energy of the first subband and e2 is the second subband energy of the second subband. In the description of the equation (39) with fe_sm (k) as the first spectral energy of the sub-spectrum, it is understood that when Fe (k) is used as the first spectral energy of the sub-spectrum, the sub-band energy can be calculated by the scheme shown in the equation (5) (when Fe (k) in the equation (5) is for the initial high frequency spectrum), that is, fe_sm (k) in the equation (39) can be replaced with Fe (k).
Next, from e1 and e2, the spectral tilt coefficients of the initial spectrum are determined based on the following logic:
If(e2>=e1):
T_para=0;
Else:
T_para=8*f_cont_high*SQRT((e1-e2)/(e1+e2);
T_para=min(1.0,T_para);
T_para=T_para/7;
where t_para is a spectral tilt coefficient, SQRT is an open root operation, f_cont_high=0.07, a preset filter coefficient, and 7 is half of the total number of sub-spectrums.
(4) The second filter gain for each sub-spectrum is calculated according to the following equation (40):
gain f0 (k)=Fe_sm(k) f_cont_high (40)
wherein, gain f0 (k) For the second filter gain of the kth sub-spectrum, f_cont_high=0.07, for a predetermined filter coefficient, fe_sm (k) is calculated according to equation (38)The resulting smoothed term of the first spectral energy of the kth sub-spectrum, k=0, 1, …,13. Similarly, when Fe (k) is used as the first spectral energy of the sub-spectrum, fe_sm (k) in the formula (40) may be replaced by Fe (k).
Then, if the spectral tilt coefficient T_para is positive, a second filter gain according to the following formula (41) is also required f0 (k) And (3) further adjusting:
If(T_para>0):
gain f1 (k)=gain f0 (k)*(1+k*T para ) (41)
wherein, gain f1 (k) Is the second filter gain adjusted according to the spectral tilt factor t_para.
(5) The filter gain value of the high frequency post-filter is obtained according to the following formula (42):
G post_filt (k)=(1+gain f1 (k))/2 (42)
wherein, gain f1 (k) G is the filter gain adjusted according to equation (41) post_filt (k) According to gain f1 (k) The final kth sub-spectrum corresponds to a filter gain (i.e., a second filter gain) of 5 high frequency domain coefficients.
Specifically, in determining the second filter gain G corresponding to the kth sub-spectrum post_filt (k) Thereafter, since the first filter gain includes a first number (e.g., l=14) of second filter gains G post_filt (k) And a second filtering gain G post_filt (k) The filter gain of N frequency spectrum coefficients corresponding to the kth sub-spectrum can be obtained to obtain a first filter gain G post_filt (j)。
Step S8, frequency-time transformation, namely, reverse short-time Fourier transformation iSTFT:
and obtaining the broadband signal with the expanded frequency band based on the low-frequency spectrum and the high-frequency spectrum.
Specifically, the low-frequency domain coefficient S Low_rev (i, j) and high frequency domain coefficient S High_rev (i, j) combining to generate a high frequency spectrum, and performing inverse time-frequency conversion based on the low frequency spectrum and the high frequency spectrum to generate a new speech frame s Rec (i, j), i.e. a wideband signal. This isThe effective spectrum of the narrowband signal has been spread to 7000Hz.
In a second example as shown in fig. 4, the time-frequency transform is MDCT. In the first example described above, the time-frequency transformation of the narrowband signal is based on STFT, and each signal frequency bin contains amplitude information and phase information according to classical signal theory. In the first example, the phase of the high frequency part is mapped directly from the low frequency part, with some error, and thus MDCT is used in the second example. The MDCT is still similar to the windowing and overlapping processing of the first example, but the generated MDCT coefficients are real numbers and have larger information, and the band expansion can be completed by only utilizing the correlation between the high-frequency MDCT coefficients and the low-frequency MDCT coefficients and adopting a neural network model similar to the first example. The specific process comprises the following steps:
Step T1, front-end signal processing:
and carrying out up-sampling processing of a factor of 2 on the narrowband signal to be processed, and outputting an up-sampled signal with a sampling rate of 16000 Hz.
Since the sampling rate of the narrowband signal to be processed is 8000Hz and the frame length is 10ms, the up-sampled signal corresponds to 160 sample points (frequency points), and the up-sampled signal is subjected to the modified discrete cosine transform MDCT, specifically: 160 sample points corresponding to the previous voice frame and 160 sample points corresponding to the current voice frame (narrowband signal to be processed) form an array, and the array comprises 320 sample points. Then, the sample points in the array are subjected to windowing of cosine window to obtain a windowed signal s Low (i, j) performing MDCT to obtain 160 low-frequency domain coefficients S Low (i, j). Where i is the frame index of the speech frame and j is the intra sample index (j=0, 1, …, 159).
Step T2, low frequency pre-filtering (the initial spectrum in this step is the initial low frequency spectrum):
the low-frequency pre-filtering is to perform filtering processing on the initial low-frequency domain coefficient obtained by the narrow-band signal through MDCT, so as to obtain the low-frequency domain coefficient. In the filtering process, the initial low frequency domain coefficient is subjected to filtering processing by a filter gain determined based on the initial low frequency domain coefficient, specifically as shown in the following formula (43):
S Low_rev (i,j)=G pre_filt (j)*S Low (i,j) (43)
Wherein G is pre_filt (j) For the filter gain calculated from the initial low frequency domain coefficients, S Low (i, j) is an initial low frequency domain coefficient, S Low_rev (i, j) is a low frequency domain coefficient obtained by the filtering process.
In this example, it is assumed that a filtering gain is shared by every 5 initial low frequency domain coefficients in the same subband, where the filtering gain is calculated as follows:
(1) The initial low frequency domain coefficients are banded, e.g. 5 adjacent initial low frequency domain coefficients are combined into one sub-spectrum, the example corresponding to 14 sub-bands. The average energy is calculated for each subband. In particular, the energy of each bin (i.e., the initial low frequency domain coefficient described above) is defined as the sum of the real square and the imaginary square. Calculating energy values of adjacent 5 frequency points by the following formula (44), wherein the average value of the energy values of the 5 frequency points is the first initial spectrum energy of the current sub-spectrum:
wherein S is Low (i, j) is a low frequency domain coefficient (i.e., an initial low frequency domain coefficient) obtained according to time-frequency conversion, pe (k) is a first initial spectral energy, k=0, 1, ….
(2) Based on the inter-frame correlation, a first spectral energy of the current sub-spectrum is calculated by at least one of equation (45) and equation (46):
Fe(k)=1.0+Pe(k)+Pe pre (k) (45)
Fe_sm(k)=(Fe(k)+Fe pre _sm(k))/2 (46)
Where Fe (k) is the second spectral energy of the current sub-spectrum and Pe (k) is the first initial spectral energy of the current sub-spectrum of the current speech frame, pe pre (k) Is a first initial spectral energy of a sub-spectrum of an associated speech frame of the current speech frame corresponding to the current sub-spectrum,fe_sm (k) is the first spectral energy after the cumulative average, i.e. the final determined first spectral energy, fe pre_ sm (k) is a second spectral energy corresponding to the current sub-spectrum of an associated speech frame of the current speech frame, the associated speech frame being at least one speech frame preceding and adjacent to the current speech frame. Optionally, the associated speech frame is a speech frame preceding the current speech frame.
(3) Calculating a spectrum inclination coefficient of the initial spectrum, equally dividing a frequency band corresponding to the initial spectrum into a first sub-band and a second sub-band, respectively calculating first sub-band energy of the first sub-band and second sub-band energy of the second sub-band, taking Fe_sm (k) as a first spectrum energy of a kth sub-spectrum as an example, and calculating a formula (47) as follows:
where e1 is the first subband energy of the first subband and e2 is the second subband energy of the second subband.
Next, from e1 and e2, the spectral tilt coefficients of the initial spectrum are determined based on the following logic:
If(e2>=e1):
T_para=0;
Else:
T_para=8*f_cont_low*SQRT((e1-e2)/(e1+e2);
T_para=min(1.0,T_para);
T_para=T_para/7;
Where t_para is a spectral tilt coefficient, SQRT is an open root operation, f_cont_low=0.035, is a preset filter coefficient, and 7 is half of the total number of sub-spectrums.
(4) The second filter gain for each sub-spectrum is calculated, and likewise, taking fe_sm (k) as an example of the first spectral energy of the kth sub-spectrum, can be calculated according to the following formula (48):
gain f0 (k)=Fe_sm(k) f_cont_low (48)
wherein, gain f0 (k) For the second filter gain of the kth sub-spectrum, f_cont_low=0.035, which is a preset filter coefficient, fe_sm (k) is a smoothed term of the first spectral energy of the kth sub-spectrum calculated according to equation (46), k=0, 1, …,13.
Then, if the spectral tilt coefficient T_para is positive, the gain calculated according to equation (48) is also required to be calculated according to equation (49) f0 (k) And (3) further processing:
If(T_para>0):
gain f1 (k)=gain f0 (k)*(1+k*T para ) (48)
wherein, gain f1 (k) Is the second filter gain adjusted according to the spectral tilt factor t_para.
(5) The filter gain value of the low frequency pre-filter is calculated according to the following formula (50):
G pre_filt (k)=(1+gain f1 (k))/2 (50)
wherein, gain f1 (k) G is a second filter gain adjusted according to formula (49) pre_filt (k) According to gain f1 (k) The filter gain (i.e., the second filter gain) of the 5 low frequency domain coefficients corresponding to the kth sub-spectrum finally obtained.
Specifically, in determining the second filter gain G corresponding to the kth sub-spectrum pre_filt (k) Thereafter, since the first filter gain includes a first number (e.g., l=14) of second filter gains G pre_filt (k) And a second filtering gain G pre_filt (k) The filter gain of N frequency spectrum coefficients corresponding to the kth sub-spectrum can be obtained to obtain a first filter gain G pre_filt (j)。
Step T3, feature extraction:
a) Obtaining a low-frequency domain coefficient S for completing the pre-filtering Low_rev (i,j)。
If the narrowband signal is a signal with the sampling rate of 16000Hz and the bandwidth of 0-3500 Hz, the sampling rate and the bandwidth of the narrowband signal can be based onFrame length from S Low_rev 70 low frequency domain coefficients j=0, 1, … 69 are determined in (i, j).
After obtaining the low frequency domain coefficients comprising 70, a low frequency spectrum envelope of the narrowband signal may be determined based on the 70 low frequency domain coefficients. Wherein the low frequency spectral envelope may be determined based on the low frequency domain coefficients by:
the narrowband signal is banded, and for 70 low-frequency domain coefficients, the frequency band corresponding to every 5 adjacent low-frequency domain coefficients can be divided into one sub-band, and the total frequency band is divided into 14 sub-bands, and each sub-band corresponds to 5 low-frequency domain coefficients. For each subband, the low frequency spectral envelope of the subband is defined as the average energy of the neighboring low frequency domain coefficients. Specifically, the method can be calculated by a formula (51):
Wherein e Low (i, k) denotes a sub-spectrum envelope (a low-frequency spectrum envelope of each sub-band), k denotes an index number of the sub-band, 14 sub-bands are total, and k=0, 1,2 … … 13, and 14 sub-spectrum envelopes are included in the low-frequency spectrum envelope.
Thereby, the 70-dimensional low-frequency domain coefficient S can be used Low_rev (i, j) and 14-dimensional low frequency spectral envelope e Low (i, k) as input to a neural network model.
Step T4, inputting a neural network model:
input layer: the neural network model inputs the 84-dimensional feature vector,
output layer: considering that the target wideband of the band extension in the present embodiment is 7000Hz, it is necessary to predict the high-frequency spectral envelope e of 14 sub-bands corresponding to the 3500-7000Hz band Hig (i, k). In addition, 4 probability densities fc related to the flatness information may be output at the same time, i.e., the output result is 18 dimensions.
The neural network model in the second example is the same as the processing procedure of the neural network model in the first example, and will not be described herein.
Step T5, generating a high-frequency amplitude spectrum:
similar to the first example described above, based on the flatness information, using the flatness analysis similar to the first example, a flatness relation v (i, k) of two subband areas of high frequency and a low frequency part is generated, and then the high frequency spectrum envelope e is combined High (i, k) Using a procedure similar to the first example, high frequency MDCT coefficients S may be generated High (i,j)。
Step T6, high frequency post-filtering (the initial spectrum in the step is the initial high frequency spectrum):
the high-frequency post-filtering is to perform filtering processing on the obtained initial high-frequency domain coefficient to obtain a filtered initial high-frequency domain coefficient, and the filtered initial high-frequency domain coefficient is recorded as a high-frequency domain coefficient. In the filtering process, the initial high-frequency domain coefficient is subjected to filtering processing by a filter gain determined based on the high-frequency domain coefficient, specifically as shown in the following formula (52):
S High_rev (i,j)=G post_filt (j)*S High (i,j) (52)
wherein G is post_filt (j) For the filtering gain calculated from the high-frequency domain coefficient, S High (i, j) is an initial high frequency domain coefficient, S High_rev (i, j) is a high-frequency domain coefficient obtained by the filtering process.
The specific processing procedure of the high-frequency post-filtering is similar to the specific processing procedure of the high-frequency pre-filtering, and the specific processing procedure is as follows:
in this example, it is assumed that every 5 initial high-frequency domain coefficients in the same subband share a filter gain, where the filter gain G post_filt (j) The calculation process of (a) is specifically as follows:
(1) The initial high frequency domain coefficients are banded, e.g. 5 adjacent initial high frequency domain coefficients are combined into one sub-spectrum, the example corresponding to 14 sub-bands. The average energy is calculated for each subband. In particular, the energy of each bin (i.e., the initial high frequency domain coefficient described above) is defined as the sum of the real square and the imaginary square. Calculating energy values of adjacent 5 frequency points by the following formula (53), wherein the average value of the energy values of the 5 frequency points is the first spectrum energy of the current sub-spectrum:
Wherein S is High (i, j) is an initial high frequency domain coefficient, pe (k) is a first initial spectral energy, k=0, 1, ….
(2) Based on the inter-frame correlation, a first spectral energy of the current sub-spectrum is calculated by at least one of equation (54) and equation (55):
Fe(k)=1.0+Pe(k)+Pe pre (k) (54)
Fe_sm(k)=(Fe(k)+Fe pre_ sm(k))/2 (55)
where Fe (k) is the second spectral energy of the current sub-spectrum and Pe (k) is the first initial spectral energy of the current sub-spectrum of the current speech frame, pe pre (k) Is the first initial spectral energy of the sub-spectrum corresponding to the current sub-spectrum of the associated speech frame of the current speech frame, fe_sm (k) is the first spectral energy after the cumulative average, fe pre_ sm (k) is a first spectral energy corresponding to a current sub-spectrum of an associated speech frame of the current speech frame, the associated speech frame being at least one speech frame preceding and adjacent to the current speech frame, whereby short-term and long-term correlations between speech signal frames are fully taken into account.
(3) Calculating a spectrum inclination coefficient of the initial spectrum, equally dividing a frequency band corresponding to the initial spectrum into a first sub-band and a second sub-band, respectively calculating first sub-band energy of the first sub-band and second sub-band energy of the second sub-band, taking Fe_sm (k) as a first spectrum energy of a kth sub-spectrum as an example, and calculating a formula (56) as follows:
Where e1 is the first subband energy of the first subband and e2 is the second subband energy of the second subband.
Next, from e1 and e2, the spectral tilt coefficients of the initial spectrum are determined based on the following logic:
If(e2>=e1):
T_para=0;
Else:
T_para=8*f_cont_high*SQRT((e1-e2)/(e1+e2);
T_para=min(1.0,T_para);
T_para=T_para/7;
where t_para is a spectral tilt coefficient, SQRT is an open root operation, f_cont_high=0.07, a preset filter coefficient, and 7 is half of the total number of sub-spectrums.
(4) The second filter gain for each sub-spectrum is calculated, taking fe_sm (k) as an example of the first spectral energy of the kth sub-spectrum, and can be calculated according to the following formula (57):
gain f0 (k)=Fe_sm(k) f_cont_high (57)
wherein, gain f0 (k) For the second filter gain of the kth sub-spectrum, f_cont_high=0.07, for a preset filter coefficient, fe_sm (k) is the first spectral energy smoothing term of the kth sub-spectrum calculated according to equation (55), i.e. the second spectral energy described above, k=0, 1, …,13.
Then, if the spectral tilt factor T_para is positive, a second filter gain is also required according to the following equation (58) f0 (k) And (3) further adjusting:
If(T_para>0):
gain f1 (k)=gain f0 (k)*(1+k*T para ) (58)
wherein, gain f1 (k) Is the second filter gain adjusted according to the spectral tilt factor t_para.
(5) The filter gain value of the high frequency post-filter is obtained according to the following formula (59):
G post_filt (k)=(1+gain f1 (k))/2 (59)
Wherein, gain f1 (k) To adjust the second filter gain according to formula (58), G post_filt (k) According to gain f1 (k) And the filter gain (namely the second filter gain) of the 5 high-frequency domain coefficients corresponding to the k sub-spectrum is finally obtained.
Specifically, in determining the second filter gain G corresponding to the kth sub-spectrum post_filt (k) Thereafter, since the first filter gain includes a first number (e.g., l=14) of second filter gains G post_filt (k) And a second filtering gain G post_filt (k) The filter gain of N frequency spectrum coefficients corresponding to the kth sub-spectrum can be obtained to obtain a first filter gain G post_filt (j)。
Step T7, frequency-time transform, i.e. inverse modified cosine tfir transform iMDCT:
and obtaining the broadband signal with the expanded frequency band based on the low-frequency spectrum and the high-frequency spectrum.
Specifically, the low-frequency domain coefficient S Low_rev (i, j) and high frequency domain coefficient S High_rev (i, j) combining to generate a high frequency spectrum, and performing inverse time-frequency conversion based on the low frequency spectrum and the high frequency spectrum to generate a new speech frame s Rec (i, j), i.e. a wideband signal. At this point, the effective spectrum of the narrowband signal has been spread to 7000Hz.
In the voice communication scene of the intercommunication between PSTN and VoIP, the VoIP side can only receive the narrowband voice from PSTN (the sampling rate is 8kHz, and the effective bandwidth is 3.5 kHz). The visual perception of the user is that the sound is not bright enough, the volume is not big enough, and the intelligibility is general. According to the technical scheme disclosed by the embodiment of the application, the band expansion is carried out without extra bits, and the effective bandwidth can be expanded to 7kHz at the receiving end of the VoIP side. The user can intuitively feel brighter tone, greater volume and better intelligibility. In addition, the forward compatibility problem does not exist based on the scheme, namely, the protocol does not need to be modified, and the PSTN can be perfectly compatible.
The method of the embodiment of the application is applied to the downlink side of the PSTN-VoIP channel, for example, the functional module of the scheme provided by the embodiment of the application can be integrated at the client side provided with the conference system, so that the band expansion of the narrow-band signal can be realized at the client side to obtain the broadband signal. Specifically, the signal processing in the scene is a signal post-processing technology, taking PSTN (the coding system may be ITU-T g.711) as an example, and recovering a voice frame after finishing g.711 decoding inside the conference system client; the post-processing technology related to the implementation of the application can enable the VoIP user to receive the broadband signal even if the transmitting end is the narrowband signal.
The method of the embodiment of the application can also be applied to a mixing server of a PSTN-VoIP channel, after the frequency band expansion is carried out by the mixing server, the broadband signal after the frequency band expansion is sent to a VoIP client, and after the VoIP client receives the VoIP code stream corresponding to the broadband signal, the broadband voice output by the frequency band expansion can be recovered by decoding the VoIP code stream. A typical function in a mixing server is to transcode, for example, a code stream (such as OPUS or SILK, etc.) commonly used for VoIP in transcoding a code stream (such as encoded using g.711) of a PSTN link. In the mixing server, the voice frame after G.711 decoding can be up-sampled to 16000Hz, and then the scheme provided by the embodiment of the application is used for completing the band expansion; then transcoded into a common code stream for VoIP. The VoIP client receives one or more VoIP code streams, and can recover the broadband voice output by the band expansion through decoding.
Fig. 5 is a schematic structural diagram of a band expanding device according to another embodiment of the present application, as shown in fig. 5, the device 50 may include a low-frequency spectrum parameter determining module 51, a correlation parameter determining module 52, a high-frequency spectrum determining module 53, and a wideband signal determining module 54, where:
the low-frequency spectrum determining module 51 is configured to perform time-frequency conversion on the narrowband signal to be processed to obtain a corresponding low-frequency spectrum;
a correlation parameter determining module 52, configured to obtain, based on the low frequency spectrum, a correlation parameter of a high frequency portion and a low frequency portion of the target broadband spectrum through a neural network model, where the correlation parameter includes at least one of a high frequency spectrum envelope and relative flatness information, and the relative flatness information characterizes a correlation between a spectral flatness of the high frequency portion and a spectral flatness of the low frequency portion of the target broadband spectrum;
a high-frequency spectrum determining module 53, configured to obtain a target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum;
a wideband signal determining module 54, configured to obtain a wideband signal with a spread frequency band based on the low frequency spectrum and the target high frequency spectrum;
wherein at least one of the low frequency spectrum or the target high frequency spectrum is a spectrum obtained by filtering the corresponding initial spectrum.
In one possible implementation, the low frequency spectrum determining module is configured to:
determining a first filtering gain of the initial spectrum based on the spectral energy of the initial spectrum;
and performing filtering processing on the initial frequency spectrum according to the first filtering gain.
In one possible implementation, the low frequency spectrum determining module is configured to:
dividing the initial frequency spectrum into a first number of sub-frequency spectrums, and determining first frequency spectrum energy corresponding to each sub-frequency spectrum;
determining a second filter gain corresponding to each sub-spectrum based on the respective corresponding first spectral energy of each sub-spectrum, wherein the first filter gain value comprises a first number of second filter gains;
when the low-frequency spectrum determining module carries out filtering processing on the initial frequency spectrum according to the first filtering gain, each corresponding sub-frequency spectrum is respectively subjected to filtering processing according to the second filtering gain corresponding to each sub-frequency spectrum.
In one possible implementation, the low frequency spectrum determining module is configured to:
dividing a frequency band corresponding to the initial frequency spectrum into a first sub-band and a second sub-band;
determining first sub-band energy of the first sub-band according to first frequency spectrum energy of all sub-spectrums corresponding to the first sub-band, and determining second sub-band energy of the second sub-band according to first frequency spectrum energy of all sub-spectrums corresponding to the second sub-band;
Determining a spectrum inclination coefficient of the initial spectrum according to the first sub-band energy and the second sub-band energy;
and determining a second filtering gain corresponding to each sub-spectrum according to the spectrum inclination coefficient and the first spectrum energy corresponding to each sub-spectrum.
In one possible implementation, the narrowband signal is a speech signal of a current speech frame, and the low frequency spectrum determining module is configured to:
determining a first initial spectral energy of a sub-spectrum;
if the current voice frame is the first voice frame, determining first spectrum energy based on the first initial spectrum energy of the sub-spectrum;
if the current voice frame is not the first voice frame, acquiring first initial spectrum energy of a sub-spectrum corresponding to the sub-spectrum of the associated voice frame, wherein the associated voice frame is at least one voice frame positioned before the current voice frame and adjacent to the current voice frame;
the first spectral energy of the one sub-spectrum is obtained based on the first initial spectral energy of the one sub-spectrum and the first initial spectral energy of the sub-spectrum of the associated speech frame corresponding to the one sub-spectrum.
In one possible implementation, the associated speech frame is a speech frame preceding the current speech frame, and the low frequency spectrum determining module is configured to, when the current speech frame is a first speech frame, determine a first spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum:
Determining a second spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum and the initialized first initial spectral energy;
determining a first spectral energy of the one sub-spectrum based on the second spectral energy of the one sub-spectrum and the initialized first spectral energy;
when the current speech frame is not the first speech frame, based on the first initial spectral energy of the one sub-spectrum and the first initial spectral energy of the sub-spectrum of the associated speech frame corresponding to the one sub-spectrum, obtaining the first spectral energy of the one sub-spectrum, for:
determining a second spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum and the first initial spectral energy of the sub-spectrum of the previous speech frame corresponding to the one sub-spectrum;
the first spectral energy of the one sub-spectrum is determined from the second spectral energy of the one sub-spectrum and the first spectral energy of the sub-spectrum of the previous speech frame corresponding to the one sub-spectrum.
In one possible implementation, the correlation parameters include a high frequency spectral envelope and relative flatness information; the neural network model at least comprises an input layer and an output layer, wherein the input layer inputs the characteristic vector of the low-frequency spectrum, the output layer at least comprises a single-side long-short-term memory network LSTM layer and two fully-connected network layers respectively connected with the LSTM layer, each fully-connected network layer comprises at least one fully-connected layer, the LSTM layer converts the characteristic vector processed by the input layer, one fully-connected network layer performs first classification processing according to the vector value converted by the LSTM layer and outputs a high-frequency spectrum envelope, and the other fully-connected network layer performs second classification processing according to the vector value converted by the LSTM layer and outputs relative flatness information.
In one possible implementation, the time-frequency transform comprises a fourier transform or a discrete cosine transform; the correlation parameter determining module is used for obtaining correlation parameters of a high-frequency part and a low-frequency part of a target broadband frequency spectrum through a neural network model based on the low-frequency spectrum:
obtaining a low-frequency amplitude spectrum of the narrowband signal according to the low-frequency spectrum;
inputting the low-frequency amplitude spectrum into a neural network model, and obtaining correlation parameters of a high-frequency part and a low-frequency part of a target broadband spectrum based on the output of the neural network model;
if the time-frequency transformation is discrete cosine transformation, the high-frequency spectrum determining module is used for obtaining the correlation parameters of the high-frequency part and the low-frequency part of the target broadband frequency spectrum through a neural network model based on the low-frequency spectrum:
and inputting the low-frequency spectrum into a neural network model, and obtaining correlation parameters of a high-frequency part and a low-frequency part of the target broadband spectrum based on the output of the neural network model.
In one possible implementation, the time-frequency transform comprises a fourier transform or a discrete cosine transform;
if the time-frequency transformation is fourier transformation, the high-frequency spectrum determining module is used for obtaining a target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum:
Obtaining a low-frequency spectrum envelope of the narrowband signal according to the low-frequency spectrum;
generating an initial high frequency amplitude spectrum based on the low frequency amplitude spectrum;
based on the high-frequency spectrum envelope and the low-frequency spectrum envelope, adjusting an initial high-frequency amplitude spectrum to obtain a target frequency amplitude spectrum;
generating a corresponding high-frequency phase spectrum based on the low-frequency phase spectrum of the narrowband signal;
obtaining a target high-frequency spectrum according to the target high-frequency amplitude spectrum and the high-frequency phase spectrum;
if the time-frequency transformation is discrete cosine transformation, the high-frequency spectrum determining module is used for obtaining a target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum:
obtaining a low-frequency spectrum envelope of the narrowband signal according to the low-frequency spectrum;
generating an initial high frequency spectrum based on the low frequency spectrum;
and adjusting the initial high-frequency spectrum based on the high-frequency spectrum envelope and the low-frequency spectrum envelope to obtain a target high-frequency spectrum.
In one possible implementation, the high frequency spectrum determination module, when generating the initial high frequency amplitude spectrum based on the low frequency amplitude spectrum, is configured to:
copying the amplitude spectrum of the high-frequency band part in the low-frequency amplitude spectrum;
the high-frequency spectrum determining module is used for generating an initial high-frequency spectrum based on the low-frequency spectrum:
The spectrum of the high band portion of the low frequency spectrum is replicated.
In one possible implementation, the high frequency spectral envelope and the low frequency spectral envelope are both spectral envelopes in the logarithmic domain;
the high-frequency spectrum determining module is used for adjusting the initial high-frequency amplitude spectrum based on the high-frequency spectrum envelope and the low-frequency spectrum envelope:
determining a first difference of the high frequency spectrum envelope and the low frequency spectrum envelope;
adjusting the initial high-frequency amplitude spectrum based on the first difference value to obtain a target high-frequency amplitude spectrum;
the high-frequency spectrum determining module is used for adjusting the initial high-frequency spectrum based on the high-frequency spectrum envelope and the low-frequency spectrum envelope:
determining a second difference of the high frequency spectrum envelope and the low frequency spectrum envelope;
and adjusting the initial high-frequency spectrum based on the second difference value to obtain a target high-frequency spectrum.
In one possible implementation, if the time-frequency transform is a fourier transform, the high-frequency spectral envelope comprises a second number of first sub-spectral envelopes, the initial high-frequency amplitude spectrum comprises a second number of first sub-amplitude spectra, wherein each first sub-spectral envelope is determined based on a corresponding first sub-amplitude spectrum in the initial high-frequency amplitude spectrum;
the high-frequency spectrum determining module is used for adjusting the initial high-frequency amplitude spectrum based on a first difference value when determining the first difference value of the high-frequency spectrum envelope and the low-frequency spectrum envelope to obtain the target frequency amplitude spectrum:
Determining a first difference value of each first sub-spectrum envelope and a corresponding spectrum envelope in the low-frequency spectrum envelopes;
based on the first difference value corresponding to each first sub-spectrum envelope, the corresponding first sub-amplitude spectrum is adjusted to obtain a second number of adjusted first sub-amplitude spectrums;
obtaining a target frequency amplitude spectrum based on the second number of adjusted first sub-amplitude spectrums;
if the time-frequency transformation is discrete cosine transformation, the high-frequency spectrum envelope comprises a third number of second sub-spectrum envelopes, the initial high-frequency spectrum comprises the third number of first sub-spectrums, and each second sub-spectrum envelope is determined based on the corresponding first sub-spectrum in the initial high-frequency spectrum;
the high-frequency spectrum determining module is used for adjusting the initial high-frequency spectrum based on a second difference value of the high-frequency spectrum envelope and the low-frequency spectrum envelope to obtain a target high-frequency spectrum when determining the second difference value of the high-frequency spectrum envelope and the low-frequency spectrum envelope:
determining a second difference value of each second sub-spectral envelope and a corresponding spectral envelope in the low-frequency spectral envelopes;
based on the second difference value corresponding to each second sub-spectrum envelope, the corresponding first sub-spectrum is adjusted to obtain a third number of adjusted first sub-spectrums;
And obtaining a target high-frequency spectrum based on the third number of adjusted first sub-spectrums.
In one possible implementation, the high frequency spectrum determining module is configured to, when determining the first difference or the second difference of the high frequency spectrum envelope and the low frequency spectrum envelope:
determining a gain adjustment value of the high frequency spectrum envelope based on the relative flatness information and the energy information of the low frequency spectrum;
adjusting the high-frequency spectrum envelope based on the gain adjustment value to obtain an adjusted high-frequency spectrum envelope;
a first difference or a second difference of the adjusted high frequency spectral envelope and the low frequency spectral envelope is determined.
In one possible implementation, the relative flatness information includes relative flatness information of at least two sub-band areas corresponding to the high frequency part, the relative flatness information corresponding to one sub-band area characterizing a correlation of a spectral flatness of one sub-band area of the high frequency part and a spectral flatness of a high frequency band of the low frequency part;
if the high-frequency part comprises spectrum parameters corresponding to at least two sub-band regions, the spectrum parameters of each sub-band region are obtained by the spectrum parameters of a high-frequency band of the basic low-frequency part, and the relative flatness information comprises the relative flatness information of the spectrum parameters of each sub-band region and the spectrum parameters of the high-frequency band, wherein the spectrum parameters are the amplitude spectrum if the time-frequency transformation is Fourier transformation, the spectrum parameters are the frequency spectrum if the time-frequency transformation is discrete cosine transformation;
The high-frequency spectrum determining module is used for determining a gain adjustment value of the high-frequency spectrum envelope based on the relative flatness information and the energy information of the low-frequency spectrum:
determining a gain adjustment value of a corresponding spectrum envelope part in the high-frequency spectrum envelope based on the relative flatness information corresponding to each sub-band region and the spectrum energy information corresponding to each sub-band region in the low-frequency spectrum;
the high-frequency spectrum determining module is used for adjusting the high-frequency spectrum envelope based on the gain adjustment value:
and adjusting the corresponding spectrum envelope part according to the gain adjustment value of each corresponding spectrum envelope part in the high-frequency spectrum envelope.
In one possible implementation, if the high frequency spectrum envelope includes a first predetermined number of high frequency sub-spectrum envelopes, the first predetermined number is a second number when the low frequency spectrum is obtained by fourier transform, and the first predetermined number is a third number when the low frequency spectrum is obtained by discrete cosine transform;
the high-frequency spectrum determining module is used for determining a gain adjustment value of a corresponding spectrum envelope part in the high-frequency spectrum envelope based on the relative flatness information corresponding to each sub-band region and the spectrum energy information corresponding to each sub-band region in the low-frequency spectrum:
For each high-frequency sub-spectrum envelope, determining a gain adjustment value of the high-frequency sub-spectrum envelope according to spectrum energy information corresponding to a spectrum envelope corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope, relative flatness information corresponding to a sub-band region corresponding to the spectrum envelope corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope, and spectrum energy information corresponding to a sub-band region corresponding to the spectrum envelope corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope;
adjusting each corresponding spectrum envelope part according to the gain adjustment value of each corresponding spectrum envelope part in the high-frequency spectrum envelope, including:
and adjusting the corresponding high-frequency sub-spectrum envelopes according to the gain adjustment value of each high-frequency sub-spectrum envelope in the high-frequency spectrum envelopes.
In one possible implementation, if the narrowband signal includes at least two associated signals, the apparatus further includes:
the narrowband signal determining module is used for fusing at least two paths of related signals to obtain narrowband signals; or, each of the at least two associated signals is used as a narrowband signal.
The device provided by the embodiment of the application can effectively filter quantization noise possibly introduced in the quantization process of the narrowband signal by filtering the corresponding initial frequency spectrum, so as to prevent the quantization noise from being spread to the target high-frequency spectrum in the process of carrying out frequency band expansion based on the low-frequency spectrum; the target high-frequency spectrum can also be a spectrum obtained by filtering the corresponding initial spectrum, so that noise possibly existing in the target high-frequency spectrum can be effectively filtered, the signal quality of the broadband signal is enhanced, and the hearing experience of a user is further improved.
It should be noted that, this embodiment is an apparatus embodiment corresponding to the above-mentioned method embodiment, and this embodiment may be implemented in cooperation with the above-mentioned method embodiment. The related technical details mentioned in the above method embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment may also be applied in the above-described method item embodiments.
Another embodiment of the present application provides an electronic device, as shown in fig. 6, an electronic device 600 shown in fig. 6 includes: a processor 601 and a memory 603. The processor 601 is coupled to a memory 603, such as via a bus 602. Further, the electronic device 600 may also include a transceiver 604. It should be noted that, in practical applications, the transceiver 604 is not limited to one, and the structure of the electronic device 600 is not limited to the embodiment of the present application.
The processor 601 is applied to the embodiment of the present application, and is configured to implement the functions of the low-frequency spectrum determining module, the high-frequency spectrum determining module, and the wideband signal determining module shown in fig. 5.
The processor 601 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 601 may also be a combination that performs computing functions, such as including one or more microprocessors, a combination of a DSP and a microprocessor, and the like.
Bus 602 may include a path to transfer information between the components. Bus 602 may be a PCI bus or an EISA bus, etc. The bus 602 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus.
The memory 603 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disks, laser disks, optical disks, digital versatile disks, blu-ray disks, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 603 is used for storing application program codes for executing the inventive arrangements and is controlled to be executed by the processor 601. The processor 601 is configured to execute application code stored in the memory 603 to implement the operations of the band expanding means provided by the embodiment shown in fig. 5.
The electronic device provided by the embodiment of the application comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein when the processor executes the program, the implementation can be realized: the low-frequency spectrum can be a spectrum obtained by filtering the corresponding initial spectrum, so that quantization noise possibly introduced in the quantization process of the narrowband signal is effectively filtered, and the quantization noise is prevented from being spread to a target high-frequency spectrum in the process of performing frequency band spreading based on the low-frequency spectrum; the target high-frequency spectrum can also be a spectrum obtained by filtering the corresponding initial spectrum, so that noise possibly existing in the target high-frequency spectrum can be effectively filtered, the signal quality of the broadband signal is enhanced, and the hearing experience of a user is further improved.
The embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method shown in the above embodiment. Wherein: the low-frequency spectrum can be a spectrum obtained by filtering the corresponding initial spectrum, so that quantization noise possibly introduced in the quantization process of the narrowband signal is effectively filtered, and the quantization noise is prevented from being spread to a target high-frequency spectrum in the process of performing frequency band spreading based on the low-frequency spectrum; the target high-frequency spectrum can also be a spectrum obtained by filtering the corresponding initial spectrum, so that noise possibly existing in the target high-frequency spectrum can be effectively filtered, the signal quality of the broadband signal is enhanced, and the hearing experience of a user is further improved.
The computer readable storage medium provided by the embodiments of the present application is applicable to any one of the embodiments of the above method.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
The foregoing is only a partial embodiment of the present application, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims (12)

1. A band extension method, comprising:
performing time-frequency conversion on the narrowband signal to be processed to obtain a corresponding low-frequency spectrum;
based on the low frequency spectrum, obtaining a correlation parameter of a high frequency part and a low frequency part of a target broadband spectrum through a neural network model, wherein the correlation parameter comprises a high frequency spectrum envelope and relative flatness information, and the relative flatness information characterizes the correlation of the spectrum flatness of the high frequency part and the spectrum flatness of the low frequency part of the target broadband spectrum;
obtaining a target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum;
obtaining a broadband signal with the expanded frequency band based on the low-frequency spectrum and the target high-frequency spectrum;
wherein at least one of the low frequency spectrum and the target high frequency spectrum is a spectrum obtained by filtering a corresponding initial spectrum;
wherein the target high frequency spectrum is obtained by:
obtaining a low-frequency spectrum envelope of the narrowband signal based on the low-frequency spectrum; wherein the high frequency spectral envelope and the low frequency spectral envelope are both logarithmic domain spectral envelopes;
determining a gain adjustment value of the high-frequency spectrum envelope based on the relative flatness information and the energy information of the low-frequency spectrum, and adjusting the high-frequency spectrum envelope based on the gain adjustment value to obtain an adjusted high-frequency spectrum envelope;
If the time-frequency transformation is Fourier transformation, obtaining a low-frequency amplitude spectrum of the narrow-band signal according to the low-frequency spectrum, generating an initial high-frequency amplitude spectrum based on the low-frequency amplitude spectrum, determining a first difference value between an adjusted high-frequency spectrum envelope and the low-frequency spectrum envelope, adjusting the initial high-frequency amplitude spectrum based on the first difference value to obtain a target high-frequency amplitude spectrum, generating a corresponding high-frequency phase spectrum based on a low-frequency phase spectrum of the narrow-band signal, and obtaining a target high-frequency spectrum according to the target high-frequency amplitude spectrum and the high-frequency phase spectrum;
if the time-frequency transformation is discrete cosine transformation, obtaining a low-frequency spectrum envelope of the narrowband signal based on the low-frequency spectrum, generating an initial high-frequency spectrum based on the low-frequency spectrum, determining a second difference value between the adjusted high-frequency spectrum envelope and the low-frequency spectrum envelope, and adjusting the initial high-frequency spectrum based on the second difference value to obtain the target high-frequency spectrum.
2. The method of claim 1, wherein filtering the initial spectrum comprises:
dividing the initial frequency spectrum into a first number of sub-frequency spectrums, and determining first frequency spectrum energy corresponding to each sub-frequency spectrum;
Determining a second filtering gain corresponding to each sub-spectrum based on the first spectrum energy corresponding to each sub-spectrum;
and respectively carrying out filtering processing on each corresponding sub-spectrum according to the second filtering gain corresponding to each sub-spectrum.
3. The method of claim 2, wherein determining the second filter gain for each sub-spectrum based on the respective first spectral energy for each sub-spectrum comprises:
dividing a frequency band corresponding to the initial frequency spectrum into a first sub-band and a second sub-band;
determining first sub-band energy of the first sub-band according to first frequency spectrum energy of all sub-spectrums corresponding to the first sub-band, and determining second sub-band energy of the second sub-band according to first frequency spectrum energy of all sub-spectrums corresponding to the second sub-band;
determining a spectrum inclination coefficient of the initial spectrum according to the first sub-band energy and the second sub-band energy;
and determining a second filtering gain corresponding to each sub-spectrum according to the spectrum inclination coefficient and the first spectrum energy corresponding to each sub-spectrum.
4. The method of claim 2, wherein the narrowband signal is a speech signal of a current speech frame, and wherein determining the first spectral energy of one of the sub-spectrums comprises:
Determining a first initial spectral energy of the one sub-spectrum of the current speech frame;
if the current speech frame is a first speech frame, determining a first spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum;
if the current voice frame is not the first voice frame, acquiring first initial spectrum energy of a sub-spectrum corresponding to the sub-spectrum of an associated voice frame, wherein the associated voice frame is at least one voice frame positioned before the current voice frame and adjacent to the current voice frame;
and obtaining the first spectrum energy of the one sub-spectrum based on the first initial spectrum energy of the one sub-spectrum and the first initial spectrum energy of the sub-spectrum corresponding to the one sub-spectrum of the associated voice frame.
5. The method of claim 4, wherein the associated speech frame is a preceding speech frame to the current speech frame, wherein,
if the current speech frame is a first speech frame, the determining the first spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum includes:
determining a second spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum and the initialized first initial spectral energy;
Determining a first spectral energy of the one sub-spectrum based on the second spectral energy of the one sub-spectrum and the initialized first spectral energy;
if the current speech frame is not the first speech frame, the obtaining the first spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum and the first initial spectral energy of the sub-spectrum of the associated speech frame corresponding to the one sub-spectrum includes:
determining a second spectral energy of the one sub-spectrum based on the first initial spectral energy of the one sub-spectrum and the first initial spectral energy of the sub-spectrum of the previous speech frame corresponding to the one sub-spectrum;
the first spectral energy of the one sub-spectrum is determined from the second spectral energy of the one sub-spectrum and the first spectral energy of the sub-spectrum of the previous speech frame corresponding to the one sub-spectrum.
6. The method according to any one of claim 1 to 5, wherein,
the neural network model at least comprises an input layer and an output layer, wherein the input layer inputs the feature vector of a low-frequency spectrum, the output layer at least comprises a single-side long-short-term memory network LSTM layer and two fully-connected network layers respectively connected with the LSTM layer, each fully-connected network layer comprises at least one fully-connected layer, the LSTM layer converts the feature vector processed by the input layer, one fully-connected network layer carries out first classification processing according to the vector value converted by the LSTM layer and outputs the high-frequency spectrum envelope, and the other fully-connected network layer carries out second classification processing according to the vector value converted by the LSTM layer and outputs the relative flatness information.
7. The method according to any one of claims 1 to 5, wherein the time-frequency transform comprises a fourier transform or a discrete cosine transform;
if the time-frequency transformation is fourier transformation, the obtaining, based on the low-frequency spectrum, correlation parameters of a high-frequency part and a low-frequency part of a target broadband spectrum through a neural network model includes:
obtaining a low-frequency amplitude spectrum of the narrowband signal according to the low-frequency spectrum;
inputting the low-frequency amplitude spectrum into the neural network model, and obtaining correlation parameters of a high-frequency part and a low-frequency part of a target broadband spectrum based on the output of the neural network model;
if the time-frequency transformation is discrete cosine transformation, the obtaining, based on the low-frequency spectrum, correlation parameters of a high-frequency part and a low-frequency part of a target broadband spectrum through a neural network model includes:
and inputting the low-frequency spectrum into the neural network model, and obtaining correlation parameters of a high-frequency part and a low-frequency part of the target broadband spectrum based on the output of the neural network model.
8. The method of claim 1, wherein if the time-frequency transform is a fourier transform, the high-frequency spectral envelope comprises a second number of first sub-spectral envelopes, the initial high-frequency amplitude spectrum comprises the second number of first sub-amplitude spectra, wherein each of the first sub-spectral envelopes is determined based on a corresponding first sub-amplitude spectrum in the initial high-frequency amplitude spectrum;
The determining a first difference value between the adjusted high-frequency spectrum envelope and the low-frequency spectrum envelope, adjusting the initial high-frequency amplitude spectrum based on the first difference value, and obtaining the target high-frequency amplitude spectrum includes:
determining a first difference value of each first sub-spectral envelope and a corresponding spectral envelope in the low-frequency spectral envelopes;
based on the first difference value corresponding to each first sub-spectrum envelope, adjusting the corresponding first sub-amplitude spectrum to obtain the second number of adjusted first sub-amplitude spectrums;
obtaining the target high-frequency amplitude spectrum based on the second number of adjusted first sub-amplitude spectrums;
if the time-frequency transformation is discrete cosine transformation, the high-frequency spectrum envelope comprises a third number of second sub-spectrum envelopes, the initial high-frequency spectrum comprises the third number of first sub-spectrums, and each second sub-spectrum envelope is determined based on the corresponding first sub-spectrum in the initial high-frequency spectrum;
the determining a second difference value between the adjusted high-frequency spectrum envelope and the low-frequency spectrum envelope, adjusting the initial high-frequency spectrum based on the second difference value, and obtaining the target high-frequency spectrum includes:
Determining a second difference value of each second sub-spectral envelope and a corresponding spectral envelope in the low-frequency spectral envelopes;
based on the second difference value corresponding to each second sub-spectrum envelope, adjusting the corresponding first sub-spectrum to obtain the third number of adjusted first sub-spectrums;
and obtaining the target high-frequency spectrum based on the third number of adjusted first sub-spectrums.
9. The method according to claim 1, wherein the relative flatness information includes relative flatness information corresponding to at least two sub-band areas of the high frequency part, the relative flatness information corresponding to one sub-band area characterizing a correlation of a spectral flatness of one sub-band area of the high frequency part and a spectral flatness of a high frequency band of the low frequency part;
if the high-frequency part comprises spectrum parameters corresponding to at least two sub-band regions, the spectrum parameters of each sub-band region are obtained based on the spectrum parameters of a high-frequency band of the low-frequency part, and the relative flatness information comprises the spectrum parameters of each sub-band region and the relative flatness information of the spectrum parameters of the high-frequency band, wherein if time-frequency conversion is Fourier conversion, the spectrum parameters are the amplitude spectrum, and if time-frequency conversion is discrete cosine conversion, the spectrum parameters are the frequency spectrum;
The determining a gain adjustment value of the high frequency spectrum envelope based on the relative flatness information and the energy information of the low frequency spectrum includes:
determining a gain adjustment value of a corresponding spectral envelope portion in the high-frequency spectral envelope based on the relative flatness information corresponding to each sub-band region and the spectral energy information corresponding to each sub-band region in the low-frequency spectrum;
the adjusting the high frequency spectral envelope based on the gain adjustment value includes:
and adjusting the corresponding spectrum envelope part according to the gain adjustment value of each corresponding spectrum envelope part in the high-frequency spectrum envelope.
10. The method of claim 8, wherein if the high frequency spectral envelope comprises a first predetermined number of high frequency sub-spectral envelopes, the first predetermined number is the second number when the low frequency spectrum is obtained by fourier transform, and the first predetermined number is the third number when the low frequency spectrum is obtained by discrete cosine transform;
the determining the gain adjustment value of the corresponding spectrum envelope part in the high-frequency spectrum envelope based on the relative flatness information corresponding to each sub-band region and the spectrum energy information corresponding to each sub-band region in the low-frequency spectrum comprises:
For each high-frequency sub-spectrum envelope, determining a gain adjustment value of the high-frequency sub-spectrum envelope according to spectrum energy information corresponding to a spectrum envelope corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope, relative flatness information corresponding to a sub-band region corresponding to the spectrum envelope corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope, and spectrum energy information corresponding to a sub-band region corresponding to the spectrum envelope corresponding to the high-frequency sub-spectrum envelope in the low-frequency spectrum envelope;
the adjusting the corresponding spectrum envelope part according to the gain adjustment value of each corresponding spectrum envelope part in the high-frequency spectrum envelope comprises the following steps:
and adjusting the corresponding high-frequency sub-spectrum envelopes according to the gain adjustment value of each high-frequency sub-spectrum envelope in the high-frequency spectrum envelopes.
11. A band expansion apparatus, comprising:
the low-frequency spectrum determining module is used for carrying out time-frequency conversion on the narrowband signal to be processed to obtain a corresponding low-frequency spectrum;
the correlation parameter determining module is used for obtaining correlation parameters of a high-frequency part and a low-frequency part of a target broadband spectrum through a neural network model based on the low-frequency spectrum, wherein the correlation parameters comprise a high-frequency spectrum envelope and relative flatness information, and the relative flatness information characterizes correlation of the spectrum flatness of the high-frequency part and the spectrum flatness of the low-frequency part of the target broadband spectrum;
The high-frequency spectrum determining module is used for obtaining a target high-frequency spectrum based on the correlation parameter and the low-frequency spectrum;
the broadband signal determining module is used for obtaining broadband signals with expanded frequency bands based on the low-frequency spectrum and the target high-frequency spectrum;
wherein at least one of the low frequency spectrum or the target high frequency spectrum is a spectrum obtained by filtering a corresponding initial spectrum;
wherein the target high frequency spectrum is obtained by:
obtaining a low-frequency spectrum envelope of the narrowband signal based on the low-frequency spectrum; wherein the high frequency spectral envelope and the low frequency spectral envelope are both logarithmic domain spectral envelopes;
determining a gain adjustment value of the high-frequency spectrum envelope based on the relative flatness information and the energy information of the low-frequency spectrum, and adjusting the high-frequency spectrum envelope based on the gain adjustment value to obtain an adjusted high-frequency spectrum envelope;
if the time-frequency transformation is Fourier transformation, obtaining a low-frequency amplitude spectrum of the narrow-band signal according to the low-frequency spectrum, generating an initial high-frequency amplitude spectrum based on the low-frequency amplitude spectrum, determining a first difference value between an adjusted high-frequency spectrum envelope and the low-frequency spectrum envelope, adjusting the initial high-frequency amplitude spectrum based on the first difference value to obtain a target high-frequency amplitude spectrum, generating a corresponding high-frequency phase spectrum based on a low-frequency phase spectrum of the narrow-band signal, and obtaining a target high-frequency spectrum according to the target high-frequency amplitude spectrum and the high-frequency phase spectrum;
If the time-frequency transformation is discrete cosine transformation, obtaining a low-frequency spectrum envelope of the narrowband signal based on the low-frequency spectrum, generating an initial high-frequency spectrum based on the low-frequency spectrum, determining a second difference value between the adjusted high-frequency spectrum envelope and the low-frequency spectrum envelope, and adjusting the initial high-frequency spectrum based on the second difference value to obtain the target high-frequency spectrum.
12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the band extension method of any of claims 1-10 when executing the program.
CN201910955743.7A 2019-09-18 2019-10-09 Band expansion method, device, electronic equipment and computer readable storage medium Active CN112530446B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910882477 2019-09-18
CN201910882477X 2019-09-18

Publications (2)

Publication Number Publication Date
CN112530446A CN112530446A (en) 2021-03-19
CN112530446B true CN112530446B (en) 2023-10-20

Family

ID=74974456

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910955743.7A Active CN112530446B (en) 2019-09-18 2019-10-09 Band expansion method, device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112530446B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115148217A (en) * 2022-06-15 2022-10-04 腾讯科技(深圳)有限公司 Audio processing method, device, electronic equipment, storage medium and program product

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090002842A (en) * 2007-07-04 2009-01-09 삼성전자주식회사 Method and apparatus for encoding and decoding audio signal
JP2010020251A (en) * 2008-07-14 2010-01-28 Ntt Docomo Inc Speech coder and method, speech decoder and method, speech band spreading apparatus and method
CN101996640A (en) * 2009-08-31 2011-03-30 华为技术有限公司 Frequency band expansion method and device
CN102169694A (en) * 2010-02-26 2011-08-31 华为技术有限公司 Method and device for generating psychoacoustic model
CN103026407A (en) * 2010-05-25 2013-04-03 诺基亚公司 A bandwidth extender
CN107705801A (en) * 2016-08-05 2018-02-16 中国科学院自动化研究所 The training method and Speech bandwidth extension method of Speech bandwidth extension model
WO2019081070A1 (en) * 2017-10-27 2019-05-02 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus, method or computer program for generating a bandwidth-enhanced audio signal using a neural network processor
CN110246515A (en) * 2019-07-19 2019-09-17 腾讯科技(深圳)有限公司 Removing method, device, storage medium and the electronic device of echo

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8515747B2 (en) * 2008-09-06 2013-08-20 Huawei Technologies Co., Ltd. Spectrum harmonic/noise sharpness control

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090002842A (en) * 2007-07-04 2009-01-09 삼성전자주식회사 Method and apparatus for encoding and decoding audio signal
JP2010020251A (en) * 2008-07-14 2010-01-28 Ntt Docomo Inc Speech coder and method, speech decoder and method, speech band spreading apparatus and method
CN101996640A (en) * 2009-08-31 2011-03-30 华为技术有限公司 Frequency band expansion method and device
CN102169694A (en) * 2010-02-26 2011-08-31 华为技术有限公司 Method and device for generating psychoacoustic model
CN103026407A (en) * 2010-05-25 2013-04-03 诺基亚公司 A bandwidth extender
CN107705801A (en) * 2016-08-05 2018-02-16 中国科学院自动化研究所 The training method and Speech bandwidth extension method of Speech bandwidth extension model
WO2019081070A1 (en) * 2017-10-27 2019-05-02 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus, method or computer program for generating a bandwidth-enhanced audio signal using a neural network processor
CN110246515A (en) * 2019-07-19 2019-09-17 腾讯科技(深圳)有限公司 Removing method, device, storage medium and the electronic device of echo

Also Published As

Publication number Publication date
CN112530446A (en) 2021-03-19

Similar Documents

Publication Publication Date Title
CN110556122B (en) Band expansion method, device, electronic equipment and computer readable storage medium
CN110556123B (en) Band expansion method, device, electronic equipment and computer readable storage medium
CN110556121B (en) Band expansion method, device, electronic equipment and computer readable storage medium
AU763471B2 (en) A method and device for adaptive bandwidth pitch search in coding wideband signals
US9251800B2 (en) Generation of a high band extension of a bandwidth extended audio signal
US8639500B2 (en) Method, medium, and apparatus with bandwidth extension encoding and/or decoding
US9280978B2 (en) Packet loss concealment for bandwidth extension of speech signals
TW201140563A (en) Determining an upperband signal from a narrowband signal
WO2005111568A1 (en) Encoding device, decoding device, and method thereof
US20220180881A1 (en) Speech signal encoding and decoding methods and apparatuses, electronic device, and storage medium
JP2010521012A (en) Speech coding system and method
JP6289507B2 (en) Apparatus and method for generating a frequency enhancement signal using an energy limiting operation
WO2011062538A9 (en) Bandwidth extension of a low band audio signal
WO2013066244A1 (en) Bandwidth extension of audio signals
CN112530446B (en) Band expansion method, device, electronic equipment and computer readable storage medium
Bhatt et al. A novel approach for artificial bandwidth extension of speech signals by LPC technique over proposed GSM FR NB coder using high band feature extraction and various extension of excitation methods
Prasad et al. Speech bandwidth extension aided by magnitude spectrum data hiding
Choo et al. Blind bandwidth extension system utilizing advanced spectral envelope predictor
WO2023198925A1 (en) High frequency reconstruction using neural network system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40038380

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant