CN114758672A

CN114758672A - Audio generation method and device and electronic equipment

Info

Publication number: CN114758672A
Application number: CN202210329574.8A
Authority: CN
Inventors: 李彤; 杨张辉; 高可攀
Original assignee: Shenzhen Grandstream Networks Technologies Co ltd
Current assignee: Shenzhen Grandstream Networks Technologies Co ltd
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-07-15

Abstract

The application provides an audio generation method, an audio generation device and electronic equipment; after a first amplitude spectrum, a direct current component amplitude spectrum and a phase spectrum of a to-be-predicted sampling signal of each audio frame in an audio signal to be expanded are obtained, the first amplitude spectrum is input into a trained audio prediction model to obtain a second amplitude spectrum corresponding to the to-be-predicted sampling signal, wherein the audio prediction model is a neural network model constructed by a convolution network and a convolution long-time and short-time memory network, then a target frequency spectrum is obtained by combining the direct current component amplitude spectrum, the second amplitude spectrum and the phase spectrum, and finally the target audio signal is generated according to the target frequency spectrum. According to the method and the device, the frequency spectrum characteristic of the target audio signal is predicted based on the frequency spectrum characteristic of the audio signal to be expanded through the trained audio prediction model, the defect that the existing method cannot be applied to the frequency domain characteristic is overcome, and the quality of super-resolution audio generation is improved.

Description

Audio generation method and device and electronic equipment

Technical Field

The present application relates to the field of speech signal processing, and in particular, to an audio generation method and apparatus, and an electronic device.

Background

With the development and maturity of mobile communication technology, people have higher and higher requirements for the quality of voice in communication, and an audio super-resolution technology is also developed in order to supplement the high-frequency component missing from narrowband voice in traditional narrowband communication.

However, the traditional audio super-resolution technology mainly applies the correlation between the high frequency band and the low frequency band of the voice signal to perform frequency band expansion, and the expansion effect is often not ideal due to the limited technical method, and the effect of a real broadband signal cannot be achieved; the audio super-resolution technology only using the Convolutional Neural Network (CNN) still has limited algorithm effect because only the spatial features of the signal can be extracted and the time sequence characteristics of the voice signal cannot be utilized; the method of combining the Convolutional Neural Network (CNN) and the long and short term memory network (LSTM) mainly uses time domain sampling points of signals as features, and the method is not suitable for frequency domain features because input data of the long and short term memory network is one-dimensional and is not suitable for spatial sequence data. However, since the difference between the narrowband speech signal and the wideband speech signal is mainly reflected on the frequency band, it is difficult for a network trained by using time domain sampling points as features to learn the relationship between the low frequency band and the high frequency band of the signal, so that the quality of super-resolution audio generation is not high.

Therefore, it is desirable to provide an audio generation method to compensate for the defect that the current method cannot be applied to the frequency domain features, and to improve the quality of super-resolution audio generation.

Disclosure of Invention

The application provides an audio generation method, an audio generation device and electronic equipment, which are used for overcoming the defect that the current method cannot be applied to frequency domain characteristics and improving the quality of super-resolution audio generation.

In order to solve the technical problem, the present application provides the following technical solutions:

the application provides an audio generation method, comprising the following steps:

acquiring a first amplitude spectrum, a direct current component amplitude spectrum and a phase spectrum of a to-be-predicted sampling signal of each audio frame in an audio signal to be expanded;

inputting the first amplitude spectrum into a trained audio prediction model, and outputting a second amplitude spectrum corresponding to the to-be-predicted sampling signal, wherein the audio prediction model is a neural network model constructed by a convolution network and a convolution long-time memory network;

combining the direct-current component magnitude spectrum, the second magnitude spectrum and the phase spectrum to obtain a target frequency spectrum;

and generating a target audio signal according to the target frequency spectrum.

Correspondingly, the present application also provides an audio generating apparatus, including:

The device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring a first amplitude spectrum, a direct current component amplitude spectrum and a phase spectrum of a to-be-predicted sampling signal of each audio frame in an audio signal to be expanded;

the prediction unit is used for inputting the first amplitude spectrum into a trained audio prediction model and outputting a second amplitude spectrum corresponding to the sampling signal to be predicted, and the audio prediction model is a neural network model constructed by a convolution network and a convolution long-time and short-time memory network;

the target frequency spectrum determining unit is used for combining the direct-current component amplitude spectrum, the second amplitude spectrum and the phase spectrum to obtain a target frequency spectrum;

and the generating unit is used for generating a target audio signal according to the target frequency spectrum.

Meanwhile, the present application provides an electronic device, which includes a processor and a memory, wherein the memory is used for storing a computer program, and the processor is used for operating the computer program in the memory to execute the steps in the audio generation method.

Furthermore, the present application provides a computer-readable storage medium, which stores a plurality of instructions, where the instructions are suitable for being loaded by a processor to execute the steps in the audio generation method.

Furthermore, the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium; the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the steps of the audio generation method.

Has the beneficial effects that: compared with the traditional long-time and short-time memory network, the long-time and short-time memory network convolution method has the advantages that the long-time and short-time memory network convolution operation is added to the long-time and short-time memory network convolution operation, the long and short convolution time memory network not only has the time sequence modeling capability, but also can capture basic space characteristics by performing convolution operation in multidimensional data, therefore, the time characteristic and the space characteristic of the audio signal to be expanded can be extracted through the neural network model provided by the application for training to obtain a trained audio prediction model, then predicting the spectral feature of the target audio signal based on the spectral feature of the audio signal to be expanded through the trained audio prediction model, and then the target signal corresponding to the audio signal to be expanded is obtained, so that the defect that the current method cannot be applied to frequency domain characteristics is overcome, and the quality of super-resolution audio generation is improved.

Drawings

The technical solution and other advantages of the present application will become apparent from the detailed description of the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of an audio generation method provided in an embodiment of the present application.

Fig. 2 is a schematic diagram of a framework of an application phase of an audio prediction model according to an embodiment of the present application.

Fig. 3 is a schematic diagram illustrating an effect of a low-resolution audio signal becoming a super-resolution audio signal according to an embodiment of the present application.

Fig. 4 is a schematic network structure diagram of an audio prediction model provided in an embodiment of the present application.

Fig. 5 is a schematic structural diagram of a long-term and short-term memory network according to an embodiment of the present application.

Fig. 6 is a schematic diagram of a framework of an audio prediction model training phase provided in an embodiment of the present application.

Fig. 7 is a schematic structural diagram of an audio generating apparatus according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be implemented in other sequences than those illustrated or described herein. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions.

In the present application, the audio signal to be extended may be a low resolution audio signal, i.e. a narrowband signal, with a sampling rate of 8kHz and a bandwidth of typically 300Hz-3.4 kHz.

In the present application, an audio frame is a signal obtained by performing overlap framing and windowing on an audio signal to be extended, and a signal of each frame of the audio signal to be extended is an audio frame.

In the present application, a to-be-predicted sampling signal is a signal obtained by performing time domain interpolation upsampling processing on each audio frame in an audio signal to be extended.

In the present application, the magnitude spectrum of the dc component is the modulus of the first point after the instant frequency domain conversion. Specifically, assuming that the sampling frequency is fs and the number of sampling points is N, after the time domain signal is converted into the frequency domain through time-frequency domain conversion, the first point corresponds to the direct current component, and a curve formed by the first point and a module value thereof is the direct current component amplitude spectrum.

In this application, a phase spectrum refers to a phase versus frequency curve representing the phase that each frequency component has at the time origin.

In the present application, the target audio signal may be a signal obtained by performing super-resolution processing on the audio signal to be extended, and the target audio signal completes missing high-frequency components in the audio signal to be extended.

The application provides an audio generation method and device and electronic equipment.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an audio generating method according to an embodiment of the present disclosure. The method is suitable for conference scenes such as voice call, voice over Internet protocol (VoIP), network conference and the like, most of the scenes use traditional narrow-band communication and can only transmit narrow-band signals, the transmitted narrow-band signals cause unnatural voice due to the lack of high-frequency components, the voice sounds more stuffy and affect the tone quality and the listening feeling, in order to improve the tone quality, an audio super-resolution technology (namely an audio bandwidth extension technology) is used for carrying out super-resolution processing on audio data with low sampling rate, the missing high-frequency components are compensated as accurately as possible, so that the audio sounds louder and clearer, and the listening feeling is better.

It should be noted that the present application does not limit the sampling rate of the low-resolution speech signal and the high-resolution speech signal, for example, the present application may be applied to the task where the sampling rate of the audio signal to be extended is 8kHz and the target sampling rates are 16kHz, 24kHz, and 48kHz, and may also be applied to the task where the sampling rate of the audio signal to be extended is 16kHz and the target sampling rate is 48 kHz.

The following describes in detail an audio generation method in the present application, which includes at least the following steps:

s101: and acquiring a first amplitude spectrum, a direct current component amplitude spectrum and a phase spectrum of the to-be-predicted sampling signal of each audio frame in the to-be-extended audio signal.

In one embodiment, before predicting the spectral feature of the audio signal to be extended, the audio signal to be extended and the corresponding information thereof need to be acquired, and the specific steps include: acquiring an audio signal to be expanded and a target sampling rate; preprocessing the audio signal to be expanded to obtain each audio frame of the audio signal to be expanded; resampling each audio frame according to a target sampling rate to obtain a sampling signal to be predicted; and performing feature extraction processing on the to-be-predicted sampling signal to obtain a first amplitude spectrum, a direct-current component amplitude spectrum and a phase spectrum of the to-be-predicted sampling signal. Wherein the audio signal to be extended may be a low resolution audio signal; the target sampling rate is a sampling rate corresponding to a target audio signal (namely, a high-resolution audio signal corresponding to the low-resolution audio signal) which is artificially or default by a system; the pre-processing includes overlap framing and windowing.

Specifically, as shown in fig. 2, fig. 2 is a schematic diagram of a framework of an application stage of an audio prediction model according to an embodiment of the present application. After obtaining the audio signal to be expanded (low resolution audio signal), performing overlapping framing and windowing processing on the audio signal to be expanded (low resolution audio signal) to obtain each audio frame x (t) of the audio signal to be expanded (low resolution audio signal), and then performing time domain interpolation up-sampling processing on each frame signal x (t) according to a target sampling rate to ensure that the sampling rate of the audio frame x (t) is consistent with that of the target audio signal to obtain a sampling signal x to be predicted_i(t), finally, the sampled signal x to be predicted is_i(t) carrying out feature extraction processing to obtain a sampling signal x to be predicted_i(t), wherein the magnitude spectrum comprises a first magnitude spectrum and a dc split magnitude spectrum.

In one embodiment, the specific steps of obtaining the phase spectrum and the amplitude spectrum include: converting the sampling signal to be predicted to a frequency domain according to a preset time-frequency domain conversion condition to obtain a frequency spectrum of the sampling signal to be predicted; determining a phase spectrum of the sampling signal to be predicted according to the frequency spectrum of the sampling signal to be predicted; and performing amplitude spectrum extraction processing on the to-be-predicted sampling signal to obtain a first amplitude spectrum and a direct-current component amplitude spectrum of the to-be-predicted sampling signal. Wherein, the predetermined time-frequency domain conversion condition may be a short-time fourier transform. Specifically, sampling signal x to be predicted in time domain is converted by short-time Fourier transform _i(t) converting to frequency domain to obtain frequency spectrum X of sampling signal to be predicted_i(w), then the phase spectrum is obtained according to the frequency spectrum, and finally the sampling signal x to be predicted is obtained_i(t) performing amplitude spectrum extraction processing to obtain a first amplitude spectrum and a direct current component amplitude spectrum of the to-be-predicted sampling signal,the detailed procedure will be described below.

In one embodiment, the specific steps of performing amplitude spectrum extraction processing on the to-be-predicted sampled signal to obtain a first amplitude spectrum and a dc component amplitude spectrum of the to-be-predicted sampled signal include: filtering the sampling signal to be predicted to obtain a first filtering signal; converting the first filtering signal into a frequency domain according to a preset time-frequency domain conversion condition to obtain a frequency spectrum of the first filtering signal; and determining a first amplitude spectrum and a direct-current component amplitude spectrum of the to-be-predicted sampling signal according to the frequency spectrum of the first filtering signal. Wherein, the filtering process may be a low-pass filtering process, and the image spectrum generated in the interpolation resampling process is filtered out by the low-pass filtering process to obtain the first filtering signal x₁(t) then converting the first filtered signal x by short-time Fourier₁(t) converting to the frequency domain to obtain the frequency spectrum X of the first filtered signal ₁(w) and then on the frequency spectrum X of the first filtered signal₁(w) obtaining the absolute value to obtain the amplitude spectrum of the first filtering signal, and obtaining the symmetrical half part of the amplitude spectrum to obtain the sampling signal x to be predicted as the amplitude spectrum is symmetrical about the longitudinal axis_i(t) (frequency spectrum X of the first filtered signal)₁(w) the symmetrical half of the curve of the amplitude composition of the points other than the first point) and the DC component amplitude spectrum (spectrum X of the first filtered signal)₁(w) a modulus value of a first point in the set).

It should be noted that, due to the short-time stationary characteristic of the speech signal, the short-time fourier transform is adopted to convert the time-domain signal into the frequency domain.

S102: and inputting the first amplitude spectrum into a trained audio prediction model, and outputting a second amplitude spectrum corresponding to a to-be-predicted sampling signal, wherein the audio prediction model is a neural network model constructed by a convolution network and a convolution long-time memory network.

The traditional audio super-resolution technology mainly applies the correlation between high frequency bands and low frequency bands of voice signals to carry out frequency band expansion, and the expansion methods mainly comprise a codebook mapping method, a frequency band replication method, a linear prediction analysis method, a Gaussian Mixture Model (GMM) based method, a Hidden Markov Model (HMM) based method and the like, and because the technical methods are limited, the expansion effect is not ideal, and the effect of a real broadband signal cannot be achieved.

With the continuous and deep research in recent years, many audio super-resolution algorithms based on a neural network appear, and the main idea is to use a deep neural network to predict a high-frequency band signal corresponding to a low-frequency band signal by taking the super-resolution algorithm of an image as a reference, and the effect is as shown in fig. 3. However, unlike images, speech signals are sequential signals, and speech in a period of time has a great correlation between the front and the back, that is, the speech signal in the current time is correlated with the signal in the previous period of time. The Convolutional Neural Network (CNN) used in the image algorithm can only extract the spatial features of the signal and cannot utilize the time sequence characteristics of the voice signal, so the audio super-resolution algorithm effect using only the Convolutional Neural Network (CNN) is still limited; in order to utilize the time sequence characteristic of the speech signal, a method combining a Convolutional Neural Network (CNN) and a long-term memory network (LSTM) is presented at present, the LSTM is a variant of a cyclic convolutional neural network (RNN), the RNN is a neural network structure capable of processing sequence data, the information of previous multiframe can be considered when the current speech frame is processed by using the RNN, so that the current output result is not only related to the current input, but also related to the input of a previous period of time, therefore, the LSTM as a variant of the RNN has a long-term memory function, and the problem of gradient disappearance is avoided; however, the current CNN and LSTM combined algorithm mainly uses time-domain sampling points of signals as features, and because input data of the LSTM is one-dimensional and is not suitable for spatial sequence data, the method is not suitable for frequency-domain features, and algorithm delay and calculation amount of the method are large.

Based on the above problems, the present application provides an audio prediction model constructed by a convolutional network (CNN) and a convolutional long-and-short term memory network (ConvLSTM), and trains the audio prediction model to obtain a trained audio prediction model, which specifically includes the following steps: acquiring a training set and an initial audio prediction model, wherein the initial audio prediction model comprises a down-sampling module, a bottleneck layer module, an up-sampling module and an output module; the down-sampling module consists of N down-sampling layers with the same structure and different parameters, and the up-sampling module consists of N up-sampling layers with the same structure and different parameters; and training the initial audio prediction model according to a training set to obtain a trained audio prediction model.

Fig. 4 shows a network model proposed in the present application, and fig. 4 is a schematic network structure diagram of an audio prediction model provided in an embodiment of the present application. The audio prediction model mainly comprises four modules, namely a down-sampling module, a bottleneck layer module, an up-sampling module and an output module. The down-sampling module consists of N down-sampling layers with the same structure but different parameters, wherein each down-sampling layer comprises a convolution layer, a pooling layer and a convolution long-time and short-time memory (ConvLSTM) layer; the up-sampling module is composed of N up-sampling layers with the same structure but different parameters, and the up-sampling layers comprise a convolution layer, an anti-convolution layer and a convolution long-time memory (ConvLSTM) layer. Furthermore, there is a 1 × 1 convolution in each of the up-and down-sampled layers, which serves to reduce the dimensionality of the eigen-channels (channels) and hence the number of parameters of the network before ConvLSTM is performed. The bottleneck layer comprises a convolutional layer, a pooling layer and a ConvLSTM layer. The output module includes a convolutional layer and an anti-convolutional layer.

In one embodiment, the output of each downsampling layer in the downsampling module is connected to the input of each upsampling layer in the upsampling module. In particular, the audio prediction model further introduces a residual error network, i.e. the output of each down-sampling layer in a down-sampling module is connected to the input of the up-sampling layer in the corresponding up-sampling module. The model does not limit the number of layers of the down-sampling module and the up-sampling module, i.e. the size of the number of layers N, but the number of layers in the two modules needs to be the same and the structure is symmetrical.

The convolutional long-short term memory network (ConvLSTM) used in the model is an improvement of the conventional long-short term memory network (LSTM), and the structure of the conventional long-short term memory network (LSTM) unit is shown in fig. 5. LSTM is a variant of a Recurrent Neural Network (RNN) which is essentially a fully-connected network, also called FC-LSTM, and functions to have a "memory" function, and LSTM adds state preservation to the computing unit to improve the memory of the network, i.e. the network considers past information, the output of which depends not only on the current input but also on previous information, and the current output is determined by the previous information and the input at that time.

The formula of LSTM is shown in formula 1 to formula 5, where σ is gate, i _tTo the input gate, f_tTo forget the door, c_tAs memory cells, o_tTo the output gate, h_tFor hidden states therein, W and b are weights to be learned by the network, x_tFor the input sequence data, o is the Hadamard (Hadamard) product. The weight in the LSTM calculation unit is shared, and each layer of LSTM shares one weight.

ConvLSTM is a convolution long-time memory network, compared with a traditional LSTM network, ConvLSTM adds convolution operation, LSTM can only extract time sequence characteristics, and ConvLSTM can simultaneously extract space-time characteristics after convolution operation replaces a full connection layer, so that the ConvLSTM not only has time sequence modeling capability, but also can extract space characteristics, and basic space characteristics are captured by performing convolution operation in multidimensional data. The equations are shown in equations 6 to 10, where denotes convolution operation.

It can be seen that, compared to LSTM, ConvLSTM uses convolution instead of matrix multiplication, and after adding convolution, not only the timing relationship of the signals can be obtained, but also the spatial features can be extracted like a convolutional layer.

In an embodiment, based on the above analysis, the audio prediction model provided in the present application may extract the temporal features and the spatial features of the audio signal at the same time for training, so that before the audio prediction model is trained, a training set including the temporal features and the spatial features needs to be obtained, and the specific steps include: acquiring a first voice signal, a first sampling rate and a second sampling rate; preprocessing the first voice signal to obtain each first audio frame of the first voice signal; resampling each first audio frame according to a first sampling rate to obtain a first sampling signal; resampling the first sampling signal according to a second sampling rate to obtain a second sampling signal; filtering the second sampling signal to obtain a second filtered signal; respectively converting the first voice signal and the second filtering signal into frequency domains according to preset time-frequency domain conversion conditions to obtain a first training frequency spectrum corresponding to the first voice signal and a second training frequency spectrum corresponding to the second filtering signal; and determining a training set according to the first training frequency spectrum and the second training frequency spectrum. Wherein the first speech signal may be an original high resolution speech signal for training; the preprocessing comprises overlapping framing and windowing; the preset time-frequency domain conversion condition may be a short-time fourier transform.

As shown in fig. 6, fig. 6 is a schematic diagram of a framework of an audio prediction model training phase according to an embodiment of the present application. Because the voice signal has short-time stationarity and the subsequent calculation is based on the frequency domain characteristics of the signal, the acquired first voice signal (namely, the original high-resolution voice signal) needs to be subjected to overlapping framing and windowing processing to obtain a first audio frame, wherein the purpose of the windowing processing is to prevent frequency spectrum leakage; then, according to a first sampling rate M (M is an integer larger than 1), down-sampling processing is carried out on each frame of high-resolution voice signals (namely first audio frames) to obtain first sampling signals; then, according to a second sampling rate P (P is an integer greater than 1), performing time domain interpolation upsampling processing to obtain a second sampling signal, where the purpose of the time domain interpolation upsampling processing is to obtain a phase spectrum of a high-frequency component by using a mirror image high-frequency component obtained by upsampling in an application stage, and in order to keep the same form of data during training and actual application, although the phase spectrum is not needed in the training stage, the time domain interpolation upsampling processing is also needed to be performed on the training data; then, low-pass filtering is carried out on the second sampling signal, and high-frequency components of images obtained after the time domain interpolation is subjected to the upsampling processing are filtered out, so that a second filtering signal (namely a low-resolution voice signal) with only low-frequency components and missing high-frequency components is obtained; then, respectively performing short-time fourier transform on the first voice signal (i.e., the original high-resolution voice signal) and the second filtering signal (i.e., the low-resolution voice signal) to obtain a spectrum feature (i.e., a first training spectrum) of the corresponding first voice signal (i.e., the original high-resolution voice signal) and a spectrum feature (i.e., a second training spectrum) of the corresponding second filtering signal (i.e., the low-resolution voice signal); because the frequency spectrum has a symmetrical characteristic, only half of the frequency spectrum is reserved, and an absolute value is calculated for half of the first training frequency spectrum to obtain a magnitude spectrum of a first voice signal (namely an original high-resolution voice signal), and the magnitude spectrum is recorded as label; solving an absolute value of half of the second training frequency spectrum to obtain a magnitude spectrum of a second filtering signal (namely, a low-resolution voice signal), and recording the magnitude spectrum as data; repeating the operations of sampling, filtering, time-frequency domain transformation and extracting the amplitude spectrum on each first audio frame to obtain a training set used by model training; and finally, training the initial audio prediction model by using the manufactured training set, and storing the weight parameters obtained by training to obtain the trained audio prediction model.

And inputting the first amplitude spectrum into the trained audio prediction model, and predicting a second amplitude spectrum corresponding to the current to-be-predicted sampling signal (namely the amplitude spectrum of the high-resolution signal corresponding to the current to-be-predicted sampling signal) by using the weight parameter of the trained audio prediction model.

S103: and combining the direct-current component amplitude spectrum, the second amplitude spectrum and the phase spectrum to obtain a target frequency spectrum.

Since the frequency spectrum reflects the distribution of the amplitude and phase of the signal along with the frequency, the direct current component amplitude spectrum, the second amplitude spectrum and the phase spectrum of the sampling signal to be predicted are obtained according to the steps and then combined to obtain the target frequency spectrum y (w).

S104: and generating a target audio signal according to the target frequency spectrum.

In one embodiment, the specific step of generating the target audio signal comprises: converting the target frequency spectrum into a time domain according to a preset time-frequency domain inverse conversion condition to obtain a target audio frame corresponding to the target frequency spectrum; and performing time domain synthesis processing on the target audio frame to obtain a target audio signal. The preset inverse time-frequency domain conversion condition may be inverse short-time fourier transform.

Specifically, after obtaining a target frequency spectrum y (w), performing inverse short-time fourier transform on the target frequency spectrum y (w) to obtain a target audio frame signal y (t) of a time domain; and then, performing time domain synthesis processing on the target audio frame signals, including windowing each target audio frame signal y (t), and performing overlap addition on each target audio frame signal y (t-1) and the previous frame audio frame signal y (t-1) to obtain target audio signals (namely the target high-resolution audio signals after the audio signals to be expanded are expanded). In the process of synthesizing the target audio signal, the information of the previous multi-frame can be used, and the quality of super-resolution audio generation is improved.

The method combines convolution and a convolution long-time memory network (ConvLSTM), can simultaneously extract time characteristics and space characteristics of an audio signal for training, then predicts missing high-frequency components based on frequency spectrum characteristics in a low-resolution speech signal, improves speech quality under the condition of limited bandwidth, and solves the problems of low resolution and poor tone quality of a narrow-band signal.

Based on the content of the foregoing embodiments, an audio generating apparatus is provided in the embodiments of the present application, and the apparatus may be disposed in an audio acquiring terminal. The audio generating apparatus is configured to execute the audio generating method provided in the foregoing method embodiment, and specifically, referring to fig. 7, the apparatus includes:

a first obtaining unit 701, configured to obtain a first amplitude spectrum, a dc component amplitude spectrum, and a phase spectrum of a to-be-predicted sampling signal of each audio frame in an audio signal to be extended;

the prediction unit 702 is configured to input the first magnitude spectrum to a trained audio prediction model, and output a second magnitude spectrum corresponding to the to-be-predicted sampling signal, where the audio prediction model is a neural network model constructed by a convolution network and a convolution long-and-short-term memory network;

A target frequency spectrum determining unit 703, configured to combine the dc component magnitude spectrum, the second magnitude spectrum, and the phase spectrum to obtain a target frequency spectrum;

a generating unit 704, configured to generate a target audio signal according to the target frequency spectrum.

In one embodiment, the first obtaining unit 701 includes:

the second acquisition unit is used for acquiring the audio signal to be expanded and a target sampling rate;

the first preprocessing unit is used for preprocessing the audio signal to be expanded to obtain each audio frame of the audio signal to be expanded;

the first sampling unit is used for resampling each audio frame according to the target sampling rate to obtain a to-be-predicted sampling signal;

and the characteristic extraction unit is used for carrying out characteristic extraction processing on the to-be-predicted sampling signal to obtain a first amplitude spectrum, a direct-current component amplitude spectrum and a phase spectrum of the to-be-predicted sampling signal.

In one embodiment, the feature extraction unit includes:

the first time-frequency domain conversion unit is used for converting the to-be-predicted sampling signal to a frequency domain according to a preset time-frequency domain conversion condition to obtain a frequency spectrum of the to-be-predicted sampling signal;

the phase spectrum determining unit is used for determining the phase spectrum of the sampling signal to be predicted according to the frequency spectrum of the sampling signal to be predicted;

And the amplitude spectrum extraction unit is used for carrying out amplitude spectrum extraction processing on the to-be-predicted sampling signal to obtain a first amplitude spectrum and a direct-current component amplitude spectrum of the to-be-predicted sampling signal.

In one embodiment, the magnitude spectrum extraction unit includes:

the first filtering unit is used for filtering the to-be-predicted sampling signal to obtain a first filtering signal;

a second time-frequency domain converting unit, configured to convert the first filtered signal into a frequency domain according to the preset time-frequency domain converting condition, so as to obtain a frequency spectrum of the first filtered signal;

and the amplitude spectrum determining subunit is used for determining a first amplitude spectrum and a direct-current component amplitude spectrum of the to-be-predicted sampling signal according to the frequency spectrum of the first filtering signal.

In one embodiment, the audio generating apparatus further comprises:

the third acquisition unit is used for acquiring a training set and an initial audio prediction model, wherein the initial audio prediction model comprises a down-sampling module, a bottleneck layer module, an up-sampling module and an output module; the down-sampling module consists of N down-sampling layers with the same structure and different parameters, and the up-sampling module consists of N up-sampling layers with the same structure and different parameters;

And the model training unit is used for training the initial audio prediction model according to the training set to obtain a trained audio prediction model.

In one embodiment, the third obtaining unit includes:

a fourth obtaining unit, configured to obtain the first voice signal, the first sampling rate, and the second sampling rate;

the second preprocessing unit is used for preprocessing the first voice signal to obtain each first audio frame of the first voice signal;

the second sampling unit is used for resampling each first audio frame according to the first sampling rate to obtain a first sampling signal;

the third sampling unit is used for resampling the first sampling signal according to the second sampling rate to obtain a second sampling signal;

the second filtering unit is used for carrying out filtering processing on the second sampling signal to obtain a second filtering signal;

a third time-frequency domain converting unit, configured to convert the first voice signal and the second filtered signal into frequency domains according to preset time-frequency domain conversion conditions, respectively, so as to obtain a first training frequency spectrum corresponding to the first voice signal and a second training frequency spectrum corresponding to the second filtered signal;

And the training set determining unit is used for determining a training set according to the first training frequency spectrum and the second training frequency spectrum.

In one embodiment, the output of each downsampling layer in the downsampling module of the initial audio prediction model is connected to the input of each upsampling layer in the upsampling module.

In one embodiment, the audio generation unit 704 includes:

the frequency-time domain conversion unit is used for converting the target frequency spectrum into a time domain according to a preset time-frequency domain inverse conversion condition to obtain a target audio frame corresponding to the target frequency spectrum;

and the audio frame synthesis unit is used for carrying out time domain synthesis processing on the target audio frame to obtain a target audio signal.

The audio generating apparatus of the embodiment of the present application may be configured to execute the technical solutions of the foregoing method embodiments, and the implementation principles and technical effects thereof are similar and will not be described herein again.

Different from the current technology, the audio generation device provided by the application is provided with the prediction unit, and the prediction unit is used for predicting the frequency spectrum characteristic of the target audio signal based on the frequency spectrum characteristic of the audio signal to be expanded so as to obtain the target signal corresponding to the audio signal to be expanded, so that the defect that the current method cannot be applied to frequency domain characteristics is overcome, and the quality of super-resolution audio generation is improved.

Correspondingly, the embodiment of the application also provides the electronic equipment, and the electronic equipment comprises a server or a terminal and the like.

As shown in fig. 8, the electronic device may include a processor 801 having one or more processing cores, a Wireless Fidelity (WiFi) module 802, a memory 803 having one or more computer-readable storage media, an audio circuit 804, a display unit 805, an input unit 806, a sensor 807, a power supply 808, and a Radio Frequency (RF) circuit 809. Those skilled in the art will appreciate that the configuration of the electronic device shown in fig. 8 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 801 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 803 and calling data stored in the memory 803, thereby monitoring the electronic device as a whole. In one embodiment, processor 801 may include one or more processing cores; preferably, the processor 801 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 801.

WiFi belongs to short-range wireless transmission technology, and electronic devices can help users send and receive e-mails, browse web pages, access streaming media, etc. through the wireless module 802, which provides wireless broadband internet access for users. Although fig. 8 shows the wireless module 802, it is understood that it does not belong to the essential constitution of the terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The memory 803 may be used to store software programs and modules, and the processor 801 executes various functional applications and data processes by operating the computer programs and modules stored in the memory 803. The memory 803 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 803 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 803 may also include a memory controller to provide the processor 801 and the input unit 806 with access to the memory 803.

The audio circuitry 804 includes a speaker that can provide an audio interface between a user and the electronic device. The audio circuit 804 can transmit the electrical signal converted from the received audio data to the speaker, and the electrical signal is converted into a sound signal by the speaker and output; on the other hand, the speaker converts the collected sound signal into an electrical signal, which is received by the audio circuit 804 and converted into audio data, and the audio data is processed by the audio data output processor 801 and then transmitted to another electronic device via the radio frequency circuit 809, or the audio data is output to the memory 803 for further processing. The audio circuit 804 may also include an earbud jack to provide communication of a peripheral headset with the electronic device.

The display unit 805 may be used to display information input by or provided to a user and various graphical user interfaces of the terminal, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 805 may include a Display panel, and in one embodiment, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 801 to determine the type of touch event, and then the processor 801 provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 8 the touch sensitive surface and the display panel are two separate components to implement the input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement the input and output functions.

The input unit 806 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, the input unit 806 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (such as operations by the user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. In one embodiment, the touch sensitive surface may comprise two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 801, and can receive and execute commands sent by the processor 801. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 806 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The electronic device may also include at least one sensor 807, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust brightness of the display panel according to brightness of ambient light, and the proximity sensor may turn off the display panel and/or the backlight when the terminal moves to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which may be further configured to the electronic device, detailed descriptions thereof are omitted.

The electronic device also includes a power supply 808 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 801 via a power management system to manage charging, discharging, and power consumption management functions via the power management system. The power supply 808 may also include any component including one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The rf circuit 809 may be used for receiving and sending signals during a message transmission or communication process, and in particular, receives downlink information of a base station and then sends the received downlink information to the one or more processors 801 for processing; in addition, data relating to uplink is transmitted to the base station. Generally, the radio frequency circuitry 809 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. The radio frequency circuit 809 can also communicate with networks and other devices through wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

Although not shown, the electronic device may further include a camera, a bluetooth module, and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 801 in the electronic device loads an executable file corresponding to a process of one or more application programs into the memory 803 according to the following instructions, and the processor 801 runs the application programs stored in the memory 803, thereby implementing the following functions:

Acquiring a first amplitude spectrum, a direct-current component amplitude spectrum and a phase spectrum of a to-be-predicted sampling signal of each audio frame in an audio signal to be expanded;

inputting the first amplitude spectrum into a trained audio prediction model, and outputting a second amplitude spectrum corresponding to the sampling signal to be predicted, wherein the audio prediction model is a neural network model constructed by a convolution network and a convolution long-time and short-time memory network;

combining the direct-current component amplitude spectrum, the second amplitude spectrum and the phase spectrum to obtain a target frequency spectrum;

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed description, which is not repeated herein.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, the embodiment of the present application provides a computer-readable storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to implement the functions of the above-mentioned audio generation method.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium can execute the steps in any method provided in the embodiments of the present application, the beneficial effects that can be achieved by any method provided in the embodiments of the present application can be achieved, for details, see the foregoing embodiments, and are not described herein again.

Meanwhile, the embodiment of the present application provides a computer program product or a computer program, which includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.

The audio generation method, the audio generation device, and the electronic device provided in the embodiments of the present application are described in detail above, and a specific example is applied in the present application to explain the principles and embodiments of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of audio generation, comprising:

2. The audio generating method according to claim 1, wherein the step of obtaining the first amplitude spectrum, the dc component amplitude spectrum and the phase spectrum of the to-be-predicted sampled signal of each audio frame in the to-be-extended audio signal comprises:

acquiring an audio signal to be expanded and a target sampling rate;

preprocessing the audio signal to be expanded to obtain each audio frame of the audio signal to be expanded;

resampling each audio frame according to the target sampling rate to obtain a to-be-predicted sampling signal;

And performing feature extraction processing on the to-be-predicted sampling signal to obtain a first amplitude spectrum, a direct-current component amplitude spectrum and a phase spectrum of the to-be-predicted sampling signal.

3. The audio generation method according to claim 2, wherein the step of performing feature extraction processing on the to-be-predicted sampled signal to obtain a first amplitude spectrum, a dc component amplitude spectrum, and a phase spectrum of the to-be-predicted sampled signal includes:

converting the sampling signal to be predicted to a frequency domain according to a preset time-frequency domain conversion condition to obtain a frequency spectrum of the sampling signal to be predicted;

determining a phase spectrum of the sampling signal to be predicted according to the frequency spectrum of the sampling signal to be predicted;

and carrying out amplitude spectrum extraction processing on the to-be-predicted sampling signal to obtain a first amplitude spectrum and a direct-current component amplitude spectrum of the to-be-predicted sampling signal.

4. The audio generation method according to claim 3, wherein the step of performing amplitude spectrum extraction processing on the to-be-predicted sampled signal to obtain a first amplitude spectrum and a dc component amplitude spectrum of the to-be-predicted sampled signal includes:

filtering the to-be-predicted sampling signal to obtain a first filtering signal;

Converting the first filtering signal into a frequency domain according to the preset time-frequency domain conversion condition to obtain a frequency spectrum of the first filtering signal;

and determining a first amplitude spectrum and a direct-current component amplitude spectrum of the to-be-predicted sampling signal according to the frequency spectrum of the first filtering signal.

5. The audio generating method according to claim 1, wherein before the step of inputting the first magnitude spectrum to a trained audio prediction model and outputting a second magnitude spectrum corresponding to the sample signal to be predicted, the audio prediction model is a neural network model constructed by a convolutional network and a convolutional long-short time memory network, the audio generating method further comprises:

acquiring a training set and an initial audio prediction model, wherein the initial audio prediction model comprises a down-sampling module, a bottleneck layer module, an up-sampling module and an output module; the down-sampling module consists of N down-sampling layers with the same structure and different parameters, and the up-sampling module consists of N up-sampling layers with the same structure and different parameters;

and training the initial audio prediction model according to the training set to obtain a trained audio prediction model.

6. The audio generation method of claim 5, wherein the step of obtaining the training set and the initial audio prediction model comprises:

Acquiring a first voice signal, a first sampling rate and a second sampling rate;

preprocessing the first voice signal to obtain each first audio frame of the first voice signal;

resampling each first audio frame according to the first sampling rate to obtain a first sampling signal;

resampling the first sampling signal according to the second sampling rate to obtain a second sampling signal;

filtering the second sampling signal to obtain a second filtered signal;

respectively converting the first voice signal and the second filtering signal into frequency domains according to preset time-frequency domain conversion conditions to obtain a first training frequency spectrum corresponding to the first voice signal and a second training frequency spectrum corresponding to the second filtering signal;

and determining the training set according to the first training frequency spectrum and the second training frequency spectrum.

7. The audio generation method of claim 5, wherein an output of each downsampling layer in the downsampling module is coupled to an input of each upsampling layer in the upsampling module.

8. The audio generating method according to claim 1, wherein the step of generating the target audio signal according to the target spectrum comprises:

Converting the target frequency spectrum into a time domain according to a preset time-frequency domain inverse conversion condition to obtain a target audio frame corresponding to the target frequency spectrum;

and performing time domain synthesis processing on the target audio frame to obtain a target audio signal.

9. An audio generation apparatus, comprising:

the prediction unit is used for inputting the first amplitude spectrum into a trained audio prediction model and outputting a second amplitude spectrum corresponding to the to-be-predicted sampling signal, wherein the audio prediction model is a neural network model constructed by a convolution network and a convolution long-time and short-time memory network;

10. An electronic device, comprising a processor and a memory, the memory being configured to store a computer program, the processor being configured to execute the computer program in the memory to perform the steps of the audio generation method of any of claims 1 to 8.