CN117544603A

CN117544603A - Voice communication system and method

Info

Publication number: CN117544603A
Application number: CN202311498981.2A
Authority: CN
Inventors: 艾杨; 盛峥彦; 郑瑞晨; 鲁叶欣; 江晓航; 凌震华
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2024-02-09

Abstract

The application discloses a voice communication system and a voice communication method, and relates to the technical field of voice communication. The system comprises: a first electronic device and a second electronic device; the first electronic equipment is used for acquiring the waveform of the voice signal; extracting a first logarithmic magnitude spectrum and a first phase spectrum from the waveform of the voice signal through short-time Fourier transform; generating a continuous code according to the first pair of magnitude spectrums and the first phase spectrum; discretizing the continuous codes to obtain index vectors; transmitting the index vector to the second electronic device; the second electronic device is used for generating a quantization code according to the index vector; generating a second pair of magnitude spectra and a second phase spectrum according to the quantization code; the second pair-wise magnitude spectrum and the second phase spectrum are restored to the waveform of the speech signal by an inverse short-time fourier transform. Therefore, the decoded voice signal performance can be guaranteed to be better while the voice signal is stored or transmitted at a low coding bit rate, so that the efficiency and the reduction degree of voice communication are improved.

Description

Voice communication system and method

Technical Field

The present disclosure relates to the field of voice communication technologies, and in particular, to a voice communication system and method.

Background

At present, the voice encoding and decoding technology is widely applied to the technical fields of voice communication and the like. The speech codec (speech codec) technology refers to a technology that compresses a speech signal into discrete codes, and then reconstructs the speech signal by using the discrete codes, and the speech codec can be used to realize speech codec.

In the related art, a voice codec is mainly divided into a parameter codec and a waveform codec. Wherein the parameter codec usually decodes and encodes characteristic parameters of the speech signal. Since the speech signal has a short-time stationary characteristic, the parameter codec has an advantage of a low encoding bit rate. However, parametric codecs also have the disadvantage that the decoded speech signal is of poor quality and sensitive to noise. Waveform codecs generally encode and decode waveforms of a voice signal, so that the voice signal decoded by the waveform codec has an advantage of high reduction degree, but this also requires a higher encoding bit rate, thereby increasing storage and transmission costs of the voice signal.

Therefore, how to store or transmit the voice signal with as few bits as possible (i.e., low coding bit rate) while ensuring better performance of the decoded voice signal is a technical problem to be solved.

Disclosure of Invention

The application provides a voice communication system and a voice communication method, which can store or transmit voice signals at a low coding bit rate and ensure that the decoded voice signals have better performance.

The application discloses the following technical scheme:

in a first aspect, the present application provides a voice communication system, the system comprising: a first electronic device and a second electronic device;

the first electronic device is used for acquiring the waveform of the voice signal; extracting a first logarithmic magnitude spectrum and a first phase spectrum from the waveform of the voice signal through short-time Fourier transform; generating a continuous code according to the first pair of magnitude spectrums and the first phase spectrum; discretizing the continuous codes to obtain index vectors; transmitting the index vector to the second electronic device;

the second electronic device is configured to generate a quantization code according to the index vector; generating a second pair of magnitude spectrums and a second phase spectrum according to the quantization code; and recovering the second logarithmic magnitude spectrum and the second phase spectrum into waveforms of the voice signals through inverse short-time Fourier transformation.

Optionally, the first electronic device includes: an encoder module and a quantizer module;

the encoder module is used for acquiring the waveform of the voice signal; extracting a first logarithmic magnitude spectrum and a first phase spectrum from the waveform of the voice signal through short-time Fourier transform; generating a continuous code according to the first pair of magnitude spectrums and the first phase spectrum; transmitting the continuous code to the quantizer module;

the quantizer module is used for performing discretization on the continuous codes to obtain index vectors; and sending the index vector to the second electronic device.

Optionally, the encoder module is specifically configured to: encoding the first pair of digital magnitude spectrums to obtain magnitude codes; encoding the first phase spectrum to obtain a phase code; and splicing the amplitude code and the phase code to generate a continuous code.

Optionally, the encoder module includes: an amplitude sub-encoder and a phase sub-encoder;

the amplitude subcode is used for encoding the first pair of logarithmic amplitude spectrums to obtain amplitude codes;

the phase sub-encoder is configured to encode the first phase spectrum to obtain a phase code.

Optionally, the second electronic device includes: a decoder module;

the decoder module is used for generating a quantization code according to the index vector; generating a second pair of magnitude spectrums and a second phase spectrum according to the quantization code; and recovering the second logarithmic magnitude spectrum and the second phase spectrum into waveforms of the voice signals through inverse short-time Fourier transformation.

Optionally, the second electronic device is further configured to:

generating a waveform recovery model according to the quantized code, the second pair of magnitude spectrums, the second phase spectrum and the waveform of the voice signal, wherein the waveform recovery model is a neural network model for generating the waveform of the voice signal through the quantized code.

Optionally, the second electronic device is further configured to: and updating the waveform recovery model according to the value of one or more of the amplitude spectrum loss function, the phase spectrum loss function, the short-time spectrum loss function and the waveform loss function.

Optionally, the phase spectrum loss function is a linear combination of an instantaneous phase loss function, a group delay loss function, and an instantaneous angular frequency loss function, the short-time spectrum loss function is a linear combination of a real part loss function, an imaginary part loss function, and a short-time spectrum consistency loss function, and the waveform loss function is a linear combination of a loss function, a feature matching loss function, and a mel spectrum loss function that generate an countermeasure network.

In a second aspect, the present application provides a voice communication method, applied to a first electronic device, the method including:

acquiring the waveform of a voice signal;

extracting a first logarithmic magnitude spectrum and a first phase spectrum from the waveform of the voice signal through short-time Fourier transform;

generating a continuous code according to the first pair of magnitude spectrums and the first phase spectrum;

discretizing the continuous codes to obtain index vectors;

and sending the index vector to a second electronic device.

In a third aspect, the present application provides a voice communication method applied to a second electronic device, where the method includes:

receiving an index vector sent by first electronic equipment;

generating a quantization code according to the index vector;

generating a second pair of magnitude spectrums and a second phase spectrum according to the quantization code;

and recovering the second logarithmic magnitude spectrum and the second phase spectrum into waveforms of voice signals through inverse short-time Fourier transformation.

Compared with the prior art, the application has the following beneficial effects:

the application provides a voice communication system and a method, wherein the system comprises the following steps: a first electronic device and a second electronic device; the first electronic equipment is used for acquiring the waveform of the voice signal; extracting a first logarithmic magnitude spectrum and a first phase spectrum from the waveform of the voice signal through short-time Fourier transform; generating a continuous code according to the first pair of magnitude spectrums and the first phase spectrum; discretizing the continuous codes to obtain index vectors; transmitting the index vector to the second electronic device; the second electronic device is used for generating a quantization code according to the index vector; generating a second pair of magnitude spectra and a second phase spectrum according to the quantization code; the second pair-wise magnitude spectrum and the second phase spectrum are restored to the waveform of the speech signal by an inverse short-time fourier transform. Thus, the waveform of the speech signal with high reduction degree is converted into a speech amplitude spectrum and a phase spectrum by short-time Fourier transform, and then the speech amplitude spectrum and the phase spectrum are used as speech parameter characteristics to be coded in parallel. Similarly, the encoded quantized codes can be decoded in parallel, and the decoded voice amplitude spectrum and the decoded phase spectrum are restored into waveforms of voice signals through inverse short time Fourier transform, so that the voice signals are stored or transmitted at a low encoding bit rate, and meanwhile, the decoded voice signals are guaranteed to have better performance, and the efficiency and the restoration degree of voice communication are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 is a schematic diagram of a voice communication system according to an embodiment of the present application;

fig. 2 is a signaling diagram of a voice communication system according to an embodiment of the present application;

fig. 3 is a flowchart of a voice communication method according to an embodiment of the present application;

fig. 4 is a flowchart of another voice communication method according to an embodiment of the present application.

Detailed Description

As described above, the speech codec technology has been widely used in the technical fields of speech communication, signal processing, and the like. The speech codec (speech codec) refers to a technology that compresses a speech signal into discrete codes and then reconstructs the speech signal using the discrete codes. The voice codec can be implemented using a voice codec.

In the related art, speech codecs are generally classified into three types: parameter codec, waveform codec, and hybrid codec.

The parameter codec typically decodes and encodes characteristic parameters of the speech signal, such as linear predictive coding (Linear Predictive Coding, LPC). Because the voice signal has the characteristic of short-time stable property, the parameter codec has the advantage of low coding bit rate. However, parametric codecs also have the disadvantage that the decoded speech signal is of poor quality and sensitive to noise. With the development of deep learning, in order to further improve the quality of the decoded voice signal, a neural network vocoder may be used to convert the encoded characteristic parameters into a voice waveform through a neural network model, thereby implementing decoding. However, there is some room for improvement in the quality of the speech signal decoded by the parameter codec.

Waveform codecs generally encode and decode waveforms of a voice signal, so that the voice signal decoded by the waveform codec has an advantage of high reduction degree, but this also requires a higher encoding bit rate, thereby increasing storage and transmission costs of the voice signal.

With the development of deep learning, some end-to-end neural network codecs are proposed, which can effectively balance the decoded speech quality and the coding bit rate. Illustratively, the codec SoundStream may employ a residual vector quantization mechanism to reduce the encoding bit rate while employing the generation counter-loss proposed by the vocoder HiFi-GAN to ensure the quality of the decoded speech signal. However, these end-to-end neural network codecs, because they directly encode and decode waveforms, require hundreds of times the upsampling and downsampling operations, which potentially reduces the efficiency of the codec while making modeling of the neural network model extremely difficult.

Hybrid codecs, such as Regular pulse excited linear prediction codec (Regular-Pulse Excited Linear Prediction Codec), combine the advantages of parametric and waveform codecs. However, implementing hybrid codecs using neural network models has been lacking.

Therefore, how to store or transmit the voice signal with as few bits as possible (i.e., low coding bit rate) while ensuring that the performance of the decoded voice signal is not significantly degraded is a technical problem to be solved.

In view of the foregoing, the present application provides a voice communication system and method, the system comprising: a first electronic device and a second electronic device; the first electronic equipment is used for acquiring the waveform of the voice signal; extracting a first logarithmic magnitude spectrum and a first phase spectrum from the waveform of the voice signal through short-time Fourier transform; generating a continuous code according to the first pair of magnitude spectrums and the first phase spectrum; discretizing the continuous codes to obtain index vectors; transmitting the index vector to the second electronic device; the second electronic device is used for generating a quantization code according to the index vector; generating a second pair of magnitude spectra and a second phase spectrum according to the quantization code; the second pair-wise magnitude spectrum and the second phase spectrum are restored to the waveform of the speech signal by an inverse short-time fourier transform. Thus, the waveform of the speech signal with high reduction degree is converted into a speech amplitude spectrum and a phase spectrum by short-time Fourier transform, and then the speech amplitude spectrum and the phase spectrum are used as speech parameter characteristics to be coded in parallel. Similarly, the encoded quantized codes can be decoded in parallel, and the decoded voice amplitude spectrum and the decoded phase spectrum are restored into waveforms of voice signals through inverse short time Fourier transform, so that the voice signals are stored or transmitted at a low encoding bit rate, and meanwhile, the decoded voice signals are guaranteed to have better performance, and the efficiency and the restoration degree of voice communication are improved.

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Referring to fig. 1, a schematic diagram of a voice communication system according to an embodiment of the present application is shown. As can be seen from fig. 1, a voice communication system 10 includes a first electronic device 11 and a second electronic device 12. Wherein the first electronic device 11 comprises an encoder module 101 and a quantizer module 102; the second electronic device 12 includes a decoder module 103 therein.

The encoder module 101 is composed of an amplitude sub-encoder and a phase sub-encoder. Specifically, the amplitude sub-encoder and the phase sub-encoder are each comprised of an input convolutional layer, a ConvNeXt network, and a downsampled convolutional layer. The ConvNeXt network is a characteristic processing structure composed of a convolution layer, a feedforward layer, a Gaussian error linear unit, residual error connection and other elements, and the downsampling convolution layer is used for further expanding frame shift of coding characteristics so as to reduce coding bit rate.

The quantizer module 102 adopts a residual vector quantization strategy, which is composed of a plurality of vector quantizers and connected by a residual manner, so as to realize discretization of continuous codes and generate quantized codes. It should be noted that, the present application is not limited to a specific number of quantizers.

The decoder module 103 is composed of an amplitude sub-decoder and a phase sub-decoder. Specifically, the amplitude sub-decoder consists of an up-sampling deconvolution layer, a ConvNeXt network, and an output convolution layer. The phase sub-decoder is different from the amplitude sub-decoder in that the output convolution layer is replaced by a phase parallel estimation architecture which is composed of two parallel output convolution layers and a phase calculation formula, and the phase parallel estimation architecture can simulate the calculation process from real part of a short-time complex spectrum to phase spectrum, strictly limit a predicted phase value in a phase main value interval and realize direct prediction of a winding phase.

Referring to fig. 2, the signaling diagram of a voice communication system according to an embodiment of the present application is shown.

S201: the encoder module obtains a waveform of the speech signal.

First, the encoder module 101 of the first electronic device 11 needs to acquire the waveform of the voice signal Wherein x is a waveform, ">For the real set, T is the waveform point length.

S202: the encoder module extracts a first log magnitude spectrum and a first phase spectrum from a waveform of the speech signal by a short-time fourier transform.

The Short-time fourier transform (STFT) is a mathematical calculation that determines the frequency and phase of the sine wave of the local region of a time-varying signal. By short-time Fourier transformI.e. from the waveform of the speech signalExtracting corresponding first logarithmic magnitude spectrum +.>And a first phase spectrum->

Wherein A is a first logarithmic magnitude spectrum,for the real number set, F is the number of frames, N is the number of frequency points, and P is the first phase spectrum. It will be appreciated that the first pair of magnitude spectra a and the first phase spectrum P are matrices of f×n, and the elements in the matrices are real numbers.

In some examples, the waveform of the speech signal at a sampling rate of 16kHz (kilohertz) may be subjected to a short-time fourier transform to obtain a first logarithmic magnitude spectrum at a sampling rate of 400Hz and a first phase spectrum at a sampling rate of 400 Hz. Note that, the present application is not limited to a specific sampling rate.

S203: the encoder module generates a continuous code from the first pair of magnitude spectra and the first phase spectrum.

The encoder module 101 comprises an amplitude sub-encoder and a phase sub-encoder. The amplitude sub-encoder and the phase sub-encoder are parallel and identical in structure.

Specifically, the amplitude sub-encoder and the phase sub-encoder respectively take the first logarithmic amplitude spectrumAnd a first phase spectrum->For input, parallel to the first logarithmic magnitude spectrum +.>And a first phase spectrum->Encoding is performed to generate an amplitude code and a phase code, respectively. Then, after the generated amplitude codes and phase codes are spliced and fused, a continuous code with long frame shift and low dimension is generated by a dimension-reducing convolution layer> Wherein C is a continuous code, ">As a real number set, F _c Frame number N of continuous code _c Is the dimension of the consecutive code.

In some examples, the sampling rate of the continuous code generated based on a log-magnitude spectrum with a sampling rate of 400Hz and a phase spectrum with a sampling rate of 400Hz may be 50Hz.

It should be noted that the above-mentioned amplitude sub-encoder and phase sub-encoder are each composed of an input convolution layer, a convnex network, and a downsampled convolution layer. Then the first log-amplitude spectrum of the inputAnd a first phase spectrum->Firstly, carrying out normalization operation through an input convolution layer with the channel number of C, then carrying out advanced treatment through a ConvNeXt network, then reducing the dimension of an output result of the ConvNeXt network to half of the original dimension through a downsampling convolution layer with the channel number of C/2 and the step length of D, and finally generating an amplitude continuous code/phase continuous code

More specifically, conThe vNeXt network consists of a cascade of multiple identical ConvNeXt blocks. In each ConvNeXt block, the normalized log-magnitude spectrum and phase spectrum outputted by the input convolution layer sequentially pass through a linear convolution layer with C channels and a node with C nodes _H Is described for mapping features to a higher dimensional feed-forward layer (C _H >C) A gaussian error linear unit (Gaussian error linear unit, GELU) and a node number C, effecting the mapping of features to the original low-dimensional feed-forward layer, the resulting output is added to the input (i.e. the residual connection) as the final output of the ConvNeXt block.

S204: the encoder module sends the continuous code to the quantizer module.

S205: the quantizer module performs discretization on the continuous codes to obtain index vectors.

The quantizer module 102 subjects the continuous code generated by the encoder module 101 toAfter discretization, an index vector m can be obtained ¹ ,m ² ,…,m ^Q 。

Specifically, quantizer module 102 employs a residual vector quantization (Residual Vector Quantization, RVQ) strategy consisting of Q vector quantizers (Vector Quantization, VQ), each with a trainable codebookWherein B is ^q Is codebook, N _c For the dimension of the continuous code, M is the number of vectors in the codebook. Index vector m ¹ ,m ² ,…,m ^Q Then it is stored or transmitted in binary form. Thus, the coding bit rate (in kilobits per second, i.e., kbps) of the codec proposed in the present application can be calculated by the following equation (1):

wherein Bitrate is the coding bit rate,f _s is the sampling rate, w, of the waveform of the speech signal _s For frame shift in the first logarithmic magnitude spectrum and the first phase spectrum extracted from the waveform of the speech signal, D is the downsampling magnification of the downsampling convolution layer, Q is the number of vector quantizers, and M is the number of vectors in the codebook.

S206: the quantizer module sends the index vector to the decoder module.

S207: the decoder module generates a quantization code from the index vector.

The decoder module 103 of the second electronic device 12 may be based on the index vector m ¹ ,m ² ,…,m ^Q Further generating quantization codes

S208: the decoder module generates a second pair of magnitude spectra and a second phase spectrum from the quantized code.

Quantization codeFirstly, through an up-dimension convolution layer with the channel number of C/2, the original dimension is restored, then the original dimension is input into an amplitude sub-decoder and a phase sub-decoder, and a second pair of logarithmic magnitude spectrums are decoded in parallel through the amplitude sub-decoder and the phase sub-decoder>And a second phase spectrum->

Specifically, for the amplitude decoder module, the input quantized code of the original dimension is firstly up-sampled by a D-times through an up-sampling deconvolution layer with a channel number of C and a step length of D, then deeply processed by a ConvNeXt network, and finally a second pair of amplitude spectrums are predicted through an output convolution layer with a channel number of N

The only difference of the phase sub-decoder compared to the amplitude sub-encoder is that the output of the phase sub-decoder is a phase-parallel estimation architecture. The architecture can ensure the direct prediction of the winding phase spectrum, and consists of two parallel output linear convolution layers with the number of channels being N and a phase calculation formula phi. It will be appreciated that it is assumed that the outputs of the two output convolutional layers are respectivelyAnd->Then the second phase spectrum +.>Need to pass->And (5) calculating.

It should be noted that the function Φ is calculated element by element. By way of example, the formula of the function Φ may be shown as the following formulas (2), (3):

wherein R is the first output of the first output convolution layer and I is the second output of the second output convolution layer. When x is greater than or equal to 0, sgn ^* (x) =1; when x is<0, sgn ^* (x)＝-1。

S209: the decoder module restores the second pair of magnitude spectra and the second phase spectrum to waveforms of the speech signal by an inverse short time fourier transform.

At the time of obtaining the second logarithmic magnitude spectrumAnd a second phase spectrum->Thereafter, the waveform of the speech signal may be reconstructed by an Inverse Short Time Fourier Transform (ISTFT) (Inverse Short time Short-Time Fourier Transform)The speech signal waveform here is +>Is the same as the original speech signal waveform +.>Waveforms with extremely high similarity or identical.

In other specific implementations, the waveform of the speech signal may be processed not only by the steps S202-S209 as described aboveRestoring to->It is also possible to pass the quantization code, the first logarithmic magnitude spectrum, the second logarithmic magnitude spectrum, the first phase spectrum, the second phase spectrum, the waveform of the speech signal +.>And->A waveform recovery model is generated. The waveform recovery model is a neural network model for recovering the waveform of the voice signal, so that after the waveform recovery model is constructed, the waveform of the corresponding voice signal can be obtained by directly inputting the quantized code into the waveform recovery model. And, the loss function can also be respectively constructed according to the second pair of magnitude spectrums, the second phase spectrums and the waveform of the voice signal, so as to optimize and update the waveform recovery model. The waveform recovery model described above may be included in the second electronic device 12, and the waveform recovery model may be included in the first electronic device 11. For utensilThe position of the waveform restoration model of the body is not limited in this application.

In some examples, the loss function may include an amplitude spectrum loss functionThe amplitude spectrum loss functionDefined as the second logarithmic magnitude spectrum +.>Mean square error loss with the first logarithmic magnitude spectrum A for reducing the decoded second logarithmic magnitude spectrum +.>And the first log magnitude spectrum a. Specifically, the amplitude spectrum loss function->The formula of (2) can be shown as the following formula (4):

wherein,for the amplitude spectrum loss function, +.>To get the average value->For the second logarithmic magnitude spectrum, a is the first logarithmic magnitude spectrum.

It should be noted that, when calculating the first logarithmic magnitude spectrum a, it is necessary to first calculate the waveform of the original speech signalExtracting a natural short-time complex spectrum S through short-time Fourier transformation, and then calculating by the following formula (5):

wherein A is a first logarithmic magnitude spectrum, re and Im respectively represent a real part and an imaginary part, and S is a short-time complex spectrum.

In other examples, the loss function may include a phase spectrum loss functionThe phase spectrum loss functionDefined as the second phase spectrum +.>And a loss between the first phase spectrum P for narrowing the second phase spectrum +.>And the first phase spectrum P. The phase spectrum loss function->In which three sub-loss functions, respectively instantaneous phase loss functions, can be includedGroup delay loss function->And transient angular frequency loss function->

In particular, instantaneous phaseLoss functionDefined as the second phase spectrum +.>And the anti-rolling loss of the first phase spectrum P can be expressed as the following formula (6):

wherein,for transient phase loss function, +.>To take the average value, f _AW Is a linear anti-winding function->For the second phase spectrum, P is the first pair of phase spectrums.

Specifically, group delay loss functionDefined as decoding group delay +.>And natural group delay delta _DF The winding loss resistance of P can be expressed as the following formula (7):

wherein,for group delay loss function, +.>To get the average value->For decoding group delay, delta _DF P is natural group delay, delta _DF To differentiate along the frequency axis, delta _DT To be differentiated along the time axis, f _AW Is a linear anti-winding function.

Specifically, the instantaneous angular frequency loss functionDefined as decoding instantaneous angular frequency +.>And natural instantaneous angular frequency delta _DT The winding loss resistance of P can be expressed as the following formula (8):

wherein,for transient angular frequency loss function, < >>To get the average value->To decode the instantaneous angular frequency, delta _DT P is natural instantaneous angular frequency, f _AW Is a linear anti-winding function.

It should be noted that the above equations (6) to (8) all relate to the linear anti-winding function f _AW The problem of expansion of training errors due to phase wrapping can be avoided, as shown in the following formula (9):

wherein f _AW As a function of the linear anti-winding function,is a real set.

It should be noted that, the first phase spectrum P referred to in each of the above formulas (6) to (8) may be shown in the following formula (10):

P＝Φ(Re(S)，Im(S)) (10)

wherein P is a first phase spectrum, S is a short-time complex spectrum, re and Im respectively represent a real part and an imaginary part.

Finally, the phase spectrum loss functionCan be expressed as instantaneous phase loss function +.>Group delay loss function->And transient angular frequency loss function->Can be represented by the following formula (11):

wherein,for the phase spectrum loss function, +.>For transient phase loss function, +.>As a function of the group delay loss,is a transient angular frequency loss function.

In some examples, a short-term spectrum is also acquired in the above processAnd a short-term complex spectrum S, which can also be based on a short-term spectral loss function->Updating the waveform recovery model. Short-term spectral loss function->Second logarithmic magnitude spectrum for improved decoding +.>And a second phase spectrum->Degree of matching between them, and ensuring their reconstructed short-term spectrum +.>Is a uniform property of (a). Wherein, short-term spectrum->Comprises three sub-loss functions, respectively real loss function +.>Imaginary loss function->And a short-term spectral coherence loss function>

Specifically, the real part loss functionDefined as reconstructed real part->Absolute error loss between natural real part Re (S), imaginary part loss definition function +.>Reconstruction of the imaginary part->The absolute error loss between the natural imaginary part Im (S) can be expressed as the following equations (12), (13), respectively:

wherein,as a real part loss function->Defining a function for the imaginary loss->To get the average value->Re (S) is the natural real part, < ->Im (S) is natural imaginary partPart(s)>The short-time spectrum and S is a short-time complex spectrum.

Since the second pair of magnitude spectra and the second phase spectrum are predictive and the short-term spectrum domain is only a subset of the complex domain, they reconstruct a short-term spectrumNot necessarily a truly existing short-term spectrum. The real existing short-term spectrum is expressed as +.>Short-term spectral coherence loss function>Short-term spectrum for reducing reconstruction +.>And the true presence of a short-term spectrum +.>The gap between them. Whereas the true short-term spectrum +.>By first reconstructing a short-term spectrum +.>Performing inverse short-time Fourier transform to obtain a waveform +.>The waveform is fourier transformed, and can be expressed as the following formula (14):

wherein,for the truly existing short-term spectrum, STFT is the short-term Fourier transform, ISTFT is the inverse short-term Fourier transform, +.>A reconstructed short-term spectrum.

Short-term spectrum consistency loss functionDefined as a reconstructed short-term spectrum +.>Short-term spectrum of true presence->The two norms between the two are written in the form of real and imaginary parts, and can be shown in the following formula (15):

wherein,for the short-time spectrum consistency loss function, E is the average value +.>Reconstruction of real part->As real part +.>Reconstructing the imaginary part->Is true and virtualPart(s)>For short time spectrum, ->Is a truly existing short-term spectrum.

Finally, short-term spectral loss functionCan be expressed as a real part loss function +.>Imaginary loss function->And a short-term spectral coherence loss function>The proportional linear combination can be represented by the following formula (16):

wherein,lambda as short-term spectral loss function _RI Is a superparameter->As a real part loss function->For the imaginary loss function->Is a short-time spectrum consistency loss function.

In some cases showIn one example, the loss function may include a waveform loss functionWaveform loss function->For shrinking the decoded waveform->And the natural waveform x (i.e., the waveform input to the encoder module 101 and the waveform output from the decoder module 103), including generating a generator loss function against the network ∈ ->Feature matching loss function->And mel-spectrum loss function>Waveform loss function->Is a proportional linear combination of the above sub-loss functions, and can be represented by the following formula (16):

wherein,for waveform loss function, ++>To generate a generator loss function of the countermeasure network, +.>To match the loss function, lambda _Mel Is a superparameter->Is a mel-spectrum loss function.

The waveform recovery model adopts standard generation countermeasure training mode, and the final generator loss is generatedFor a proportional linear combination of all the above losses, one can see the following equation (17):

wherein,generator loss, < >>For the amplitude spectrum loss function, +.>For the phase spectrum loss function, +.>For short-term spectral loss function, < >>Lambda is the waveform loss function _A 、λ _P And lambda (lambda) _S Are super parameters.

During training, use lossAnd generating a decision maker penalty function for the countermeasure network>The training generator and the decision maker are alternated until the model converges.

In some specific scenarios, taking an actual application scenario of voice communication as an example, the first electronic device (transmitting end) encodes a voice signal, then transmits the encoded voice signal to the second electronic device (receiving end), and then the receiving end decodes the encoded voice signal to obtain the voice signal.

In some specific implementations, the encoder of the first electronic device first extracts the log-magnitude spectrum a and the phase spectrum P from the waveform of the speech signal by short-time fourier transform, and generates the continuous code C from the log-magnitude spectrum a and the phase spectrum P. Then, the quantizer of the first electronic device discretizes the continuous code C and generates an index vector m ¹ ，m ² ，...，m ^Q . Finally, index vector m ¹ ，m ² ，...，m ^Q And transmitting the data in a binary form and transmitting the data to the second electronic equipment.

After the decoder of the second electronic device receives the index vector, it first follows the codebook B ¹ ，B ² ，...，B ^Q Generating quantized codesAnd is->Generating a log magnitude spectrum->And phase spectrum->Finally, the logarithmic magnitude spectrum +.>And phase spectrum->Restoring the speech signal waveform by inverse short-time fourier transformation>Thereby realizing a complete voice communication process.

In summary, the present application discloses a voice communication system, which includes: a first electronic device and a second electronic device; the first electronic equipment is used for acquiring the waveform of the voice signal; extracting a first logarithmic magnitude spectrum and a first phase spectrum from the waveform of the voice signal through short-time Fourier transform; generating a continuous code according to the first pair of magnitude spectrums and the first phase spectrum; discretizing the continuous codes to obtain index vectors; transmitting the index vector to the second electronic device; the second electronic device is used for generating a quantization code according to the index vector; generating a second pair of magnitude spectra and a second phase spectrum according to the quantization code; the second pair-wise magnitude spectrum and the second phase spectrum are restored to the waveform of the speech signal by an inverse short-time fourier transform. Therefore, the waveform of the voice signal with high reduction degree is converted into the voice amplitude spectrum and the phase spectrum through the short-time Fourier transform and the inverse short-time Fourier transform, and then the voice amplitude spectrum and the phase spectrum are used as voice parameter characteristics to be coded and decoded in parallel, so that the decoded voice signal performance is ensured to be better when the voice signal is stored or transmitted at a low coding bit rate, and the efficiency and the reduction degree of voice communication are improved.

Referring to fig. 3, a flowchart of a voice communication method according to an embodiment of the present application is shown. The method is applied to a first electronic device 11, comprising:

s301: a waveform of a speech signal is acquired.

S302: a first logarithmic magnitude spectrum and a first phase spectrum are extracted from the waveform of the speech signal by short-time Fourier transform.

S303: generating a continuous code according to the first pair of magnitude spectrums and the first phase spectrum.

S304: and discretizing the continuous codes to obtain index vectors.

S305: and sending the index vector to a second electronic device.

In summary, the present application provides a voice communication method, which can store or transmit a voice signal at a low coding bit rate, and ensure that the performance of the decoded voice signal is better, thereby improving the efficiency and the reduction degree of voice communication.

Referring to fig. 4, a flowchart of another voice communication method according to an embodiment of the present application is shown. The method is applied to a second electronic device 12, comprising:

s401: and receiving the index vector sent by the first electronic equipment.

S402: and generating a quantization code according to the index vector.

S403: and generating a second pair of magnitude spectrums and a second phase spectrum according to the quantized codes.

S404: and recovering the second logarithmic magnitude spectrum and the second phase spectrum into waveforms of voice signals through inverse short-time Fourier transformation.

The "first" and "second" in the names of "first", "second" (where present) and the like in the embodiments of the present application are used for name identification only, and do not represent the first and second in sequence.

From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described example methods may be implemented in software plus general hardware platforms. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, including several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a router) to perform the methods described in the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments. In particular, for the apparatus and media embodiments, since they are substantially similar to the system, method embodiments, the description is relatively simple, with reference to the description of the method embodiments in part. The apparatus and media embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements illustrated as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely one specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A voice communication system, the system comprising: a first electronic device and a second electronic device;

2. The system of claim 1, wherein the first electronic device comprises: an encoder module and a quantizer module;

3. The system according to claim 2, wherein the encoder module is specifically configured to: encoding the first pair of digital magnitude spectrums to obtain magnitude codes; encoding the first phase spectrum to obtain a phase code; and splicing the amplitude code and the phase code to generate a continuous code.

4. The system of claim 3, wherein the encoder module comprises: an amplitude sub-encoder and a phase sub-encoder;

5. The system of claim 1, wherein the second electronic device comprises: a decoder module;

6. The system of claim 1, wherein the second electronic device is further configured to:

7. The method of claim 6, wherein the second electronic device is further configured to: and updating the waveform recovery model according to the value of one or more of the amplitude spectrum loss function, the phase spectrum loss function, the short-time spectrum loss function and the waveform loss function.

8. The method of claim 7, wherein the phase spectral loss function is a linear combination of an instantaneous phase loss function, a group delay loss function, and an instantaneous angular frequency loss function, the short-time spectral loss function is a linear combination of a real part loss function, an imaginary part loss function, and a short-time spectral consistency loss function, and the waveform loss function is a linear combination of a loss function, a feature matching loss function, and a mel spectrum loss function that generate an countermeasure network.

9. A method of voice communication, applied to a first electronic device, the method comprising:

acquiring the waveform of a voice signal;

discretizing the continuous codes to obtain index vectors;

and sending the index vector to a second electronic device.

10. A method of voice communication, for use with a second electronic device, the method comprising:

receiving an index vector sent by first electronic equipment;

generating a quantization code according to the index vector;