WO2023221674A1 - 音频编解码方法及相关产品 - Google Patents

音频编解码方法及相关产品 Download PDF

Info

Publication number
WO2023221674A1
WO2023221674A1 PCT/CN2023/085872 CN2023085872W WO2023221674A1 WO 2023221674 A1 WO2023221674 A1 WO 2023221674A1 CN 2023085872 W CN2023085872 W CN 2023085872W WO 2023221674 A1 WO2023221674 A1 WO 2023221674A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
audio frame
vector
encoding
codebook
Prior art date
Application number
PCT/CN2023/085872
Other languages
English (en)
French (fr)
Inventor
华超
黄飞
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202210546928.4A external-priority patent/CN115050378B/zh
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023221674A1 publication Critical patent/WO2023221674A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/20Vocoders using multiple modes using sound class specific coding, hybrid encoders or object based coding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/22Mode decision, i.e. based on audio signal content versus external parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application belongs to the field of audio and video technology, and specifically relates to an audio encoding and decoding method, an audio encoding and decoding device, a computer readable medium, an electronic device and a computer program product.
  • Encoding and decoding media data such as audio and video can realize compression and transmission of media data, thereby reducing the network transmission cost of media data and improving network transmission efficiency.
  • media data may be lost, resulting in poor media data quality.
  • This application provides an audio encoding and decoding method, an audio encoding and decoding device, a computer-readable medium, an electronic device, and a computer program product, aiming to improve the quality of media data.
  • an audio decoding method is provided.
  • the method is executed by a computer device.
  • the method includes:
  • the encoding vector of the historical audio frame is upsampled to obtain the upsampled feature value.
  • the historical audio frame is before the current audio frame in the audio frame sequence.
  • the upsampling feature value is a feature vector obtained during the upsampling process and used to describe the historical audio frame;
  • the encoding vector of the current audio frame is upsampled according to the upsampling feature value to obtain the decoded data of the current audio frame.
  • an audio encoding method is provided.
  • the method is executed by a computer device.
  • the method includes:
  • the audio data of the historical audio frame is down-sampled to obtain the down-sampled feature value.
  • the historical audio frame is before the current audio frame in the audio frame sequence.
  • the down-sampling feature value is a feature vector obtained during the down-sampling process and used to describe the historical audio frame;
  • the audio data of the current audio frame is down-sampled according to the down-sampling feature value to obtain the encoding vector of the current audio frame.
  • an audio decoding device is provided, the device is deployed on a computer device, and the device includes:
  • An acquisition module configured to acquire the encoding vector of each audio frame in the audio frame sequence
  • a first upsampling module configured to, when decoding the current audio frame in the audio frame sequence, upsample the encoding vector of the historical audio frame to obtain the upsampling feature value, and the historical audio frame is the audio frame.
  • the upsampling feature value is a feature vector obtained during the upsampling process and used to describe historical audio frames;
  • the second upsampling module is configured to upsample the encoding vector of the current audio frame according to the upsampling feature value to obtain the decoded data of the current audio frame.
  • an audio encoding device is provided, the device is deployed on a computer device, and the device includes:
  • An acquisition module configured to acquire the audio data of each audio frame in the audio frame sequence
  • a first down-sampling module configured to down-sample the audio data of historical audio frames to obtain down-sampling feature values when encoding the current audio frame in the audio frame sequence, and the historical audio frames are the audio frames
  • the down-sampling feature value is a feature vector obtained during the down-sampling process and used to describe historical audio frames;
  • the second down-sampling module is configured to down-sample the audio data of the current audio frame according to the down-sampling feature value to obtain the encoding vector of the current audio frame.
  • a computer-readable medium is provided.
  • a computer program is stored on the computer-readable medium.
  • the computer program is executed by a processor, the audio encoding and decoding method in the above technical solution is implemented.
  • an electronic device including: a processor; and a memory for storing a computer program executable by the processor; wherein the processor is configured to execute The executable computer program causes the electronic device to execute the audio encoding and decoding method in the above technical solution.
  • a computer program product including a computer program that implements the audio encoding and decoding method in the above technical solution when executed by a processor.
  • the encoding vector of each audio frame in the audio frame sequence is obtained.
  • the encoding vector of the current audio frame can be upsampled.
  • the upsampling feature value obtained by upsampling the coding vector of the historical audio frame can be introduced, so that the coding vector of the current audio frame is upsampled according to the upsampling feature value to obtain the current audio.
  • Figure 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of the embodiment of the present application can be applied;
  • Figure 2 illustrates the placement of a video encoding device and a video decoding device in a streaming environment
  • Figure 3 shows the network structure block diagram of a codec based on a convolutional neural network
  • Figure 4 shows a flow chart of an audio decoding method in one embodiment of the present application
  • Figure 5 shows a flow chart of the method steps for audio decoding based on a convolutional neural network including multiple upsampling layers in one embodiment of the present application
  • Figure 6 shows a schematic diagram of a network module that implements data encoding and decoding processing in an embodiment of the present application
  • Figure 7 shows a schematic diagram of the principle of normalizing the channel feature values output by multiple sampling channels in one embodiment of the present application
  • Figure 8 shows a flow chart of the steps of audio frame decoding based on the query codebook in one embodiment of the present application
  • Figure 9 shows a schematic diagram of the principle of determining encoding vectors based on data mapping in one embodiment of the present application.
  • Figure 10 shows a flow chart of steps for training the quantizer in one embodiment of the present application
  • Figure 11 shows a step flow chart of an audio encoding method in one embodiment of the present application
  • Figure 12 shows a flow chart of the method steps for audio encoding based on a convolutional neural network including multiple downsampling layers in one embodiment of the present application
  • Figure 13 shows a flow chart of steps for model training of the encoder and decoder in one embodiment of the present application
  • Figure 14 shows a schematic diagram of the principle of encoding and decoding model training based on a generative adversarial network in one embodiment of the present application
  • Figure 15 shows a structural block diagram of an audio decoding device in an embodiment of the present application
  • Figure 16 shows a structural block diagram of an audio encoding device in one embodiment of the present application.
  • Figure 17 shows a block diagram of a computer system structure of an electronic device in an embodiment of the present application.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments may, however, be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concepts of the example embodiments. To those skilled in the art.
  • this application involves user-related data such as audio frames.
  • user-related data such as audio frames.
  • Convolutional neural network In the field of multimedia data processing such as text, images, audio and video, convolutional neural network is the most successfully applied deep learning structure. Convolutional neural network is composed of multiple convolutional layers, generally including convolutional layers. Layer), downsampling layer (Pooling Layer), activation function layer (Activation Layer), normalization layer (Normalization Layer), fully connected layer (Full Connected Layer), etc.
  • Audio coding and decoding The audio encoding process is to compress the audio into smaller data, and the decoding process is to restore the smaller data to audio.
  • the encoded smaller data is used for network transmission and takes up less bandwidth.
  • Audio sampling rate describes the number of data contained in unit time (1 second). For example, the 16k sampling rate contains 16,000 sampling points, and each sampling point corresponds to a short integer.
  • Codebook A collection of multiple vectors. The encoder and decoder save exactly the same codebook.
  • Quantization Find the closest vector to the input vector in the codebook, return it as a replacement for the input vector, and return the corresponding codebook index position.
  • Quantizer The quantizer is responsible for quantization and updating the vectors in the codebook.
  • Weak network environment An environment with poor network transmission quality, such as a bandwidth below 3kpbs.
  • Audio frame Indicates the minimum voice duration for a single transmission in the network.
  • STFT Short Time Fourier Transform
  • Figure 1 schematically shows a schematic diagram of an exemplary system architecture to which the technical solutions of embodiments of the present application can be applied.
  • a system architecture 100 includes a plurality of terminal devices that can communicate with each other through, for example, a network 150 .
  • the system architecture 100 may include a first terminal device 110 and a second terminal device 120 interconnected through a network 150 .
  • the first terminal device 110 and the second terminal device 120 perform one-way data transmission.
  • the first terminal device 110 may encode the audio and video data (such as the audio and video data stream collected by the terminal device 110) for transmission to the second terminal device 120 through the network 150, and the encoded audio and video data is represented by one or Multiple encoded audio and video streams are transmitted.
  • the second terminal device 120 can receive the encoded audio and video data from the network 150, decode the encoded audio and video data to restore the audio and video data, and create content based on the restored audio and video data. Play or display.
  • the system architecture 100 may include a third terminal device 130 and a fourth terminal device 140 that perform bidirectional transmission of encoded audio and video data.
  • the bidirectional transmission may occur, for example, during an audio and video conference.
  • each of the third terminal device 130 and the fourth terminal device 140 may encode audio and video data (such as audio and video data streams collected by the terminal device) for transmission to the third terminal device through the network 150 .
  • the other terminal device among the terminal device 130 and the fourth terminal device 140 may also receive a signal transmitted by the other terminal device among the third terminal device 130 and the fourth terminal device 140.
  • the input encoded audio and video data can be decoded to restore the audio and video data, and the content can be played or displayed based on the restored audio and video data.
  • the first terminal device 110 , the second terminal device 120 , the third terminal device 130 and the fourth terminal device 140 may be personal computers and smart phones, but the principles disclosed in this application may not be limited thereto. Embodiments disclosed herein are suitable for use with laptops, tablets, media players, and/or dedicated audio and video conferencing equipment.
  • the network 150 represents any number of networks that transmit encoded audio and video data between the first terminal device 110 , the second terminal device 120 , the third terminal device 130 and the fourth terminal device 140 , including, for example, wired and/or wireless communication networks. .
  • Communication network 150 may exchange data in circuit-switched and/or packet-switched channels.
  • the network may include telecommunications networks, local area networks, wide area networks, and/or the Internet. For purposes of this application, unless explained below, the architecture and topology of network 150 may be immaterial to the operations disclosed herein.
  • Figure 2 schematically shows the placement of the audio and video encoding device and the audio and video decoding device in a streaming environment.
  • the subject matter disclosed in this application can be equally applied to other applications that support audio and video, including, for example, audio and video conferencing, digital TV (television, television), storage of compressed audio and video on digital media including CDs, DVDs, storage sticks, etc. wait.
  • the streaming system may include a collection subsystem 213.
  • the collection subsystem 213 may include audio and video sources 201 such as microphones and cameras.
  • the audio and video sources create uncompressed audio and video data streams 202.
  • the audio and video data stream 202 is depicted as a thick line to emphasize the high data volume of the audio and video data stream.
  • the audio and video data stream 202 can be electronically
  • the device 220 processes, and the electronic device 220 includes an audio and video encoding device 203 coupled to the audio and video source 201 .
  • the audio and video encoding device 203 may include hardware, software, or a combination of hardware and software to implement or implement aspects of the disclosed subject matter as described in greater detail below.
  • the encoded audio and video data 204 (or the encoded audio and video code stream 204) is depicted as a thin line to emphasize the lower data amount of the encoded audio and video data 204 (or the encoded audio and video code stream 204).
  • audio and video code stream 204 which can be stored on the streaming server 205 for future use.
  • One or more streaming client subsystems such as client subsystem 206 and client subsystem 208 in FIG. 2 , may access streaming server 205 to retrieve copies 207 and 209 of encoded audio and video data 204 .
  • the client subsystem 206 may include, for example, the audio and video decoding device 210 in the electronic device 230.
  • the audio and video decoding device 210 decodes the incoming copy 207 of the encoded audio and video data and generates an output audio and video data stream 211 that can be presented on an output terminal 212 (eg, speaker, display) or another presentation device.
  • an output terminal 212 eg, speaker, display
  • the encoded audio and video data 204, audio and video data 207 and audio and video data 209 can be encoded according to certain audio and video coding/compression standards.
  • electronic device 220 and electronic device 230 may include other components not shown in the figures.
  • the electronic device 220 may include an audio and video decoding device
  • the electronic device 230 may further include an audio and video encoding device.
  • Figure 3 shows a network structure block diagram of a codec built based on a convolutional neural network in one embodiment of the present application.
  • the network structure of the codec includes an encoder 310 and a decoder 320.
  • the encoder 310 can be implemented as software as the audio and video encoding device 203 shown in Figure 2, and the decoder 320 can be implemented as software as shown in Figure 3.
  • the audio data can be encoded and compressed through the encoder 310 .
  • the encoder 310 may include an input layer 311, one or more downsampling layers 312, and an output layer 313.
  • the input layer 311 and the output layer 313 may be convolutional layers built based on a one-dimensional convolution kernel, and four downsampling layers 312 are sequentially connected between the input layer 311 and the output layer 313.
  • the functions of each network layer are explained as follows.
  • the encoder 310 may encode a batch of B audio vectors at the same time.
  • the first downsampling layer reduces the vector dimension to 1/2 to obtain a feature vector with a channel number of 64 and a dimension of 8000; the second downsampling layer reduces the vector dimension to 1/4 , obtaining a feature vector with a channel number of 128 and a dimension of 2000; the third downsampling layer reduces the vector dimension to 1/5, and obtains a feature vector with a channel number of 256 and a dimension of 400; the fourth downsampling layer reduces the vector The dimension is reduced to 1/4, and a feature vector with a channel number of 512 and a dimension of 50 is obtained.
  • the output layer 313 performs convolution processing on the feature vector to obtain a coding vector with a channel number of vq_dim and a dimension of 25.
  • vq_dim is the preset vector quantization dimension, which can take a value of 32, for example.
  • the encoding vectors are input to the quantizer 330, and the vector index corresponding to each encoding vector can be obtained by querying the codebook.
  • the vector index can then be transmitted to the data receiving end, and the data receiving end decodes the vector index through the decoder 320 to obtain restored audio data.
  • the decoder 320 may include an input layer 321, one or more upsampling layers 322, and an output layer 323.
  • the data receiving end can first query the codebook vector corresponding to the vector index through the quantizer 320.
  • the codebook vector can be, for example, a vector with a channel number of vq_dim and a dimension of 25. Among them, vq_dim is the preset vector quantization dimension, which can take a value of 32, for example.
  • the data receiving end in order to improve decoding efficiency, can simultaneously decode a batch of B codebook vectors.
  • the codebook vector to be decoded is input to the input layer 321. After convolution processing, a feature vector with a channel number of 512 and a dimension of 50 can be obtained.
  • the first upsampling layer increases the vector dimension to 8 times (for example, 8 ⁇ ) to obtain a feature vector with a channel number of 256 and a dimension of 400; the second upsampling layer increases the vector dimension Increase to 5 times (for example, 5 ⁇ ), and obtain a feature vector with a channel number of 128 and a dimension of 2000; the third upsampling layer increases the vector dimension to 4 times (for example, 4 ⁇ ), and obtain a feature vector with a channel number of 64 and a dimension of 2000.
  • the fourth upsampling layer increases the vector dimension to 2 times (for example, 2 ⁇ ), resulting in a feature vector with a channel number of 32 and a dimension of 16000.
  • the output layer 323 performs convolution processing on the feature vector and restores the decoded audio data with a channel number of 1 and a dimension of 16,000.
  • the codec as a whole can be regarded as a speech-to-speech model.
  • the embodiment of the present application can extract the Mel spectrum of the input and output audio respectively as the input of the loss function, so that the two are close to each other on the Mel spectrum.
  • Mel spectrum can be set to different sampling window sizes.
  • embodiments of the present application can use multi-scale Mel spectrum constraints as the reconstruction loss function.
  • Mel spectrum is a spectrogram distributed under mel scale, and Mel spectrum can be expressed as mel spectrum.
  • the sound signal is originally a one-dimensional time domain signal, and it is difficult to see the frequency change pattern intuitively. If it is transformed into the frequency domain through Fourier transform, although the frequency distribution of the signal can be seen, the time domain information is lost and the change of the frequency distribution over time cannot be seen.
  • Embodiments of the present application may use time-frequency domain analysis methods such as short-time Fourier transform, wavelet transform, and Wigner distribution to solve this problem.
  • Short-time Fourier transform is to perform Fourier transform (FFT) on the short-time signal obtained by frame processing. Specifically, it divides a long signal into frames, adds windows, and then performs Fourier transform on each frame. , and finally stack the results of each frame along another dimension to obtain a two-dimensional signal form similar to a picture.
  • the original signal is an audio signal
  • the two-dimensional signal obtained by STFT expansion is the spectrogram.
  • the spectrogram is filtered and transformed through mel-scale filter banks to obtain the Mel spectrum.
  • the audio encoding method, the audio decoding method, the audio encoding device, the audio decoding device, the computer-readable medium, and the electronic device provided by the present application are described below in conjunction with the specific embodiments from two aspects: the decoding side as the data receiving end and the encoding side as the data sending end.
  • Technical solutions such as equipment and computer program products are described in detail.
  • Figure 4 shows a step flow chart of an audio decoding method in an embodiment of the present application.
  • This method can be executed by a terminal device or server that receives encoded data. That is, an electronic device that executes the audio decoding method provided by an embodiment of the present application can It is a terminal device or a server.
  • This embodiment of the present application uses the audio decoding method executed by the terminal device as an example.
  • the terminal device may be, for example, the audio and video decoding device 210 shown in Figure 2 or the decoder 320 shown in Figure 3 .
  • the audio decoding method in the embodiment of the present application may include the following S410 to S430.
  • the audio frame is a data segment with a specified time length obtained by framing and windowing the original audio data.
  • the encoding vector is a data compression vector obtained by downsampling the audio frame multiple times.
  • an encoder based on a convolutional neural network as shown in Figure 3 can be used to encode the audio frame to obtain a coding vector.
  • the overall characteristics of the original audio data and the parameters that characterize its essential characteristics change with time, so it is a non-stationary process and cannot be analyzed and processed by digital signal processing technology that processes stationary signals.
  • digital signal processing technology that processes stationary signals.
  • different speech sounds are responses produced by the movement of human oral muscles forming a certain shape of the vocal tract, and this oral muscle movement is very slow relative to the frequency of speech, so on the other hand, although the audio The signal has time-varying characteristics, but within a short time range (for example, within a short time of 10-30ms), its characteristics remain basically unchanged, that is, relatively stable, so it can be regarded as a quasi-steady state process, that is, audio The signal has short-term stationarity.
  • embodiments of the present application can divide the original audio data into segments to analyze their characteristic parameters, where each segment is called an audio frame.
  • the frame length of the audio frame may be in the range of 10-30 ms, for example.
  • Frames can be divided into continuous segments or overlapping segments. Overlapping segments can make a smooth transition between frames and maintain their continuity.
  • the overlapping part between the previous frame and the next frame is called frame shift.
  • the ratio of frame shift to frame length can range from 0 to 1/2.
  • Windowing processing refers to using window functions to perform function mapping on the framed audio signal, so that two adjacent audio data frames can transition smoothly, reducing the problem of signal discontinuity at the beginning and end of the data frame, and making the overall situation more stable. Higher continuity and avoidance of Gibbs effect.
  • audio signals that are not originally periodic can also show some characteristics of periodic functions, which is beneficial to signal analysis and processing.
  • the slopes at both ends of the time window should be minimized so that the edges of the window smoothly transition to zero without causing sharp changes. This can make the intercepted signal waveform slowly drop to zero and reduce the audio data frame. truncation effect.
  • the window length should be moderate. If the window length is large, it is equivalent to a very narrow low-pass filter. When the audio signal passes through, the high-frequency part that reflects the waveform details is blocked, and the short-term energy changes very little with time, making it impossible to realistically Reflects the amplitude changes of the audio signal: On the contrary, if the window length is too short, the passband of the filter becomes wider, the short-term energy changes sharply with time, and a smooth energy function cannot be obtained.
  • a Hamming window can be selected as the window function.
  • the Hamming window has smooth low-pass characteristics and can reflect the frequency characteristics of short-term signals to a high extent.
  • other types of window functions such as rectangular window and Haining window can also be used.
  • upsampling feature value is a feature vector obtained during the upsampling process to describe historical audio frames.
  • the historical audio frame is one or more audio frames that are temporally continuous with the current audio frame in the audio frame sequence.
  • the current audio frame being decoded is the Nth audio frame in the audio frame sequence.
  • the corresponding historical audio frame may be the N-1 audio frame in the audio frame sequence.
  • Upsampling is an operation that maps encoding vectors from low dimensions to high dimensions.
  • upsampling methods such as linear interpolation, deconvolution, or unpooling can be used.
  • linear interpolation is a method of inserting new elements into low-dimensional vectors based on linear interpolation functions to obtain high-dimensional vectors. It can include nearest neighbor interpolation algorithm, bilinear interpolation algorithm, bicubic interpolation algorithm, etc.
  • Deconvolution also known as transpose convolution, is a special convolution operation. For example, you can first add 0 to the low-dimensional vector to expand the vector dimension, and then perform forward convolution through the convolution kernel to obtain the high-dimensional vector. Dimension vector. Anti-pooling is the reverse operation of pooling.
  • the upsampled process data can be retained by configuring a buffer area.
  • the feature vector used to describe the audio frame obtained during the upsampling process can be cached, such as the upsampled feature values of historical audio frames.
  • S430 Upsample the encoding vector of the current audio frame according to the upsampling feature value to obtain the decoded data of the current audio frame.
  • the upsampling feature value of the historical audio frame and the encoding vector of the current audio frame can be input to the decoder as input data, so that the decoder can use the feature vector of the historical video frame to encode the current audio frame. Frames are upsampled.
  • the original audio data will lose some information during the encoding process, and the decoding process based on upsampling is usually difficult to restore the original audio data.
  • the embodiment of the present application can guide the current audio frame by caching the upsampling characteristics of the previously decoded historical audio frames.
  • the upsampling process improves the data restoration effect of audio decoding, thus improving the audio encoding and decoding quality.
  • Figure 5 shows a flowchart of method steps for audio decoding based on a convolutional neural network including multiple upsampling layers in one embodiment of the present application.
  • the audio decoding method may include the following S510 to S540.
  • the audio frame is a data segment with a specified time length obtained by framing and windowing the original audio data.
  • the encoding vector is a data compression vector obtained by downsampling the audio frame multiple times.
  • an encoder based on a convolutional neural network as shown in Figure 3 can be used to encode the audio frame to obtain a coding vector.
  • S520 Obtain a decoder including multiple upsampling layers, and perform upsampling processing on the encoding vector of the historical audio frame through multiple upsampling layers to obtain multiple feature vectors.
  • the historical audio frame is one or more decoded before the current audio frame. audio frames.
  • Embodiments of the present application may use a decoder based on a convolutional neural network as shown in Figure 3 to decode the encoding vector of the audio frame.
  • the decoder includes multiple sequentially connected upsampling layers, and each upsampling layer can implement upsampling processing by performing a convolution operation on the input vector.
  • the decoder after the decoder performs an upsampling process on the encoding vector of the historical audio frame, multiple feature vectors with the same number as the upsampling layers can be obtained. At this time, the multiple feature vectors can be used as upsampling feature values.
  • the decoder shown in Figure 3 includes four upsampling layers, and each upsampling layer outputs a feature vector. Four feature vectors can be obtained by upsampling a historical audio frame.
  • a number of feature vectors smaller than the upsampling layer can be obtained.
  • the decoder shown in Figure 3 includes four upsampling layers. Each upsampling layer outputs a feature vector, and then extracts part of the feature vectors from it. That is, upsampling a historical audio frame can obtain less than four Feature vector.
  • S530 Input the encoding vector of the current audio frame into the decoder, and input multiple feature vectors into multiple upsampling layers.
  • the encoding vector of the current audio frame is upsampled multiple times through multiple upsampling layers of the encoder in sequence.
  • multiple features obtained by upsampling the historical audio frames are Vectors are input to the upsampling layer synchronously. That is, the input data of the upsampling layer in the encoder, in addition to the output data of the previous upsampling layer, also includes feature vectors obtained by upsampling historical audio frames.
  • S540 Upsample the encoding vector and multiple feature vectors of the current audio frame through multiple upsampling layers to obtain the decoded data of the current audio frame.
  • FIG. 6 shows a schematic diagram of a network module that implements data encoding and decoding processing in an embodiment of the present application.
  • the network module shown in Figure 6 is the basic functional module that constitutes the encoder or decoder shown in Figure 3.
  • each downsampling layer in the encoder or each upsampling layer in the decoder can include one or more A network module shown in Figure 6.
  • the network module that implements data encoding and decoding processing includes multiple residual blocks Res Block.
  • the input data of the network module includes two parts, namely the current input feature In feature and the first historical feature Last feature.
  • the current input feature In feature can be the output feature obtained by convolution processing of the current audio frame by the previous network module
  • the historical feature Last feature can be the output feature obtained by the convolution processing of the previous audio frame by the current network module, for example It may be an upsampled feature value obtained by upsampling the encoding vector of the historical audio frame through the upsampling layer in the above embodiments of the present application.
  • the output data of the network module also includes two parts, namely the current output feature Out feature and the second historical feature Last feature.
  • the current output feature Out feature can be used as the latter network module to convolve the current audio frame.
  • the output feature obtained through processing, the second historical feature Last feature can be used as the input feature of the current network module for convolution processing of the subsequent audio frame.
  • the embodiment of the present application can jointly decode the upsampling feature value obtained in the upsampling process of the historical audio frame and the encoding vector of the current audio frame, thus improving the input experience of the current audio frame. Wild, improve the accuracy of audio encoding and decoding.
  • the upsampling layer of the decoder includes at least two sampling channels.
  • the method of upsampling the encoding vector and multiple feature vectors of the current audio frame through multiple upsampling layers in S540 may include: upsampling the current audio frame through at least two sampling channels in the upsampling layer. Perform feature extraction on the encoding vector and multiple feature vectors to obtain at least two channel feature values; obtain the mean and variance of the at least two channel feature values; and normalize the at least two channel feature values based on the mean and variance.
  • Different sampling channels can convolve the input data based on convolution kernels of different sizes or different parameters to obtain multiple channel feature values under different representation dimensions, which can improve the comprehensiveness and reliability of feature extraction from audio frames.
  • embodiments of the present application can normalize channel feature values collected on different sampling channels of the same audio frame.
  • Figure 7 shows a schematic diagram of the principle of normalizing channel feature values output by multiple sampling channels in an embodiment of the present application.
  • Each square in Figure 7 represents a data sampling point, a row of squares distributed along the horizontal direction represents an audio frame, and multiple rows of squares distributed along the vertical direction represent multiple audios that are encoded and decoded simultaneously in a batch.
  • Frame, a multi-row grid distributed in the depth direction represents multiple sampling channels that sample the same audio frame.
  • one audio frame serves as a normalization unit, and each audio frame is independent of each other.
  • the mean and variance can be calculated for multiple channel feature values sampled from different sampling channels in the same audio frame, and then the mean value is subtracted from each channel feature value and divided by the variance to obtain the normalized channel feature value.
  • each sampling channel can share the same mean and variance, ensuring the comprehensiveness of data sampling while reducing the amount of data calculation.
  • a weighted smoothing process can be performed on the mean and variance between each audio frame.
  • the at least two channel feature values are normalized.
  • the method of normalizing the channel feature values may be to normalize the at least two channel feature values according to the mean and variance after weighted smoothing processing to further reduce the amount of data calculation.
  • real-time segmented transmission When transmitting audio data, real-time segmented transmission may be used.
  • the characteristics of real-time segmented transmission determine that users can obtain media data in real time without downloading the complete media file, but it also places high demands on the user's device performance and network conditions.
  • the audio frame can be compressed and quantized to obtain an index value. In this way, the quantized index value is transmitted during transmission, thereby reducing the amount of data transmission. , thereby improving data transmission efficiency.
  • the corresponding encoding vector can be found from the codebook through the index value, and then the decoding is completed.
  • Figure 8 shows a flow chart of the steps of audio frame decoding based on the query codebook in one embodiment of the present application.
  • the encoding vector of the audio frame can be encoded based on the query codebook. Positioning to reduce the amount of data transmission on the codec side.
  • the method of decoding the encoding vector of the audio frame based on the query codebook may include the following S810 to S840.
  • S810 For each audio frame in the audio frame sequence, obtain the coding index value of the audio frame.
  • the coding index value is used to indicate the codebook vector in the codebook.
  • the codebook is used to save the mapping relationship between the coding index value and the codebook vector.
  • the sender of audio data can transfer the coding index value of each audio frame to the receiver through network transmission, which can greatly reduce the amount of data transmission and significantly Improve audio data transmission efficiency.
  • S820 Query the codebook vector associated with the encoding index value in the codebook, and determine the encoding vector of the audio frame based on the codebook vector.
  • the quantizer can be used to query the codebook vector associated with the encoding index value in the codebook, and further determine the encoding vector of the audio frame based on the codebook vector.
  • the decoder can directly use the codebook vector queried in the codebook as the encoding vector of the audio frame, or can perform data mapping on the queried codebook vector according to preset mapping rules to determine Encoding vector of the audio frame.
  • the preset mapping rules can be pre-agreed rules between the sender and receiver of the audio data. Using data mapping to determine the encoding vector can improve the security of data transmission while sharing the codebook.
  • data mapping is performed using dimensionality-raising projection, which can reduce the vector dimension in the codebook, compress the codebook, and reduce the amount of data maintained in the codebook.
  • Figure 9 shows a schematic diagram of the principle of determining encoding vectors based on data mapping in one embodiment of the present application.
  • an encoding vector can be obtained, whose vector dimension is, for example, N.
  • the encoding vector is first dimensionally reduced and projected to compress the vector with a dimension of N/Q.
  • the codebook includes M codebook vectors, where the vector dimension of each codebook vector is N/Q.
  • the codebook vector corresponding to it can first be queried in the codebook.
  • the vector dimension of the codebook vector is N/Q. After performing dimension-raising projection on the codebook vector, the encoding vector with vector dimension N can be restored.
  • the coding vector can be dimensionally reduced or elevated based on linear transformation, or some network layers of neural networks such as convolutional layers and fully connected layers can be used for data mapping.
  • S830 Upsample the encoding vector of the historical audio frame to obtain the upsampling feature value.
  • the historical audio frame is one or more audio frames decoded before the current audio frame in the audio frame sequence.
  • the upsampling feature value is obtained during the upsampling process.
  • a historical audio frame is one or more audio frames that are temporally continuous with the current audio frame in the audio frame sequence.
  • the current audio frame being decoded is the Nth audio frame in the audio frame sequence
  • the corresponding historical audio frame can be Is the N-1 audio frame in the audio frame sequence.
  • Upsampling is an operation that maps encoding vectors from low dimensions to high dimensions. For example, upsampling methods such as linear interpolation, deconvolution, or unpooling can be used. In the embodiment of this application, the number of upsampling processes can be retained by configuring the buffer area. according to. When upsampling an audio frame, the feature vectors obtained during the upsampling process that describe the audio frame can be cached.
  • S840 Upsample the encoding vector of the current audio frame according to the upsampling feature value to obtain the decoded data of the current audio frame.
  • the upsampling feature values of historical audio frames and the encoding vector of the current audio frame can be input into the decoder as input data, so that the decoder can use the upsampling feature values of historical video frames to upsample the current audio frame. sampling.
  • the original audio data will lose some information during the encoding process, and the decoding process based on upsampling is usually difficult to restore the original audio data.
  • the embodiment of the present application can guide the current audio frame by caching the upsampling characteristics of the previously decoded historical audio frames. The upsampling process improves the data restoration effect of audio decoding, thus improving the audio encoding and decoding quality.
  • the codebook can be queried through the quantizer in the encoding and decoding model, and the codebook can be updated based on the sample data.
  • the quantizer in the embodiment of the present application may be a model built based on a convolutional neural network, and the quantizer may be trained based on sample data to improve its coding and quantization effect on audio frames.
  • a method for training a quantizer may include: obtaining a codebook and a quantizer used to maintain the codebook.
  • the codebook is used to represent the mapping relationship between the encoding index value and the codebook vector; Obtain the coding vector sample obtained by encoding the audio frame sample by the encoder; predict the codebook vector sample that matches the coding vector sample through the quantizer; update the quantization according to the loss error between the coding vector sample and the codebook vector sample network parameters of the quantizer to achieve training of the quantizer.
  • the codebook vector associated with the encoding index value can be queried in the codebook through the trained quantizer.
  • a method for maintaining and updating the codebook based on the quantizer may include: obtaining statistical parameters of coding vector samples that match the codebook vector samples; updating the codebook according to the statistical parameters, and the updated codebook is Predict the codebook vector sample that matches the encoding vector sample at the next time.
  • the statistical parameters of the coding vector samples include at least one of a vector sum and a hit number.
  • the vector sum represents an average vector obtained after weighted average processing of each coding vector sample
  • the hit number represents a sum of The number of coding vector samples that the codebook vector samples match.
  • the method of updating the codebook based on statistical parameters may include: performing exponential weighted smoothing on the codebook based on the vector sum; performing Laplacian smoothing on the codebook based on the number of hits.
  • Figure 10 shows a flow chart of steps for training a quantizer in one embodiment of the present application. As shown in Figure 10, this embodiment of the present application can realize the construction and maintenance of codebooks based on the training quantizer.
  • the training process includes the following S1001 to S1006.
  • the input data is a coding vector sample obtained by coding audio data (such as audio data of audio frame samples).
  • S1002 Determine whether the input data is the first input data of the quantizer. If the input data is input to the quantizer for the first time, S1003 is executed; if the input data is not input to the quantizer for the first time, S1004 is executed.
  • S1003 Perform clustering processing on the input data to obtain M clusters, each cluster corresponding to a codebook vector.
  • M codebook vectors can form a codebook for data quantization, and the codebook stores the encoding index value corresponding to each codebook vector.
  • the embodiment of the present application can cluster the input data based on K-means clustering, and each cluster cluster corresponds to a codebook vector and a coding index value. At the same time, the vector sum of each vector in each cluster and the number of hits for vector query for each cluster can be counted.
  • the method of querying the belonging category may include predicting the similarity between the input data and the cluster center of each cluster, and using the cluster with the highest similarity as the belonging category of the input data.
  • S1005 Determine the corresponding coding index value and the quantized codebook vector according to the category of the input data.
  • the loss error of the codebook vector can be, for example, the mean square error loss MSE Loss.
  • the mean square error refers to the expected value of the square of the difference between the parameter estimate and the parameter value.
  • the mean square error loss can evaluate the degree of change in the data. The smaller the mean square error loss, the better the accuracy of the quantizer in quantizing the input data.
  • S1008 Perform Laplacian smoothing on the codebook based on the number of hits.
  • the zero probability problem that occurs in the vector prediction of the codebook can be solved by Laplacian smoothing.
  • the embodiment of the present application can continuously update the codebook by performing weighted smoothing on the codebook, so that the vectors generated by the encoder are closer to the vectors in the codebook, and the prediction accuracy of the quantizer for the vectors in the codebook is improved.
  • Figure 11 shows a step flow chart of an audio encoding method in an embodiment of the present application.
  • This method can be executed by a terminal device or a server that sends audio data.
  • This embodiment of the present application takes the audio encoding method executed by a terminal device as an example for explanation.
  • the terminal device may be, for example, the audio and video encoding device 203 shown in FIG. 2 or the encoder 310 shown in FIG. 3 .
  • the audio decoding method in the embodiment of the present application may include the following S1110 to S1130.
  • An audio frame is a data segment with a specified time length obtained by framing and windowing the original audio data.
  • the overall characteristics of the original audio data and the parameters that characterize its essential characteristics change with time, so it is a non-stationary process and cannot be analyzed and processed by digital signal processing technology that processes stationary signals.
  • digital signal processing technology that processes stationary signals.
  • different speech sounds are responses produced by the movement of human oral muscles forming a certain shape of the vocal tract, and this oral muscle movement is very slow relative to the frequency of speech, so on the other hand, although the audio The signal has time-varying characteristics, but within a short time range (for example, within a short time of 10-30ms), its characteristics remain basically unchanged, that is, relatively stable, so it can be regarded as a quasi-steady state process, that is, audio The signal has short-term stationarity.
  • embodiments of the present application can divide the original audio data into segments to analyze their characteristic parameters.
  • Each segment is called an audio frame.
  • the frame length of the audio frame may be in the range of 10-30 ms, for example.
  • Frames can be divided into continuous segments or overlapping segments. Overlapping segments can make a smooth transition between frames and maintain their continuity.
  • the overlapping part between the previous frame and the next frame is called frame shift.
  • the ratio of frame shift to frame length can range from 0 to 1/2.
  • Windowing processing refers to using window functions to perform function mapping on the framed audio signal, so that two adjacent audio data frames can transition smoothly, reducing the problem of signal discontinuity at the beginning and end of the data frame, and making the overall situation more stable. Higher continuity and avoidance of Gibbs effect.
  • audio signals that are not originally periodic can also show some characteristics of periodic functions, which is beneficial to signal analysis and processing.
  • S1120 When encoding to the current audio frame in the audio frame sequence, downsample the audio data of the historical audio frame to obtain the downsampling feature value.
  • the historical audio frame is one or more encoded before the current audio frame in the audio frame sequence.
  • Audio frame, downsampling feature value is a feature vector obtained during the downsampling process to describe the historical audio frame.
  • the historical audio frame is one or more audio frames that are temporally continuous with the current audio frame in the audio frame sequence.
  • the current audio frame being decoded is the Nth audio frame in the audio frame sequence.
  • the corresponding historical audio frame may be the N-1 audio frame in the audio frame sequence.
  • Downsampling is an operation that maps encoding vectors from high dimensions to low dimensions. For example, convolution operations or pooling operations can be used for downsampling.
  • the downsampled process data can be retained by configuring a buffer area.
  • the feature vectors obtained during the downsampling process that describe the audio frame can be cached.
  • S1130 Downsample the audio data of the current audio frame according to the downsampling feature value to obtain the encoding vector of the current audio frame.
  • the down-sampling feature values of historical audio frames and the audio data of the current audio frame can be input to the encoder as input data, so that the encoder can use the characteristics of the historical video frames to evaluate the current audio frame. Perform downsampling.
  • the original audio data will lose some information during the encoding process.
  • the embodiment of the present application can guide the down-sampling process of the current audio frame, improve the data relevance of audio encoding, and improve audio quality. codec quality.
  • Figure 12 shows a flow chart of the method steps for audio encoding based on a convolutional neural network including multiple downsampling layers in one embodiment of the present application.
  • the audio encoding method may include the following S1210 to S1240.
  • the audio frame is a data segment with a specified time length obtained by framing and windowing the original audio data.
  • the encoding vector is a data compression vector obtained by downsampling the audio frame multiple times.
  • an encoder based on a convolutional neural network as shown in Figure 3 can be used to encode the audio frame to obtain a coding vector.
  • S1220 Obtain an encoder including multiple down-sampling layers, and perform down-sampling processing on the audio data of historical audio frames through multiple down-sampling layers to obtain multiple feature vectors.
  • the historical audio frames are one or more coded before the current audio frame. audio frames.
  • Embodiments of the present application may use an encoder based on a convolutional neural network as shown in Figure 3 to encode the audio data of the audio frame.
  • the encoder includes multiple sequentially connected down-sampling layers, and each down-sampling layer can implement down-sampling processing by performing a convolution operation on the input vector.
  • multiple feature vectors equal to the number of down-sampling layers can be obtained.
  • the encoder shown in Figure 3 includes four downsampling layers, and each downsampling layer outputs a feature vector. Then four feature vectors can be obtained by downsampling a historical audio frame.
  • a number of feature vectors smaller than the downsampling layer can be obtained.
  • the encoder shown in Figure 3 includes four down-sampling layers. Each down-sampling layer outputs a feature vector, and then extracts part of the feature vector from it. That is, down-sampling a historical audio frame can obtain less than four Feature vector.
  • S1230 Input the audio data of the current audio frame into the encoder, and input multiple feature vectors into multiple downsampling layers.
  • the audio data of the current audio frame is sequentially downsampled multiple times through multiple downsampling layers of the encoder.
  • multiple features obtained by downsampling the historical audio frames are Vectors are synchronously input to the downsampling layer. That is, the input data of the down-sampling layer in the encoder, in addition to the output data of the previous down-sampling layer, also includes the feature vector obtained by down-sampling the historical audio frame.
  • S1240 Perform downsampling processing on the audio data and multiple feature vectors of the current audio frame through multiple downsampling layers to obtain the encoding vector of the current audio frame.
  • the embodiment of the present application can jointly encode the feature vector obtained in the down-sampling process of the historical audio frame and the audio data of the current audio frame, thereby improving the input receptive field of the current audio frame and improving Audio codec accuracy.
  • the downsampling layer of the encoder includes at least two sampling channels.
  • the method of downsampling the audio data and multiple feature vectors of the current audio frame through multiple downsampling layers in S1240 may include: downsampling the current audio frame through at least two sampling channels in the downsampling layer. Perform feature extraction on audio data and multiple feature vectors to obtain at least two channel feature values; obtain the mean and variance of at least two channel feature values; and normalize at least two channel feature values based on the mean and variance.
  • Different sampling channels can convolve the input data based on convolution kernels of different sizes or different parameters to obtain multiple channel feature values under different representation dimensions, which can improve the comprehensiveness and reliability of feature extraction from audio frames.
  • embodiments of the present application can normalize channel feature values collected on different sampling channels of the same audio frame. For a solution of normalizing the channel characteristic values collected on different sampling channels, reference can be made to the embodiment shown in FIG. 7 , which will not be described again here.
  • audio frame encoding processing can be performed based on the query codebook.
  • the encoding vector of the audio frame can be positioned based on the codebook query, reducing the amount of data transmission on the encoding and decoding side.
  • the codebook vector can be obtained by querying the codebook according to the encoding vector, and the encoding index value associated with the codebook vector can be obtained.
  • Figure 13 shows a flow chart of steps for model training of the encoder and decoder in one embodiment of the present application. As shown in Figure 13, this embodiment of the present application implements model training of the encoder and decoder by constructing a generative adversarial network.
  • the training method may include the following S1310 to S1350.
  • S1310 Obtain an encoder including multiple downsampling layers and a decoder including multiple upsampling layers.
  • the encoder and decoder in the embodiment of the present application can be a coding and decoding model based on a convolutional neural network as shown in Figure 3, in which each upsampling layer or downsampling layer can adopt a convolution operation or a causal convolution operation. , used for feature mapping.
  • S1320 Encode and decode audio input samples through the encoder and decoder to obtain audio output samples.
  • the encoder encodes the audio input samples to obtain corresponding encoding vector samples, and then the decoder decodes the encoding vector samples to obtain audio output samples.
  • the encoding and decoding methods of the encoder and decoder may refer to the above embodiments and will not be described again here.
  • S1330 Determine the first loss error of the encoder and the decoder according to the audio input sample and the audio output sample.
  • spectral features are extracted for audio input samples and audio output samples respectively, that is, spectral features are extracted for the audio input samples to obtain the first mel spectrum, and spectral features are extracted for the audio output samples to obtain the third mel spectrum.
  • two mel spectra and then determine the first loss error of the encoder and decoder based on the difference between the first mel spectrum and the second mel spectrum.
  • the method of performing spectral feature extraction on audio input samples to obtain the first Mel spectrum, and performing spectral feature extraction on the audio output samples to obtain the second Mel spectrum may be: obtaining at least two samples Scale sampling window; perform spectral feature extraction on audio input samples at different sample scales through the sampling window to obtain a multi-scale first Mel spectrum, and perform spectral feature extraction on audio output samples to obtain a multi-scale second Mel spectrum .
  • S1340 Use the sample discriminator to perform type discrimination on audio input samples and audio output samples, and determine the second loss error of the sample discriminator based on the discrimination results.
  • S1350 Perform generative adversarial training on the encoder, decoder and sample discriminator based on the first loss error and the second loss error to update network parameters of the encoder, decoder and sample discriminator.
  • the sample discriminator may include an original sample discriminator and a sample feature discriminator; the method for performing type discrimination on audio input samples and audio output samples through the sample discriminator includes: The sample is input to the original sample discriminator to obtain the first type of discrimination result output by the original sample discriminator; the audio input sample is subjected to spectral feature extraction to obtain the first mel spectrum, and the audio output sample is subjected to spectral feature extraction to obtain the second mel spectrum. Mel spectrum; input the first Mel spectrum and the second Mel spectrum to the sample feature discriminator to obtain the second type of discrimination result output by the sample feature discriminator.
  • the discrimination results include the first type of discrimination results and the second type of discrimination results.
  • Figure 14 shows a schematic diagram of the principle of encoding and decoding model training based on a generative adversarial network in one embodiment of the present application.
  • the codec as a whole can be regarded as a speech-to-speech model.
  • the input audio input and output audio output are extracted from the Mel spectrum respectively as a loss function. input, making the two close to each other on the Mel spectrum.
  • Mel spectrum can be set to different sampling window sizes.
  • the embodiment of this application uses multi-scale Mel spectrum constraints as the reconstruction loss.
  • the embodiment of this application uses a Generative Adversarial Network (GAN) for model training, uses the codec as a generator, and designs two discriminators at the same time: the original speech is used as the input discriminator (for example, the first discriminator in Figure 14 (e.g., the second discriminator in Figure 14) and the Mel spectrum as input.
  • GAN Generative Adversarial Network
  • Using the encoding and decoding model provided by the above embodiments of the present application to encode or decode audio data can significantly improve the encoding and decoding quality of audio data, especially in weak network environments, such as in elevators, under tall buildings, in mountainous areas, etc. Improve the call quality of voice calls and video calls.
  • Table 1 shows the call quality comparison results between the embodiments of the present application and the codec models in related technologies. Among them, PESQ and STOI indicators are used to measure voice quality, and the larger the value, the better.
  • FIG. 15 shows a structural block diagram of an audio decoding device in an embodiment of the present application.
  • the audio decoding device 1500 includes:
  • the acquisition module 1510 is configured to acquire the encoding vector of each audio frame in the audio frame sequence
  • the first upsampling module 1520 is configured to, when decoding the current audio frame in the audio frame sequence, upsample the encoding vector of the historical audio frame to obtain the upsampling feature value, and the historical audio frame is the audio One or more audio frames decoded before the current audio frame in the frame sequence, the upsampling feature value is a feature vector obtained during the upsampling process and used to describe historical audio frames;
  • the second upsampling module 1530 is configured to upsample the encoding vector of the current audio frame according to the upsampling feature value to obtain the decoded data of the current audio frame.
  • the second upsampling module 1530 may further include:
  • a decoder acquisition module configured to acquire a decoder that includes multiple upsampling layers, where the upsampling feature values include the decoder obtained by upsampling the encoding vectors of the historical audio frames by the multiple upsampling layers. Multiple feature vectors;
  • a data input module configured to input the encoding vector of the current audio frame into the decoder, and input the multiple feature vectors into the multiple upsampling layers correspondingly;
  • the upsampling processing module is configured to perform upsampling processing on the encoding vector of the current audio frame and the plurality of feature vectors through the plurality of upsampling layers to obtain decoded data of the current audio frame.
  • the second upsampling module 1530 may further include:
  • an encoder acquisition module configured to acquire an encoder including multiple downsampling layers
  • a codec processing module configured to codec process audio input samples through the encoder and the decoder to obtain audio output samples
  • a first error determination module configured to determine a first loss error of the encoder and the decoder based on the audio input samples and the audio output samples;
  • a second error determination module configured to perform type discrimination on the audio input sample and the audio output sample through a sample discriminator, and determine a second loss error of the sample discriminator based on the discrimination result;
  • a generative adversarial training module configured to perform generative adversarial training on the encoder, the decoder and the sample discriminator according to the first loss error and the second loss error to update the encoder, Network parameters of the decoder and the sample discriminator.
  • the sample discriminator includes an original sample discriminator and a sample feature discriminator;
  • the second error determination module includes:
  • a discriminator input module configured to input the audio input sample and the audio output sample to the original sample discriminator to obtain a first type of discrimination result output by the original sample discriminator;
  • a spectral feature extraction module configured to perform spectral feature extraction on the audio input sample to obtain a first Mel spectrum, and perform spectral feature extraction on the audio output sample to obtain a second Mel spectrum;
  • a spectrum feature input module configured to input the first Mel spectrum and the second Mel spectrum to the sample feature discriminator to obtain a second type of discrimination result output by the sample feature discriminator, so
  • the discrimination results include the first type discrimination results and the second type discrimination results.
  • the first error determination module may be further configured to: perform spectral feature extraction on the audio input sample to obtain a first Mel spectrum, and perform spectral feature extraction on the audio output sample to obtain a third Mel spectrum. Two Mel spectrums; determine the first loss error of the encoder and the decoder according to the degree of difference between the first Mel spectrum and the second Mel spectrum.
  • the first error determination module may be further configured to: obtain a sampling window including at least two sample scales; perform spectrum processing on the audio input sample at different sample scales through the sampling window Feature extraction is performed to obtain a multi-scale first Mel spectrum, and spectral feature extraction is performed on the audio output sample to obtain a multi-scale second Mel spectrum.
  • the upsampling layer includes at least two sampling channels; the upsampling processing module includes:
  • a channel feature extraction module configured to perform feature extraction on the encoding vector of the current audio frame and the plurality of feature vectors through at least two sampling channels in the upsampling layer to obtain at least two channel feature values
  • a mean and variance acquisition module configured to acquire the mean and variance of the at least two channel feature values
  • a normalization processing module configured to normalize the at least two channel feature values according to the mean and variance.
  • the upsampling processing module further includes:
  • a weighted smoothing module configured to perform weighted smoothing processing on the mean and variance between each audio frame
  • the normalization processing module is configured to normalize the at least two channel feature values according to the mean and variance after weighted smoothing processing.
  • the acquisition module 1510 may further include:
  • a coding index value acquisition module configured to obtain, for each audio frame in the audio frame sequence, a coding index value of the audio frame, where the coding index value is used to indicate a codebook vector in the codebook;
  • a coding vector determination module configured to query the codebook for a codebook vector associated with the coding index value, and determine a coding vector of the audio frame according to the codebook vector.
  • the dimension of the codebook vector is lower than the dimension of the encoding vector; the encoding vector determination module may be further configured to: perform dimensionally ascending projection on the codebook vector to obtain the Encoding vector.
  • the acquisition module 1510 may further include:
  • a quantizer acquisition module configured to obtain the codebook and a quantizer used to maintain the codebook, where the codebook is used to represent the mapping relationship between the encoding index value and the codebook vector;
  • a coding vector sample acquisition module configured to acquire coding vector samples obtained by encoding the audio frame samples by the encoder
  • a quantizer prediction module configured to predict, by the quantizer, codebook vector samples that match the encoding vector samples
  • a quantizer update module configured to update the network parameters of the quantizer according to the loss error between the encoding vector sample and the codebook vector sample;
  • a coding vector determination module configured to query the codebook for a codebook vector associated with the coding index value through the trained quantizer.
  • the acquisition module 1510 may further include:
  • a statistical parameter acquisition module configured to acquire statistical parameters of coding vector samples that match the codebook vector samples
  • a codebook update module is configured to update the codebook according to the statistical parameters, and the updated codebook is used for the next prediction of codebook vector samples that match the encoding vector samples.
  • the statistical parameters include at least one of a vector sum and a hit number.
  • the vector sum represents an average vector obtained by weighted average processing of each encoding vector sample.
  • the hit number Expressed with The number of coding vector samples matched by the codebook vector samples; the codebook update module may be further configured to: perform exponential weighted smoothing on the codebook according to the vector sum; perform Laplan on the codebook according to the number of hits Smooth.
  • FIG 16 shows a structural block diagram of an audio encoding device in an embodiment of the present application.
  • the audio encoding device 1600 includes:
  • the acquisition module 1610 is configured to acquire the audio data of each audio frame in the audio frame sequence
  • the first downsampling module 1620 is configured to downsample the audio data of historical audio frames to obtain downsampling feature values when encoding the current audio frame in the audio frame sequence, and the historical audio frames are the audio One or more audio frames encoded before the current audio frame in the frame sequence, the down-sampling feature value is a feature vector obtained during the down-sampling process and used to describe historical audio frames;
  • the second downsampling module 1630 is configured to downsample the audio data of the current audio frame according to the downsampling feature value to obtain the encoding vector of the current audio frame.
  • Figure 17 schematically shows a block diagram of a computer system used to implement an electronic device according to an embodiment of the present application.
  • the computer system 1700 includes a central processing unit 1701 (Central Processing Unit, CPU), which can be loaded into a random computer according to a program stored in a read-only memory 1702 (Read-Only Memory, ROM) or from a storage part 1708. Access the program in the memory 1703 (Random Access Memory, RAM) to perform various appropriate actions and processes. In the random access memory 1703, various programs and data required for system operation are also stored.
  • the central processing unit 1701, the read-only memory 1702 and the random access memory 1703 are connected to each other through a bus 1704.
  • the input/output interface 1705 Input/Output interface, ie, I/O interface
  • I/O interface input/output interface
  • the following components are connected to the input/output interface 1705: an input part 1706 including a keyboard, a mouse, etc.; an output part 1707 including a cathode ray tube (Cathode Ray Tube, CRT), a liquid crystal display (Liquid Crystal Display, LCD), etc., and a speaker, etc. ; a storage section 1708 including a hard disk, etc.; and a communication section 17017 including a network interface card such as a LAN card, a modem, etc.
  • the communication section 17017 performs communication processing via a network such as the Internet.
  • Driver 1710 is also connected to input/output interface 1705 as needed.
  • Removable media 1711 such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, etc., are installed on the drive 1710 as needed, so that a computer program read therefrom is installed into the storage portion 1708 as needed.
  • the processes described in the respective method flow charts may be implemented as computer software programs.
  • embodiments of the present application include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communications portion 17017, and/or installed from removable media 1711.
  • the central processor 1701 When the computer program is executed by the central processor 1701, various functions defined in the system of the present application are executed.
  • the computer-readable medium shown in the embodiments of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • a computer-readable storage medium may, for example, be - but Not limited to - electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any combination thereof.
  • Computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Erasable Programmable Read Only Memory (EPROM), flash memory, optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any of the above suitable The combination.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, in which computer-readable program code is carried.
  • Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than computer-readable storage media that can send, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wireless, wired, etc., or any suitable combination of the above.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block in the block diagram or flowchart illustration, and combinations of blocks in the block diagram or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations, or may be implemented by special purpose hardware-based systems that perform the specified functions or operations. Achieved by a combination of specialized hardware and computer instructions.
  • the example embodiments described here can be implemented by software, or can be implemented by software combined with necessary hardware. Therefore, the technical solution according to the embodiment of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which can be a personal computer, server, touch terminal, or network device, etc.) to execute the method according to the embodiment of the present application.
  • a non-volatile storage medium which can be a CD-ROM, U disk, mobile hard disk, etc.
  • a computing device which can be a personal computer, server, touch terminal, or network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种音频编解码方法、音频编解码装置、计算机可读介质、电子设备以及计算机程序产品。属于音视频技术领域。音频解码方法包括:获取音频帧序列中各个音频帧的编码向量(S410);当解码到音频帧序列中的当前音频帧时,对历史音频帧的编码向量进行上采样得到上采样特征值,历史音频帧是音频帧序列中在当前音频帧之前解码的一个或者多个音频帧,上采样特征值是在上采样过程中得到的用于描述历史音频帧的特征向量(S420);根据上采样特征值对当前音频帧的编码向量进行上采样得到当前音频帧的解码数据(S430)。可以提高音频数据的编解码质量。

Description

音频编解码方法及相关产品
本申请要求于2022年5月19日提交中国专利局、申请号202210546928.4、申请名称为“音频编解码方法及相关产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于音视频技术领域,具体涉及一种音频编解码方法、音频编解码装置、计算机可读介质、电子设备以及计算机程序产品。
背景技术
对音视频等媒体数据进行编解码处理,可以实现对媒体数据的压缩传输,从而降低媒体数据的网络传输成本,提高网络传输效率。
而在进行编码处理时,可能会使得媒体数据出现数据丢失,进而导致媒体数据质量差的问题。
发明内容
本申请提供一种音频编解码方法、音频编解码装置、计算机可读介质、电子设备以及计算机程序产品,目的在于提高媒体数据质量。
本申请的其他特性和优点将通过下面的详细描述变得显然,或部分地通过本申请的实践而习得。
根据本申请实施例的一个方面,提供一种音频解码方法,所述方法由计算机设备执行,该方法包括:
获取音频帧序列中各个音频帧的编码向量;
当解码到所述音频帧序列中的当前音频帧时,对历史音频帧的编码向量进行上采样得到上采样特征值,所述历史音频帧是所述音频帧序列中在所述当前音频帧之前解码的一个或者多个音频帧,所述上采样特征值是在上采样过程中得到的用于描述历史音频帧的特征向量;
根据所述上采样特征值对所述当前音频帧的编码向量进行上采样得到所述当前音频帧的解码数据。
根据本申请实施例的一个方面,提供一种音频编码方法,所述方法由计算机设备执行,该方法包括:
获取音频帧序列中各个音频帧的音频数据;
当编码到所述音频帧序列中的当前音频帧时,对历史音频帧的音频数据进行下采样得到下采样特征值,所述历史音频帧是所述音频帧序列中在所述当前音频帧之前编码的一个或者多个音频帧,所述下采样特征值是在下采样过程中得到的用于描述历史音频帧的特征向量;
根据所述下采样特征值对所述当前音频帧的音频数据进行下采样得到所述当前音频帧的编码向量。
根据本申请实施例的一个方面,提供一种音频解码装置,所述装置部署在计算机设备上,该装置包括:
获取模块,被配置为获取音频帧序列中各个音频帧的编码向量;
第一上采样模块,被配置为当解码到所述音频帧序列中的当前音频帧时,对历史音频帧的编码向量进行上采样得到上采样特征值,所述历史音频帧是所述音频帧序列中在所述当前音频帧之前解码的一个或者多个音频帧,所述上采样特征值是在上采样过程中得到的用于描述历史音频帧的特征向量;
第二上采样模块,被配置为根据所述上采样特征值对所述当前音频帧的编码向量进行上采样得到所述当前音频帧的解码数据。
根据本申请实施例的一个方面,提供一种音频编码装置,所述装置部署在计算机设备上,该装置包括:
获取模块,被配置为获取音频帧序列中各个音频帧的音频数据;
第一下采样模块,被配置为当编码到所述音频帧序列中的当前音频帧时,对历史音频帧的音频数据进行下采样得到下采样特征值,所述历史音频帧是所述音频帧序列中在所述当前音频帧之前编码的一个或者多个音频帧,所述下采样特征值是在下采样过程中得到的用于描述历史音频帧的特征向量;
第二下采样模块,被配置为根据所述下采样特征值对当前音频帧的音频数据进行下采样得到所述当前音频帧的编码向量。
根据本申请实施例的一个方面,提供一种计算机可读介质,所述计算机可读介质上存储有计算机程序,所述计算机程序被处理器执行时实现如以上技术方案中的音频编解码方法。
根据本申请实施例的一个方面,提供一种电子设备,该电子设备包括:处理器;以及存储器,用于存储所述处理器可执行的计算机程序;其中,所述处理器被配置为经由执行所述可执行的计算机程序使得所述电子设备执行如以上技术方案中的音频编解码方法。
根据本申请实施例的一个方面,提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如以上技术方案中的音频编解码方法。
基于本申请实施例提供的技术方案,获取音频帧序列中各个音频帧的编码向量,当解码到音频帧序列中的当前音频帧时,可以对当前音频帧的编码向量进行上采样,在对当前音频帧的编码向量进行上采样的过程中,可以引入对历史音频帧的编码向量进行上采样得到的上采样特征值,从而根据上采样特征值对当前音频帧的编码向量进行上采样得到当前音频帧的解码数据,使得单个音频帧的上采样过程能够获得更大的数据感受野,因此能够提高音频的数据质量。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。显而易见地,下面描述中的附图仅仅是本申请的 一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了可以应用本申请实施例的技术方案的示例性系统架构的示意图;
图2示出视频编码装置和视频解码装置在流式传输环境中的放置方式;
图3示出了一个基于卷积神经网络构建的编解码器的网络结构框图;
图4示出了本申请一个实施例中音频解码方法的流程图;
图5示出了本申请一个实施例中基于包括多个上采样层的卷积神经网络进行音频解码的方法步骤流程图;
图6示出了本申请一个实施例中实现数据编解码处理的网络模块示意图;
图7示出了本申请一个实施例中对多个采样通道输出的通道特征值进行归一化处理的原理示意图;
图8示出了本申请一个实施例中基于查询码本进行音频帧解码处理的步骤流程图;
图9示出了本申请一个实施例中基于数据映射确定编码向量的原理示意图;
图10示出了本申请一个实施例中对量化器进行训练的步骤流程图;
图11示出了本申请一个实施例中的音频编码方法的步骤流程图;
图12示出了本申请一个实施例中基于包括多个下采样层的卷积神经网络进行音频编码的方法步骤流程图;
图13示出了本申请一个实施例中对编码器和解码器进行模型训练的步骤流程图;
图14示出了本申请一个实施例中基于生成对抗网络进行编解码模型训练的原理示意图;
图15示出了本申请一个实施例中的音频解码装置的结构框图;
图16示出了本申请一个实施例中的音频编码装置的结构框图;
图17示出了本申请一个实施例中电子设备的计算机系统结构框图。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本申请将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
需要说明的是:在本文中提及的“多个”是指两个或两个以上。“和/或”描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
在本申请的具体实施方式中,涉及到音频帧等与用户相关的数据,当本申请的各个实施例运用到具体产品或技术中时,需要获得用户单独许可或者单独同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
本申请实施例中涉及的相关术语或者缩略语解释如下。
卷积神经网络:在文本、图像、音视频等多媒体数据处理领域,卷积神经网络是应用最成功的深度学习结构,卷积神经网络由多个卷积层组成,一般包括卷积层(Convolutional Layer)、降采样层(Pooling Layer)、激活函数层(Activation Layer)、标准化层(Normalization Layer)、全连接层(Full Connected Layer)等。
音频编解码:音频编码过程为将音频压缩到更小的数据,解码过程为将更小的数据还原为音频。编码后的更小数据用于网络传输,占用更小的带宽。
音频采样率:音频采样率描述单位时间(1秒)内包含的数据个数。如16k采样率包含16000个采样点,每个采样点对应一个短型整数。
码本:多个向量的集合,编码器和解码器两边保存一模一样的码本。
量化:将输入向量在码本中找到距离最近的向量,将其返回作为输入向量的替换,并将对应的码本索引位置返回。
量化器:量化器负责量化工作,并负责更新码本内向量。
弱网环境:网络传输质量较差的环境,例如指3kpbs以下的带宽。
音频帧:表示网络中单次传输的最小语音时长。
短时傅里叶变换(Short Time Fourier Transform,STFT):将长时间信号分成数个较短的等长信号,然后再分别计算每个较短段的傅里叶变换。通常拿来描绘频域与时域上的变化,为时频分析中的一个重要的工具。
图1示意性地示出了可以应用本申请实施例的技术方案的示例性系统架构的示意图。
如图1所示,系统架构100包括多个终端设备,所述终端设备可通过例如网络150彼此通信。举例来说,系统架构100可以包括通过网络150互连的第一终端设备110和第二终端设备120。在图1的实施例中,第一终端设备110和第二终端设备120执行单向数据传输。
举例来说,第一终端设备110可对音视频数据(例如由终端设备110采集的音视频数据流)进行编码以通过网络150传输到第二终端设备120,已编码的音视频数据以一个或多个已编码音视频码流形式传输,第二终端设备120可从网络150接收已编码音视频数据,对已编码音视频数据进行解码以恢复音视频数据,并根据恢复的音视频数据进行内容播放或显示。
在本申请的一个实施例中,系统架构100可以包括执行已编码音视频数据的双向传输的第三终端设备130和第四终端设备140,所述双向传输比如可以发生在音视频会议期间。对于双向数据传输,第三终端设备130和第四终端设备140中的每个终端设备可对音视频数据(例如由终端设备采集的音视频数据流)进行编码,以通过网络150传输到第三终端设备130和第四终端设备140中的另一终端设备。第三终端设备130和第四终端设备140中的每个终端设备还可接收由第三终端设备130和第四终端设备140中的另一终端设备传 输的已编码音视频数据,且可对已编码音视频数据进行解码以恢复音视频数据,并根据恢复的音视频数据进行内容播放或显示。
在图1的实施例中,第一终端设备110、第二终端设备120、第三终端设备130和第四终端设备140可为个人计算机和智能电话,但本申请公开的原理可不限于此。本申请公开的实施例适用于膝上型计算机、平板电脑、媒体播放器和/或专用音视频会议设备。网络150表示在第一终端设备110、第二终端设备120、第三终端设备130和第四终端设备140之间传送已编码音视频数据的任何数目的网络,包括例如有线和/或无线通信网络。通信网络150可在电路交换和/或分组交换信道中交换数据。该网络可包括电信网络、局域网、广域网和/或互联网。出于本申请的目的,除非在下文中有所解释,否则网络150的架构和拓扑对于本申请公开的操作来说可能是无关紧要的。
在本申请的一个实施例中,图2示意性地示出音视频编码装置和音视频解码装置在流式传输环境中的放置方式。本申请所公开主题可同等地适用于其它支持音视频的应用,包括例如音视频会议、数字TV(television,电视机)、在包括CD、DVD、存储棒等的数字介质上存储压缩音视频等等。
流式传输系统可包括采集子系统213,采集子系统213可包括麦克风、摄像头等音视频源201,音视频源创建未压缩的音视频数据流202。相较于已编码的音视频数据204(或已编码的音视频码流204),音视频数据流202被描绘为粗线以强调高数据量的音视频数据流,音视频数据流202可由电子设备220处理,电子设备220包括耦接到音视频源201的音视频编码装置203。音视频编码装置203可包括硬件、软件或软硬件组合以实现或实施如下文更详细地描述的所公开主题的各方面。相较于音视频数据流202,已编码的音视频数据204(或已编码的音视频码流204)被描绘为细线以强调较低数据量的已编码的音视频数据204(或已编码的音视频码流204),其可存储在流式传输服务器205上以供将来使用。一个或多个流式传输客户端子系统,例如图2中的客户端子系统206和客户端子系统208,可访问流式传输服务器205以检索已编码的音视频数据204的副本207和副本209。客户端子系统206可包括例如电子设备230中的音视频解码装置210。音视频解码装置210对已编码的音视频数据的传入副本207进行解码,且产生可在输出端212(例如扬声器、显示器)或另一呈现装置上呈现的输出音视频数据流211。在一些流式传输系统中,可根据某些音视频编码/压缩标准对已编码的音视频数据204、音视频数据207和音视频数据209(例如音视频码流)进行编码。
应注意,电子设备220和电子设备230可包括图中未示出的其它组件。举例来说,电子设备220可包括音视频解码装置,且电子设备230还可包括音视频编码装置。
图3示出了本申请一个实施例中基于卷积神经网络构建的编解码器的网络结构框图。
如图3所示,编解码器的网络结构包括编码器310和解码器320,其中编码器310可以作为软件实现如图2所示的音视频编码装置203,解码器320可以作为软件实现如图2所示的音视频解码装置210。
在数据发送端可以通过编码器310对音频数据进行编码压缩。在本申请的一个实施例中,编码器310可以包括一个输入层311、一个或者多个下采样层312以及一个输出层313。
举例而言,输入层311和输出层313可以是基于一维卷积核构建的卷积层,在输入层311和输出层313之间依次连接有四个下采样层312。基于一个应用场景对各个网络层的功能做如下说明。
在编码器的输入阶段,对待编码的原始音频数据进行数据采样,可以得到通道数为1、维度为16000的向量;将该向量输入至输入层311,经过卷积处理后可以得到通道数为32、维度为16000的特征向量。在一些可选的实施方式中,为了提高编码效率,编码器310可以同时对数量为B的一个批次的音频向量进行编码处理。
在编码器的下采样阶段,第一个下采样层将向量维度降低至1/2,得到通道数为64、维度为8000的特征向量;第二个下采样层将向量维度降低至1/4,得到通道数为128、维度为2000的特征向量;第三个下采样层将向量维度降低至1/5,得到通道数为256、维度为400的特征向量;第四个下采样层将向量维度降低至1/4,得到通道数为512、维度为50的特征向量。
在编码器的输出阶段,输出层313对特征向量进行卷积处理后得到通道数为vq_dim、维度为25的编码向量。其中,vq_dim为预设的矢量量化维度,例如可以取值为32。
将编码向量输入至量化器330,可以在码本中查询得到每个编码向量对应的向量索引。随后可以将向量索引传输至数据接收端,由数据接收端通过解码器320对向量索引进行解码处理,得到还原后的音频数据。
在本申请的一个实施例中,解码器320可以包括一个输入层321、一个或者多个上采样层322以及一个输出层323。
数据接收端在接收到网络传输的向量索引后,首先可以通过量化器320在码本中查询向量索引对应的码本向量,该码本向量例如可以是通道数为vq_dim、维度为25的向量。其中,vq_dim为预设的矢量量化维度,例如可以取值为32。在一些可选的实施方式中,为了提高解码效率,数据接收端可以同时对数量为B的一个批次的码本向量进行解码处理。
在解码器的输入阶段,将待解码的码本向量输入至输入层321,经过卷积处理后可以得到通道数为512、维度为50的特征向量。
在解码器的上采样阶段,第一个上采样层将向量维度升高至8倍(例如8×),得到通道数为256、维度为400的特征向量;第二个上采样层将向量维度升高至5倍(例如5×),得到通道数为128、维度为2000的特征向量;第三个上采样层将向量维度升高至4倍(例如4×),得到通道数为64、维度为8000的特征向量;第四个上采样层将向量维度升高至2倍(例如2×),得到通道数为32、维度为16000的特征向量。
在解码器的输出阶段,输出层323对特征向量进行卷积处理后,还原得到通道数为1、维度为16000的解码音频数据。
编解码器整体可以看成是语音转语音模型,为了使模型生成的语音更符合人耳听觉曲线,本申请实施例可以将输入与输出音频分别提取梅尔频谱,作为损失函数的输入,使得二者在梅尔频谱上接近。梅尔频谱可设置不同的采样窗口大小,为了让生成语音质量更接近输入语音,本申请实施例可以采用多尺度梅尔频谱约束作为重构损失函数。
梅尔频谱是在梅尔标度(mel scale)下分布的声谱图(spectrogram),梅尔频谱可以表示为mel频谱。声音信号本是一维的时域信号,直观上很难看出频率变化规律。如果通过傅里叶变换把它变到频域上,虽然可以看出信号的频率分布,但是丢失了时域信息,无法看出频率分布随时间的变化。本申请实施例可以采用短时傅里叶变换、小波变换、Wigner分布等时频域分析方法解决这一问题。
短时傅里叶变换(STFT)是对分帧处理得到的短时信号做傅里叶变换(FFT),具体是将一段长信号分帧、加窗,再对每一帧做傅里叶变换,最后把每一帧的结果沿另一个维度堆叠起来,得到类似于一幅图的二维信号形式。当原始信号是音频信号时,通过STFT展开得到的二维信号即为声谱图。为了得到合适大小的声音特征,通过梅尔标度滤波器组(mel-scale filter banks)对声谱图进行滤波变换,即可得到梅尔频谱。
下面结合具体实施方式,分别从作为数据接收端的解码侧和作为数据发送端的编码侧两个方面对本申请提供的音频编码方法、音频解码方法、音频编码装置、音频解码装置、计算机可读介质、电子设备以及计算机程序产品等技术方案做出详细说明。
图4示出了本申请一个实施例中的音频解码方法的步骤流程图,该方法可以由接收到编码数据的终端设备或者服务器执行,即执行本申请实施例提供的音频解码方法的电子设备可以是终端设备或服务器,本申请实施例以终端设备执行的音频解码方法作为示例进行说明,该终端设备例如可以是图2所示的音视频解码装置210或者图3所示的解码器320。
如图4所示,本申请实施例中的音频解码方法可以包括如下的S410至S430。
S410:获取音频帧序列中各个音频帧的编码向量。
音频帧是对原始音频数据进行分帧处理和加窗处理后得到的具有指定时间长度的数据分段,编码向量是对音频帧进行多次下采样后得到的数据压缩向量。本申请实施例可以采用如图3中所示的基于卷积神经网络构建的编码器对音频帧进行编码处理后得到编码向量。
原始音频数据从整体来看其特性及表征其本质特征的参数均是随时间而变化的,所以它是一个非平稳态过程,不能用处理平稳信号的数字信号处理技术对其进行分析处理。但是,由于不同的语音是由人的口腔肌肉运动构成声道某种形状而产生的响应,而这种口腔肌肉运动相对于语音频率来说是非常缓慢的,所以从另一方面看,虽然音频信号具有时变特性,但在一个短时间范围内(例如在10-30ms的短时间内),其特性基本保持不变即相对稳定,因而可以将其看作是一个准稳态过程,即音频信号具有短时平稳性。为实现对音频信号的短时分析,本申请实施例可以将原始音频数据分为一段一段来分析其特征参数,其中每一段称为一个音频帧。音频帧的帧长例如可以在10-30ms的范围内进行取值。分帧可以采用连续分段的方法,也可以采用交叠分段的方法,交叠分段可以使帧与帧之间平滑过渡,保持其连续性。前一帧和后一帧的交叠部分称为帧移,帧移与帧长的比值可以取值为0~1/2。
加窗处理是指利用窗函数对分帧后的音频信号进行函数映射,使得相邻的两个音频数据帧能够平稳过渡,减少数据帧在起始和结束部分信号不连续的问题,使全局具有更高的连续性,避免出现吉布斯效应。另外,通过加窗处理,也可以使原本没有周期性的音频信号呈现出周期函数的部分特征,有利于对其进行信号分析和处理。
在进行加窗处理时,应当尽量减小时间窗两端的坡度,使窗口边缘两端不引起急剧变化而平滑过渡到零,这样可以使截取出的信号波形缓慢降为零,减小音频数据帧的截断效应。窗口长度应当适中,如果窗口长度很大,则等效于很窄的低通滤波器,音频信号通过时,反映波形细节的高频部分被阻碍,短时能量随时间变化很小,不能真实地反映音频信号的幅度变化:反之,如果窗口长度太短,则滤波器的通带变宽,短时能量随时间有急剧的变化,不能得到平滑的能量函数。
在本申请的一个实施例中,可以选用汉明窗作为窗函数,汉明窗具有平滑的低通特性,能够在较高的程度上反映短时信号的频率特性。在其他一些实施例中,也可以选用矩形窗、海宁窗等其他类型的窗函数。
S420:当解码到音频帧序列中的当前音频帧时,对历史音频帧的编码向量进行上采样得到上采样特征值,历史音频帧是音频帧序列中在当前音频帧之前解码的一个或者多个音频帧,上采样特征值是在上采样过程中得到的用于描述历史音频帧的特征向量。
在本申请的一个实施例中,历史音频帧是在音频帧序列中与当前音频帧时间连续的一个或者多个音频帧,例如正在解码的当前音频帧是音频帧序列中的第N个音频帧,与之对应的历史音频帧可以是音频帧序列中的第N-1个音频帧。
上采样是将编码向量由低维度向高维度进行映射处理的操作,例如可以采用线性插值、反卷积或者反池化等上采样方法。其中,线性插值是基于线性插值函数向低维度向量中插入新元素以得到高维度向量的方法,可以包括最近邻插值算法、双线性插值算法、双三次插值算法等等。反卷积,也可以称为转置卷积,是一种特殊的卷积操作,例如可以先向低维度向量中补0以扩大向量维度,然后再通过卷积核进行正向卷积得到高维度向量。反池化,是池化的逆操作。
在本申请的一个实施例中,可以通过配置缓存区保留上采样的过程数据。在对一个音频帧进行上采样时,可以缓存上采样过程中获得的用于描述音频帧的特征向量,例如历史音频帧的上采样特征值。
S430:根据上采样特征值对当前音频帧的编码向量进行上采样得到当前音频帧的解码数据。
在本申请的一个实施例中,可以将历史音频帧的上采样特征值与当前音频帧的编码向量共同作为输入数据输入至解码器中,使得解码器能够利用历史视频帧的特征向量对当前音频帧进行上采样。
原始音频数据在编码过程中会损失一些信息,基于上采样的解码过程通常难以还原出原始音频数据,本申请实施例通过缓存在前解码的历史音频帧的上采样特征,可以指导当前音频帧的上采样过程,提高音频解码的数据还原效果,因而能够提高音频的编解码质量。
图5示出了本申请一个实施例中基于包括多个上采样层的卷积神经网络进行音频解码的方法步骤流程图。如图5所示,该音频解码方法可以包括如下的S510至S540。
S510:获取音频帧序列中各个音频帧的编码向量。
音频帧是对原始音频数据进行分帧处理和加窗处理后得到的具有指定时间长度的数据分段,编码向量是对音频帧进行多次下采样后得到的数据压缩向量。本申请实施例可以采用如图3中所示的基于卷积神经网络构建的编码器对音频帧进行编码处理后得到编码向量。
S520:获取包括多个上采样层的解码器,通过多个上采样层对历史音频帧的编码向量进行上采样处理得到多个特征向量,历史音频帧是在当前音频帧之前解码的一个或者多个音频帧。
本申请实施例可以采用如图3中所示的基于卷积神经网络构建的解码器对音频帧的编码向量进行解码处理。其中,解码器包括多个依次连接的上采样层,每个上采样层通过对输入向量进行卷积操作可以实现上采样处理。
本申请实施例中由解码器对历史音频帧的编码向量进行上采样处理后,可以得到与上采样层数量相同的多个特征向量,此时该多个特征向量可以作为上采样特征值。例如图3所示的解码器包括四个上采样层,每个上采样层输出一个特征向量,则针对一个历史音频帧进行上采样处理可以得到四个特征向量。
在一些可选的实施方式中,由解码器对历史音频帧的编码向量进行上采样处理后,可以得到数量小于上采样层的多个特征向量。例如图3所示的解码器包括四个上采样层,每个上采样层输出一个特征向量,然后从中抽取部分特征向量,即针对一个历史音频帧进行上采样处理可以得到数量少于四个的特征向量。
S530:将当前音频帧的编码向量输入解码器,并将多个特征向量对应输入多个上采样层。
当前音频帧的编码向量依次通过编码器的多个上采样层进行多次上采样,在对当前音频帧的编码向量进行上采样处理的过程中,将历史音频帧经过上采样得到的多个特征向量同步输入至上采样层。即编码器中上采样层的输入数据除了前一个上采样层的输出数据之外,还包括了对历史音频帧进行上采样处理得到的特征向量。
S540:通过多个上采样层对当前音频帧的编码向量和多个特征向量进行上采样处理,得到当前音频帧的解码数据。
图6示出了本申请一个实施例中实现数据编解码处理的网络模块示意图。图6所示的网络模块是构成图3所示的编码器或者解码器的基础功能模块,例如编码器中的每个下采样层或者解码器中的每个上采样层均可以包括一个或者多个图6所示的网络模块。
如图6所示,实现数据编解码处理的网络模块中包括有多个残差块Res Block。网络模块的输入数据包括两部分,即当前输入特征In feature和第一历史特征Last feature。其中,当前输入特征In feature可以是前一网络模块对当前音频帧进行卷积处理得到的输出特征,历史特征Last feature可以是当前网络模块对前一音频帧进行卷积处理得到的输出特征,例如可以是本申请以上实施例中通过上采样层对历史音频帧的编码向量进行上采样处理得到的上采样特征值。
网络模块的输出数据也包括两部分,即当前输出特征Out feature和第二历史特征Last feature。其中,当前输出特征Out feature可以作为后一网络模块对当前音频帧进行卷积处 理得到的输出特征,第二历史特征Last feature可以作为当前网络模块对后一音频帧进行卷积处理的输入特征。
本申请实施例通过保留前一音频帧的输出特征,可以将历史音频帧在上采样过程中获得的上采样特征值与当前音频帧的编码向量进行共同解码,因此能够提升当前音频帧的输入感受野,提高音频编解码的准确性。
在本申请的一个实施例中,解码器的上采样层包括至少两个采样通道。在此基础上,S540中通过多个上采样层对当前音频帧的编码向量和多个特征向量进行上采样处理的方法可以包括:通过上采样层中的至少两个采样通道对当前音频帧的编码向量和多个特征向量进行特征提取得到至少两个通道特征值;获取至少两个通道特征值的均值和方差;根据均值和方差对至少两个通道特征值进行归一化处理。
不同的采样通道可以基于不同大小或者不同参数的卷积核对输入数据进行卷积处理得到不同表征维度下的多个通道特征值,从而可以提高对音频帧进行特征提取的全面性和可靠性。在此基础上,为了降低模型计算量,本申请实施例可以对同一音频帧在不同采样通道上采集到的通道特征值进行归一化处理。
图7示出了本申请一个实施例中对多个采样通道输出的通道特征值进行归一化处理的原理示意图。图7中的每一个方格代表一个数据采样点,沿水平方向分布的一行方格表示一个音频帧,沿竖直方向分布的多行方格表示在一个批次中同步进行编解码的多个音频帧,纵深方向分布的多行方格表示对同一个音频帧进行采样的多个采样通道。
如图7所示,在对数据采样点的映射数据进行归一化处理时,一个音频帧作为一个归一化单元,各个音频帧之间相互独立。首先可以对同一个音频帧中由不同采样通道采样得到的多个通道特征值计算均值和方差,再将每个通道特征值减去均值后除以方差,即可得到均一化的通道特征值。通过对每个音频帧在不同采样通道采样得到的多个通道特征值进行归一化处理,可以使各个采样通道共享同一个均值和方差,在保证数据采样全面性的同时降低数据计算量。
在本申请的一个实施例中,在根据均值和方差对至少两个通道特征值进行归一化处理之前,可以对各个音频帧之间的均值和方差进行加权平滑处理,此时对至少两个通道特征值进行归一化处理的方式可以是根据加权平滑处理后的均值和方差对所述至少两个通道特征值进行归一化处理,以进一步降低数据计算量。
在进行音频数据传输时,可能采用实时分段传输。实时分段传输的特点决定了用户在无需下载完整媒体文件的情况下便可以实时获得媒体数据,但同时也对用户的设备性能和网络条件提出了很高的要求。在网络状态不理想的情况下,为了保证音频帧的传输效率,可以将音频帧压缩并量化后得到一个索引值,这样,在进行传输时传输的是量化后的索引值,从而降低数据传输量,进而提高数据传输效率。在这种情况下,在进行解码时可以通过索引值从码本中找到对应的编码向量,进而完成解码。
图8示出了本申请一个实施例中基于查询码本进行音频帧解码处理的步骤流程图。通过在编码器和解码器配置相同的码本,可以基于查询码本的方式对音频帧的编码向量进行 定位,降低编解码侧的数据传输量。如图8所示,基于查询码本对音频帧的编码向量进行解码处理的方法可以包括如下的S810至S840。
S810:针对音频帧序列中的每个音频帧,获取音频帧的编码索引值,编码索引值用于指示码本中的码本向量。
码本用于保存编码索引值与码本向量之间的映射关系,音频数据的发送方可以通过网络传输向接收方传递各个音频帧的编码索引值,由此可以极大地降低数据传输量,显著提高音频数据的传输效率。
S820:在码本中查询与编码索引值相关联的码本向量,并根据码本向量确定音频帧的编码向量。
当音频数据的接收方获取到编码索引值之后,可以通过量化器在码本中查询与编码索引值相关联的码本向量,并进一步根据码本向量确定音频帧的编码向量。
在一些可选的实施方式中,解码器可以直接将码本中查询到的码本向量作为音频帧的编码向量,或者可以根据预设的映射规则对查询到的码本向量进行数据映射以确定音频帧的编码向量。预设的映射规则可以是音频数据的发送方与接收方进行预先约定的规则,利用数据映射的方式确定编码向量,可以在共享码本的同时,提高数据传输的安全性。
在本申请的一个实施例中,码本向量的维度低于编码向量的维度;根据码本向量确定音频帧的编码向量的方法可以包括:对码本向量进行升维投影,得到音频帧的编码向量。本申请实施例中采用升维投影的方式进行数据映射,可以降低码本中的向量维度,起到压缩码本的作用,降低码本的维护数据量。
图9示出了本申请一个实施例中基于数据映射确定编码向量的原理示意图。如图9所示,在编码侧,经过编码器对音频帧进行数据编码后,可以得到编码向量,其向量维度例如为N。在查询码本之前,先对编码向量进行降维投影,可以维度为N/Q的压缩向量。相应地,码本中包括有M个码本向量,其中每个码本向量的向量维度均为N/Q。经过查询码本,可以确定与编码向量相对应的编码索引值,编码索引值的取值范围为1~M。
在解码侧,当接收到数据发送方传输的编码索引值后,可以首先在码本中查询与之对应的码本向量,该码本向量的向量维度为N/Q。再对码本向量进行升维投影后,可以还原得到向量维度为N的编码向量。
在本申请的一个实施例中,可以基于线性变换对编码向量进行降维投影或者升维投影,或者可以利用卷积层、全连接层等神经网络的部分网络层进行数据映射。
S830:对历史音频帧的编码向量进行上采样得到上采样特征值,历史音频帧是音频帧序列中在当前音频帧之前解码的一个或者多个音频帧,上采样特征值是在上采样过程中得到的用于描述历史音频帧的特征向量。
历史音频帧是在音频帧序列中与当前音频帧时间连续的一个或者多个音频帧,例如正在解码的当前音频帧是音频帧序列中的第N个音频帧,与之对应的历史音频帧可以是音频帧序列中的第N-1个音频帧。
上采样是将编码向量由低维度向高维度进行映射处理的操作,例如可以采用线性插值、反卷积或者反池化等上采样方法。本申请实施例可以通过配置缓存区保留上采样的过程数 据。在对一个音频帧进行上采样时,可以缓存上采样过程中获得的用于描述音频帧的特征向量。
S840:根据上采样特征值对当前音频帧的编码向量进行上采样得到当前音频帧的解码数据。
本申请实施例可以将历史音频帧的上采样特征值与当前音频帧的编码向量共同作为输入数据输入至解码器中,使得解码器能够利用历史视频帧的上采样特征值对当前音频帧进行上采样。原始音频数据在编码过程中会损失一些信息,基于上采样的解码过程通常难以还原出原始音频数据,本申请实施例通过缓存在前解码的历史音频帧的上采样特征,可以指导当前音频帧的上采样过程,提高音频解码的数据还原效果,因而能够提高音频的编解码质量。
为了保证数据编解码的稳定性和可靠性,在编解码模型中可以通过量化器对码本进行查询,并可以根据样本数据对码本进行更新。本申请实施例中的量化器可以是基于卷积神经网络构建的模型,基于样本数据可以对量化器进行训练以提高其对于音频帧的编码量化效果。
在本申请的一个实施例中,对量化器进行训练的方法可以包括:获取码本以及用于维护码本的量化器,码本用于表示编码索引值与码本向量之间的映射关系;获取由编码器对音频帧样本进行编码处理得到的编码向量样本;通过量化器预测与编码向量样本相匹配的码本向量样本;根据编码向量样本与码本向量样本之间的损失误差,更新量化器的网络参数,从而实现对量化器的训练。在训练得到量化器后,可以通过训练得到的量化器在码本中查询与编码索引值相关联的码本向量。
在本申请的一个实施例中,基于量化器维护更新码本的方法可以包括:获取与码本向量样本相匹配的编码向量样本的统计参数;根据统计参数更新码本,更新后的码本用于下一次预测与编码向量样本相匹配的码本向量样本。通过对码本的不断更新,可以提高其对于音频帧的编码量化效果。
在本申请的一个实施例中,编码向量样本的统计参数包括向量和与命中数中的至少一种,向量和表示对各个编码向量样本进行加权平均处理后得到的平均值向量,命中数表示与码本向量样本相匹配的编码向量样本的数量。在此基础上,根据统计参数更新码本的方法可以包括:根据向量和对码本进行指数加权平滑;根据命中数对码本进行拉普拉斯平滑。
图10示出了本申请一个实施例中对量化器进行训练的步骤流程图。如图10所示,本申请实施例基于训练量化器,可以实现码本的构建和维护,该训练过程包括如下的S1001至S1006。
S1001:获取量化器的输入数据,输入数据是对音频数据(例如音频帧样本的音频数据)进行编码处理后得到的编码向量样本。
S1002:判断输入数据是否为量化器的首次输入数据。若输入数据是首次输入量化器,则执行S1003;若输入数据不是首次输入量化器,则执行S1004。
S1003:对输入数据进行聚类处理,得到M个聚类簇,每个聚类簇对应一个码本向量。M个码本向量可以组成用于进行数据量化的码本,码本中保存每个码本向量对应的编码索引值。
在一个可选的实施方式中,本申请实施例可以基于K-means聚类对输入数据进行聚类处理,每个聚类簇即对应一个码本向量和一个编码索引值。与此同时,可以统计每个聚类簇中各个向量的向量和以及每个聚类簇进行向量查询的命中数。
S1004:在码本中查询输入数据的归属类别。
查询归属类别的方式可以包括将输入数据与各个聚类簇的聚类中心进行相似度预测,将相似度最高的一个聚类簇作为输入数据的归属类别。
S1005:根据输入数据的归属类别确定对应的编码索引值以及量化处理后的码本向量。
S1006:获取码本向量的损失误差,并根据损失误差更新量化器的网络参数。码本向量的损失误差例如可以使用均方误差损失MSE Loss,均方误差是指参数估计值与参数值之差平方的期望值。均方误差损失可以评价数据的变化程度,均方误差损失的值越小,说明量化器对于输入数据的量化处理具有更好的精确度。
S1007:根据向量和对码本进行指数加权平滑。EMA平滑,即指数滑动平均(exponential moving average),可以看作是变量的过去一段时间取值的均值,相比对变量直接赋值而言,滑动平均得到的值在数据分布上更加平缓光滑、抖动性更小,不会因为某次的异常取值而使得滑动平均值波动很大。
S1008:根据命中数对码本进行拉普拉斯平滑。通过拉普拉斯平滑可以解决码本的向量预测中出现的零概率问题。
本申请实施例通过对码本进行加权平滑,可以不断更新码本,使得编码器生成的向量更接近码本中的向量,提高量化器对于码本中向量的预测准确性。
图11示出了本申请一个实施例中的音频编码方法的步骤流程图,该方法可以由发送音频数据的终端设备或者服务器执行,本申请实施例以终端设备执行的音频编码方法作为示例进行说明,该终端设备例如可以是图2所示的音视频编码装置203或者图3所示的编码器310。
如图11所示,本申请实施例中的音频解码方法可以包括如下的S1110至S1130。
S1110:获取音频帧序列中各个音频帧的音频数据。
音频帧是对原始音频数据进行分帧处理和加窗处理后得到的具有指定时间长度的数据分段。
原始音频数据从整体来看其特性及表征其本质特征的参数均是随时间而变化的,所以它是一个非平稳态过程,不能用处理平稳信号的数字信号处理技术对其进行分析处理。但是,由于不同的语音是由人的口腔肌肉运动构成声道某种形状而产生的响应,而这种口腔肌肉运动相对于语音频率来说是非常缓慢的,所以从另一方面看,虽然音频信号具有时变特性,但在一个短时间范围内(例如在10-30ms的短时间内),其特性基本保持不变即相对稳定,因而可以将其看作是一个准稳态过程,即音频信号具有短时平稳性。为实现对音频信号的短时分析,本申请实施例可以将原始音频数据分为一段一段来分析其特征参数, 其中每一段称为一个音频帧。音频帧的帧长例如可以在10-30ms的范围内进行取值。分帧可以采用连续分段的方法,也可以采用交叠分段的方法,交叠分段可以使帧与帧之间平滑过渡,保持其连续性。前一帧和后一帧的交叠部分称为帧移,帧移与帧长的比值可以取值为0~1/2。
加窗处理是指利用窗函数对分帧后的音频信号进行函数映射,使得相邻的两个音频数据帧能够平稳过渡,减少数据帧在起始和结束部分信号不连续的问题,使全局具有更高的连续性,避免出现吉布斯效应。另外,通过加窗处理,也可以使原本没有周期性的音频信号呈现出周期函数的部分特征,有利于对其进行信号分析和处理。
S1120:当编码到音频帧序列中的当前音频帧时,对历史音频帧的音频数据进行下采样得到下采样特征值,历史音频帧是音频帧序列中在当前音频帧之前编码的一个或者多个音频帧,下采样特征值是在下采样过程中得到的用于描述历史音频帧的特征向量。
在本申请的一个实施例中,历史音频帧是在音频帧序列中与当前音频帧时间连续的一个或者多个音频帧,例如正在解码的当前音频帧是音频帧序列中的第N个音频帧,与之对应的历史音频帧可以是音频帧序列中的第N-1个音频帧。
下采样是将编码向量由高维度向低维度进行映射处理的操作,例如可以采用卷积操作或者池化操作进行下采样。
在本申请的一个实施例中,可以通过配置缓存区保留下采样的过程数据。在对一个音频帧进行下采样时,可以缓存下采样过程中获得的用于描述音频帧的特征向量。
S1130:根据下采样特征值对当前音频帧的音频数据进行下采样得到当前音频帧的编码向量。
在本申请的一个实施例中,可以将历史音频帧的下采样特征值与当前音频帧的音频数据共同作为输入数据输入至编码器中,使得编码器能够利用历史视频帧的特征对当前音频帧进行下采样。
原始音频数据在编码过程中会损失一些信息,本申请实施例通过缓存在前编码的历史音频帧的下采样特征,可以指导当前音频帧的下采样过程,提高音频编码的数据关联性,提高音频的编解码质量。
图12示出了本申请一个实施例中基于包括多个下采样层的卷积神经网络进行音频编码的方法步骤流程图。如图12所示,该音频编码方法可以包括如下的S1210至S1240。
S1210:获取音频帧序列中各个音频帧的音频数据。
音频帧是对原始音频数据进行分帧处理和加窗处理后得到的具有指定时间长度的数据分段,编码向量是对音频帧进行多次下采样后得到的数据压缩向量。本申请实施例可以采用如图3中所示的基于卷积神经网络构建的编码器对音频帧进行编码处理后得到编码向量。
S1220:获取包括多个下采样层的编码器,通过多个下采样层对历史音频帧的音频数据进行下采样处理得到多个特征向量,历史音频帧是在当前音频帧之前编码的一个或者多个音频帧。
本申请实施例可以采用如图3中所示的基于卷积神经网络构建的编码器对音频帧的音频数据进行编码处理。其中,编码器包括多个依次连接的下采样层,每个下采样层通过对输入向量进行卷积操作可以实现下采样处理。
本申请实施例中由编码器对历史音频帧的音频数据进行下采样处理后,可以得到与下采样层数量相同的多个特征向量。例如图3所示的编码器包括四个下采样层,每个下采样层输出一个特征向量,则针对一个历史音频帧进行下采样处理可以得到四个特征向量。
在一些可选的实施方式中,由编码器对历史音频帧的音频数据进行下采样处理后,可以得到数量小于下采样层的多个特征向量。例如图3所示的编码器包括四个下采样层,每个下采样层输出一个特征向量,然后从中抽取部分特征向量,即针对一个历史音频帧进行下采样处理可以得到数量少于四个的特征向量。
S1230:将当前音频帧的音频数据输入编码器,并将多个特征向量对应输入多个下采样层。
当前音频帧的音频数据依次通过编码器的多个下采样层进行多次下采样,在对当前音频帧的音频数据进行下采样处理的过程中,将历史音频帧经过下采样得到的多个特征向量同步输入至下采样层。即编码器中下采样层的输入数据除了前一个下采样层的输出数据之外,还包括了对历史音频帧进行下采样处理得到的特征向量。
S1240:通过多个下采样层对当前音频帧的音频数据和多个特征向量进行下采样处理,得到当前音频帧的编码向量。
本申请实施例通过保留前一音频帧的输出特征,可以将历史音频帧在下采样过程中获得的特征向量与当前音频帧的音频数据进行共同编码,因此能够提升当前音频帧的输入感受野,提高音频编解码的准确性。
在本申请的一个实施例中,编码器的下采样层包括至少两个采样通道。在此基础上,S1240中通过多个下采样层对当前音频帧的音频数据和多个特征向量进行下采样处理的方法可以包括:通过下采样层中的至少两个采样通道对当前音频帧的音频数据和多个特征向量进行特征提取得到至少两个通道特征值;获取至少两个通道特征值的均值和方差;根据均值和方差对至少两个通道特征值进行归一化处理。
不同的采样通道可以基于不同大小或者不同参数的卷积核对输入数据进行卷积处理得到不同表征维度下的多个通道特征值,从而可以提高对音频帧进行特征提取的全面性和可靠性。在此基础上,为了降低模型计算量,本申请实施例可以对同一音频帧在不同采样通道上采集到的通道特征值进行归一化处理。对不同采样通道上采集到的通道特征值进行归一化处理的方案可以参考图7所示的实施例,此处不再赘述。
在本申请的一个实施例中,可以基于查询码本进行音频帧编码处理。通过在编码器和解码器配置相同的码本,可以基于查询码本的方式对音频帧的编码向量进行定位,降低编解码侧的数据传输量。在本申请实施例中,当获得编码向量后,可以根据编码向量在码本中查询得到码本向量,并获取与码本向量相关联的编码索引值。
图13示出了本申请一个实施例中对编码器和解码器进行模型训练的步骤流程图。如图13所示,本申请实施例通过构建生成对抗网络实现对编码器和解码器的模型训练,训练方法可以包括如下的S1310至S1350。
S1310:获取包括多个下采样层的编码器和包括多个上采样层的解码器。
本申请实施例中的编码器和解码器可以是如图3所示的基于卷积神经网络构建的编解码模型,其中每个上采样层或者下采样层可以采用卷积操作或者因果卷积操作,用以进行特征映射。
S1320:通过编码器和解码器对音频输入样本进行编解码处理得到音频输出样本。
编码器对音频输入样本进行编码处理后可以得到对应的编码向量样本,然后由解码器对编码向量样本进行解码处理后得到音频输出样本。编码器和解码器进行编解码处理的方法可以参考上述实施例,此处不再赘述。
S1330:根据音频输入样本和音频输出样本确定编码器和解码器的第一损失误差。
在本申请的一个实施例中,分别对音频输入样本和音频输出样本进行频谱特征提取,即对音频输入样本进行频谱特征提取得到第一梅尔频谱,以及对音频输出样本进行频谱特征提取得到第二梅尔频谱,然后根据第一梅尔频谱和第二梅尔频谱的差异度确定编码器和解码器的第一损失误差。
在本申请的一个实施例中,对音频输入样本进行频谱特征提取得到第一梅尔频谱,以及对音频输出样本进行频谱特征提取得到第二梅尔频谱的方式可以是:获取包括至少两种样本尺度的采样窗口;通过采样窗口在不同的样本尺度上对音频输入样本进行频谱特征提取得到多尺度的第一梅尔频谱,以及对音频输出样本进行频谱特征提取得到多尺度的第二梅尔频谱。
S1340:通过样本判别器对音频输入样本和音频输出样本进行类型判别,并根据判别结果确定样本判别器的第二损失误差。
S1350:根据第一损失误差和第二损失误差对编码器、解码器和样本判别器进行生成对抗训练,以更新编码器、解码器和样本判别器的网络参数。
在本申请的一个实施例中,样本判别器可以包括原始样本判别器和样本特征判别器;通过样本判别器对音频输入样本和音频输出样本进行类型判别的方法包括:将音频输入样本和音频输出样本输入至原始样本判别器,得到由原始样本判别器输出的第一类型判别结果;对音频输入样本进行频谱特征提取得到第一梅尔频谱,以及对音频输出样本进行频谱特征提取得到第二梅尔频谱;将第一梅尔频谱和所述第二梅尔频谱输入至样本特征判别器,得到由样本特征判别器输出的第二类型判别结果。此时判别结果包括第一类型判别结果和第二类型判别结果。
图14示出了本申请一个实施例中基于生成对抗网络进行编解码模型训练的原理示意图。如图14所示,编解码器整体可以看成是语音转语音模型,为了使得模型生成出的语音更符合人耳听觉曲线,将输入音频input与输出音频output分别提取Mel频谱,作为损失函数的输入,使得二者在Mel频谱上接近。Mel频谱可设置不同的采样窗口大小,为了让生成语音质量更接近输入语音,本申请实施例采用多尺度Mel频谱约束作为重构Loss。
本申请实施例采用生成对抗网络(Generative Adversarial Network,GAN)进行模型训练,将编解码器作为生成器,同时设计两个判别器:原始语音作为输入的判别器(例如图14中的第一判别器)和Mel频谱作为输入的判别器(例如图14中的第二判别器)。通过两个判别器从音频采样和梅尔频谱采样两个角度进行数据判别,能够提高数据判别强度,进而提高编解码模型对于音频数据的编解码质量。
利用本申请上述实施例提供的编解码模型对音频数据进行编码或者解码处理,能够显著提高音频数据的编解码质量,尤其是在弱网环境下,如电梯内、高楼下、山区等环境中,提高语音通话、视频通话的通话质量。
表1示出了本申请实施例与相关技术中编解码模型的通话质量对比结果。其中,PESQ与STOI指标均用于衡量语音质量,且值越大越好。
表1
由表1的结果对比可知,本申请实施例提供的编解码模型在3kbps带宽下可流畅地语音通话,且通话质量高于开源编解码器Opus在6kbps带宽的通话质量。
应当注意,尽管在附图中以特定顺序描述了本申请中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
以下介绍本申请的装置实施例,可以用于执行本申请上述实施例中的音频编解码方法。
图15示出了本申请一个实施例中的音频解码装置的结构框图。如图15所示,音频解码装置1500包括:
获取模块1510,被配置为获取音频帧序列中各个音频帧的编码向量;
第一上采样模块1520,被配置为当解码到所述音频帧序列中的当前音频帧时,对历史音频帧的编码向量进行上采样得到上采样特征值,所述历史音频帧是所述音频帧序列中在所述当前音频帧之前解码的一个或者多个音频帧,所述上采样特征值是在上采样过程中得到的用于描述历史音频帧的特征向量;
第二上采样模块1530,被配置为根据所述上采样特征值对所述当前音频帧的编码向量进行上采样得到所述当前音频帧的解码数据。
在本申请的一个实施例中,第二上采样模块1530可以进一步包括:
解码器获取模块,被配置为获取包括多个上采样层的解码器,所述上采样特征值包括由所述多个上采样层分别对所述历史音频帧的编码向量进行上采样处理得到的多个特征向量;
数据输入模块,被配置为将所述当前音频帧的编码向量输入所述解码器,并将所述多个特征向量对应输入所述多个上采样层;
上采样处理模块,被配置为通过所述多个上采样层对所述当前音频帧的编码向量和所述多个特征向量进行上采样处理,得到所述当前音频帧的解码数据。
在本申请的一个实施例中,第二上采样模块1530可以进一步包括:
编码器获取模块,被配置为获取包括多个下采样层的编码器;
编解码处理模块,被配置为通过所述编码器和所述解码器对音频输入样本进行编解码处理得到音频输出样本;
第一误差确定模块,被配置为根据所述音频输入样本和所述音频输出样本确定所述编码器和所述解码器的第一损失误差;
第二误差确定模块,被配置为通过样本判别器对所述音频输入样本和所述音频输出样本进行类型判别,并根据判别结果确定所述样本判别器的第二损失误差;
生成对抗训练模块,被配置为根据所述第一损失误差和所述第二损失误差对所述编码器、所述解码器和所述样本判别器进行生成对抗训练,以更新所述编码器、所述解码器和所述样本判别器的网络参数。
在本申请的一个实施例中,所述样本判别器包括原始样本判别器和样本特征判别器;第二误差确定模块包括:
判别器输入模块,被配置为将所述音频输入样本和所述音频输出样本输入至所述原始样本判别器,得到由所述原始样本判别器输出的第一类型判别结果;
频谱特征提取模块,被配置为对所述音频输入样本进行频谱特征提取得到第一梅尔频谱,以及对所述音频输出样本进行频谱特征提取得到第二梅尔频谱;
频谱特征输入模块,被配置为将所述第一梅尔频谱和所述第二梅尔频谱输入至所述样本特征判别器,得到由所述样本特征判别器输出的第二类型判别结果,所述判别结果包括所述第一类型判别结果和所述第二类型判别结果。
在本申请的一个实施例中,第一误差确定模块可以进一步被配置为:对所述音频输入样本进行频谱特征提取得到第一梅尔频谱,以及对所述音频输出样本进行频谱特征提取得到第二梅尔频谱;根据所述第一梅尔频谱和所述第二梅尔频谱的差异度确定所述编码器和所述解码器的第一损失误差。
在本申请的一个实施例中,第一误差确定模块可以进一步被配置为:获取包括至少两种样本尺度的采样窗口;通过所述采样窗口在不同的样本尺度上对所述音频输入样本进行频谱特征提取得到多尺度的第一梅尔频谱,以及对所述音频输出样本进行频谱特征提取得到多尺度的第二梅尔频谱。
在本申请的一个实施例中,所述上采样层包括至少两个采样通道;上采样处理模块,包括:
通道特征提取模块,被配置为通过所述上采样层中的至少两个采样通道对所述当前音频帧的编码向量和所述多个特征向量进行特征提取得到至少两个通道特征值;
均值方差获取模块,被配置为获取所述至少两个通道特征值的均值和方差;
归一化处理模块,被配置为根据所述均值和方差对所述至少两个通道特征值进行归一化处理。
在本申请的一个实施例中,上采样处理模块还包括:
加权平滑模块,被配置为对各个音频帧之间的均值和方差进行加权平滑处理;
归一化处理模块,被配置为根据加权平滑处理后的均值和方差对所述至少两个通道特征值进行归一化处理。
在本申请的一个实施例中,获取模块1510可以进一步包括:
编码索引值获取模块,被配置为针对所述音频帧序列中的每个音频帧,获取音频帧的编码索引值,所述编码索引值用于指示码本中的码本向量;
编码向量确定模块,被配置为在所述码本中查询与所述编码索引值相关联的码本向量,并根据所述码本向量确定音频帧的编码向量。
在本申请的一个实施例中,所述码本向量的维度低于所述编码向量的维度;编码向量确定模块可以进一步被配置为:对所述码本向量进行升维投影,得到音频帧的编码向量。
在本申请的一个实施例中,获取模块1510可以进一步包括:
量化器获取模块,被配置为获取所述码本以及用于维护所述码本的量化器,所述码本用于表示编码索引值与码本向量之间的映射关系;
编码向量样本获取模块,被配置为获取由编码器对音频帧样本进行编码处理得到的编码向量样本;
量化器预测模块,被配置为通过所述量化器预测与所述编码向量样本相匹配的码本向量样本;
量化器更新模块,被配置为根据所述编码向量样本与所述码本向量样本之间的损失误差,更新所述量化器的网络参数;
编码向量确定模块,被配置为通过训练得到的量化器在所述码本中查询与所述编码索引值相关联的码本向量。
在本申请的一个实施例中,获取模块1510可以进一步包括:
统计参数获取模块,被配置为获取与所述码本向量样本相匹配的编码向量样本的统计参数;
码本更新模块,被配置为根据所述统计参数更新所述码本,更新后的码本用于下一次预测与编码向量样本相匹配的码本向量样本。
在本申请的一个实施例中,所述统计参数包括向量和与命中数中的至少一种,所述向量和表示对各个编码向量样本进行加权平均处理后得到的平均值向量,所述命中数表示与 所述码本向量样本相匹配的编码向量样本的数量;码本更新模块可以进一步被配置为:根据向量和对所述码本进行指数加权平滑;根据命中数对所述码本进行拉普拉斯平滑。
图16示出了本申请一个实施例中的音频编码装置的结构框图。如图16所示,音频编码装置1600包括:
获取模块1610,被配置为获取音频帧序列中各个音频帧的音频数据;
第一下采样模块1620,被配置为当编码到所述音频帧序列中的当前音频帧时,对历史音频帧的音频数据进行下采样得到下采样特征值,所述历史音频帧是所述音频帧序列中在所述当前音频帧之前编码的一个或者多个音频帧,所述下采样特征值是在下采样过程中得到的用于描述历史音频帧的特征向量;
第二下采样模块1630,被配置为根据所述下采样特征值对所述当前音频帧的音频数据进行下采样得到所述当前音频帧的编码向量。
本申请各实施例中提供的音频编解码装置的具体细节已经在对应的方法实施例中进行了详细的描述,此处不再赘述。
图17示意性地示出了用于实现本申请实施例的电子设备的计算机系统结构框图。
需要说明的是,图17示出的电子设备的计算机系统1700仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图17所示,计算机系统1700包括中央处理器1701(Central Processing Unit,CPU),其可以根据存储在只读存储器1702(Read-Only Memory,ROM)中的程序或者从存储部分1708加载到随机访问存储器1703(Random Access Memory,RAM)中的程序而执行各种适当的动作和处理。在随机访问存储器1703中,还存储有系统操作所需的各种程序和数据。中央处理器1701、在只读存储器1702以及随机访问存储器1703通过总线1704彼此相连。输入/输出接口1705(Input/Output接口,即I/O接口)也连接至总线1704。
以下部件连接至输入/输出接口1705:包括键盘、鼠标等的输入部分1706;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分1707;包括硬盘等的存储部分1708;以及包括诸如局域网卡、调制解调器等的网络接口卡的通信部分17017。通信部分17017经由诸如因特网的网络执行通信处理。驱动器1710也根据需要连接至输入/输出接口1705。可拆卸介质1711,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1710上,以便于从其上读出的计算机程序根据需要被安装入存储部分1708。
特别地,根据本申请的实施例,各个方法流程图中所描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分17017从网络上被下载和安装,和/或从可拆卸介质1711被安装。在该计算机程序被中央处理器1701执行时,执行本申请的系统中限定的各种功能。
需要说明的是,本申请实施例所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但 不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、有线等等,或者上述的任意合适的组合。
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本申请的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、触控终端、或者网络设备等)执行根据本申请实施方式的方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (19)

  1. 一种音频解码方法,所述方法由计算机设备执行,包括:
    获取音频帧序列中各个音频帧的编码向量;
    当解码到所述音频帧序列中的当前音频帧时,对历史音频帧的编码向量进行上采样得到上采样特征值,所述历史音频帧是所述音频帧序列中在所述当前音频帧之前解码的一个或者多个音频帧,所述上采样特征值是在上采样过程中得到的用于描述历史音频帧的特征向量;
    根据所述上采样特征值对所述当前音频帧的编码向量进行上采样得到所述当前音频帧的解码数据。
  2. 根据权利要求1所述的音频解码方法,根据所述上采样特征值对所述当前音频帧的编码向量进行上采样得到所述当前音频帧的解码数据,包括:
    获取包括多个上采样层的解码器,所述上采样特征值包括由所述多个上采样层分别对所述历史音频帧的编码向量进行上采样处理得到的多个特征向量;
    将所述当前音频帧的编码向量输入所述解码器,并将所述多个特征向量对应输入所述多个上采样层;
    通过所述多个上采样层对所述当前音频帧的编码向量和所述多个特征向量进行上采样处理,得到所述当前音频帧的解码数据。
  3. 根据权利要求2所述的音频解码方法,在将所述当前音频帧的编码向量输入所述解码器之前,所述方法还包括:
    获取包括多个下采样层的编码器;
    通过所述编码器和所述解码器对音频输入样本进行编解码处理得到音频输出样本;
    根据所述音频输入样本和所述音频输出样本确定所述编码器和所述解码器的第一损失误差;
    通过样本判别器对所述音频输入样本和所述音频输出样本进行类型判别,并根据判别结果确定所述样本判别器的第二损失误差;
    根据所述第一损失误差和所述第二损失误差对所述编码器、所述解码器和所述样本判别器进行生成对抗训练,以更新所述编码器、所述解码器和所述样本判别器的网络参数。
  4. 根据权利要求3所述的音频解码方法,所述样本判别器包括原始样本判别器和样本特征判别器;通过样本判别器对所述音频输入样本和所述音频输出样本进行类型判别,包括:
    将所述音频输入样本和所述音频输出样本输入至所述原始样本判别器,得到由所述原始样本判别器输出的第一类型判别结果;
    对所述音频输入样本进行频谱特征提取得到第一梅尔频谱,以及对所述音频输出样本进行频谱特征提取得到第二梅尔频谱;
    将所述第一梅尔频谱和所述第二梅尔频谱输入至所述样本特征判别器,得到由所述样本特征判别器输出的第二类型判别结果,所述判别结果包括所述第一类型判别结果和所述第二类型判别结果。
  5. 根据权利要求3所述的音频解码方法,根据所述音频输入样本和所述音频输出样本确定所述编码器和所述解码器的第一损失误差,包括:
    对所述音频输入样本进行频谱特征提取得到第一梅尔频谱,以及对所述音频输出样本进行频谱特征提取得到第二梅尔频谱;
    根据所述第一梅尔频谱和所述第二梅尔频谱的差异度确定所述编码器和所述解码器的第一损失误差。
  6. 根据权利要求5所述的音频解码方法,对所述音频输入样本进行频谱特征提取得到第一梅尔频谱,以及对所述音频输出样本进行频谱特征提取得到第二梅尔频谱,包括:
    获取包括至少两种样本尺度的采样窗口;
    通过所述采样窗口在不同的样本尺度上对所述音频输入样本进行频谱特征提取得到多尺度的第一梅尔频谱,以及对所述音频输出样本进行频谱特征提取得到多尺度的第二梅尔频谱。
  7. 根据权利要求2所述的音频解码方法,所述上采样层包括至少两个采样通道;通过所述多个上采样层对所述当前音频帧的编码向量和所述多个特征向量进行上采样处理,包括:
    通过所述上采样层中的至少两个采样通道对所述当前音频帧的编码向量和所述多个特征向量进行特征提取得到至少两个通道特征值;
    获取所述至少两个通道特征值的均值和方差;
    根据所述均值和方差对所述至少两个通道特征值进行归一化处理。
  8. 根据权利要求7所述的音频解码方法,在根据所述均值和方差对所述至少两个通道特征值进行归一化处理之前,所述方法还包括:
    对各个音频帧之间的均值和方差进行加权平滑处理;
    所述根据所述均值和方差对所述至少两个通道特征值进行归一化处理,包括:
    根据加权平滑处理后的均值和方差对所述至少两个通道特征值进行归一化处理。
  9. 根据权利要求1至8中任意一项所述的音频解码方法,获取音频帧序列中各个音频帧的编码向量,包括:
    针对所述音频帧序列中的每个音频帧,获取音频帧的编码索引值,所述编码索引值用于指示码本中的码本向量;
    在所述码本中查询与所述编码索引值相关联的码本向量,并根据所述码本向量确定音频帧的编码向量。
  10. 根据权利要求9所述的音频解码方法,所述码本向量的维度低于所述编码向量的维度;根据所述码本向量确定音频帧的编码向量,包括:
    对所述码本向量进行升维投影,得到音频帧的编码向量。
  11. 根据权利要求9所述的音频解码方法,在所述码本中查询与所述编码索引值相关联的码本向量之前,所述方法还包括:
    获取所述码本以及用于维护所述码本的量化器,所述码本用于表示编码索引值与码本向量之间的映射关系;
    获取由编码器对音频帧样本进行编码处理得到的编码向量样本;
    通过所述量化器预测与所述编码向量样本相匹配的码本向量样本;
    根据所述编码向量样本与所述码本向量样本之间的损失误差,更新所述量化器的网络参数;
    在所述码本中查询与所述编码索引值相关联的码本向量,包括:
    通过训练得到的量化器在所述码本中查询与所述编码索引值相关联的码本向量。
  12. 根据权利要求11所述的音频解码方法,在通过所述量化器预测与所述编码向量样本相匹配的码本向量样本之后,所述方法还包括:
    获取与所述码本向量样本相匹配的编码向量样本的统计参数;
    根据所述统计参数更新所述码本,更新后的码本用于下一次预测与编码向量样本相匹配的码本向量样本。
  13. 根据权利要求12所述的音频解码方法,所述统计参数包括向量和与命中数中的至少一种,所述向量和表示对各个编码向量样本进行加权平均处理后得到的平均值向量,所述命中数表示与所述码本向量样本相匹配的编码向量样本的数量;根据所述统计参数更新所述码本,包括:
    根据所述向量和对所述码本进行指数加权平滑;
    根据所述命中数对所述码本进行拉普拉斯平滑。
  14. 一种音频编码方法,所述方法由计算机设备执行,包括:
    获取音频帧序列中各个音频帧的音频数据;
    当编码到所述音频帧序列中的当前音频帧时,对历史音频帧的音频数据进行下采样得到下采样特征值,所述历史音频帧是所述音频帧序列中在所述当前音频帧之前编码的一个或者多个音频帧,所述下采样特征值是在下采样过程中得到的用于描述历史音频帧的特征向量;
    根据所述下采样特征值对所述当前音频帧的音频数据进行下采样得到所述当前音频帧的编码向量。
  15. 一种音频解码装置,所述装置部署在计算机设备上,包括:
    获取模块,被配置为获取音频帧序列中各个音频帧的编码向量;
    第一上采样模块,被配置为当解码到所述音频帧序列中的当前音频帧时,对历史音频帧的编码向量进行上采样得到上采样特征值,所述历史音频帧是所述音频帧序列中在所述当前音频帧之前解码的一个或者多个音频帧,所述上采样特征值是在上采样过程中得到的用于描述历史音频帧的特征向量;
    第二上采样模块,被配置为根据所述上采样特征值对所述当前音频帧的编码向量进行上采样得到所述当前音频帧的解码数据。
  16. 一种音频编码装置,所述装置部署在计算机设备上,包括:
    获取模块,被配置为获取音频帧序列中各个音频帧的音频数据;
    第一下采样模块,被配置为当编码到所述音频帧序列中的当前音频帧时,对历史音频帧的音频数据进行下采样得到下采样特征值,所述历史音频帧是所述音频帧序列中在所述 当前音频帧之前编码的一个或者多个音频帧,所述下采样特征值是在下采样过程中得到的用于描述历史音频帧的特征向量;
    第二下采样模块,被配置为根据所述下采样特征值对当前音频帧的音频数据进行下采样得到所述当前音频帧的编码向量。
  17. 一种计算机可读介质,所述计算机可读介质上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至14中任意一项所述的方法。
  18. 一种电子设备,包括:
    处理器;以及
    存储器,用于存储所述处理器可执行的计算机程序;
    其中,所述处理器配置为经由执行所述可执行的计算机程序使得所述电子设备执行权利要求1至14中任意一项所述的方法。
  19. 一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现权利要求1至14中任意一项所述的方法。
PCT/CN2023/085872 2022-05-19 2023-04-03 音频编解码方法及相关产品 WO2023221674A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210546928.4 2022-05-19
CN202210546928.4A CN115050378B (zh) 2022-05-19 音频编解码方法及相关产品

Publications (1)

Publication Number Publication Date
WO2023221674A1 true WO2023221674A1 (zh) 2023-11-23

Family

ID=83160045

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/085872 WO2023221674A1 (zh) 2022-05-19 2023-04-03 音频编解码方法及相关产品

Country Status (1)

Country Link
WO (1) WO2023221674A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117592384A (zh) * 2024-01-19 2024-02-23 广州市车厘子电子科技有限公司 一种基于生成对抗网络的主动声浪生成方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004090864A2 (en) * 2003-03-12 2004-10-21 The Indian Institute Of Technology, Bombay Method and apparatus for the encoding and decoding of speech
CN102436819A (zh) * 2011-10-25 2012-05-02 杭州微纳科技有限公司 无线音频压缩、解压缩方法及音频编码器和音频解码器
US20200294518A1 (en) * 2017-11-10 2020-09-17 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for encoding and decoding an audio signal using downsampling or interpolation of scale parameters
CN113129911A (zh) * 2021-03-19 2021-07-16 江门市华恩电子研究院有限公司 一种音频信号编码压缩和传输的方法及电子设备
CN113903345A (zh) * 2021-09-29 2022-01-07 北京字节跳动网络技术有限公司 音频处理方法、设备及电子设备
CN115050378A (zh) * 2022-05-19 2022-09-13 腾讯科技(深圳)有限公司 音频编解码方法及相关产品

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004090864A2 (en) * 2003-03-12 2004-10-21 The Indian Institute Of Technology, Bombay Method and apparatus for the encoding and decoding of speech
CN102436819A (zh) * 2011-10-25 2012-05-02 杭州微纳科技有限公司 无线音频压缩、解压缩方法及音频编码器和音频解码器
US20200294518A1 (en) * 2017-11-10 2020-09-17 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for encoding and decoding an audio signal using downsampling or interpolation of scale parameters
CN113129911A (zh) * 2021-03-19 2021-07-16 江门市华恩电子研究院有限公司 一种音频信号编码压缩和传输的方法及电子设备
CN113903345A (zh) * 2021-09-29 2022-01-07 北京字节跳动网络技术有限公司 音频处理方法、设备及电子设备
CN115050378A (zh) * 2022-05-19 2022-09-13 腾讯科技(深圳)有限公司 音频编解码方法及相关产品

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117592384A (zh) * 2024-01-19 2024-02-23 广州市车厘子电子科技有限公司 一种基于生成对抗网络的主动声浪生成方法
CN117592384B (zh) * 2024-01-19 2024-05-03 广州市车厘子电子科技有限公司 一种基于生成对抗网络的主动声浪生成方法

Also Published As

Publication number Publication date
CN115050378A (zh) 2022-09-13

Similar Documents

Publication Publication Date Title
US10810993B2 (en) Sample-efficient adaptive text-to-speech
CN111508508A (zh) 一种超分辨率音频生成方法及设备
TW201236444A (en) Video transmission and sharing over ultra-low bitrate wireless communication channel
CN109785847B (zh) 基于动态残差网络的音频压缩算法
WO2023221674A1 (zh) 音频编解码方法及相关产品
WO2021179788A1 (zh) 语音信号的编解码方法、装置、电子设备及存储介质
Wu et al. Audiodec: An open-source streaming high-fidelity neural audio codec
CN112116903A (zh) 语音合成模型的生成方法、装置、存储介质及电子设备
WO2023142454A1 (zh) 语音翻译和模型训练方法、装置、电子设备以及存储介质
US11990148B2 (en) Compressing audio waveforms using neural networks and vector quantizers
WO2023241240A1 (zh) 音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
CN113488063A (zh) 一种基于混合特征及编码解码的音频分离方法
JP2023548707A (ja) 音声強調方法、装置、機器及びコンピュータプログラム
US20230016637A1 (en) Apparatus and Method for End-to-End Adversarial Blind Bandwidth Extension with one or more Convolutional and/or Recurrent Networks
CN111816197B (zh) 音频编码方法、装置、电子设备和存储介质
CN113903345A (zh) 音频处理方法、设备及电子设备
WO2023241222A1 (zh) 音频处理方法、装置、设备、存储介质及计算机程序产品
WO2023241254A1 (zh) 音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品
WO2023241205A1 (zh) 音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
WO2022213825A1 (zh) 基于神经网络的端到端语音增强方法、装置
CN115050378B (zh) 音频编解码方法及相关产品
CN115206321A (zh) 语音关键词的识别方法、装置和电子设备
US20210287038A1 (en) Identifying salient features for generative networks
CN113571079A (zh) 语音增强方法、装置、设备及存储介质
Xu et al. A Multi-Scale Feature Aggregation Based Lightweight Network for Audio-Visual Speech Enhancement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23806631

Country of ref document: EP

Kind code of ref document: A1