WO2022213787A1 - 音频编码方法、音频解码方法、装置、计算机设备、存储介质及计算机程序产品 - Google Patents

音频编码方法、音频解码方法、装置、计算机设备、存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2022213787A1
WO2022213787A1 PCT/CN2022/081414 CN2022081414W WO2022213787A1 WO 2022213787 A1 WO2022213787 A1 WO 2022213787A1 CN 2022081414 W CN2022081414 W CN 2022081414W WO 2022213787 A1 WO2022213787 A1 WO 2022213787A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
sample
encoding
coding
rate
Prior art date
Application number
PCT/CN2022/081414
Other languages
English (en)
French (fr)
Inventor
梁俊斌
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2023538141A priority Critical patent/JP2024501933A/ja
Priority to EP22783856.2A priority patent/EP4239630A1/en
Publication of WO2022213787A1 publication Critical patent/WO2022213787A1/zh
Priority to US17/978,905 priority patent/US20230046509A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/24Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/002Dynamic bit allocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/167Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals

Definitions

  • the present application relates to the technical field of audio and video, and in particular, to an audio encoding method, an audio decoding method, an apparatus, a computer device, a computer-readable storage medium, and a computer program product.
  • the voice coding technology is to analyze and compress the original lossless audio signal collected in the time domain and frequency domain through the audio model, thereby reducing the voice transmission bandwidth and storage space, while maintaining good audio quality.
  • the input parameters of the general speech encoder include: sampling rate, number of channels and encoding bit rate, etc., among which, the larger the encoding bit rate is, the more bandwidth the encoding bit stream occupies, the larger the storage space occupied by the encoding file, and the higher the speech encoding quality. .
  • the encoding bit rate is generally set through experimental experience values.
  • PESQ Perceptual Evaluation of Speech Quality
  • PESQ Perceptual Evaluation of Speech Quality
  • the value is matched with the voice quality target requirements, and then the required voice coding code rate is determined.
  • This voice coding code rate is used in actual services. In the whole process of voice coding and compression, the coding code rate is usually fixed.
  • Embodiments of the present application provide an audio encoding method, an audio decoding method, an apparatus, a computer device, a computer-readable storage medium, and a computer program product, which can improve the quality of audio encoding.
  • the technical solution includes the following aspects.
  • the embodiment of the present application provides an audio coding method, and the method includes:
  • the encoding rate prediction process is performed on the sample audio feature parameter by using an encoding rate prediction model to obtain a sample encoding rate of the sample audio frame;
  • the encoding rate prediction model is trained, and the training is ended when the sample encoding quality score reaches the target encoding quality score;
  • the sample coding quality score is determined by the first sample audio and the second sample audio.
  • the embodiment of the present application provides an audio coding method, and the method includes:
  • An encoding rate prediction process is performed on the audio feature parameters by using an encoding rate prediction model to obtain the audio encoding rate of the audio frame, wherein the encoding rate prediction model is used to predict the respective encoding quality when the target encoding quality score is reached. the audio coding code rate corresponding to the audio frame;
  • Audio encoding is performed on the audio frame based on the audio encoding code rate, and target audio data is generated based on the encoding result corresponding to each audio frame.
  • the embodiment of the present application provides an audio decoding method, and the method includes:
  • the decoded target audio data is obtained by performing audio decoding on the encoded target audio data by using an audio decoding code rate corresponding to the audio encoding code rate.
  • An embodiment of the present application provides an audio encoding device, the device comprising:
  • a first acquisition module configured to acquire sample audio feature parameters corresponding to each sample audio frame in the first sample audio
  • a first processing module configured to perform an encoding rate prediction process on the sample audio feature parameter by using an encoding rate prediction model to obtain a sample encoding rate of the sample audio frame
  • a first encoding module configured to perform audio encoding on the sample audio frame based on the sample encoding code rate, and generate sample audio data based on the encoding result corresponding to each frame of the sample audio frame;
  • an audio decoding module configured to perform audio decoding on the sample audio data to obtain a second sample audio corresponding to the sample audio data
  • a training module configured to train the coding rate prediction model based on the first sample audio and the second sample audio, until the sample coding quality score reaches the target coding quality score and ends the training; wherein, The sample encoding quality score is determined by the first sample audio and the second sample audio.
  • An embodiment of the present application provides an audio encoding device, the device comprising:
  • a fourth acquisition module configured to acquire audio feature parameters corresponding to each audio frame in the original audio
  • the second processing module is configured to perform encoding rate prediction processing on the audio feature parameters by using an encoding rate prediction model to obtain the audio encoding rate of the audio frame, wherein the encoding rate prediction model is used to predict the The audio coding code rate corresponding to each of the audio frames in the target coding quality score;
  • the second encoding module is configured to perform audio encoding on the audio frame based on the audio encoding code rate, and generate target audio data based on the encoding result corresponding to each audio frame.
  • An embodiment of the present application provides an audio decoding device, and the device includes:
  • a fifth acquisition module configured to acquire the encoded target audio data
  • the decoding module is configured to perform audio decoding on the encoded target audio data by using an audio decoding code rate corresponding to the audio encoding code rate to obtain the decoded target audio data.
  • An embodiment of the present application provides a computer device, the computer device includes a processor and a memory, and the memory stores at least one piece of program, and the at least one piece of program is loaded and executed by the processor to implement the above aspects the audio encoding method or the audio decoding method.
  • An embodiment of the present application provides a computer-readable storage medium, where at least one program is stored in the storage medium, and the at least one program is loaded and executed by a processor to implement the audio encoding method or the audio according to the above aspect decoding method.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the audio encoding method or the audio decoding method provided in the foregoing optional implementation manner.
  • the dynamic coding rate is used for audio coding, which can reduce the audio coding quality as much as possible while meeting the target requirements.
  • a small audio coding bit rate can further reduce the storage space of audio data and reduce bandwidth consumption in the process of transmitting audio data.
  • Fig. 1 shows the process schematic diagram of audio coding in the related art
  • FIG. 2 shows a schematic diagram of an implementation environment provided by an embodiment of the present application
  • Fig. 3 shows the flow chart of the audio coding method shown in the embodiment of the present application
  • Fig. 4 shows the flow chart of the audio coding method shown in the embodiment of the present application
  • Fig. 5 shows the flow chart of the audio coding method shown in the embodiment of the present application
  • Fig. 6 shows the flow chart of the audio coding method shown in the embodiment of the present application
  • FIG. 7 shows a schematic diagram of a complete model training process shown in an embodiment of the present application.
  • FIG. 8 shows a flowchart of an audio coding method shown in an embodiment of the present application
  • Fig. 9 shows the flow chart of the audio coding method shown in the embodiment of the present application.
  • FIG. 10 shows a schematic diagram of an audio encoding process shown in an embodiment of the present application.
  • FIG. 11 shows a block diagram of the structure of the audio coding apparatus shown in the embodiment of the present application.
  • FIG. 12 shows a structural block diagram of the audio coding apparatus shown in the embodiment of the present application.
  • FIG. 13 shows a structural block diagram of a computer device provided by an embodiment of the present application.
  • Audio coding is to analyze and compress the original lossless audio signal of the original collection through the redundant analysis and compression of the time domain and frequency domain through the audio model, thereby reducing the voice transmission bandwidth and storage space, while maintaining good audio quality.
  • the input parameters of the audio encoder include: sampling rate, number of channels, encoding bit rate, etc. Among them, when the encoding bit rate used in audio encoding is larger, the quality of speech encoding is better, but the more bandwidth occupied by the encoding bit stream, And the storage space occupied by the encoded audio file is larger.
  • Artificial Intelligence It is the theory, method, technology and application that use digital computer or digital computer-controlled machine to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results system.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the embodiments of the present application mainly relate to the field of machine learning technology in the field of artificial intelligence technology.
  • FIG. 1 shows a schematic diagram of an audio coding process in the related art.
  • Coding parameters 104 After voice coding and channel coding are performed on the collected original voice 103, the coding result is transmitted to the receiving end 102 via the Internet, and the receiving end 102 performs channel decoding and voice decoding on the coding result, and generates a corresponding sound signal 105 .
  • the encoding parameters (encoding bit rate) are generally fixed, and are only appropriately adjusted according to the packet loss state 106 .
  • the audio signal is encoded with a fixed coding rate. Since the voice signal itself is a time-varying signal, the compression process of different voice signals in the voice encoder at different times and different times is quite different, so that under the same coding rate, different voices The coding quality of the signal varies greatly, and the quality of the speech coding cannot be guaranteed.
  • an embodiment of the present application provides a method for dynamically adjusting the audio coding rate based on audio feature parameters (ie, the audio coding method and the audio decoding method).
  • FIG. 2 shows the implementation of the present application.
  • the example provides a schematic diagram of the implementation environment.
  • the implementation environment may include: a first terminal 210 , a server 220 and a second terminal 230 .
  • the first terminal 210 has an application program supporting the network call technology installed and running. It may be an electronic device such as a smartphone, desktop computer, tablet computer, multimedia playback device, smart watch, smart speaker, laptop portable computer, and the like.
  • the application program may be a social program, a live broadcast program, a shopping program, a game program, a video program, an audio program, an instant messaging program, or the like.
  • the first terminal 210 stores an encoding rate prediction model, and the encoding rate prediction model can dynamically adjust the audio encoding rate based on audio feature parameters corresponding to the audio signal, and based on the predicted audio encoding code Audio encoding is performed at a high rate, and the encoded audio data stream is pushed to the second terminal 230 through the server 220;
  • the terminal (for example, the second terminal 230) may add the network state parameter fed back by the receiving terminal when predicting the coding rate.
  • the encoded audio data needs to be transmitted to the receiving end through the network
  • the encoded audio data does not need to be Transmission through the network only needs to be stored in a local or other storage medium.
  • the network state parameter does not need to be considered.
  • the encoding rate prediction model pre-stored in the first terminal 210 may be trained by other computer equipment (not shown in the figure), and the encoding rate prediction model is pushed to the first terminal 210, This enables the first terminal 210 to dynamically adjust the audio coding rate based on the coding rate prediction model in the actual application process.
  • the computer device may be a background server corresponding to the application program in the first terminal 210 .
  • the first terminal 210 and the server 220 may be connected through a wireless network or a wired network.
  • the server 220 is configured to provide background services for applications in the first terminal 210 or the second terminal 230 (eg, applications capable of making network calls).
  • server 220 may be a background server for the above-mentioned application.
  • the server 220 may be a server, or a server cluster composed of multiple servers, wherein the multiple servers may form a blockchain, and the server is a node on the blockchain, or a cloud computing service center.
  • the server 220 may receive the audio data stream from the first terminal 210, and push the audio data stream to the indicated second terminal 230.
  • the server 220 may receive the network state parameter fed back by the second terminal 230, and feed back the network state parameter to the first terminal 210, so that the first terminal 210 adjusts the audio coding rate based on the network state parameter.
  • the second terminal 230 and the server 220 may be connected through a wireless network or a wired network.
  • the second terminal 230 has an application program supporting the network call technology installed and running. It may be an electronic device such as a smartphone, desktop computer, tablet computer, multimedia playback device, smart watch, smart speaker, laptop portable computer, and the like.
  • the application may be a social program, a live broadcast application, a shopping program, a game program, a video program, an audio program, an instant messaging program, or the like.
  • the second terminal 230 may receive the audio data stream sent by the first terminal 210, decode the audio data stream, and present the transmitted audio. For example, the second terminal 230 can feed back the network state parameter to the first terminal 210, so that the first terminal 210 can dynamically adjust the audio coding rate based on the network state parameter.
  • the encoded audio data needs to be transmitted to the receiving end through the network
  • the encoded audio data does not need to be Transmission through the network only needs to be stored in a local or other storage medium.
  • the network state parameter does not need to be considered.
  • the audio in the embodiment of the present application is not limited to the audio of the call, and may also be audio recording, live audio, and the like.
  • the above-mentioned terminal may include various types of applications, for example, instant messaging applications, video playback applications, recording applications, live broadcast applications, and the like.
  • the above-mentioned audio encoding method and audio decoding method are not limited to be applied to scenarios such as cloud games, voice calls, and live video broadcasts.
  • FIG. 3 shows a flowchart of the audio coding method shown in the embodiment of the present application.
  • the embodiment of the present application is described by taking the method applied to the first terminal 210 shown in FIG. 2 as an example, and the method includes the following steps .
  • Step 301 Acquire audio feature parameters corresponding to each audio frame in the original audio.
  • the original audio may be the voice collected by the terminal.
  • the original audio may be the sound signal collected in the network voice call scene or the video call scene, the sound signal collected in the live broadcast scene, or the sound signal collected in the live broadcast scene.
  • the sound signal collected in the online karaoke scene may also be the sound signal collected in the voice broadcasting scene; for example, the original audio may also be the audio obtained in the voice storage scene.
  • Music, video, etc., the embodiments of the present application are not limited to the form of original audio.
  • the audio coding bit rate applicable in different application scenarios is obtained through preliminary measurement, so in the actual application process, the audio coding bit rate is used to encode the obtained original audio, that is, for a certain application scenario All audios below use a fixed code rate.
  • the voice signal as an example, the voice signal itself is a time-varying signal. If a fixed coding rate is used to encode different voice signals, the compression quality of different voice signals at different times and within the audio encoder obviously varies greatly, which may not be guaranteed. Speech encoding quality.
  • the audio characteristic parameters corresponding to each audio frame in the same original audio are predicted to obtain the corresponding audio encoding bit rate of each audio frame, so that the audio encoding bit rate can be dynamically adjusted based on different audio feature parameters, so that each audio frame can meet the encoding quality requirements, thereby improving the original audio. encoding quality.
  • the division may be performed according to a set duration.
  • 20ms is an audio frame.
  • the audio feature parameters may include fixed gain, adaptive gain, pitch period, line spectrum pair parameters, etc.
  • the embodiments of the present application are not limited to fixed gain, adaptive gain, pitch period, and line spectrum pair parameters.
  • the pitch period is used to reflect the time interval or the frequency of the opening and closing of the glottis.
  • the vocal cords vibrate to produce voiced sounds (unvoiced sounds are produced by air friction).
  • the pronunciation process of voiced sound is: the airflow from the lungs impacts the glottis, causing the glottis to open and close, forming a series of quasi-periodic airflow pulses. . Therefore, the voiced waveform presents a certain quasi-periodicity, and the pitch period refers to this quasi-periodicity.
  • an autocorrelation method when extracting the pitch period corresponding to the audio signal, an autocorrelation method, a cepstrum method, an average amplitude difference function method, a linear prediction method, a wavelet-autocorrelation function method, a spectral subtraction-autocorrelation function method, etc. can be used.
  • voiced sounds require a higher encoding rate (the encoding rate is greater than the voiced rate threshold)
  • unvoiced sounds require a lower encoding rate (the encoding rate is greater than the unvoiced rate threshold). Therefore, for different speech signals, The encoding bit rates that need to be used to make it reach the preset encoding quality are also different.
  • the audio corresponding to the pitch period is further analyzed. The encoding bit rate that the frame needs to use.
  • the process of audio coding it is necessary to forward the input sound. Or negative adjustment, so that the output sound is suitable for the subjective feeling of the human ear.
  • This process is the gain control process of the original audio, and the corresponding adaptive gain of the speech signal at different times is different due to the difference in loudness.
  • the noise in the audio signal will also increase.
  • the essence of audio coding is to reduce the redundancy (ie noise signal) in the audio. Obviously, different gains will affect the coding rate of the audio signal. Therefore, it is necessary to determine the corresponding coding based on the corresponding gains of different audio frames. code rate.
  • the line spectrum pair parameter is used to reflect the spectral characteristics of the audio signal, and the line spectrum pair parameter has relative independence of error, that is, the line spectrum pair parameter deviation at a certain frequency point only affects the speech spectrum near the frequency, and affects other frequencies.
  • the line spectrum above has little effect on the parametric speech spectrum. This is beneficial to the quantization and interpolation of the line spectrum pair parameters, and the encoded audio of the same quality can be achieved with a relatively small encoding code rate. It can be seen that the line spectrum pair parameters corresponding to the audio signal are helpful for the determination of the encoding code rate.
  • a corresponding audio feature extraction model may be set, the original audio is input into the audio feature extraction model, and audio feature extraction is performed on each audio frame contained in the original audio, thereby outputting audio feature parameters corresponding to each audio frame.
  • N kinds of feature parameters on the audio feature dimensions that have a greater impact on the encoding result can be selected.
  • the corresponding it is only necessary to extract the audio feature parameters in the N audio feature dimensions, where N is a positive integer.
  • different audio feature extraction dimensions can be set for different audio types.
  • Step 302 performing coding rate prediction processing on the audio feature parameter by using the coding rate prediction model to obtain the audio coding rate of the audio frame.
  • the coding rate prediction model is trained with the target coding quality score as the target. Therefore, in the process of applying the coding rate prediction model to predict the coding rate, the prediction can be made based on the audio feature parameters corresponding to each audio frame.
  • the audio coding quality corresponding to the original audio reaches the target coding quality score
  • the audio coding bit rate corresponding to each audio frame is obtained.
  • different audio feature parameters correspond to different audio coding bit rates.
  • the terminal is provided with a coding rate prediction model, and the coding rate prediction model can dynamically adjust the audio coding rate corresponding to each audio frame based on the audio feature parameters corresponding to each audio frame.
  • the audio feature parameter corresponding to each audio frame is input into the coding rate prediction model, so that the audio coding rate corresponding to the audio frame can be obtained, so that the audio frame can be subsequently encoded based on the audio coding rate.
  • Step 303 Perform audio encoding on the audio frame based on the audio encoding bit rate, and generate target audio data based on the encoding result corresponding to each audio frame.
  • the audio frames can be encoded based on the audio encoding bit rates, and then combined with the encoding results corresponding to each audio frame to generate the target corresponding to the original audio audio data.
  • the audio feature parameters corresponding to the audio frame are audio feature parameter 1 to audio feature parameter 50, and the audio feature parameters corresponding to each audio frame are respectively input into the code.
  • the encoding code rate corresponding to the audio frame ie, the encoding code rate 1 to the encoding code rate 50
  • the audio frame is encoded based on the audio encoding code rate corresponding to each audio frame, and each audio frame is obtained.
  • Corresponding audio coding results ie, audio coding result 1 to audio coding result 50
  • the audio encoding method in the embodiment of the present application may be pulse code modulation (PCM, Pulse Code Modulation) encoding, waveform sound file (WAV) encoding, MP3 encoding, and the like.
  • PCM pulse code modulation
  • WAV waveform sound file
  • the target audio data can be stored in the terminal, and can also be transmitted to other devices through the network.
  • the encoded target audio data needs to be transmitted to the receiving end through the network, and the receiving end uses the audio decoding bit rate corresponding to the audio encoding bit rate. Audio decoding is performed to obtain the decoded target audio data, and the decoded target audio data is played back losslessly.
  • the difference in audio features between several consecutive audio frames is generally small, and the difference in the corresponding audio coding rate is also small, or generally corresponds to the same audio coding rate.
  • the audio coding bit rate corresponding to each obtained audio frame can be smoothed to reduce the influence of the prediction error on the audio coding quality.
  • the audio coding rate corresponding to the audio frame can be determined for each audio frame.
  • the audio coding code rate matched by the characteristic parameters, thereby improving the coding quality of the entire audio; compared with the use of a fixed coding code rate in the related art, in the embodiment of the present application, the dynamic coding code rate is used for audio coding, which can meet the target audio coding quality.
  • the audio coding code rate is reduced as much as possible, thereby reducing the storage space of audio data and reducing bandwidth consumption in the process of transmitting audio data.
  • the coding rate prediction model In order to enable the coding rate prediction model to achieve the goal of dynamically adjusting the audio coding rate, it is necessary to train the coding rate prediction model through a large number of sample audios in advance, so that the coding rate prediction model can learn the audio corresponding to different audio feature parameters.
  • the applicable audio coding rate so that the audio coding rate can be dynamically adjusted based on the coding rate prediction model during the application process.
  • FIG. 4 shows a flowchart of an audio coding method according to an embodiment of the present application.
  • the embodiment of the present application takes a computer device as an example for illustrative description, and the method includes the following steps.
  • Step 401 Obtain sample audio feature parameters corresponding to each sample audio frame in the first sample audio.
  • the coding rate prediction model is used to match the audio coding rate corresponding to different audio feature parameters.
  • a large number of sample audios need to be obtained, and the The sample audio feature parameters corresponding to each sample audio frame are used to train the coding rate prediction model.
  • sample audio feature parameters can be extracted by an audio feature extraction model.
  • different types of audio may be acquired, such as voice, music, audio in audio and video, and the like.
  • sample audio with different audio contents and different audio durations can also be selected; for the same sample audio, the first sample audio can also be selected.
  • the sample audio is divided into different audio frames for subsequent extraction of audio feature parameters.
  • Step 402 performing an encoding rate prediction process on the sample audio feature parameters by using an encoding rate prediction model to obtain a sample encoding rate of the sample audio frame.
  • the sample audio feature parameters corresponding to each sample audio frame are input into the encoding rate prediction model, and the sample encoding rate corresponding to each sample audio frame output by the encoding rate prediction model can be obtained.
  • the coding rate prediction model can use a fully connected network as the main network, or can use a deep neural network (DNN, Deep Neural Networks), a convolutional neural network (CNN, Convolutional Neural Networks), a recurrent neural network (RNN, Recurrent Neural Networks) Network) and other neural networks, or developers build neural networks based on actual needs, the embodiments of the present application are not limited to the structure of the coding rate prediction model. Different sample audio feature parameters correspond to different sample coding rates.
  • DNN Deep Neural Networks
  • CNN convolutional neural network
  • RNN Recurrent Neural Networks
  • Step 403 Perform audio coding on the sample audio frame based on the sample coding rate, and generate sample audio data based on the coding result corresponding to each frame of the sample audio frame.
  • the sample coding rate or audio coding rate output by the coding rate prediction model corresponds to the audio coding scene, correspondingly, when evaluating whether the coding rate output by the coding rate prediction model matches the audio frame, it is necessary to use the sample coding After the audio coding is performed on the sample audio frame, the audio coding result is used as one of the basis for training the coding rate prediction model.
  • the sample encoding bit rate corresponding to each sample audio frame in the first sample audio is obtained, and each sample audio frame is encoded based on the sample encoding bit rate corresponding to each sample audio frame. Audio coding is performed, so that sample audio data is generated based on the coding results corresponding to the sample audio frames of each frame, so as to be used for subsequent evaluation of the current speech coding quality corresponding to the first sample audio.
  • Step 404 Perform audio decoding on the sample audio data to obtain a second sample audio corresponding to the sample audio data.
  • the audio decoding is performed on the sample audio data to obtain a second sample audio generated based on the sample audio data, so that the audio encoding quality of the first sample audio can be determined by comparing the second sample audio and the original sample audio. .
  • Step 405 based on the first sample audio and the second sample audio, train the coding rate prediction model, and end the training when the sample coding quality score reaches the target coding quality score.
  • the sample coding quality score is determined by the first sample audio and the second sample audio.
  • the encoding quality corresponding to the current encoding parameter is determined by comparing the original audio (the first sample audio) with the audio after audio encoding and decoding (the second sample audio), so as to adjust the encoding quality based on the encoding quality.
  • Each parameter of the coding rate prediction model is encoded, and then the training process of the coding rate prediction model is completed through several training cycles.
  • the sample coding quality score of the sample audio can reach the target coding quality score, and the coding is determined.
  • the training of the bit rate prediction model is completed.
  • the target coding quality score may be 5 points.
  • the target coding quality score corresponding to the coding rate prediction model may also be set based on the actual application scenario requirements.
  • the subjective speech quality assessment (PESQ, Perceptual Evaluation of Speech Quality) test method can be used to calculate the difference value corresponding to the first sample audio and the second sample audio, and then map to the average Opinion value (MOS, Mean Opinion Score), if the difference between the first sample audio and the second sample audio is greater, the corresponding speech coding quality is worse, and the MOS value is lower.
  • PESQ Perceptual Evaluation of Speech Quality
  • the coding rate prediction model can dynamically control the audio coding rate based on the sample audio feature parameters corresponding to the sample audio frames.
  • the audio coding rate predicted by the coding rate prediction model is more in line with the characteristics of the audio signal, which can reduce the audio coding rate as much as possible while the audio coding quality meets the target requirements, thereby reducing the storage space of audio data, and Reduce bandwidth consumption during the transmission of audio data.
  • the difference between consecutive audio frames is small, that is to say, the audio feature parameter difference between adjacent audio frames is small, and the audio coding corresponding to the current audio frame is predicted.
  • the audio encoding bit rate corresponding to the previous audio frame has a certain reference for the current audio frame.
  • the audio encoding bit rate corresponding to the previous audio frame Return to the encoding rate prediction process of the next audio frame.
  • FIG. 5 shows a flowchart of an audio coding method according to an embodiment of the present application.
  • the embodiment of the present application takes a computer device as an example for illustrative description, and the method includes the following steps.
  • Step 501 Obtain sample audio feature parameters corresponding to each sample audio frame in the first sample audio.
  • step 501 For the implementation of step 501, reference may be made to step 401, which is not repeated in this embodiment of the present application.
  • the sample audio feature parameters may include at least one of fixed gain, adaptive gain, pitch period, pitch frequency, and line spectrum pair parameters.
  • Step 502 Obtain the i-1 th sample coding rate corresponding to the i-1 th sample audio frame.
  • i is an increasing integer and the value range is 1 ⁇ i ⁇ N
  • N is the number of sample audio frames
  • N is an integer greater than 1.
  • the sample coding rate corresponding to the sample audio frame of the previous frame is regressed into the coding rate prediction model, so that when predicting the sample coding rate corresponding to the sample audio frame of the next frame, reference can be made to the previous sample coding rate.
  • the sample encoding bit rate of the frame can try to avoid the situation that the sample encoding bit rate fluctuates greatly.
  • Step 503 Perform coding rate prediction processing on the audio feature parameter of the ith sample and the coding rate of the ith sample by using the coding rate prediction model to obtain the coding rate of the ith sample corresponding to the sample audio frame of the ith frame.
  • the obtained i-1-th frame sample encoding bit rate and the i-th sample audio feature parameter may be input into the encoding bit rate together
  • a prediction basis is provided for the coding rate of the ith sample, which can further improve the prediction accuracy of the coding rate.
  • the coding rate prediction model when the coding rate prediction model outputs the 10th sample audio frame corresponding to the 10th frame.
  • 10-sample coding rate when predicting the 11th-sample coding rate corresponding to the 11th-sample audio frame, the 10th-sample coding rate and the 11th-sample audio feature parameters can be input into the coding rate prediction model to obtain the 11th-sample coding rate.
  • Sample encoding bit rate when predicting the 11th-sample coding rate corresponding to the 11th-sample audio frame.
  • Step 504 Perform audio coding on the sample audio frame based on the sample coding rate, and generate sample audio data based on the coding result corresponding to each frame of the sample audio frame.
  • Step 505 Perform audio decoding on the sample audio data to obtain a second sample audio corresponding to the sample audio data.
  • step 504 and step 505 reference may be made to the foregoing embodiments, which will not be described in detail in this embodiment of the present application.
  • Step 506 Determine a sample coding quality score corresponding to the first sample audio based on the first sample audio and the second sample audio.
  • the MOS value is determined as a sample encoding quality score corresponding to the first sample audio.
  • the value range of the MOS value may be from 0 to 5, wherein the higher the MOS score, the better the audio coding quality.
  • Step 507 Train a coding rate prediction model based on the sample coding quality score and the target coding quality score.
  • the target coding quality score indicates the expected target of audio coding, which is set by the developer. Different target coding quality scores can be set based on the application scenario of the coding rate prediction model. Illustratively, if the coding rate prediction model is applicable For voice call scenarios, you can set the target coding quality score to 4. If the coding rate prediction model is suitable for audio storage scenarios, you can set the target coding quality score to 5.
  • different coding rate prediction models can also be trained for different target coding quality scores, so that in the actual application process, the corresponding coding rate prediction model can be selected based on the requirements of the target coding quality score in the actual application scenario .
  • the difference between the current encoding result and the expected target is determined by comparing the sample encoding quality score and the target encoding quality score, and then the encoding rate prediction model is trained based on the audio difference, thereby updating the encoding rate Predict individual parameters in the model.
  • the selection of the coding rate should also be used as one of the indicators for evaluating the coding quality.
  • the coding rate A and coding rate are used B can achieve the same encoding quality, but the encoding bit rate A is smaller than the encoding bit rate B, and the larger the encoding bit rate, the more storage space and traffic bandwidth may be consumed.
  • a smaller coding rate is determined in code rate B.
  • the coding rate is also used as one of the loss parameters of the coding rate prediction model.
  • the process of training the coding rate prediction model may further include the following steps.
  • the corresponding sample coding rate is predicted for each sample audio frame.
  • the corresponding sample audio frame of each frame can be The sample coding rate is averaged to obtain the average coding rate, and then the average coding rate is determined as one of the parameters for evaluating the audio coding quality.
  • the coding loss corresponding to the first sample audio is jointly evaluated by synthesizing the two parameter dimensions of the coding rate and the coding quality score, that is, based on the average coding rate, the sample coding quality score and the target coding quality score value, calculate the first encoding loss corresponding to the first sample audio.
  • developers can adjust the weights of the two parameter dimensions by themselves based on the needs of the application scenario.
  • a larger weight can be set for the encoding bit rate; for the audio storage scenario, the encoding The quality score value sets a larger weight.
  • the process of constructing the first encoding loss may further include the following steps.
  • the coding quality score is determined by the sample coding quality score and the target coding quality score.
  • the loss weights corresponding to the average coding rate and the coding quality score can be obtained respectively, and then the first coding loss can be calculated based on the loss weights corresponding to each parameter.
  • the first loss weight and the second loss weight are set by the developer. Different first loss weights and second loss weights may be set respectively based on different application scenarios of the coding rate prediction model, so that the coding rate prediction model obtained by training is more suitable for the requirements of the application scenario.
  • different coding rate prediction models can also be trained for combinations of different loss weights, and then in the actual application process, corresponding coding rate prediction models can be selected according to the requirements of different application scenarios.
  • the formula for calculating the first coding loss can be expressed as follows:
  • a represents the weighting coefficient (ie loss weight) with a value of 0 to 1; average(.) represents the averaging function; bitrate represents the coding rate; power(.) represents the power function; MOS_SET represents the prediction of the objective voice quality MOS score.
  • Set the target value that is, the target coding quality score
  • mos represents the sample coding quality score
  • the average coding rate, the first loss weight, the sample coding quality score, the target coding quality score, and the second loss weight are brought into the above formula, and the first sample audio corresponding to the first sample audio can be calculated. encoding loss.
  • the cross-entropy (Cross-Entropy) criterion is used in the process of training the coding rate prediction model, that is, a preset coding loss is preset, and only when the first coding loss is infinitely close to the preset coding loss , it can be determined that the training of the encoding rate prediction model is completed.
  • the coding rate prediction model is trained with the goal of small coding rate and good coding quality, so that the coding rate prediction model can control the speech coding rate during the application process.
  • the coding rate is the smallest.
  • the audio coding quality can be the best.
  • the audio data after audio encoding needs to be transmitted to other terminals through the network.
  • the encoded voice data needs to be transmitted to other clients, and whether the receiving end can obtain good
  • the audio signal depends not only on the encoding bit rate, but also on the network environment state of the network transmission process. Therefore, in order to enable the receiving end to obtain a high-quality audio signal in this specific scenario, in the process of predicting the audio encoding bit rate, the The current network state parameters need to be considered.
  • the network state parameters are also required to participate in the model training.
  • step 402 may be replaced by step 601 and step 602 .
  • Step 601 Obtain sample network state parameters of the first sample audio.
  • the network state parameters may also be added to the training samples of the training coding rate prediction model.
  • the sample network state parameter may be a packet loss rate, a network transmission rate, or the like.
  • desired sample network state parameters can be simulated randomly.
  • different sample network state parameters may be generated for different sample audios, or corresponding sample network state parameters may be generated for different sample audio frames, or corresponding sample network state parameters may be generated every preset time period.
  • the sample network state parameter and the sample audio feature parameter corresponding to the sample audio frame can be jointly input into the coding rate prediction model for coding rate prediction.
  • Step 602 perform coding rate prediction processing on the sample network state parameters and the sample audio feature parameters by using the coding rate prediction model to obtain the sample coding rate of the sample audio frame.
  • sample network state parameter used for this prediction in addition to obtaining the sample audio feature parameter corresponding to the sample audio frame, it is also necessary to obtain the sample network state parameter used for this prediction, and The sample network state parameters and the sample audio feature parameters are jointly input into the coding rate prediction model, so as to obtain the sample coding rate output by the coding rate prediction model.
  • the sample encoding bit rate corresponding to the sample audio frame of the previous frame can also be returned to the encoding bit rate prediction model to predict The sample coding rate corresponding to the next frame of sample audio frame provides prediction reference.
  • the sample network state parameter, the i-1 th sample coding rate (the coding rate corresponding to the i-1 th sample audio frame) and the ith sample audio feature parameter may be input into the coding rate prediction model , wherein the sample network state parameter provides the current network state reference, and the i-1 th sample coding rate provides the coding rate prediction reference, and then generates the ith sample coding rate corresponding to the ith sample audio frame.
  • the coding rate prediction model can take into account the influence of the network state on the coding rate when predicting the coding rate, which further improves the The corresponding audio coding quality in the call scenario).
  • FIG. 7 shows a schematic diagram of a complete model training process shown in an embodiment of the present application.
  • the first sample voice 701 is divided into several sample audio frames, and the sample audio feature parameters 704 corresponding to each sample audio frame, the network
  • the packet loss flag 703 is input into the encoding rate prediction model 702, and the current frame encoding rate 705 output by the encoding rate prediction model 702 is obtained.
  • the current frame encoding rate 705 is not only used for speech encoding, but also can 705 returns to the coding rate prediction model 702, which is used to predict the coding rate of the next frame; audio coding is performed based on the coding rate corresponding to the audio frame of each frame sample to obtain the audio coding result, and then the voice coding result is subjected to audio decoding. , to generate a second sample speech 706, so as to perform a PESQ test on the first sample speech 701 and the second sample speech 706, and then train a coding rate prediction model 702 based on the test result.
  • the coding rate prediction model 702 includes a fully connected layer (DENSE) and a gated recurrent unit (GRU).
  • the number of neurons is 256, and the number of neurons in DENSE3 is 1;
  • the network packet loss flag 703 is input into DENSE1 to extract network state features;
  • the sample audio feature parameters 704 are input into DENSE2 to extract audio features, and then pass through GRU2 ,
  • GRU3 performs feature fusion, input DENSE3, DENSE3 outputs the probability of each preset coding rate, and then determines the preset coding rate with the highest probability as the current frame coding rate corresponding to the current sample audio frame.
  • the coding rate prediction model 702 may also adopt other network structures, for example, the coding rate prediction model 702 only includes a fully connected layer.
  • the coding rate of the previous frame is returned to the network model as the basis for predicting the coding rate of the next frame.
  • the audio coding rate output from the coding rate prediction model of each frame is returned to the model to provide a reference for the coding rate prediction of the next frame.
  • step 302 can be replaced by step 801 and step 802 .
  • Step 801 Obtain the j-1 th audio coding bit rate corresponding to the j-1 th audio frame.
  • j is an increasing integer and the value range is 1 ⁇ j ⁇ M
  • M is the number of audio frames
  • M is an integer greater than 1.
  • the encoding bit rate prediction model predicts the j-1 th audio encoding bit rate corresponding to the j-1 th audio frame, it is not applied to the subsequent j-1 audio encoding bit rate based on the j-1 th audio encoding bit rate.
  • the j-1 audio encoding bit rate can also be re-input into the encoding bit rate prediction model to provide a reference for predicting the j-th audio encoding bit rate corresponding to the j-th audio frame. .
  • Step 802 perform encoding rate prediction processing on the j-1th audio encoding rate and the jth audio feature parameter corresponding to the jth audio frame by the encoding rate prediction model, and obtain the jth audio encoding corresponding to the jth audio frame. code rate.
  • the j-1-th audio encoding code rate corresponding to the j-1-th audio frame may be obtained, so as to convert the j-1-th audio encoding code rate to the j-1-th audio frame.
  • the audio coding rate and the jth audio feature parameter are jointly input into the coding rate prediction model, and the j-1th audio coding rate provides the prediction basis for the jth audio coding rate, and then the jth output of the coding rate prediction model is obtained. Audio encoding bit rate.
  • the audio coding rate of the previous frame can serve as a reference for the prediction of the audio coding rate of the next frame, and can avoid the audio coding in the coding rate prediction process.
  • the code rate fluctuates greatly, which can improve the prediction accuracy of the audio coding code rate.
  • the network status will affect the voice quality received by the receiving end. Therefore, in this specific application scenario, in order to To avoid the influence of the network state on the voice quality, it is necessary to consider the influence of the current network state when generating the audio coding rate.
  • step 302 can be replaced by step 901 and step 902 .
  • Step 901 Obtain the current network state parameter fed back by the receiving end, and the receiving end is used to receive the target audio data transmitted through the network.
  • the target audio data that has undergone audio encoding needs to be transmitted to other terminals (that is, the receiving end) through the network, and the network status also has a certain influence on the audio encoding process.
  • a smaller coding rate is used; if the network status is good, a larger coding rate is used. Therefore, for the audio data used for network transmission, in the process of predicting the coding rate, it is also necessary to consider The current network status parameters fed back to the receiver.
  • the network status parameter can be returned by the receiver.
  • the receiver collects the network packet loss rate within a certain period of time and returns the network packet loss rate to the sender.
  • the packet loss rate can be used as a network state parameter and input into the coding rate prediction model, so that the current network state can be considered when predicting the audio coding rate.
  • the sending terminal may acquire the network state parameter from the receiving end every set time, or the receiving end may feed back the network state parameter to the sending terminal every predetermined time.
  • the set time may be 30 minutes (min).
  • Step 902 performing encoding rate prediction processing on the current network state parameters and audio feature parameters by using an encoding rate prediction model to obtain the audio encoding rate of the audio frame.
  • the acquired current network state parameters and the audio feature parameters corresponding to the audio frame may be input into the coding rate prediction model , so that the influence factor of the current network state is taken into account when predicting the audio coding rate, so as to obtain the audio coding rate output by the coding rate prediction model.
  • the sender After the sender encodes the audio based on the audio encoding bit rate and transmits the encoding result to the receiving end through the network, since the audio encoding bit rate used in the audio encoding process has already taken into account the current network state, it is guaranteed to receive The terminal receives a good audio signal.
  • the audio encoding bit rate corresponding to the previous audio frame can also be returned to the encoding bit rate prediction model.
  • the audio coding rate corresponding to one audio frame provides a prediction reference.
  • the network state parameter, the j-1 th audio coding rate (that is, the audio coding rate corresponding to the j-1 th audio frame), and the j th audio feature parameter may be input into the coding rate prediction model, and are given by
  • the network state parameter provides the network state reference for the jth audio coding rate
  • the j-1th audio coding rate provides the coding rate prediction reference for the jth audio coding rate, and then outputs the jth audio frame from the coding rate prediction model.
  • the coding rate prediction model can take into account the influence of the network state on the coding rate when predicting the coding rate, further improving the specific Corresponding audio coding quality in a scenario (for example, a voice call scenario).
  • FIG. 10 shows a schematic diagram of an audio encoding process shown in an embodiment of the present application.
  • the network packet loss flag 1001 ie network state parameter
  • the audio feature parameter 1002 can be input into the coding rate prediction model 1003, thereby outputting the current frame coding rate 1004; for example, the current frame can also be coded
  • the bit rate 1004 is input into the encoding bit rate prediction model, which is used to improve the reference basis for predicting the encoding bit rate of the next frame; and then perform audio encoding based on the audio encoding bit rate corresponding to each audio frame, and based on the encoding result corresponding to each audio frame Generate audio encoded data corresponding to the original audio.
  • FIG. 11 shows a structural block diagram of the audio coding apparatus shown in the embodiment of the present application.
  • the audio encoding apparatus can be implemented as a whole or a part of a computer device through software, hardware or a combination of the two.
  • the audio encoding apparatus may include:
  • the first obtaining module 1101 is configured to obtain sample audio feature parameters corresponding to each sample audio frame in the first sample audio; the first processing module 1102 is configured to encode the sample audio feature parameters by using a coding rate prediction model rate prediction processing to obtain the sample encoding code rate of the sample audio frame; the first encoding module 1103 is configured to perform audio encoding on the sample audio frame based on the sample encoding code rate, and based on the corresponding sample audio frame of each frame.
  • the encoding result generates sample audio data;
  • the audio decoding module 1104 is configured to perform audio decoding on the sample audio data to obtain the second sample audio corresponding to the sample audio data;
  • the training module 1105 is configured to perform audio decoding based on the first sample audio and the second sample audio, train the coding rate prediction model, and end the training until the sample coding quality score reaches the target coding quality score; wherein, the sample coding quality score passes the first The sample audio and the second sample audio are determined.
  • the apparatus further includes: a second obtaining module configured to obtain sample network state parameters of the first sample audio; the first processing module 1102 includes: a first processing unit configured to The sample network state parameter and the sample audio feature parameter are subjected to encoding rate prediction processing by the encoding rate prediction model to obtain the sample encoding rate of the sample audio frame.
  • the apparatus further includes: a third acquisition module configured to acquire the i-1 th sample coding rate corresponding to the i-1 th frame sample audio frame;
  • the first processing module 1102 includes: a second processing unit, configured to perform encoding rate prediction on the audio feature parameter of the i-th sample and the encoding rate of the i-1-th sample by using the encoding rate prediction model processing, to obtain the i-th sample coding rate corresponding to the i-th frame sample audio frame; wherein, i is an increasing integer and the value range is 1 ⁇ i ⁇ N, N is the number of the sample audio frames, and N is greater than 1 the integer.
  • the training module 1105 includes: a determining unit configured to determine the sample encoding quality corresponding to the first sample audio based on the first sample audio and the second sample audio score; a training unit configured to train the coding rate prediction model based on the sample coding quality score and the target coding quality score.
  • the training unit is further configured to: determine an average encoding bit rate corresponding to the first sample audio, wherein the average encoding bit rate is encoded by the sample corresponding to each frame of sample audio frame code rate determination; based on the average coding code rate, the sample coding quality score and the target coding quality score, construct a first coding loss corresponding to the first sample audio; based on the first coding loss and a preset coding loss to train the coding rate prediction model.
  • the training unit is further configured to: obtain a first loss weight corresponding to the average encoding bit rate and a second loss weight corresponding to an encoding quality score, the encoding quality score passing through the The sample coding quality score and the target coding quality score are determined; based on the average coding rate, the first loss weight, the coding quality score and the second loss weight, construct the first sample the first encoding loss corresponding to this audio.
  • the type of the sample audio feature parameter includes at least one of the following: fixed gain, adaptive gain, pitch period, pitch frequency, line spectrum pair parameter.
  • the sample audio feature parameters corresponding to each sample audio frame in the sample audio are analyzed, so as to predict the corresponding sample audio frames of each frame based on the sample audio feature parameters.
  • the sample audio encoding code rate is based on the sample audio encoding code rate corresponding to each frame, and then audio encoding is performed on the sample audio frame based on the sample encoding code rate corresponding to each frame.
  • the coding rate prediction model After audio decoding the audio encoding result, by comparing the relationship between the audio decoded audio and the original audio, Train the coding rate prediction model, so that in the actual application process, the coding rate prediction model has the function of dynamically adjusting the audio coding rate based on the audio feature parameters, which can reduce the audio coding quality as much as possible while meeting the target requirements.
  • the audio coding bit rate can reduce the storage space of the audio data and reduce the bandwidth consumption in the process of transmitting the audio data.
  • FIG. 12 shows a structural block diagram of the audio coding apparatus shown in the embodiment of the present application.
  • the audio encoding apparatus can be implemented as a whole or a part of a computer device through software, hardware or a combination of the two.
  • the audio encoding apparatus may include:
  • the fourth acquisition module 1201 is configured to acquire audio feature parameters corresponding to each audio frame in the original audio
  • the second processing module 1202 is configured to perform an encoding rate prediction process on the audio feature parameter by using an encoding rate prediction model to obtain an audio encoding rate of the audio frame, wherein the encoding rate prediction model is used to predict The audio coding bit rate corresponding to each of the audio frames when reaching the target coding quality score;
  • the second encoding module 1203 is configured to perform audio encoding on the audio frame based on the audio encoding code rate, and generate target audio data based on the encoding result corresponding to each audio frame.
  • the target audio data is used for network transmission
  • the device also includes:
  • the fifth obtaining module is configured to obtain the current network state parameter fed back by the receiving end, and the receiving end is configured to receive the target audio data transmitted through the network;
  • the second processing module 1202 includes: a third processing unit configured to In order to perform encoding rate prediction processing on the current network state parameter and the audio feature parameter through the encoding rate prediction model, the audio encoding rate of the audio frame is obtained.
  • the apparatus further includes:
  • the sixth obtaining module is configured to obtain the j-1 th audio encoding code rate corresponding to the j-1 th audio frame;
  • the second processing module 1202 includes: a fourth processing unit, configured to predict by the encoding code rate The model performs coding rate prediction processing on the j-1 audio coding rate and the j audio feature parameter corresponding to the j audio frame, to obtain the j audio coding rate corresponding to the j audio frame; wherein , j is an increasing integer and the value range is 1 ⁇ j ⁇ M, M is the number of the audio frames, and M is an integer greater than 1.
  • the type of the audio feature parameter includes at least one of the following: fixed gain, adaptive gain, pitch period, pitch frequency, line spectrum pair parameter.
  • the audio coding code rate matched with the audio feature parameters, thereby improving the coding quality of the entire audio; compared with the use of a fixed coding code rate in the related art, in this embodiment, the dynamic coding code rate is used for audio coding, which can satisfy the audio coding quality. While meeting the target requirements, the audio coding rate should be reduced as much as possible, thereby reducing the storage space of the audio data and reducing the bandwidth consumption in the process of transmitting the audio data.
  • Embodiments of the present application also provide an audio decoding apparatus, which can be implemented as all or a part of a computer device through software, hardware, or a combination of the two.
  • the audio decoding apparatus may include:
  • the fifth obtaining module is configured to obtain the encoded target audio data; the decoding module is configured to perform audio decoding on the encoded target audio data by using the audio decoding code rate corresponding to the audio encoding code rate, and obtain the decoded target audio data. of the target audio data.
  • FIG. 13 shows a structural block diagram of a computer device provided by an embodiment of the present application.
  • the computer device can be used to implement the audio encoding method or the audio decoding method provided in the above embodiments. Specifically:
  • the computer device 1300 includes a central processing unit (CPU, Central Processing Unit) 1301, a system memory 1304 including a random access memory (RAM, Random Access Memory) 1302 and a read only memory (ROM, Read-Only Memory) 1303, and A system bus 1305 that connects the system memory 1304 and the central processing unit 1301 .
  • the computer device 1300 also includes a basic input/output system (I/O system, Input/Output system) 1306 that helps to transfer information between various devices within the computer device, and is used to store the operating system 1313, application programs 1314 and other Mass storage device 1307 for program modules 1315 .
  • I/O system Input/Output system
  • the basic input/output system 1306 includes a display 1308 for displaying information and input devices 1309 such as a mouse, keyboard, etc., for user input of information. Both the display 1308 and the input device 1309 are connected to the central processing unit 1301 through the input and output controller 1310 connected to the system bus 1305 .
  • the basic input/output system 1306 may also include an input output controller 1310 for receiving and processing input from a number of other devices such as a keyboard, mouse, or electronic stylus. Similarly, input output controller 1310 also provides output to a display screen, printer, or other type of output device.
  • the mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305 .
  • the mass storage device 1307 and its associated computer-readable storage media provide non-volatile storage for the computer device 1300 . That is, the mass storage device 1307 may include a computer-readable storage medium (not shown) such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
  • a computer-readable storage medium such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
  • the computer-readable storage medium can include both computer storage medium and communication medium.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable storage instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, Erasable Programmable Read Only Memory (EPROM, Erasable Programmable Read Only Memory), Electronically Erasable Rewritable Read Only Memory (EEPROM, Electrically-Erasable Programmable Read-Only Memory), flash memory or Other solid-state storage technologies, CD-ROM, Digital Versatile Disc (DVD, Digital Versatile Disc) or other optical storage, cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices.
  • the system memory 1304 and the mass storage device 1307 described above may be collectively referred to as memory.
  • the memory stores one or more programs, the one or more programs are configured to be executed by one or more central processing units 1301, the one or more programs contain instructions for implementing the above method embodiments, and the central processing unit 1301 executes the One or more programs implement the methods provided by the above-mentioned respective method embodiments.
  • the computer device 1300 may also be connected to a remote server on the network through a network such as the Internet to operate. That is, the computer device 1300 can be connected to the network 1312 through the network interface unit 1311 connected to the system bus 1305, or can also use the network interface unit 1311 to connect to other types of networks or remote server systems (not shown). ).
  • the memory further includes one or more programs, the one or more programs are stored in the memory, and the one or more programs include steps performed by the computer device for performing the methods provided in the embodiments of the present application.
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the audio according to the above embodiments Encoding method or audio decoding method.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the audio encoding method or the audio decoding method provided in the foregoing optional implementation manner.

Abstract

一种音频编码方法、音频解码方法、装置、计算机设备、计算机可读存储介质及计算机程序产品,该方法包括:获取第一样本音频中各个样本音频帧对应的样本音频特征参数(401);通过编码码率预测模型对样本音频特征参数进行编码码率预测处理,得到样本音频帧的样本编码码率(402);基于样本编码码率对样本音频帧进行音频编码,并基于各帧样本音频帧对应的编码结果生成样本音频数据(403);对样本音频数据进行音频解码,得到样本音频数据对应的第二样本音频(404);基于第一样本音频和第二样本音频,训练编码码率预测模型,直至样本编码质量分值达到目标编码质量分值时结束训练(405);其中,样本编码质量分值通过第一样本音频和第二样本音频确定。

Description

音频编码方法、音频解码方法、装置、计算机设备、存储介质及计算机程序产品
相关申请的交叉引用
本申请实施例基于申请号为202110380547.9、申请日为2021年04月09日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请实施例作为参考。
技术领域
本申请涉及音视频技术领域,特别涉及一种音频编码方法、音频解码方法、装置、计算机设备、计算机可读存储介质及计算机程序产品。
背景技术
语音编码技术是将采集到的原始无损音频信号,通过音频模型对时域和频域的冗余分析和压缩,从而降低语音传输带宽和存储空间,同时保持较好的音频质量。一般语音编码器的输入参数包括:采样率、通道数和编码码率等,其中,编码码率越大则编码码流占用带宽越多、编码文件占用的存储空间越大,语音编码质量越高。
相关技术中,一般通过实验经验值设置编码码率,比如,在实验室环境下使用主观语音质量评估(PESQ,Perceptual Evaluation of Speech Quality)的方法测量不同编码参数下对应的PESQ值,再根据PESQ值和语音质量目标要求进行匹配,进而确定所需要的语音编码码率,该语音编码码率被用于实际业务中,在语音编码压缩的整个过程中,编码码率通常是固定不变的。
显然,采用相关技术中的固定编码码率的语音编码方法,由于语音信号本身是时变信号,不同时刻、不同语音信号在语音编码器内部的压缩过程存在较大差异,使得相同编码码率下,不同语音信号的编码质量差异较大,无法保证语音编码的质量。
发明内容
本申请实施例提供了一种音频编码方法、音频解码方法、装置、计算机设备、计算机可读存储介质及计算机程序产品,可以提高音频编码的质量,该技术方案包括如下方面。
本申请实施例提供了一种音频编码方法,所述方法包括:
获取第一样本音频中各个样本音频帧对应的样本音频特征参数;
通过编码码率预测模型对所述样本音频特征参数进行编码码率预测处理,得到所述样本音频帧的样本编码码率;
基于所述样本编码码率对所述样本音频帧进行音频编码,并基于各帧样本音频帧对应的编码结果生成样本音频数据;
对所述样本音频数据进行音频解码,得到所述样本音频数据对应的第二样本音频;
基于所述第一样本音频和所述第二样本音频,训练所述编码码率预测模型,直至样本编码质量分值达到目标编码质量分值时结束所述训练;
其中,所述样本编码质量分值通过所述第一样本音频和所述第二样本音频确定。
本申请实施例提供了一种音频编码方法,所述方法包括:
获取原始音频中各个音频帧对应的音频特征参数;
通过编码码率预测模型对所述音频特征参数进行编码码率预测处理,得到所述音频帧的音频编码码率,其中,所述编码码率预测模型用于预测达到目标编码质量分值时各个所述音频帧对应的音频编码码率;
基于所述音频编码码率对所述音频帧进行音频编码,并基于各帧音频帧对应的编码结果生成目标音频数据。
本申请实施例提供了一种音频解码方法,所述方法包括:
获取所述编码后的目标音频数据;
通过与音频编码码率对应的音频解码码率对所述编码后的目标音频数据进行音频解码,得到解码后的所述目标音频数据。
本申请实施例提供了一种音频编码装置,所述装置包括:
第一获取模块,配置为获取第一样本音频中各个样本音频帧对应的样本音频特征参数;
第一处理模块,配置为通过编码码率预测模型对所述样本音频特征参数进行编码码率预测处理,得到所述样本音频帧的样本编码码率;
第一编码模块,配置为基于所述样本编码码率对所述样本音频帧进行音频编码,并基于各帧样本音频帧对应的编码结果生成样本音频数据;
音频解码模块,配置为对所述样本音频数据进行音频解码,得到所述样本音频数据对应的第二样本音频;
训练模块,配置为基于所述第一样本音频和所述第二样本音频,训练所述编码码率预测模型,直至样本编码质量分值达到目标编码质量分值时结束所述训练;其中,所述样本编码质量分值通过所述第一样本音频和所述第二样本音频确定。
本申请实施例提供了一种音频编码装置,所述装置包括:
第四获取模块,配置为获取原始音频中各个音频帧对应的音频特征参数;
第二处理模块,配置为通过编码码率预测模型对所述音频特征参数进行编码码率预测处理,得到所述音频帧的音频编码码率,其中,所述编码码率预测模型用于预测达到目标编码质量分值时各个所述音频帧对应的音频编码码率;
第二编码模块,配置为基于所述音频编码码率对所述音频帧进行音频编码,并基于各帧音频帧对应的编码结果生成目标音频数据。
本申请实施例提供了一种音频解码装置,所述装置包括:
第五获取模块,配置为获取所述编码后的目标音频数据;
解码模块,配置为通过与音频编码码率对应的音频解码码率对所述编码后的目标音频数据进行音频解码,得到解码后的所述目标音频数据。
本申请实施例提供了一种计算机设备,所述计算机设备包含处理器和存储器,所述存储器中存储有至少一段程序,所述至少一段程序由所述处理器加载并执行以实现如上述方面所述音频编码方法或所述音频解码方法。
本申请实施例提供了一种计算机可读存储介质,所述存储介质中存储有至少一段程序,所述至少一段程序由处理器加载并执行以实现如上述方面所述音频编码方法或所述音频解码方法。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述可选实现方式中提供的音频编码方法或所述音频解码方法。
本申请实施例提供的技术方案可以包括以下有益效果:
在音频编码场景中,通过分析原始音频中各个音频帧对应的音频特征参数,以实现 基于音频特征参数动态调控音频帧对应的音频编码码率的目的,可以为各个音频帧确定与音频特征参数匹配的音频编码码率,从而提高整个音频的编码质量;相比于相关技术中采用固定编码码率,采用动态编码码率进行音频编码,可以在音频编码质量满足目标要求的同时,尽可能的减小音频编码码率,进而可以降低音频数据的存储空间,以及减少传输音频数据过程中的带宽消耗。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。
图1示出了相关技术中音频编码的过程示意图;
图2示出了本申请实施例提供的实施环境的示意图;
图3示出了本申请实施例示出的音频编码方法的流程图;
图4示出了本申请实施例示出的音频编码方法的流程图;
图5示出了本申请实施例示出的音频编码方法的流程图;
图6示出了本申请实施例示出的音频编码方法的流程图;
图7示出了本申请实施例示出的完整模型训练过程的示意图;
图8示出了本申请实施例示出的音频编码方法的流程图;
图9示出了本申请实施例示出的音频编码方法的流程图;
图10示出了本申请实施例示出的音频编码过程的示意图;
图11示出了本申请实施例示出的音频编码装置的结构方框图;
图12示出了本申请实施例示出的音频编码装置的结构方框图;
图13示出了本申请实施例提供的计算机设备的结构框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
为了便于理解,下面对本申请实施例涉及的名词进行解释。
1)音频编码:音频编码是将原始采集到的原始无损音频信号,通过音频模型对时域和频域的冗余分析和压缩,从而降低语音传输带宽和存储空间,同时保持较好的音频质量。音频编码器的输入参数包括:采样率、通道数、编码码率等;其中,当音频编码时所采用的编码码率越大时,语音编码质量越好,但是编码码流占用带宽越多,且音频编码后的音频文件占用的存储空间越大。
2)人工智能(AI,Artificial Intelligence):是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式 存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
需要说明的是,本申请实施例主要涉及到人工智能技术领域中的机器学习技术领域。
请参考图1,其示出了相关技术中音频编码的过程示意图。以语音通话场景中的音频编码为例,往往在语音通话启动之前,就预先基于实验经验值为音频编码器设置固定的编码参数,当发送端101启动语音通话时,基于当前语音通话场景配置的编码参数104,对采集到的原始语音103进行语音编码和信道编码后,并将编码结果经过互联网传输至接收端102,接收端102对编码结果进行信道解码和语音解码,生成对应的声音信号105。在整个语音通话过程中,编码参数(编码码率)一般是固定不变的,仅会根据丢包状态106进行适当调节。
显然,以固定编码码率对音频信号进行编码,由于语音信号本身是时变信号,不同时刻、不同语音信号在语音编码器内部的压缩过程存在较大差异,使得相同编码码率下,不同语音信号的编码质量差异较大,无法保证语音编码的质量。
针对相关技术中的问题,本申请实施例提供了一种基于音频特征参数动态调整音频编码码率的方法(即音频编码方法以及音频解码方法),请参考图2,其示出了本申请实施例提供的实施环境的示意图。该实施环境可以包括:第一终端210、服务器220和第二终端230。
第一终端210中安装和运行有支持网络通话技术的应用程序。其可以是诸如智能手机、台式电脑、平板电脑、多媒体播放设备、智能手表、智能音箱,膝上型便携计算机等电子设备。其中,该应用程序可以是社交类程序、直播类程序、购物类程序、游戏类程序、视频类程序、音频类程序、即时通讯类程序等。
在一些实施例中,第一终端210中存储有编码码率预测模型,该编码码率预测模型可以基于音频信号对应的音频特征参数,动态调控音频编码码率,并基于预测得到的音频编码码率进行音频编码,并将编码得到的音频数据流通过服务器220推送至第二终端230;例如,当编码后的音频数据需要通过网络传输时,为了使得音频数据可以以更好的质量传输至接收端(比如,第二终端230),可以在预测编码码率时加入接收端反馈的网络状态参数。例如,除了特定场景(该特定场景可以是:音视频通话场景、直播场景等)下需要将编码得到的音频数据通过网络传输至接收端,在其他可能的应用场景下,编码后的音频数据无需通过网络传输,仅需要存储在本地或其他存储介质中,对应的,在预测音频编码码率时,也就无需考虑网络状态参数。
需要说明的是,第一终端210中预先存储的编码码率预测模型可以是由其他计算机设备(图中未示出)训练完成,并将该编码码率预测模型推送至第一终端210中,使得第一终端210在实际应用过程中,可以基于该编码码率预测模型实现动态调整音频编码码率的目的。例如,该计算机设备可以是第一终端210中应用程序对应的后台服务器。
其中,第一终端210与服务器220之间可以通过无线网络或有线网络相连。
服务器220用于为第一终端210或第二终端230中的应用程序(如能够进行网络通话的应用程序)提供后台服务。例如,服务器220可以是上述应用程序的后台服务器。服务器220可以是一台服务器,也可以是由多台服务器组成的服务器集群,其中多个服务器可组成为一区块链,而服务器为区块链上的节点,或者是一个云计算服务中心。本申请实施例中,服务器220可以接收来自第一终端210的音频数据流,并向指示的第二终端230推送该音频数据流。例如,服务器220可以接收第二终端230反馈的网络状态参数,并将该网络状态参数反馈给第一终端210,以便第一终端210基于该网络状态参数调整音频编码码率。
其中,第二终端230与服务器220之间可以通过无线网络或有线网络相连。
第二终端230中安装和运行有支持网络通话技术的应用程序。其可以是诸如智能手机、台式电脑、平板电脑、多媒体播放设备、智能手表、智能音箱,膝上型便携计算机等电子设备。其中,该应用程序可以是社交类程序、直播类应用程序、购物类程序、游戏类程序、视频类程序、音频类程序、即时通讯类程序等。本实施例中,第二终端230可以接收第一终端210发送的音频数据流,并对音频数据流进行解码,呈现传输的音频。例如,第二终端230可以向第一终端210反馈网络状态参数,使得第一终端210可以基于网络状态参数动态调整音频编码码率。例如,除了特定场景(该特定场景可以是:音视频通话场景、直播场景等)下需要将编码得到的音频数据通过网络传输至接收端,在其他可能的应用场景下,编码后的音频数据无需通过网络传输,仅需要存储在本地或其他存储介质中,对应的,在预测音频编码码率时,也就无需考虑网络状态参数。
需要说明的是,本申请实施例中的音频并不局限于通话音频,还可以是录音、直播音频等。其中,上述终端可以包括各种类型的应用,例如,即时通讯应用、视频播放应用、录音应用、直播应用等。
在一些实施例中,上述音频编码方法以及音频解码方法不限于应用于云游戏、语音通话、视频直播等场景中。
请参考图3,其示出了本申请实施例示出的音频编码方法的流程图,本申请实施例以该方法应用于图2所示的第一终端210为例进行说明,该方法包括如下步骤。
步骤301,获取原始音频中各个音频帧对应的音频特征参数。
其中,原始音频可以是终端采集到的语音,示意性的,原始音频可以是网络语音通话场景或视频通话场景中采集到的声音信号,也可以是直播场景中采集到的声音信号,也可以是在线K歌场景中采集到的声音信号,也可以是语音广播场景中采集到的声音信号;例如,原始音频也可以是语音存储场景中获取到的音频,示意性的,原始音频可以是语音、音乐、视频等,本申请实施例不局限于原始音频的形式。
为了使得音频可以更易存储和远距离传输,通常需要对获取到的原始音频进行音频编码,以减少音频存储的空间,或减少远距离传输所消耗的流量带宽,相关技术中,在进行音频编码过程中,一般通过前期测量得到不同应用场景下所适用的音频编码码率,从而在实际应用过程中,采用该音频编码码率对获取到的原始音频进行编码,也就是说,对于某个应用场景下的所有音频,均采用固定编码码率。以语音信号为例,语音信号本身是时变信号,若采用固定编码码率对不同语音信号进行编码,不同时刻、不同语音信号在音频编码器内部的压缩质量显然存在较大差异,可能无法保证语音编码质量。
本申请实施例中,考虑到音频信号的特征(可变性),为了提高音频编码质量,在一种可能的实施方式中,通过分析同一原始音频中各个音频帧对应的音频特征参数,以便基于该音频特征参数分别预测得到各个音频帧对应的音频编码码率,使得音频编码码率可以基于不同音频特征参数进行动态调控,从而使得每一帧音频帧都可以达到编码质量要求,进而提高了原始音频的编码质量。
例如,对原始音频进行音频帧划分时,可以按照设定时长进行划分,示意性的,20ms为一帧音频帧。
例如,音频特征参数可以包括固定增益、自适应增益、基音周期、线谱对参数等,本申请实施例不局限于固定增益、自适应增益、基音周期、线谱对参数。
基音周期是用于反映声门相邻两次开闭之间的时间间隔或开闭的频率;示意性的,人在发音时,声带振动产生浊音(清音由空气摩擦产生)。浊音的发音过程是:来自肺部的气流冲击声门,造成声门的一张一合,形成一系列准周期的气流脉冲,经过声道(含口腔、鼻腔)的谐振及唇齿辐射,最终形成语音信号。故浊音波形呈现一定的准周期性, 而基音周期,就是对这种准周期而言的。例如,在提取音频信号对应的基音周期时,可以采用自相关法、倒谱法、平均幅度差函数法、线性预测法、小波-自相关函数法,谱减-自相关函数法等。示意性的,一般浊音需要较高的编码码率(编码码率大于浊音码率阈值),而清音需要较低的编码码率(编码码率大于清音码率阈值),因此针对不同语音信号,使其达到预设编码质量时所需要采用的编码码率也不相同,对应的,在训练编码码率预测模型过程中,通过提取音频帧所对应的基音周期,进一步分析该基音周期对应的音频帧所需要采用的编码码率。
由于手机等设备采集的原始音频往往有时候响度偏低,有时候响度偏高,造成声音忽大忽小,影响听众的主观感受,因此,在进行音频编码过程中,需要对输入声音进行正向或负向调节,使得输出的声音适宜人耳的主观感受。该过程即为对原始音频的增益调控过程,而不同时刻的语音信号由于响度高低的差异,对应的自适应增益存在差异,在对音频帧进行增益过程中,也同样会增加音频信号中的噪声信号,而音频编码的实质是为了减少音频中的冗余(即噪声信号),显然,不同增益会影响该音频信号的编码码率,因此,需要基于不同音频帧对应的增益确定其对应的编码码率。
线谱对参数用于反映音频信号的频谱特征,线谱对参数具有误差相对独立性,即某个频率点上的线谱对参数偏差只对该频率附近的语音频谱产生影响,而对其它频率上的线谱对参数语音频谱影响不大。这样有利于线谱对参数的量化和插值,以相对少的编码码率达到相同质量的编码音频,可见音频信号对应的线谱对参数有助于编码码率的确定。
例如,可以设置对应的音频特征提取模型,将原始音频输入该音频特征提取模型,对原始音频中包含的各个音频帧进行音频特征提取,从而输出各帧音频帧对应的音频特征参数。
例如,由于音频特征参数包含很多特征维度,为了提高音频特征提取的效率,可以从中选择出对编码结果影响较大(编码结果影响大于影响阈值)的N种音频特征维度上的特征参数,对应的,仅需要提取该N种音频特征维度上的音频特征参数即可,其中,N为正整数。例如,针对不同音频种类,可以设置不同音频特征提取维度。
步骤302,通过编码码率预测模型对音频特征参数进行编码码率预测处理,得到音频帧的音频编码码率。
其中,编码码率预测模型是以目标编码质量分值为目标进行训练的,因此,在应用该编码码率预测模型进行编码码率预测过程中,可以基于各个音频帧对应的音频特征参数,预测出使得原始音频对应的音频编码质量达到目标编码质量分值时,各个音频帧所对应的音频编码码率。其中,不同音频特征参数对应不同音频编码码率。
其中,终端中设置有编码码率预测模型,该编码码率预测模型可以基于各个音频帧对应的音频特征参数,动态调控各个音频帧对应的音频编码码率。将每一帧音频帧对应的音频特征参数输入该编码码率预测模型中,从而可以得到该帧音频帧对应的音频编码码率,以便后续可以基于该音频编码码率对音频帧进行音频编码。
示意性的,编码码率预测模型的训练过程可以参考下文实施例,本申请实施例在此不做赘述。
步骤303,基于音频编码码率对音频帧进行音频编码,并基于各帧音频帧对应的编码结果生成目标音频数据。
在一些实施例中,当获取到不同音频帧对应的音频编码码率后,可以基于该音频编码码率对音频帧进行编码,进而结合各个音频帧对应的编码结果,以生成原始音频对应的目标音频数据。
示意性的,若原始音频被划分为音频帧1~音频帧50,对应的,音频帧对应音频特征参数为音频特征参数1~音频特征参数50,将各个音频帧对应的音频特征参数分别输 入编码码率预测模型中,得到音频帧对应的编码码率(即编码码率1~编码码率50),再分别基于各个音频帧对应的音频编码码率对音频帧进行音频编码,得到各个音频帧对应的音频编码结果(即音频编码结果1~音频编码结果50),从而组合音频编码结果1~音频编码结果50,得到原始音频对应的目标音频数据。
需要说明的是,本申请实施例的音频编码方法可以是脉冲编码调制(PCM,Pulse Code Modulation)编码、波形声音文件(WAV)编码、MP3编码等。
例如,该目标音频数据可以存储在终端中,也可以通过网络传输至其他设备。例如,特定场景(例如音视频通话场景、直播场景等)下需要将编码得到的目标音频数据通过网络传输至接收端,接收端通过与音频编码码率对应的音频解码码率对目标音频数据进行音频解码,得到解码后的目标音频数据,以无损播放解码后的目标音频数据。
例如,对于同一原始音频中,一般连续几帧音频帧之间的音频特征差异小,对应的音频编码码率的差异也小,或一般对应相同音频编码码率,为了避免编码码率预测模型出现偶然误差影响音频编码结果,可以对获得的各个音频帧对应的音频编码码率进行平滑处理,以降低预测误差对音频编码质量的影响。
综上,本申请实施例中,通过分析原始音频中各个音频帧对应的音频特征参数,以实现基于音频特征参数动态调控音频帧对应的音频编码码率的目的,可以为各个音频帧确定与音频特征参数匹配的音频编码码率,从而提高整个音频的编码质量;相比于相关技术中采用固定编码码率,本申请实施例中采用动态编码码率进行音频编码,可以在音频编码质量满足目标要求的同时,尽可能的减小音频编码码率,进而可以降低音频数据的存储空间,以及减少传输音频数据过程中的带宽消耗。
为了使得编码码率预测模型可以实现动态调控音频编码码率的目标,需要预先通过大量样本音频对编码码率预测模型进行训练,使得该编码码率预测模型可以学习到对应不同音频特征参数的音频所适用的音频编码码率,以便在应用过程中可以基于该编码码率预测模型动态调控音频编码码率。
请参考图4,其示出了本申请实施例示出的音频编码方法的流程图,本申请实施例以计算机设备为例进行示例性说明,该方法包括如下步骤。
步骤401,获取第一样本音频中各个样本音频帧对应的样本音频特征参数。
需要说明的是,编码码率预测模型是用于匹配不同音频特征参数所对应的音频编码码率的,在编码码率预测模型的训练过程中,需要获取到大量的样本音频,以及样本音频中各个样本音频帧对应的样本音频特征参数,用于训练编码码率预测模型。
例如,样本音频特征参数可以由音频特征提取模型提取得到。
例如,为了使得编码码率预测模型可以适用于更多应用场景,在获取第一样本音频时,可以获取不同种类的音频,比如,语音、音乐、音视频中的音频等。
示意性的,第一样本音频的数量越多,编码码率预测模型预测的准确性越高;第一样本音频的种类越丰富,编码码率预测模型的预测范围和预测准确性也越高。
例如,在选择第一样本音频帧时,除了可以选择不同类型的音频,对于同一类型的音频,也可以选取不同音频内容、不同音频时长的样本音频;对于同一样本音频,也可以对第一样本音频划分为不同音频帧,用于后续提取音频特征参数。
步骤402,通过编码码率预测模型对样本音频特征参数进行编码码率预测处理,得到样本音频帧的样本编码码率。
在一些实施例中,将各个样本音频帧对应的样本音频特征参数输入编码码率预测模型中,可以得到编码码率预测模型输出的各个样本音频帧对应的样本编码码率。
例如,编码码率预测模型可以采用全连接网络作为主网络,也可以采用深度神经网络(DNN,Deep Neural Networks)、卷积神经网络(CNN,Convolutional Neural Networks)、 循环神经网络(RNN,Recurrent Neural Network)等神经网络,或者开发人员基于实际需求搭建神经网络,本申请实施例不限定于编码码率预测模型的结构。不同样本音频特征参数对应不同样本编码码率。
步骤403,基于样本编码码率对样本音频帧进行音频编码,并基于各帧样本音频帧对应的编码结果生成样本音频数据。
由于编码码率预测模型输出的样本编码码率或音频编码码率是对应音频编码场景的,对应的,在评估编码码率预测模型输出的编码码率是否匹配音频帧时,需要利用该样本编码码率对样本音频帧进行音频编码后,再基于音频编码结果作为训练编码码率预测模型的依据之一。
在一些实施例中,对于第一样本音频,获取到该第一样本音频中各个样本音频帧对应的样本编码码率,并基于各个样本音频帧对应的样本编码码率对各个样本音频帧进行音频编码,从而基于各帧样本音频帧对应的编码结果生成样本音频数据,以用于后续评估第一样本音频对应的本次语音编码质量。
步骤404,对样本音频数据进行音频解码,得到样本音频数据对应的第二样本音频。
为了评估语音编码质量,通过对样本音频数据进行音频解码,得到基于样本音频数据生成的第二样本音频,以便通过比较第二样本音频和原始样本音频,进而确定第一样本音频的音频编码质量。
步骤405,基于第一样本音频和第二样本音频,训练编码码率预测模型,直至样本编码质量分值达到目标编码质量分值时结束训练。
其中,样本编码质量分值通过第一样本音频和第二样本音频确定。
在一些实施例中,通过比较原始音频(第一样本音频)和经过音频编解码之后的音频(第二样本音频),来确定本次编码参数所对应的编码质量,从而基于该编码质量调整编码码率预测模型的各个参数,进而通过若干训练周期完成编码码率预测模型的训练过程。
在训练编码码率预测模型过程中,当采用编码码率预测模型输出的编码码率对样本音频进行音频编码后,可以使得样本音频的样本编码质量分值达到目标编码质量分值时,确定编码码率预测模型训练完成。示意性的,该目标编码质量分值可以是5分。例如,也可以基于实际应用场景需求设置编码码率预测模型对应的目标编码质量分值。
其中,针对确定样本编码质量的方式,可以采用主观语音质量评估(PESQ,Perceptual Evaluation of Speech Quality)测试方法,通过计算出第一样本音频和第二样本音频对应的差异值,进而映射到平均意见值(MOS,Mean Opinion Score),若第一样本音频和第二样本音频差异越大,对应的语音编码质量越差,MOS值越低。
综上,本申请实施例中,通过训练编码码率预测模型,使得编码码率预测模型可以基于样本音频帧对应的样本音频特征参数,动态调控音频编码码率,在实际应用过程中,使得基于编码码率预测模型预测得到的音频编码码率更符合音频信号的特征,可以在音频编码质量满足目标要求的同时,尽可能的减小音频编码码率,进而可以降低音频数据的存储空间,以及减少传输音频数据过程中的带宽消耗。
对于一段音频,虽然不同时刻均是变化的,但是连续多帧音频帧之间的差异小,也就是说,相邻音频帧之间的音频特征参数差异小,在预测当前音频帧对应的音频编码码率时,前一帧音频帧对应的音频编码码率对当前音频帧具有一定的参考意义,为了进一步提高音频编码码率的预测准确性,可以将前一帧音频帧对应的音频编码码率回归到下一帧音频帧的编码码率预测过程中。
请参考图5,其示出了本申请实施例示出的音频编码方法的流程图,本申请实施例以计算机设备为例进行示例性说明,该方法包括如下步骤。
步骤501,获取第一样本音频中各个样本音频帧对应的样本音频特征参数。
步骤501的实施方式可以参考步骤401,本申请实施例在此不做赘述。
例如,样本音频特征参数可以包括固定增益、自适应增益、基音周期、基音频率、线谱对参数中的至少一种。
步骤502,获取第i-1帧样本音频帧对应的第i-1样本编码码率。
其中,i为递增的整数且取值范围为1<i≤N,N为样本音频帧的数量,N为大于1的整数。
在一些实施例中,通过将前一帧样本音频帧对应的样本编码码率回归到编码码率预测模型中,使得在预测下一帧样本音频帧对应的样本编码码率时,可以参考前一帧的样本编码码率,可以尽量避免出现样本编码码率波动大的情况。
步骤503,通过编码码率预测模型对第i样本音频特征参数和第i-1样本编码码率进行编码码率预测处理,得到第i帧样本音频帧对应的第i样本编码码率。
在一些实施例中,在预测第i帧样本音频帧对应的第i样本编码码率时,可以将获取到的第i-1帧样本编码码率和第i样本音频特征参数一起输入编码码率预测模型中,为第i样本编码码率提供预测依据,可以进一步提高编码码率的预测准确性。
示意性的,若第一样本音频被划分为样本音频帧1~样本音频帧60,对应的,在编码码率预测过程中,当编码码率预测模型输出第10帧样本音频帧对应的第10样本编码码率,预测第11帧样本音频帧对应的第11样本编码码率时,可以将第10样本编码码率和第11样本音频特征参数一起输入编码码率预测模型中,得到第11样本编码码率。
步骤504,基于样本编码码率对样本音频帧进行音频编码,并基于各帧样本音频帧对应的编码结果生成样本音频数据。
步骤505,对样本音频数据进行音频解码,得到样本音频数据对应的第二样本音频。
步骤504和步骤505的实施方式可以参考上述实施例,本申请实施例在此不做赘述。
步骤506,基于第一样本音频和第二样本音频,确定第一样本音频对应的样本编码质量分值。
在一些实施例中,通过对第一样本音频和第二样本音频进行PSEQ测试,进而将测量结果映射到MOS值,将该MOS值确定为第一样本音频对应的样本编码质量分值。
示意性的,MOS值的取值范围可以是0~5,其中,MOS分值越高,表示音频编码质量越好。
步骤507,基于样本编码质量分值和目标编码质量分值,训练编码码率预测模型。
其中,目标编码质量分值指示音频编码的预期目标,由开发人员进行设置,可以基于编码码率预测模型的应用场景设置不同的目标编码质量分值,示意性的,若编码码率预测模型适用于语音通话场景,可以设置目标编码质量分值为4,若编码码率预测模型适用于音频存储场景,可以设置目标编码质量分值为5。
例如,也可以针对不同目标编码质量分值训练出不同的编码码率预测模型,以便在实际应用过程中,可以基于实际应用场景对目标编码质量分值的要求,选择对应的编码码率预测模型。
在一些实施例中,通过比较样本编码质量分值和目标编码质量分值,确定本次编码结果与预期目标之间的差距,进而基于该音频差距训练编码码率预测模型,从而更新编码码率预测模型中的各个参数。
在音频编码过程中,除了目标编码质量分值之外,编码码率的选取也应该作为评价编码质量的指标之一,示意性的,对于同一音频信号,若采用编码码率A和编码码率B均可以达到相同的编码质量,但是编码码率A小于编码码率B,而编码码率越大,可能会消耗更多的存储空间和流量带宽,因此,还需要从编码码率A和编码码率B中决策 出较小的编码码率,对应的,在模型训练过程中,将编码码率也作为编码码率预测模型的损失参数之一。
示例性地,训练编码码率预测模型的过程还可以包括如下步骤。
一、确定第一样本音频对应的平均编码码率,平均编码码率通过各帧样本音频帧对应的样本编码码率确定。
本申请实施例在音频编码过程中,针对每一帧样本音频帧均预测了对应的样本编码码率,在评价是否可以达到较小的样本编码码率时,可以对各帧样本音频帧对应的样本编码码率取平均值,求得平均编码码率,进而将该平均编码码率确定为评价音频编码质量的参数之一。
二、基于平均编码码率、样本编码质量分值和目标编码质量分值,构建第一样本音频对应的第一编码损失。
在一些实施例中,通过综合编码码率和编码质量分值两个参数维度,共同评价第一样本音频对应的编码损失,即基于平均编码码率、样本编码质量分值和目标编码质量分值,计算得到第一样本音频对应的第一编码损失。
例如,开发人员可以基于应用场景的需求,自行调整两个参数维度上的权重,示意性的,对于语音通话场景下,可以为编码码率设置较大的权重;对于音频存储场景,可以为编码质量分值设置较大的权重。
示例性地,构建第一编码损失的过程还可以包括如下步骤。
1、获取平均编码码率对应的第一损失权重,以及编码质量分值对应的第二损失权重,编码质量分值通过样本编码质量分值和目标编码质量分值确定。
在一些实施例中,在计算编码损失时,可以分别获取到平均编码码率和编码质量分值对应的损失权重,进而基于各个参数对应的损失权重,计算得到第一编码损失。
例如,第一损失权重和第二损失权重由开发人员进行设置。可以基于编码码率预测模型应用场景的不同,分别设置不同的第一损失权重和第二损失权重,使得训练得到的编码码率预测模型更适用于该应用场景的需求。
例如,也可以针对不同损失权重的组合训练不同的编码码率预测模型,进而在实际应用过程中,可以针对不同应用场景的需求选择对应的编码码率预测模型。
2、基于平均编码码率、第一损失权重、编码质量分值以及第二损失权重,构建第一样本音频对应的第一编码损失。
示例性地,计算第一编码损失的公式可以表示为如下:
a*average(bitrate)+(1-a)*power(f(MOS_SET-mos),3)
Figure PCTCN2022081414-appb-000001
其中,a表示值为0~1的加权系数(即损失权重);average(.)表示求平均函数;bitrate表示编码码率;power(.)表示幂函数;MOS_SET表示语音客观质量MOS分的预设目标值(即目标编码质量分值),mos表示样本编码质量分值,函数f(x)定义为当x<=0时,f(x)=0,而x>0时,f(x)=x。
在一些实施例中,将平均编码码率、第一损失权重、样本编码质量分值、目标编码质量分值、第二损失权重带入上述公式,可以计算得到第一样本音频对应的第一编码损失。
三、基于第一编码损失和预设编码损失,训练编码码率预测模型。
在一些实施例中,在训练编码码率预测模型过程中使用交叉熵(Cross-Entropy)准则,也就是说,预先设置有预设编码损失,只有当第一编码损失无限接近于预设编码损 失时,可以确定编码码率预测模型训练完成。
在一些实施例中,通过将前一帧样本编码码率回归到编码码率预测模型中,可以为后一帧样本编码码率的预测提供一定的参考价值,从而避免预测过程中编码码率波动幅度大,进而可以提高编码码率的预测准确性;此外,以编码码率小和编码质量好为目标训练编码码率预测模型,使得编码码率预测模型在应用过程中控制语音编码码率时,达到语音编码质量满足目标要求的前提下编码码率最小,对应的,在同等带宽或存储空间条件下,可以使音频编码质量最佳。
在特定应用场景中,经过音频编码之后的音频数据需要经过网络传输至其他终端,比如,语音通话场景下,需要将编码后的语音数据传输至其他客户端,而接收端是否可以获取到好的音频信号不仅取决于编码码率,还与网络传输过程的网络环境状态有关,因此,为了使得在该特定场景下接收端可以获得质量好的音频信号,在预测音频编码码率的过程中,还需要考虑到当前网络状态参数,对应的,在模型训练过程中,也需要网络状态参数参与模型训练。
示例性地,在图4的基础上,如图6所示,步骤402可以被替换为步骤601和步骤602。
步骤601,获取第一样本音频的样本网络状态参数。
在训练编码码率预测模型中,为了使得预测出的音频编码码率适用于当前网络状态,可以将网络状态参数也加入到训练编码码率预测模型的训练样本中。示意性的,样本网络状态参数可以是丢包率、网络传输速率等。
例如,可以随机模拟所需要的样本网络状态参数。示意性的,可以针对不同样本音频生成不同的样本网络状态参数,或针对不同样本音频帧生成对应的样本网络状态参数,或每隔预设时间段生成对应的样本网络状态参数。
对应的,在预测样本音频帧对应的样本编码码率时,可以将样本网络状态参数和该样本音频帧对应的样本音频特征参数共同输入编码码率预测模型中,进行编码码率预测。
步骤602,通过编码码率预测模型对样本网络状态参数和样本音频特征参数进行编码码率预测处理,得到样本音频帧的样本编码码率。
在一些实施例中,在预测样本音频帧对应的样本编码码率时,除了需要获取到该样本音频帧对应的样本音频特征参数,还需要获取到本次预测所使用的样本网络状态参数,并将样本网络状态参数和该样本音频特征参数共同输入编码码率预测模型中,从而得到编码码率预测模型输出的样本编码码率。
例如,为了进一步提高特定应用场景下的编码预测准确性,在进行编码码率预测过程中,也可以将前一帧样本音频帧对应的样本编码码率回归到编码码率预测模型中,为预测下一帧样本音频帧对应的样本编码码率提供预测参考。
在一些实施例中,可以将样本网络状态参数、第i-1样本编码码率(第i-1帧样本音频帧对应的编码码率)和第i样本音频特征参数输入编码码率预测模型中,其中,样本网络状态参数提供当前网络状态参考,第i-1样本编码码率提供编码码率预测参考,进而生成第i样本音频帧对应的第i样本编码码率。
在一些实施例中,通过在训练过程中添加网络状态参数,使得编码码率预测模型在预测编码码率时可以考虑到网络状态对编码码率的影响,进一步提高在特定场景下(比如,语音通话场景下)对应的音频编码质量。
请参考图7,其示出了本申请实施例示出的完整模型训练过程的示意图。在基于第一样本语音701对编码码率预测模型702进行训练的过程中,将第一样本语音701划分为若干样本音频帧,并将各个样本音频帧对应的样本音频特征参数704、网络丢包标志 703输入编码码率预测模型702中,得到编码码率预测模型702输出的当前帧编码码率705,该当前帧编码码率705不仅用于语音编码,还可以将当前帧编码码率705回归到编码码率预测模型702中,用于预测下一帧编码码率;基于各帧样本音频帧对应的编码码率进行音频编码,得到音频编码结果,再将语音编码结果经过音频解码后,生成第二样本语音706,以便通过对第一样本语音701和第二样本语音706进行PESQ测试,继而基于测试结果训练编码码率预测模型702。
示例性地,编码码率预测模型702包括全连接层(DENSE)和门控循环单元(GRU),示意性的,GRU1的神经元数量为24,DENSE2的神经元数量为96,GRU2、GRU3的神经元数量均为256,DENSE3的神经元数量为1;将网络丢包标志703输入DENSE1中,提取网络状态特征;同时将样本音频特征参数704输入DENSE2中,用于提取音频特征,再通过GRU2、GRU3进行特征融合,输入DENSE3中,由DENSE3输出各个预设编码码率的概率,进而将概率最高的预设编码码率确定为当前样本音频帧对应的当前帧编码码率。
例如,编码码率预测模型702还可以采用其他网络结构,比如,编码码率预测模型702仅包括全连接层。
在模型训练过程中,通过将前一帧编码码率回归到网络模型中,以作为预测下一帧编码码率的依据,对应的,在实际应用过程中,为了进一步提高音频编码质量,也可以将每帧编码码率预测模型输出的音频编码码率回归到模型中,为下一帧编码码率预测提供参考。
在图3的基础上,如图8所示,步骤302可以被替换为步骤801和步骤802。
步骤801,获取第j-1帧音频帧对应的第j-1音频编码码率。
其中,j为递增的整数且取值范围为1<j≤M,M为音频帧的数量,M为大于1的整数。
在一些实施例中,当编码码率预测模型预测出第j-1帧音频帧对应的第j-1音频编码码率后,除了应用于后续基于该第j-1音频编码码率对第j-1音频帧进行音频编码之外,还可以将第j-1音频编码码率重新输入编码码率预测模型中,用于为预测第j帧音频帧对应的第j音频编码码率提供参考依据。
步骤802,通过编码码率预测模型对第j-1音频编码码率和第j帧音频帧对应的第j音频特征参数进行编码码率预测处理,得到第j帧音频帧对应的第j音频编码码率。
在一些实施例中,在预测第j帧音频帧对应的第j音频编码码率时,可以获取到第j-1帧音频帧对应的第j-1音频编码码率,以便将第j-1音频编码码率和第j音频特征参数共同输入编码码率预测模型中,由第j-1音频编码码率为第j音频编码码率提供预测依据,进而得到编码码率预测模型输出的第j音频编码码率。
在一些实施例中,通过将前一帧音频编码码率回归到编码码率预测模型中,可以为后一帧音频编码码率的预测起到参考作用,可以避免编码码率预测过程中音频编码码率波动幅度大,进而可以提高音频编码码率的预测准确性。
对于某些特定应用场景下,比如,语音通话场景下、直播场景下等需要在线传输音频数据的场景,网络状态会影响到接收端接收到的语音质量,因此,在该特定应用场景下,为了避免网络状态对语音质量的影响,需要在生成音频编码码率时考虑当前网络状态的影响。
在图3的基础上,如图9所示,步骤302可以被替换为步骤901和步骤902。
步骤901,获取接收端反馈的当前网络状态参数,接收端用于接收经过网络传输的目标音频数据。
在一种可能的应用场景下,经过音频编码的目标音频数据需要经过网络传输至其他终端(即接收端),而网络状态对音频编码过程也具有一定的影响,示意性的,若网络状态较差,对应的,则采用较小的编码码率;网络状态较好,则采用较大的编码码率,因此,对于用于网络传输的音频数据,在预测编码码率过程中,还需要考虑到接收端反馈的当前网络状态参数。
其中,该网络状态参数可以由接收端返回,以网络状态参数为丢包率为例,接收端通过统计一定时间内的网络丢包率,并将该网络丢包率返回至发送端,当发送端接收到该丢包率时,即可将该丢包率作为网络状态参数,输入编码码率预测模型中,使得在预测音频编码码率时,可以考虑到当前的网络状态。
示意性的,发送终端可以每隔设定时间从接收端获取网络状态参数,或接收端每隔预定时间向发送终端反馈网络状态参数。其中,设定时间可以是30分钟(min)。
步骤902,通过编码码率预测模型对当前网络状态参数和音频特征参数进行编码码率预测处理,得到音频帧的音频编码码率。
在一些实施例中,在预测音频帧对应的音频编码码率时,考虑到当前网络状态的影响,可以将获取到的当前网络状态参数和音频帧对应的音频特征参数输入编码码率预测模型中,使得在预测音频编码码率时,兼顾当前网络状态这一影响因素,从而得到编码码率预测模型输出的音频编码码率。
当发送端基于该音频编码码率对音频进行编码后,并将编码结果通过网络传输至接收端后,由于在音频编码过程中所使用的音频编码码率已经考虑到当前网络状态,可以保证接收端收到好的音频信号。
例如,为了进一步提高特定应用场景下的编码预测准确性,在进行编码码率预测过程中,也可以将前一帧音频帧对应的音频编码码率回归到编码码率预测模型中,为预测下一帧音频帧对应的音频编码码率提供预测参考。
在一些实施例中,可以将网络状态参数、第j-1音频编码码率(即第j-1音频帧对应的音频编码码率)和第j音频特征参数输入编码码率预测模型中,由网络状态参数为第j音频编码码率提供网络状态参考,由第j-1音频编码码率为第j音频编码码率提供编码码率预测参考,进而由编码码率预测模型输出第j音频帧对应的第j音频编码码率,j为大于1的整数。
在一些实施例中,通过在预测音频编码码率的过程中,添加网络状态参数,使得编码码率预测模型在预测编码码率时可以考虑到网络状态对编码码率的影响,进一步提高在特定场景下(比如,语音通话场景下)对应的音频编码质量。
请参考图10,其示出了本申请实施例示出的音频编码过程的示意图。在模型应用过程中,可以将网络丢包标志1001(即网络状态参数)和音频特征参数1002输入编码码率预测模型1003中,从而输出当前帧编码码率1004;例如,还可以将当前帧编码码率1004输入编码码率预测模型中,用于为预测下一帧编码码率提高参考依据;进而基于各帧音频帧对应的音频编码码率进行音频编码,基于各帧音频帧对应的编码结果生成原始音频对应的音频编码数据。
请参考图11,其示出了本申请实施例示出的音频编码装置的结构方框图。该音频编码装置可以通过软件、硬件或者两者的结合实现成为计算机设备的全部或一部分。该音频编码装置可以包括:
第一获取模块1101,配置为获取第一样本音频中各个样本音频帧对应的样本音频特征参数;第一处理模块1102,配置为通过编码码率预测模型对所述样本音频特征参数进行编码码率预测处理,得到所述样本音频帧的样本编码码率;第一编码模块1103,配置 为基于所述样本编码码率对所述样本音频帧进行音频编码,并基于各帧样本音频帧对应的编码结果生成样本音频数据;音频解码模块1104,配置为对所述样本音频数据进行音频解码,得到所述样本音频数据对应的第二样本音频;训练模块1105,配置为基于所述第一样本音频和所述第二样本音频,训练所述编码码率预测模型,直至样本编码质量分值达到目标编码质量分值时结束所述训练;其中,所述样本编码质量分值通过所述第一样本音频和所述第二样本音频确定。
在一些实施例中,所述装置还包括:第二获取模块,配置为获取所述第一样本音频的样本网络状态参数;所述第一处理模块1102,包括:第一处理单元,配置为通过所述编码码率预测模型对所述样本网络状态参数和所述样本音频特征参数进行编码码率预测处理,得到所述样本音频帧的样本编码码率。
在一些实施例中,所述装置还包括:第三获取模块,配置为获取第i-1帧样本音频帧对应的第i-1样本编码码率;
所述第一处理模块1102,包括:第二处理单元,配置为通过所述编码码率预测模型对所述第i样本音频特征参数和所述第i-1样本编码码率进行编码码率预测处理,得到第i帧样本音频帧对应的第i样本编码码率;其中,i为递增的整数且取值范围为1<i≤N,N为所述样本音频帧的数量,N为大于1的整数。
在一些实施例中,所述训练模块1105,包括:确定单元,配置为基于所述第一样本音频和所述第二样本音频,确定所述第一样本音频对应的所述样本编码质量分值;训练单元,配置为基于所述样本编码质量分值和所述目标编码质量分值,训练所述编码码率预测模型。
在一些实施例中,所述训练单元,还配置为:确定所述第一样本音频对应的平均编码码率,其中,所述平均编码码率通过各帧样本音频帧对应的所述样本编码码率确定;基于所述平均编码码率、所述样本编码质量分值和所述目标编码质量分值,构建所述第一样本音频对应的第一编码损失;基于所述第一编码损失和预设编码损失,训练所述编码码率预测模型。
在一些实施例中,所述训练单元,还配置为:获取所述平均编码码率对应的第一损失权重、和编码质量分值对应的第二损失权重,所述编码质量分值通过所述样本编码质量分值和所述目标编码质量分值确定;基于所述平均编码码率、所述第一损失权重、所述编码质量分值和所述第二损失权重,构建所述第一样本音频对应的所述第一编码损失。
在一些实施例中,所述样本音频特征参数的类型包括以下至少之一:固定增益、自适应增益、基音周期、基音频率、线谱对参数。
综上所述,本申请实施例中,通过在训练编码码率预测模型过程中,分析样本音频中各个样本音频帧对应的样本音频特征参数,以便基于样本音频特征参数预测各帧样本音频帧对应的样本音频编码码率,进而基于各帧对应的样本编码码率对样本音频帧进行音频编码,在对音频编码结果进行音频解码后,通过比较音频解码后的音频和原始音频之间的关系,训练编码码率预测模型,使得在实际应用过程中,编码码率预测模型具备可以基于音频特征参数动态调控音频编码码率的功能,可以在音频编码质量满足目标要求的同时,尽可能的减小音频编码码率,进而可以降低音频数据的存储空间,以及减少传输音频数据过程中的带宽消耗。
请参考图12,其示出了本申请实施例示出的音频编码装置的结构方框图。该音频编码装置可以通过软件、硬件或者两者的结合实现成为计算机设备的全部或一部分。该音频编码装置可以包括:
第四获取模块1201,配置为获取原始音频中各个音频帧对应的音频特征参数;
第二处理模块1202,配置为通过编码码率预测模型对所述音频特征参数进行编码码 率预测处理,得到所述音频帧的音频编码码率,其中,所述编码码率预测模型用于预测达到目标编码质量分值时各个所述音频帧对应的音频编码码率;
第二编码模块1203,配置为基于所述音频编码码率对所述音频帧进行音频编码,并基于各帧音频帧对应的编码结果生成目标音频数据。
在一些实施例中,所述目标音频数据用于网络传输;
所述装置还包括:
第五获取模块,配置为获取接收端反馈的当前网络状态参数,所述接收端用于接收经过所述网络传输的目标音频数据;所述第二处理模块1202,包括:第三处理单元,配置为通过所述编码码率预测模型对所述当前网络状态参数和所述音频特征参数进行编码码率预测处理,得到所述音频帧的音频编码码率。
在一些实施例中,所述装置还包括:
第六获取模块,配置为获取第j-1帧音频帧对应的第j-1音频编码码率;所述第二处理模块1202,包括:第四处理单元,配置为通过所述编码码率预测模型对所述第j-1音频编码码率和所述第j帧音频帧对应的第j音频特征参数进行编码码率预测处理,得到第j帧音频帧对应的第j音频编码码率;其中,j为递增的整数且取值范围为1<j≤M,M为所述音频帧的数量,M为大于1的整数。
在一些实施例中,所述音频特征参数的类型包括以下至少之一:固定增益、自适应增益、基音周期、基音频率、线谱对参数。
综上所述,本申请实施例中,通过分析原始音频中各个音频帧对应的音频特征参数,以实现基于音频特征参数动态调控音频帧对应的音频编码码率的目的,可以为各个音频帧确定与音频特征参数匹配的音频编码码率,从而提高整个音频的编码质量;相比于相关技术中采用固定编码码率,本实施例中采用动态编码码率进行音频编码,可以在音频编码质量满足目标要求的同时,尽可能的减小音频编码码率,进而可以降低音频数据的存储空间,以及减少传输音频数据过程中的带宽消耗。
本申请实施例还提供一种音频解码装置,该音频解码装置可以通过软件、硬件或者两者的结合实现成为计算机设备的全部或一部分。该音频解码装置可以包括:
第五获取模块,配置为获取所述编码后的目标音频数据;解码模块,配置为通过与音频编码码率对应的音频解码码率对所述编码后的目标音频数据进行音频解码,得到解码后的所述目标音频数据。
请参考图13,其示出了本申请实施例提供的计算机设备的结构框图。该计算机设备可用于实施上述实施例中提供的音频编码方法或音频解码方法。具体来讲:
所述计算机设备1300包括中央处理单元(CPU,Central Processing Unit)1301、包括随机存取存储器(RAM,Random Access Memory)1302和只读存储器(ROM,Read-Only Memory)1303的系统存储器1304,以及连接系统存储器1304和中央处理单元1301的系统总线1305。所述计算机设备1300还包括帮助计算机设备内的各个器件之间传输信息的基本输入/输出系统(I/O系统,Input/Output系统)1306,和用于存储操作系统1313、应用程序1314和其他程序模块1315的大容量存储设备1307。
所述基本输入/输出系统1306包括有用于显示信息的显示器1308和用于用户输入信息的诸如鼠标、键盘之类的输入设备1309。其中所述显示器1308和输入设备1309都通过连接到系统总线1305的输入输出控制器1310连接到中央处理单元1301。所述基本输入/输出系统1306还可以包括输入输出控制器1310以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器1310还提供输出到显 示屏、打印机或其他类型的输出设备。
所述大容量存储设备1307通过连接到系统总线1305的大容量存储控制器(未示出)连接到中央处理单元1301。所述大容量存储设备1307及其相关联的计算机可读存储介质为计算机设备1300提供非易失性存储。也就是说,所述大容量存储设备1307可以包括诸如硬盘或者只读光盘(CD-ROM,Compact Disc Read-Only Memory)驱动器之类的计算机可读存储介质(未示出)。
不失一般性,所述计算机可读存储介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读存储指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、可擦除可编程只读寄存器(EPROM,Erasable Programmable Read Only Memory)、电子抹除式可复写只读存储器(EEPROM,Electrically-Erasable Programmable Read-Only Memory)、闪存或其他固态存储其技术,CD-ROM、数字多功能光盘(DVD,Digital Versatile Disc)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器1304和大容量存储设备1307可以统称为存储器。
存储器存储有一个或多个程序,一个或多个程序被配置成由一个或多个中央处理单元1301执行,一个或多个程序包含用于实现上述方法实施例的指令,中央处理单元1301执行该一个或多个程序实现上述各个方法实施例提供的方法。
根据本申请的各种实施例,所述计算机设备1300还可以通过诸如因特网等网络连接到网络上的远程服务器运行。也即计算机设备1300可以通过连接在所述系统总线1305上的网络接口单元1311连接到网络1312,或者说,也可以使用网络接口单元1311来连接到其他类型的网络或远程服务器系统(未示出)。
所述存储器还包括一个或者一个以上的程序,所述一个或者一个以上程序存储于存储器中,所述一个或者一个以上程序包含用于进行本申请实施例提供的方法中计算机设备所执行的步骤。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有至少一条指令,所述至少一条指令由所述处理器加载并执行以实现如上各个实施例所述的音频编码方法或音频解码方法。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述可选实现方式中提供的音频编码方法或音频解码方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (18)

  1. 一种音频编码方法,由计算机设备执行,所述方法包括:
    获取第一样本音频中各个样本音频帧对应的样本音频特征参数;
    通过编码码率预测模型对所述样本音频特征参数进行编码码率预测处理,得到所述样本音频帧的样本编码码率;
    基于所述样本编码码率对所述样本音频帧进行音频编码,并基于各帧样本音频帧对应的编码结果生成样本音频数据;
    对所述样本音频数据进行音频解码,得到所述样本音频数据对应的第二样本音频;
    基于所述第一样本音频和所述第二样本音频,训练所述编码码率预测模型,直至样本编码质量分值达到目标编码质量分值时结束所述训练;
    其中,所述样本编码质量分值通过所述第一样本音频和所述第二样本音频确定。
  2. 根据权利要求1所述的方法,其中,所述通过编码码率预测模型对所述样本音频特征参数进行编码码率预测处理,得到所述样本音频帧的样本编码码率之前,所述方法还包括:
    获取所述第一样本音频的样本网络状态参数;
    所述通过编码码率预测模型对所述样本音频特征参数进行编码码率预测处理,得到所述样本音频帧的样本编码码率,包括:
    通过所述编码码率预测模型对所述样本网络状态参数和所述样本音频特征参数进行编码码率预测处理,得到所述样本音频帧的样本编码码率。
  3. 根据权利要求1所述的方法,其中,所述通过编码码率预测模型对所述样本音频特征参数进行编码码率预测处理,得到所述样本音频帧的样本编码码率之前,所述方法还包括:
    获取第i-1帧样本音频帧对应的第i-1样本编码码率;
    所述通过编码码率预测模型对所述样本音频特征参数进行编码码率预测处理,得到所述样本音频帧的样本编码码率,包括:
    通过所述编码码率预测模型对所述第i样本音频特征参数和所述第i-1样本编码码率进行编码码率预测处理,得到第i帧样本音频帧对应的第i样本编码码率;
    其中,i为递增的整数且取值范围为1<i≤N,N为所述样本音频帧的数量,N为大于1的整数。
  4. 根据权利要求1至3任一所述的方法,其中,所述基于所述第一样本音频和所述第二样本音频,训练所述编码码率预测模型,包括:
    基于所述第一样本音频和所述第二样本音频,确定所述第一样本音频对应的所述样本编码质量分值;
    基于所述样本编码质量分值和所述目标编码质量分值,训练所述编码码率预测模型。
  5. 根据权利要求4所述的方法,其中,所述基于所述样本编码质量分值和目标编码质量分值,训练所述编码码率预测模型,包括:
    确定所述第一样本音频对应的平均编码码率,其中,所述平均编码码率通过各帧样本音频帧对应的所述样本编码码率确定;
    基于所述平均编码码率、所述样本编码质量分值和所述目标编码质量分值,构建所述第一样本音频对应的第一编码损失;
    基于所述第一编码损失和预设编码损失,训练所述编码码率预测模型。
  6. 根据权利要求5所述的方法,其中,所述基于所述平均编码码率、所述样本编码质量分值和所述目标编码质量分值,构建所述第一样本音频对应的第一编码损失,包括:
    获取所述平均编码码率对应的第一损失权重、和编码质量分值对应的第二损失权重,所述编码质量分值通过所述样本编码质量分值和所述目标编码质量分值确定;
    基于所述平均编码码率、所述第一损失权重、所述编码质量分值和所述第二损失权重,构建所述第一样本音频对应的所述第一编码损失。
  7. 根据权利要求1至3任一所述的方法,其中,所述样本音频特征参数的类型包括以下至少之一:固定增益、自适应增益、基音周期、基音频率、线谱对参数。
  8. 一种音频编码方法,由计算机设备执行,所述方法包括:
    获取原始音频中各个音频帧对应的音频特征参数;
    通过编码码率预测模型对所述音频特征参数进行编码码率预测处理,得到所述音频帧的音频编码码率,其中,所述编码码率预测模型用于预测达到目标编码质量分值时各个所述音频帧对应的音频编码码率;
    基于所述音频编码码率对所述音频帧进行音频编码,并基于各帧音频帧对应的编码结果生成目标音频数据。
  9. 根据权利要求8所述的方法,其中,
    所述目标音频数据用于网络传输;
    所述通过编码码率预测模型对所述音频特征参数进行编码码率预测处理,得到所述音频帧的音频编码码率之前,所述方法还包括:
    获取接收端反馈的当前网络状态参数,所述接收端用于接收经过所述网络传输的目标音频数据;
    所述通过编码码率预测模型对所述音频特征参数进行编码码率预测处理,得到所述音频帧的音频编码码率,包括:
    通过所述编码码率预测模型对所述当前网络状态参数和所述音频特征参数进行编码码率预测处理,得到所述音频帧的音频编码码率。
  10. 根据权利要求8所述的方法,其中,
    所述通过编码码率预测模型对所述音频特征参数进行编码码率预测处理,得到所述音频帧的音频编码码率之前,所述方法还包括:
    获取第j-1帧音频帧对应的第j-1音频编码码率;
    所述通过编码码率预测模型对所述音频特征参数进行编码码率预测处理,得到所述音频帧的音频编码码率,包括:
    通过所述编码码率预测模型对所述第j-1音频编码码率和所述第j帧音频帧对应的第j音频特征参数进行编码码率预测处理,得到第j帧音频帧对应的第j音频编码码率;
    其中,j为递增的整数且取值范围为1<j≤M,M为所述音频帧的数量,M为大于1的整数。
  11. 根据权利要求8至10任一所述的方法,其中,所述音频特征参数的类型包括以下至少之一:固定增益、自适应增益、基音周期、基音频率、线谱对参数。
  12. 一种音频解码方法,由计算机设备执行,应用于如权利要求8-11所述音频编码方法所编码后的目标音频数据;
    所述方法包括:
    获取所述编码后的目标音频数据;
    通过与音频编码码率对应的音频解码码率对所述编码后的目标音频数据进行音频解码,得到解码后的所述目标音频数据。
  13. 一种音频编码装置,所述装置包括:
    第一获取模块,配置为获取第一样本音频中各个样本音频帧对应的样本音频特征参数;
    第一处理模块,配置为通过编码码率预测模型对所述样本音频特征参数进行编码码率预测处理,得到所述样本音频帧的样本编码码率;
    第一编码模块,配置为基于所述样本编码码率对所述样本音频帧进行音频编码,并基于各帧样本音频帧对应的编码结果生成样本音频数据;
    音频解码模块,配置为对所述样本音频数据进行音频解码,得到所述样本音频数据对应的第二样本音频;
    训练模块,配置为基于所述第一样本音频和所述第二样本音频,训练所述编码码率预测模型,直至样本编码质量分值达到目标编码质量分值时结束所述训练;其中,所述样本编码质量分值通过所述第一样本音频和所述第二样本音频确定。
  14. 一种音频编码装置,所述装置包括:
    第四获取模块,配置为获取原始音频中各个音频帧对应的音频特征参数;
    第二处理模块,配置为通过编码码率预测模型对所述音频特征参数进行编码码率预测处理,得到所述音频帧的音频编码码率,其中,所述编码码率预测模型用于预测达到目标编码质量分值时各个所述音频帧对应的音频编码码率;
    第二编码模块,配置为基于所述音频编码码率对所述音频帧进行音频编码,并基于各帧音频帧对应的编码结果生成目标音频数据。
  15. 一种音频解码装置,所述装置包括:
    第五获取模块,配置为获取所述编码后的目标音频数据;
    解码模块,配置为通过与音频编码码率对应的音频解码码率对所述编码后的目标音频数据进行音频解码,得到解码后的所述目标音频数据。
  16. 一种计算机设备,所述计算机设备包含处理器和存储器,所述存储器中存储有至少一段程序,所述至少一段程序由所述处理器加载并执行以实现如权利要求1至7任一所述的音频编码方法,或实现如权利要求8至11任一所述的音频编码方法,或实现如权利要求12任一所述的音频解码方法。
  17. 一种计算机可读存储介质,所述存储介质中存储有至少一段程序,所述至少一段程序由所述处理器加载并执行以实现权利要求1至7任一所述的音频编码方法,或实现如权利要求8至11任一所述的音频编码方法,或实现如权利要求12任一所述的音频解码方法。
  18. 一种计算机程序产品,包括计算机指令,所述计算机指令使得计算机执行以实现权利要求1至7任一所述的音频编码方法,或实现如权利要求8至11任一所述的音频 编码方法,或实现如权利要求12任一所述的音频解码方法。
PCT/CN2022/081414 2021-04-09 2022-03-17 音频编码方法、音频解码方法、装置、计算机设备、存储介质及计算机程序产品 WO2022213787A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2023538141A JP2024501933A (ja) 2021-04-09 2022-03-17 オーディオ符号化方法、オーディオ復号化方法、装置、コンピューター機器及びコンピュータープログラム
EP22783856.2A EP4239630A1 (en) 2021-04-09 2022-03-17 Audio encoding method, audio decoding method, apparatus, computer device, storage medium, and computer program product
US17/978,905 US20230046509A1 (en) 2021-04-09 2022-11-01 Audio encoding method, audio decoding method, apparatus, computer device, storage medium, and computer program product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110380547.9A CN112767956B (zh) 2021-04-09 2021-04-09 音频编码方法、装置、计算机设备及介质
CN202110380547.9 2021-04-09

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/978,905 Continuation US20230046509A1 (en) 2021-04-09 2022-11-01 Audio encoding method, audio decoding method, apparatus, computer device, storage medium, and computer program product

Publications (1)

Publication Number Publication Date
WO2022213787A1 true WO2022213787A1 (zh) 2022-10-13

Family

ID=75691260

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/081414 WO2022213787A1 (zh) 2021-04-09 2022-03-17 音频编码方法、音频解码方法、装置、计算机设备、存储介质及计算机程序产品

Country Status (5)

Country Link
US (1) US20230046509A1 (zh)
EP (1) EP4239630A1 (zh)
JP (1) JP2024501933A (zh)
CN (1) CN112767956B (zh)
WO (1) WO2022213787A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113518250B (zh) * 2020-08-07 2022-08-02 腾讯科技(深圳)有限公司 一种多媒体数据处理方法、装置、设备及可读存储介质
CN112767956B (zh) * 2021-04-09 2021-07-16 腾讯科技(深圳)有限公司 音频编码方法、装置、计算机设备及介质
CN113192520B (zh) * 2021-07-01 2021-09-24 腾讯科技(深圳)有限公司 一种音频信息处理方法、装置、电子设备及存储介质
CN117813652A (zh) * 2022-05-10 2024-04-02 北京小米移动软件有限公司 音频信号编码方法、装置、电子设备和存储介质
CN115334349B (zh) * 2022-07-15 2024-01-02 北京达佳互联信息技术有限公司 音频处理方法、装置、电子设备及存储介质
CN117793078A (zh) * 2024-02-27 2024-03-29 腾讯科技(深圳)有限公司 一种音频数据的处理方法、装置、电子设备和存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104143335A (zh) * 2014-07-28 2014-11-12 华为技术有限公司 音频编码方法及相关装置
US8897370B1 (en) * 2009-11-30 2014-11-25 Google Inc. Bitrate video transcoding based on video coding complexity estimation
CN109495660A (zh) * 2018-11-29 2019-03-19 广州市百果园信息技术有限公司 一种音频数据的编码方法、装置、设备和存储介质
CN110992963A (zh) * 2019-12-10 2020-04-10 腾讯科技(深圳)有限公司 网络通话方法、装置、计算机设备及存储介质
CN111243608A (zh) * 2020-01-17 2020-06-05 中国人民解放军国防科技大学 一种基于深度自编码机低速率语音编码方法
CN111370032A (zh) * 2020-02-20 2020-07-03 厦门快商通科技股份有限公司 语音分离方法、系统、移动终端及存储介质
CN111862995A (zh) * 2020-06-22 2020-10-30 北京达佳互联信息技术有限公司 一种码率确定模型训练方法、码率确定方法及装置
CN112767956A (zh) * 2021-04-09 2021-05-07 腾讯科技(深圳)有限公司 音频编码方法、装置、计算机设备及介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7080009B2 (en) * 2000-05-01 2006-07-18 Motorola, Inc. Method and apparatus for reducing rate determination errors and their artifacts
ATE479182T1 (de) * 2007-07-30 2010-09-15 Global Ip Solutions Gips Ab Audiodekoder mit geringer verzögerung
CN104517612B (zh) * 2013-09-30 2018-10-12 上海爱聊信息科技有限公司 基于amr-nb语音信号的可变码率编码器和解码器及其编码和解码方法
CN105610635B (zh) * 2016-02-29 2018-12-07 腾讯科技(深圳)有限公司 语音编码发送方法和装置
US10587880B2 (en) * 2017-03-30 2020-03-10 Qualcomm Incorporated Zero block detection using adaptive rate model
CN110767243A (zh) * 2019-11-04 2020-02-07 重庆百瑞互联电子技术有限公司 一种音频编码方法、装置及设备
CN111429926B (zh) * 2020-03-24 2022-04-15 北京百瑞互联技术有限公司 一种优化音频编码速度的方法和装置
CN112289328A (zh) * 2020-10-28 2021-01-29 北京百瑞互联技术有限公司 一种确定音频编码码率的方法及系统

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8897370B1 (en) * 2009-11-30 2014-11-25 Google Inc. Bitrate video transcoding based on video coding complexity estimation
CN104143335A (zh) * 2014-07-28 2014-11-12 华为技术有限公司 音频编码方法及相关装置
CN109495660A (zh) * 2018-11-29 2019-03-19 广州市百果园信息技术有限公司 一种音频数据的编码方法、装置、设备和存储介质
CN110992963A (zh) * 2019-12-10 2020-04-10 腾讯科技(深圳)有限公司 网络通话方法、装置、计算机设备及存储介质
CN111243608A (zh) * 2020-01-17 2020-06-05 中国人民解放军国防科技大学 一种基于深度自编码机低速率语音编码方法
CN111370032A (zh) * 2020-02-20 2020-07-03 厦门快商通科技股份有限公司 语音分离方法、系统、移动终端及存储介质
CN111862995A (zh) * 2020-06-22 2020-10-30 北京达佳互联信息技术有限公司 一种码率确定模型训练方法、码率确定方法及装置
CN112767956A (zh) * 2021-04-09 2021-05-07 腾讯科技(深圳)有限公司 音频编码方法、装置、计算机设备及介质

Also Published As

Publication number Publication date
JP2024501933A (ja) 2024-01-17
EP4239630A1 (en) 2023-09-06
US20230046509A1 (en) 2023-02-16
CN112767956B (zh) 2021-07-16
CN112767956A (zh) 2021-05-07

Similar Documents

Publication Publication Date Title
WO2022213787A1 (zh) 音频编码方法、音频解码方法、装置、计算机设备、存储介质及计算机程序产品
CN110223705B (zh) 语音转换方法、装置、设备及可读存储介质
CN108922538B (zh) 会议信息记录方法、装置、计算机设备及存储介质
CN108900725B (zh) 一种声纹识别方法、装置、终端设备及存储介质
CN111862934B (zh) 语音合成模型的改进方法和语音合成方法及装置
WO2022227935A1 (zh) 语音识别方法、装置、设备、存储介质及程序产品
CN112185363B (zh) 音频处理方法及装置
CN111951823A (zh) 一种音频处理方法、装置、设备及介质
CN114338623B (zh) 音频的处理方法、装置、设备及介质
CN111863033A (zh) 音频质量识别模型的训练方法、装置、服务器和存储介质
CN112908293B (zh) 一种基于语义注意力机制的多音字发音纠错方法及装置
CN112767955B (zh) 音频编码方法及装置、存储介质、电子设备
CN113823303A (zh) 音频降噪方法、装置及计算机可读存储介质
CN115713939A (zh) 语音识别方法、装置及电子设备
US20180082703A1 (en) Suitability score based on attribute scores
EP4040436A1 (en) Speech encoding method and apparatus, computer device, and storage medium
US20080059161A1 (en) Adaptive Comfort Noise Generation
Baskaran et al. Dominant speaker detection in multipoint video communication using Markov chain with non-linear weights and dynamic transition window
CN117854509B (zh) 一种耳语说话人识别模型训练方法和装置
CN117373465B (zh) 一种语音频信号切换系统
WO2024056078A1 (zh) 视频生成方法、装置和计算机可读存储介质
Kaledibi et al. Quality of Experience Prediction for VoIP Calls Using Audio MFCCs and Multilayer Perceptron
US11011174B2 (en) Method and system for determining speaker-user of voice-controllable device
CN114783410A (zh) 语音合成方法、系统、电子设备和存储介质
CN117854509A (zh) 一种耳语说话人识别模型训练方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22783856

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022783856

Country of ref document: EP

Effective date: 20230531

WWE Wipo information: entry into national phase

Ref document number: 2023538141

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE