WO2022179406A1 - 音频转码方法、装置、音频转码器、设备以及存储介质 - Google Patents

音频转码方法、装置、音频转码器、设备以及存储介质 Download PDF

Info

Publication number
WO2022179406A1
WO2022179406A1 PCT/CN2022/076144 CN2022076144W WO2022179406A1 WO 2022179406 A1 WO2022179406 A1 WO 2022179406A1 CN 2022076144 W CN2022076144 W CN 2022076144W WO 2022179406 A1 WO2022179406 A1 WO 2022179406A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
signal
target
excitation signal
parameter
Prior art date
Application number
PCT/CN2022/076144
Other languages
English (en)
French (fr)
Inventor
黄庆博
王蒙
肖玮
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202111619099.XA external-priority patent/CN115050377A/zh
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022179406A1 publication Critical patent/WO2022179406A1/zh
Priority to US18/046,708 priority Critical patent/US20230075562A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/173Transcoding, i.e. converting between two coded representations avoiding cascaded coding-decoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0017Lossless audio signal coding; Perfect reconstruction of coded audio signal by transmission of coding error
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/083Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being an excitation gain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/24Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding

Definitions

  • the present application relates to the field of audio processing, and in particular, to an audio transcoding method, apparatus, audio transcoder, device, and storage medium.
  • Embodiments of the present application provide an audio transcoding method, apparatus, audio transcoder, device, and storage medium, which can improve the speed and efficiency of audio transcoding.
  • the technical solution is as follows:
  • an audio transcoding method comprising:
  • Entropy decoding is performed on the first audio stream of the first code rate to obtain audio characteristic parameters and an excitation signal of the first audio stream, where the excitation signal is a quantized audio signal;
  • the excitation signal and the audio feature parameter are requantized based on the time-domain audio signal and the target transcoding rate to obtain the target excitation signal and the target audio feature parameter;
  • Entropy coding is performed on the target audio feature parameter and the target excitation signal to obtain a second audio stream with a second code rate, where the second code rate is lower than the first code rate.
  • an audio transcoder includes: a first processing unit, a second processing unit, a quantization unit, and a third processing unit, wherein the first processing unit and the first processing unit Two processing units are respectively connected with the quantization unit, the second processing unit is connected with the quantization unit, and the quantization unit is connected with the third processing unit;
  • the first processing unit is used to perform entropy decoding on the first audio stream of the first code rate, and obtain the audio characteristic parameter and the excitation signal of the first audio stream, and the excitation signal is the quantized audio signal;
  • the second processing unit is configured to obtain a time-domain audio signal corresponding to the excitation signal based on the audio feature parameter and the excitation signal;
  • the quantization unit is used to re-quantize the excitation signal and the audio feature parameter based on the time-domain audio signal and the target transcoding rate to obtain the target excitation signal and the target audio feature parameter;
  • the third processing unit is configured to perform entropy coding on the target audio feature parameter and the target excitation signal to obtain a second audio stream with a second code rate, where the second code rate is lower than the first code Rate.
  • an audio transcoding device comprising:
  • a decoding module configured to perform entropy decoding on the first audio stream of the first code rate to obtain audio characteristic parameters and an excitation signal of the first audio stream, where the excitation signal is a quantized audio signal;
  • a time-domain audio signal acquisition module configured to acquire a time-domain audio signal corresponding to the excitation signal based on the audio feature parameter and the excitation signal;
  • a quantization module for re-quantizing the excitation signal and the audio feature parameter based on the time domain audio signal and the target transcoding rate to obtain the target excitation signal and the target audio feature parameter;
  • An encoding module configured to perform entropy encoding on the target audio feature parameter and the target excitation signal to obtain a second audio stream with a second code rate, where the second code rate is lower than the first code rate.
  • a computer device comprising one or more processors and one or more memories, wherein the one or more memories store at least one computer program, the computer program being executed by the One or more processors are loaded and executed to implement the audio transcoding method.
  • a computer-readable storage medium is provided, and at least one computer program is stored in the computer-readable storage medium, and the computer program is loaded and executed by a processor to implement the audio transcoding method.
  • a computer program product or computer program comprising program code stored in a computer-readable storage medium from which a processor of a computer device reads The program code, the processor executes the program code, so that the computer device executes the above-mentioned audio transcoding method.
  • FIG. 1 is a schematic structural diagram of an encoder provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an implementation environment of an audio transcoding method provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of an audio transcoding method provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a decoder provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an audio transcoder provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a method for forward error correction coding provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an audio transcoding apparatus provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • Cloud Technology is a general term for network technology, information technology, integration technology, management platform technology, application technology, etc. based on cloud computing business model applications, which can form a resource pool, which can be used on demand and is flexible and convenient. Cloud computing technology will become an important support. Background services of technical network systems require a lot of computing and storage resources, such as video websites, picture websites and more portal websites.
  • Cloud computing is a computing mode that distributes computing tasks on a resource pool composed of a large number of computers, enabling various application systems to obtain computing power, storage space and information services as needed.
  • the network that provides the resources is called the “cloud”.
  • the resources in the “cloud” are infinitely expandable in the eyes of users, and can be obtained at any time, used on demand, expanded at any time, and paid for according to usage.
  • cloud platform As a basic capability provider of cloud computing, it will establish a cloud computing resource pool (referred to as cloud platform, generally referred to as IaaS (Infrastructure as a Service, Infrastructure as a Service) platform, and deploy various types of virtual resources in the resource pool for External customers choose to use.
  • cloud computing resource pool mainly includes: computing devices (which are virtualized machines, including operating systems), storage devices, and network devices.
  • Cloud conference is an efficient, convenient and low-cost conference form based on cloud computing technology. Users only need to perform simple and easy-to-use operations through the Internet interface, and can quickly and efficiently share voice, data files and videos with teams and customers around the world, and complex technologies such as data transmission and processing in the conference are provided by cloud conference services. The dealer helps the user to operate.
  • the cloud conference system supports multi-server dynamic cluster deployment and provides multiple high-performance servers, which greatly improves the stability, security and availability of conferences.
  • video conferencing has been welcomed by many users because it can greatly improve communication efficiency, continuously reduce communication costs, and upgrade internal management levels. It has been widely used in transportation, transportation, finance, operators, education, enterprises and other fields. There is no doubt that after the use of cloud computing for video conferencing, it will be more attractive in terms of convenience, speed, and ease of use, which will surely stimulate the arrival of a new upsurge in video conferencing applications.
  • Entropy coding is the coding that does not lose any information according to the principle of entropy in the coding process, and the information entropy is the average amount of information of the source.
  • Quantization refers to the process of approximating a continuous value of a signal (or a large number of possible discrete values) into a finite number (or less) of discrete values.
  • In-band forward error correction also called Forward Error Correction (FEC)
  • FEC Forward Error Correction
  • Audio coding is divided into two types: Multi-rate Coding and Scalable Coding.
  • the scalable coding stream has the following characteristics:
  • the low-rate code stream is a subset of the high-rate code stream.
  • the network is congested, only the low-rate core code stream can be transmitted, which is more flexible, and the multi-scale encoding code stream does not have this feature.
  • the decoding result of the multi-scale encoding code stream will be better than the decoding result of the scalable encoding code stream.
  • OPUS is one of the most widely used audio encoders.
  • the OPUS encoder is a multi-scale encoder and cannot generate a cuttable stream like a scalable encoder.
  • Figure 1 provides a schematic diagram of the structure of an OPUS encoder.
  • VAD Voice Activity Detection
  • voice activity detection VAD
  • LTP Long- Term Prediction, long-term post-filtering
  • gain processing LSF (Line Spectral Frequency) quantization, prediction, pre-filtering, noise shaping quantization and interval coding, etc.
  • the decoder first decodes the encoded audio, and then re-encodes the decoded audio through the OPUS encoder to change the bit rate of the audio. Since there are many steps involved in encoding using the OPUS encoder, the encoding complexity is relatively high. high.
  • the computer device may be provided as a terminal or a server, and an implementation environment consisting of a terminal and a server will be introduced below.
  • FIG. 2 is a schematic diagram of an implementation environment of an audio transcoding method provided by an embodiment of the present application.
  • the implementation environment may include a terminal 210 and a server 240 .
  • the terminal 210 is connected to the server 240 through a wireless network or a wired network.
  • the terminal 210 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
  • the terminal 210 has social applications installed and running.
  • the server 240 is an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, and cloud communications. , middleware services, domain name services, security services, distribution networks (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • the server 240 can serve as the execution body of the audio transcoding method provided by the embodiments of the present application, that is, the terminal 210 can collect the audio signal, send the audio signal to the server 240, and the server 240 can transcode the audio signal. code, and send the transcoded audio to other terminals.
  • the terminal 210 generally refers to one of multiple terminals, and only the terminal 210 is used as an example in this embodiment of the present application.
  • the number of the above-mentioned terminals may be more or less.
  • the above-mentioned terminal is only one, or the above-mentioned terminal is tens or hundreds, or more, in this case, the above-mentioned implementation environment also includes other terminals.
  • the embodiments of the present application do not limit the number of terminals and device types.
  • the terminal is also the terminal 210 in the above-mentioned implementation environment
  • the server is also the above-mentioned Server 240 in the implementation environment.
  • the embodiments of the present application can be applied to various social applications, for example, to online conference applications, or to instant messaging applications, or to live broadcast applications, which are not limited in the embodiments of the present application.
  • an online conference application program is installed on the multiple terminals, and a user of each terminal is a participant of an online conference.
  • Multiple terminals are connected to the server through the network.
  • the server can transcode the audio signal uploaded by each terminal, and then send the transcoded audio signal to multiple terminals, so that multiple terminals can play the audio signal, so as to realize online meeting. Since the network environments where the multiple terminals are located may be different, in the process of transcoding the audio signal by the server, the server can use the technical solutions provided by the embodiments of the present application to convert the audio signal into the audio signal according to the network bandwidth of different terminals.
  • a higher bit rate means higher voice quality, which can make full use of the larger bandwidth and improve the quality of online conferences.
  • the server can transcode the audio signal at a lower bit rate, and a lower bit rate means less bandwidth occupation, so that the audio signal can be sent to the terminal in real time , to ensure the normal access of the terminal to the online conference.
  • the network bandwidth in which it is located may be larger at a certain time and smaller at another time. Then the server can also adjust the transcoding rate according to the fluctuation of the network, so as to ensure the normal progress of the online conference.
  • online meetings are also referred to as cloud meetings.
  • the user can conduct voice chat by installing the instant messaging application on the terminal.
  • the instant messaging application can obtain the audio signals of the two users during the chat through the terminals of the two users, send the audio signals to the server, and the server will send the audio signals to the server. It is sent to the two terminals respectively, and the instant messaging application plays the audio signal through the terminal, so that the voice chat between the two users can be realized.
  • the network environment of the two parties in the voice chat may also be different, that is, one party has a larger network bandwidth and the other party has a smaller network bandwidth.
  • the server can use the technical solutions provided in the embodiments of the present application to transcode the audio signal, convert the audio signal to an appropriate bit rate, and then send it to the two terminals, so as to ensure that the two users can speak normally. to chat with.
  • the host used by the host can collect the live audio signal of the host, send the live audio signal to the live server, and the live server will send the live audio signal to the audience used by different audiences. After receiving the live audio signal , the audience plays the live audio signal, and the audience can hear the voice of the anchor during the live broadcast.
  • the server can use the technical solutions provided by the embodiments of the present application to transcode live audio signals according to the network environments where different viewers are located, that is, according to different network bandwidths of the viewers To convert live audio signals to different bit rates, and send audio signals of different bit rates to different audiences, so as to ensure that different audiences can play live audio normally.
  • the server can transcode the live audio signal at a higher bit rate, and a higher bit rate means higher voice quality, which can make full use of Larger bandwidth improves the quality of live broadcasts.
  • the server can transcode the live audio signal at a lower bit rate, and a lower bit rate means less bandwidth occupation, which can ensure that the live audio signal will be transcoded. It is sent to the audience in real time to ensure that the audience can watch the live broadcast normally.
  • the network bandwidth may be larger at one time and smaller at another time. Then the server can also adjust the transcoding rate according to the fluctuation of the network bandwidth to ensure the normal progress of the live broadcast.
  • the audio transcoding method provided in the embodiment of the present application can be applied to the terminal as well as the server as a cloud service, and the terminal can quickly transcode the audio.
  • the embodiment of the present application does not limit the execution subject. .
  • the server performs entropy decoding on the first audio stream of the first bit rate, and obtains audio characteristic parameters and an excitation signal of the first audio stream, where the excitation signal is a quantized audio signal.
  • the first audio stream is a high-bit-rate audio stream
  • the audio characteristic parameters include signal gain, LSF (Line Spectral Frequency) parameter, LTP (Long-Term Prediction, long-term post-filtering) parameter, and Treble delay, etc.
  • Quantization refers to the process of approximating the continuous value of a signal to a finite number (or less) of discrete values.
  • the audio signal is a continuous signal, and the excitation signal obtained after quantization is also a discrete signal. The discrete signal is convenient for the server. for subsequent processing.
  • the high bit rate refers to the bit rate of the audio stream uploaded by the terminal to the server.
  • the high bit rate may also be a bit rate higher than a certain bit rate threshold, for example, the bit rate threshold is 1Mbps, then a bit rate higher than 1Mbps is also called a high bit rate.
  • the definition of high bit rate may be different, which is not limited in this embodiment of the present application.
  • the audio information is a speech signal.
  • the server acquires a time-domain audio signal corresponding to the excitation signal based on the audio feature parameter and the excitation signal.
  • the excitation signal is a discrete signal
  • the server can restore the excitation signal to a time-domain audio signal based on the audio characteristic parameters for subsequent audio transcoding.
  • the server re-quantizes the excitation signal and the audio feature parameter based on the time-domain audio signal and the target transcoding rate to obtain the target excitation signal and the target audio feature parameter.
  • re-quantization may also be called Noise Shaping Quantization (NSQ).
  • NSP Noise Shaping Quantization
  • the re-quantization process is also a compression process.
  • the server re-quantizes the excitation signal and audio feature parameters, that is, to The process of recompressing the excitation signal and audio characteristic parameters.
  • the server performs entropy coding on the target audio feature parameter and the target excitation signal to obtain a second audio stream with a second code rate, where the second code rate is lower than the first code rate.
  • the audio feature parameters and the excitation signal are re-quantized, the audio feature parameters and the excitation signal are re-compressed. Entropy coding is performed on the re-quantized audio feature parameters and excitation, and a second audio stream with a lower bit rate can be directly obtained. .
  • entropy decoding is used to obtain audio feature parameters and excitation signals.
  • requantization it is also performed on the excitation signal and audio characteristic parameters, and does not involve the correlation processing of the time domain signal.
  • entropy coding is performed on the excitation signal and the audio feature parameters to obtain a second audio stream with a smaller code rate. Since the computational complexity of entropy decoding and entropy coding is small, the computational complexity can be greatly reduced without processing the time domain signal, thereby improving the speed and efficiency of audio transcoding as a whole on the premise of ensuring sound quality.
  • the method includes:
  • the server performs entropy decoding on the first audio stream at the first bit rate, and obtains audio characteristic parameters and an excitation signal of the first audio stream, where the excitation signal is a quantized audio signal.
  • the server obtains the occurrence probability of multiple coding units in the first audio stream.
  • the server decodes the first audio stream based on the occurrence probability, and obtains a plurality of decoding units corresponding to the plurality of coding units respectively.
  • the server combines multiple decoding units to obtain audio characteristic parameters and excitation signals of the first audio stream.
  • the coding unit is the smallest coding unit when encoding the audio stream.
  • the above embodiment is a possible implementation of entropy decoding.
  • an entropy encoding method corresponding to the above embodiment is first described below.
  • the server obtains the audio feature parameters of the first audio stream and the occurrence probability of multiple coding units in the excitation signal.
  • the server determines the initial interval corresponding to the first audio stream.
  • the server divides the initial interval into a plurality of first-level sub-intervals based on the occurrence probability of the plurality of coding units.
  • the ratio between the lengths is the same as the ratio between the occurrence probabilities of the plurality of coding units.
  • the server divides the first-level sub-interval into a plurality of second-level sub-intervals based on the occurrence probability of the plurality of coding units.
  • the intervals respectively correspond to the combination of the first coding unit in the plurality of coding units and any coding unit in the plurality of coding units.
  • the server determines a target second-level subsection from the plurality of second-level subsections based on the appearance order of the plurality of coding units in the first audio stream, and further divides the target second-level subsection based on the target second-level subsection.
  • the server repeatedly performs the above steps until a K-level sub-interval is obtained, and the K-level sub-interval is the sub-interval corresponding to the combination of the plurality of coding units, wherein K is a positive integer, and K is the same as the number of the plurality of coding units .
  • the server can use any numerical value in the K-level sub-interval to represent the first audio stream, and the numerical value is also an encoded value of entropy encoding of the first audio stream.
  • each letter in “MNOOP” is a coding unit, and the "MNOOP” can represent the audio characteristic parameter and excitation signal of the first audio stream.
  • MNOOP the number of occurrences of the letter “M” is 1
  • N the number of occurrences of the letter “N”
  • O the number of occurrences of the letter “O”
  • P the number of occurrences of the letter “P” is 1. Since "MNOOP” includes 5 letters, the occurrence probabilities of "M”, “N”, “O” and “P” in “MNOOP” are 0.2, 0.2, 0.4, and 0.2, respectively.
  • the initial interval corresponding to "MNOOP" is [0, 100000].
  • the server divides the interval [0, 100000] into four sub-intervals: M: [0, 20000], N: [20000, 40000], O: [40000, 80000] and P: [80000, 100000], where the ratio between the lengths of each subinterval is the same as the ratio between the corresponding occurrence probabilities. Since the first letter in "MNOOP" is "M”, the server selects the first sub-interval M: [0, 20000] as the basic interval for subsequent entropy coding.
  • the server divides the interval M: [0, 20000] into four sub-intervals: MM: [0, 4000], MN: [4000, 8000 according to the probability of occurrence of "M", “N”, “O” and “P” ], MO: [8000, 16000] and MP: [16000, 20000]. Since the first two letters in "MNOOP" are "MN", the server selects the second sub-interval MN: [4000, 8000] as the basic interval for subsequent entropy coding.
  • the server divides the interval MN: [4000, 8000] into four sub-intervals according to the probability of occurrence of "M", “N”, “O” and “P”: MNM: [4000, 4800], MNN: [4800, 5600 ], MNO: [5600, 7200] and MNP: [7200, 8000]. Since the first three letters in "MNOOP" are "MNO", the server uses the third sub-interval MNO: [5600, 7200] as the base interval for subsequent entropy coding.
  • the server divides the interval MNO: [5600, 7200] into four sub-intervals: MNOM: [5600, 5920], MNON: [5920, 6240 ], MNOO: [6240, 6880], and MNOP: [6880, 7200]. Since in “MNOOP”, the first four letters are "MNOO”, the server puts the third sub-interval MNOO: [6240, 6880] The base interval for subsequent entropy encoding.
  • the server divides the interval MNOO: [6240, 6880] into four sub-intervals: MNOOM: [6240, 6368], MNOON: [6368, 6496 according to the probability of occurrence of "M", “N”, “O” and “P” ], MNOOO: [6496, 6752] and MNOOP: [6752, 6880], the entropy encoding interval for "MNOOP” is [6752, 6880], and the server can use any of the intervals [6752, 6880]
  • a numerical value is used to represent the encoding result of "MNOOP", for example, 6800 is used to represent "MNOOP". In the above embodiment, 6800 is also the first audio stream.
  • the server obtains the occurrence probability of the plurality of coding units in the first audio stream.
  • the server determines an initial interval corresponding to the first audio stream, and the initial interval is the same initial interval as the entropy encoding process.
  • the server divides the initial interval into a plurality of first-level sub-intervals based on the occurrence probability of the plurality of coding units.
  • the ratio between the lengths is the same as the ratio between the occurrence probabilities of the plurality of coding units.
  • the server compares the encoded value of the first audio stream with the plurality of first-level sub-intervals, and determines the first-level sub-interval to which the encoded value belongs as the target first-level sub-interval, and the encoding unit corresponding to the target first-level sub-interval is the first-level sub-interval
  • the server divides the target first-level sub-interval into a plurality of second-level sub-intervals based on the occurrence probability of the plurality of coding units.
  • the server determines a target secondary subinterval from the plurality of secondary subintervals, and the two coding units corresponding to the target secondary subinterval are the first two coding units corresponding to the first audio stream. coding unit.
  • the server performs subsequent decoding based on the target secondary sub-interval until the target K-level sub-interval is obtained.
  • the K coding units corresponding to the target K-level sub-interval are all coding units corresponding to the first audio stream, where K is a positive integer , K is the same as the number of the multiple coding units.
  • the server obtains the probability of occurrence of multiple coding units in the first audio stream, that is, the probability of occurrence of "M", “N”, “O” and "P" are 0.2, 0.2, 0.4, and 0.2, respectively.
  • the server constructs the same initial interval [0, 100000] as the entropy coding process, and the server divides the interval [0, 100000] into four sub-intervals according to the probability of occurrence of "M", "N", “O” and "P”: M : [0, 20000], N: [20000, 40000], O: [40000, 80000] and P: [80000, 100000], since the first audio stream 6800 is in the first sub-interval M: [0, 20000] , so the server uses the interval [0, 20000] as the basic interval for subsequent entropy decoding, and M as the first decoding unit decoded.
  • the server divides the interval M: [0, 20000] into four sub-intervals: MM: [0, 4000], MN: [4000, 8000 according to the probability of occurrence of "M", “N", “O” and "P” ], MO: [8000, 16000] and MP: [16000, 20000], since the first audio stream 6800 is in the second sub-interval MN: [4000, 8000], the server uses the sub-interval [4000, 8000] as The basic interval of subsequent entropy decoding, N is used as the second decoding unit decoded.
  • the server divides the interval MN: [4000, 8000] into four sub-intervals: MNM: [4000, 4800], MNN: [4800, 5600 according to the probability of occurrence of "M", “N", “O” and "P” ], MNO: [5600, 7200] and MNP: [7200, 8000], since the first audio stream 6800 is in the third sub-interval MNO: [5600, 7200], this sub-interval [5600, 7200] is used as the follow-up
  • the basic interval of entropy decoding, O is used as the third decoding unit decoded.
  • the server divides the interval MNO: [5600, 7200] into four sub-intervals: MNOM: [5600, 5920], MNON: [5920, 6240 ], MNOO: [6240, 6880] and MNOP: [6880, 7200], since the first audio stream 6800 is in the third sub-interval MNOO: [6240, 6880], the server uses the sub-interval [6240, 6880] as The basic interval of subsequent entropy decoding, O is used as the fourth decoding unit decoded.
  • the server divides the interval MNOO: [6240, 6880] into four sub-intervals: MNOOM: [6240, 6368], MNOON: [6368, 6496 according to the probability of occurrence of "M", “N”, “O” and “P” ], MNOOO: [6496, 6752] and MNOOP: [6752, 6880], since the first audio stream 6800 is in the fourth sub-interval MNOOP: [6752, 6880], the server uses P as the fifth decoded decoded unit. The server combines the decoded five decoding units "M", "N", “O", “O” and “P” to obtain "MNOOP", which is the audio characteristic parameter and excitation signal of the first audio stream .
  • the server inputs the first audio stream to the interval decoder 501 to perform entropy decoding on the first audio stream.
  • the entropy decoding process can be referred to the above example, and details are not repeated here.
  • an entropy-decoded audio stream is obtained.
  • the server inputs the entropy-decoded audio stream into the parameter decoder 502, and the parameter decoder 502 outputs the flag bit pulse, the signal gain and the audio characteristic parameter.
  • the server inputs the flag pulse and the signal gain to the excitation signal generator 503 to obtain the excitation signal.
  • the server obtains, based on the audio feature parameter and the excitation signal, a time-domain audio signal corresponding to the excitation signal.
  • the server processes the excitation signal based on the audio characteristic parameter to obtain a time-domain audio signal corresponding to the excitation signal.
  • the server inputs the audio characteristic parameters and the excitation signal into the frame reconstruction module 504, and the frame reconstruction module 504 outputs the audio signal after frame reconstruction.
  • the server inputs the audio signal after frame reconstruction into the sampling rate conversion filter 505, and performs resampling and encoding through the sampling rate conversion filter 505 to obtain a time-domain audio signal corresponding to the excitation signal.
  • the server can input the audio signal after frame reconstruction into stereo separation module 506, and the frame The reconstructed audio signal is divided into a mono audio signal.
  • the server inputs the monaural audio signal into the sampling rate conversion filter 505 for resampling coding, and obtains the time domain audio signal corresponding to the excitation signal.
  • the audio characteristic parameters include signal gain, LSF (Line Spectral Frequency, linear spectrum) coefficient, LTP (Long-Term Prediction, long-term post-filtering) coefficient, treble delay, and the like.
  • the frame reconstruction module includes LTP synthesis filter and LPC (Linear Predictive Coding, LPC) synthesis filter.
  • the server inputs the excitation signal and the treble delay and LTP coefficients in the audio characteristic parameters into the LTP synthesis filter, and the LTP synthesis filter synthesizes the excitation signal.
  • a first frame reconstruction is performed to obtain a first filtered audio signal.
  • the server inputs the first filtered audio signal, the LSF coefficient and the signal gain into the LPC synthesis filter, and the LPC synthesis filter performs the second frame reconstruction on the first filtered audio signal to obtain the second filtered audio signal.
  • the server fuses the first filtered audio signal and the second filtered audio signal to obtain a frame-reconstructed audio signal.
  • the server obtains a first quantization parameter through at least one iteration process based on the target transcoding rate, where the first quantization parameter is used to adjust the first rate of the first audio stream to the target transcoding rate.
  • the server obtains the first quantization parameter through at least one iterative process, and in any iterative process, the server determines the first candidate quantization parameter based on the target transcoding rate.
  • the server simulates the requantization process of the excitation signal and the audio feature parameter based on the first candidate quantization parameter, and obtains a first signal corresponding to the excitation signal and a first parameter corresponding to the audio feature parameter.
  • the server simulates the entropy encoding process of the first signal and the first parameter to obtain an analog audio stream.
  • the An alternative quantization parameter is determined as the first quantization parameter.
  • the processing process includes four parts, that is, the server first determines an alternative quantization parameter, re-quantizes the excitation signal and the audio feature parameter according to the alternative quantization parameter, and obtains the first signal and the first parameter. .
  • the server can simulate the entropy encoding process of the first signal and the first parameter to obtain an analog audio stream.
  • the server discriminates the analog audio stream to determine whether the analog audio stream meets the requirement, and the requirement determination is performed based on the first target condition and the second target condition. When both the first target condition and the second target condition are satisfied, the server can end the iteration and output the first quantization parameter. When either of the first target condition and the second target condition is not satisfied, the server can re-iterate.
  • the server determines the first candidate quantization parameter based on the target transcoding rate.
  • the target transcoding rate can be determined by the server according to the actual situation, such as determining the target transcoding rate according to the network bandwidth, so that the target transcoding rate matches the network bandwidth.
  • the first candidate quantization parameter represents a quantization step size, and the larger the quantization step size, the larger the compression ratio, and the smaller the amount of quantized data.
  • the target transcoding code rate is lower than the first code rate of the first audio stream, then the audio transcoding process is a process of reducing the code rate of the audio stream.
  • the server can generate a first candidate quantization parameter based on the target transcoding rate, and after using the first candidate quantization parameter to re-quantize the excitation signal and the audio feature parameters, audio with a lower code rate can be obtained stream, the bit rate of the audio stream is close to the target transcoding rate.
  • the server simulates the requantization process of the excitation signal and the audio characteristic parameter based on the first candidate quantization parameter, and obtains a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter.
  • the above simulation means that the server does not re-quantize the excitation signal and the audio feature parameter itself, but does a simulation of the re-quantization process based on the first candidate quantization parameter, so as to subsequently determine the number used in the actual quantization process.
  • a quantization parameter Through this simulation process, the server can determine the most suitable first quantization parameter.
  • the server simulates the discrete cosine transformation process of the excitation signal and the discrete cosine transformation process of the audio feature parameters, respectively, to obtain the second signal corresponding to the excitation signal and the second parameter corresponding to the audio feature parameter.
  • the server divides the second signal and the second parameter by the first candidate quantization parameter, respectively, and performs rounding to obtain the first signal and the first parameter.
  • the server performs discrete cosine transform on the excitation signal to obtain the second signal.
  • the server re-quantizes the second signal by using the quantization step size corresponding to the first candidate quantization parameter, that is, divides the second signal by the quantization step size represented by the first candidate parameter and then rounds to obtain the first signal.
  • the excitation signal is a matrix
  • the server can respond to the stimulus signal Perform discrete cosine transform, that is, use the following formula (1) to convert the excitation signal A discrete cosine transform is performed to obtain a second signal.
  • F(u) is the second signal
  • u is the generalized frequency variable
  • u 1, 2, 3...N-1
  • f(i) is the excitation signal
  • N is the number of values in the excitation signal
  • i is the excitation signal value in the signal.
  • the second signal is taken as The quantization step size is 28 as an example for description.
  • the server can re-quantize the second signal through the following formula (2) to obtain the first signal.
  • Q() is a quantization function
  • m is a value in the second signal
  • round() is a rounding function for rounding
  • S is a quantization step size.
  • the server uses the formula (2) for the second signal After requantization, the first signal can be obtained
  • the server simulates the entropy encoding process of the first signal and the first parameter to obtain an analog audio stream.
  • the server can convert the first signal Divide into four vectors, (7, -1, 0, 0) T , (0, -1, 0, 0) T , (0, 0, 0, 0) T and (0, 0, 0, 0) T.
  • the server will denote the vector (7, -1, 0, 0) T as A, the vector (0, -1, 0, 0) T as B, and the vector (0, 0, 0, 0) T as C .
  • first signal It can also be simplified to (ABCC).
  • the probability of occurrence of coding units "A", “B” and “C” in (ABCC) is 0.25, 0.25 and 0.5, respectively, and the server generates an initial interval [0, 100000].
  • the server divides the initial interval [0, 100000] into three sub-intervals A: [0, 25000], B[25000, 50000] and C[50000 according to the probability of occurrence of coding units "A", "B” and "C” , 100000]. Since the first letter in the first signal (ABCC) is "A”, the server selects the first sub-interval A: [0, 25000] as the basic interval for subsequent entropy coding.
  • the server divides the interval A: [0, 25000] into three sub-intervals AA: [0, 6250], AB[6250, 12500] and AC[ 12500, 100000]. Since the second letter in the first signal (ABCC) is "B", the server selects the second sub-interval AB[6250, 12500] as the basic interval for subsequent entropy coding.
  • the server divides the interval AB[6250, 12500] into three sub-intervals ABA according to the probability of occurrence of coding units "A", "B” and “C”: [6250, 7812.5], ABB[7812.5, 9375] and ABC[9375] , 12500].
  • the server selects the third sub-interval ABC[9375, 12500] as the basic interval for subsequent entropy coding.
  • the server divides the interval ABC[9375, 12500] into three sub-intervals ABCA: [9375, 10156.25], ABCB[10156.25, 10, 937.5] and ABCC according to the probability of occurrence of coding units "A", "B” and "C” [10, 937.5, 12500], the interval for entropy encoding of the first signal (ABCC) is ABCC[10, 937.5, 12500], and the server can use any value in the interval ABCC[10, 937.5, 12500] to
  • the first signal (ABCC) is represented, for example, 12000 is used to represent the first signal (ABCC).
  • the server can represent any value in the interval [100, 130] to represent the analog audio stream, for example, using 120 to represent an analog audio stream.
  • the fourth part describes the first target condition and the second target condition.
  • that the analog audio stream complies with the first target condition refers to at least one of the following:
  • the bit rate of the analog audio stream is less than or equal to the target transcoding rate, and the audio stream quality parameter of the analog audio stream is greater than or equal to the quality parameter threshold.
  • the audio stream quality parameters include signal-to-noise ratio, PESQ (Perceptual Evaluation of Speech Quality) and POLQA (Perceptual Objective Listening Quality Analysis), etc.
  • the quality parameter threshold is set according to the actual situation. For example, it is set according to the quality requirements of voice calls, that is, when the quality requirements of voice calls are high, the quality parameter threshold can be set higher, and when the quality requirements of voice calls are low, it can also be set
  • the quality parameter threshold is set relatively small, which is not limited in this embodiment of the present application.
  • At least one of the time-domain audio signal and the first signal, the target transcoding rate, the code rate of the analog audio stream, and the number of iterations meets the second target condition means:
  • the similarity between the time-domain audio signal and the first signal is greater than or equal to a similarity threshold.
  • the difference between the target transcoding rate and the bit rate of the analog audio stream is less than or equal to the difference threshold.
  • the number of iterations is equal to the number of iterations threshold. That is, in the iterative process, the similarity between the time-domain audio signal and the first signal is used as the first factor affecting the termination of the iteration, and the difference between the target transcoding code rate and the code rate of the analog audio stream is used as The second factor that affects the termination of the iteration, the number of iterations is the third factor that affects the termination of the iteration.
  • the server uses three factors to determine when to end an iteration.
  • the server can terminate the iteration and use the candidate quantization parameter corresponding to the current iteration as the first quantization parameter.
  • the server can obtain the first quantization parameter with fewer iterations, so that in the scenario of a real-time voice call, the transcoding can be completed at a faster speed.
  • the server does not perform a complete iterative process.
  • the above iterative process is also a noise shaping quantization (NSQ) loop iteration.
  • NAQ noise shaping quantization
  • the limitation of the above-mentioned second target condition can also be called a greedy algorithm.
  • the use of the greedy algorithm can greatly improve the speed of audio transcoding.
  • Other alternative quantization parameters are looked for directly around the quantization parameter of the first audio stream.
  • the second is that the number of iterations can be greatly reduced according to the above three factors when comparing the excitation signal with the time-domain audio signal.
  • the decoder may also be deleted, and audio transcoding may be performed directly, which is not limited in this embodiment of the present application.
  • the server uses the second candidate quantization parameter determined based on the target transcoding rate as the input of the next iterative process. That is, when the threshold for the number of iterations is greater than 1, if neither the first target condition nor the second target condition is met, the server can re-determine the second candidate quantization parameter based on the target transcoding rate, and based on the second candidate Quantize parameters for the next iteration process.
  • the server re-quantizes the excitation signal and the audio feature parameter based on the time-domain audio signal and the first quantization parameter to obtain the target excitation signal and the target audio feature parameter.
  • the target excitation signal is the re-quantized excitation signal
  • the target audio feature parameter is the re-quantized audio feature parameter
  • the server performs discrete cosine transform on the excitation signal and the audio characteristic parameter respectively, to obtain a third signal corresponding to the excitation signal and a third parameter corresponding to the audio characteristic parameter.
  • the server divides the third signal and the third parameter by the first quantization parameter and rounds it to obtain the target excitation signal and the target audio feature parameter.
  • the server performs entropy encoding on the target audio feature parameter and the target excitation signal to obtain a second audio stream with a second code rate, where the second code rate is lower than the first code rate.
  • the server obtains the target audio feature parameter and the occurrence probability of multiple coding units in the target excitation signal.
  • the server encodes the plurality of coding units based on the occurrence probability to obtain a second audio stream.
  • the target audio feature parameter and the target excitation signal are "DEFFG"
  • each letter is a coding unit, where "D", “E”, “F” and “G” are in “DEFFG”
  • the probabilities of appearing in are 0.2, 0.2, 0.4, and 0.2, respectively, and the initial interval corresponding to "DEFFG” is [0, 100000].
  • the server divides the interval [0, 100000] into four sub-intervals: D: [0, 20000], E: [20000, 40000], F: [40000, 80000] and G: [80000, 100000], where the ratio between the lengths of each subinterval is the same as the ratio of the corresponding occurrence probability. Since the first letter in "DEFFG” is "D”, the server selects the first sub-interval D: [0, 20000] as the basic interval for subsequent entropy coding.
  • the server divides the interval D: [0, 20000] into four sub-intervals: DD: [0, 4000], DE: [4000, 8000 according to the probability of occurrence of "D", "E”, “F” and “G” ], DF: [8000, 16000] and DG: [16000, 20000]. Since the first two letters in "DEFFG” are "DE”, the server selects the second sub-interval DE: [4000, 8000] as the basic interval for subsequent entropy coding.
  • the server divides the interval DE: [4000, 8000] into four sub-intervals: DED: [4000, 4800], DEE: [4800, 5600 ], DEF: [5600, 7200] and DEG: [7200, 8000]. Since the first three letters in "DEFFG” are "DEF”, the server uses the third sub-interval DEF: [5600, 7200] as the base interval for subsequent entropy encoding.
  • the server divides the interval DEF: [5600, 7200] into four sub-intervals: DEFD: [5600, 5920], DEFE: [5920, 6240 ], DEFF: [6240, 6880], and DEFG: [6880, 7200]. Since in “DEFFG”, the first four letters are "DEFF”, the server puts the third subinterval DEFF: [6240, 6880] The base interval for subsequent entropy encoding.
  • the server divides the interval DEFF: [6240, 6880] into four sub-intervals: DEFFD: [6240, 6368], DEFFE: [6368, 6496 according to the probability of occurrence of "D", “E”, “F” and “G” ], DEFFF: [6496, 6752] and DEFFG: [6752, 6880], the interval for entropy encoding "DEFFG” is [6752, 6880], and the server can use any of the interval [6752, 6880]
  • a numerical value is used to represent the encoding result of "DEFFG", for example, 6800 is used to represent "DEFFG". In the above embodiment, 6800 is also the second audio stream.
  • the audio transcoding method provided in this embodiment of the present application can also be combined with other audio processing methods to improve the quality of audio transcoding.
  • the audio transcoding method provided by the embodiment of the present application can be combined with the forward error correction (FEC) encoding method.
  • FEC forward error correction
  • the forward error correction method can be used to encode the audio.
  • the essence of the forward error correction is to add Redundant information, so that error correction can occur even if error occurs, redundant information is information related to the previous N frames of the current audio frame, where N is a positive integer.
  • the server performs forward error correction encoding on the subsequently received audio stream based on the second audio stream.
  • an audio stream is an audio frame
  • the second audio stream is denoted as T-1 frame
  • the subsequent audio stream received from the terminal is denoted as T frame
  • the server when it encodes the T frame, it can
  • the T-1 frame that is, the second audio stream
  • Network adversarial during streaming where network adversarial is the performance against network fluctuations.
  • the server can use the audio transcoding method provided in the embodiment of this application to adjust the bit rate of T-1 frame and T-2 frame, so as to reduce the T-1 frame and T-2 frame
  • the code rate of the frame, the in-band forward error correction method is used to encode the adjusted T-1 frame, the adjusted T-2 frame and T, and the encoded FEC code stream is obtained. Since the T-1 frame and T The code rate of -2 frames is reduced, then the overall code rate of the encoded FEC code stream can also be reduced, thereby improving the network confrontation during audio stream transmission under the premise of ensuring audio quality.
  • an embodiment of the present application also provides an audio transcoder.
  • the structure of the audio transcoder is shown in FIG. 7 .
  • the audio transcoder includes: a first processing unit 701 , a second processing unit 702 , and a quantization unit 703 and a third processing unit 704, wherein the first processing unit 701 is connected to the second processing unit 702 and the quantization unit 703 respectively, the second processing unit 702 is connected to the quantization unit 703, and the quantization unit 703 is connected to the third processing unit 704.
  • the audio transcoder provided by the embodiments of the present application is also referred to as a downlink transcoder.
  • the first processing unit 701 is used to perform entropy decoding on the first audio stream of the first code rate, and obtain the audio characteristic parameter and the excitation signal of the first audio stream, and the excitation signal is the quantized audio signal.
  • the second processing unit 702 is configured to obtain a time-domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal.
  • the quantization unit 703 is configured to re-quantize the excitation signal and the audio feature parameter based on the time-domain audio signal and the target transcoding rate to obtain the target excitation signal and the target audio feature parameter.
  • the quantization unit 703 is also referred to as a fast noise shaping quantization unit.
  • the third processing unit 704 is configured to perform entropy coding on the target audio feature parameter and the target excitation signal to obtain a second audio stream with a second code rate, where the second code rate is lower than the first code rate.
  • the first processing unit 701 can send the audio characteristic parameter and the excitation signal to the second processing unit 702 and the quantization unit 703 respectively, and the second processing unit 702 can obtain the audio from the first processing unit Audio characteristic parameters and excitation signals, and based on the audio characteristic parameters and excitation signals, obtain time-domain audio signals corresponding to the excitation signals.
  • the second processing unit 702 can send the time domain audio signal to the quantization unit 703 .
  • the quantization unit 703 can receive the target transcoding rate, the audio characteristic parameter, the excitation signal and the time-domain audio signal to re-quantize the excitation signal and the audio characteristic parameter.
  • the quantization unit 703 can send the target audio characteristic parameter and the target excitation signal to the third processing unit 704, and the third processing unit 704 performs entropy coding on the target audio characteristic parameter and the target excitation signal, thereby obtaining the second audio frequency of the second code rate. flow.
  • the quantization unit 703 is configured to obtain a first quantization parameter through at least one iteration process based on the target transcoding rate, and the first quantization parameter is used to convert the first code rate of the first audio stream Adjust to target transcoding rate. Based on the time-domain audio signal and the first quantization parameter, the excitation signal and the audio feature parameter are requantized to obtain the target excitation signal and the target audio feature parameter.
  • the quantization unit 703 is configured to, in any iterative process, determine the first candidate quantization parameter based on the target transcoding rate.
  • the re-quantization process of the excitation signal and the audio feature parameter is simulated based on the first candidate quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio feature parameter.
  • the An alternative quantization parameter is determined as the first quantization parameter.
  • that the analog audio stream complies with the first target condition refers to at least one of the following:
  • the bit rate of the analog audio stream is less than or equal to the target transcoding rate.
  • the audio stream quality parameter of the analog audio stream is greater than or equal to the quality parameter threshold.
  • At least one of the time-domain audio signal and the first signal, the target transcoding rate, the code rate of the analog audio stream, and the number of iterations meets the second target condition means:
  • the similarity between the time-domain audio signal and the first signal is greater than or equal to a similarity threshold.
  • the difference between the target transcoding rate and the bit rate of the analog audio stream is less than or equal to the difference threshold.
  • the number of iterations is equal to the number of iterations threshold.
  • the quantization unit 703 is used for:
  • the discrete cosine transformation process of the excitation signal and the discrete cosine transformation process of the audio feature parameters are simulated respectively to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio feature parameters.
  • the second signal and the second parameter are respectively divided by the first candidate quantization parameter and then rounded to obtain the first signal and the first parameter.
  • the quantization unit 703 is further configured to: in response to the analog audio stream not meeting the first target condition, or the time domain audio signal and the first signal, the target transcoding rate and the code rate of the analog audio stream , the number of iterations that have been performed does not meet the second target condition, and the second candidate quantization parameter determined based on the target transcoding rate is used as the input of the next iteration process.
  • the first processing unit 701 is configured to: obtain the occurrence probability of multiple coding units in the first audio stream.
  • the first audio stream is decoded based on the occurrence probability to obtain a plurality of decoding units corresponding to the plurality of coding units respectively. Combining multiple decoding units to obtain audio characteristic parameters and excitation signals of the first audio stream.
  • the third processing unit 704 is used for:
  • the plurality of coding units are encoded based on the occurrence probability to obtain a second audio stream.
  • the audio transcoder further includes a forward error correction unit, and the forward error correction module is connected to the third processing unit 704, and is configured to perform forward error correction on the subsequently received audio stream based on the second audio stream error correction coding.
  • the audio transcoder provided by the above embodiment performs audio transcoding
  • only the division of the above functional units is used as an example for illustration.
  • the above functions may be allocated by different functional units as required. , that is, dividing the internal structure of the audio transcoder into different functional units to complete all or part of the functions described above.
  • the audio transcoder and the audio transcoding method embodiments provided by the above embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments, which will not be repeated here.
  • entropy decoding is used to obtain audio feature parameters and excitation signals.
  • requantization it is also performed on the excitation signal and audio characteristic parameters, and does not involve the correlation processing of the time domain signal.
  • entropy coding is performed on the excitation signal and the audio feature parameters to obtain a second audio stream with a smaller code rate. Since the computational complexity of entropy decoding and entropy coding is small, the computational complexity can be greatly reduced without processing the time domain signal, thereby improving the speed and efficiency of audio transcoding as a whole on the premise of ensuring sound quality.
  • FIG. 8 is a schematic structural diagram of an audio transcoding apparatus provided by an embodiment of the present application.
  • the apparatus includes: a decoding module 801 , a time-domain audio signal acquisition module 802 , a quantization module 803 , and an encoding module 804 .
  • the decoding module 801 is configured to perform entropy decoding on the first audio stream of the first bit rate to obtain audio characteristic parameters of the first audio stream and an excitation signal, where the excitation signal is a quantized audio signal.
  • the time-domain audio signal acquisition module 802 is configured to acquire a time-domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal.
  • the quantization module 803 is configured to re-quantize the excitation signal and the audio feature parameter based on the time-domain audio signal and the target transcoding rate to obtain the target excitation signal and the target audio feature parameter.
  • the encoding module 804 is configured to perform entropy encoding on the target audio feature parameter and the target excitation signal to obtain a second audio stream with a second code rate, where the second code rate is lower than the first code rate.
  • the quantization module is configured to obtain a first quantization parameter through at least one iteration process based on the target transcoding code rate, and the first quantization parameter is used to adjust the first code rate of the first audio stream Target transcoding rate. Based on the time-domain audio signal and the first quantization parameter, the excitation signal and the audio feature parameter are requantized to obtain the target excitation signal and the target audio feature parameter.
  • the quantization module is configured to, in any iterative process, determine the first candidate quantization parameter based on the target transcoding rate.
  • the re-quantization process of the excitation signal and the audio feature parameter is simulated based on the first candidate quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio feature parameter.
  • the An alternative quantization parameter is determined as the first quantization parameter.
  • that the analog audio stream complies with the first target condition refers to at least one of the following:
  • the bit rate of the analog audio stream is less than or equal to the target transcoding rate.
  • the audio stream quality parameter of the analog audio stream is greater than or equal to the quality parameter threshold.
  • At least one of the time-domain audio signal and the first signal, the target transcoding rate, the code rate of the analog audio stream, and the number of iterations meets the second target condition means:
  • the similarity between the time-domain audio signal and the first signal is greater than or equal to a similarity threshold.
  • the difference between the target transcoding rate and the bit rate of the analog audio stream is less than or equal to the difference threshold.
  • the number of iterations is equal to the number of iterations threshold.
  • the quantization module is configured to simulate the discrete cosine transformation process of the excitation signal and the discrete cosine transformation process of the audio feature parameters, respectively, to obtain a second signal corresponding to the excitation signal and a second signal corresponding to the audio feature parameters. the second parameter.
  • the second signal and the second parameter are respectively divided by the first candidate quantization parameter and then rounded to obtain the first signal and the first parameter.
  • the quantization module is further configured to respond that the analog audio stream does not meet the first target condition, or the time domain audio signal and the first signal, the target transcoding code rate and the code rate of the analog audio stream, If none of the number of iterations has met the second target condition, the second candidate quantization parameter determined based on the target transcoding rate is used as the input of the next iteration process.
  • the decoding module is configured to obtain the occurrence probability of multiple coding units in the first audio stream.
  • the first audio stream is decoded based on the occurrence probability to obtain a plurality of decoding units corresponding to the plurality of coding units respectively. Combining multiple decoding units to obtain audio characteristic parameters and excitation signals of the first audio stream.
  • the encoding module is configured to acquire the target audio feature parameter and the occurrence probability of multiple encoding units in the target excitation signal.
  • the plurality of coding units are encoded based on the occurrence probability to obtain a second audio stream.
  • the apparatus further includes a forward error correction module, configured to perform forward error correction coding on the subsequently received audio stream based on the second audio stream.
  • the audio transcoding apparatus provided in the above embodiments is used for audio transcoding, only the division of the above functional modules is used as an example for illustration. In practical applications, the above functions can be allocated by different functional modules as required. , that is, dividing the internal structure of the audio transcoding device into different functional modules to complete all or part of the functions described above.
  • the audio transcoding apparatus provided in the above embodiments and the audio transcoding method embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments, which will not be repeated here.
  • entropy decoding is used to obtain audio feature parameters and excitation signals.
  • requantization it is also performed on the excitation signal and audio characteristic parameters, and does not involve the correlation processing of the time domain signal.
  • entropy coding is performed on the excitation signal and the audio feature parameters to obtain a second audio stream with a smaller code rate. Since the computational complexity of entropy decoding and entropy coding is small, the computational complexity can be greatly reduced without processing the time domain signal, thereby improving the speed and efficiency of audio transcoding as a whole on the premise of ensuring sound quality.
  • An embodiment of the present application provides a computer device for executing the above method.
  • the computer device can be implemented as a terminal or a server.
  • the structure of the terminal is first introduced below:
  • FIG. 9 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • the terminal 900 can be: a smart phone, a tablet computer, a notebook computer or a desktop computer. Terminal 900 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.
  • the terminal 900 includes: one or more processors 901 and one or more memories 902 .
  • the processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 901 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • the processor 901 may also include a main processor and a co-processor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the co-processor is A low-power processor for processing data in a standby state.
  • Memory 902 may include one or more computer-readable storage media, which may be non-transitory. Memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 902 is used to store at least one computer program, and the at least one computer program is used to be executed by the processor 901 to implement the methods provided by the method embodiments in this application. Audio transcoding method.
  • FIG. 9 does not constitute a limitation to the terminal 900, and may include more or less components than those shown in the figure, or combine some components, or adopt different component arrangements.
  • the above computer equipment can also be implemented as a server, and the structure of the server is introduced below:
  • the server 1000 may vary greatly due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 1001 and a or more memories 1002, wherein, at least one computer program is stored in the one or more memories 1002, and the at least one computer program is loaded and executed by the one or more processors 1001 to implement the above-mentioned methods. method provided by the example.
  • the server 1000 may also have components such as wired or wireless network interfaces, keyboards, and input/output interfaces for input and output, and the server 1000 may also include other components for implementing device functions, which will not be repeated here.
  • a computer-readable storage medium such as a memory including a computer program
  • the above-mentioned computer program can be executed by a processor to complete the audio transcoding method in the above-mentioned embodiment.
  • the computer-readable storage medium may be Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc Read-Only Memory (CD-ROM), Tape, floppy disk, and optical data storage devices, etc.
  • a computer program product or computer program comprising program code stored in a computer readable storage medium from which a processor of a computer device is readable by a computer
  • the program code is read by reading the storage medium, and the processor executes the program code, so that the computer device executes the above-mentioned audio transcoding method.

Abstract

一种音频转码方法、转码装置、转码器及相应的计算机设备、计算机可读存储介质,该方法包括:对第一码率的第一音频流进行熵解码,得到第一音频流的音频特征参数和激励信号,激励信号为量化后的音频信号(401);基于音频特征参数和激励信号,获取激励信号对应的时域音频信号(402);基于目标转码码率,通过至少一次迭代过程,获取第一量化参数,第一量化参数用于将第一音频流的第一码率调整为目标转码码率(403);基于时域音频信号和第一量化参数,对激励信号和音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数(404);对目标音频特征参数和目标激励信号进行熵编码,得到第二码率的第二音频流,第二码率低于所述第一码率(405)。

Description

音频转码方法、装置、音频转码器、设备以及存储介质
本申请要求于2021年2月26日提交的申请号为202110218868.9、发明名称为“音频转码方法、装置、音频转码器、设备以及存储介质”以及于2021年12月27日提交的申请号为202111619099.X、发明名称为“音频转码方法、装置、音频转码器、设备以及存储介质”的两个中国专利申请的优先权,这两个中国专利申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频处理领域,特别涉及一种音频转码方法、装置、音频转码器、设备以及存储介质。
背景技术
随着网络技术的发展,越来越多的用户通过通社交类应用程序进行语音聊天。
发明内容
本申请实施例提供了一种音频转码方法、装置、音频转码器、设备以及存储介质,可以提高音频转码的速度和效率。所述技术方案如下:
一方面,提供了一种音频转码方法,所述方法包括:
对第一码率的第一音频流进行熵解码,得到所述第一音频流的音频特征参数和激励信号,所述激励信号为量化后的音频信号;
基于所述音频特征参数和所述激励信号,获取所述激励信号对应的时域音频信号;
基于所述时域音频信号和目标转码码率对所述激励信号和所述音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数;
对所述目标音频特征参数和所述目标激励信号进行熵编码,得到第二码率的第二音频流,所述第二码率低于所述第一码率。
一方面,提供了一种音频转码器,所述音频转码器包括:第一处理单元、第二处理单元、量化单元以及第三处理单元,其中,所述第一处理单元与所述第二处理单元和所述量化单元分别相连,所述第二处理单元与所述量化单元相连,所述量化单元与所述第三处理单元相连;
所述第一处理单元,用于对第一码率的第一音频流进行熵解码,得到所述第一音频流的音频特征参数和激励信号,所述激励信号为量化后的音频信号;
所述第二处理单元,用于基于所述音频特征参数和所述激励信号,获取所述激励信号对应的时域音频信号;
所述量化单元,用于基于所述时域音频信号和目标转码码率对所述激励信号和所述音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数;
所述第三处理单元,用于对所述目标音频特征参数和所述目标激励信号进行熵编码,得到第二码率的第二音频流,所述第二码率低于所述第一码率。
一方面,提供了一种音频转码装置,所述装置包括:
解码模块,用于对第一码率的第一音频流进行熵解码,得到所述第一音频流的音频特征参数和激励信号,所述激励信号为量化后的音频信号;
时域音频信号获取模块,用于基于所述音频特征参数和所述激励信号,获取所述激励信号对应的时域音频信号;
量化模块,用于基于所述时域音频信号和目标转码码率对所述激励信号和所述音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数;
编码模块,用于对所述目标音频特征参数和所述目标激励信号进行熵编码,得到第二码率的第二音频流,所述第二码率低于所述第一码率。
一方面,提供了一种计算机设备,所述计算机设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条计算机程序,所述计算机程序由所述一个或多个处理器加载并执行以实现所述音频转码方法。
一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述计算机程序由处理器加载并执行以实现所述音频转码方法。
一方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括程序代码,该程序代码存储在计算机可读存储介质中,计算机设备的处理器从计算机可读存储介质读取该程序代码,处理器执行该程序代码,使得该计算机设备执行上述音频转码方法。
附图说明
为例更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种编码器的结构示意图;
图2是本申请实施例提供的一种音频转码方法的实施环境的示意图;
图3是本申请实施例提供的一种音频转码方法的流程图;
图4是本申请实施例提供的一种音频转码方法的流程图;
图5是本申请实施例提供的一种解码器的结构示意图;
图6是本申请实施例提供的一种音频转码器的结构示意图;
图7是本申请实施例提供的一种前向纠错编码的方法示意图;
图8是本申请实施例提供的一种音频转码装置的结构示意图;
图9是本申请实施例提供的一种终端的结构示意图;
图10是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。
本申请中术语“至少一个”是指一个或多个,“多个”的含义是指两个或两个以上。
相关技术中,由于不同用户的网络带宽不同,在用户进行语音聊天的过程中,社交类应用程序需要对传输的音频进行转码,比如,若一个用户的网络带宽较低,那么就需要对音频进行转码,也即是将音频的码率降低,以保证该用户能够正常进行语音聊天。
但是,在音频转码的过程中,转码的复杂度较高,从而导致音频转码的速度慢且效率低。
云技术(Cloud Technology)基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。技术网络系统的后台服务需要大量的计算、存储资源,如视频网站、图片类 网站和更多的门户网站。
云计算(Cloud computing)是一种计算模式,它将计算任务分布在大量计算机构成的资源池上,使各种应用系统能够根据需要获取计算力、存储空间和信息服务。提供资源的网络被称为“云”。“云”中的资源在使用者看来是可以无限扩展的,并且可以随时获取,按需使用,随时扩展,按使用付费。
作为云计算的基础能力提供商,会建立云计算资源池(简称云平台,一般称为IaaS(Infrastructure as a Service,基础设施即服务)平台,在资源池中部署多种类型的虚拟资源,供外部客户选择使用。云计算资源池中主要包括:计算设备(为虚拟化机器,包含操作系统)、存储设备、网络设备。
云会议是基于云计算技术的一种高效、便捷、低成本的会议形式。使用者只需要通过互联网界面,进行简单易用的操作,便可快速高效地与全球各地团队及客户同步分享语音、数据文件及视频,而会议中数据的传输、处理等复杂技术由云会议服务商帮助使用者进行操作。
目前国内云会议主要集中在以SaaS(Software as a Service,软件即服务)模式为主体的服务内容,包括电话、网络、视频等服务形式,基于云计算的视频会议就叫云会议。
在云会议时代,数据的传输、处理、存储全部由视频会议厂家的计算机资源处理,用户完全无需再购置昂贵的硬件和安装繁琐的软件,只需打开浏览器,登录相应界面,就能进行高效的远程会议。
云会议系统支持多服务器动态集群部署,并提供多台高性能服务器,大大提升了会议稳定性、安全性、可用性。近年来,视频会议因能大幅提高沟通效率,持续降低沟通成本,带来内部管理水平升级,而获得众多用户欢迎,已广泛应用在交通、运输、金融、运营商、教育、企业等各个领域。毫无疑问,视频会议运用云计算以后,在方便性、快捷性、易用性上具有更强的吸引力,必将激发视频会议应用新高潮的到来。
熵编码:熵编码即编码过程中按熵原理不丢失任何信息的编码,信息熵为信源的平均信息量。
量化:是指将信号的连续取值(或者大量可能的离散取值)近似为有限多个(或较少的)离散值的过程。
带内前向纠错:带内前向纠错也叫前向纠错码(Forward Error Correction,简称FEC),是增加数据通讯可信度的方法。在单向通讯信道中,一旦错误被发现,其接收器将无权再请求传输。FEC是利用数据进行传输冗余信息的方法,当传输中出现错误,将允许接收器再建数据。
音频编码分为多尺度编码(Multi-rate Coding)和可伸缩编码(Scalable Coding)两种,其中可伸缩编码码流具有以下特点:低码率码流是高码率码流的子集,当网络拥塞时可以只传低码率核心码流,比较灵活,多尺度编码码流没有这个特性。但一般而言,同等码率下,多尺度编码码流的解码结果会优于可伸缩编码码流的解码结果。
OPUS是应用最为广泛的音频编码器之一,OPUS编码器是一种多尺度编码器,无法像可伸缩编码器生成一段可切割的码流,图1提供了一种OPUS编码器的结构示意图,从图1中可以看出,采用OPUS编码器对音频进行编码时,OPUS编码器需要对音频进行语音活动检测(VAD,Voice Activity Detection)语音活动检测、基音处理、噪声整形处理、LTP(Long-Term Prediction,长时后滤波)缩放控制、增益处理、LSF(Line Spectral Frequency,线性频谱)量化、预测、预滤波、噪声整形量化以及区间编码等步骤,当需要进行音频转码时,需要采用OPUS解码器先对编码后的音频进行解码,再将解码后的音频通过OPUS编码器进行重新编码,以改变音频的码率,由于采用OPUS编码器进行编码涉及的步骤较多,因此编码复杂度较高。
在本申请实施例中,计算机设备可以提供为终端或者服务器,下面将由终端和服务器组成实施环境进行介绍。
图2是本申请实施例提供的一种音频转码方法的实施环境示意图,参见图2,该实施环境中可以包括终端210和服务器240。
终端210通过无线网络或有线网络与服务器240相连。可选地,终端210是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。终端210安装和运行有社交类应用程序。
可选地,服务器240是独立的物理服务器,或者是多个物理服务器构成的服务器集群或者分布式系统,或者是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。在一些实施例中,服务器240能够作为本申请实施例提供的音频转码方法的执行主体,也即是终端210能够采集音频信号,将音频信号发送给服务器240,由服务器240对音频信号进行转码,将转码后的音频发送给其他终端。
可选地,终端210泛指多个终端中的一个,本申请实施例仅以终端210来举例说明。
本领域技术人员可以知晓,上述终端的数量可以更多或更少。比如上述终端仅为一个,或者上述终端为几十个或几百个,或者更多数量,此时上述实施环境中还包括其他终端。本申请实施例对终端的数量和设备类型不加以限定。
上述所有可选技术方案,可以采用任意结合形成本申请的可选实施例,在此不再一一赘述。
对本申请实施例的实施环境进行介绍之后,下面将结合上述实施环境对本申请实施例的应用场景进行介绍,在下述介绍过程中,终端也即是上述实施环境中的终端210,服务器也即是上述实施环境中的服务器240。本申请实施例能够应用在各类社交类应用中,比如应用于在线会议类应用中,或者应用于即时通信类应用中,或者应用于直播类应用中,本申请实施例对此不做限定。
在在线会议类应用中,往往存在多个终端,多个终端上安装有在线会议类应用程序,每个终端的用户均为一个在线会议的参与者。多个终端均通过网络与服务器相连。在进行在线会议的过程中,服务器能够对各个终端上传的音频信号进行转码,再将转码后的音频信号发送给多个终端,以使得多个终端均能够对音频信号进行播放,从而实现在线会议。由于多个终端所处的网络环境可能是不同的,在服务器对音频信号进行转码的过程中,服务器能够采用本申请实施例提供的技术方案,根据不同终端的网络带宽来将音频信号转为不同的码率,将不同码率的音频信号发送给不同的终端,从而保证不同终端均能够正常进行在线会议,也即是对于网络带宽较大的终端来说,服务器能够以较高的码率来对音频信号进行转码,较高的码率也就意味着较高的语音质量,这样能够充分利用较大的带宽,提高在线会议的质量。对于网络带宽较小的终端来说,服务器能够以较低的码率来对音频信号进行转码,较低的码率也就意味着较小的带宽占用,这样能够将音频信号实时发送给终端,保证终端正常的在线会议接入。除此之外,由于网络存在网络波动,也即是对于同一个终端来说,所处的网络带宽可能在某个时刻较大,在另一个时刻较小。那么服务器也能够根据网络的波动情况,来对转码码率进行调整,以保证在线会议的正常进行。在一些实施例中,在线会议也被称为云会议。
在即时通信类应用中,用户能够通过在终端上安装即时通信类应用进行语音聊天。以两个用户通过即时通信类应用进行语音聊天为例,即时通信类应用能够通过两个用户的终端获取两个用户在聊天过程中的音频信号,将音频信号发送给服务器,由服务器将音频信号分别发送给两个终端,即时通信类应用通过终端播放音频信号,这样就能够实现两个用户之间的语音聊天。与在线会议场景同理,语音聊天双方所处的网络环境也可能存在差异,也即是一方的网络带宽较大,另一方的网络带宽较小。在这种情况下,服务器能够采用本申请实施例 提供的技术方案对音频信号进行转码,将音频信号转为合适的码率后再发送给两个终端,从而保证两个用户能够正常进行语音聊天。
在直播类应用中,主播使用的主播端能够采集主播的直播音频信号,将直播音频信号发送给直播服务器,由直播服务器将直播音频信号发送给不同观众使用的观众端,接收到直播音频信号之后,观众端对直播音频信号进行播放,观众就能够听到主播在直播时的语音。由于不同观众端可能处于不同的网络环境,服务器能够采用本申请实施例提供的技术方案,根据不同观众端所处的网络环境对直播音频信号进行转码,也即是根据观众端不同的网络带宽来将直播音频信号转为不同的码率,将不同码率的音频信号发送给不同的观众端,从而保证不同观众端均能够正常播放直播音频。也即是对于网络带宽较大的观众端来说,服务器能够以较高的码率来对直播音频信号进行转码,较高的码率也就意味着较高的语音质量,这样能够充分利用较大的带宽,提高直播的质量。对于网络带宽较小的观众端来说,服务器能够以较低的码率来对直播音频信号进行转码,较低的码率也就意味着较小的带宽占用,这样能够保证将直播音频信号实时发送给观众端,保证观众端正常观看直播。除此之外,由于网络存在网络波动,也即是对于同一个观众端来说,所处的网络带宽可能在某个时刻较大,在另一个时刻较小。那么服务器也能够根据网络带宽的波动情况,来对转码码率进行调整,以保证直播的正常进行。
除了上述三种应用场景之外,本申请实施例提供的技术方案也能够应用于其他音频传输的场景下,比如应用在广播电视传输的场景下,或者应用在卫星通讯的场景下,本申请实施例对此不做限定。
当然,本申请实施例提供的音频转码方法除了能够应用在服务器上作为一种云服务,也能够应用在终端上,由终端对音频进行快速转码,本申请实施例对于执行主体不做限定。
在介绍完本申请实施例的实施环境和应用场景之后,下面对本申请实施例提供的技术方案进行说明,在下述说明过程中,以执行音频转码方法的主体为服务器为例,参见图3,方法包括:
301、服务器对第一码率的第一音频流进行熵解码,得到第一音频流的音频特征参数和激励信号,激励信号为量化后的音频信号。
在一些实施例中,第一音频流为高码率的音频流,音频特征参数包括信号增益、LSF(Line Spectral Frequency,线性频谱)参数、LTP(Long-Term Prediction,长时后滤波)参数以及高音延迟等。量化是指将信号的连续取值近似为有限多个(或较少的)离散值的过程,音频信号为连续信号,经过量化之后得到的激励信号也即是离散的信号,离散的信号便于服务器进行后续处理。在一些实施例中,高码率是指终端上传至服务器的音频流的码率,在其他实施例中,高码率也可能为高于某个码率阈值的码率,比如码率阈值为1Mbps,那么高于1Mbps的码率也被称为高码率。当然,在不同编码标准中,高码率定义可能是不同的,本申请实施例对此不做限定。在一些场景下,该音频信息语音信号。
302、服务器基于音频特征参数和激励信号,获取激励信号对应的时域音频信号。
在一些实施例中,激励信号为离散的信号,服务器能够基于音频特征参数将激励信号还原为时域音频信号,以进行后续的音频转码。
303、服务器基于时域音频信号和目标转码码率对激励信号和音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数。
在一些实施例中,重新量化也可以被称为噪声整形量化(Noise Shaping Quantization,NSQ),重新量化的过程也即是一个压缩过程,服务器对激励信号和音频特征参数进行重新量化也即是对激励信号和音频特征参数进行重新压缩的过程。
304、服务器对目标音频特征参数和目标激励信号进行熵编码,得到第二码率的第二音频流,第二码率低于第一码率。
对音频特征参数和激励信号进行重新量化之后,音频特征参数和激励信号也就完成了重新压缩对重新量化之后的音频特征参数和激励进行熵编码,能够直接得到码率较低的第二音频流。
通过本申请实施例提供的技术方案,在进行音频流转码时,不需要执行完整的参数提取过程,而是采用了熵解码来获取音频特征参数和激励信号。在进行重新量化时,也是对激励信号和音频特征参数来进行的,并不涉及对时域信号的相关处理。最终对激励信号和音频特征参数进行熵编码,得到码率较小的第二音频流。由于熵解码和熵编码的运算量较小,不对时域信号进行处理也能大大减少运算量,从而在保证音质的前提下,从整体上提高了音频转码的速度和效率。
上述步骤301-304是本申请实施例的简单介绍,下面将结合一些例子对本申请实施例提供的技术方案进行更加清楚的说明,参见图4,方法包括:
401、服务器对第一码率的第一音频流进行熵解码,得到第一音频流的音频特征参数和激励信号,激励信号为量化后的音频信号。
在一种可能的实施方式中,服务器获取第一音频流中多个编码单元的出现概率。服务器基于出现概率对第一音频流进行解码,得到与多个编码单元分别对应的多个解码单元。服务器对多个解码单元进行组合,得到第一音频流的音频特征参数和激励信号。在一些实施例中,编码单元为对音频流进行编码时的最小编码单位。
上述实施方式为熵解码的一种可能的实施方式,为了对上述实施方式进行更加清楚的说明,下面先对与上述实施方式对应的一种熵编码的方式进行说明。
举例来说,服务器获取第一音频流的音频特征参数和激励信号中的多个编码单元的出现概率。服务器确定第一音频流对应的初始区间。服务器基于该多个编码单元的出现概率,将该初始区间划分为多个一级子区间,该多个一级子区间与该多个编码单元一一对应,该多个一级子区间的区间长度之间的比值与该多个编码单元的出现概率之间的比值相同。对于该多个一级子区间中的第一个一级子区间,服务器基于该多个编码单元的出现概率,将该一级子区间划分为多个二级子区间,该多个二级子区间分别对应于该多个编码单元中第一个编码单元与该多个编码单元中任一编码单元的组合。服务器基于该多个编码单元在该第一音频流中的出现顺序,从该多个二级子区间中确定目标二级子区间,基于该目标二级子区间再进行划分。服务器重复执行上述步骤,直至将得到K级子区间,该K级子区间也就是该多个编码单元的组合对应的子区间,其中,K为正整数,K与该多个编码单元的数量相同。服务器能够采用该K级子区间中任一数值来表示该第一音频流,该数值也即是对该第一音频流进行熵编码的编码值。
比如,为了简化过程,以第一音频流为“MNOOP”为例,“MNOOP”中每个字母为一个编码单元,该“MNOOP”能够表示该第一音频流的音频特征参数和激励信号。在“MNOOP”中,字母“M”的出现次数为1,字母“N”的出现次数为1,字母“O”的出现次数为2,字母“P”的出现次数为1。由于“MNOOP”包括5个字母,那么“M”、“N”、“O”和“P”在“MNOOP”中的出现概率分别为0.2、0.2、0.4、和0.2。在一些实施例中,“MNOOP”对应的初始区间为[0,100000]。服务器根据“M”、“N”、“O”和“P”出现的概率,将区间[0,100000]分为四个子区间:M:[0,20000]、N:[20000,40000]、O:[40000,80000]以及P:[80000,100000],其中每个子区间长度之间的比例与对应出现概率之间的比例相同。由于在“MNOOP”中,第一个字母为“M”,因此服务器选取第一个子区间M:[0,20000]作为后续熵编码的基础区间。服务器根据“M”、“N”、“O”和“P”出现的概率,将区间M:[0,20000]分为四个子区间:MM:[0,4000]、MN:[4000,8000]、MO:[8000,16000]以及MP:[16000,20000]。由于在“MNOOP”中,前两个字母为“MN”,因此服务器选取第二个子区间MN:[4000,8000]作为后续熵编码的基础区间。服务器根据“M”、“N”、“O”和“P”出现的概率,将区间MN:[4000,8000]分为四个子区间: MNM:[4000,4800]、MNN:[4800,5600]、MNO:[5600,7200]以及MNP:[7200,8000]。由于“MNOOP”中前三个字母为“MNO”,因此服务器将第三个子区间MNO:[5600,7200]作为后续熵编码的基础区间。服务器根据“M”、“N”、“O”和“P”出现的概率,将区间MNO:[5600,7200]分为四个子区间:MNOM:[5600,5920]、MNON:[5920,6240]、MNOO:[6240,6880]以及MNOP:[6880,7200]。由于在“MNOOP”中,前四个个字母为“MNOO”,因此服务器将第三个子区间MNOO:[6240,6880]后续熵编码的基础区间。服务器根据“M”、“N”、“O”和“P”出现的概率,将区间MNOO:[6240,6880]划分为四个子区间:MNOOM:[6240,6368]、MNOON:[6368,6496]、MNOOO:[6496,6752]以及MNOOP:[6752,6880],由此得到对“MNOOP”进行熵编码的区间为[6752,6880],服务器能够采用该区间[6752,6880]中任一数值来表示“MNOOP”的编码结果,比如采用6800来表示“MNOOP”,在上述实施方式中,6800也即是第一音频流。
在上述熵编码的基础上,对上述实施方式进行说明。
举例来说,服务器获取第一音频流中多个编码单元的出现概率。服务器确定第一音频流对应的初始区间,该初始区间为与熵编码过程相同的初始区间。服务器基于该多个编码单元的出现概率,将该初始区间划分为多个一级子区间,该多个一级子区间与该多个编码单元一一对应,该多个一级子区间的区间长度之间的比值与该多个编码单元的出现概率之间的比值相同。服务器比较该第一音频流的编码值与该多个一级子区间,将该编码值所属的一级子区间确定为目标一级子区间,该目标一级子区间对应的编码单元为该第一音频流对应的第一个编码单元。服务器基于该多个编码单元的出现概率,将该目标一级子区间划分为多个二级子区间。服务器基于该第一音频流的编码值,从该多个二级子区间中确定出目标二级子区间,该目标二级子区间对应的两个编码单元为该第一音频流对应的前两个编码单元。服务器基于该目标二级子区间进行后续解码,直至得到目标K级子区间,该目标K级子区间对应的K个编码单元为该第一音频流对应的全部编码单元,其中,K为正整数,K与该多个编码单元的数量相同。
比如,以第一音频流为6800为例进行说明,服务器获取第一音频流中多个编码单元出现的概率,也即是“M”、“N”、“O”和“P”出现的概率分别为0.2、0.2、0.4、和0.2。服务器构建与熵编码过程相同初始区间[0,100000],服务器根据“M”、“N”、“O”和“P”出现的概率,将区间[0,100000]分为四个子区间:M:[0,20000]、N:[20000,40000]、O:[40000,80000]以及P:[80000,100000],由于第一音频流6800处于第一个子区间M:[0,20000]中,因此服务器将该区间[0,20000]作为后续熵解码的基础区间,M作为解码出的第一个解码单元。服务器根据“M”、“N”、“O”和“P”出现的概率,将区间M:[0,20000]分为四个子区间:MM:[0,4000]、MN:[4000,8000]、MO:[8000,16000]以及MP:[16000,20000],由于第一音频流6800处于第二个子区间MN:[4000,8000]中,因此服务器将该子区间[4000,8000]作为后续熵解码的基础区间,N作为解码出的第二个解码单元。服务器根据“M”、“N”、“O”和“P”出现的概率,将区间MN:[4000,8000]分为四个子区间:MNM:[4000,4800]、MNN:[4800,5600]、MNO:[5600,7200]以及MNP:[7200,8000],由于第一音频流6800处于第三个子区间MNO:[5600,7200]内,因此将该子区间[5600,7200]作为后续熵解码的基础区间,O作为解码出的第三个解码单元。服务器根据“M”、“N”、“O”和“P”出现的概率,将区间MNO:[5600,7200]分为四个子区间:MNOM:[5600,5920]、MNON:[5920,6240]、MNOO:[6240,6880]以及MNOP:[6880,7200],由于第一音频流6800处于第三个子区间MNOO:[6240,6880]中,因此服务器将该子区间[6240,6880]作为后续熵解码的基础区间,O作为解码出的第四个解码单元。服务器根据“M”、“N”、“O”和“P”出现的概率,将区间MNOO:[6240,6880]划分为四个子区间:MNOOM:[6240,6368]、MNOON:[6368,6496]、MNOOO:[6496,6752]以及MNOOP:[6752,6880],由于第一音频流6800处于第四个子区间MNOOP:[6752,6880]中,因此服务器将P作为解码出的第五个解码单元。服务器将解码出的五个解码单元 “M”、“N”、“O”、“O”和“P”进行组合,得到“MNOOP”,也即是第一音频流的音频特征参数和激励信号。
为了对本申请实施例提供的技术方案进行更加清楚的说明,下面在上述举例中的熵解码的基础上,对上述实施方式进行说明。
在一种可能的实施方式中,参见图5,服务器将第一音频流输入区间解码器501对第一音频流进行熵解码,熵解码的过程参见上述举例,在此不再赘述。由区间解码器501对第一音频流进行熵解码之后,得到熵解码后的音频流。服务器将熵解码后的音频流输入参数解码器502,通过参数解码器502输出标志位脉冲、信号增益以及音频特征参数。服务器将标志位脉冲和信号增益输入激励信号产生器503,得到激励信号。
402、服务器基于音频特征参数和激励信号,获取激励信号对应的时域音频信号。
在一种可能的实施方式中,服务器基于音频特征参数对激励信号进行处理,得到激励信号对应的时域音频信号。
举例来说,参见图5,服务器将音频特征参数和激励信号输入帧重建模块504,由帧重建模块504输出帧重建后的音频信号。服务器将帧重建后的音频信号输入采样率变换过滤器505,通过采样率变换过滤器505进行重采样编码,得到激励信号对应的时域音频信号。可选地,若帧重建后的音频信号为立体声音频信号,那么在将帧重建后的音频信号输入采样率变换过滤器之前,服务器能够将帧重建后的音频信号输入立体声分离模块506,将帧重建后的音频信号分为单声道音频信号。服务器将单声道音频信号输入采样率变换过滤器505进行重采样编码,得到激励信号对应的时域音频信号。
下面对帧重建模块对激励信号进行帧重建的方法进行说明:
在一种可能的实施方式中,音频特征参数包括信号增益、LSF(Line Spectral Frequency,线性频谱)系数、LTP(Long-Term Prediction,长时后滤波)系数以及高音延迟等。帧重建模块包括LTP合成滤波器以及LPC(Linear Predictive Coding,LPC)合成滤波器,服务器将激励信号以及音频特征参数中的高音延迟和LTP系数输入LTP合成滤波器,由LTP合成滤波器对激励信号进行第一次帧重建,得到第一滤波音频信号。服务器将第一滤波音频信号、LSF系数以及信号增益输入LPC合成滤波器,由LPC合成滤波器对第一滤波音频信号进行第二次帧重建,得到第二滤波音频信号。服务器将第一滤波音频信号和第二滤波音频信号进行融合,得到帧重建后的音频信号。
403、服务器基于目标转码码率,通过至少一次迭代过程,获取第一量化参数,第一量化参数用于将第一音频流的第一码率调整为目标转码码率。
在一种可能的实施方式中,服务器经由至少一次迭代过程获取第一量化参数,在任一迭代过程中,服务器基于目标转码码率,确定第一备选量化参数。服务器基于第一备选量化参数对激励信号和音频特征参数的重新量化过程进行模拟,得到与激励信号对应的第一信号和与音频特征参数对应的第一参数。服务器对第一信号和第一参数的熵编码过程进行模拟,得到模拟音频流。响应于模拟音频流符合第一目标条件,且时域音频信号和第一信号、目标转码码率和模拟音频流的码率、已迭代次数中的至少一项符合第二目标条件,将第一备选量化参数确定为第一量化参数。
在上述实施方式中,包括四个部分的处理过程,也即是服务器先确定一个备选量化参数,根据备选量化参数对激励信号和音频特征参数进行重新量化,得到第一信号和第一参数。服务器能够对第一信号和第一参数的熵编码过程进行模拟,得到模拟音频流。服务器对模拟音频流进行判别,确定模拟音频流是否符合需求,需求的判别是基于第一目标条件和第二目标条件来进行的。当同时满足第一目标条件和第二目标条件时,服务器能够结束迭代,输出第一量化参数。当第一目标条件和第二目标条件中的任一个不满足时,服务器能够重新进行迭代。
为了对上述实施方式进行更加清楚的说明,下面将分为四个部分对上述实施方式进行说 明。
第一部分、服务器基于目标转码码率,确定第一备选量化参数。
其中,目标转码码率能够由服务器根据实际情况确定,比如根据网络带宽来确定目标转码码率,以使得目标转码码率与网络带宽相匹配。
在一些实施例中,第一备选量化参数表示一个量化步长,量化步长越大则压缩的比例越大,量化后的数据量越少。量化步长越小,压缩的比例越小,则量化后的数据量越大。在一些实施例中,目标转码码率低于第一音频流的第一码率,那么在音频转码过程中,也即是一个将音频流码率降低的过程。在这个过程中,服务器能够基于目标转码码率来生成一个第一备选量化参数,采用第一备选量化参数对激励信号和音频特征参数进行重新量化之后,能够得到码率较低的音频流,该音频流的码率与目标转码码率相接近。
第二部分、服务器基于第一备选量化参数对激励信号和音频特征参数的重新量化过程进行模拟,得到与激励信号对应的第一信号和与音频特征参数对应的第一参数。
上述模拟是指,服务器并没有对激励信号和音频特征参数本身进行重新量化,而是基于第一备选量化参数做了一种重新量化过程的模拟,从而后续确定实际量化过程中所采用的第一量化参数。通过这个模拟过程,服务器能够确定出最合适的第一量化参数。
在一种可能的实施方式中,服务器分别对激励信号的离散余弦变换过程和音频特征参数的离散余弦变换过程进行模拟,得到激励信号对应的第二信号和音频特征参数对应的第二参数。服务器将第二信号和第二参数分别与第一备选量化参数相除后进行取整,得到第一信号和第一参数。
以服务器对激励信号进行重新量化为例进行说明,在模拟过程中,服务器对激励信号进行离散余弦变换,得到第二信号。服务器采用第一备选量化参数对应的量化步长对第二信号进行重新量化,也即是将第二信号与第一备选参数表示的量化步长相除后进行取整,得到第一信号。
比如,若激励信号为矩阵
Figure PCTCN2022076144-appb-000001
服务器能够对激励信号
Figure PCTCN2022076144-appb-000002
进行离散余弦变换,也即是通过下述公式(1)来对激励信号
Figure PCTCN2022076144-appb-000003
进行离散余弦变换,得到第二信号。
Figure PCTCN2022076144-appb-000004
其中,F(u)为第二信号,u为广义频率变量,u=1,2,3……N-1,f(i)为激励信号,N为激励信号中数值的数量,i为激励信号中的数值。
为了方便说明,下面以第二信号为
Figure PCTCN2022076144-appb-000005
量化步长为28为例进行说明。在一些实施例中,服务器能够通过下述公式(2)来对第二信号中进行重新量化,得到第一信号。
Q(m)=round(m/S+0.5)   (2)
其中,Q()为量化函数,m为第二信号中的数值,round()为四舍五入的取整函数,S为量化步长。
以第二信号
Figure PCTCN2022076144-appb-000006
中的195为例,服务器能够将195代入公式(2),也即是Q(195)=round(195/28+0.5)=round(7.464)=7,7也即是对195进行量化的结果。服务器采用公式(2)对第二信号
Figure PCTCN2022076144-appb-000007
进行重新量化之后,能够得到第一信号
Figure PCTCN2022076144-appb-000008
第三部分、服务器对第一信号和第一参数的熵编码过程进行模拟,得到模拟音频流。
以模拟对第一信号
Figure PCTCN2022076144-appb-000009
进行熵编码为例进行说明,服务器能够将第一信号
Figure PCTCN2022076144-appb-000010
划分成四个向量,(7,-1,0,0) T、(0,-1,0,0) T、(0,0,0,0) T以及(0,0,0,0) T。服务器将向量(7,-1,0,0) T记作A,将向量(0,-1,0,0) T记作B,将向量(0,0,0,0) T记作C。第一信号
Figure PCTCN2022076144-appb-000011
也就能简化为(ABCC)。在第一信号(ABCC)中,编码单元“A”、“B”和“C”在(ABCC)中出现的概率分别为0.25、0.25以及0.5,服务器生成一个初始区间[0,100000]。服务器根据编码单元“A”、“B”和“C”出现的概率,将初始区间[0,100000]划分为三个子区间A:[0,25000]、B[25000,50000]以及C[50000,100000]。由于在第一信号(ABCC)中第一个字母为“A”,那么服务器选取第一个子区间A:[0,25000]作为后续熵编码的基础区间。服务器根据编码单元“A”、“B”和“C”出现的概率,将区间A:[0,25000]划分为三个子区间AA:[0,6250]、AB[6250,12500]以及AC[12500,100000]。由于在第一信号(ABCC)中第二个字母为“B”,那么服务器选取第二个子区间AB[6250,12500]作为后续熵编码的基础区间。服务器根据编码单元“A”、“B”和“C”出现的概率,将区间AB[6250,12500]划分为三个子区间ABA:[6250,7812.5]、ABB[7812.5,9375]以及ABC[9375,12500]。由于在第一信号(ABCC)中第三个字母为“C”,那么服务器选取第三个子区间ABC[9375,12500]作为后续熵编码的基础区间。服务器根据编码单元“A”、“B”和“C”出现的概率,将区间ABC[9375,12500]划分为三个子区间ABCA:[9375,10156.25]、ABCB[10156.25,10,937.5]以及ABCC[10,937.5,12500],由此得到对第一信号(ABCC)进行熵编码的区间为ABCC[10,937.5,12500],服务器能够采用该区间ABCC[10,937.5,12500]中任意数值来表示第一信号(ABCC),比如采用12000来表示第一信号(ABCC)。
若对第一信号和第一参数的熵编码过程进行模拟,得到的区间为[100,130],那么服务器能够将该区间[100,130]中的任一数值来表示模拟音频流,比如采用120来表示模拟音频流。
第四部分、对第一目标条件和第二目标条件进行说明。
在一种可能的实施方式中,模拟音频流符合第一目标条件是指下述至少一项:
模拟音频流的码率小于或等于目标转码码率,以及模拟音频流的音频流质量参数大于或等于质量参数阈值。其中,音频流质量参数包括信噪比、PESQ(Perceptual Evaluation of Speech Quality,感知语音质量评估)以及POLQA(Perceptual Objective Listening Quality Analysis,感知客观语音质量评估)等,质量参数阈值根据实际情况进行设置,比如根据语音通话的质量要求进行设置,也即是当语音通话的质量要求较高时,那么也就能够将质量参数阈值设置地较高,当语音通话的质量要求较低时,也就能够将质量参数阈值设置地较小,本申请实施例对此不做限定。
在一种可能的实施方式中,时域音频信号和第一信号、目标转码码率和模拟音频流的码率、已迭代次数中的至少一项符合第二目标条件是指:
时域音频信号和第一信号之间的相似度大于或等于相似度阈值。目标转码码率和模拟音频流的码率之间的差值小于或等于差值阈值。已迭代次数等于迭代次数阈值。也即是,在迭代过程中,时域音频信号和第一信号之间的相似度作为影响迭代终止的第一个因素,目标转码码率和模拟音频流的码率之间的差值作为影响迭代终止的第二个因素,已迭代次数作为影响迭代终止的第三个因素。服务器通过三个因素能够确定结束迭代的时机。在一些实施例中,若迭代次数阈值为3,当前的迭代次数为3,时域音频信号和迭代得到的第一信号之间的相似度小于相似度阈值,同时目标转码码率和模拟音频流的码率之间的差值大于差值阈值,由于已迭代次数与次数阈值相同,那么服务器能够终止迭代,将当前迭代对应的备选量化参数作为第一量化参数。通过第二目标条件的限制,服务器能够以较少的迭代次数来获取第一量化参数,这样实时语音通话的场景下,就能够以更快的速度完成转码。
在上述第二目标条件的限制下,服务器没有执行完整的迭代流程,在一些实施例中,上述迭代流程也即是噪声整形量化(NSQ)循环迭代。上述第二目标条件的限制也可被称为贪心算法,采用贪心算法能够大大提升音频转码的速度,原因如下:其一是由于第一音频流为高码率的最优量化结果,服务器能够直接在第一音频流的量化参数附近寻找其他备选量化参数。其二是由于对激励信号和时域音频信号比较时,根据上述三个因素能够大大减少迭代次数。当然,在更激进的情况下,比如只迭代1次,也可以删除解码器,直接进行音频转码即可,本申请实施例对此不做限定。
除此之外,在迭代过程中,响应于模拟音频流不符合第一目标条件,或时域音频信号和第一信号、目标转码码率和模拟音频流的码率、已迭代次数均不符合第二目标条件,服务器将基于目标转码码率确定的第二备选量化参数,作为下一次迭代过程的输入。也即是,当迭代次数阈值大于1时,若第一目标条件和第二目标条件均不符合时,服务器能够基于目标转码码率重新确定第二备选量化参数,并基于第二备选量化参数进行下一次迭代过程。
404、服务器基于时域音频信号和第一量化参数对激励信号和音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数。
其中,目标激励信号为重新量化后的激励信号,目标音频特征参数为重新量化后的音频特征参数。
在一种可能的实施方式中,服务器对激励信号和音频特征参数分别进行离散余弦变换,得到激励信号对应的第三信号和音频特征参数对应的第三参数。服务器将第三信号和第三参数分别与第一量化参数相除后进行取整,得到目标激励信号和目标音频特征参数。该实施方式与上述步骤403中的第二部分属于同一发明构思,实现过程参见上述描述,再次不再赘述。
405、服务器对目标音频特征参数和目标激励信号进行熵编码,得到第二码率的第二音频流,第二码率低于第一码率。
在一种可能的实施方式中,服务器获取目标音频特征参数和目标激励信号中多个编码单元的出现概率。服务器基于出现概率对多个编码单元进行编码,得到第二音频流。
举例来说,为了简化过程,假设目标音频特征参数和目标激励信号为“DEFFG”,每个字母为一个编码单元,其中“D”、“E”、“F”和“G”在“DEFFG”中的出现的概率分别为0.2、0.2、0.4、和0.2,“DEFFG”对应的初始区间为[0,100000]。服务器根据“D”、“E”、“F”和“G”出现的概率,将区间[0,100000]分为四个子区间:D:[0,20000]、E:[20000,40000]、F:[40000,80000]以及G:[80000,100000],其中每个子区间长度之间的比例与对应出现概率的比例相同。由于在“DEFFG”中,第一个字母为“D”,因此服务器选取第一个子区间D:[0,20000]作为后续熵编码的基础区间。服务器根据“D”、“E”、“F”和“G”出现的概率,将区间D:[0,20000]分为四个子区间:DD:[0,4000]、DE:[4000,8000]、DF:[8000,16000]以及DG:[16000,20000]。由于在“DEFFG”中,前两个字母为“DE”,因此服务器选取第二个子区间DE:[4000,8000]作为后续熵编码的基础区间。服务器根据“D”、“E”、“F”和“G”出现的概率,将区间DE:[4000,8000]分为四个子区间:DED:[4000,4800]、DEE:[4800,5600]、DEF:[5600,7200] 以及DEG:[7200,8000]。由于“DEFFG”中前三个字母为“DEF”,因此服务器将第三个子区间DEF:[5600,7200]作为后续熵编码的基础区间。服务器根据“D”、“E”、“F”和“G”出现的概率,将区间DEF:[5600,7200]分为四个子区间:DEFD:[5600,5920]、DEFE:[5920,6240]、DEFF:[6240,6880]以及DEFG:[6880,7200]。由于在“DEFFG”中,前四个个字母为“DEFF”,因此服务器将第三个子区间DEFF:[6240,6880]后续熵编码的基础区间。服务器根据“D”、“E”、“F”和“G”出现的概率,将区间DEFF:[6240,6880]划分为四个子区间:DEFFD:[6240,6368]、DEFFE:[6368,6496]、DEFFF:[6496,6752]以及DEFFG:[6752,6880],由此得到对“DEFFG”进行熵编码的区间为[6752,6880],服务器能够采用该区间[6752,6880]中任一数值来表示“DEFFG”的编码结果,比如采用6800来表示“DEFFG”,在上述实施方式中,6800也即是第二音频流。
可选地,在步骤505之后,本申请实施例提供的音频转码方法也能够与其他音频处理方法相结合,以提高音频转码的质量。比如,本申请实施例提供的音频转码方法,能够与前向纠错(FEC)编码方法相结合。在音频流的传输过程中,可能出现误码和抖动,从而导致音频传输的质量下降,基于此,可以采用前向纠错的方法对音频进行编码,向前纠错的本质是在音频中加入冗余信息,以便出现误码是能够即使纠错,冗余信息也即是与当前音频帧的前N帧相关的信息,其中N为正整数。
在一种可能的实施方式中,服务器基于第二音频流对后续接收到的音频流进行前向纠错编码。
举例来说,假设一段音频流为一个音频帧,将第二音频流记作T-1帧,将后续从终端接收到的音频流记作T帧,那么服务器在对T帧进行编码时,能够将T-1帧,也即是第二音频流作为T帧的前向纠错编码中的冗余信息进行编码,从而得到编码后的FEC码流,其中,T为正整数。由于T-1帧的码率通过本申请实施例提供的音频转码方法后进行了降低,那么编码后的FEC码流的整体码率也能够降低,从而在保证音频质量的前提下,提高音频流传输时的网络对抗性,其中,网络对抗性也即是对抗网络波动的性能。
上述是以将一个音频帧作为前向纠错编码中的冗余信息进行编码为例进行说明的,在其他可能的实施方式中,参见图6,若服务器当前正在编码第T帧,那么对于T-1帧和T-2帧来说,服务器能够采用本申请实施例提供的音频转码方法对T-1帧和T-2帧的码率进行调整,以降低T-1帧和T-2帧的码率,采用带内前向纠错的方法对调整后的T-1帧和调整后的T-2帧以及T进行编码,得到编码后的FEC码流,由于T-1帧和T-2帧的码率被降低,那么编码后的FEC码流的整体码率也能够降低,从而在保证音频质量的前提下,提高音频流传输时的网络对抗性。
通过本申请实施例提供的技术方案,在进行音频流转码时,不需要执行完整的参数提取过程,采用了熵解码来获取音频特征参数和激励信号,也即是采取了更为激进的贪心算法。在进行重新量化时,也是对激励信号和音频特征参数来进行的,并不涉及对时域信号的相关处理。最终对激励信号和音频特征参数进行熵编码,得到码率较小的第二音频流。由于熵解码和熵编码的复杂度几乎可以忽略不计,因此熵解码和熵编码的运算量较小,不对时域信号进行处理也能大大减少运算量,从而在保证音频质量的前提下,从整体上提高了音频转码的速度和效率。
除此之外,本申请实施例还提供了一种音频转码器,该音频转码器的结构参见图7,音频转码器包括:第一处理单元701、第二处理单元702、量化单元703以及第三处理单元704,其中,第一处理单元701与第二处理单元702和量化单元703分别相连,第二处理单元702与量化单元703相连,量化单元703与第三处理单元704相连。在一些实施例中,本申请实施例提供的音频转码器也被称为下行转码器。
第一处理单元701,用于对第一码率的第一音频流进行熵解码,得到第一音频流的音频 特征参数和激励信号,激励信号为量化后的音频信号。
第二处理单元702,用于基于音频特征参数和激励信号,获取激励信号对应的时域音频信号。
量化单元703,用于基于时域音频信号和目标转码码率对激励信号和音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数。在一些实施例中,量化单元703也被称为快速噪声整形量化单元。
第三处理单元704,用于对目标音频特征参数和目标激励信号进行熵编码,得到第二码率的第二音频流,第二码率低于第一码率。
在一些实施例中,在转码过程中,第一处理单元701能够将音频特征参数和激励信号分别发送给第二处理单元702和量化单元703,第二处理单元702能够从第一处理单元获取音频特征参数和激励信号,并基于音频特征参数和激励信号,获取激励信号对应的时域音频信号。第二处理单元702能够将时域音频信号发送给量化单元703。量化单元703能够接收目标转码码率、音频特征参数、激励信号和时域音频信号对激励信号和音频特征参数进行重新量化。量化单元703能够将目标音频特征参数和目标激励信号发送给第三处理单元704,由第三处理单元704对目标音频特征参数和目标激励信号进行熵编码,从而得到第二码率的第二音频流。
在一种可能的实施方式中,量化单元703,用于基于目标转码码率,通过至少一次迭代过程,获取第一量化参数,第一量化参数用于将第一音频流的第一码率调整为目标转码码率。基于时域音频信号和第一量化参数对激励信号和音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数。
在一种可能的实施方式中,量化单元703,用于在任一迭代过程中,基于目标转码码率,确定第一备选量化参数。基于第一备选量化参数对激励信号和音频特征参数的重新量化过程进行模拟,得到与激励信号对应的第一信号和与音频特征参数对应的第一参数。对第一信号和第一参数的熵编码过程进行模拟,得到模拟音频流。响应于模拟音频流符合第一目标条件,且时域音频信号和第一信号、目标转码码率和模拟音频流的码率、已迭代次数中的至少一项符合第二目标条件,将第一备选量化参数确定为第一量化参数。
在一种可能的实施方式中,模拟音频流符合第一目标条件是指下述至少一项:
模拟音频流的码率小于或等于目标转码码率。
模拟音频流的音频流质量参数大于或等于质量参数阈值。
在一种可能的实施方式中,时域音频信号和第一信号、目标转码码率和模拟音频流的码率、已迭代次数中的至少一项符合第二目标条件是指:
时域音频信号和第一信号之间的相似度大于或等于相似度阈值。
目标转码码率和模拟音频流的码率之间的差值小于或等于差值阈值。
已迭代次数等于迭代次数阈值。
在一种可能的实施方式中,量化单元703用于:
分别对激励信号的离散余弦变换过程和音频特征参数的离散余弦变换过程进行模拟,得到与激励信号对应的第二信号和与音频特征参数对应的第二参数。
将第二信号和第二参数分别与第一备选量化参数相除后进行取整,得到第一信号和第一参数。
在一种可能的实施方式中,量化单元703还用于:响应于模拟音频流不符合第一目标条件,或时域音频信号和第一信号、目标转码码率和模拟音频流的码率、已迭代次数均不符合第二目标条件,将基于目标转码码率确定的第二备选量化参数,作为下一次迭代过程的输入。
在一种可能的实施方式中,第一处理单元701用于:获取第一音频流中多个编码单元的出现概率。基于出现概率对第一音频流进行解码,得到与多个编码单元分别对应的多个解码单元。对多个解码单元进行组合,得到第一音频流的音频特征参数和激励信号。
在一种可能的实施方式中,第三处理单元704用于:
获取目标音频特征参数和目标激励信号中多个编码单元的出现概率。
基于出现概率对多个编码单元进行编码,得到第二音频流。
在一种可能的实施方式中,音频转码器还包括前向纠错单元,前向纠错模块与第三处理单元704相连,用于基于第二音频流对后续接收到的音频流进行前向纠错编码。
需要说明的是:上述实施例提供的音频转码器在音频转码时,仅以上述各功能单元的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元完成,即将音频转码器的内部结构划分成不同的功能单元,以完成以上描述的全部或者部分功能。另外,上述实施例提供的音频转码器与音频转码的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
通过本申请实施例提供的技术方案,在进行音频流转码时,不需要执行完整的参数提取过程,而是采用了熵解码来获取音频特征参数和激励信号。在进行重新量化时,也是对激励信号和音频特征参数来进行的,并不涉及对时域信号的相关处理。最终对激励信号和音频特征参数进行熵编码,得到码率较小的第二音频流。由于熵解码和熵编码的运算量较小,不对时域信号进行处理也能大大减少运算量,从而在保证音质的前提下,从整体上提高了音频转码的速度和效率。
图8是本申请实施例提供的一种音频转码装置结构示意图,参见图8,装置包括:解码模块801、时域音频信号获取模块802、量化模块803以及编码模块804。
解码模块801,用于对第一码率的第一音频流进行熵解码,得到第一音频流的音频特征参数和激励信号,激励信号为量化后的音频信号。
时域音频信号获取模块802,用于基于音频特征参数和激励信号,获取激励信号对应的时域音频信号。
量化模块803,用于基于时域音频信号和目标转码码率对激励信号和音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数。
编码模块804,用于对目标音频特征参数和目标激励信号进行熵编码,得到第二码率的第二音频流,第二码率低于第一码率。
在一种可能的实施方式中,量化模块,用于基于目标转码码率,通过至少一次迭代过程,获取第一量化参数,第一量化参数用于将第一音频流的第一码率调整为目标转码码率。基于时域音频信号和第一量化参数对激励信号和音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数。
在一种可能的实施方式中,量化模块,用于在任一迭代过程中,基于目标转码码率,确定第一备选量化参数。基于第一备选量化参数对激励信号和音频特征参数的重新量化过程进行模拟,得到与激励信号对应的第一信号和与音频特征参数对应的第一参数。对第一信号和第一参数的熵编码过程进行模拟,得到模拟音频流。响应于模拟音频流符合第一目标条件,且时域音频信号和第一信号、目标转码码率和模拟音频流的码率、已迭代次数中的至少一项符合第二目标条件,将第一备选量化参数确定为第一量化参数。
在一种可能的实施方式中,模拟音频流符合第一目标条件是指下述至少一项:
模拟音频流的码率小于或等于目标转码码率。
模拟音频流的音频流质量参数大于或等于质量参数阈值。
在一种可能的实施方式中,时域音频信号和第一信号、目标转码码率和模拟音频流的码率、已迭代次数中的至少一项符合第二目标条件是指:
时域音频信号和第一信号之间的相似度大于或等于相似度阈值。
目标转码码率和模拟音频流的码率之间的差值小于或等于差值阈值。
已迭代次数等于迭代次数阈值。
在一种可能的实施方式中,量化模块,用于分别对激励信号的离散余弦变换过程和音频特征参数的离散余弦变换过程进行模拟,得到与激励信号对应的第二信号和与音频特征参数对应的第二参数。
将第二信号和第二参数分别与第一备选量化参数相除后进行取整,得到第一信号和第一参数。
在一种可能的实施方式中,量化模块,还用于响应于模拟音频流不符合第一目标条件,或时域音频信号和第一信号、目标转码码率和模拟音频流的码率、已迭代次数均不符合第二目标条件,将基于目标转码码率确定的第二备选量化参数,作为下一次迭代过程的输入。
在一种可能的实施方式中,解码模块,用于获取第一音频流中多个编码单元的出现概率。基于出现概率对第一音频流进行解码,得到与多个编码单元分别对应的多个解码单元。对多个解码单元进行组合,得到第一音频流的音频特征参数和激励信号。
在一种可能的实施方式中,编码模块,用于获取目标音频特征参数和目标激励信号中多个编码单元的出现概率。基于出现概率对多个编码单元进行编码,得到第二音频流。
在一种可能的实施方式中,装置还包括前向纠错模块,用于基于第二音频流对后续接收到的音频流进行前向纠错编码。
需要说明的是:上述实施例提供的音频转码装置在音频转码时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将音频转码装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的音频转码装置与音频转码的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
通过本申请实施例提供的技术方案,在进行音频流转码时,不需要执行完整的参数提取过程,而是采用了熵解码来获取音频特征参数和激励信号。在进行重新量化时,也是对激励信号和音频特征参数来进行的,并不涉及对时域信号的相关处理。最终对激励信号和音频特征参数进行熵编码,得到码率较小的第二音频流。由于熵解码和熵编码的运算量较小,不对时域信号进行处理也能大大减少运算量,从而在保证音质的前提下,从整体上提高了音频转码的速度和效率。
本申请实施例提供了一种计算机设备,用于执行上述方法,该计算机设备可以实现为终端或者服务器,下面先对终端的结构进行介绍:
图9是本申请实施例提供的一种终端的结构示意图。该终端900可以是:智能手机、平板电脑、笔记本电脑或台式电脑。终端900还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端900包括有:一个或多个处理器901和一个或多个存储器902。
处理器901可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器901可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器901也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。
存储器902可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器902还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器902中的非暂态的计算机可读存储介质用于存储至少一个计算机程序,该至少一个计算机程序用于被处理器901所执行以实现本申请中方法实施例提供的音频转码方法。
本领域技术人员可以理解,图9中示出的结构并不构成对终端900的限定,可以包括比 图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
上述计算机设备还可以实现为服务器,下面对服务器的结构进行介绍:
图10是本申请实施例提供的一种服务器的结构示意图,该服务器1000可因配置或性能不同而产生比较大的差异,可以包括一个或多个处理器(Central Processing Units,CPU)1001和一个或多个的存储器1002,其中,所述一个或多个存储器1002中存储有至少一条计算机程序,所述至少一条计算机程序由所述一个或多个处理器1001加载并执行以实现上述各个方法实施例提供的方法。当然,该服务器1000还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器1000还可以包括其他用于实现设备功能的部件,在此不做赘述。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括计算机程序的存储器,上述计算机程序可由处理器执行以完成上述实施例中的音频转码方法。例如,该计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括程序代码,该程序代码存储在计算机可读存储介质中,计算机设备的处理器从计算机可读存储介质读取该程序代码,处理器执行该程序代码,使得该计算机设备执行上述音频转码方法。
上述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种音频转码方法,由计算机设备执行,所述方法包括:
    对第一码率的第一音频流进行熵解码,得到所述第一音频流的音频特征参数和激励信号,所述激励信号为量化后的音频信号;
    基于所述音频特征参数和所述激励信号,获取所述激励信号对应的时域音频信号;
    基于所述时域音频信号和目标转码码率对所述激励信号和所述音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数;
    对所述目标音频特征参数和所述目标激励信号进行熵编码,得到第二码率的第二音频流,所述第二码率低于所述第一码率。
  2. 根据权利要求1所述的方法,其中,所述基于所述时域音频信号和目标转码码率对所述激励信号和所述音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数包括:
    基于所述目标转码码率,通过至少一次迭代过程,获取第一量化参数,所述第一量化参数用于将所述第一音频流的所述第一码率调整为所述目标转码码率;
    基于所述时域音频信号和所述第一量化参数对所述激励信号和所述音频特征参数进行重新量化,得到所述目标激励信号和所述目标音频特征参数。
  3. 根据权利要求2所述的方法,其中,所述基于所述目标转码码率,通过至少一次迭代过程,获取第一量化参数包括:
    在任一所述迭代过程中,基于所述目标转码码率,确定第一备选量化参数;
    基于所述第一备选量化参数对所述激励信号和所述音频特征参数的重新量化过程进行模拟,得到与所述激励信号对应的第一信号和与所述音频特征参数对应的第一参数;
    对所述第一信号和所述第一参数的熵编码过程进行模拟,得到模拟音频流;
    响应于所述模拟音频流符合第一目标条件,且所述时域音频信号和所述第一信号、所述目标转码码率和所述模拟音频流的码率、已迭代次数中的至少一项符合第二目标条件,将所述第一备选量化参数确定为所述第一量化参数。
  4. 根据权利要求3所述的方法,其中,所述模拟音频流符合所述第一目标条件是指下述至少一项:
    所述模拟音频流的码率小于或等于所述目标转码码率;
    所述模拟音频流的音频流质量参数大于或等于质量参数阈值。
  5. 根据权利要求3所述的方法,其中,所述时域音频信号和所述第一信号、所述目标转码码率和所述模拟音频流的码率、已迭代次数中的至少一项符合第二目标条件是指:
    所述时域音频信号和所述第一信号之间的相似度大于或等于相似度阈值;
    所述目标转码码率和所述模拟音频流的码率之间的差值小于或等于差值阈值;
    所述已迭代次数等于迭代次数阈值。
  6. 根据权利要求3所述的方法,其中,所述基于所述第一备选量化参数对所述激励信号和所述音频特征参数的重新量化过程进行模拟,得到与所述激励信号对应的第一信号和与所述音频特征参数对应的第一参数包括:
    分别对所述激励信号的离散余弦变换过程和所述音频特征参数的离散余弦变换过程进行模拟,得到与所述激励信号对应的第二信号和与所述音频特征参数对应的第二参数;
    将所述第二信号和所述第二参数分别与所述第一备选量化参数相除后进行取整,得到所述第一信号和所述第一参数。
  7. 根据权利要求3所述的方法,其中,所述方法还包括:
    响应于所述模拟音频流不符合所述第一目标条件,或所述时域音频信号和所述第一信号、所述目标转码码率和所述模拟音频流的码率、已迭代次数均不符合所述第二目标条件,将基于所述目标转码码率确定的第二备选量化参数作为下一次迭代过程的输入。
  8. 根据权利要求1所述的方法,其中,所述对第一码率的第一音频流进行熵解码,得到所述第一音频流的音频特征参数和激励信号包括:
    获取所述第一音频流中多个编码单元的出现概率;
    基于所述出现概率对所述第一音频流进行解码,得到与所述多个编码单元分别对应的多个解码单元;
    对所述多个解码单元进行组合,得到所述第一音频流的音频特征参数和激励信号。
  9. 根据权利要求1所述的方法,其中,所述对所述目标音频特征参数和所述目标激励信号进行熵编码,得到第二码率的第二音频流包括:
    获取所述目标音频特征参数和所述目标激励信号中多个编码单元的出现概率;
    基于所述出现概率对所述多个编码单元进行编码,得到所述第二音频流。
  10. 根据权利要求1所述的方法,其中,所述对所述目标音频特征参数和所述目标激励信号进行熵编码,得到第二码率的第二音频流之后,所述方法还包括:
    基于所述第二音频流对后续接收到的音频流进行前向纠错编码。
  11. 一种音频转码器,所述音频转码器包括:第一处理单元、第二处理单元、量化单元以及第三处理单元,其中,所述第一处理单元与所述第二处理单元和所述量化单元分别相连,所述第二处理单元与所述量化单元相连,所述量化单元与所述第三处理单元相连;
    所述第一处理单元,用于对第一码率的第一音频流进行熵解码,得到所述第一音频流的音频特征参数和激励信号,所述激励信号为量化后的音频信号;
    所述第二处理单元,用于基于所述音频特征参数和所述激励信号,获取所述激励信号对应的时域音频信号;
    所述量化单元,用于基于所述时域音频信号和目标转码码率对所述激励信号和所述音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数;
    所述第三处理单元,用于对所述目标音频特征参数和所述目标激励信号进行熵编码,得到第二码率的第二音频流,所述第二码率低于所述第一码率。
  12. 根据权利要求11所述的音频转码器,其中,所述量化单元,用于基于所述目标转码码率,通过至少一次迭代过程,获取第一量化参数,所述第一量化参数用于将所述第一音频流的所述第一码率调整为所述目标转码码率;基于所述时域音频信号和所述第一量化参数对所述激励信号和所述音频特征参数进行重新量化。
  13. 一种音频转码装置,所述装置包括:
    解码模块,用于对第一码率的第一音频流进行熵解码,得到所述第一音频流的音频特征参数和激励信号,所述激励信号为量化后的音频信号;
    时域音频信号获取模块,用于基于所述音频特征参数和所述激励信号,获取所述激励信号对应的时域音频信号;
    量化模块,用于基于所述时域音频信号和目标转码码率对所述激励信号和所述音频特征参数进行重新量化,得到目标激励信号和目标音频特征参数;
    编码模块,用于对所述目标音频特征参数和所述目标激励信号进行熵编码,得到第二码率的第二音频流,所述第二码率低于所述第一码率。
  14. 一种计算机设备,所述计算机设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条计算机程序,所述计算机程序由所述一个或多个处理器加载并执行以实现如权利要求1至权利要求10任一项所述的音频转码方法。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述计算机程序由处理器加载并执行以实现如权利要求1至权利要求10任一项所述的音频转码方法。
PCT/CN2022/076144 2021-02-26 2022-02-14 音频转码方法、装置、音频转码器、设备以及存储介质 WO2022179406A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/046,708 US20230075562A1 (en) 2021-02-26 2022-10-14 Audio Transcoding Method and Apparatus, Audio Transcoder, Device, and Storage Medium

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110218868 2021-02-26
CN202110218868.9 2021-02-26
CN202111619099.X 2021-12-27
CN202111619099.XA CN115050377A (zh) 2021-02-26 2021-12-27 音频转码方法、装置、音频转码器、设备以及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/046,708 Continuation US20230075562A1 (en) 2021-02-26 2022-10-14 Audio Transcoding Method and Apparatus, Audio Transcoder, Device, and Storage Medium

Publications (1)

Publication Number Publication Date
WO2022179406A1 true WO2022179406A1 (zh) 2022-09-01

Family

ID=83048655

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/076144 WO2022179406A1 (zh) 2021-02-26 2022-02-14 音频转码方法、装置、音频转码器、设备以及存储介质

Country Status (2)

Country Link
US (1) US20230075562A1 (zh)
WO (1) WO2022179406A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1669071A (zh) * 2002-05-22 2005-09-14 日本电气株式会社 用于在音频代码的编码/解码处理之间转换代码的方法和装置以及使用该方法和装置的存储介质
CN1954366A (zh) * 2004-05-11 2007-04-25 达丽星网络有限公司 用于电信多速率语音编码器中的语音速率转换的方法和装置
CN101563726A (zh) * 2006-09-20 2009-10-21 汤姆森许可贸易公司 用于对音频信号进行代码转换的方法和设备
CN101617361A (zh) * 2006-09-28 2009-12-30 北方电讯网络有限公司 用于编码的语音业务的速率降低的方法和设备
CN103457703A (zh) * 2013-08-27 2013-12-18 大连理工大学 一种g.729到amr12.2速率的转码方法
CN104658539A (zh) * 2013-11-20 2015-05-27 大连佑嘉软件科技有限公司 一种语音编码器码流的转码方法
CN107112024A (zh) * 2014-10-24 2017-08-29 杜比国际公司 音频信号的编码和解码
CN107659603A (zh) * 2016-09-22 2018-02-02 腾讯科技(北京)有限公司 用户与推送信息互动的方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1669071A (zh) * 2002-05-22 2005-09-14 日本电气株式会社 用于在音频代码的编码/解码处理之间转换代码的方法和装置以及使用该方法和装置的存储介质
CN1954366A (zh) * 2004-05-11 2007-04-25 达丽星网络有限公司 用于电信多速率语音编码器中的语音速率转换的方法和装置
CN101563726A (zh) * 2006-09-20 2009-10-21 汤姆森许可贸易公司 用于对音频信号进行代码转换的方法和设备
CN101617361A (zh) * 2006-09-28 2009-12-30 北方电讯网络有限公司 用于编码的语音业务的速率降低的方法和设备
CN103457703A (zh) * 2013-08-27 2013-12-18 大连理工大学 一种g.729到amr12.2速率的转码方法
CN104658539A (zh) * 2013-11-20 2015-05-27 大连佑嘉软件科技有限公司 一种语音编码器码流的转码方法
CN107112024A (zh) * 2014-10-24 2017-08-29 杜比国际公司 音频信号的编码和解码
CN107659603A (zh) * 2016-09-22 2018-02-02 腾讯科技(北京)有限公司 用户与推送信息互动的方法及装置

Also Published As

Publication number Publication date
US20230075562A1 (en) 2023-03-09

Similar Documents

Publication Publication Date Title
US11227612B2 (en) Audio frame loss and recovery with redundant frames
Cummiskey et al. Adaptive quantization in differential PCM coding of speech
US5835495A (en) System and method for scaleable streamed audio transmission over a network
EP3992964B1 (en) Voice signal processing method and apparatus, and electronic device and storage medium
WO2023221674A1 (zh) 音频编解码方法及相关产品
US20230005487A1 (en) Autocorrection of pronunciations of keywords in audio/videoconferences
CN111464262A (zh) 数据处理方法、装置、介质及电子设备
WO2022179406A1 (zh) 音频转码方法、装置、音频转码器、设备以及存储介质
WO2023241254A1 (zh) 音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品
CN112767955A (zh) 音频编码方法及装置、存储介质、电子设备
CN111816197A (zh) 音频编码方法、装置、电子设备和存储介质
CN111951821B (zh) 通话方法和装置
US20180337964A1 (en) Selectively transforming audio streams based on audio energy estimate
CN114842857A (zh) 语音处理方法、装置、系统、设备及存储介质
CN115050377A (zh) 音频转码方法、装置、音频转码器、设备以及存储介质
US20230238009A1 (en) Speech coding method and apparatus, speech decoding method and apparatus, computer device, and storage medium
US11855775B2 (en) Transcoding method and apparatus, medium, and electronic device
WO2022037444A1 (zh) 编码、解码方法、装置、介质和电子设备
WO2022252957A1 (zh) 音频数据编解码方法和相关装置及计算机可读存储介质
WO2024018525A1 (ja) 映像処理装置、方法およびプログラム
US20230131141A1 (en) Method, system, and computer program product for streaming
US20230262267A1 (en) Entropy coding for neural-based media compression
WO2022242534A1 (zh) 编解码方法、装置、设备、存储介质及计算机程序
WO2022258036A1 (zh) 编解码方法、装置、设备、存储介质及计算机程序
CN115631758B (zh) 音频信号处理方法、装置、设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22758771

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 18.01.2024)