CN115050377A

CN115050377A - Audio transcoding method and device, audio transcoder, equipment and storage medium

Info

Publication number: CN115050377A
Application number: CN202111619099.XA
Authority: CN
Inventors: 黄庆博; 王蒙; 肖玮
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-26
Filing date: 2021-12-27
Publication date: 2022-09-13
Anticipated expiration: 2041-12-27

Abstract

The application discloses an audio transcoding method, an audio transcoding device, an audio transcoder, equipment and a storage medium, and belongs to the field of audio processing. By the technical scheme provided by the embodiment of the application, when the audio stream is transcoded, the audio characteristic parameters and the excitation signal are obtained by adopting entropy decoding instead of complete parameter extraction. When re-quantization is performed, the excitation signal and the audio characteristic parameters are also performed, and correlation processing on the time domain signal is not involved. Finally, entropy coding is carried out on the excitation signal and the audio characteristic parameters to obtain a second audio stream with a smaller code rate. Because the computation of entropy decoding and entropy coding is small, the computation can be greatly reduced without processing time domain signals, thereby improving the speed and efficiency of audio transcoding on the whole on the premise of ensuring the tone quality.

Description

Audio transcoding method and device, audio transcoder, equipment and storage medium

The present application claims priority from chinese patent application No. 202110218868.9 entitled "audio transcoding method, apparatus, audio transcoder, device, and storage medium", filed 26/2/2021, the entire contents of which are incorporated herein by reference.

Technical Field

The present application relates to the field of audio processing, and in particular, to an audio transcoding method and apparatus, an audio transcoder, a device, and a storage medium.

Background

With the development of network technology, more and more users conduct voice chat through social application programs.

In the related art, because network bandwidths of different users are different, in a process of a user performing a voice chat, a social application program needs to transcode a transmitted audio, for example, if the network bandwidth of one user is low, the audio needs to be transcoded, that is, a code rate of the audio is reduced, so that the user can normally perform the voice chat.

However, in the audio transcoding process, the complexity of transcoding is high, which results in slow and inefficient audio transcoding.

Disclosure of Invention

The embodiment of the application provides an audio transcoding method, an audio transcoding device, an audio transcoder, equipment and a storage medium, and the speed and the efficiency of audio transcoding can be improved. The technical scheme is as follows:

in one aspect, an audio transcoding method is provided, where the method includes:

entropy decoding a first audio stream with a first code rate to obtain audio characteristic parameters and an excitation signal of the first audio stream, wherein the excitation signal is a quantized voice signal;

acquiring a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal;

re-quantizing the excitation signal and the audio characteristic parameters based on the time-domain audio signal and a target transcoding code rate;

entropy coding is carried out on the audio characteristic parameters after the re-quantization and the excitation signals after the re-quantization to obtain a second audio stream with a second code rate, wherein the second code rate is lower than the first code rate.

In one aspect, an audio transcoder is provided, the audio transcoder comprising: the entropy coding device comprises an entropy decoding unit, a time domain decoding unit, a quantization unit and an entropy coding unit, wherein the entropy decoding unit is respectively connected with the time domain decoding unit and the quantization unit, the time domain decoding unit is connected with the quantization unit, and the quantization unit is connected with the entropy coding unit;

the entropy decoding unit is configured to perform entropy decoding on a first audio stream with a first code rate to obtain an audio characteristic parameter and an excitation signal of the first audio stream, where the excitation signal is a quantized speech signal;

the time domain decoding unit is configured to obtain a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal;

the quantization unit is configured to re-quantize the excitation signal and the audio characteristic parameter based on the time-domain audio signal and a target transcoding code rate;

the entropy coding unit is configured to perform entropy coding on the re-quantized audio feature parameter and the re-quantized excitation signal to obtain a second audio stream with a second code rate, where the second code rate is lower than the first code rate.

In a possible implementation manner, the quantization unit is configured to determine, in any one of the iterative processes, a first candidate quantization parameter based on the target transcoding code rate; simulating the re-quantization process of the excitation signal and the audio characteristic parameter based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter; simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream; determining the first alternative quantization parameter as the first quantization parameter in response to the simulated audio stream meeting a first target condition and at least one of the time-domain audio signal and the first signal, the target transcoding code rate and the code rate of the simulated audio stream, the number of iterations meeting a second target condition.

In a possible embodiment, said analog audio stream complying with said first target condition is at least one of:

the code rate of the analog audio stream is less than or equal to the target transcoding code rate;

the audio stream quality parameter of the analog audio stream is greater than or equal to a quality parameter threshold.

In a possible embodiment, when at least one of the time-domain audio signal and the first signal, the target transcoding code rate, and the code rate and the number of iterations of the analog audio stream meets a second target condition, the method includes:

a similarity between the time-domain audio signal and the first signal is greater than or equal to a similarity threshold;

the difference between the target transcoding code rate and the code rate of the analog audio stream is less than or equal to a difference threshold;

the iterated number is equal to an iterated number threshold.

In a possible implementation, the quantization unit is configured to: respectively simulating the discrete cosine transform process of the excitation signal and the discrete cosine transform process of the audio characteristic parameter to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio characteristic parameter; and dividing the second signal and the second parameter by the first candidate quantization parameter respectively and then rounding to obtain the first signal and the first parameter.

In a possible implementation, the quantization unit is further configured to: and in response to the fact that the simulated audio stream does not accord with the first target condition, or the time domain audio signal and the first signal, the target transcoding code rate and the number of iterations of the simulated audio stream do not accord with the second target condition, taking a second alternative quantization parameter determined based on the target transcoding code rate as the input of the next iteration process.

In one possible implementation, the entropy decoding unit is configured to: acquiring the occurrence probability of a plurality of coding units in the first audio stream; decoding the first audio stream based on the occurrence probability to obtain a plurality of decoding units respectively corresponding to the plurality of coding units; and combining the plurality of decoding units to obtain the audio characteristic parameters and the excitation signals of the first audio stream.

In a possible implementation, the entropy encoding unit is configured to:

obtaining the audio characteristic parameters after the re-quantization and the occurrence probabilities of a plurality of coding units in the excitation signal after the re-quantization;

and coding the plurality of coding units based on the occurrence probability to obtain the second audio stream.

In a possible implementation, the audio transcoder further includes a forward error correction module, connected to the entropy coding unit, for performing forward error correction coding on a subsequently received audio stream based on the second audio stream.

In one aspect, an audio transcoding apparatus is provided, the apparatus including:

the decoding module is used for performing entropy decoding on a first audio stream with a first code rate to obtain audio characteristic parameters and an excitation signal of the first audio stream, wherein the excitation signal is a quantized voice signal;

a time domain audio signal obtaining module, configured to obtain a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal;

a quantization module, configured to re-quantize the excitation signal and the audio characteristic parameter based on the time-domain audio signal and a target transcoding code rate;

and the coding module is used for performing entropy coding on the re-quantized audio characteristic parameters and the re-quantized excitation signals to obtain a second audio stream with a second code rate, wherein the second code rate is lower than the first code rate.

In a possible implementation manner, the quantization module is configured to obtain a first quantization parameter through at least one iterative process based on the target transcoding code rate, where the first quantization parameter is used to adjust the first code rate of the first audio stream to the target transcoding code rate; re-quantizing the excitation signal and the audio feature parameter based on the time-domain audio signal and the first quantization parameter.

In a possible implementation manner, the quantization module is configured to determine, in any one of the iterative processes, a first candidate quantization parameter based on the target transcoding code rate; simulating the re-quantization process of the excitation signal and the audio characteristic parameter based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter; simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream; determining the first alternative quantization parameter as the first quantization parameter in response to the simulated audio stream meeting a first target condition and at least one of the time-domain audio signal and the first signal, the target transcoding code rate and the code rate of the simulated audio stream, the number of iterations meeting a second target condition.

a similarity between the time domain audio signal and the first signal is greater than or equal to a similarity threshold;

the iterated number is equal to an iterated number threshold.

In a possible implementation manner, the quantization module is configured to respectively simulate a discrete cosine transform process of the excitation signal and a discrete cosine transform process of the audio characteristic parameter, so as to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio characteristic parameter; and dividing the second signal and the second parameter by the first alternative quantization parameter respectively and then rounding to obtain the first signal and the first parameter.

In a possible implementation manner, the quantization module is further configured to, in response to that the simulated audio stream does not meet the first target condition, or that neither the time-domain audio signal nor the first signal, the target transcoding code rate, the code rate of the simulated audio stream, or the number of iterations meets the second target condition, use a second candidate quantization parameter determined based on the target transcoding code rate as an input of a next iteration process.

In a possible implementation, the decoding module is configured to obtain a probability of occurrence of a plurality of coding units in the first audio stream; decoding the first audio stream based on the occurrence probability to obtain a plurality of decoding units respectively corresponding to the plurality of coding units; and combining the plurality of decoding units to obtain the audio characteristic parameters and the excitation signals of the first audio stream.

In a possible implementation manner, the encoding module is configured to obtain the re-quantized audio feature parameters and occurrence probabilities of a plurality of encoding units in the re-quantized excitation signal; and coding the plurality of coding units based on the occurrence probability to obtain the second audio stream.

In a possible implementation, the apparatus further includes a forward error correction module configured to forward error correction encode a subsequently received audio stream based on the second audio stream.

In one aspect, a computer device is provided, the computer device comprising one or more processors and one or more memories having at least one computer program stored therein, the computer program being loaded and executed by the one or more processors to implement the audio transcoding method.

In one aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, the computer program being loaded and executed by a processor to implement the audio transcoding method.

In one aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising program code, the program code being stored in a computer-readable storage medium, the program code being read by a processor of a computer device from the computer-readable storage medium, the program code being executed by the processor such that the computer device performs the audio transcoding method described above.

By the technical scheme provided by the embodiment of the application, when the audio stream is transcoded, the audio characteristic parameters and the excitation signal are obtained by adopting entropy decoding instead of complete parameter extraction. When re-quantization is performed, the excitation signal and the audio characteristic parameters are also performed, and correlation processing on the time domain signal is not involved. Finally, entropy coding is carried out on the excitation signal and the audio characteristic parameters to obtain a second audio stream with a smaller code rate. Because the computation of entropy decoding and entropy coding is small, the computation can be greatly reduced without processing time domain signals, thereby improving the speed and efficiency of audio transcoding on the whole on the premise of ensuring the tone quality.

Drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic structural diagram of an encoder according to an embodiment of the present application;

fig. 2 is a schematic diagram of an implementation environment of an audio transcoding method provided in an embodiment of the present application;

fig. 3 is a flowchart of an audio transcoding method provided in an embodiment of the present application;

fig. 4 is a flowchart of an audio transcoding method provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a decoder according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an audio transcoder according to an embodiment of the present application;

fig. 7 is a schematic diagram of a method of forward error correction coding according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an audio transcoding apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.

The term "at least one" in this application means one or more, and the meaning of "a plurality" means two or more.

Cloud Technology (Cloud Technology) is based on the general names of network Technology, information Technology, integration Technology, management platform Technology, application Technology and the like applied in a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

Cloud computing (Cloud computing) is a computing model that distributes computing tasks over a pool of resources made up of a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is called the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (called as an ifas (Infrastructure as a Service) platform for short is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.

According to the logic function division, a PaaS (Platform as a Service) layer can be deployed on an IaaS (Infrastructure as a Service) layer, a SaaS (Software as a Service) layer is deployed on the PaaS layer, and the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, a web container, etc. SaaS is a variety of business software, such as web portal, sms, and mass texting. Generally speaking, SaaS and PaaS are upper layers relative to IaaS.

The cloud conference is an efficient, convenient and low-cost conference form based on a cloud computing technology. A user can share voice, data files and videos with teams and clients all over the world quickly and efficiently only by performing simple and easy-to-use operation through an internet interface, and complex technologies such as transmission and processing of data in a conference are assisted by a cloud conference service provider to operate.

At present, domestic cloud conferences mainly focus on Service contents mainly in a Software as a Service (SaaS a Service) mode, including Service forms such as telephones, networks and videos, and cloud computing-based video conferences are called cloud conferences.

In the cloud conference era, data transmission, processing and storage are all processed by computer resources of video conference manufacturers, users do not need to purchase expensive hardware and install complicated software, and efficient teleconferencing can be performed only by opening a browser and logging in a corresponding interface.

The cloud conference system supports multi-server dynamic cluster deployment, provides a plurality of high-performance servers, and greatly improves conference stability, safety and usability. In recent years, video conferences have gained popularity from many users due to the ability to greatly improve communication efficiency, continuously reduce communication cost, and bring about an upgrade in internal management level, and have been widely used in various fields such as transportation, finance, operators, education, enterprises, and the like. Undoubtedly, after the video conference uses cloud computing, the cloud computing has stronger attraction in convenience, rapidness and usability, and the arrival of new climax of the video conference application is necessarily stimulated.

Entropy coding: entropy coding is coding without losing any information according to the entropy principle in the coding process, and the information entropy is the average information amount of a source.

And (3) quantification: refers to the process of approximating a continuous value (or a large number of possible discrete values) of a signal to a finite number (or fewer) of discrete values.

And (3) carrying out in-band forward error correction: in-band Forward Error Correction, also called Forward Error Correction (FEC) is a method for increasing the reliability of data communication. In a one-way communication channel, once an error is found, its receiver will not be entitled to a transmission again. FEC is a method of transmitting redundant information using data that will allow a receiver to reconstruct the data when an error occurs in the transmission.

Audio Coding is divided into Multi-rate Coding (Multi-rate Coding) and Scalable Coding (Scalable Coding), wherein Scalable Coding streams have the following characteristics: the low-code-rate code stream is a subset of the high-code-rate code stream, only the low-code-rate core code stream can be transmitted when the network is congested, the method is flexible, and the multi-scale coding code stream does not have the characteristic. However, generally speaking, the decoding result of the multi-scale coded code stream is better than that of the scalable coded code stream at the same code rate.

Fig. 1 provides a schematic structural diagram of an OPUS encoder, and as can be seen from fig. 1, when an OPUS encoder is used to encode audio, the OPUS encoder needs to perform Voice Activity Detection (VAD) on the audio, pitch processing, noise shaping processing, LTP (Long-Term Prediction, Long-Term filtering) scaling control, gain processing, LSF (Line Spectral Frequency) quantization, Prediction, pre-filtering, noise shaping quantization, interval coding, and the like, when audio transcoding is needed, an OPUS decoder needs to decode the encoded audio first, and then re-encode the decoded audio through the OPUS encoder to change the code rate of the audio, the encoding complexity is high due to the number of steps involved in encoding with the OPUS encoder.

In the embodiment of the present application, the computer device may be provided as a terminal or a server, and an implementation environment including the terminal and the server is described below.

Fig. 2 is a schematic diagram of an implementation environment of an audio transcoding method provided in an embodiment of the present application, and referring to fig. 2, the implementation environment may include a terminal 210 and a server 240.

The terminal 210 is connected to the server 240 through a wireless network or a wired network. Optionally, the terminal 210 is a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto. The terminal 210 is installed and operated with a social application.

Optionally, the server 240 is an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, cloud database, cloud computing, cloud function, cloud storage, web service, cloud communication, middleware service, domain name service, security service, distribution Network (CDN), big data and artificial intelligence platform, and the like. In some embodiments, the server 240 can be an execution subject of the audio transcoding method provided in the embodiment of the present application, that is, the terminal 210 can collect an audio signal, send the audio signal to the server 240, transcode the audio signal by the server 240, and send the transcoded audio to other terminals.

Optionally, the terminal 210 generally refers to one of a plurality of terminals, and the embodiment of the present application is illustrated by the terminal 210.

Those skilled in the art will appreciate that the number of terminals may be greater or less. For example, the number of the terminal is only one, or several tens or hundreds, or more, and in this case, other terminals are also included in the implementation environment. The number of terminals and the type of the device are not limited in the embodiments of the present application.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

After the implementation environment of the embodiment of the present application is introduced, an application scenario of the embodiment of the present application will be introduced below with reference to the implementation environment, in the following introduction, a terminal is also a terminal 210 in the implementation environment, and a server is also a server 240 in the implementation environment, and the embodiment of the present application can be applied to various social applications, such as an online conference application, an instant messaging application, or a live broadcast application, and the embodiment of the present application is not limited thereto.

In the online conference application, there are often a plurality of terminals, the plurality of terminals are installed with an online conference application program, and a user of each terminal is a participant of an online conference. The plurality of terminals are all connected with the server through the network. In the process of carrying out the online conference, the server can transcode the voice signals uploaded by each terminal and then send the transcoded voice signals to the plurality of terminals, so that the plurality of terminals can play the voice signals, and the online conference is realized. Because the network environments of the terminals may be different, in the process of transcoding the voice signal by the server, the server can adopt the technical scheme provided by the embodiment of the application, convert the voice signal into different code rates according to the network bandwidths of different terminals, and send the voice signals with different code rates to different terminals, so that different terminals can normally perform an online conference, that is, for a terminal with a larger network bandwidth, the server can transcode the voice signal with a higher code rate, and a higher code rate means higher voice quality, so that the larger bandwidth can be fully utilized, and the quality of the online conference is improved. For a terminal with a smaller network bandwidth, the server can transcode the voice signal at a lower code rate, and the lower code rate means smaller bandwidth occupation, so that the voice signal can be sent to the terminal in real time, and the normal online conference access of the terminal is ensured. In addition, because of network fluctuation, the network bandwidth of the same terminal may be larger at one time and smaller at another time. The server can also adjust the transcoding rate according to the fluctuation condition of the network, so as to ensure the normal operation of the online conference. In some embodiments, online conferences are also referred to as cloud conferences.

In the instant messaging application, a user can carry out voice chat by installing the instant messaging application on the terminal. Taking the example that two users carry out voice chat through the instant messaging application, the instant messaging application can acquire voice signals of the two users in the chat process through terminals of the two users, send the voice signals to the server, the server sends the voice signals to the two terminals respectively, and the instant messaging application plays the voice signals through the terminals, so that the voice chat between the two users can be realized. Similar to the online conference scenario, the network environments of the two parties in voice chat may be different, that is, the network bandwidth of one party is larger, and the network bandwidth of the other party is smaller. Under the condition, the server can transcode the voice signal by adopting the technical scheme provided by the embodiment of the application, and the voice signal is converted into the proper code rate and then is sent to the two terminals, so that the two users can be ensured to normally carry out voice chat.

In the application of the live broadcast type, the live broadcast end used by the anchor can collect the live broadcast voice signal of the anchor, the live broadcast voice signal is sent to the live broadcast server, the live broadcast server sends the live broadcast voice signal to audience ends used by different audiences, after receiving the live broadcast voice signal, the audience ends play the live broadcast voice signal, and the audiences can hear the voice of the anchor in the live broadcast. Because different audience terminals may be in different network environments, the server can transcode the live broadcast voice signal according to the network environment in which the different audience terminals are located, that is, the live broadcast voice signal is converted into different code rates according to different network bandwidths of the audience terminals, and the voice signals with the different code rates are sent to the different audience terminals, so that the different audience terminals can normally play the live broadcast voice. That is, for the audience with a large network bandwidth, the server can transcode the live broadcast voice signal with a high code rate, and the high code rate means high voice quality, so that the large bandwidth can be fully utilized, and the live broadcast quality can be improved. For the audience with smaller network bandwidth, the server can transcode the live voice signal with lower code rate, and the lower code rate means smaller bandwidth occupation, so that the live voice signal can be ensured to be sent to the audience in real time, and the audience can be ensured to normally watch the live voice. In addition, because of network fluctuation, the network bandwidth may be larger at one time and smaller at another time for the same viewer. The server can also adjust the transcoding rate according to the fluctuation condition of the network bandwidth to ensure the normal operation of live broadcasting.

In addition to the three application scenarios, the technical solution provided in the embodiment of the present application can also be applied to other audio transmission scenarios, such as a broadcast television transmission scenario or a satellite communication scenario, which is not limited in the embodiment of the present application.

Of course, the audio transcoding method provided in the embodiment of the present application can be applied to a server as a cloud service, and can also be applied to a terminal, and the terminal performs fast transcoding on audio.

After the implementation environment and the application scenario of the embodiment of the present application are introduced, a technical solution provided by the embodiment of the present application is described below, in the following description process, taking a main body of an audio transcoding method as an example, and referring to fig. 3, the method includes:

301. the server carries out entropy decoding on the first audio stream with the first code rate to obtain audio characteristic parameters and an excitation signal of the first audio stream, wherein the excitation signal is a quantized voice signal.

In some embodiments, the first audio stream is a high-bitrate audio stream, and the audio characteristic parameters include signal gain, LSF (Line Spectral Frequency) parameters, LTP (Long-Term Prediction) parameters, and high-pitch delay. Quantization refers to a process of approximating a continuous value of a signal to a finite number of (or fewer) discrete values, a voice signal is a continuous signal, an excitation signal obtained after quantization is also a discrete signal, and the discrete signal is convenient for a server to perform subsequent processing. In some embodiments, the high code rate refers to a code rate of an audio stream uploaded to the server by the terminal, and in other embodiments, the high code rate may also be a code rate higher than a certain code rate threshold, for example, the code rate threshold is 1Mbps, and then a code rate higher than 1Mbps is also referred to as a high code rate. Of course, the definition of the high code rate may be different in different coding standards, and this is not limited in the embodiment of the present application.

302. And the server acquires a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal.

In some embodiments, the excitation signal is a discrete signal, and the server is capable of restoring the excitation signal to a time-domain audio signal for subsequent audio transcoding based on the audio characteristic parameters.

303. And the server re-quantizes the excitation signal and the audio characteristic parameters based on the time domain audio signal and the target transcoding code rate.

In some embodiments, the re-Quantization may also be referred to as Noise Shaping Quantization (NSQ), and the re-Quantization process is a compression process, and the re-Quantization of the excitation signal and the audio characteristic parameter by the server is a re-compression process of the excitation signal and the audio characteristic parameter.

304. And the server carries out entropy coding on the re-quantized audio characteristic parameters and the re-quantized excitation signals to obtain a second audio stream with a second code rate, wherein the second code rate is lower than the first code rate.

After the audio characteristic parameters and the excitation signals are re-quantized, the audio characteristic parameters and the excitation signals are re-compressed, and the re-quantized audio characteristic parameters and excitation are entropy-encoded, so that a second audio stream with a lower code rate can be directly obtained.

By the technical scheme provided by the embodiment of the application, when the audio stream is transcoded, a complete parameter extraction process is not required to be executed, and the audio characteristic parameters and the excitation signals are obtained by adopting entropy decoding. When re-quantization is performed, the excitation signal and the audio characteristic parameters are also performed, and correlation processing on the time domain signal is not involved. Finally, entropy coding is carried out on the excitation signal and the audio characteristic parameters to obtain a second audio stream with a smaller code rate. Because the computation amount of entropy decoding and entropy encoding is small, the computation amount can be greatly reduced without processing time domain signals, and the speed and the efficiency of audio transcoding are integrally improved on the premise of ensuring the tone quality.

The steps 301-304 are simply introduced in the embodiment of the present application, and the technical solution provided by the embodiment of the present application will be more clearly described below with reference to some examples, and referring to fig. 4, the method includes:

401. the server carries out entropy decoding on the first audio stream with the first code rate to obtain audio characteristic parameters and an excitation signal of the first audio stream, wherein the excitation signal is a quantized voice signal.

In one possible embodiment, the server obtains probabilities of occurrence of a plurality of coding units in the first audio stream. The server decodes the first audio stream based on the occurrence probability to obtain a plurality of decoding units respectively corresponding to the plurality of encoding units. The server combines the plurality of decoding units to obtain the audio characteristic parameters and the excitation signal of the first audio stream.

The above embodiment is one possible embodiment of entropy decoding, and in order to more clearly describe the above embodiment, a description will be given below of one entropy encoding method corresponding to the above embodiment.

For example, to simplify the process, it is assumed that the audio characteristic parameter and the excitation signal of the first audio stream are "MNOOP", and each letter is a coding unit, where the probabilities of occurrence of "M", "N", "O", and "P" in MNOOP "are 0.2, 0.4, and 0.2, respectively, and the initial interval corresponding to" MNOOP "is [0, 100000 ]. The server divides the interval [0, 100000] into four sub-intervals according to the probability of occurrence of "M", "N", "O" and "P": m: [0, 20000], N: [20000, 40000], O: [40000, 80000] and P: [80000, 100000], wherein the ratio between each subinterval length is the same as the ratio between the corresponding probability of occurrence. Since the first letter is "M" in "MNOOP", the server chooses the first subinterval M: [0, 20000] as the base interval for subsequent entropy coding. The server, according to the probabilities of occurrence of "M", "N", "O", and "P", converts the interval M: [0, 20000] is divided into four sub-intervals: MM: [0, 4000], MN: [4000, 8000], MO: [8000, 16000] and MP: [16000, 20000]. Since in "MNOOP" the first two letters are "MN", the server chooses the second subinterval MN: [4000, 8000] as the basic interval for the subsequent entropy coding. The server compares the interval MN: [4000, 8000] is divided into four sub-intervals: MNM: [4000, 4800], MNN: [4800, 5600], MNO: [5600, 7200] and MNP: [7200, 8000]. Since the first three letters in "MNOOP" are "MNOs", the server will convert the third sub-interval MNO: [5600, 7200] as a base interval for subsequent entropy encoding. The server, according to the probabilities of occurrence of "M", "N", "O", and "P", associates the interval MNO: [5600, 7200] is divided into four subintervals: MNOM: [5600, 5920], MNON: [5920, 6240], MNOO: [6240, 6880] and MNOP: [6880, 7200]. Since in "MNOOP" the first four letters are "MNOO", the server will assign a third subinterval MNOO: [6240, 6880] base interval for subsequent entropy coding. The server, according to the probabilities of occurrence of "M", "N", "O", and "P", associates an interval MNOO: [6240, 6880] divided into four subintervals: MNOOM: [6240, 6368], MNOON: [6368, 6496], MNOOO: [6496, 6752] and MNOOP: 6752, 6880, whereby the interval for entropy coding "MNOOP" is [6752, 6880], and the server can represent the result of coding "MNOOP" using any value in this interval [6752, 6880], for example, 6800 is used to represent "MNOOP", and 6800 is the first audio stream in the above embodiment.

The above embodiment will be described based on the above entropy coding.

Taking the first audio stream as 6800 for an example, the server obtains the probabilities of the occurrence of the multiple coding units in the first audio stream, that is, the probabilities of the occurrence of "M", "N", "O", and "P" are 0.2, 0.4, and 0.2, respectively. The server constructs an initial interval [0, 100000] which is the same as the entropy coding process, and divides the interval [0, 100000] into four subintervals according to the probability of occurrence of 'M', 'N', 'O' and 'P': m: [0, 20000], N: [20000, 40000], O: [40000, 80000] and P: [80000, 100000], since the first audio stream 6800 is in the first subinterval M: [0, 20000], therefore, the server uses the interval [0, 20000] as the base interval for the subsequent entropy decoding, and uses M as the first decoding unit for decoding. The server, according to the probabilities of occurrence of "M", "N", "O", and "P", converts the interval M: [0, 20000] is divided into four sub-intervals: MM: [0, 4000], MN: [4000, 8000], MO: [8000, 16000] and MP: [16000, 20000], since the first audio stream 6800 is in the second sub-interval MN: [4000, 8000], therefore, the server uses the subinterval [4000, 8000] as the basic interval of the subsequent entropy decoding, and uses N as the second decoding unit of the decoding. The server, according to the probabilities of occurrence of "M", "N", "O", and "P", associates the interval MN: [4000, 8000] is divided into four sub-intervals: MNM: [4000, 4800], MNN: [4800, 5600], MNO: [5600, 7200] and MNP: [7200, 8000], since the first audio stream 6800 is in the third subinterval MNO: [5600, 7200], the subinterval [5600, 7200] is therefore used as a base interval for subsequent entropy decoding, and O is used as the third decoding unit for decoding. The server, according to the probabilities of occurrence of "M", "N", "O", and "P", associates the interval MNO: [5600, 7200] is divided into four subintervals: MNOM: [5600, 5920], MNON: [5920, 6240], MNOO: [6240, 6880] and MNOP: [6880, 7200], because the first audio stream 6800 is in the third subinterval MNOO: [6240, 6880], therefore, the server uses the subinterval [6240, 6880] as the base interval for subsequent entropy decoding, and O as the fourth decoded unit. The server, according to the probabilities of occurrence of "M", "N", "O", and "P", associates an interval MNOO: [6240, 6880] divided into four subintervals: MNOOM: [6240, 6368], MNOON: [6368, 6496], MNOOO: [6496, 6752] and MNOOP: [6752, 6880], since the first audio stream 6800 is in the fourth subinterval MNOOP: 6752, 6880, so the server has P as the fifth decoded unit. The server combines the decoded five decoding units "M", "N", "O" and "P" to obtain "MNOOP", that is, the audio characteristic parameters and excitation signals of the first audio stream.

In order to more clearly explain the technical solutions provided in the embodiments of the present application, the above embodiments are explained below on the basis of the entropy decoding in the above example.

In a possible embodiment, referring to fig. 5, the server inputs the first audio stream into the interval decoder 501, and performs entropy decoding on the first audio stream, where the process of entropy decoding refers to the above example, and is not described herein again. After the first audio stream is entropy decoded by the interval decoder 501, an entropy decoded audio stream is obtained. The server inputs the entropy-decoded audio stream into the parameter decoder 502, and outputs the flag bit pulse, the signal gain and the audio characteristic parameter through the parameter decoder 502. The server inputs the flag pulse and the signal gain to the excitation signal generator 503 to obtain the excitation signal.

402. And the server acquires a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal.

In a possible implementation manner, the server processes the excitation signal based on the audio characteristic parameter to obtain a time-domain audio signal corresponding to the excitation signal.

For example, referring to fig. 5, the server inputs the audio characteristic parameters and the excitation signal into the frame reconstruction module 504, and the frame reconstruction module 504 outputs the frame-reconstructed audio signal. The server inputs the audio signal after frame reconstruction into a sampling rate conversion filter 505, and performs resampling coding through the sampling rate conversion filter 505 to obtain a time domain audio signal corresponding to the excitation signal. Alternatively, if the frame-reconstructed audio signal is a stereo audio signal, the server can input the frame-reconstructed audio signal to the stereo separation module 506 before inputting the frame-reconstructed audio signal to the sampling rate conversion filter, so as to separate the frame-reconstructed audio signal into a mono audio signal. The server inputs the mono audio signal into the sampling rate conversion filter 505 to perform resampling coding, and obtains a time domain audio signal corresponding to the excitation signal.

The following describes a method for reconstructing a frame of an excitation signal by a frame reconstruction module:

in one possible implementation, the audio characteristic parameters include signal gain, LSF (Line Spectral Frequency) coefficient, LTP (Long-Term Prediction) coefficient, and treble delay. The frame reconstruction module comprises an LTP (Linear Predictive Coding) filter and an LPC (LPC) synthesis filter, the server inputs the excitation signal and the high pitch delay and LTP coefficients in the audio characteristic parameters into the LTP synthesis filter, and the LTP synthesis filter carries out first frame reconstruction on the excitation signal to obtain a first filtering audio signal. And the server inputs the first filtered audio signal, the LSF coefficient and the signal gain into an LPC synthesis filter, and the LPC synthesis filter carries out second-time frame reconstruction on the first filtered audio signal to obtain a second filtered audio signal. And the server fuses the first filtering audio signal and the second filtering audio signal to obtain the audio signal after frame reconstruction.

403. The server obtains a first quantization parameter through at least one iteration process based on the target transcoding code rate, wherein the first quantization parameter is used for adjusting the first code rate of the first audio stream to the target transcoding code rate.

In one possible implementation, the server obtains the first quantization parameter through at least one iteration process, and in any iteration process, the server determines the first alternative quantization parameter based on the target transcoding code rate. And the server simulates the re-quantization process of the excitation signal and the audio characteristic parameter based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter. And the server simulates the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream. And determining the first alternative quantization parameter as the first quantization parameter in response to the analog audio stream meeting a first target condition and at least one of the time-domain audio signal and the first signal, the target transcoding code rate, the code rate of the analog audio stream and the number of iterations meeting a second target condition.

In the above embodiment, the processing includes four parts, that is, the server determines an alternative quantization parameter first, and re-quantizes the excitation signal and the audio characteristic parameter according to the alternative quantization parameter to obtain the first signal and the first parameter. The server is capable of simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream. The server judges the analog audio stream and determines whether the analog audio stream meets the requirement, wherein the judgment of the requirement is carried out based on the first target condition and the second target condition. When the first target condition and the second target condition are simultaneously met, the server can end the iteration and output the first quantization parameter. The server can re-iterate when either of the first target condition and the second target condition is not satisfied.

In order to more clearly explain the above embodiment, the following description will be divided into four parts to explain the above embodiment.

The first part explains a mode that a server determines a first alternative quantization parameter based on a target transcoding code rate.

The target transcoding code rate can be determined by the server according to actual conditions, for example, the target transcoding code rate is determined according to the network bandwidth, so that the target transcoding code rate is matched with the network bandwidth.

In some embodiments, the first alternative quantization parameter represents a quantization step size, and the larger the quantization step size is, the larger the compression ratio is, and the smaller the quantized data amount is. The smaller the quantization step size, the smaller the compression ratio, and the larger the amount of quantized data. In some embodiments, the target transcoding bitrate is lower than the first bitrate of the first audio stream, and then a bitrate reduction process is performed in the audio transcoding process. In the process, the server can generate a first alternative quantization parameter based on the target transcoding code rate, and after the excitation signal and the audio characteristic parameter are re-quantized by using the first alternative quantization parameter, an audio stream with a lower code rate can be obtained, wherein the code rate of the audio stream is close to the target transcoding code rate.

And a second part is used for explaining a mode that the re-quantization process of the excitation signal and the audio characteristic parameter is simulated by the server based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter.

The simulation means that the server does not re-quantize the excitation signal and the audio characteristic parameter, but performs a simulation of a re-quantization process based on the first candidate quantization parameter, thereby subsequently determining the first quantization parameter used in the actual quantization process.

In a possible implementation manner, the server respectively simulates a discrete cosine transform process of the excitation signal and a discrete cosine transform process of the audio characteristic parameter to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio characteristic parameter. And the server divides the second signal and the second parameter with the first alternative quantization parameter respectively and then performs rounding to obtain a first signal and a first parameter.

The description is given by taking an example that the server performs re-quantization on the excitation signal, and in the simulation process, the server performs discrete cosine transform on the excitation signal to obtain a second signal. The server performs requantization on the second signal by using the quantization step corresponding to the first candidate quantization parameter, that is, the second signal is divided by the quantization step represented by the first candidate parameter and then rounded to obtain the first signal.

For example, if the excitation signal is a matrix

The server can respond to the excitation signal

Performing a discrete cosine transform, i.e. by using the following formula (1) to the excitation signal

And performing discrete cosine transform to obtain a second signal.

Where f (u) is the second signal, u is the generalized frequency variable, u is 1, 2, 3 … … N-1, f (i) is the excitation signal, N is the number of values in the excitation signal, and i is the value in the excitation signal.

For convenience of explanation, the second signal is used as follows

The quantization step size is illustrated as 28. In some embodiments, the server can re-quantize the second signal by equation (2) below to obtain the first signal.

Q(m)＝round(m/S+0.5) (2)

Where Q () is the quantization function, m is the value in the second signal, round () is the rounded rounding function, and S is the quantization step.

With a second signal

In 195, for example, the server can substitute 195 into equation (2), i.e., Q (195) ═ round (195/28+0.5) ═ round (7.464) ═ 7, i.e., the result of quantifying 195. The server uses formula (2) to process the second signal

After re-quantization, a first signal can be obtained

And a third part is used for explaining a mode that the server simulates the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream.

To simulate the first signal

As an example, the server can entropy encode the first signal

Divided into four vectors, (7, -1, 0, 0) ^T 、(0，-1，0，0) ^T 、(0，0，0，0) ^T And (0, 0, 0, 0) ^T . The server will vector (7, -1, 0, 0) ^T Record as A, vector (0, -1, 0, 0) ^T Let it be B, let vector (0, 0, 0, 0) ^T Denoted C. First signal

This can be simplified to (ABCC). In the first signal (ABCC), the probabilities of the occurrence of coding units "A", "B" and "C" in (ABCC) are 0.25, 0.25 and 0.5, respectively, the server generates an initial interval [0, 100000%]. The server converts the initial interval [0, 100000] according to the probability of occurrence of the coding units "A", "B" and "C]Divided into three subintervals a: [0, 25000]、B[25000，50000]And C [50000, 100000]]. Since the first letter in the first signal (ABCC) is "a", the server chooses the first subinterval a: [0, 25000]As the base interval for subsequent entropy coding. The server, according to the probabilities of occurrence of the encoding units "a", "B", and "C", converts the interval a: [0, 25000]The division into three sub-intervals AA: [0, 6250]、AB[6250，12500]And AC [12500, 100000]. Since the second letter is "B" in the first signal (ABCC), the server selects the second subinterval AB [6250, 12500 ]]As the base interval for subsequent entropy coding. The server converts the interval AB [6250, 12500 ] according to the probability of occurrence of the coding units ' A ', ' B ' and ' C]The division into three subintervals ABA: [6250, 7812.5]、ABB[7812.5，9375]And ABC [9375, 12500]. Since the third letter is "C" in the first signal (ABCC), the server selects a third subinterval ABC [9375, 12500 ]]As the base interval for subsequent entropy coding. The server converts the interval ABC [9375, 12500 ] according to the probability of occurrence of the coding units ' A ', ' B ' and ' C]The division into three subintervals ABCA: [9375, 10156.25]、ABCB[10156.25，10，937.5]And ABCC [10, 937.5, 12500]The interval in which the first signal (ABCC) is entropy coded is thus obtained ABCC [10, 937.5, 12500 ]]The server can use the interval ABCC [10, 937.5, 12500 ]]Any value in (a) represents the first signal (ABCC), such as 12000.

If the entropy coding process of the first signal and the first parameter is simulated to obtain an interval [100, 130], the server can represent the analog audio stream by any value in the interval [100, 130], for example, 120.

The fourth section explains the first target condition and the second target condition.

In a possible embodiment, the compliance of the analog audio stream with the first target condition is at least one of:

the code rate of the simulated audio stream is less than or equal to the target transcoding code rate, and the audio stream quality parameter of the simulated audio stream is greater than or equal to the quality parameter threshold. The Quality parameter of the audio stream includes a signal-to-noise ratio, PESQ (Perceptual Speech Quality Evaluation), POLQA (Perceptual Objective Speech Quality Evaluation), and the like, and the Quality parameter threshold is set according to an actual situation, for example, according to a Quality requirement of the voice call, that is, when the Quality requirement of the voice call is higher, the Quality parameter threshold can be set higher, and when the Quality requirement of the voice call is lower, the Quality parameter threshold can be set smaller.

In a possible embodiment, when at least one of the code rate of the time-domain audio signal and the code rate of the first signal, the target transcoding code rate, and the code rate of the analog audio stream, and the number of iterations meets the second target condition, the method includes:

the similarity between the time domain audio signal and the first signal is greater than or equal to a similarity threshold. The difference between the target transcoding code rate and the code rate of the analog audio stream is less than or equal to a difference threshold. The number of iterations is equal to the iteration threshold. That is, in the iteration process, the similarity between the time-domain audio signal and the first signal is used as a first factor influencing the termination of the iteration, the difference between the target transcoding code rate and the code rate of the analog audio stream is used as a second factor influencing the termination of the iteration, the number of times of the iteration is used as a third factor influencing the termination of the iteration, and the server determines the moment for terminating the iteration through three phonemes. In some embodiments, if the threshold of the iteration number is 3, the current iteration number is 3, the similarity between the time-domain audio signal and the first signal obtained by iteration is smaller than the similarity threshold, and the difference between the target transcoding code rate and the code rate of the analog audio stream is larger than the difference threshold, because the iteration number is the same as the threshold of the number of times, the server may terminate the iteration and use the alternative quantization parameter corresponding to the current iteration as the first quantization parameter. Through the limitation of the second target condition, the server can obtain the first quantization parameter with fewer iteration times, so that transcoding can be completed at a higher speed under the scene of real-time voice call.

Under the second target condition, the server does not perform a complete iteration process, which in some embodiments is a noise-shaped quantization (NSQ) loop iteration. The limitation of the second target condition may also be referred to as a greedy algorithm, and the greedy algorithm can greatly increase the audio transcoding speed for the following reasons: one is that since the first audio stream is the optimal quantization result with high bitrate, the server can directly search other alternative quantization parameters near the quantization parameter of the first audio stream. And secondly, when the excitation signal is compared with the time-domain audio signal, the iteration times can be greatly reduced according to the three factors. Of course, in a more radical case, for example, only 1 iteration is performed, the decoder may also be deleted, and audio transcoding may be directly performed, which is not limited in the embodiment of the present application.

In addition, in the iteration process, in response to the fact that the simulated audio stream does not accord with a first target condition, or the time-domain audio signal, the first signal, the target transcoding code rate, the code rate of the simulated audio stream and the number of iterations do not accord with a second target condition, the server takes a second alternative quantization parameter determined based on the target transcoding code rate as input of the next iteration process. That is, when the threshold of the iteration number is greater than 1, if neither the first target condition nor the second target condition is met, the server may re-determine the second candidate quantization parameter based on the target transcoding code rate, and perform the next iteration process based on the second candidate quantization parameter.

404. The server re-quantizes the excitation signal and the audio characteristic parameter based on the time-domain audio signal and the first quantization parameter.

In a possible implementation manner, the server performs discrete cosine transform on the excitation signal and the audio characteristic parameter respectively to obtain a third signal corresponding to the excitation signal and a third parameter corresponding to the audio characteristic parameter. And the server divides the third signal and the third parameter with the first quantization parameter respectively and then carries out rounding to obtain the re-quantized excitation signal and the re-quantized audio characteristic parameter. This embodiment and the second part in step 503 belong to the same inventive concept, and the implementation process is described in the above description and is not described again.

405. And the server carries out entropy coding on the audio characteristic parameters subjected to the re-quantization and the excitation signals subjected to the re-quantization to obtain a second audio stream with a second code rate, wherein the second code rate is lower than the first code rate.

In one possible embodiment, the server obtains the re-quantized audio feature parameters and the probability of occurrence of the plurality of coding units in the re-quantized excitation signal. And the server encodes the plurality of encoding units based on the occurrence probability to obtain a second audio stream.

For example, to simplify the process, assume that the re-quantized audio feature parameters and the re-quantized excitation signal are "DEFFG", each letter being one coding unit, where the probabilities of occurrence of "D", "E", "F", and "G" in "DEFFG" are 0.2, 0.4, and 0.2, respectively, and the "DEFFG" corresponds to an initial interval of [0, 100000 ]. The server divides the interval [0, 100000] into four sub-intervals according to the probability of occurrence of "D", "E", "F" and "G": d: [0, 20000], E: [20000, 40000], F: [40000, 80000] and G: [80000, 100000], wherein the ratio between each subinterval length is the same as the ratio of the corresponding probability of occurrence. Since in "DEFFG", the first letter is "D", the server chooses the first subinterval D: [0, 20000] as the base interval for subsequent entropy coding. The server, according to the probabilities of occurrence of "D", "E", "F", and "G", associates the interval D: [0, 20000] is divided into four sub-intervals: DD: [0, 4000], DE: [4000, 8000], DF: [8000, 16000] and DG: [16000, 20000]. Since in "DEFFG", the first two letters are "DE", the server chooses a second subinterval DE: [4000, 8000] as the basic interval for the subsequent entropy coding. The server, according to the probabilities of occurrence of "D", "E", "F" and "G", maps the interval DE: [4000, 8000] into four subintervals: DED: [4000, 4800], DEE: [4800, 5600], DEF: [5600, 7200] and DEG: [7200, 8000]. Since the first three letters in "DEFFG" are "DEF", the server will transmit a third subinterval DEF: [5600, 7200] as a base interval for subsequent entropy coding. The server, according to the probabilities of occurrence of "D", "E", "F" and "G", maps the interval DEF: [5600, 7200] is divided into four subintervals: DEFD: [5600, 5920], DEFE: [5920, 6240], DEFF: [6240, 6880] and DEFG: [6880, 7200]. Since in "DEFFG", the first four letters are "DEFF", the server will transmit a third subinterval DEFF: [6240, 6880] base interval for subsequent entropy coding. The server, according to the probabilities of occurrence of "D", "E", "F" and "G", converts the interval DEFF: [6240, 6880] is divided into four subintervals: DEFFD: [6240, 6368], DEFFE: [6368, 6496], DEFF: [6496, 6752] and DEFG: [6752, 6880], from which an interval [6752, 6880] in which "DEFG" is entropy-encoded is obtained, the server can represent the result of encoding "DEFG" using any value of the interval [6752, 6880], for example, "DEFG" using 6800, and 6800 is the second audio stream in the above embodiment.

Optionally, after step 505, the audio transcoding method provided by the embodiment of the present application can also be combined with other audio processing methods to improve the quality of audio transcoding. For example, the audio transcoding method provided by the embodiment of the present application can be combined with a Forward Error Correction (FEC) encoding method. In the transmission process of the audio stream, errors and jitter may occur, thereby causing the quality of audio transmission to be reduced, and based on this, the audio may be encoded by using a forward error correction method, the essence of the forward error correction is to add redundant information into the audio, so that the occurrence of errors is error-correcting, and the redundant information is information related to the first N frames of the current audio frame, where N is a positive integer.

In one possible embodiment, the server forward error correction encodes the subsequently received audio stream based on the second audio stream.

For example, assuming that a segment of audio stream is an audio frame, the second audio stream is denoted as a T-1 frame, and the audio stream received from the terminal subsequently is denoted as a T frame, then when the server encodes the T frame, the server can encode the T-1 frame, that is, the second audio stream, as redundant information in forward error correction coding of the T frame, thereby obtaining an encoded FEC code stream, where T is a positive integer. Since the code rate of the T-1 frame is reduced by the audio transcoding method provided by the embodiment of the application, the overall code rate of the coded FEC code stream can be reduced, so that the network antagonism during audio stream transmission is improved on the premise of ensuring the audio quality, wherein the network antagonism is also the performance of resisting network fluctuation.

In the above description, an audio frame is taken as redundant information in forward error correction coding for example for explanation, in other possible embodiments, referring to fig. 6, if a server is currently coding a T-th frame, for the T-1 frame and the T-2 frame, the server can adopt the audio transcoding method provided by the embodiment of the present application to adjust the code rates of the T-1 frame and the T-2 frame so as to reduce the code rates of the T-1 frame and the T-2 frame, and adopt an in-band forward error correction method to code the adjusted T-1 frame, the adjusted T-2 frame and the T to obtain an encoded FEC code stream, because the code rates of the T-1 frame and the T-2 frame are reduced, the overall code rate of the encoded FEC code stream can also be reduced, so that on the premise of ensuring the audio quality, the network adversity during audio streaming is improved.

By the technical scheme provided by the embodiment of the application, when the audio stream is transcoded, a complete parameter extraction process is not required to be executed, the audio characteristic parameters and the excitation signals are obtained by adopting entropy decoding, and a more aggressive greedy algorithm is adopted. When re-quantization is performed, the excitation signal and the audio characteristic parameters are also performed, and correlation processing on the time domain signal is not involved. Finally, entropy coding is carried out on the excitation signal and the audio characteristic parameters to obtain a second audio stream with a smaller code rate. The complexity of entropy decoding and entropy coding can be almost ignored, so the computation of entropy decoding and entropy coding is small, the computation can be greatly reduced without processing time domain signals, and the speed and efficiency of audio transcoding are integrally improved on the premise of ensuring the audio quality.

In addition, the present application provides an audio transcoder, which has a structure shown in fig. 7, and includes: the entropy coding device comprises an entropy decoding unit 701, a time domain decoding unit 702, a quantization unit 703 and an entropy coding unit 704, wherein the entropy decoding unit 701 is respectively connected with the time domain decoding unit 702 and the quantization unit 703, the time domain decoding unit 702 is connected with the quantization unit 703, and the quantization unit 703 is connected with the entropy coding unit 704. In some embodiments, the audio transcoder provided by the embodiments of the present application is also referred to as a downstream transcoder.

The entropy decoding unit 701 is configured to perform entropy decoding on the first audio stream with the first code rate to obtain an audio feature parameter and an excitation signal of the first audio stream, where the excitation signal is a quantized speech signal.

A time domain decoding unit 702, configured to obtain a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal.

A quantization unit 703, configured to re-quantize the excitation signal and the audio characteristic parameter based on the time-domain audio signal and the target transcoding code rate. In some embodiments, the quantization unit 703 is also referred to as a fast noise shaping quantization unit.

An entropy encoding unit 704, configured to perform entropy encoding on the re-quantized audio feature parameter and the re-quantized excitation signal to obtain a second audio stream with a second code rate, where the second code rate is lower than the first code rate.

In some embodiments, in the transcoding process, the entropy decoding unit 701 can send the audio characteristic parameter and the excitation signal to the time domain decoding unit 702 and the quantization unit 703, respectively, and the time domain decoding unit 702 can obtain the audio characteristic parameter and the excitation signal from the entropy decoding unit and obtain a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal. The time domain decoding unit 702 can send the time domain audio signal to the quantization unit 703. The quantization unit 703 is capable of receiving the target transcoding rate, the audio characteristic parameter, the excitation signal, and the time-domain audio signal, and re-quantizing the excitation signal and the audio characteristic parameter. The quantization unit 703 is capable of sending the re-quantized audio feature parameters and the re-quantized excitation signal to the entropy encoding unit 704, and the entropy encoding unit 704 performs entropy encoding on the re-quantized audio feature parameters and the re-quantized excitation signal, so as to obtain a second audio stream with a second code rate.

In a possible implementation manner, the quantization unit is configured to obtain a first quantization parameter through at least one iteration process based on the target transcoding code rate, where the first quantization parameter is used to adjust the first code rate of the first audio stream to the target transcoding code rate. The excitation signal and the audio feature parameter are re-quantized based on the time-domain audio signal and the first quantization parameter.

In a possible implementation manner, the quantization unit is configured to determine, in any iteration process, a first candidate quantization parameter based on the target transcoding code rate. And simulating the re-quantization process of the excitation signal and the audio characteristic parameter based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter. And simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream. And determining the first alternative quantization parameter as the first quantization parameter in response to the analog audio stream meeting a first target condition and at least one of the time-domain audio signal and the first signal, the target transcoding code rate, the code rate of the analog audio stream and the number of iterations meeting a second target condition.

and the code rate of the analog audio stream is less than or equal to the target transcoding code rate.

The audio stream quality parameter of the simulated audio stream is greater than or equal to the quality parameter threshold.

the similarity between the time domain audio signal and the first signal is greater than or equal to a similarity threshold.

The difference between the target transcoding code rate and the code rate of the analog audio stream is less than or equal to a difference threshold.

The number of iterations is equal to the iteration threshold.

In one possible embodiment, the quantization unit is configured to:

and respectively simulating the discrete cosine transform process of the excitation signal and the discrete cosine transform process of the audio characteristic parameter to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio characteristic parameter.

And dividing the second signal and the second parameter by the first alternative quantization parameter respectively and then rounding to obtain a first signal and a first parameter.

In a possible implementation, the quantization unit is further configured to: and in response to the fact that the analog audio stream does not accord with the first target condition, or the time domain audio signal and the first signal, the target transcoding code rate, the code rate of the analog audio stream and the iteration times do not accord with the second target condition, a second alternative quantization parameter determined based on the target transcoding code rate is used as the input of the next iteration process.

In one possible implementation, the entropy decoding unit is configured to: the probability of occurrence of a plurality of coding units in a first audio stream is obtained. And decoding the first audio stream based on the occurrence probability to obtain a plurality of decoding units respectively corresponding to the plurality of coding units. And combining the plurality of decoding units to obtain the audio characteristic parameters and the excitation signal of the first audio stream.

In one possible implementation, the entropy coding unit is configured to:

and acquiring the re-quantized audio characteristic parameters and the occurrence probability of a plurality of coding units in the re-quantized excitation signal.

And coding the plurality of coding units based on the occurrence probability to obtain a second audio stream.

In a possible embodiment, the audio transcoder further comprises a forward error correction unit, and the forward error correction module is connected to the entropy coding unit and configured to forward error correction code the subsequently received audio stream based on the second audio stream.

It should be noted that: in the audio transcoder provided in the foregoing embodiment, only the division of each functional unit is illustrated in the foregoing description, and in practical applications, the function allocation may be completed by different functional units according to needs, that is, the internal structure of the audio transcoder is divided into different functional units to complete all or part of the functions described above. In addition, the audio transcoder and the audio transcoding method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

By the technical scheme provided by the embodiment of the application, when the audio stream is transcoded, a complete parameter extraction process is not required to be executed, and the audio characteristic parameters and the excitation signals are obtained by adopting entropy decoding. When re-quantization is performed, the excitation signal and the audio characteristic parameters are also performed, and correlation processing on the time domain signal is not involved. Finally, entropy coding is carried out on the excitation signal and the audio characteristic parameters to obtain a second audio stream with a smaller code rate. Because the computation of entropy decoding and entropy coding is small, the computation can be greatly reduced without processing time domain signals, thereby improving the speed and efficiency of audio transcoding on the whole on the premise of ensuring the tone quality.

Fig. 8 is a schematic structural diagram of an audio transcoding apparatus provided in an embodiment of the present application, and referring to fig. 8, the apparatus includes: a decoding module 801, a time domain audio signal acquisition module 802, a quantization module 803 and an encoding module 804.

The decoding module 801 is configured to perform entropy decoding on the first audio stream with the first code rate to obtain an audio characteristic parameter and an excitation signal of the first audio stream, where the excitation signal is a quantized speech signal.

A time domain audio signal obtaining module 802, configured to obtain a time domain audio signal corresponding to the excitation signal based on the audio characteristic parameter and the excitation signal.

And a quantizing module 803, configured to re-quantize the excitation signal and the audio characteristic parameter based on the time-domain audio signal and the target transcoding code rate.

And the encoding module 804 is configured to perform entropy encoding on the re-quantized audio characteristic parameter and the re-quantized excitation signal to obtain a second audio stream with a second code rate, where the second code rate is lower than the first code rate.

In a possible implementation manner, the quantization module is configured to obtain a first quantization parameter through at least one iteration process based on the target transcoding code rate, where the first quantization parameter is used to adjust the first code rate of the first audio stream to the target transcoding code rate. The excitation signal and the audio characteristic parameter are re-quantized based on the time-domain audio signal and the first quantization parameter.

In a possible implementation manner, the quantization module is configured to determine the first candidate quantization parameter based on the target transcoding code rate in any iteration process. And simulating the re-quantization process of the excitation signal and the audio characteristic parameter based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter. And simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream. And determining the first alternative quantization parameter as the first quantization parameter in response to the analog audio stream meeting a first target condition and at least one of the time-domain audio signal and the first signal, the target transcoding code rate, the code rate of the analog audio stream and the number of iterations meeting a second target condition.

The number of iterations is equal to the iteration threshold.

In a possible implementation manner, the quantization module is configured to simulate a discrete cosine transform process of the excitation signal and a discrete cosine transform process of the audio characteristic parameter, respectively, to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio characteristic parameter.

In a possible implementation manner, the quantization module is further configured to, in response to that the simulated audio stream does not meet the first target condition, or that the time-domain audio signal and the first signal, the target transcoding code rate, the code rate of the simulated audio stream, and the number of iterations do not meet the second target condition, use a second candidate quantization parameter determined based on the target transcoding code rate as an input of a next iteration process.

In a possible embodiment, the decoding module is configured to obtain a probability of occurrence of a plurality of coding units in the first audio stream. And decoding the first audio stream based on the occurrence probability to obtain a plurality of decoding units respectively corresponding to the plurality of coding units. And combining the plurality of decoding units to obtain the audio characteristic parameters and the excitation signal of the first audio stream.

In a possible embodiment, the encoding module is configured to obtain the re-quantized audio feature parameters and the probability of occurrence of the plurality of coding units in the re-quantized excitation signal. And coding the plurality of coding units based on the occurrence probability to obtain a second audio stream.

It should be noted that: in the audio transcoding device provided in the above embodiment, only the division of the functional modules is illustrated, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the audio transcoding device is divided into different functional modules to complete all or part of the functions described above. In addition, the audio transcoding device and the audio transcoding method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments, and are not described herein again.

By the technical scheme provided by the embodiment of the application, when the audio stream is transcoded, the complete parameter extraction process is not required to be executed, and the entropy decoding is adopted to obtain the audio characteristic parameters and the excitation signal. When re-quantization is performed, the excitation signal and the audio characteristic parameters are also performed, and correlation processing on the time domain signal is not involved. Finally, entropy coding is carried out on the excitation signal and the audio characteristic parameters to obtain a second audio stream with a smaller code rate. Because the computation amount of entropy decoding and entropy encoding is small, the computation amount can be greatly reduced without processing time domain signals, and the speed and the efficiency of audio transcoding are integrally improved on the premise of ensuring the tone quality.

An embodiment of the present application provides a computer device, configured to execute the method described above, where the computer device may be implemented as a terminal or a server, and a structure of the terminal is introduced below:

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 900 may be: a smartphone, a tablet computer, a laptop computer, or a desktop computer. Terminal 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, etc.

In general, terminal 900 includes: one or more processors 901 and one or more memories 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 902 is used to store at least one computer program for execution by the processor 901 to implement the audio transcoding method provided by the method embodiments in the present application.

Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of terminal 900, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

The computer device may also be implemented as a server, and the following describes a structure of the server:

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1000 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1001 and one or more memories 1002, where the one or more memories 1002 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 1001 to implement the methods provided by the foregoing method embodiments. Certainly, the server 1000 may further have components such as a wired or wireless network interface, a keyboard, an input/output interface, and the like, so as to perform input and output, and the server 1000 may further include other components for implementing functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory including a computer program, is also provided, the computer program being executable by a processor to perform the audio transcoding method in the above embodiments. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, the computer program product or the computer program comprising program code stored in a computer readable storage medium, the program code being read by a processor of a computer device from the computer readable storage medium, the program code being executed by the processor such that the computer device performs the audio transcoding method described above.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of audio transcoding, the method comprising:

entropy decoding a first audio stream with a first code rate to obtain an audio characteristic parameter and an excitation signal of the first audio stream, wherein the excitation signal is a quantized voice signal;

2. The method of claim 1, wherein the re-quantizing the excitation signal and the audio feature parameters based on the time-domain audio signal and a target transcoding code rate comprises:

based on the target transcoding code rate, obtaining a first quantization parameter through at least one iteration process, wherein the first quantization parameter is used for adjusting the first code rate of the first audio stream to the target transcoding code rate;

re-quantizing the excitation signal and the audio feature parameter based on the time-domain audio signal and the first quantization parameter.

3. The method of claim 2, wherein obtaining the first quantization parameter through at least one iterative process based on the target transcoding code rate comprises:

in any iteration process, determining a first alternative quantization parameter based on the target transcoding code rate;

simulating the re-quantization process of the excitation signal and the audio characteristic parameter based on the first alternative quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio characteristic parameter;

simulating the entropy coding process of the first signal and the first parameter to obtain a simulated audio stream;

determining the first alternative quantization parameter as the first quantization parameter in response to the simulated audio stream meeting a first target condition and at least one of the time-domain audio signal and the first signal, the target transcoding code rate and the code rate of the simulated audio stream, the number of iterations meeting a second target condition.

4. The method of claim 3, wherein the analog audio stream meeting the first target condition is at least one of:

the audio stream quality parameter of the simulated audio stream is greater than or equal to a quality parameter threshold.

5. The method of claim 3, wherein at least one of the time-domain audio signal and the first signal, the target transcoding code rate and the code rate of the analog audio stream, and the number of iterations meets a second target condition is that:

the iterated number is equal to an iterated number threshold.

6. The method of claim 3, wherein the simulating the re-quantization process of the excitation signal and the audio feature parameter based on the first candidate quantization parameter to obtain a first signal corresponding to the excitation signal and a first parameter corresponding to the audio feature parameter comprises:

respectively simulating the discrete cosine transform process of the excitation signal and the discrete cosine transform process of the audio characteristic parameter to obtain a second signal corresponding to the excitation signal and a second parameter corresponding to the audio characteristic parameter;

and dividing the second signal and the second parameter by the first alternative quantization parameter respectively and then rounding to obtain the first signal and the first parameter.

7. The method of claim 3, further comprising:

and in response to the fact that the simulated audio stream does not accord with the first target condition, or the time domain audio signal and the first signal, the target transcoding code rate and the number of iterations of the simulated audio stream do not accord with the second target condition, taking a second alternative quantization parameter determined based on the target transcoding code rate as the input of the next iteration process.

8. The method of claim 1, wherein entropy decoding the first audio stream at the first coding rate to obtain the audio feature parameters and the excitation signal of the first audio stream comprises:

acquiring the occurrence probability of a plurality of coding units in the first audio stream;

decoding the first audio stream based on the occurrence probability to obtain a plurality of decoding units respectively corresponding to the plurality of coding units;

and combining the plurality of decoding units to obtain the audio characteristic parameters and the excitation signals of the first audio stream.

9. The method of claim 1, wherein entropy encoding the re-quantized audio feature parameters and the re-quantized excitation signal to obtain a second audio stream at a second code rate comprises:

obtaining the audio characteristic parameters after the re-quantization and the occurrence probability of a plurality of coding units in the excitation signal after the re-quantization;

10. The method of claim 1, wherein after entropy encoding the re-quantized audio feature parameters and the re-quantized excitation signal to obtain a second audio stream at a second code rate, the method further comprises:

forward error correction encoding a subsequently received audio stream based on the second audio stream.

11. An audio transcoder, characterized in that the audio transcoder comprises: the entropy coding device comprises an entropy decoding unit, a time domain decoding unit, a quantization unit and an entropy coding unit, wherein the entropy decoding unit is respectively connected with the time domain decoding unit and the quantization unit, the time domain decoding unit is connected with the quantization unit, and the quantization unit is connected with the entropy coding unit;

the quantization unit is used for re-quantizing the excitation signal and the audio characteristic parameters based on the time domain audio signal and a target transcoding code rate;

12. The audio transcoder of claim 11, wherein the quantization unit is configured to obtain a first quantization parameter through at least one iterative process based on the target transcoding code rate, and the first quantization parameter is configured to adjust the first code rate of the first audio stream to the target transcoding code rate; re-quantizing the excitation signal and the audio feature parameter based on the time-domain audio signal and the first quantization parameter.

13. An audio transcoding apparatus, characterized in that the apparatus comprises:

the quantization module is used for re-quantizing the excitation signal and the audio characteristic parameters based on the time domain audio signal and a target transcoding code rate;

and the coding module is used for entropy coding the re-quantized audio characteristic parameters and the re-quantized excitation signals to obtain a second audio stream with a second code rate, wherein the second code rate is lower than the first code rate.

14. A computer device, characterized in that the computer device comprises one or more processors and one or more memories, in which at least one computer program is stored, which is loaded and executed by the one or more processors to implement the audio transcoding method of any of claims 1 to 10.

15. A computer-readable storage medium, in which at least one computer program is stored, which is loaded and executed by a processor to implement the audio transcoding method as claimed in any one of claims 1 to 10.