CN116978389A - Audio decoding method, audio encoding method, apparatus and storage medium - Google Patents

Audio decoding method, audio encoding method, apparatus and storage medium Download PDF

Info

Publication number
CN116978389A
CN116978389A CN202211186521.1A CN202211186521A CN116978389A CN 116978389 A CN116978389 A CN 116978389A CN 202211186521 A CN202211186521 A CN 202211186521A CN 116978389 A CN116978389 A CN 116978389A
Authority
CN
China
Prior art keywords
audio
data
decoding
network
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211186521.1A
Other languages
Chinese (zh)
Inventor
梁俊斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211186521.1A priority Critical patent/CN116978389A/en
Publication of CN116978389A publication Critical patent/CN116978389A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing

Abstract

The present application relates to an audio decoding method, apparatus, computer device, storage medium and computer program product. The method comprises the following steps: acquiring coding encapsulation data corresponding to each of a plurality of call parties in the multiparty call; unpacking each coded package data to obtain a plurality of audio coded data and the corresponding hearing perception intensity of each audio coded data; wherein, the hearing perception intensity represents the perception degree of the human ear to the audio; determining decoding networks corresponding to the audio coding data according to the hearing perception intensities corresponding to the audio coding data, wherein the network complexity of each decoding network is related to the hearing perception intensity of the corresponding audio coding data; decoding each audio coding data through a corresponding decoding network to obtain decoding audio corresponding to each of a plurality of call parties; the decoded audio is used for mixing. The method can improve the diversity of decoding.

Description

Audio decoding method, audio encoding method, apparatus and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to an audio decoding method, an audio encoding device, a computer device, and a storage medium.
Background
With the development of communication technology, audio codec schemes have become an important technical means for audio communication. For example, in a multi-party call scene, audio of each party needs to be encoded to obtain audio encoded data, the audio encoded data is sent to terminals of the parties in the multi-party call scene, the terminals of the parties decode the received audio encoded data to obtain a plurality of decoded audio, and the decoded audio is mixed and played.
At present, each audio encoded data is mainly decoded through a single decoding network, so that the diversity of audio decoding is limited.
Disclosure of Invention
In view of the foregoing, it is desirable to provide an audio decoding method, apparatus, computer device, computer-readable storage medium, and computer program product capable of improving the quality of mixing.
In a first aspect, the present application provides an audio decoding method, the method comprising:
acquiring coding encapsulation data corresponding to each of a plurality of call parties in the multiparty call;
unpacking each piece of encoded package data to obtain a plurality of pieces of audio encoded data and corresponding hearing perception intensity of each piece of audio encoded data; the hearing perception intensity is obtained by determining the loudness of the audio corresponding to the audio coding data, and represents the perception degree of human ears on the audio;
Determining decoding networks corresponding to the audio coding data according to the hearing perception intensities corresponding to the audio coding data, wherein the network complexity of each decoding network is related to the hearing perception intensity of the corresponding audio coding data;
decoding each audio coding data through the corresponding decoding network to obtain the corresponding decoding audio of each of the plurality of communication parties; the decoded audio is used for mixing.
In a second aspect, the present application also provides an audio decoding apparatus, the apparatus comprising:
the encapsulated data acquisition module is used for acquiring the encoded encapsulated data corresponding to each of a plurality of call parties in the multiparty call;
the coding network determining module is used for respectively deblocking each piece of coding encapsulation data to obtain a plurality of pieces of audio coding data and the corresponding hearing perception intensity of each piece of audio coding data; the hearing perception intensity is obtained by determining the loudness of the audio corresponding to the audio coding data, and represents the perception degree of human ears on the audio; determining decoding networks corresponding to the audio coding data according to the hearing perception intensities corresponding to the audio coding data, wherein the network complexity of each decoding network is related to the hearing perception intensity of the corresponding audio coding data;
The decoding module is used for decoding each audio coding data through the corresponding decoding network to obtain the corresponding decoding audio of each of the plurality of communication parties; the decoded audio is used for mixing.
In one embodiment, the encoding network determining module is further configured to sort the audio encoded data according to the order of the auditory perception intensity from high to low, to obtain an audio encoded data sequence; taking a preset number of audio coding data in a header area of the audio coding data sequence as target audio coding data, and taking a first decoding network with first network complexity as a decoding network corresponding to the target audio coding data; determining the rest of audio coding data except the target coding data in the audio coding data sequence, and taking a second decoding network with second network complexity as a decoding network corresponding to the rest of audio coding data; wherein the first network complexity is higher than the second network complexity.
In one embodiment, the encoding network determining module is further configured to determine a perceptual strength threshold, and screen target audio encoded data from the plurality of audio encoded data that has an auditory perceptual strength greater than or equal to the perceptual strength threshold; a first decoding network having a first network complexity as a decoding network corresponding to the target audio encoding data; determining remaining audio encoded data of the plurality of audio encoded data other than the target encoded data, and using a second decoding network having a second network complexity as a decoding network corresponding to the remaining audio encoded data; wherein the first network complexity is higher than the second network complexity.
In one embodiment, the audio decoding device is further configured to determine, for each party in the multi-party call, decoded audio of the remaining parties in the multi-party call, except for the current party; mixing the decoded audio of each other calling party to obtain mixed audio, and coding the mixed audio to obtain mixed coding data; transmitting the audio mixing coding data to the current calling party; the sent mixed sound coding data are used for triggering the terminal of the current calling party to decode the mixed sound coding data, so as to obtain decoding mixed sound, and playing the decoding mixed sound.
In one embodiment, the multiparty call includes at least one of multiparty online conferences, multiparty gaming voices, multiparty audio-video chats and multiparty live webcasts.
In a third aspect, the present application also provides a computer device, where the computer device includes a memory and a processor, where the memory stores a computer program, and where the processor implements steps in any one of the audio decoding methods provided by the embodiments of the present application when the computer program is executed.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the audio decoding methods provided by the embodiments of the present application.
In a fifth aspect, the present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the audio decoding methods provided by the embodiments of the present application.
According to the audio decoding method, the audio decoding device, the computer equipment, the storage medium and the computer program product, each piece of encoded encapsulation data can be unpacked by acquiring the encoded encapsulation data corresponding to each of a plurality of call parties in the multiparty call, so that a plurality of pieces of audio decoding data and the corresponding hearing perception intensity of each piece of audio decoding data are obtained. By determining the auditory sense intensity, the respective network complexity of each audio encoded data may be determined based on the respective auditory sense intensity of each audio encoded data, thereby determining the respective decoding network of each audio encoded data based on the determined network complexity. By determining the respective decoding network for each audio encoded data, the corresponding audio encoded data may be decoded by the respective decoding network to obtain decoded audio for mixing. Because the decoding network is determined according to the hearing perception intensity of the audio, the audio coding data with different hearing perception intensities can be decoded through the decoding network with different network complexity, and compared with the traditional method for decoding all the audio coding data through the same decoding network, the method greatly improves the diversity of audio decoding.
In a first aspect, the present application provides an audio encoding method, the method comprising:
acquiring the audio of a target call party in the multiparty call, and determining a target audio frame in the audio;
converting the target audio frame from a time domain to a frequency domain to obtain a plurality of frequency points;
determining the power value and the loudness value corresponding to each frequency point, and determining the hearing perception weight corresponding to each frequency point according to the loudness value corresponding to each frequency point;
fusing the power value and the hearing perception weight corresponding to each frequency point to obtain hearing perception intensity of the target audio frame;
acquiring audio coding data obtained by coding the target audio frame, and packaging the hearing perception intensity and the audio coding data to obtain coding packaging data; the hearing perception intensity is used for decoding and mixing audio coding data corresponding to each calling party in the multiparty call.
In a second aspect, the present application also provides an audio decoding apparatus, the apparatus comprising:
the frequency point determining module is used for acquiring the audio of a target call party in the multiparty call and determining a target audio frame in the audio; converting the target audio frame from a time domain to a frequency domain to obtain a plurality of frequency points;
The hearing perception intensity determining module is used for determining the power value and the loudness value corresponding to each frequency point, and determining the hearing perception weight corresponding to each frequency point according to the loudness value corresponding to each frequency point; fusing the power value and the hearing perception weight corresponding to each frequency point to obtain hearing perception intensity of the target audio frame;
the coding encapsulation module is used for obtaining audio coding data obtained by coding the target audio frame, and encapsulating the hearing perception intensity and the audio coding data to obtain coding encapsulation data; the hearing perception intensity is used for decoding and mixing audio coding data corresponding to each calling party in the multiparty call.
In one embodiment, the frequency point determining module is further configured to obtain a preset framing window when obtaining the audio of the target calling party; and triggering the framing window to slide on the audio of the target calling party by a preset sliding step length to obtain the audio fragment framed by the framing window, and taking the audio fragment as a target audio frame.
In one embodiment, the auditory perception intensity determining module is further configured to determine, according to a fourier transform result obtained by performing fourier transform on the target audio frame, a real part and an imaginary part corresponding to each frequency point in the target audio frame; for each frequency point in a plurality of frequency points, multiplying the real part of the current frequency point with the real part of the current frequency point to obtain a real part value; multiplying the imaginary part of the current frequency point with the imaginary part of the current frequency point to obtain imaginary part data; and superposing the real part numerical value and the imaginary part numerical value to obtain the power value of the current frequency point.
In one embodiment, the auditory perception intensity determining module is further configured to obtain acoustic equal-loudness curve data, and perform linear interpolation on the acoustic equal-loudness curve data to obtain a loudness value corresponding to each frequency point; the acoustic equal-loudness curve is used for describing the corresponding relation between sound pressure intensity and sound wave frequency under equal-loudness conditions.
In one embodiment, when the loudness value is greater than zero, the greater the loudness value of the frequency point, the greater the auditory perception weight corresponding to the frequency point; within a first preset frequency band, the larger the frequency value of the frequency point is, the larger the hearing perception weight corresponding to the frequency point is; in a second preset frequency band, the larger the frequency value of the frequency point is, the smaller the hearing perception weight corresponding to the frequency point is; the frequency value in the first preset frequency band is smaller than the frequency value in the second preset frequency band.
In one embodiment, the auditory sense intensity determining module is further configured to multiply, for each frequency point, the corresponding power value with the corresponding auditory sense weight to obtain an auditory sense power value corresponding to each frequency point; and accumulating the hearing perception power values to obtain the hearing perception intensity of the target audio frame.
In one embodiment, the auditory sense intensity determination module is further configured to obtain a historical auditory sense intensity of the historical audio frame; the historical audio frame is an immediately preceding audio frame positioned before the target audio frame in the audio of the target calling party; the historical auditory perception intensity is the auditory perception intensity of the historical audio frame after smoothing; and acquiring a preset smoothing coefficient, and smoothing the auditory perception intensity of the target audio frame based on the smoothing coefficient and the historical auditory perception intensity to obtain smoothed auditory perception intensity.
In a third aspect, the present application also provides a computer device, where the computer device includes a memory and a processor, where the memory stores a computer program, and where the processor implements steps in any one of the audio coding methods provided by the embodiments of the present application when the computer program is executed.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the audio coding methods provided by the embodiments of the present application.
In a fifth aspect, the present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the audio coding methods provided by the embodiments of the present application.
The audio encoding method, the audio encoding device, the computer equipment, the storage medium and the computer program product can determine the target audio frame in the audio by acquiring the audio of the target calling party in the multiparty call. By determining the target audio frame, the target audio frame can be converted from the time domain to the frequency domain, and a plurality of frequency points can be obtained. Because the auditory perception of the human ear on the audio is related to the loudness, when a plurality of frequency points are determined, the power value and the loudness value corresponding to each frequency point can be determined, so that the auditory perception weight corresponding to each frequency point is determined based on the loudness value, and the quantized auditory perception intensity is obtained through the power value and the auditory perception weight corresponding to each frequency point. The audio coding data obtained by coding the target audio frame and the auditory perception intensity can be packaged by obtaining the quantized auditory perception intensity, and the coding packaging data is obtained, so that a novel audio coding method is obtained. In addition, the hearing perception intensity packaged in the encoded package data can be used for decoding and mixing audio encoding data corresponding to each calling party in a multiparty call, so that a novel audio decoding method is provided, and the diversity of audio decoding is improved. And the auditory perception intensity is packaged into the encoded packaging data, so that the decoding network corresponding to each audio encoding data can be directly determined based on the auditory perception intensity when the subsequent audio decoding is performed, the determining efficiency of the decoding network is improved, and the decoding efficiency is further improved.
Drawings
FIG. 1 is a diagram of an application environment of an audio decoding method in one embodiment;
FIG. 2 is a flow chart of an audio decoding method according to an embodiment;
FIG. 3A is a schematic diagram of a multi-person online meeting scenario in one embodiment;
FIG. 3B is a schematic diagram of a multi-person live-link scene in one embodiment;
FIG. 3C is a schematic diagram of a multi-person online chat scenario, in one embodiment;
FIG. 4 is a schematic diagram of server mixing in one embodiment;
fig. 5 is a schematic diagram of terminal mixing in an embodiment;
FIG. 6 is a schematic diagram of decoding audio encoded data in one embodiment;
FIG. 7 is a flow chart of an audio encoding method according to an embodiment;
FIG. 8 is a diagram of generation of encoded encapsulated data in one embodiment;
FIG. 9 is a schematic diagram of audio codec in one embodiment;
FIG. 10 is an acoustic equal response plot of a China's International Standard organization for Acoustic measurements, according to an embodiment;
FIG. 11 is a graph showing the relationship between frequency values corresponding to frequency points and auditory perception weights in one embodiment;
FIG. 12 is a flow chart of an audio encoding method according to an embodiment;
FIG. 13 is a flow chart of an audio decoding method according to an embodiment;
FIG. 14 is a block diagram of an audio decoding apparatus in one embodiment;
FIG. 15 is a block diagram of an audio encoding apparatus in one embodiment;
FIG. 16 is an internal block diagram of a computer device in one embodiment;
fig. 17 is an internal structural view of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The audio decoding method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein a plurality of terminals 102 communicate with a server 104 via a network, respectively. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on the cloud or other servers. Wherein the terminal 102 is a terminal of a calling party in a multiparty call. In one embodiment, the terminal 102 may receive the audio of the calling party, encode and encapsulate the audio to obtain encoded encapsulated data, and send the encoded encapsulated data to the server 104. The plurality of terminals 102 may each transmit the encoded encapsulation data to the server 104. For each terminal 102, the trigger server 104 decodes and mixes the encoded package data of the other terminals 102 except the terminal 102 to obtain mixed audio, and returns the mixed audio to the terminal to perform mixed playing. In another embodiment, the terminal 102 may receive the audio of the calling party, encode and encapsulate the audio to obtain encoded encapsulated data, and send the encoded encapsulated data to the server 104. For each terminal 102, the trigger server 104 returns the other encoded package data except itself to itself, so that itself can decode and mix the received encoded package data to obtain mixed audio, and play the mixed audio.
The terminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In order to better understand the recommended model training method in the embodiment of the present application, the following describes the overall concept of the present application:
in the multiparty call scenario, since each party may participate in the call, audio of each party is encoded to obtain audio encoded data. Since the human ear has limited resolution of each sound source in the mix, generally, at most 4 sounds in the mix can be resolved, and the rest of the sounds in the mix are ignored. For example, in general, the human ear recognizes audio with strong auditory perception in a mix, and ignores audio with weak auditory perception. Because the human ear is prone to ignore the audio with weak hearing perception, high-precision decoding of the audio coding data with weak hearing perception is not needed when decoding, and only the audio coding data with weak hearing perception is needed to be decoded through a decoding network with low network complexity; the limited decoding resources can be prone to the audio coding data with strong hearing perception, so that the audio coding data with strong hearing perception can be decoded with high precision, and the decoded audio with high quality can be obtained, so that a subsequent calling party can easily identify the decoded audio with strong hearing perception from the mixed audio, the probability of mixed audio with unclear hearing based on a plurality of decoded audio can be reduced, and the user experience can be greatly improved.
In one embodiment, as shown in fig. 2, an audio decoding method is provided, and the method is applied to a computer device, which may be a server or a terminal in fig. 1, for example, and the audio decoding method includes the following steps:
step 202, obtaining the coding encapsulation data corresponding to each of a plurality of call parties in the multiparty call.
The call refers to the action of performing voice interaction through at least two call parties through respective corresponding call terminals. It is readily understood that the party may be a natural person, a robot, a virtual character, or the like. For example, referring to fig. 3A, in a multi-person online conference scenario, each participant is a party. For another example, referring to fig. 3B, in a multi-user live-link scene where multiple hosts are linked for live-on-screen, each host linked is also a calling party. For another example, referring to fig. 3C, in a multi-person chat scenario, each chat object participating in a chat through an instant messaging application may be a party. Fig. 3A illustrates a schematic diagram of a multi-person online meeting scenario in one embodiment. Fig. 3B illustrates a schematic diagram of a multi-person live-with-wheat scene in one embodiment. FIG. 3C illustrates a schematic diagram of a multi-person online chat scenario in one embodiment.
Specifically, when the multiparty call is performed, the computer device may obtain the encoded package data corresponding to each of the plurality of communicating parties in the multiparty call. For example, when the computer device is a server, the computer device may receive the encoded package data sent by the terminal corresponding to each of the parties in the multiparty call. When the computer device is a terminal of a calling party, the computer device can receive coded package data sent by other terminals except the computer device in a plurality of terminals corresponding to the calling parties. The encoded package data refers to data obtained by encoding and packaging the audio of the calling party.
In one embodiment, encoding audio refers to converting an analog audio signal into a digital signal for transmission in a channel. The coding aims to occupy as little communication capacity as possible and transmit the voice with the highest quality as possible on the premise of keeping a certain algorithm complexity and communication time delay. Basic methods of speech coding can be classified into waveform coding, parametric coding, and hybrid coding.
Step 204, unpacking each piece of encoded package data to obtain a plurality of pieces of audio encoded data and corresponding hearing perception intensity of each piece of audio encoded data; the hearing perception intensity is obtained by determining the loudness of the audio corresponding to the audio coding data, and represents the perception degree of the human ear on the audio.
Wherein the audio encoded data refers to data obtained by encoding audio through an encoding network. The coding network may be a machine learning model with coding capabilities obtained by sample learning. The auditory perception intensity represents the perception degree of the human ear on the audio, the auditory perception intensity is mainly determined according to the loudness of the audio, the loudness is also called volume, the loudness reflects the sound intensity perceived by the human ear, and the loudness is a subjective perception of the sound size of the human ear. The "loudness" will vary with the intensity of the sound, but will also be affected by the frequency, i.e. sounds of the same intensity, different frequencies, have different auditory perception intensities for the human ear. In one embodiment, the auditory perception intensity characterizes the degree of perception of sound loudness by the human ear; the auditory perception intensity can be determined according to the auditory perception weight, and the larger the loudness of the sound is, the larger the auditory perception weight is.
Specifically, because each piece of encoded package data includes audio encoded data and auditory perception intensity corresponding to the audio encoded data, when a plurality of pieces of encoded package data are obtained, the computer device can unpack each piece of encoded package data to obtain a plurality of pieces of audio encoded data and auditory perception intensity corresponding to each piece of audio encoded data.
In one embodiment, the process of decapsulating encoded encapsulated data may be the inverse of the encapsulation process. When the encoded package data is obtained, the computer device can decapsulate the encoded package data through a preset data decapsulation protocol. For example, the encoded encapsulated data may be decapsulated by GFP protocol (a generic frame protocol), LAPS protocol (a link access protocol), etc.
In one embodiment, the computer device may sequentially decapsulate the encoded encapsulated data, and the computer device may also simultaneously decapsulate multiple encoded encapsulated data via multithreading.
In one embodiment, for each of the plurality of encoded package data, when the computer device unpacks the encoded package data to obtain the audio encoded data and the auditory perception intensity, the computer device correspondingly stores the identifier of the audio encoded data and the auditory perception intensity to obtain a corresponding relationship between the identifier of the audio encoded data and the auditory perception intensity. So as to determine the decoding network corresponding to each audio coding data based on the corresponding relation.
In step 206, determining decoding networks corresponding to the audio encoded data according to the respective auditory perception intensities of the audio encoded data, where the network complexity of each decoding network is related to the auditory perception intensity of the corresponding audio encoded data.
Specifically, when determining the respective auditory sense intensities of each audio encoded data, the computer device determines the network complexity of the decoding network corresponding to each of the respective auditory sense intensities, i.e., the computer device determines the respective network complexity of each auditory sense intensity, and determines the respective network complexity of each audio encoded data according to the respective network complexity of each auditory sense intensity. For example, for each auditory sense intensity of the plurality of auditory sense intensities, in determining the network complexity corresponding to the current auditory sense intensity, the computer device treats the network complexity as the network complexity corresponding to the audio encoded data corresponding to the current auditory sense intensity.
Further, when determining the respective network complexity of each audio encoded data, the computer device will have a decoding network of the respective network complexity as the decoding network corresponding to the respective audio encoded data. For example, for each of the plurality of audio encoded data, the computer device may use a decoding network having a network complexity corresponding to the current audio encoded data as the decoding network corresponding to the current audio encoded data.
In one embodiment, the computer device obtains a correspondence between the auditory sense intensity and the network complexity, so as to determine the network complexity corresponding to each audio encoded data obtained by deblocking based on the correspondence between the auditory sense intensity and the network complexity.
In one embodiment, audio encoded data with high auditory perception strength corresponds to a decoding network of high network complexity; audio encoded data with low auditory perception strength corresponds to a decoding network of low network complexity. As can be readily appreciated, high auditory perception intensity and low auditory perception intensity are relatively speaking; having a high auditory sense intensity means that it is stronger than the remaining auditory sense intensities of the plurality of auditory sense intensities obtained by deblocking; having a low auditory sense intensity means that it is weaker than the remaining auditory sense intensities of the plurality of auditory sense intensities resulting from the deblocking. Accordingly, high network complexity and low network complexity are also relatively speaking. Having high network complexity means that it is higher in network complexity relative to the remaining decoding networks of the plurality of decoding networks; having low network complexity means that it is low in network complexity relative to the remaining decoding networks of the plurality of decoding networks.
In one embodiment, when determining the network complexity corresponding to each auditory perception intensity, the computer device determines the network complexity corresponding to each identifier of each audio encoding data according to the corresponding relationship between the identifier of the audio encoding data and the auditory perception intensity, and uses the network complexity corresponding to the identifier as the network complexity corresponding to the audio encoding data corresponding to the identifier.
In one embodiment, the network complexity characterizes a model complexity of the decoding network. In the case of different network complexity, the scale of the network model parameters of the decoding network is also different. In general, the larger the network model parameters of the decoding network, the more complex the network complexity of the decoding network, so that the stronger the signal reduction capability is, i.e. the higher the audio quality reconstructed by decoding is, but the higher the computational complexity (the greater the computational overhead is). Conversely, if the decoding network has smaller network model parameters, the network complexity is lower, the signal reduction capability is weaker, the quality of the audio reconstructed by decoding is lower, but the computational complexity is also lower (the computational overhead is smaller).
In one embodiment, the network complexity may be a complexity determined by at least one of spatial complexity and temporal complexity. For example, the spatial complexity may be determined, the temporal complexity may be determined, or the combination of the spatial complexity and the temporal complexity may be determined. The time complexity characterizes the operation times of the network, and determines the prediction time of the network. The spatial complexity reflects the total amount of memory swap that occurs when a single sample is input and the network completes one forward propagation. The more parameters of the network, the higher the spatial complexity.
Step 208, decoding each audio coding data through a corresponding decoding network to obtain a corresponding decoding audio of each of a plurality of call parties; the decoded audio is used for mixing.
Specifically, when determining the decoding network corresponding to each audio coding data, the computer device may decode the corresponding audio coding data through each decoding network to obtain the decoded audio corresponding to each of the plurality of parties. Further, referring to fig. 4, in the case that the computer device is a server, for a calling party i among the plurality of calling parties, the computer device mixes the decoded audio of the other calling parties except for the calling party i to obtain mixed audio, and sends the mixed audio to a terminal corresponding to the calling party i for playing. Wherein i is a positive integer and is less than or equal to the number of parties in the multiparty call. Referring to fig. 5, when the computer device is a terminal corresponding to a caller, the computer device mixes the decoded audio of the remaining callers except for the computer device to obtain mixed audio, and plays the mixed audio. Fig. 4 shows a schematic diagram of server mixing in one embodiment. Fig. 5 shows a schematic diagram of terminal mixing in an embodiment.
In one embodiment, multiple types of decoding networks may be deployed in a computer device, each type of decoding network having a different network complexity. The decoding networks with different network complexity can be networks with similar network structures and same definition on input characteristics, but different numbers of units in the networks. For example, the network may have the same network structure, but have different numbers of neurons. For another example, a network having a similar network structure but with different numbers of processing layers may be used.
In one embodiment, multiple types of decoding networks may be deployed in a computer device, each type of decoding network having a different network complexity. The decoding networks with different network complexity can also be different networks with different network structures, but consistent definition of input and output types.
In one embodiment, the audio encoded data may be mel-spectrum (mel) features of a frame of audio frames, and the computer device upsamples the mel-spectrum features multiple times to obtain a frame of audio frame vectors, and performs a speech recovery process on the audio frame vectors to obtain decoded audio.
In one embodiment, the decoding network may be a machine learning network with audio decoding capabilities that is pre-trained with samples, which may reconstruct audio features (e.g., mel-frequency cepstral features) into an audio signal.
In one embodiment, referring to fig. 6, a computer device may route each audio encoded data to a decoding network having a corresponding network complexity based on its respective auditory perception intensity. For example, audio encoded data with high auditory sense intensity may be input to a decoding network of high network complexity, and audio encoded data with low auditory sense intensity may be input to a decoding network of low network complexity. It is easy to understand that the stronger the auditory perception intensity, the more it is said to have a strong auditory perception, and the weaker the auditory perception intensity, the more it is said to have a weak auditory perception. Fig. 6 shows a schematic diagram of decoding of audio encoded data in one embodiment.
In one embodiment, the auditory perception intensity may specifically characterize the degree of perception of audio loudness by the human ear. Because not all parties are always in an active speech state during the multiparty call, for example, the parties can speak aloud, can type and record the speech of others, or can only listen to the speech of others. Therefore, if the same decoding network is used to decode the audio encoded data of each call terminal, the decoding resource is wasted, and the improvement of the decoding quality is limited. In view of this situation, the embodiments of the present invention propose a method of dynamically configuring a decoding network of audio encoded data, so that limited decoding resources tend to decode audio encoded data that is effectively speaking based on dynamic changes in complexity, so that a dynamic balance is achieved between decoding quality and decoding efficiency.
As will be readily appreciated, the loudness of sounds not effectively speaking, such as low-voice speaking, typing, etc., may be lower than the loudness of sounds effectively speaking. Accordingly, in order to perform high-quality decoding of the effectively uttered sound, the computer device may input audio encoded data having high auditory perception intensity (data obtained by encoding the effectively uttered sound) into a decoding network of high network complexity to perform high-precision decoding of the effectively uttered audio encoded data through the decoding network of high network complexity, resulting in high-quality decoded audio of the effectively uttered sound. Audio encoded data with low auditory perception intensity (data obtained by encoding sound of non-valid speech) is input into a decoding network with low network complexity to decode the audio encoded data of non-valid speech through the decoding network with low network complexity to obtain decoded audio of non-valid speech. Therefore, the audio coding data which is more important and effective in speaking and is prone to limited decoding resources can be realized, dynamic balance is achieved between decoding quality and decoding efficiency, the probability of the problem that calculation overload is easy to occur in the multi-channel audio decoding process is reduced, or the probability that the overall function fluent operation is influenced due to the fact that the decoding load is too high is reduced, and various kinds of clamping and stopping occur is reduced.
In the audio decoding method, the encoding encapsulation data corresponding to each of a plurality of call parties in the multiparty call is obtained, and each encoding encapsulation data can be unpacked to obtain a plurality of audio decoding data and the corresponding hearing perception intensity of each audio decoding data. By determining the auditory sense intensity, the respective network complexity of each audio encoded data may be determined based on the respective auditory sense intensity of each audio encoded data, thereby determining the respective decoding network of each audio encoded data based on the determined network complexity. By determining the respective decoding network for each audio encoded data, the corresponding audio encoded data may be decoded by the respective decoding network to obtain decoded audio for mixing. Because the decoding network is determined according to the hearing perception intensity of the audio, the audio coding data with different hearing perception intensities can be decoded through the decoding network with different network complexity, and compared with the traditional method for decoding all the audio coding data through the same decoding network, the method greatly improves the diversity of audio decoding.
In one embodiment, the audio coding data are sequenced according to the sequence from the big auditory perception intensity to the small auditory perception intensity, so as to obtain an audio coding data sequence; taking a preset number of audio coding data in a header area of an audio coding data sequence as target audio coding data, and taking a first decoding network with first network complexity as a decoding network corresponding to the target audio coding data; determining the rest of audio coding data except the target coding data in the audio coding data sequence, and taking a second decoding network with second network complexity as a decoding network corresponding to the rest of audio coding data; wherein the first network complexity is higher than the second network complexity.
Wherein the header region refers to a section of region having a header of the audio encoded data sequence as a start position.
Specifically, when the respective auditory perception intensities of the plurality of audio encoded data are obtained, the computer device may sort the audio encoded data in order from large to small to obtain the audio encoded data sequence. For example, when the auditory perception intensity corresponding to the audio encoded data a is a, the auditory perception intensity corresponding to the audio encoded data B is B, the auditory perception intensity corresponding to the audio encoded data C is C, and a > B > C, the audio encoded data sequence obtained by the computer device is [ audio encoded data a, audio encoded data B, audio encoded data C ]. Further, the computer device will extract a preset number of audio encoded data from the head of the audio encoded data sequence, and take all of the extracted audio encoded data as target audio encoded data. Wherein the amount of audio encoded data extracted from the audio encoding sequence may be set according to requirements, for example, the amount of audio encoded data extracted from the audio encoding sequence may be determined according to the amount of first decoding networks having a first network complexity deployed in the computer device. For example, the amount of audio encoded data extracted from the audio encoding sequence may be less than or equal to the number of first decoding networks deployed in the computer device having the first network complexity. Further, when determining the target audio encoded data, the computer device may use a first decoding network having a first network complexity as a decoding network for the target audio encoded data. For example, in the above example, when the audio encoded data a and the audio encoded data b are both target audio encoded data, the computer apparatus may use the first decoding network d having the first network complexity as the decoding network corresponding to the audio encoded data a and the first decoding network e having the first network complexity as the decoding network corresponding to the audio encoded data b.
Further, the computer device determines remaining audio encoded data in the sequence of audio encoded data other than the target encoded data and uses a second decoding network having a second network complexity as a decoding network corresponding to the remaining audio encoded data. For example, in the above example, the audio encoded data c may be determined to be the remaining audio encoded data except the target encoded data in the audio encoded data sequence, and thus the computer apparatus regards the second decoding network f having the second network complexity as the decoding network corresponding to the audio encoded data c. Wherein the first network complexity is higher than the second network complexity.
In this embodiment, by using the first decoding network having the first network complexity as the decoding network of the target audio encoded data, the audio encoded data having high auditory perception strength can be made to have the decoding network having high network complexity; by using the second decoding network having the second network complexity as the decoding network for the remaining audio encoded data, audio encoded data having a low auditory perception strength can be made to have a decoding network having a low network complexity. Therefore, the decoding of the audio coding data with high auditory perception intensity through the decoding network with high network complexity can be realized, and the decoding of the audio coding data with low auditory perception intensity through the decoding network with low network complexity can be realized, so that the diversity of audio decoding is improved.
In one embodiment, when the respective auditory perception intensities of the plurality of audio encoded data are obtained, the computer device may sort the audio encoded data in order from the top to the bottom to obtain the audio encoded data sequence. The computer device may divide the audio encoded data sequence into a plurality of audio encoded data segments, determine a network complexity corresponding to each of the audio encoded data segments according to a sequence position of the audio encoded data segments in the audio encoded data sequence, and determine a decoding network corresponding to each of the audio encoded data according to the network complexity. For example, the number of segments to be divided can be preset, the audio coding data sequence can be divided according to the preset number of segments to be divided, and the corresponding audio coding data segments can be given to the network complexity from large to small in sequence according to the positions of the audio coding data segments from first to last in the audio coding data sequence. Wherein, the network complexity corresponding to all audio coding data in one audio coding data segment is the same. When the network complexity is determined, the decoding network with the network complexity can be used as the corresponding decoding network.
In one embodiment, determining a decoding network corresponding to each of the plurality of audio encoded data according to the respective auditory perception intensities of the plurality of audio encoded data comprises: determining a perception intensity threshold value, and screening target audio coding data with the hearing perception intensity being greater than or equal to the perception intensity threshold value from a plurality of audio coding data; a first decoding network with first network complexity is used as a decoding network corresponding to the target audio coding data; determining the rest of the audio coding data except the target coding data in the plurality of audio coding data, and taking a second decoding network with second network complexity as a decoding network corresponding to the rest of the audio coding data; wherein the first network complexity is higher than the second network complexity.
Specifically, the computer device determines a perceived intensity threshold, which may be a preset fixed value or a dynamic value determined based on the auditory perceived intensity corresponding to each audio encoded data. For example, the computer device may use a median of a plurality of auditory perception intensities corresponding to a plurality of audio encoded data as the perception intensity threshold. Further, the computer device screens the target audio encoded data from the plurality of audio encoded data based on the determined perceptual strength threshold. The hearing perception intensity corresponding to the target audio coding data is larger than or equal to the perception intensity threshold value. When determining the target audio encoded data, the computer device may treat the first decoding network having the first network complexity as a decoding network corresponding to the target audio encoded data.
Further, the computer device determines remaining audio encoded data of the plurality of audio encoded data other than the target encoded data, and uses a second decoding network having a second network complexity as a decoding network corresponding to the remaining audio encoded data. Wherein the first network complexity is higher than the second network complexity, i.e. the network complexity of the first decoding network is greater than the network complexity of the second decoding network.
In one embodiment, the computer device may determine the perceived intensity threshold based on a ratio of the deployed first decoding network to the second decoding network, and the plurality of auditory perceived intensities resulting from the decapsulation. For example, when the ratio of the first decoding network to the second decoding network deployed in the computer device is 1:1, the computer device may use the median of the plurality of auditory perception intensities as the perception intensity threshold. Thus, each audio coding data has a corresponding decoding network.
In the above embodiment, by dynamically setting the perceived intensity threshold, the decoding network corresponding to each audio encoded data can be dynamically adjusted each time decoding is performed based on the dynamically set perceived intensity threshold, so that audio decoding is more flexible.
In one embodiment, after decoding each audio encoded data through a corresponding decoding network to obtain decoded audio corresponding to each of the plurality of parties, the method further includes: for each calling party in the multiparty call, determining the decoding audios of the other calling parties except the current calling party in the multiparty call, and mixing the decoding audios of the other calling parties to obtain a mixed audio; coding the audio mixing to obtain audio mixing coding data; transmitting the audio mixing coding data to the current calling party; the sent mixed sound coding data are used for triggering the terminal of the current calling party to decode the mixed sound coding data, so as to obtain decoded mixed sound, and playing the decoded mixed sound.
Specifically, for each party in the multiparty call, the computer device determines the decoded audio of the remaining parties in the multiparty call except for the current party, and mixes the determined decoded audio to obtain mixed audio. For example, referring to fig. 4, in the case where the computer device is a server, the computer device may determine the corresponding mixed audio. Further, the computer equipment encodes the audio mixture to obtain audio mixture encoded data, and sends the audio mixture encoded data to the terminal corresponding to the current calling party, so that the terminal corresponding to the current calling party decodes the audio mixture encoded data to obtain decoding audio mixture, and plays the decoding audio mixture. The current calling party can be the calling party currently processed by the computer equipment, and the computer equipment can process a plurality of calling parties simultaneously, so that the computer equipment can have a plurality of current calling parties simultaneously. The current calling party can also be any calling party in the multiparty call.
In one embodiment, the computer device may also mix the decoded audio corresponding to each of the received audio encoded data to obtain a mixed audio, encode the mixed audio to obtain mixed encoded data, and send the mixed encoded data to each party in the multi-party call.
In one embodiment, referring to fig. 5, when the computer device is a terminal corresponding to a calling party, the computer device may mix the decoded audio corresponding to each received audio encoded data to obtain a mixed audio, and play the mixed audio.
In the above embodiment, by mixing the corresponding decoded audio, the parties in the multi-party call can hear the voices of the other parties except the party through the terminal.
In one embodiment, the multiparty call includes at least one of multiparty online conferences, multiparty gaming voices, multiparty chat voices and multiparty live webcasts.
In particular, the multiparty call may be a multiparty online conference in a conference scenario, such as, in particular, a group voice conference, a group video conference, etc.; the multiparty call can also be multiparty game voice in a game scene; the multiparty call can also be multiparty voice chat, multiparty video chat and the like in the communication scene, for example, the multiparty call can be particularly group chat, group video and the like; the multiparty call can also be multiparty live broadcast of the link wheat in the link wheat scene, for example, the multi-cast online live broadcast of the link wheat, online class between the teacher and the students, etc. can be specifically realized.
In one embodiment, as shown in fig. 7, an audio encoding method is provided, and the method is applied to a computer device, which may be a server or a terminal in fig. 1, for example, and the audio encoding method includes the following steps:
step 702, obtain the audio of the target call party in the multiparty call, and determine the target audio frame in the audio.
In particular, a computer device may obtain a target audio frame for a target party in a multi-party call. The target calling party can be any calling party in the multiparty call. For example, in the case where the computer device is a server, the computer device may obtain audio for each party in a multiparty call, where each party is a target party. Under the condition that the computer equipment is a terminal corresponding to the calling party, the computer equipment can acquire the audio of the calling party corresponding to the computer equipment, wherein the calling party corresponding to the computer equipment is the target calling party. Further, when determining the audio of the target call party, the computer device may perform frame processing on the audio to obtain a target audio frame. It is easy to understand that the audio obtained by the computer device may be an audio stream, so that the computer device may frame the audio stream in real time to obtain the target audio frame. The target audio frame may be any audio frame in the audio stream that is not subjected to audio encoding processing.
Step 704, converting the target audio frame from the time domain to the frequency domain, to obtain a plurality of frequency points.
Step 706, determining the power value and the loudness corresponding to each frequency point, and determining the auditory perception weight corresponding to each frequency point according to the loudness corresponding to each frequency point.
Specifically, the computer device performs fourier transform on the target audio frame to convert the target audio frame from a time domain to a frequency domain, obtain a plurality of frequency points, and determine a power value and a loudness value corresponding to each frequency point. The loudness value characterizes the loudness of the audio corresponding to the frequency point. Each frequency point may correspond to a frequency value or a band of frequencies. For example, a target audio frame may include K frequency points 0,1,2, … …, and K-1, where K is a positive integer, and refers to the total number of frequency points. In addition, the number of the frequency points in the target audio frame and the frequency value or the frequency band corresponding to each frequency point can be set according to actual needs. For example, a more frequency band that is more sensitive to the human ear may be selected. Further, when determining the power value and the loudness value corresponding to each frequency point, the computer device determines the hearing perception weight corresponding to each frequency point according to the power value and the loudness value. For example, the computer device determines the auditory perception weight of the frequency point i according to the power value and the loudness value of the frequency point i. The auditory perception weight refers to a weighting coefficient used for calculating auditory perception intensity. i is a positive integer and is less than or equal to the total number of bins present in the target audio frame.
In one embodiment, the computer device may perform a fast fourier transform on the target audio frame to convert the target audio frame from a time domain to a frequency, obtain a spectrogram of the target audio frame, determine respective amplitudes of the frequency points in the target audio frame according to the spectrogram of the target audio frame, and obtain respective power values of the frequency points according to the respective amplitudes of the frequency points.
In one embodiment, the power value corresponding to each frequency point in the i-th frame of audio frame may be p (i, j), j=0 to K-1, where K is the total frequency point in the i-th frame of audio frame.
Step 708, fusing the power value and the auditory perception weight corresponding to each frequency point to obtain the auditory perception intensity of the target audio frame.
Specifically, when determining the power value and the auditory perception weight corresponding to each frequency point, the computer device may fuse the power value and the auditory perception weight corresponding to each frequency point to obtain the auditory perception intensity of the target audio frame. For example, the computer device may perform weighted summation on the power value and the auditory perception weight corresponding to each frequency point, so as to obtain auditory perception intensity.
In one embodiment, the computer device may determine the auditory perception weight by the following formula:
wherein J is the total number of frequency points in the i-th frame audio frame, i is the i-th frame audio frame, cof (J) is the auditory perception weight of the frequency point J in the i-th frame, and p (i, J) is the power value of the frequency point J in the i-th frame.
Step 710, obtaining audio coding data obtained by coding the target audio frame, and packaging the hearing perception intensity and the audio coding data to obtain coding packaging data; and the hearing perception intensity is used for decoding and mixing the audio coding data corresponding to each calling party in the multiparty call.
Specifically, while determining the auditory perception intensity of the target audio frame, the computer device may further encode the target audio frame through the encoding network to obtain audio encoded data. The coding network may be a machine learning model with coding capability obtained through sample training. Further, the computer equipment encapsulates the hearing perception intensity and the audio coding data through a preset encapsulation protocol to obtain coding encapsulation data, and transmits the coding encapsulation data to a corresponding server or terminal. For example, when the computer device performing audio encoding is a terminal of a caller, the computer device may transmit the encoded package data to a server for mixing. In the case that the computer device performing audio encoding is a server, the computer device may transmit the encoded package data to a terminal of a caller in a multi-party call, so that the terminal of the caller decodes and mixes the received plurality of encoded package data.
In one embodiment, referring to fig. 8, a coding network and an auditory sense intensity computing network are deployed in the computer device in advance, audio coding data of the target audio frame can be obtained through the coding network, and auditory sense intensity of the target audio frame can be obtained through the auditory sense intensity computing network, so that the audio coding data and the auditory sense intensity are integrated, and coding package data is obtained. FIG. 8 illustrates a schematic diagram of the generation of encoded encapsulated data in one embodiment.
In one embodiment, referring to fig. 9, fig. 9 shows a schematic diagram of audio codec in one embodiment. In the audio encoding stage, for each of a plurality of audio frames, the computer device may determine audio encoding data and auditory sense intensity of the audio frame, respectively, and encapsulate the audio encoding data and auditory sense intensity to obtain encoded encapsulated data. In the audio decoding stage, after obtaining a plurality of encoded package data, the computer device can decapsulate the plurality of encoded package data to obtain a plurality of audio encoded data and a plurality of auditory perception intensities, and determine decoding networks corresponding to the audio encoded data according to the magnitude values of the auditory perception intensities, so that decoding is performed through the decoding networks corresponding to the audio encoded data to obtain a plurality of decoding audio. The computer equipment can mix the plurality of decoded audios to obtain mixed audios, and output and play the mixed audios.
In the audio encoding method, the target audio frame in the audio can be determined by acquiring the audio of the target call party in the multiparty call. By determining the target audio frame, the target audio frame can be converted from the time domain to the frequency domain, and a plurality of frequency points can be obtained. Because the auditory perception of the human ear on the audio is related to the loudness, when a plurality of frequency points are determined, the power value and the loudness value corresponding to each frequency point can be determined, so that the auditory perception weight corresponding to each frequency point is determined based on the loudness value, and the quantized auditory perception intensity is obtained through the power value and the auditory perception weight corresponding to each frequency point. The audio coding data obtained by coding the target audio frame and the auditory perception intensity can be packaged by obtaining the quantized auditory perception intensity, and the coding packaging data is obtained, so that a novel audio coding method is obtained. In addition, the hearing perception intensity packaged in the encoded package data can be used for decoding and mixing audio encoding data corresponding to each calling party in a multiparty call, so that a novel audio decoding method is provided, and the diversity of audio decoding is improved. And the auditory perception intensity is packaged into the encoded packaging data, so that the decoding network corresponding to each audio encoding data can be directly determined based on the auditory perception intensity when the subsequent audio decoding is performed, the determining efficiency of the decoding network is improved, and the decoding efficiency is further improved.
In one embodiment, determining a target audio frame in audio includes: when obtaining the audio of the target calling party, obtaining a preset framing window; and triggering the framing window to slide on the audio of the target calling party by a preset sliding step length to obtain the audio fragment framed by the framing window, and taking the audio fragment as a target audio frame.
In particular, in a multiparty call, the audio of the calling party may be in the form of an audio stream, which is macroscopically unstable but microscopically stable, and based on this characteristic, when the audio of the target calling party is obtained, the computer device may divide the audio into small segments for processing, where each small segment is a frame of the target audio frame. When a target audio frame in the audio of the target call party needs to be determined, the computer equipment can acquire the framing window, trigger the framing window to slide on the audio to obtain an audio fragment framed by the framing window, and take the audio fragment as a target audio frame. In one embodiment, the framing window may be a hanning window or a hamming window. The length of the framing window may coincide with the definition of the frame by the encoder, e.g. may be 20ms. In order to improve the smoothness and continuity of the transition between two adjacent frames, it is also necessary to keep the overlap between the audio frames. That is, the sliding step size of the framing window may be less than the length of one frame of audio frame.
As will be readily appreciated, the computer device may sequentially obtain each audio frame through the frame dividing window, so that each audio frame is sequentially processed through the methods described in steps 702 to 710 according to the generation time of each audio frame, to obtain the encoded package data corresponding to each audio frame. Wherein the currently processed audio frame may be the target audio frame.
In this embodiment, the short-time stable target audio frame can be obtained by performing frame-splitting processing on the audio, so that the accuracy of audio coding is improved based on the short-time stable target audio frame.
In one embodiment, the frequency points are obtained by performing fourier transform on the target audio frame; the frequency point comprises a real part and an imaginary part; determining the power value corresponding to each frequency point comprises the following steps: according to a Fourier transform result obtained by carrying out Fourier transform on the target audio frame, determining a real part and an imaginary part corresponding to each frequency point in the target audio frame; for each frequency point in the plurality of frequency points, multiplying the real part of the current frequency point with the real part of the current frequency point to obtain a real part value; multiplying the imaginary part of the current frequency point with the imaginary part of the current frequency point to obtain imaginary part data; and superposing the real part numerical value and the imaginary part numerical value to obtain the power value of the current frequency point.
Specifically, after performing fourier transform on the target audio frame, a fourier transform result may be obtained, where the fourier transform result includes a real part and an imaginary part of a frequency point. For each frequency point in the plurality of frequency points, the computer equipment multiplies the real part of the current frequency point with the real part of the current frequency point to obtain a real part value; and multiplying the imaginary part of the current frequency point with the imaginary part of the current frequency point to obtain an imaginary part value. The computer equipment superimposes the real part numerical value and the imaginary part numerical value, so that the power value of the current frequency point is obtained. The current frequency point is the frequency point currently processed by the computer equipment, and the computer equipment can sequentially calculate the power values corresponding to the frequency points or simultaneously calculate the power values corresponding to the frequency points. The current frequency point can also be any frequency point in the target audio frame.
In one embodiment, the step of determining the loudness value of each frequency point includes: acquiring acoustic equal-loudness curve data, and performing linear interpolation on the acoustic equal-loudness curve data to obtain a loudness value corresponding to each frequency point; the acoustic equal-loudness curve is used to describe the correspondence between sound pressure intensity and sound wave frequency under equal-loudness conditions.
The acoustic equal-loudness curve is a curve describing the relationship between sound pressure level and sound wave frequency under equal-loudness conditions, and is one of important auditory characteristics. I.e. what sound pressure level is required to be achieved for pure tones at different frequencies to achieve a consistent auditory loudness for the listener. To illustrate the meaning of this curve, it is illustrated that, as shown in fig. 10, the lower the mid-low frequency (below 1 kHz) is, the greater the sound pressure intensity (energy) required for the equal sound, i.e., simply the greater the sound energy required to be perceived as the same audible sensation by the human ear. And the medium and high frequency (above 1 kHz) different frequency bands have different acoustic hearing perceptions. Fig. 10 is an acoustic equal response plot of an international organization for acoustic standards.
Specifically, the computer device may interpolate the acoustic equal loudness curve data using a linear interpolation method, thereby obtaining a loudness value for each frequency point. For example, the loudness value may be calculated based on the acoustic equal loudness curve data of BS3383 standard BS3383 Specification for normal equal-loudness level contours for pure tones under free-field listening conditions. By linearly interpolating the acoustic equal-loudness curve data based on the BS3383 standard, target acoustic equal-loudness curve data can be obtained, and the loudness values corresponding to the frequency points are read from the target acoustic equal-loudness curve data.
In one embodiment, the computer device may derive the loudness value of the bins by the following formula, which is derived from chapter four of BS 3383:
afy=af(j-1)+(freq-ff(j-1))*(af(j)-af(j-1))/(ff(j)-ff(j-1));
bfy=bf(j-1)+(freq-ff(j-1))*(bf(j)-bf(j-1))/(ff(j)-ff(j-1));
cfy=cf(j-1)+(freq-ff(j-1))*(cf(j)-cf(j-1))/(ff(j)-ff(j-1));
loud=4.2+afy*(dB-cfy)/(1+bfy*(dB-cfy));
wherein, loud is the loudness value, freq is the frequency value corresponding to the frequency point needing to calculate the loudness value; j is a frequency sequence number value (namely a frequency point value) in the acoustic equal-response curve data, wherein each frequency sequence number value in the acoustic equal-response curve data corresponds to one frequency value; freq is not greater than a frequency value corresponding to a frequency sequence number value j in the acoustic equal-loudness curve data and is not less than a frequency value corresponding to a frequency sequence number value j-1; ff. af, bf, cf are all data in the acoustic equal loudness curve data disclosed by BS 3383.
In this embodiment, since the equal loudness curve data records the correspondence between the frequency values and the loudness values, the loudness values corresponding to the frequency points in the target audio frame can be obtained by linearly interpolating the equal loudness curve data.
In one embodiment, when the loudness value is greater than zero, the greater the loudness value of the frequency point, the greater the auditory perception weight corresponding to the frequency point; in the first preset frequency band, the larger the frequency value of the frequency point is, the larger the hearing perception weight corresponding to the frequency point is; in the second preset frequency band, the larger the frequency value of the frequency point is, the smaller the hearing perception weight corresponding to the frequency point is; the frequency value in the first preset frequency band is smaller than the frequency value in the second preset frequency band.
In particular, the computer device may pass the formulaTo calculate the auditory perception weight. Wherein cof represents the auditory sense weight, freq is the frequency value corresponding to the frequency point for which the auditory sense weight needs to be calculated, and loud is the loudness value of the frequency point for which the auditory sense weight needs to be calculated. From the formula, the loudness value of the frequency pointThe larger the auditory perception weight of the frequency point is, the larger the auditory perception weight of the frequency point is.
Further, through the calculation formula of the auditory perception weight, a relation diagram between the frequency value corresponding to the frequency point and the auditory perception weight can be obtained. For example, referring to fig. 11, fig. 11 is a graph of a relationship between a frequency value corresponding to a frequency point and an auditory perception weight. As can be seen from fig. 11, in the first preset frequency band (for example, 1KHZ-4 KHZ), the higher the frequency value corresponding to the frequency point, the larger the hearing perception weight corresponding to the frequency point; in a second preset frequency band (for example, 4KHZ-8 KHZ), the larger the frequency value corresponding to the frequency point is, the smaller the hearing perception weight corresponding to the frequency point is. The frequency value in the first preset frequency band is smaller than the frequency value in the second preset frequency band.
In this embodiment, since the most intuitive perception of the human ear is the loudness of the sound, the auditory perception weight is determined by the loudness value, so that the auditory perception intensity determined based on the auditory perception weight can reflect the auditory characteristics of the human ear on the loudness of the sound, which provides a guarantee for determining the network complexity of the sound based on the auditory perception intensity, and ensures that the determined network complexity is matched with the auditory perception degree of the human ear on the sound.
In one embodiment, determining the auditory perception intensity of the target audio frame according to the power value and the auditory perception weight corresponding to each frequency point includes: for each frequency point, multiplying the corresponding power value by the corresponding auditory perception weight to obtain the corresponding auditory perception power value of each frequency point; and accumulating the hearing perception power values to obtain the hearing perception intensity of the target audio frame.
Specifically, the computer equipment multiplies the power value corresponding to the frequency point i by the hearing perception intensity to obtain the hearing perception power value corresponding to the frequency point i. Wherein i is a positive integer and is less than or equal to the total number of frequency bins included in the target audio frame. Further, when the hearing perception power values corresponding to the frequency points are obtained, the computer equipment can accumulate the hearing perception power values to obtain the hearing perception intensity of the target audio frame.
In one embodiment, the method further comprises: acquiring the historical hearing perception intensity of a historical audio frame; the historical audio frame is the immediately preceding audio frame positioned before the target audio frame in the audio of the target talking party; the historical auditory perception intensity is the auditory perception intensity of the historical audio frame after smoothing; and obtaining a preset smoothing coefficient, and smoothing the auditory perception intensity of the target audio frame based on the smoothing coefficient and the historical auditory perception intensity to obtain the smoothed auditory perception intensity.
Specifically, in order to avoid frequent switching of the decoding network corresponding to each audio frame in a section of audio, the initial auditory perception intensity of the target audio frame may be smoothed based on the historical auditory perception intensity of the historical audio frame, so as to obtain the auditory perception intensity of the smoothed target audio frame. More specifically, the computer device obtains an immediately preceding audio frame preceding the target audio frame in the audio of the target party, and takes the audio frame as a history audio frame. The computer device determines a historical auditory perception intensity of the historical audio frame. The historical auditory perception intensity is smoothed auditory perception intensity. Further, the computer equipment obtains a preset smoothing coefficient, performs smoothing treatment on the auditory perception intensity of the target audio frame based on the smoothing coefficient and the historical auditory perception intensity to obtain smoothed auditory perception intensity, and encapsulates the smoothed auditory perception intensity and audio encoding data of the target audio frame to obtain encoding encapsulation data.
In one embodiment, the computer device may smooth the auditory perception intensity by the following formula:
EPsm(i)=a*EPsm(i-1)+(1-a)*EP(i)
wherein EPsm (i) is auditory perception intensity after the i-th frame audio frame (target audio frame) is smoothed; EPsm (i-1) is the hearing perception intensity of the i-1 th frame of audio frame (historical audio frame) after smoothing; a is a smoothing coefficient; EP (i) is the auditory perception intensity of the i-th frame audio frame (target audio frame) before smoothing.
In the above embodiment, since the auditory sense intensity of the smoothed target audio frame is determined based on the historical auditory sense intensity of the historical audio frame, the auditory sense intensity of the final obtained target audio frame may be associated with the historical auditory sense intensity of the historical audio frame, so that the probability of abrupt change between the auditory sense intensity of the smoothed target audio frame and the historical auditory sense intensity of the historical audio frame is reduced based on the association relationship. Since the probability of abrupt change can be reduced, the probability that the decoding network corresponding to each audio frame in a section of audio is frequently switched can be reduced.
In one embodiment, referring to fig. 12, fig. 12 shows a schematic diagram of an audio encoding method in one specific embodiment:
And S1202, when the audio of the target call party in the multiparty call is obtained, the computer equipment obtains a preset framing window, triggers the framing window to slide on the audio of the target call party according to a preset sliding step length, obtains an audio fragment framed by the framing window, and takes the audio fragment as a target audio frame.
S1204, the computer equipment converts the target audio frame from a time domain to a frequency domain to obtain a plurality of frequency points; and determining the real part and the imaginary part corresponding to each frequency point in the target audio frame according to a Fourier transform result obtained by carrying out Fourier transform on the target audio frame.
S1206, for each frequency point in the plurality of frequency points, multiplying the real part of the current frequency point by the computer equipment to obtain a real part value; multiplying the imaginary part of the current frequency point with the imaginary part of the current frequency point to obtain imaginary part data; and superposing the real part numerical value and the imaginary part numerical value to obtain the power value of the current frequency point.
S1208, the computer equipment acquires acoustic equal-loudness curve data, and carries out linear interpolation on the acoustic equal-loudness curve data to obtain a loudness value corresponding to each frequency point; the acoustic equal-loudness curve is used to describe the correspondence between sound pressure intensity and sound wave frequency under equal-loudness conditions.
S1210, the computer equipment determines the hearing perception weight corresponding to each frequency point according to the loudness value corresponding to each frequency point; and fusing the power value and the auditory perception weight corresponding to each frequency point to obtain the auditory perception intensity of the target audio frame.
S1212, the computer equipment acquires the historical hearing perception intensity of the historical audio frame; the historical audio frame is the immediately preceding audio frame positioned before the target audio frame in the audio of the target talking party; the historical auditory perception intensity is the auditory perception intensity of the smoothed historical audio frame; and obtaining a preset smoothing coefficient, and smoothing the auditory perception intensity of the target audio frame based on the smoothing coefficient and the historical auditory perception intensity to obtain the smoothed auditory perception intensity.
S1214, the computer equipment obtains the audio coding data obtained by coding the target audio frame, and encapsulates the smoothed auditory perception intensity and the audio coding data to obtain coding encapsulation data.
In one embodiment, referring to fig. 13, fig. 13 shows a schematic diagram of an audio decoding method in a specific embodiment:
s1302, the computer equipment acquires the coding encapsulation data corresponding to each of a plurality of call parties in the multiparty call.
S1304, the computer equipment respectively unpacks each coded package data to obtain a plurality of audio coded data and the corresponding hearing perception intensity of each audio coded data; the hearing perception intensity is obtained by determining the loudness of the audio corresponding to the audio coding data, and represents the perception degree of the human ear on the audio.
S1306, the computer equipment sorts the audio coding data according to the order of the hearing perception intensity from high to low to obtain an audio coding data sequence.
The computer device takes a preset number of audio encoded data located at the head of the audio encoded data sequence as target audio encoded data, and takes a first decoding network with a first network complexity as a decoding network corresponding to the target audio encoded data S1308.
S1310, the computer equipment determines the rest of audio coding data except the target coding data in the audio coding data sequence, and takes a second decoding network with second network complexity as a decoding network corresponding to the rest of audio coding data; wherein the first network complexity is higher than the second network complexity.
S1312, decoding each audio coded data through a corresponding decoding network by the computer equipment to obtain decoding audio corresponding to each of a plurality of call parties; the decoded audio is used for mixing.
S1314, for each party in the multiparty call, the computer device determines decoded audio of the remaining parties in the multiparty call, except for the current party.
S1316, the computer equipment mixes the decoded audio of each other calling party to obtain mixed audio, and encodes the mixed audio to obtain mixed encoded data; transmitting the audio mixing coding data to the current calling party; the sent mixed sound coding data are used for triggering the terminal of the current calling party to decode the mixed sound coding data, so as to obtain decoded mixed sound, and playing the decoded mixed sound.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
The application also provides an application scene, which applies the audio decoding method. Specifically, the application of the audio decoding method in the application scene is as follows:
in a multi-person conference scene, the terminals of the target participants can acquire the coded package data of the other participants except the terminals, and unpack the coded package data to obtain the audio coded data corresponding to the other participants except the terminals, and obtain the hearing perception intensity corresponding to each audio coded data. And the terminal of the target participant determines the decoding network corresponding to each audio coding data according to each auditory perception intensity, and decodes the corresponding audio coding data through the decoding network to obtain a plurality of decoding audios. The terminal of the target participant mixes and plays each decoded audio, so that the target participant can hear the sounds of other participants.
The application further provides an application scene, and the application scene applies the audio decoding method. Specifically, the application of the audio decoding method in the application scene is as follows:
in the live broadcast of multiple persons and wheat, the terminal of the target anchor can acquire the coded encapsulation data of other anchors except the terminal of the target anchor, and unpack each coded encapsulation data according to the audio decoding method to obtain a plurality of decoded audios. The target anchor terminal mixes and plays each decoded audio, so that the target anchor can hear the sounds of the rest anchors.
The above application scenario is only illustrative, and it is to be understood that the application of the schedule synchronization method provided by the embodiments of the present application is not limited to the above scenario.
Based on the same inventive concept, the embodiment of the application also provides an audio decoding device for realizing the above-mentioned audio decoding method. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitation in the embodiments of one or more audio decoding apparatus provided below may be referred to the limitation of the audio decoding method hereinabove, and will not be repeated here.
In one embodiment, as shown in fig. 14, there is provided an audio decoding apparatus 1400 comprising: an encapsulated data acquisition module 1402, an encoding network determination module 1404, and a decoding module 1406, wherein:
the encapsulated data obtaining module 1402 is configured to obtain encoded encapsulated data corresponding to each of a plurality of parties in the multiparty call.
The encoding network determining module 1404 is configured to unpack each encoded package data to obtain a plurality of audio encoded data and respective corresponding auditory perception intensities of each audio encoded data; the hearing perception intensity is obtained by determining the loudness of the audio corresponding to the audio coding data, and represents the perception degree of human ears on the audio; and determining decoding networks corresponding to the audio coding data according to the hearing perception intensities corresponding to the audio coding data, wherein the network complexity of each decoding network is related to the hearing perception intensity of the corresponding audio coding data.
A decoding module 1406, configured to decode each audio encoded data through a corresponding decoding network, to obtain decoded audio corresponding to each of the multiple parties; the decoded audio is used for mixing.
In one embodiment, the encoding network determining module 1404 is further configured to sort the audio encoded data according to the order from the high auditory perception intensity to the low auditory perception intensity, so as to obtain an audio encoded data sequence; taking a preset number of audio coding data in a header area of an audio coding data sequence as target audio coding data, and taking a first decoding network with first network complexity as a decoding network corresponding to the target audio coding data; determining the rest of audio coding data except the target coding data in the audio coding data sequence, and taking a second decoding network with second network complexity as a decoding network corresponding to the rest of audio coding data; wherein the first network complexity is higher than the second network complexity.
In one embodiment, the encoding network determining module 1404 is further configured to determine a perception intensity threshold, and screen target audio encoded data from the plurality of audio encoded data that has an auditory perception intensity greater than or equal to the perception intensity threshold; a first decoding network with first network complexity is used as a decoding network corresponding to the target audio coding data; determining the rest of the audio coding data except the target coding data in the plurality of audio coding data, and taking a second decoding network with second network complexity as a decoding network corresponding to the rest of the audio coding data; wherein the first network complexity is higher than the second network complexity.
In one embodiment, the audio decoding apparatus 1400 is further configured to determine, for each party in the multiparty call, decoded audio of the remaining parties other than the current party in the multiparty call; mixing the decoded audio of each other calling party to obtain mixed audio, and coding the mixed audio to obtain mixed coding data; transmitting the audio mixing coding data to the current calling party; the sent mixed sound coding data are used for triggering the terminal of the current calling party to decode the mixed sound coding data, so as to obtain decoded mixed sound, and playing the decoded mixed sound.
In one embodiment, the multiparty call includes at least one of multiparty online conferences, multiparty gaming voices, multiparty audio-video chats and multiparty live webcasts.
Based on the same inventive concept, the embodiment of the application also provides an audio encoding device for realizing the above-mentioned audio encoding method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the audio encoding device provided below may be referred to the limitation of the audio encoding method hereinabove, and will not be repeated here.
In one embodiment, as shown in fig. 15, there is provided an audio encoding apparatus 1500 comprising: a frequency point determination module 1502, an auditory perception intensity determination module 1504, and an encoding encapsulation module 1506, wherein:
a frequency point determining module 1502, configured to obtain an audio of a target call party in a multiparty call, and determine a target audio frame in the audio; and converting the target audio frame from the time domain to the frequency domain to obtain a plurality of frequency points.
The auditory perception intensity determining module 1504 is configured to determine a power value and a loudness value corresponding to each frequency point, and determine an auditory perception weight corresponding to each frequency point according to the loudness value corresponding to each frequency point; and fusing the power value and the auditory perception weight corresponding to each frequency point to obtain the auditory perception intensity of the target audio frame.
The encoding encapsulation module 1506 is configured to obtain audio encoded data obtained by encoding the target audio frame, and encapsulate the auditory perception intensity and the audio encoded data to obtain encoded encapsulated data; and the hearing perception intensity is used for decoding and mixing the audio coding data corresponding to each calling party in the multiparty call.
In one embodiment, the frequency point determining module 1502 is further configured to obtain a preset framing window when obtaining the audio of the target calling party; and triggering the framing window to slide on the audio of the target calling party by a preset sliding step length to obtain the audio fragment framed by the framing window, and taking the audio fragment as a target audio frame.
In one embodiment, the auditory perception intensity determining module 1504 is further configured to determine, according to a fourier transform result obtained by performing fourier transform on the target audio frame, a real part and an imaginary part corresponding to each frequency point in the target audio frame; for each frequency point in the plurality of frequency points, multiplying the real part of the current frequency point with the real part of the current frequency point to obtain a real part value; multiplying the imaginary part of the current frequency point with the imaginary part of the current frequency point to obtain imaginary part data; and superposing the real part numerical value and the imaginary part numerical value to obtain the power value of the current frequency point.
In one embodiment, the auditory perception intensity determining module 1504 is further configured to obtain acoustic equal loudness curve data, and perform linear interpolation on the acoustic equal loudness curve data to obtain a loudness value corresponding to each frequency point; the acoustic equal-loudness curve is used to describe the correspondence between sound pressure intensity and sound wave frequency under equal-loudness conditions.
In one embodiment, when the loudness value is greater than zero, the greater the loudness value of the frequency point, the greater the auditory perception weight corresponding to the frequency point; in the first preset frequency band, the larger the frequency value of the frequency point is, the larger the hearing perception weight corresponding to the frequency point is; in the second preset frequency band, the larger the frequency value of the frequency point is, the smaller the hearing perception weight corresponding to the frequency point is; the frequency value in the first preset frequency band is smaller than the frequency value in the second preset frequency band.
In one embodiment, the auditory sense intensity determining module 1504 is further configured to multiply, for each frequency point, a corresponding power value with a corresponding auditory sense weight to obtain a corresponding auditory sense power value of each frequency point; and accumulating the hearing perception power values to obtain the hearing perception intensity of the target audio frame.
In one embodiment, the auditory sense intensity determination module 1504 is further configured to obtain a historical auditory sense intensity of the historical audio frame; the historical audio frame is the immediately preceding audio frame positioned before the target audio frame in the audio of the target talking party; the historical auditory perception intensity is the auditory perception intensity of the historical audio frame after smoothing; and obtaining a preset smoothing coefficient, and smoothing the auditory perception intensity of the target audio frame based on the smoothing coefficient and the historical auditory perception intensity to obtain the smoothed auditory perception intensity.
The above-described audio decoding apparatus, and each module in the audio encoding apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 16. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing audio decoding data and audio encoding data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an audio decoding method, an audio encoding method.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 16. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an audio decoding method, an audio encoding method. The display unit of the computer equipment is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device, wherein the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on a shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by persons skilled in the art that the structures shown in FIGS. 16-17 are block diagrams of only portions of structures associated with the present inventive arrangements and are not limiting of the computer device to which the present inventive arrangements may be implemented, and that a particular computer device may include more or fewer components than shown, or may be combined with certain components, or have different arrangements of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (17)

1. A method of audio decoding, the method comprising:
acquiring coding encapsulation data corresponding to each of a plurality of call parties in the multiparty call;
unpacking each piece of encoded package data to obtain a plurality of pieces of audio encoded data and corresponding hearing perception intensity of each piece of audio encoded data; the hearing perception intensity is obtained by determining the loudness of the audio corresponding to the audio coding data, and represents the perception degree of human ears on the audio;
Determining decoding networks corresponding to the audio coding data according to the hearing perception intensities corresponding to the audio coding data, wherein the network complexity of each decoding network is related to the hearing perception intensity of the corresponding audio coding data;
decoding each audio coding data through the corresponding decoding network to obtain the corresponding decoding audio of each of the plurality of communication parties; the decoded audio is used for mixing.
2. The method of claim 1, wherein said determining a decoding network for each of a plurality of said audio encoded data based on the respective auditory perception strengths of said plurality of said audio encoded data comprises:
sequencing the audio coding data according to the sequence from the big auditory perception intensity to the small auditory perception intensity to obtain an audio coding data sequence;
taking a preset number of audio coding data in a header area of the audio coding data sequence as target audio coding data, and taking a first decoding network with first network complexity as a decoding network corresponding to the target audio coding data;
Determining the rest of audio coding data except the target coding data in the audio coding data sequence, and taking a second decoding network with second network complexity as a decoding network corresponding to the rest of audio coding data; wherein the first network complexity is higher than the second network complexity.
3. The method of claim 1, wherein said determining a decoding network for each of a plurality of said audio encoded data based on the respective auditory perception strengths of said plurality of said audio encoded data comprises:
determining a perception intensity threshold value, and screening target audio coding data with the hearing perception intensity being greater than or equal to the perception intensity threshold value from the plurality of audio coding data;
a first decoding network having a first network complexity as a decoding network corresponding to the target audio encoding data;
determining remaining audio encoded data of the plurality of audio encoded data other than the target encoded data, and using a second decoding network having a second network complexity as a decoding network corresponding to the remaining audio encoded data; wherein the first network complexity is higher than the second network complexity.
4. The method of claim 1, wherein after said decoding each of said audio encoded data through a corresponding one of said decoding networks to obtain a corresponding one of said plurality of parties, said method further comprises:
for each calling party in the multiparty call, determining decoding audios of other calling parties except the current calling party in the multiparty call;
mixing the decoded audio of each other calling party to obtain mixed audio, and coding the mixed audio to obtain mixed coding data;
transmitting the audio mixing coding data to the current calling party; the sent mixed sound coding data are used for triggering the terminal of the current calling party to decode the mixed sound coding data, so as to obtain decoding mixed sound, and playing the decoding mixed sound.
5. The method of any of claims 1 to 4, wherein the multiparty call includes at least one of multiparty online conferences, multiparty gaming voices, multiparty audio video chats and multiparty live webcasts.
6. A method of audio encoding, the method comprising:
Acquiring the audio of a target call party in the multiparty call, and determining a target audio frame in the audio;
converting the target audio frame from a time domain to a frequency domain to obtain a plurality of frequency points;
determining the power value and the loudness value corresponding to each frequency point, and determining the hearing perception weight corresponding to each frequency point according to the loudness value corresponding to each frequency point;
fusing the power value and the hearing perception weight corresponding to each frequency point to obtain hearing perception intensity of the target audio frame;
acquiring audio coding data obtained by coding the target audio frame, and packaging the hearing perception intensity and the audio coding data to obtain coding packaging data; the hearing perception intensity is used for decoding and mixing audio coding data corresponding to each calling party in the multiparty call.
7. The method of claim 6, wherein the determining the target audio frame in the audio comprises:
when obtaining the audio of the target calling party, obtaining a preset framing window;
and triggering the framing window to slide on the audio of the target calling party by a preset sliding step length to obtain the audio fragment framed by the framing window, and taking the audio fragment as a target audio frame.
8. The method of claim 6, wherein the frequency bin is obtained by fourier transforming the target audio frame; the determining the power value corresponding to each frequency point comprises the following steps:
according to a Fourier transform result obtained by carrying out Fourier transform on the target audio frame, determining a real part and an imaginary part corresponding to each frequency point in the target audio frame;
for each frequency point in a plurality of frequency points, multiplying the real part of the current frequency point with the real part of the current frequency point to obtain a real part value;
multiplying the imaginary part of the current frequency point with the imaginary part of the current frequency point to obtain imaginary part data;
and superposing the real part numerical value and the imaginary part numerical value to obtain the power value of the current frequency point.
9. The method of claim 6, wherein the step of determining the loudness value of each of the bins comprises:
acquiring acoustic equal-loudness curve data, and performing linear interpolation on the acoustic equal-loudness curve data to obtain a loudness value corresponding to each frequency point; the acoustic equal-loudness curve is used for describing the corresponding relation between sound pressure intensity and sound wave frequency under equal-loudness conditions.
10. A method according to any one of claims 6 to 9, wherein when the loudness value is greater than zero, the greater the loudness value of the bins, the greater the auditory perception weight corresponding to the bins; within a first preset frequency band, the larger the frequency value of the frequency point is, the larger the hearing perception weight corresponding to the frequency point is; in a second preset frequency band, the larger the frequency value of the frequency point is, the smaller the hearing perception weight corresponding to the frequency point is; the frequency value in the first preset frequency band is smaller than the frequency value in the second preset frequency band.
11. The method of claim 6, wherein said determining the auditory perception intensity of the target audio frame based on the power value and the auditory perception weight corresponding to each of the frequency bins, respectively, comprises:
for each frequency point, multiplying the corresponding power value by the corresponding hearing perception weight to obtain the hearing perception power value corresponding to each frequency point;
and accumulating the hearing perception power values to obtain the hearing perception intensity of the target audio frame.
12. The method of claim 11, wherein the method further comprises:
Acquiring the historical hearing perception intensity of a historical audio frame; the historical audio frame is an immediately preceding audio frame positioned before the target audio frame in the audio of the target calling party; the historical auditory perception intensity is the auditory perception intensity of the historical audio frame after smoothing;
and acquiring a preset smoothing coefficient, and smoothing the auditory perception intensity of the target audio frame based on the smoothing coefficient and the historical auditory perception intensity to obtain smoothed auditory perception intensity.
13. An audio decoding apparatus, the apparatus comprising:
the encapsulated data acquisition module is used for acquiring the encoded encapsulated data corresponding to each of a plurality of call parties in the multiparty call;
the coding network determining module is used for respectively deblocking each piece of coding encapsulation data to obtain a plurality of pieces of audio coding data and the corresponding hearing perception intensity of each piece of audio coding data; the hearing perception intensity is obtained by determining the loudness of the audio corresponding to the audio coding data, and represents the perception degree of human ears on the audio; determining decoding networks corresponding to the audio coding data according to the hearing perception intensities corresponding to the audio coding data, wherein the network complexity of each decoding network is related to the hearing perception intensity of the corresponding audio coding data;
The decoding module is used for decoding each audio coding data through the corresponding decoding network to obtain the corresponding decoding audio of each of the plurality of communication parties; the decoded audio is used for mixing.
14. An audio encoding apparatus, the apparatus comprising:
the frequency point determining module is used for acquiring the audio of a target call party in the multiparty call and determining a target audio frame in the audio; converting the target audio frame from a time domain to a frequency domain to obtain a plurality of frequency points;
the hearing perception intensity determining module is used for determining the power value and the loudness value corresponding to each frequency point, and determining the hearing perception weight corresponding to each frequency point according to the loudness value corresponding to each frequency point; fusing the power value and the hearing perception weight corresponding to each frequency point to obtain hearing perception intensity of the target audio frame;
the coding encapsulation module is used for obtaining audio coding data obtained by coding the target audio frame, and encapsulating the hearing perception intensity and the audio coding data to obtain coding encapsulation data; the hearing perception intensity is used for decoding and mixing audio coding data corresponding to each calling party in the multiparty call.
15. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 12 when the computer program is executed.
16. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 12.
17. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 12.
CN202211186521.1A 2022-09-27 2022-09-27 Audio decoding method, audio encoding method, apparatus and storage medium Pending CN116978389A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211186521.1A CN116978389A (en) 2022-09-27 2022-09-27 Audio decoding method, audio encoding method, apparatus and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211186521.1A CN116978389A (en) 2022-09-27 2022-09-27 Audio decoding method, audio encoding method, apparatus and storage medium

Publications (1)

Publication Number Publication Date
CN116978389A true CN116978389A (en) 2023-10-31

Family

ID=88477204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211186521.1A Pending CN116978389A (en) 2022-09-27 2022-09-27 Audio decoding method, audio encoding method, apparatus and storage medium

Country Status (1)

Country Link
CN (1) CN116978389A (en)

Similar Documents

Publication Publication Date Title
US20080004729A1 (en) Direct encoding into a directional audio coding format
US20090116652A1 (en) Focusing on a Portion of an Audio Scene for an Audio Signal
TW200933608A (en) Systems, methods, and apparatus for context descriptor transmission
WO2010125228A1 (en) Encoding of multiview audio signals
CN112750444B (en) Sound mixing method and device and electronic equipment
WO2022022293A1 (en) Audio signal rendering method and apparatus
CN110299144B (en) Audio mixing method, server and client
US20030023428A1 (en) Method and apparatus of mixing audios
US20230298600A1 (en) Audio encoding and decoding method and apparatus
US20230298601A1 (en) Audio encoding and decoding method and apparatus
WO2022262576A1 (en) Three-dimensional audio signal encoding method and apparatus, encoder, and system
WO2019105436A1 (en) Audio encoding and decoding method and related product
CN116978389A (en) Audio decoding method, audio encoding method, apparatus and storage medium
WO2022242483A1 (en) Three-dimensional audio signal encoding method and apparatus, and encoder
WO2022242481A1 (en) Three-dimensional audio signal encoding method and apparatus, and encoder
TWI834163B (en) Three-dimensional audio signal encoding method, apparatus and encoder
US20230412981A1 (en) Method and apparatus for determining virtual speaker set
EP4354430A1 (en) Three-dimensional audio signal processing method and apparatus
WO2022242479A1 (en) Three-dimensional audio signal encoding method and apparatus, and encoder
WO2022242480A1 (en) Three-dimensional audio signal encoding method and apparatus, and encoder
WO2022262758A1 (en) Audio rendering system and method and electronic device
WO2022267754A1 (en) Speech coding method and apparatus, speech decoding method and apparatus, computer device, and storage medium
JP2023523081A (en) Bit allocation method and apparatus for audio signal
CN116110424A (en) Voice bandwidth expansion method and related device
CN115346537A (en) Audio coding and decoding method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination