WO2023005414A1 - 一种音频信号的编解码方法和装置 - Google Patents

一种音频信号的编解码方法和装置 Download PDF

Info

Publication number
WO2023005414A1
WO2023005414A1 PCT/CN2022/096593 CN2022096593W WO2023005414A1 WO 2023005414 A1 WO2023005414 A1 WO 2023005414A1 CN 2022096593 W CN2022096593 W CN 2022096593W WO 2023005414 A1 WO2023005414 A1 WO 2023005414A1
Authority
WO
WIPO (PCT)
Prior art keywords
blocks
transient
spectrum
block
grouping
Prior art date
Application number
PCT/CN2022/096593
Other languages
English (en)
French (fr)
Inventor
夏丙寅
李佳蔚
王喆
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to KR1020247006252A priority Critical patent/KR20240038770A/ko
Publication of WO2023005414A1 publication Critical patent/WO2023005414A1/zh
Priority to US18/423,083 priority patent/US20240177721A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/022Blocking, i.e. grouping of samples in time; Choice of analysis windows; Overlap factoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present application relates to the technical field of audio processing, in particular to a method and device for encoding and decoding audio signals.
  • Compression of audio data is an indispensable link in media applications such as media communication and media broadcasting.
  • media applications such as media communication and media broadcasting.
  • high-definition audio industry and three-dimensional audio industry people's demand for audio quality is getting higher and higher, followed by the rapid growth of audio data volume in media applications.
  • the current audio data compression technology is based on the basic principle of signal processing, using signal correlation in time and space to compress the original audio signal to reduce the amount of data, thereby facilitating the transmission or storage of audio data.
  • Embodiments of the present application provide an audio signal encoding and decoding method and device, which are used to improve the encoding quality and the reconstruction effect of the audio signal.
  • an embodiment of the present application provides an audio signal encoding method, including: obtaining M transient identifiers of the M blocks according to the spectrum of the M blocks of the current frame of the audio signal to be encoded; the M The blocks include a first block, and the transient identifier of the first block is used to indicate that the first block is a transient block, or that the first block is a non-transient block; according to the M of the M blocks Obtain the grouping information of the M blocks by transient identification; group and arrange the spectrums of the M blocks according to the grouping information of the M blocks, so as to obtain the spectrum to be coded of the current frame; use the coding neural network to Encoding the spectrum to be encoded to obtain a spectrum encoding result; writing the spectrum encoding result into a code stream.
  • M transient identifiers of M blocks are obtained according to the spectrum of M blocks of the current frame of the audio signal to be encoded, and after the grouping information of M blocks is obtained according to the M transient identifiers, the M The grouping information of blocks arranges the spectrums of M blocks in the current frame in groups. By grouping and arranging the spectrums of M blocks, the order of the spectrums of M blocks in the current frame can be adjusted to obtain the coded data of the current frame.
  • the spectrum to be encoded is encoded by using the encoding neural network, and the spectral encoding result is obtained, and the spectral encoding result can be carried through the code stream.
  • the frequency spectra of M blocks can be grouped and arranged according to the M transient identifiers in the current frame of the audio signal, so that grouping and encoding of blocks with different transient identifiers can be realized, and the accuracy of the audio signal can be improved. Encoding quality.
  • the method further includes: encoding the grouping information of the M blocks to obtain a grouping information coding result; and writing the grouping information coding result into the code stream.
  • the encoding end after the encoding end obtains the grouping information of M blocks, it can carry the grouping information in the code stream, and first encode the grouping information, and the encoding method adopted for the grouping information is not limited here .
  • the group information coding result can be obtained, and the group information coding result can be written into the code stream, so that the code stream can carry the group information coding result.
  • the grouping information of the M blocks includes: a grouping quantity or a grouping quantity identifier of the M blocks, and the grouping quantity identifier is used to indicate the grouping quantity.
  • the grouping quantity When the grouping quantity When greater than 1, the grouping information of the M blocks also includes: M transient identifiers of the M blocks; or, the grouping information of the M blocks includes: M transient identifiers of the M blocks .
  • the grouping information of M blocks includes: the grouping quantity or grouping quantity identification of M blocks, and the grouping quantity identification is used to indicate the grouping quantity.
  • the grouping information of M blocks also includes: M M transient identifiers of the blocks; or, the grouping information of the M blocks includes: M transient identifiers of the M blocks.
  • the above grouping information of the M blocks can indicate the grouping of the M blocks, so that the coding end can use the grouping information to arrange the spectrums of the M blocks in groups.
  • the grouping and arranging the frequency spectra of the M blocks according to the grouping information of the M blocks to obtain the frequency spectrum to be encoded of the current frame includes: grouping the M dividing the spectrum in the blocks indicated as transient blocks by the M transient identifiers into transient groups, and dividing the spectrum in the M blocks indicated by the M transient identifiers as non-transient blocks into In the non-transient group: arrange the frequency spectrum of the blocks in the transient group before the frequency spectrum of the blocks in the non-transient group, so as to obtain the frequency spectrum to be encoded of the current frame.
  • the encoder After the encoder obtains the grouping information of M blocks, it groups the M blocks based on the difference in the transient state identifier, so that the transient group and the non-transient group can be obtained, and then the M blocks are grouped in the current frame Arrange the positions in the frequency spectrum of the transient group, and arrange the frequency spectrum of the blocks in the transient group before the frequency spectrum of the blocks in the non-transient group, so as to obtain the frequency spectrum to be encoded.
  • the spectrum of all transient blocks in the spectrum to be encoded is located before the spectrum of the non-transient block, so that the spectrum of the transient block can be adjusted to a position with higher coding importance, so that the reconstructed audio after encoding and decoding using the neural network
  • the signal can better preserve the transient characteristics.
  • the grouping and arranging the frequency spectra of the M blocks according to the grouping information of the M blocks to obtain the frequency spectrum to be encoded of the current frame includes: grouping the M The frequency spectrum of the block indicated as a transient block by the M transient identifiers is arranged before the frequency spectrum of the M blocks indicated as a non-transient block by the M transient identifiers, so as to obtain the current frame The spectrum to be encoded.
  • the spectrum of the M blocks indicated as a transient block by the M transient identifiers is arranged before the spectrum of the M blocks indicated by the M transient identifiers as a non-transient block, so as to obtain the spectrum to be encoded of the current frame. That is, the spectrum of all transient blocks in the spectrum to be encoded is located before the spectrum of the non-transient block, so that the spectrum of the transient block can be adjusted to a position with higher coding importance, so that the reconstructed audio after encoding and decoding using the neural network
  • the signal can better preserve the transient characteristics.
  • the method before encoding the frequency spectrum to be encoded by using the encoding neural network, the method further includes: performing intra-group interleaving processing on the frequency spectrum to be encoded, so as to obtain the The spectrum of M blocks; said encoding the frequency spectrum to be encoded by using a coding neural network includes: coding the spectrum of M blocks interleaved within the group by using a coding neural network.
  • the encoding end may first perform interleaving processing within the group according to the grouping of M blocks, so as to obtain the frequency spectrum of the M blocks interleaved within the group. Then the frequency spectrum of the M blocks interleaved within the group can be the input data of the encoding neural network.
  • the coding side information can also be reduced and the coding efficiency can be improved.
  • the intragroup interleaving processing of the frequency spectrum to be encoded includes: performing interleaving processing on the frequency spectrum of the P blocks to obtain the P The frequency spectrum of the interleaved processing of the block; the interleaving processing is performed on the frequency spectrum of the Q blocks to obtain the frequency spectrum of the interleaving processing of the Q blocks;
  • Encoding the frequency spectrum includes: encoding the interleaved frequency spectrum of the P blocks and the interleaved frequency spectrum of the Q blocks by using an encoding neural network.
  • interleaving the spectrum of P blocks includes interleaving the spectrum of P blocks as a whole; similarly, interleaving the spectrum of Q blocks includes The frequency spectrum is interleaved as a whole.
  • the encoding end can perform interleaving processing according to the transient group and the non-transient group respectively, so as to obtain the interleaved frequency spectrum of P blocks and the interleaved frequency spectrum of Q blocks.
  • the frequency spectrum of the interleaved processing of P blocks and the frequency spectrum of the interleaving processing of Q blocks can be used as input data of the encoding neural network.
  • the method before obtaining the M transient identifiers of the M blocks according to the spectrum of the M blocks of the current frame of the audio signal to be encoded, the method further includes: obtaining the current frame The window type, the window type is a short window type or a non-short window type; when the window type is a short window type, the M blocks are obtained according to the spectrum of the M blocks of the current frame of the audio signal to be encoded.
  • the step of the M transient identification of the block in the above scheme, in the embodiment of the present application, the foregoing encoding scheme can be implemented only when the window type of the current frame is a short window type, so as to realize encoding when the audio signal is a transient signal.
  • the method further includes: encoding the window type to obtain a window type encoding result; and writing the window type encoding result into the code stream.
  • the encoder can carry the window type in the code stream, and first encode the window type.
  • the encoding method used for the window type is not limited here.
  • the obtaining the M transient identifiers of the M blocks according to the frequency spectra of the M blocks of the current frame of the audio signal to be encoded includes: obtaining the M transient identifiers according to the frequency spectra of the M blocks. M spectral energies of the M blocks; obtaining the average spectral energy of the M blocks according to the M spectral energies; obtaining the M spectral energies of the M blocks according to the M spectral energies and the spectral energy average M transient identifiers.
  • the encoder after the encoder obtains M spectral energies, it can average the M spectral energies to obtain the average value of the spectral energy, or remove the maximum value or the largest values among the M spectral energies, and then Averaging is performed to obtain a spectral energy average.
  • Averaging is performed to obtain a spectral energy average.
  • Transient identification wherein the transient identification of a block can be used to represent the transient characteristics of a block.
  • the transient identifier of each block can be determined through the spectral energy and the average value of the spectral energy of each block, so that the transient identifier of a block can determine the grouping information of the block.
  • the transient identifier of the first block indicates that the first block is a transient block;
  • the spectral energy of the first block is less than or equal to K times the average value of the spectral energy
  • the transient identifier of the first block indicates that the first block is a non-transient block; wherein, the K is a real number greater than or equal to 1.
  • the transient flag of the first block indicates that the first block is a transient block.
  • the spectrum energy of the first block is less than or equal to K times the average value of the spectrum energy, it means that the spectrum of the first block has little change compared with the other blocks of M blocks, and the transient flag of the first block indicates that the first block is non-transient block.
  • the embodiment of the present application also provides an audio signal decoding method, including: obtaining the grouping information of the M blocks of the current frame of the audio signal from the code stream, and the grouping information is used to indicate the M blocks M transient identifiers of M; use the decoding neural network to decode the code stream to obtain the decoded spectrum of the M blocks; reverse the decoded spectrum of the M blocks according to the grouping information of the M blocks grouping and permutation processing, to obtain the frequency spectrum of the reverse group permutation processing of the M blocks; and obtain the reconstructed audio signal of the current frame according to the frequency spectrum of the reverse group permutation processing of the M blocks.
  • the grouping information of the M blocks of the current frame of the audio signal is obtained from the code stream, and the grouping information is used to indicate the M transient identifiers of the M blocks;
  • the code stream is decoded by a decoding neural network to obtain Decoded spectrum of M blocks;
  • the decoded spectrum of M blocks is inversely grouped and arranged to obtain the spectrum of inverse grouped and arranged processing of M blocks, which is processed according to the inverse grouping and arranged of M blocks Spectrum gets the reconstructed audio signal for the current frame.
  • the decoded spectrum of M blocks can be obtained when decoding the code stream, and then the spectrum of the inverse grouped permutation process of M blocks can be obtained through reverse group permutation processing. Then the reconstructed audio signal of the current frame is obtained.
  • inverse group arrangement and decoding can be performed according to blocks with different transient identifiers in the audio signal, so the audio signal reconstruction effect can be improved.
  • the method before performing inverse grouping processing on the decoded spectrum of the M blocks according to the grouping information of the M blocks, the method further includes: decoding the M blocks Performing intra-group deinterleaving processing on the frequency spectrum to obtain the frequency spectrum of the intra-group deinterleaving processing of the M blocks; performing inverse grouping and arrangement processing on the decoded spectrum of the M blocks according to the grouping information of the M blocks, The method includes: performing the inverse grouping arrangement process on the frequency spectrum of the intra-group deinterleaving process of the M blocks according to the grouping information of the M blocks.
  • the intra-group deinterleaving processing on the decoded spectrum of the M blocks includes: performing deinterleaving processing on the decoded spectrum of the P blocks; And, perform deinterleaving processing on the decoded frequency spectrum of the Q blocks.
  • the inverse grouping processing of the decoded spectrum of the M blocks according to the grouping information of the M blocks includes: according to the grouping information of the M blocks Obtain the indexes of the P blocks according to the grouping information of the M blocks; obtain the indexes of the Q blocks according to the grouping information of the M blocks;
  • the decoded spectrum of blocks is subjected to the reverse grouping permutation process.
  • the method further includes: obtaining the window type of the current frame from the code stream, where the window type is a short window type or a non-short window type; when the window type of the current frame When it is a short window type, the step of obtaining the grouping information of M blocks of the current frame from the code stream is executed.
  • the grouping information of the M blocks includes: a grouping quantity or a grouping quantity identifier of the M blocks, and the grouping quantity identifier is used to indicate the grouping quantity.
  • the grouping quantity identifier is used to indicate the grouping quantity.
  • the embodiment of the present application also provides an audio signal encoding device, including:
  • a transient identification obtaining module configured to obtain M transient identifications of the M blocks according to the spectrum of the M blocks of the current frame of the audio signal to be encoded; the M blocks include a first block, and the first block The transient identifier of is used to indicate that the first block is a transient block, or indicate that the first block is a non-transient block;
  • a grouping information obtaining module configured to obtain the grouping information of the M blocks according to the M transient identifiers of the M blocks;
  • a grouping and arranging module configured to group and arrange the frequency spectra of the M blocks according to the grouping information of the M blocks, so as to obtain the frequency spectrum to be encoded;
  • An encoding module configured to encode the frequency spectrum to be encoded by using an encoding neural network to obtain a frequency spectrum encoding result; and write the frequency spectrum encoding result into a code stream.
  • the constituent modules of the audio signal encoding device can also perform the steps described in the aforementioned first aspect and various possible implementations, see the aforementioned first aspect and various possible implementations for details description in the method.
  • the embodiment of the present application further provides an audio signal decoding device, including:
  • the grouping information obtaining module is used to obtain the grouping information of M blocks of the current frame of the audio signal from the code stream, and the grouping information is used to indicate the M transient identifiers of the M blocks;
  • a decoding module configured to use a decoding neural network to decode the code stream to obtain decoded spectrum of M blocks;
  • An inverse grouping and arranging module configured to perform inverse grouping and arranging processing on the decoded spectrum of the M blocks according to the grouping information of the M blocks, so as to obtain the spectrum of the inverse grouping and arranging processing of the M blocks;
  • An audio signal obtaining module configured to obtain a reconstructed audio signal according to the frequency spectrum on the inverse packet permutation processing of the M blocks.
  • the constituent modules of the audio signal decoding device can also perform the steps described in the aforementioned first aspect and various possible implementations, see the aforementioned first aspect and various possible implementations for details description in the method.
  • the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores instructions, and when it is run on a computer, the computer executes the above-mentioned first aspect or the second aspect. described method.
  • an embodiment of the present application provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the method described in the first aspect or the second aspect.
  • the embodiment of the present application provides a computer-readable storage medium, including the code stream generated by the method described in the foregoing first aspect.
  • the embodiment of the present application provides a communication device, which may include entities such as terminal equipment or chips, and the communication device includes: a processor and a memory; the memory is used to store instructions; the processor is used to Executing the instructions in the memory causes the communication device to execute the method as described in any one of the aforementioned first aspect or second aspect.
  • the present application provides a chip system, which includes a processor, configured to support an audio encoder or an audio decoder to implement the functions involved in the above aspect, for example, to send or process the information involved in the above method data and/or information.
  • the chip system further includes a memory, and the memory is used for storing necessary program instructions and data of the audio encoder or audio decoder.
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • M transient identifiers of M blocks are obtained according to the spectrum of M blocks of the current frame of the audio signal to be encoded, and after the grouping information of M blocks is obtained according to the M transient identifiers, the Use the grouping information of the M blocks to group and arrange the frequency spectra of the M blocks in the current frame, and by grouping and arranging the frequency spectra of the M blocks, the order of the frequency spectra of the M blocks in the current frame can be adjusted to obtain the current
  • the to-be-encoded spectrum is encoded by using the encoding neural network to obtain the spectral encoding result, which can be carried by the code stream.
  • the frequency spectra of M blocks can be grouped and arranged according to the M transient identifiers in the current frame of the audio signal, so that grouping and encoding of blocks with different transient identifiers can be realized, and the accuracy of the audio signal can be improved. Encoding quality.
  • the grouping information of the M blocks of the current frame of the audio signal is obtained from the code stream, and the grouping information is used to indicate M transient identifiers of the M blocks; Decoding is performed to obtain the decoded spectrum of M blocks; according to the grouping information of M blocks, the decoded spectrum of M blocks is inversely grouped and arranged to obtain the spectrum of inverse grouped and arranged processing of M blocks, and according to the grouping information of M blocks The reconstructed audio signal of the current frame is obtained by depacketizing the processed spectrum.
  • the decoded spectrum of M blocks can be obtained when decoding the code stream, and then the spectrum of the inverse grouped permutation process of M blocks can be obtained through reverse group permutation processing. Then the reconstructed audio signal of the current frame is obtained.
  • inverse group arrangement and decoding can be performed according to blocks with different transient identifiers in the audio signal, so the audio signal reconstruction effect can be improved.
  • FIG. 1 is a schematic diagram of the composition and structure of an audio processing system provided by an embodiment of the present application
  • FIG. 2a is a schematic diagram of an audio encoder and an audio decoder provided in an embodiment of the present application applied to a terminal device;
  • FIG. 2b is a schematic diagram of an audio encoder provided by an embodiment of the present application applied to a wireless device or a core network device;
  • FIG. 2c is a schematic diagram of an audio decoder provided by an embodiment of the present application applied to a wireless device or a core network device;
  • FIG. 3 is a schematic diagram of an audio signal encoding method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an audio signal decoding method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an audio signal encoding and decoding system provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of an audio signal encoding method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an audio signal decoding method provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an audio signal encoding method provided in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an audio signal decoding method provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the composition and structure of an audio encoding device provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of the composition and structure of an audio decoding device provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of the composition and structure of another audio encoding device provided by the embodiment of the present application.
  • FIG. 13 is a schematic diagram of the composition and structure of another audio decoding device provided by an embodiment of the present application.
  • Sound is a continuous wave produced by the vibration of an object. Objects that vibrate to emit sound waves are called sound sources. When sound waves propagate through a medium (such as air, solid or liquid), the auditory organs of humans or animals can perceive sound.
  • a medium such as air, solid or liquid
  • Characteristics of sound waves include pitch, intensity, and timbre.
  • Pitch indicates how high or low a sound is.
  • Pitch intensity indicates the volume of a sound.
  • Pitch intensity can also be called loudness or volume.
  • the unit of sound intensity is decibel (decibel, dB). Timbre is also called fret.
  • the frequency of sound waves determines the pitch of the sound. The higher the frequency, the higher the pitch.
  • the number of times an object vibrates within one second is called frequency, and the unit of frequency is hertz (Hz).
  • the frequency of sound that can be recognized by the human ear is between 20Hz and 20000Hz.
  • the amplitude of the sound wave determines the intensity of the sound. The greater the amplitude, the greater the sound intensity. The closer the distance to the sound source, the greater the sound intensity.
  • the waveform of the sound wave determines the timbre.
  • the waveforms of sound waves include square waves, sawtooth waves, sine waves, and pulse waves.
  • sounds can be divided into regular sounds and irregular sounds.
  • Random sound refers to the sound produced by the sound source vibrating randomly. Random sounds are, for example, noises that affect people's work, study, and rest.
  • a regular sound refers to a sound produced by a sound source vibrating regularly. Regular sounds include speech and musical tones.
  • regular sound is an analog signal that changes continuously in the time-frequency domain. The analog signals may be referred to as audio signals (acoustic signals).
  • An audio signal is an information carrier that carries speech, music and sound effects.
  • the human sense of hearing has the ability to distinguish the location and distribution of sound sources in space, when the listener hears the sound in the space, he can not only feel the pitch, intensity and timbre of the sound, but also feel the direction of the sound.
  • Sound can also be divided into monophonic and stereophonic.
  • Mono has one sound channel, using a microphone to pick up the sound and using a speaker for playback.
  • Stereo has multiple sound channels, and different sound channels transmit different sound waveforms.
  • the current encoder When the audio signal is a transient signal, the current encoder does not extract the transient feature and transmit it in the code stream.
  • the transient feature is used to represent the change of the adjacent block spectrum in the transient frame of the audio signal, so that When the signal is reconstructed at the decoding end, the transient characteristics of the reconstructed audio signal cannot be obtained from the code stream, and there is a problem of poor audio signal reconstruction effect.
  • the embodiment of the present application provides an audio processing technology, in particular, provides an audio signal-oriented audio coding technology, so as to improve a traditional audio coding system.
  • Audio processing includes two parts: audio encoding and audio decoding. Audio encoding is performed on the source side and involves encoding (eg, compressing) raw audio to reduce the amount of data required to represent the audio for more efficient storage and/or transmission. Audio decoding is performed at the destination, including inverse processing relative to the encoder to reconstruct the original audio. The encoding part and the decoding part are also collectively referred to as encoding.
  • the technical solution of the embodiment of the present application can be applied to various audio processing systems, as shown in FIG. 1 , which is a schematic diagram of the composition and structure of the audio processing system provided by the embodiment of the present application.
  • the audio processing system 100 may include: an audio encoding device 101 and an audio decoding device 102 .
  • the audio coding device 101 can also be called an audio signal coding device, which can be used to generate a code stream, and then the audio coded code stream can be transmitted to the audio decoding device 102 through an audio transmission channel, and the audio decoding device 102 can also be called an audio signal
  • the decoding device can receive the code stream, then execute the audio decoding function of the audio decoding device 102, and finally obtain the reconstructed signal.
  • the audio coding device can be applied to various terminal devices that require audio communication, wireless devices that require transcoding, and core network devices.
  • the audio coding device can be the above-mentioned terminal device or wireless device or Audio encoder for core network equipment.
  • the audio decoding device can be applied to various terminal devices that require audio communication, wireless devices that require transcoding, and core network devices. decoder.
  • the audio encoder may include a radio access network, a media gateway of the core network, a transcoding device, a media resource server, a mobile terminal, a fixed network terminal, etc., and the audio encoder may also be a virtual reality (VR) ) audio encoders in streaming services.
  • VR virtual reality
  • the end-to-end encoding and decoding process for audio signals includes: audio signal A is collected After the module (acquisition), the preprocessing operation (audioPReprocessing) is performed.
  • the preprocessing operation includes filtering out the low-frequency part of the signal.
  • the rendered signal is mapped to the listener's headphones (headphones), which may be independent headphones or headphones on a glasses device.
  • FIG. 2a it is a schematic diagram of an audio encoder and an audio decoder provided in the embodiment of the present application applied to a terminal device.
  • Each terminal device may include: an audio encoder, a channel encoder, an audio decoder, and a channel decoder.
  • the channel encoder is used for channel coding the audio signal
  • the channel decoder is used for channel decoding the audio signal.
  • the first terminal device 20 may include: a first audio encoder 201 , a first channel encoder 202 , a first audio decoder 203 , and a first channel decoder 204 .
  • the second terminal device 21 may include: a second audio decoder 211 , a second channel decoder 212 , a second audio encoder 213 , and a second channel encoder 214 .
  • the first terminal device 20 is connected to a wireless or wired first network communication device 22, the first network communication device 22 is connected to a wireless or wired second network communication device 23 through a digital channel, and the second terminal device 21 is connected to a wireless or wired network communication device.
  • the second network communication device 23 may generally refer to signal transmission equipment, such as communication base stations, data exchange equipment, and the like.
  • the terminal device as the sending end first collects audio, performs audio coding on the collected audio signal, and then performs channel coding, and then transmits in a digital channel through a wireless network or a core network.
  • the terminal device as the receiving end performs channel decoding according to the received signal to obtain the code stream, and then recovers the audio signal through audio decoding, and the terminal device at the receiving end enters the audio playback.
  • the wireless device or the core network device 25 includes: a channel decoder 251, other audio decoders 252, an audio encoder 253 provided in the embodiment of the present application, and a channel encoder 254, wherein the other audio decoders 252 refer to Audio codecs other than audio codecs.
  • the channel decoder 251 is first used to perform channel decoding on the signal entering the device, and then other audio decoders 252 are used for audio decoding, and then the audio encoder 253 provided by the embodiment of the present application is used for decoding.
  • the channel coder 254 is used to perform channel coding on the audio signal, and the channel coding is completed before transmission.
  • the other audio decoder 252 performs audio decoding on the code stream decoded by the channel decoder 251 .
  • FIG. 2c it is a schematic diagram of an audio decoder provided by the embodiment of the present application being applied to a wireless device or a core network device.
  • the wireless device or the core network device 25 includes: a channel decoder 251, an audio decoder 255 provided in the embodiment of the present application, other audio encoders 256, and a channel encoder 254, wherein the other audio encoders 256 refer to Audio codecs other than audio codecs.
  • the signal entering the device is first channel-decoded by the channel decoder 251, then the received audio coded stream is decoded using the audio decoder 255, and then other audio encoders 256 are used to Perform audio encoding, and finally use the channel encoder 254 to perform channel encoding on the audio signal, and then transmit it after completing the channel encoding.
  • the wireless device refers to equipment related to radio frequency in communication
  • the core network device refers to equipment related to core network in communication.
  • the audio coding device can be applied to various terminal devices that require audio communication, wireless devices that require transcoding, and core network devices.
  • the audio coding device can be the above-mentioned terminal device or wireless device Or a multi-channel encoder of a core network device.
  • the audio decoding device can be applied to various terminal devices that require audio communication, wireless devices that require transcoding, and core network devices.
  • the audio decoding device can be a combination of the above-mentioned terminal devices or wireless devices or core network devices. channel decoder.
  • an audio signal encoding method provided by the embodiment of the present application is introduced.
  • This method can be executed by a terminal device.
  • AI artificial intelligence
  • Figure 3 the encoding process performed by the encoding end in the embodiment of the present application is described:
  • the M blocks include the first block, and the transient identifier of the first block is used to indicate that the first block is a temporary state blocks, or indicate that the first block is a non-transient block.
  • the encoding end first obtains the audio signal to be encoded, and performs frame division processing on the audio signal to be encoded to obtain the current frame of the audio signal to be encoded.
  • the encoding of the current frame is taken as an example for description, and the encoding of other frames of the audio signal to be encoded is similar to the encoding of the current frame.
  • the encoder After the encoder determines the current frame, it performs windowing processing on the current frame and performs time-frequency transformation. If the current frame includes M blocks, the spectrum of the M blocks in the current frame can be obtained, and M represents the number of blocks included in the current frame.
  • M represents the number of blocks included in the current frame.
  • the encoding end performs time-frequency transformation on the M blocks of the current frame to obtain the modified discrete cosine transform (modified discrete cosine transform, MDCT) spectrum of the M blocks.
  • MDCT modified discrete cosine transform
  • the spectrum of the M blocks is the MDCT spectrum
  • the spectrum of the M blocks may also be other spectrum.
  • the encoding end After obtaining the frequency spectra of the M blocks, the encoding end obtains M transient identifiers of the M blocks respectively according to the frequency spectra of the M blocks.
  • the frequency spectrum of each block is used to determine the transient identifier of the block, and each block corresponds to a transient identifier, and the transient identifier of a block is used to indicate the spectrum change of the block in the M blocks. For example, if one of the M blocks is the first block, the first block corresponds to a transient identifier.
  • the transient flag may indicate that the first block is a transient block, or the transient flag may indicate that the first block is a non-transient block.
  • the transient state of a block is marked as transient, which means that the spectrum of this block has a large change compared with the spectrum of other blocks in the M blocks, and the transient state of a block is marked as non-transient, which means that the spectrum of this block is compared with M The spectrum of other blocks in a block does not change much.
  • the transient flag occupies 1 bit, if the value of the transient flag is 0, then the transient flag is transient, and if the value of the transient flag is 1, then the transient flag is non-transient.
  • the transient flag is 1, then the transient flag is transient; if the value of the transient flag is 0, then the transient flag is non-transient, which is not limited here.
  • the M transient identifiers of the M blocks are used to group the M blocks, and the grouping of the M blocks is obtained according to the M transient identifiers of the M blocks information
  • the grouping information of the M blocks can indicate the grouping method of the M blocks
  • the M transient identifiers of the M blocks are the basis for the grouping of the M blocks. For example, blocks with the same transient identifier can be classified into a group In , blocks with different transient identities are grouped into different groups.
  • the grouping information of M blocks can be implemented in multiple ways, and the grouping information of M blocks includes: the number of groups or the identification of the number of groups of M blocks, the identification of the number of groups is used to indicate the number of groups, When the number of groups is greater than 1, the grouping information of the M blocks also includes: M transient identifiers of the M blocks; or, the grouping information of the M blocks includes: M transient identifiers of the M blocks.
  • the above grouping information of the M blocks can indicate the grouping of the M blocks, so that the coding end can use the grouping information to arrange the spectrums of the M blocks in groups.
  • the grouping information of M blocks includes: the number of groups of M blocks and the transient identifiers of M blocks.
  • the transient identifiers of the M blocks can also be called grouping flag information, so the grouping information in the embodiment of the present application can include Group number and group flag information.
  • the value of the number of groups may be 1 or 2.
  • the group flag information is used to indicate the transient identity of the M blocks.
  • the grouping information of M blocks includes: the transient identifiers of the M blocks, and the transient identifiers of the M blocks may also be called grouping flag information, so the grouping information in this embodiment of the application may include grouping flag information.
  • the group flag information is used to indicate the transient identity of the M blocks.
  • the grouping information of M blocks includes: the number of groups of M blocks is 1, that is, when the number of groups is equal to 1, the grouping information of M blocks does not include M transient identifiers;
  • the block grouping information also includes: M transient identifiers of the M blocks.
  • the number of groups in the grouping information of M blocks can also be replaced by a number of group identifiers to indicate the number of groups. For example, when the number of groups is marked as 0, it indicates that the number of groups is 1, and when the number of groups is marked as 1, it indicates that the number of groups is 2. .
  • the method executed by the encoding end further includes:
  • the encoding end may carry the grouping information in the code stream, and first encode the grouping information, and the encoding method adopted for the grouping information is not limited here.
  • the group information coding result can be obtained, and the group information coding result can be written into the code stream, so that the code stream can carry the group information coding result.
  • step 305 can be executed first, and then step A2 can be executed, or step A2 can be executed first, and then step 305 can be executed, or step A2 and step 305 can be executed at the same time. There is no limit.
  • the grouping information of the M blocks group and arrange the frequency spectra of the M blocks, so as to obtain the frequency spectrum to be encoded of the current frame.
  • the frequency spectrum to be encoded may also be referred to as the frequency spectrum of the M blocks arranged in groups.
  • the encoder After the encoder obtains the grouping information of M blocks, it can use the grouping information of the M blocks to group and arrange the frequency spectra of the M blocks in the current frame. By grouping and arranging the frequency spectra of the M blocks, the M blocks can be adjusted. The order in which the spectrum of is in the current frame.
  • the above grouping arrangement is carried out according to the grouping information of M blocks, and the grouping information of M blocks is obtained according to M transient identifiers of M blocks.
  • M blocks after grouping arrangement are obtained Spectrum of the block, the spectrum of the M blocks arranged in groups is based on the M transient identifiers of the M blocks, and the coding order of the spectrum of the M blocks can be changed through the group sorting.
  • step 303 arranges the spectrums of the M blocks in groups according to the grouping information of the M blocks, so as to obtain the spectrum to be coded, including:
  • the encoder After the encoder obtains the grouping information of the M blocks, it groups the M blocks based on the difference of the transient identifier, so that the transient group and the non-transient group can be obtained, and then the M blocks in the frequency spectrum of the current frame Arrange the positions of the blocks in the transient group before the spectra of the blocks in the non-transient group to obtain the spectrum to be encoded. That is, the spectrum of all transient blocks in the spectrum to be encoded is located before the spectrum of the non-transient block, so that the spectrum of the transient block can be adjusted to a position with higher coding importance, so that the reconstructed audio after encoding and decoding using the neural network The signal can better preserve the transient characteristics.
  • step 303 groups and arranges the spectrums of the M blocks according to the grouping information of the M blocks, so as to obtain the spectrum to be encoded of the current frame, including:
  • the spectrum of the M blocks indicated as a transient block by the M transient identifiers is arranged before the spectrum of the M blocks indicated by the M transient identifiers as a non-transient block, so as to obtain the spectrum to be encoded of the current frame.
  • the spectrum of all transient blocks in the spectrum to be encoded is located before the spectrum of the non-transient block, so that the spectrum of the transient block can be adjusted to a position with higher encoding importance, so that the reconstructed audio after encoding and decoding with the neural network
  • the signal can better preserve the transient characteristics.
  • the encoding end after the encoding end obtains the spectrum to be encoded of the current frame, it can use the encoding neural network to encode to generate the spectral encoding result, and then write the spectral encoding result into the code stream, and the encoding end can send it to the decoding send the code stream.
  • latent variables can be generated, and the latent variables represent the characteristics of the spectrum of the M blocks arranged in groups.
  • step 304 uses the encoding neural network to encode the frequency spectrum to be encoded
  • the method performed by the encoding end further includes:
  • step 304 uses the encoding neural network to encode the frequency spectrum to be encoded, including:
  • the encoding end may first perform intra-group interleaving processing according to the grouping of M blocks, so as to obtain the spectrum of the M blocks interleaved within the group. Then the frequency spectrum of the M blocks interleaved within the group can be the input data of the encoding neural network.
  • the coding side information can also be reduced and the coding efficiency can be improved.
  • the number of M blocks indicated as transient blocks by M transient identifiers is P
  • the number of M blocks indicated by M transient identifiers as non-transient blocks is Q One
  • step D1 performs intra-group interleaving processing on the frequency spectrum to be coded, including:
  • performing interleaving processing on the frequency spectrum of P blocks includes performing interleaving processing on the frequency spectrum of the P blocks as a whole; similarly, performing interleaving processing on the frequency spectrum of Q blocks includes taking the frequency spectrum of the Q blocks as a whole A whole for interleaving processing.
  • step E1 uses an encoding neural network to encode the frequency spectra of the M blocks interleaved within the group, including:
  • the interleaved frequency spectrum of P blocks and the interleaved frequency spectrum of Q blocks are encoded by using the encoding neural network.
  • the encoding end can perform interleaving processing according to the transient group and the non-transient group, so as to obtain the interleaved frequency spectrum of P blocks and the interleaved frequency spectrum of Q blocks.
  • the frequency spectrum of the interleaved processing of P blocks and the frequency spectrum of the interleaving processing of Q blocks can be used as input data of the encoding neural network.
  • step 301 obtains the M transient identifiers of the M blocks according to the spectrum of the M blocks of the current frame of the audio signal to be encoded
  • the method performed by the encoder further includes:
  • the window type is a short window type or a non-short window type
  • the window type is the short window type
  • the step of obtaining M transient identifiers of the M blocks according to the spectrum of the M blocks of the current frame of the audio signal to be encoded is performed.
  • the encoding end may first determine the window type of the current frame.
  • the window type may be a short window type or a non-short window type.
  • the encoding end determines the window type according to the current frame of the audio signal to be encoded.
  • the short window may also be called a short frame
  • the non-short window may also be called a non-short frame.
  • the window type is a short window type
  • the execution of the aforementioned step 301 is triggered.
  • the aforementioned encoding scheme can be implemented, so as to implement encoding when the audio signal is a transient signal.
  • the method performed by the encoding end when the encoding end performs the aforementioned steps F1 and F2, the method performed by the encoding end further includes:
  • the encoding end may carry the window type in the code stream, and first encode the window type, and the encoding method adopted for the window type is not limited here.
  • the window type encoding result can be obtained, and the window type encoding result can be written into the code stream, so that the code stream can carry the window type encoding result.
  • step 301 obtains M transient identifiers of M blocks according to the spectrum of M blocks of the current frame of the audio signal to be encoded, including:
  • the M spectral energies can be averaged to obtain the average value of the spectral energy, or the maximum value or several maximum values of the M spectral energies can be removed and then averaged, to obtain the spectral energy average.
  • Transient identification By comparing the spectral energy of each block in the M spectral energies with the average value of the spectral energy to determine the change of the spectrum of each block compared to the spectrum of other blocks in the M blocks, and then obtain the M of the M blocks Transient identification, wherein the transient identification of a block can be used to represent the transient characteristics of a block.
  • the transient identifier of each block can be determined through the spectral energy and the average value of the spectral energy of each block, so that the transient identifier of a block can determine the grouping information of the block.
  • the transient identifier of the first block indicates that the first block is a transient block
  • the transient flag of the first block indicates that the first block is a non-transient block
  • K is a real number greater than or equal to 1.
  • K there are various values of K, which are not limited here.
  • K the determination process of the transient identity of the first block in the M blocks as an example, when the spectral energy of the first block is greater than K times the average value of the spectral energy, it means that the first block has a larger frequency spectrum than the other blocks of the M blocks. If the change is too large, the transient flag of the first block indicates that the first block is a transient block.
  • the spectrum energy of the first block is less than or equal to K times the average value of the spectrum energy, it means that the spectrum of the first block has little change compared with the other blocks of M blocks, and the transient flag of the first block indicates that the first block is non-transient block.
  • the encoder can also obtain M transient identifiers of M blocks in other ways, for example, obtain the difference or ratio between the spectral energy of the first block and the average value of the spectral energy, and according to the obtained difference or ratio value to determine M transient identifiers for M blocks.
  • M transient identifiers of M blocks are obtained according to the spectrum of M blocks of the current frame of the audio signal to be encoded, and after the grouping information of M blocks is obtained according to the M transient identifiers.
  • the grouping information of the M blocks can be used to group and arrange the frequency spectra of the M blocks in the current frame, and by grouping and arranging the frequency spectra of the M blocks, the arrangement order of the frequency spectra of the M blocks in the current frame can be adjusted,
  • After obtaining the spectrum to be encoded use the encoding neural network to encode the spectrum to be encoded, and obtain the spectral encoding result, which can be carried by the code stream.
  • the M transient states in the current frame of the audio signal can be The identification arranges the frequency spectra of the M blocks in groups, so that grouping and encoding of blocks with different transient identifications can be realized, and the encoding quality of the audio signal can be improved.
  • the embodiment of the present application also provides an audio signal decoding method, which can be executed by a terminal device.
  • the terminal device can be an audio signal decoding device (hereinafter referred to as a decoding end or decoder, for example, the decoding end can be AI decoder).
  • the method performed on the decoding end in the embodiment of the present application mainly includes:
  • the decoding end receives the code stream sent by the encoding end, the encoding end writes the coding result of group information in the code stream, and the decoding end analyzes the code stream to obtain the group information of M blocks of the current frame of the audio signal.
  • the decoding end can determine M transient identifiers of the M blocks according to the grouping information of the M blocks.
  • the group information may include: group quantity and group flag information.
  • the grouping information may include grouping flag information. For details, refer to the description of the foregoing embodiments at the encoding end.
  • the decoding end uses the decoding neural network to decode the code stream to obtain the decoded spectrum of M blocks.
  • the decoded spectrum of the M blocks corresponds to the spectrum of the M blocks arranged in groups at the encoding end.
  • the decoding neural network is inverse to the execution process of the encoding neural network at the encoding end. Through decoding, the reconstructed Spectrum of the M blocks arranged in groups.
  • the decoding end obtains the grouping information of M blocks, and the decoding end also obtains the decoding spectrum of M blocks through the code stream. Since the encoding end arranges the spectrum of M blocks in groups, the decoding end needs to perform the opposite of the encoding end. Therefore, according to the grouping information of the M blocks, the decoding spectrum of the M blocks is reversely grouped and arranged to obtain the spectrum of the reverse grouped and arranged process of the M blocks.
  • the encoding end can transform the frequency domain to the time domain on the frequency spectrum of the inverse grouping arrangement processing of the M blocks, so as to obtain the reconstructed audio signal of the current frame .
  • step 403 before step 403 performs inverse grouping processing on the decoded spectrum of the M blocks according to the grouping information of the M blocks, the method performed by the decoding end further includes:
  • I1. Perform intra-group de-interleaving processing on the decoded spectrum of M blocks, to obtain the frequency spectrum of the intra-group de-interleaving processing of M blocks;
  • Step 403 performs inverse grouping processing on the decoded spectrum of M blocks according to the grouping information of M blocks, including:
  • the intra-group de-interleaving performed by the decoding end is the inverse process of the intra-group interleaving at the encoding end, which will not be described in detail here.
  • Step I1 performs intra-group deinterleaving processing on the decoded spectrum of M blocks, including:
  • I12. Perform deinterleaving processing on the decoded spectrum of Q blocks.
  • performing deinterleaving processing on the frequency spectrum of P blocks includes performing deinterleaving processing on the frequency spectrum of the P blocks as a whole; similarly, performing deinterleaving processing on the frequency spectrum of Q blocks includes deinterleaving the frequency spectrum of the Q blocks The frequency spectrum is deinterleaved as a whole.
  • the encoder can perform interleaving processing according to the transient group and the non-transient group respectively, so as to obtain the interleaved frequency spectrum of P blocks and the interleaved frequency spectrum of Q blocks.
  • the frequency spectrum of the interleaved processing of P blocks and the frequency spectrum of the interleaving processing of Q blocks can be used as input data of the encoding neural network.
  • side information of coding can also be reduced, and coding efficiency can be improved. Since the encoding end performs intra-group interleaving, the decoding end needs to perform a corresponding inverse process, that is, the decoding end can perform deinterleaving processing.
  • the number of transient blocks indicated by M transient identifiers is P, and among the M blocks indicated by M transient identifiers as non-transitory blocks.
  • Step 403 performs inverse grouping processing on the decoded spectrum of M blocks according to the grouping information of M blocks, including:
  • the decoded frequency spectrums of the M blocks are reversely grouped and arranged.
  • the indexes of the M blocks are continuous, for example, from 0 to M-1. After the encoding end performs group arrangement, the indexes of the M blocks are no longer continuous. According to the grouping information of M blocks, the decoder can obtain the index of P blocks in the reconstructed grouped M blocks, and the index of Q blocks in the reconstructed grouped M blocks. After permutation processing, the indexes of M blocks can be recovered and are still continuous.
  • the method performed by the decoding end further includes:
  • the window type is a short window type or a non-short window type
  • the window type of the current frame is the short window type
  • the step of obtaining the grouping information of the M blocks of the current frame from the code stream is executed.
  • the aforementioned encoding scheme can be implemented only when the window type of the current frame is a short window type, so as to implement encoding when the audio signal is a transient signal.
  • the decoding end performs the reverse process of the encoding end, so the decoding end can also first determine the window type of the current frame, which can be a short window type or a non-short window type, for example, the decoding end obtains the window type of the current frame from the code stream type.
  • the short window may also be called a short frame
  • the non-short window may also be called a non-short frame.
  • the grouping information of M blocks includes: the grouping quantity or grouping quantity identification of M blocks, the grouping quantity identification is used to indicate the grouping quantity, when the grouping quantity is greater than 1, the grouping information of M blocks It also includes: M transient identifiers of M blocks;
  • the grouping information of the M blocks includes: M transient identifiers of the M blocks.
  • the grouping information of the M blocks of the current frame of the audio signal is obtained from the code stream, and the grouping information is used to indicate M transient identifiers of the M blocks;
  • the stream is decoded to obtain the decoded spectrum of M blocks; according to the grouping information of the M blocks, the decoded spectrum of the M blocks is inversely grouped and arranged to obtain the spectrum of the inverse grouped arrangement of the M blocks.
  • the spectrum processed by the inverse packet permutation obtains the reconstructed audio signal of the current frame.
  • the decoded spectrum of M blocks can be obtained when decoding the code stream, and then the spectrum of the inverse grouped permutation process of M blocks can be obtained through reverse group permutation processing. Then the reconstructed audio signal of the current frame is obtained.
  • inverse group arrangement and decoding can be performed according to blocks with different transient identifiers in the audio signal, so the audio signal reconstruction effect can be improved.
  • FIG. 5 it is a schematic diagram of the system architecture applied in the field of radio and television provided by the embodiment of this application. 3D sound codec.
  • the 3D sound signal produced by the 3D sound of the live broadcast program is obtained by applying the 3D sound encoding of the embodiment of the application to obtain a code stream, which is transmitted to the user side through the radio and television network, and is decoded by the 3D sound decoder in the set-top box to reconstruct the 3D sound
  • the signal is played back by the loudspeaker group.
  • the 3D sound signal produced by the 3D sound of the post-program is obtained through the 3D sound encoding of the embodiment of the application to obtain the code stream, and is transmitted to the user side through the broadcasting network or the Internet, and the 3D sound signal in the network receiver or mobile terminal
  • the decoder decodes and reconstructs the three-dimensional sound signal, which is played back by the speaker group or the earphone.
  • the embodiment of the present application provides an audio codec, and the audio codec may specifically include a wireless access network, a media gateway of a core network, a transcoding device, a media resource server, etc., a mobile terminal, a fixed network terminal, and the like. It can also be applied to audio codecs in broadcast TV or terminal media playback, and VR streaming services.
  • the following audio signal encoding method is implemented by applying the encoder proposed in the embodiment of the present application, including:
  • a specific implementation includes the following three steps:
  • the audio signal of the current frame is a time-domain signal of L points.
  • Transient detection is performed according to the audio signal of the current frame to determine the transient information of the current frame.
  • the transient information of the current frame may include one or more of an identifier of whether the current frame is a transient signal, a location where the transient occurs in the current frame, and a parameter characterizing the degree of the transient.
  • the transient degree may be the level of the transient energy, or the ratio of the signal energy at the position where the transient occurs to the signal energy at the adjacent non-transient position.
  • the window type of the current frame is a short window.
  • the window type of the current frame is other window types excluding the short window.
  • the embodiment of the present application does not limit other window types, for example, other window types may include: long windows, cut-in windows, cut-out windows, and the like.
  • window type of the current frame is a short window
  • the audio signal of the current frame is subjected to short-window windowing processing and time-frequency transformation to obtain MDCT spectra of M blocks.
  • the window type of the current frame is a short window
  • M overlapping short window window functions are used for windowing processing to obtain audio signals of M blocks after windowing, where M is a positive integer greater than or equal to 2.
  • the window length of the short window window function is 2L/M, where L is the frame length of the current frame, and the splicing length is L/M.
  • M is equal to 8
  • L is equal to 1024
  • the window length of the short window function is 256 samples
  • the splicing length is 128 samples.
  • the audio signals of the M blocks after windowing are respectively subjected to time-frequency transformation to obtain the MDCT spectrum of the M blocks of the current frame.
  • the length of the windowed audio signal of the current block is 256 samples.
  • 128 MDCT coefficients are obtained, which is the MDCT spectrum of the current block.
  • step S13 obtains the number of groups and the grouping flag information of the current frame, in an implementation manner: first, the MDCT spectrum of M blocks is interleaved to obtain the MDCT spectrum of M blocks after interleaving; next, the The MDCT spectrum of the interleaved M blocks is encoded and preprocessed to obtain the preprocessed MDCT spectrum; then the preprocessed MDCT spectrum is deinterleaved to obtain the MDCT spectrum of the deinterleaved M blocks; finally, according to the solution The MDCT spectrum of the M blocks processed by the interleaving process determines the number of groups and group flag information of the current frame.
  • Interleaving the MDCT spectrum of M blocks is to interleave the M MDCT spectrum with length L/M into MDCT spectrum with length L.
  • the spectral coefficients are arranged in order from 0 to M-1 according to the serial number of the block where they are located, and the value of i starts from 0 to L/M-1.
  • the encoding preprocessing operation may include: frequency domain noise shaping (frequency domain noise shaping, FDNS), time domain noise shaping (temporal noise shaping, TNS) and bandwidth extension (bandwidth extension, BWE) and other processing, which is not limited here.
  • frequency domain noise shaping frequency domain noise shaping, FDNS
  • time domain noise shaping temporary noise shaping, TNS
  • bandwidth extension bandwidth extension
  • the deinterleaving process is the inverse process of the interleaving process.
  • the length of the preprocessed MDCT spectrum is L
  • the preprocessed MDCT spectrum of length L is divided into M MDCT spectra of length L/M, and the MDCT spectrum in each block is arranged from small to large frequency points, and the solution can be obtained
  • the MDCT spectrum of the M blocks processed by interleaving Preprocessing the interleaved frequency spectrum can reduce coding side information, thereby reducing the bit occupation of the side information and improving coding efficiency.
  • the specific method includes the following three steps:
  • the MDCT spectral energy of each block is calculated, which is denoted as enerMdct[8].
  • 8 is the value of M
  • 128 represents the number of MDCT coefficients in one block.
  • Method 1 directly calculate the average value of MDCT spectrum energy of M blocks, that is, the average value of enerMdct[8], and use it as the average value of MDCT spectrum energy avgEner.
  • Method 2 Determine the block with the largest MDCT spectral energy among the M blocks; calculate the average value of the MDCT spectral energy of the other M-1 blocks except the block with the largest energy, and use it as the average value avgEner of the MDCT spectral energy. Or calculate the average value of the MDCT spectrum energy of other blocks except several blocks with the largest energy, and use it as the average value avgEner of the MDCT spectrum energy.
  • the MDCT spectral energy of the M blocks and the average value of the MDCT spectral energy determine the number of groups and the grouping flag information of the current frame, and write them into the code stream.
  • the current block may be: comparing the MDCT spectrum energy of each block with the average value of the MDCT spectrum energy. If the MDCT spectrum energy of the current block is greater than K times of the average value of the MDCT spectrum energy, the current block is a transient block, and the transient state flag of the current block is 0; otherwise, the current block is a non-transient block, and the non-transient state of the current block is The status flag is 1.
  • M blocks are grouped, and the number of groups and grouping flag information are determined. Among them, those with the same transient identification value are a group, M blocks are divided into N groups, and N is the number of groups.
  • the group flag information is information composed of the transient flag value of each block in the M blocks.
  • transient blocks form transient groups and non-transient blocks form non-transient groups.
  • the number of groups numGroups of the current frame is 2 if the transient identifiers of the blocks are not completely the same, otherwise it is 1.
  • the group quantity can be indicated by the group quantity indicator. For example, if the number of groups is marked as 1, it means that the number of groups in the current frame is 2; if the number of groups is marked as 0, it means that the number of groups in the current frame is 1.
  • the group indicator information groupIndicator of the current frame is formed by sequentially arranging the transient identifiers of M blocks.
  • step S13 obtains the number of groups and grouping flag information
  • another implementation is: do not perform interleaving and deinterleaving processing on the MDCT spectrum of M blocks, and directly determine the number of groups and grouping flags of the current frame according to the MDCT spectrum of M blocks Information, encoding the group number and group flag information of the current frame and writing the coding result into the code stream.
  • Determining the number of groups and group flag information of the current frame according to the MDCT spectrum of M blocks is similar to determining the number of groups and group flag information of the current frame according to the MDCT spectrum of M blocks after deinterleaving, and will not be repeated here.
  • non-transient group may be further divided into two or more other groups, which is not limited in this embodiment of the present application.
  • a non-transient group can be divided into a harmonic group and a non-harmonic group.
  • the MDCT spectrum arranged in groups is the spectrum to be encoded of the current frame.
  • the encoding neural network of the encoder will have a better encoding effect on the spectrum in the front, so adjusting the transient block to the front can ensure the encoding effect of the transient block, thereby retaining more spectral details of the transient block , to improve the encoding quality.
  • the MDCT spectrum arranged in groups is first interleaved within the group to obtain the MDCT spectrum interleaved within the group. Then, the encoding neural network is used to encode the interleaved MDCT spectrum within the group.
  • the intra-group interleaving process is similar to the aforementioned interleaving process performed on the MDCT spectrum of M blocks before obtaining the group number and group flag information, except that the object of interleaving is the MDCT spectrum belonging to the same group. For example, the interleaving process is performed on the MDCT spectrum blocks belonging to the transient group.
  • the MDCT spectrum blocks belonging to the non-transient group are interleaved.
  • the encoding neural network processing is pre-trained, and the embodiment of the present application does not limit the specific network structure and training method of the encoding neural network.
  • the encoding neural network can choose fully connected network or convolutional neural network (convolutional neural networks, CNN).
  • the decoding process corresponding to the encoding end includes:
  • window type of the current frame is a short window, decode according to the received code stream to obtain the group number and group flag information.
  • the identification information of the number of packets in the code stream can be analyzed, and the number of packets of the current frame can be determined according to the identification information of the number of packets. For example, if the number of groups is marked as 1, it means that the number of groups in the current frame is 2; if the number of groups is marked as 0, it means that the number of groups in the current frame is 1.
  • Decoding the received code stream to obtain group flag information may be: reading M-bit group flag information from the code stream. Whether the i-th block is a transient block can be determined according to the value of the i-th bit of the group flag information. If the value of the i-th bit is 0, it means that the i-th block is a transient block; if the value of the i-th bit is 1, it means that the i-th block is a non-transient block.
  • the decoding process at the decoding end corresponds to the encoding process at the encoding end. Specific steps include:
  • the decoded MDCT spectrum is obtained by using the decoding neural network.
  • the decoded MDCT spectrum belonging to the same group can be determined.
  • Intra-group deinterleaving processing is performed on the MDCT spectrum belonging to the same group to obtain the MDCT spectrum processed by intragroup deinterleaving.
  • the de-interleaving process within the group is the same as the de-interleaving process of the MDCT spectrum of the interleaved M blocks before the coder obtains the group number and group flag information.
  • the inverse packet permutation processing at the decoding end is the inverse process of the packet permutation processing at the encoding end.
  • the MDCT spectrum processed by intra-group deinterleaving is composed of M MDCT spectrum blocks of L/M points.
  • the block index idx0(i) of the i-th transient block is the block index corresponding to the block whose i-th flag value is 0 in the group flag information, and i starts from 0.
  • the number of transient blocks is the number of bits whose flag value is 0 in the packet flag information, which is denoted as num0.
  • the non-transient blocks need to be processed.
  • MDCT spectrum of idx1(j) blocks is the block index corresponding to the block whose jth flag value is 1 in the group flag information, and j starts from 0.
  • a specific implementation method is: firstly, perform interleaving processing on the MDCT spectrum of the M blocks processed by the inverse grouping permutation process to obtain the MDCT of the interleaved process of the M blocks Spectrum; Next, post-decoding processing is performed on the interleaved MDCT spectrum of M blocks.
  • post-decoding processing can include inverse TNS, inverse FDNS, BWE processing, etc., and post-decoding processing follows the encoding preprocessing method of the encoding end one by one.
  • the MDCT spectrum processed after decoding is obtained; then the MDCT spectrum processed after decoding is deinterleaved to obtain the MDCT spectrum of the deinterleaved process of M blocks; finally, the MDCT spectrum of the deinterleaved process of M blocks is respectively performed Transform from the frequency domain to the time domain, and after de-windowing and splicing and adding processing, the reconstructed audio signal is obtained.
  • another specific implementation method to obtain the reconstructed audio signal is: respectively transform the MDCT spectrum of M blocks from the frequency domain to the time domain, and perform de-windowing and splicing phase After processing, the reconstructed audio signal is obtained.
  • the encoding method of the audio signal performed by the encoding end includes:
  • the frame length is 1024
  • the input signal of the current frame is an audio signal of 1024 points.
  • the input signal of the current frame is divided into L blocks, and the signal energy in each block is calculated. If the signal energy in adjacent blocks changes suddenly, the current frame is considered as a transient signal.
  • the window type of the current frame is a short window, otherwise it is a long window.
  • the window type of the current frame can also add a cut-in window and a cut-out window.
  • the frame number of the current frame be i, and determine the window type of the current frame according to the transient detection results of frames i-1 and i-2 and the transient detection results of the current frame.
  • the window type of frame i is long window.
  • the window type of frame i is cut-in window.
  • the window type of the i-th frame is a cut-out window.
  • the window type of frame i is short window.
  • windowing and MDCT transformation are performed respectively: for long window, cut-in window and cut-out window, if the signal length after windowing is 2048, then 1024 MDCT coefficients are obtained; For the short window, add 8 concatenated short windows with a length of 256, and each short window obtains 128 MDCT coefficients. The 128-point MDCT coefficients of each short window are called a block, and there are 1024 MDCT coefficients in total.
  • the window type of the current frame is a short window, perform interleaving processing on the MDCT spectrum of the current frame to obtain an interleaved MDCT spectrum.
  • the MDCT spectrum of eight blocks is interleaved, that is, eight 128-dimensional MDCT spectrums are interleaved into an MDCT spectrum with a length of 1024.
  • Spectrum form after interleaving can be: block 0 bin 0, block 1 bin 0, block 2 bin 0, ..., block 7 bin 0, block 0 bin 1, block 1, bin 1, block 2 bin 1, ..., block 7 bin 1,....
  • block 0bin 0 represents the 0th frequency point of the 0th block.
  • Preprocessing may include FDNS, TNS, BWE and other processing.
  • step S35 Perform deinterleaving in the opposite manner to step S35 to obtain 8 blocks of MDCT spectrum, wherein each block has 128 points.
  • the information may include the number of groups numGroups and group indicator information groupIndicator.
  • the specific solution for determining the grouping information may be any one of the aforementioned steps S13 performed by the encoding end. For example, if the MDCT spectral coefficients of 8 blocks in a short frame are mdctSpectrum[8][128], then the MDCT spectral energy of each block is calculated and recorded as enerMdct[8]. Calculate the average value of the MDCT spectrum energy of 8 blocks, which is recorded as avgEner. There are two methods for calculating the average value of the MDCT spectrum energy:
  • Method 1 directly calculate the average value of the MDCT spectrum energy of 8 blocks, that is, the average value of enerMdct[8].
  • Method 2 In order to reduce the influence of the block with the largest energy among the 8 blocks on the calculation of the average value, the energy of the largest block can be removed before calculating the average value.
  • the current block is considered to be a transient block (marked as 0), otherwise the current block is considered to be a non-transient block (marked as 1), all transient blocks State blocks form a transient group, and all non-transient blocks form a non-transient group.
  • the grouping information obtained from the preliminary judgment can be:
  • Block index 0 1 2 3 4 5 6 7.
  • Group indicator information groupIndicator 1 1 1 1 0 0 0 0 1.
  • the number of groups and group flag information need to be written into the code stream and transmitted to the decoding end.
  • the specific scheme of grouping and arranging the MDCT spectrums of the M blocks according to the grouping information may be any one of the aforementioned steps S14 performed by the coding end.
  • step S38 if the grouping information is:
  • Block index 0 1 2 3 4 5 6 7.
  • Group indicator information groupIndicator 1 1 1 1 0 0 0 0 1.
  • Block index 3 4 5 6 0 1 2 7.
  • the spectrum of the 0th block after the arrangement is the spectrum of the 3rd block before the arrangement
  • the spectrum of the 1st block after the arrangement is the spectrum of the 4th block before the arrangement
  • the spectrum of the 2nd block after the arrangement is the 4th block before the arrangement
  • the spectrum of the 5 blocks, the spectrum of the third block after the arrangement is the spectrum of the sixth block before the arrangement
  • the spectrum of the fourth block after the arrangement is the spectrum of the 0th block before the arrangement
  • the spectrum of the fifth block after the arrangement is The spectrum of the first block before the arrangement
  • the spectrum of the sixth block after the arrangement is the spectrum of the second block before the arrangement
  • the spectrum of the seventh block after the arrangement is the spectrum of the seventh block before the arrangement.
  • S310 Perform intra-group spectrum interleaving processing on the group-arranged MDCT spectrum to obtain the intra-group interleaved MDCT spectrum.
  • interleave processing within the group is performed for each group, and the processing method is similar to step S35, except that the interleaving processing is limited to processing the MDCT spectrum belonging to the same group.
  • interleave the transient groups (blocks 3, 4, 5, and 6 before the arrangement, that is, blocks 0, 1, 2, and 3 after the arrangement), and interleave the other Groups (blocks 0, 1, 2, and 7 before the arrangement, that is, blocks 4, 5, 6, and 7 after the arrangement) are interleaved.
  • the embodiment of the present application does not limit the specific method of encoding the MDCT spectrum after intra-group interleaving by using the encoding neural network.
  • the MDCT spectrum after intragroup interleaving is processed by a coded neural network to generate latent variables. Quantify the latent variables to obtain the quantified latent variables. Arithmetic encoding is performed on the quantized latent variables, and the arithmetic encoding result is written into the code stream.
  • the MDCT spectrum of the current frame obtained in step S34 is directly encoded by using an encoding neural network.
  • determine the window function corresponding to the window type perform windowing processing on the audio signal of the current frame, and obtain the signal after windowing processing; when the windows of adjacent frames are overlapping, time-frequency processing is performed on the signal after windowing processing
  • Forward transform such as MDCT transform, obtains the MDCT spectrum of the current frame; encodes the MDCT spectrum of the current frame.
  • the decoding method of the audio signal performed by the decoder includes:
  • the decoding neural network corresponds to the encoding neural network.
  • the specific method of decoding using the decoding neural network perform arithmetic decoding according to the received code stream to obtain quantized latent variables. Dequantize the quantized latent variables to obtain the dequantized latent variables. The dequantized latent variables are taken as input and processed by a decoding neural network to generate a decoded MDCT spectrum.
  • the MDCT spectrum blocks belonging to the same group are determined according to the number of groups and group flag information.
  • the decoded MDCT spectrum is divided into 8 blocks.
  • the number of groups is equal to 2, and the group indicator information groupIndicator is 1 1 1 0 0 0 0 1.
  • the number of bits with a flag value of 0 in the group flag information is 4, so the MDCT spectrum of the first 4 blocks in the decoded MDCT spectrum is a group, which belongs to the transient group and needs to be de-interleaved within the group; If the number of bits is 4, then the MDCT spectrum of the last 4 blocks is a group, which belongs to a non-transient group, and needs to be deinterleaved within the group.
  • the MDCT spectrum of the eight blocks obtained by the intra-group deinterleaving process is the MDCT spectrum of the eight blocks by the intra-group deinterleaving process.
  • the MDCT spectrums processed by deinterleaving in the group are arranged into M block spectrums sorted by time.
  • the MDCT spectrum of the 0th block obtained by deinterleaving within the group is adjusted to the MDCT spectrum of the third block (group indicator information
  • group indicator information The element position index corresponding to the bit with the first flag value of 0 in the group is 3
  • the MDCT spectrum of the first block obtained by the deinterleaving process in the group is adjusted to the MDCT spectrum of the fourth block (the second in the group flag information
  • the element position index corresponding to the bit whose flag value is 0 is 4
  • the MDCT spectrum of the second block obtained by the deinterleaving process in the group is adjusted to the MDCT spectrum of the fifth block (the third flag value in the group flag information is 0
  • the element position index corresponding to the bit of the bit is 5
  • the MDCT spectrum of the 3rd block obtained by the deinterleaving process in the group is adjusted to the MDCT spectrum of the 6th block (the bit corresponding to the fourth
  • the short-frame spectrum form after spectrum grouping is as follows: Block index 3 4 5 6 0 1 2 7.
  • the window type of the current frame is a short window
  • the MDCT spectrum processed by the inverse packet arrangement is interleaved, and the method is the same as before.
  • Post-decoding processing may include BWE inverse processing, TNS inverse processing, FDNS inverse processing and so on.
  • the reconstructed MDCT spectrum includes the MDCT spectrum of M blocks, and the inverse MDCT transform is performed on the MDCT spectrum of each block respectively. After windowing and aliasing and adding are performed on the inversely transformed signal, the reconstructed audio signal of the short frame can be obtained.
  • window type of the current frame is other window types, decode according to the decoding method corresponding to other types of frames to obtain the reconstructed audio signal.
  • the reconstructed MDCT spectrum is obtained by using the decoding neural network.
  • the window type of the current frame is a short window
  • the number of groups and the grouping flag information of the current frame are obtained; according to the number of groups and the grouping flag information of the current frame
  • the frequency spectra of the M blocks of the current frame are grouped and arranged to obtain grouped and arranged audio signals; the grouped and arranged frequency spectra are encoded by using an encoding neural network.
  • the MDCT spectrum containing the transient feature can be adjusted to a position with higher coding importance, so that the reconstructed audio signal can better preserve the transient state after encoding and decoding with the neural network feature.
  • the embodiment of the present application can also be used for stereo coding, the difference is that: firstly, according to steps S31-310 of the coding end in the previous embodiment, the left and right channels of the stereo are respectively processed and obtained after the intra-group interleaving MDCT of the left channel Spectrum and intra-interleaved MDCT spectrum of the right channel. Then step S311 becomes: use the encoding neural network to encode the MDCT spectrum after the interleaving in the group of the left channel and the MDCT spectrum after the interleaving in the group of the right channel.
  • the input of the encoding neural network is no longer the interleaved MDCT spectrum of the mono channel, but the MDCT spectrum of the left channel and the right MDCT spectrum after intra-group interleaving of channels.
  • the coding neural network may be a CNN network, and the MDCT spectrum after intra-group interleaving of the left channel and the MDCT spectrum after intra-group interleaving of the right channel are used as the input of the two channels of the CNN network.
  • the process performed by the decoder includes:
  • the window type of the left channel of the current frame, the number of groups and the group flag information are obtained.
  • the window type of the right channel of the current frame the number of groups and the group flag information are obtained.
  • the decoding neural network is used to obtain the decoded stereo MDCT spectrum.
  • the process is performed according to the steps of monophonic decoding on the decoding side of Embodiment 1, and the reconstructed left channel signal is obtained. .
  • the process is performed according to the steps of monophonic decoding on the decoding side of Embodiment 1, and the reconstructed right channel signal is obtained. .
  • an audio encoding device 1000 provided by the embodiment of the present application may include: a transient identification obtaining module 1001, a grouping information obtaining module 1002, a grouping arrangement module 1003 and an encoding module 1004, wherein,
  • a transient identification obtaining module configured to obtain M transient identifications of the M blocks according to the spectrum of the M blocks of the current frame of the audio signal to be encoded; the M blocks include a first block, and the first block The transient identifier of is used to indicate that the first block is a transient block, or indicate that the first block is a non-transient block;
  • a grouping information obtaining module configured to obtain the grouping information of the M blocks according to the M transient identifiers of the M blocks;
  • a grouping and arranging module configured to group and arrange the frequency spectra of the M blocks according to the grouping information of the M blocks, so as to obtain the frequency spectrum to be encoded of the current frame;
  • An encoding module configured to encode the frequency spectrum to be encoded by using an encoding neural network to obtain a frequency spectrum encoding result; and write the frequency spectrum encoding result into a code stream.
  • an audio decoding device 1100 may include: a grouping information obtaining module 1101, a decoding module 1102, an inverse grouping arrangement module 1103, and an audio signal obtaining module 1104, wherein,
  • the grouping information obtaining module is used to obtain the grouping information of M blocks of the current frame of the audio signal from the code stream, and the grouping information is used to indicate the M transient identifiers of the M blocks;
  • a decoding module configured to use a decoding neural network to decode the code stream to obtain decoded spectrum of M blocks;
  • An inverse grouping and arranging module configured to perform inverse grouping and arranging processing on the decoded spectrum of the M blocks according to the grouping information of the M blocks, so as to obtain the spectrum of the inverse grouping processing of the M blocks;
  • the audio signal obtaining module is configured to obtain the reconstructed audio signal of the current frame according to the spectrum of the inverse packet processing of the M blocks.
  • the embodiment of the present application also provides a computer storage medium, wherein the computer storage medium stores a program, and the program executes some or all of the steps described in the above method embodiments.
  • the audio coding device 1200 includes:
  • a receiver 1201 , a transmitter 1202 , a processor 1203 and a memory 1204 (the number of processors 1203 in the audio encoding device 1200 can be one or more, one processor is taken as an example in FIG. 12 ).
  • the receiver 1201 , the transmitter 1202 , the processor 1203 and the memory 1204 may be connected through a bus or in other ways, wherein connection through a bus is taken as an example in FIG. 12 .
  • the memory 1204 may include read-only memory and random-access memory, and provides instructions and data to the processor 1203 .
  • a part of the memory 1204 may also include a non-volatile random access memory (non-volatile random access memory, NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1204 stores operating systems and operating instructions, executable modules or data structures, or their subsets, or their extended sets, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the operating system may include various system programs for implementing various basic services and processing hardware-based tasks.
  • the processor 1203 controls the operation of the audio encoding device, and the processor 1203 may also be called a central processing unit (central processing unit, CPU).
  • CPU central processing unit
  • various components of the audio encoding device are coupled together through a bus system, wherein the bus system may include a power bus, a control bus, and a status signal bus, etc. in addition to a data bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the foregoing embodiments of the present application may be applied to the processor 1203 or implemented by the processor 1203 .
  • the processor 1203 may be an integrated circuit chip, which has a signal processing capability.
  • each step of the above-mentioned method may be implemented by an integrated logic circuit of hardware in the processor 1203 or instructions in the form of software.
  • the above-mentioned processor 1203 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or Other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 1204, and the processor 1203 reads the information in the memory 1204, and completes the steps of the above method in combination with its hardware.
  • the receiver 1201 can be used to receive input digital or character information, and generate signal input related to the relevant settings and function control of the audio encoding device.
  • the transmitter 1202 can include a display device such as a display screen, and the transmitter 1202 can be used to output through an external interface. Numeric or character information.
  • the processor 1203 is configured to execute the methods performed by the audio encoding device shown in FIG. 3 , FIG. 6 , and FIG. 8 in the foregoing embodiments.
  • the audio decoding device 1300 includes:
  • a receiver 1301 , a transmitter 1302 , a processor 1303 and a memory 1304 (the number of processors 1303 in the audio decoding device 1300 can be one or more, one processor is taken as an example in FIG. 13 ).
  • the receiver 1301 , the transmitter 1302 , the processor 1303 and the memory 1304 may be connected through a bus or in other ways, wherein connection through a bus is taken as an example in FIG. 13 .
  • the memory 1304 may include read-only memory and random-access memory, and provides instructions and data to the processor 1303 . A portion of memory 1304 may also include NVRAM.
  • the memory 1304 stores operating systems and operating instructions, executable modules or data structures, or their subsets, or their extended sets, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the operating system may include various system programs for implementing various basic services and processing hardware-based tasks.
  • the processor 1303 controls the operation of the audio decoding device, and the processor 1303 may also be called a CPU.
  • various components of the audio decoding device are coupled together through a bus system, wherein the bus system may include a power bus, a control bus, and a status signal bus, etc. in addition to a data bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the foregoing embodiments of the present application may be applied to the processor 1303 or implemented by the processor 1303 .
  • the processor 1303 may be an integrated circuit chip and has a signal processing capability. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 1303 or instructions in the form of software.
  • the aforementioned processor 1303 may be a general processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
  • Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 1304, and the processor 1303 reads the information in the memory 1304, and completes the steps of the above method in combination with its hardware.
  • the processor 1303 is configured to execute the methods performed by the audio decoding device shown in FIG. 4 , FIG. 7 , and FIG. 9 in the foregoing embodiments.
  • the chip when the audio encoding device or the audio decoding device is a chip in the terminal, the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example Input/output interface, pin or circuit, etc.
  • the processing unit may execute the computer-executable instructions stored in the storage unit, so that the chip in the terminal executes the audio encoding method of any one of the above-mentioned first aspect, or the audio decoding method of any one of the second aspect.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit in the terminal located outside the chip, such as a read-only memory (read -only memory, ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM read-only memory
  • RAM random access memory
  • the processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling program execution of the method of the first aspect or the second aspect.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be The physical unit can be located in one place, or it can be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between the modules indicates that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines.
  • the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a floppy disk of a computer , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application .
  • a computer device which can be a personal computer, a server, or a network device, etc.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server, or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wired eg, coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (Solid State Disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本申请实施例公开了一种音频信号的编解码方法和装置,用于提高编码质量以及音频信号的重建效果。本申请实施例提供一种音频信号的编码方法,包括:根据待编码音频信号的当前帧的M个块的频谱获得所述M个块的M个暂态标识;所述M个块包括第一块,所述第一块的暂态标识用于指示所述第一块为暂态块,或者指示所述第一块为非暂态块;根据所述M个块的M个暂态标识获得所述M个块的分组信息;根据所述M个块的分组信息对所述M个块的频谱进行分组排列,以获得所述当前帧的待编码频谱;利用编码神经网络对所述待编码频谱进行编码,以获得频谱编码结果;将所述频谱编码结果写入码流。

Description

一种音频信号的编解码方法和装置
本申请要求于2021年07月29日提交中国专利局、申请号为202110865328.X、发明名称为“一种音频信号的编解码方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频处理技术领域,尤其涉及一种音频信号的编解码方法和装置。
背景技术
音频数据的压缩是媒体通信和媒体广播等媒体应用中不可或缺的环节。随着高清音频产业以及三维音频产业的发展,人们对音频质量的需求越来越高,随之而来的是媒体应用中音频数据量的迅猛增长。
目前的音频数据的压缩技术为基于信号处理的基本原理,在时间、空间上利用信号的相关性对原始的音频信号进行压缩,以减少数据量,从而便于音频数据的传输或存储。
在目前的音频信号编码方案中,当音频信号是暂态信号时,存在编码质量低的问题。在解码端进行信号重建时,也会存在音频信号重建效果差的问题。
发明内容
本申请实施例提供了一种音频信号的编解码方法和装置,用于提高编码质量以及音频信号的重建效果。
为解决上述技术问题,本申请实施例提供以下技术方案:
第一方面,本申请实施例提供一种音频信号的编码方法,包括:根据待编码音频信号的当前帧的M个块的频谱获得所述M个块的M个暂态标识;所述M个块包括第一块,所述第一块的暂态标识用于指示所述第一块为暂态块,或者指示所述第一块为非暂态块;根据所述M个块的M个暂态标识获得所述M个块的分组信息;根据所述M个块的分组信息对所述M个块的频谱进行分组排列,以获得所述当前帧的待编码频谱;利用编码神经网络对所述待编码频谱进行编码,以获得频谱编码结果;将所述频谱编码结果写入码流。
在上述方案中,根据待编码音频信号的当前帧的M个块的频谱获得M个块的M个暂态标识,根据M个暂态标识获得M个块的分组信息之后,可以使用该M个块的分组信息对当前帧的M个块的频谱进行分组排列,通过对M个块的频谱进行分组排列,从而可以调整M个块的频谱在当前帧中的排列顺序,获得当前帧的待编码频谱之后,利用编码神经网络对待编码频谱进行编码,获得了频谱编码结果,通过码流可以携带该频谱编码结果。因此本申请实施例中能够根据音频信号的当前帧中M个暂态标识对M个块的频谱进行分组排列,从而能够实现针对不同暂态标识的块进行分组排列以及编码,提高对音频信号的编码质量。
在一种可能的实现方式中,所述方法还包括:对所述M个块的分组信息进行编码,以获得分组信息编码结果;将所述分组信息编码结果写入所述码流。在上述方案中,编码端在获得M个块的分组信息之后,可以在码流中携带该分组信息,首先对该分组信息进行编码,对于该分组信息所采用的编码方式,此处不做限定。通过对分组信息的编码,可以获 得分组信息编码结果,该分组信息编码结果可以被写入到码流中,从而使得码流可以携带分组信息编码结果。
在一种可能的实现方式中,所述M个块的分组信息包括:所述M个块的分组数量或分组数量标识,所述分组数量标识用于指示所述分组数量,当所述分组数量大于1时,所述M个块的分组信息还包括:所述M个块的M个暂态标识;或者,所述M个块的分组信息包括:所述M个块的M个暂态标识。在上述方案中,M个块的分组信息包括:M个块的分组数量或分组数量标识,分组数量标识用于指示分组数量,当分组数量大于1时,M个块的分组信息还包括:M个块的M个暂态标识;或者,M个块的分组信息包括:M个块的M个暂态标识。通过上述M个块的分组信息可以指示M个块的分组情况,从而编码端可以使用该分组信息对M个块的频谱进行分组排列。
在一种可能的实现方式中,所述根据所述M个块的分组信息对所述M个块的频谱进行分组排列,以获得所述当前帧的待编码频谱,包括:将所述M个块中被所述M个暂态标识指示为暂态块的频谱分到暂态组中,以及将所述M个块中被所述M个暂态标识指示为非暂态块的频谱分到非暂态组中;将所述暂态组中的块的频谱排列至所述非暂态组中的块的频谱之前,以获得所述当前帧的待编码频谱。在上述方案中,编码端获得M个块的分组信息之后,对M个块基于暂态标识的不同进行分组,从而可以获得暂态组和非暂态组,接下来对M个块在当前帧的频谱中的位置进行排列,将暂态组中的块的频谱排列至非暂态组中的块的频谱之前,以获得待编码频谱。即在待编码频谱中所有暂态块的频谱位于非暂态块的频谱之前,从而能够将暂态块的频谱调整到编码重要性更高的位置,使得利用神经网络编解码处理后重建的音频信号能更好地保留暂态特征。
在一种可能的实现方式中,所述根据所述M个块的分组信息对所述M个块的频谱进行分组排列,以获得所述当前帧的待编码频谱,包括:将所述M个块中被所述M个暂态标识指示为暂态块的频谱排列至所述M个块中被所述M个暂态标识指示为非暂态块的频谱之前,以获得所述当前帧的待编码频谱。在上述方案中,编码端获得M个块的分组信息之后,根据该分组信息确定M个块中每个块的暂态标识,先从M个块中找到P个暂态块以及Q个非暂态块,则M=P+Q。将M个块中被M个暂态标识指示为暂态块的频谱排列至M个块中被M个暂态标识指示为非暂态块的频谱之前,以获得当前帧的待编码频谱。即在待编码频谱中所有暂态块的频谱位于非暂态块的频谱之前,从而能够将暂态块的频谱调整到编码重要性更高的位置,使得利用神经网络编解码处理后重建的音频信号能更好地保留暂态特征。
在一种可能的实现方式中,所述利用编码神经网络对所述待编码频谱进行编码之前,所述方法还包括:对所述待编码频谱进行组内交织处理,以获得组内交织处理的M个块的频谱;所述利用编码神经网络对所述待编码频谱进行编码,包括:利用编码神经网络对所述组内交织处理的M个块的频谱进行编码。在上述方案中,编码端在获得当前帧的待编码频谱之后,可以先根据M个块的分组进行组内的交织处理,从而获得组内交织处理的M个块的频谱。则组内交织处理的M个块的频谱可以是编码神经网络的输入数据。通过组内交织处理,还可以减少编码的边信息,提高编码效率。
在一种可能的实现方式中,所述M个块中被所述M个暂态标识指示为暂态块的数量为P个,所述M个块中被所述M个暂态标识指示为非暂态块的数量为Q个,M=P+Q;所述对所 述待编码频谱进行组内交织处理,包括:对所述P个块的频谱进行交织处理,以获得所述P个块的交织处理的频谱;对所述Q个块的频谱进行交织处理,以获得所述Q个块的交织处理的频谱;所述利用编码神经网络对所述组内交织处理的M个块的频谱进行编码,包括:利用编码神经网络对所述P个块的交织处理的频谱、所述Q个块的交织处理的频谱进行编码。在上述方案中,对P个块的频谱进行交织处理包括将所述P个块的频谱作为一个整体来进行交织处理;同理,对Q个块的频谱进行交织处理包括将所述Q个块的频谱作为一个整体来进行交织处理。编码端可以根据暂态组和非暂态组分别进行交织处理,从而可以获得P个块的交织处理的频谱和Q个块的交织处理的频谱。P个块的交织处理的频谱、Q个块的交织处理的频谱可以作为编码神经网络的输入数据。通过组内交织处理,还可以减少编码的边信息,提高编码效率。
在一种可能的实现方式中,所述根据待编码音频信号的当前帧的M个块的频谱获得所述M个块的M个暂态标识前,所述方法还包括:获得所述当前帧的窗类型,所述窗类型为短窗类型或非短窗类型;当所述窗类型为短窗类型时,才执行根据待编码音频信号的当前帧的M个块的频谱获得所述M个块的M个暂态标识的步骤。在上述方案中,本申请实施例中只有在当前帧的窗类型为短窗类型时可以执行前述的编码方案,实现在音频信号为暂态信号时的编码。
在一种可能的实现方式中,所述方法还包括:对所述窗类型进行编码以获得窗类型编码结果;将所述窗类型编码结果写入所述码流。在上述方案中,编码端在获得当前帧的窗类型之后,可以在码流中携带该窗类型,首先对该窗类型进行编码,对于该窗类型所采用的编码方式,此处不做限定。通过对窗类型的编码,可以获得窗类型编码结果,该窗类型编码结果可以被写入到码流中,从而使得码流可以携带窗类型编码结果。
在一种可能的实现方式中,所述根据待编码音频信号的当前帧的M个块的频谱获得所述M个块的M个暂态标识,包括:根据所述M个块的频谱获得所述M个块的M个频谱能量;根据所述M个频谱能量获得所述M个块的频谱能量平均值;根据所述M个频谱能量与所述频谱能量平均值获得所述M个块的M个暂态标识。在上述方案中,编码端获得M个频谱能量之后,可以将M个频谱能量进行平均,以获得频谱能量平均值,或者将M个频谱能量中的最大值或最大的若干个值剔除之后,再进行平均,以获得频谱能量平均值。通过M个频谱能量中每个块的频谱能量与频谱能量平均值进行比较,以确定每个块的频谱相比于M个块中其它块的频谱的变化情况,进而获得M个块的M个暂态标识,其中,一个块的暂态标识可以用于表示一个块的暂态特征。本申请实施例通过每个块的频谱能量与频谱能量平均值可以确定出每个块的暂态标识,使得一个块的暂态标识能够确定该块的分组信息。
在一种可能的实现方式中,当所述第一块的频谱能量大于所述频谱能量平均值的K倍时,所述第一块的暂态标识指示所述第一块为暂态块;或,当所述第一块的频谱能量小于或等于所述频谱能量平均值的K倍时,所述第一块的暂态标识指示所述第一块为非暂态块;其中,所述K为大于或等于1的实数。在上述方案中,以M个块中第一块的暂态标识的确定过程为例,当第一块的频谱能量大于频谱能量平均值的K倍时,说明第一块相较于M个块的其它块,频谱变化过大,此时第一块的暂态标识指示第一块为暂态块。当第一块的频谱能量小于或等于频谱能量平均值的K倍时,说明第一块相较于M个块的其它块,频谱变 化不大,第一块的暂态标识指示第一块为非暂态块。
第二方面,本申请实施例还提供一种音频信号的解码方法,包括:从码流中获得音频信号的当前帧的M个块的分组信息,所述分组信息用于指示所述M个块的M个暂态标识;利用解码神经网络对所述码流进行解码,以获得所述M个块的解码频谱;根据所述M个块的分组信息对所述M个块的解码频谱进行逆分组排列处理,以获得所述M个块的逆分组排列处理的频谱;根据所述M个块的逆分组排列处理的频谱获得所述当前帧的重构音频信号。
在上述方案中,从码流中获得音频信号的当前帧的M个块的分组信息,分组信息用于指示M个块的M个暂态标识;利用解码神经网络对码流进行解码,以获得M个块的解码频谱;根据M个块的分组信息对M个块的解码频谱进行逆分组排列处理,以获得M个块的逆分组排列处理的频谱,根据M个块的逆分组排列处理的频谱获得当前帧的重构音频信号。由于码流中包括的频谱编码结果是经过分组排列的,因此解码该码流时可以获得M个块的解码频谱,再通过逆分组排列处理,可以获得M个块的逆分组排列处理的频谱,进而获得当前帧的重构音频信号。在进行信号重建时,可以根据音频信号中不同暂态标识的块进行逆分组排列以及解码,因此能够提高音频信号重建效果。
在一种可能的实现方式中,所述根据所述M个块的分组信息对所述M个块的解码频谱进行逆分组排列处理之前,所述方法还包括:对所述M个块的解码频谱进行组内解交织处理,以获得所述M个块的组内解交织处理的频谱;所述根据所述M个块的分组信息对所述M个块的解码频谱进行逆分组排列处理,包括:根据所述M个块的分组信息对所述M个块的组内解交织处理的频谱进行所述逆分组排列处理。
在一种可能的实现方式中,所述M个块中被所述M个暂态标识指示为暂态块的数量为P个,所述M个块中被所述M个暂态标识指示为非暂态块的数量为Q个,M=P+Q;所述对所述M个块的解码频谱进行组内解交织处理,包括:对所述P个块的解码频谱进行解交织处理;以及,对所述Q个块的解码频谱进行解交织处理。
在一种可能的实现方式中,所述M个块中被所述M个暂态标识指示为暂态块的数量为P个,所述M个块中被所述M个暂态标识指示为非暂态块的数量为Q个,M=P+Q;所述根据所述M个块的分组信息对所述M个块的解码频谱进行逆分组排列处理,包括:根据所述M个块的分组信息获得所述P个块的索引;根据所述M个块的分组信息获得所述Q个块的索引;根据所述P个块的索引和所述Q个块的索引对所述M个块的解码频谱进行所述逆分组排列处理。
在一种可能的实现方式中,所述方法还包括:从所述码流中获得当前帧的窗类型,所述窗类型为短窗类型或非短窗类型;当所述当前帧的窗类型为短窗类型时,才执行从码流中获得当前帧的M个块的分组信息的步骤。
在一种可能的实现方式中,所述M个块的分组信息包括:所述M个块的分组数量或分组数量标识,所述分组数量标识用于指示所述分组数量,当所述分组数量大于1时,所述M个块的分组信息还包括:所述M个块的M个暂态标识;或,所述M个块的分组信息包括:所述M个块的M个暂态标识。
第三方面,本申请实施例还提供一种音频信号的编码装置,包括:
暂态标识获得模块,用于根据待编码音频信号的当前帧的M个块的频谱获得所述M个 块的M个暂态标识;所述M个块包括第一块,所述第一块的暂态标识用于指示所述第一块为暂态块,或者指示所述第一块为非暂态块;
分组信息获得模块,用于根据所述M个块的M个暂态标识获得所述M个块的分组信息;
分组排列模块,用于根据所述M个块的分组信息对所述M个块的频谱进行分组排列,以获得待编码频谱;
编码模块,用于利用编码神经网络对所述待编码频谱进行编码,以获得频谱编码结果;将所述频谱编码结果写入码流。
在本申请的第三方面中,音频信号的编码装置的组成模块还可以执行前述第一方面以及各种可能的实现方式中所描述的步骤,详见前述对第一方面以及各种可能的实现方式中的说明。
第四方面,本申请实施例还提供一种音频信号的解码装置,包括:
分组信息获得模块,用于从码流中获得音频信号的当前帧的M个块的分组信息,所述分组信息用于指示所述M个块的M个暂态标识;
解码模块,用于利用解码神经网络对所述码流进行解码,以获得M个块的解码频谱;
逆分组排列模块,用于根据所述M个块的分组信息对所述M个块的解码频谱进行逆分组排列处理,以获得M个块的逆分组排列处理的频谱;
音频信号获得模块,用于根据所述M个块的逆分组排列处理上的频谱获得重构音频信号。
在本申请的第四方面中,音频信号的解码装置的组成模块还可以执行前述第一方面以及各种可能的实现方式中所描述的步骤,详见前述对第一方面以及各种可能的实现方式中的说明。
第五方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面或第二方面所述的方法。
第六方面,本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第二方面所述的方法。
第七方面,本申请实施例提供了一种计算机可读存储介质,包括如前述第一方面所述的方法所生成的码流。
第八方面,本申请实施例提供一种通信装置,该通信装置可以包括终端设备或者芯片等实体,所述通信装置包括:处理器、存储器;所述存储器用于存储指令;所述处理器用于执行所述存储器中的所述指令,使得所述通信装置执行如前述第一方面或第二方面中任一项所述的方法。
第九方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持音频编码器或者音频解码器实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存音频编码器或者音频解码器必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
从以上技术方案可以看出,本申请实施例具有以下优点:
在本申请的一个实施例中,根据待编码音频信号的当前帧的M个块的频谱获得M个块的M个暂态标识,根据M个暂态标识获得M个块的分组信息之后,可以使用该M个块的分组信息对当前帧的M个块的频谱进行分组排列,通过对M个块的频谱进行分组排列,从而可以调整M个块的频谱在当前帧中的排列顺序,获得当前帧的待编码频谱之后,利用编码神经网络对待编码频谱进行编码,获得了频谱编码结果,通过码流可以携带该频谱编码结果。因此本申请实施例中能够根据音频信号的当前帧中M个暂态标识对M个块的频谱进行分组排列,从而能够实现针对不同暂态标识的块进行分组排列以及编码,提高对音频信号的编码质量。
在本申请的另一个实施例中,从码流中获得音频信号的当前帧的M个块的分组信息,分组信息用于指示M个块的M个暂态标识;利用解码神经网络对码流进行解码,以获得M个块的解码频谱;根据M个块的分组信息对M个块的解码频谱进行逆分组排列处理,以获得M个块的逆分组排列处理的频谱,根据M个块的逆分组排列处理的频谱获得当前帧的重构音频信号。由于码流中包括的频谱编码结果是经过分组排列的,因此解码该码流时可以获得M个块的解码频谱,再通过逆分组排列处理,可以获得M个块的逆分组排列处理的频谱,进而获得当前帧的重构音频信号。在进行信号重建时,可以根据音频信号中不同暂态标识的块进行逆分组排列以及解码,因此能够提高音频信号重建效果。
附图说明
图1为本申请实施例提供的音频处理系统的组成结构示意图;
图2a为本申请实施例提供的音频编码器和音频解码器应用于终端设备的示意图;
图2b为本申请实施例提供的音频编码器应用于无线设备或者核心网设备的示意图;
图2c为本申请实施例提供的音频解码器应用于无线设备或者核心网设备的示意图;
图3为本申请实施例提供的一种音频信号的编码方法的示意图;
图4为本申请实施例提供的一种音频信号的解码方法的示意图;
图5为本申请实施例提供的一种音频信号的编解码系统的示意图;
图6为本申请实施例提供的一种音频信号的编码方法的示意图;
图7为本申请实施例提供的一种音频信号的解码方法的示意图;
图8为本申请实施例提供的一种音频信号的编码方法示意图;
图9为本申请实施例提供的一种音频信号的解码方法的示意图;
图10为本申请实施例提供的一种音频编码装置的组成结构示意图;
图11为本申请实施例提供的一种音频解码装置的组成结构示意图;
图12为本申请实施例提供的另一种音频编码装置的组成结构示意图;
图13为本申请实施例提供的另一种音频解码装置的组成结构示意图。
具体实施方式
下面结合附图,对本申请的实施例进行描述。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况 下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
声音(sound)是由物体振动产生的一种连续的波。产生振动而发出声波的物体称为声源。声波通过介质(如:空气、固体或液体)传播的过程中,人或动物的听觉器官能感知到声音。
声波的特征包括音调、音强和音色。音调表示声音的高低。音强表示声音的大小。音强也可以称为响度或音量。音强的单位是分贝(decibel,dB)。音色又称为音品。
声波的频率决定了音调的高低。频率越高音调越高。物体在一秒钟之内振动的次数称为频率,频率单位是赫兹(hertz,Hz)。人耳能识别的声音的频率在20Hz至20000Hz之间。
声波的幅度决定了音强的强弱。幅度越大音强越大。距离声源越近,音强越大。
声波的波形决定了音色。声波的波形包括方波、锯齿波、正弦波和脉冲波等。
根据声波的特征,声音可以分为规则声音和无规则声音。无规则声音是指声源无规则地振动发出的声音。无规则声音例如是影响人们工作、学习和休息等的噪声。规则声音是指声源规则地振动发出的声音。规则声音包括语音和乐音。声音用电表示时,规则声音是一种在时频域上连续变化的模拟信号。该模拟信号可以称为音频信号(acoustic signals)。音频信号是一种携带语音、音乐和音效的信息载体。
由于人的听觉具有辨别空间中声源的位置分布的能力,则听音者听到空间中的声音时,除了能感受到声音的音调、音强和音色外,还能感受到声音的方位。
声音还可以根据分为单声道和立体声。单声道具有一个声音通道,用一个传声器拾取声音,用一个扬声器进行放音。立体声具有多个声音通道,且不同的声音通道传输不同声音波形。
当音频信号为暂态信号时,目前的编码端并未提取暂态特征并在码流中进行传输,该暂态特征用于表示音频信号的暂态帧中相邻块频谱的变化情况,从而在解码端进行信号重建时,无法从码流中获得重建的音频信号的暂态特征,存在音频信号重建效果差的问题。
本申请实施例提供一种音频处理技术,尤其是提供一种面向音频信号的音频编码技术,以改进传统的音频编码系统。音频处理包括音频编码和音频解码两部分。音频编码在源侧执行,包括编码(例如,压缩)原始音频以减少表示该音频所需的数据量,从而更高效地存储和/或传输。音频解码在目的侧执行,包括相对于编码器作逆处理,以重建原始音频。编码部分和解码部分也合称为编码。下面将结合附图对本申请实施例的实施方式进行详细描述。
本申请实施例的技术方案可以应用于各种的音频处理系统,如图1所示,为本申请实施例提供的音频处理系统的组成结构示意图。音频处理系统100可以包括:音频编码装置101和音频解码装置102。其中,音频编码装置101又可以称为音频信号的编码装置,可用于生成码流,然后该音频编码码流可以通过音频传输通道传输给音频解码装置102,音频解码装置102又可以称为音频信号的解码装置,可以接收到码流,然后执行音频解码装置 102的音频解码功能,最后获得重建后的信号。
在本申请的实施例中,该音频编码装置可以应用于各种有音频通信需要的终端设备、有转码需要的无线设备与核心网设备,例如音频编码装置可以是上述终端设备或者无线设备或者核心网设备的音频编码器。同样的,该音频解码装置可以应用于各种有音频通信需要的终端设备、有转码需要的无线设备与核心网设备,例如音频解码装置可以是上述终端设备或者无线设备或者核心网设备的音频解码器。例如,音频编码器可以包括无线接入网、核心网的媒体网关、转码设备、媒体资源服务器、移动终端、固网终端等,音频编码器还可以是应用于虚拟现实技术(virtual reality,VR)流媒体(streaming)服务中的音频编码器。
在申请实施例中,以适用于虚拟现实流媒体(VR streaming)服务中的音频编码模块(audio encoding及audio decoding)为例,端到端对音频信号的编解码流程包括:音频信号A经过采集模块(acquisition)后进行预处理操作(audioPReprocessing),预处理操作包括滤除掉信号中的低频部分,可以是以20Hz或者50Hz为分界点,提取信号中的方位信息,之后进行编码处理(audio encoding)打包(file/segment encapsulation)之后发送(delivery)到解码端,解码端首先进行解包(file/segment decapsulation),之后解码(audio decoding),对解码信号进行双耳渲染(audio rendering)处理,渲染处理后的信号映射到收听者耳机(headphones)上,可以为独立的耳机,也可以是眼镜设备上的耳机。
如图2a所示,为本申请实施例提供的音频编码器和音频解码器应用于终端设备的示意图。对于每个终端设备都可以包括:音频编码器、信道编码器、音频解码器、信道解码器。具体的,信道编码器用于对音频信号进行信道编码,信道解码器用于对音频信号进行信道解码。例如,在第一终端设备20中可以包括:第一音频编码器201、第一信道编码器202、第一音频解码器203、第一信道解码器204。在第二终端设备21中可以包括:第二音频解码器211、第二信道解码器212、第二音频编码器213、第二信道编码器214。第一终端设备20连接无线或者有线的第一网络通信设备22,第一网络通信设备22和无线或者有线的第二网络通信设备23之间通过数字信道连接,第二终端设备21连接无线或者有线的第二网络通信设备23。其中,上述无线或者有线的网络通信设备可以泛指信号传输设备,例如通信基站,数据交换设备等。
在音频通信中,作为发送端的终端设备首先进行音频采集,对采集到的音频信号进行音频编码,再进行信道编码后,通过无线网络或者核心网进行在数字信道中传输。而作为接收端的终端设备根据接收到的信号进行信道解码,以获得码流,然后经过音频解码恢复出音频信号,由接收端的终端设备进音频回放。
如图2b所示,为本申请实施例提供的音频编码器应用于无线设备或者核心网设备的示意图。其中,无线设备或者核心网设备25包括:信道解码器251、其他音频解码器252、本申请实施例提供的音频编码器253、信道编码器254,其中,其他音频解码器252是指除音频解码器以外的其他音频解码器。在无线设备或者核心网设备25内,首先通过信道解码器251对进入该设备的信号进行信道解码,然后使用其他音频解码器252进行音频解码,然后使用本申请实施例提供的音频编码器253进行音频编码,最后使用信道编码器254对 音频信号进行信道编码,完成信道编码之后再传输出去。其中,其他音频解码器252是对信道解码器251解码后的码流进行音频解码。
如图2c所示,为本申请实施例提供的音频解码器应用于无线设备或者核心网设备的示意图。其中,无线设备或者核心网设备25包括:信道解码器251、本申请实施例提供的音频解码器255、其他音频编码器256、信道编码器254,其中,其他音频编码器256是指除音频编码器以外的其他音频编码器。在无线设备或者核心网设备25内,首先通过信道解码器251对进入该设备的信号进行信道解码,然后使用音频解码器255对接收到的音频编码码流进行解码,然后使用其他音频编码器256进行音频编码,最后使用信道编码器254对音频信号进行信道编码,完成信道编码之后再传输出去。在无线设备或者核心网设备中,如果需要实现转码,则需要进行相应的音频编码处理。其中,无线设备指的是通信中的射频相关的设备,核心网设备指的是通信中核心网相关的设备。
在本申请的一些实施例中,该音频编码装置可以应用于各种有音频通信需要的终端设备、有转码需要的无线设备与核心网设备,例如音频编码装置可以是上述终端设备或者无线设备或者核心网设备的多声道编码器。同样的,该音频解码装置可以应用于各种有音频通信需要的终端设备、有转码需要的无线设备与核心网设备,例如音频解码装置可以是上述终端设备或者无线设备或者核心网设备的多声道解码器。
首先介绍本申请实施例提供的一种音频信号的编码方法,该方法可以由终端设备执行,例如该终端设备可以是一种音频信号的编码装置(如下简称编码端或者编码器,例如编码端可以是人工智能(artificial intelligence,AI)编码器)。如图3所示,对本申请实施例中编码端执行的编码流程进行说明:
301.根据待编码音频信号的当前帧的M个块的频谱获得M个块的M个暂态标识;M个块包括第一块,第一块的暂态标识用于指示第一块为暂态块,或者指示第一块为非暂态块。
编码端首先获得待编码音频信号,将待编码音频信号进行分帧处理,以获得待编码音频信号的当前帧。后续实施例中以对当前帧的编码为例进行说明,待编码音频信号的其它帧的编码与当前帧的编码类似。
编码端确定当前帧之后,对当前帧进行加窗处理,并进行时频变换,若当前帧包括M个块,则可以获得当前帧的M个块的频谱,M表示当前帧中包括的块个数,本申请实施例中对于M的取值不做限定。例如,编码端对当前帧的M个块进行时频变换,以获得M个块的修正的离散余弦变换(modified discrete cosine transform,MDCT)频谱,后续实施例中以M个块的频谱为MDCT频谱为例,不限定的是,M个块的频谱也可以是其它频谱。
编码端获得M个块的频谱之后,根据该M个块的频谱分别获得M个块的M个暂态标识。其中,每个块的频谱用于确定该块的暂态标识,每个块都对应一个暂态标识,一个块的暂态标识用于指示该块在M个块中的频谱变化情况。例如M个块中包括的某一个块为第一块,则该第一块对应一个暂态标识。
在本申请的一些实施例中,暂态标识的取值有多种实现方式,例如暂态标识可以指示第一块为暂态块,或者暂态标识可以指示第一块为非暂态块。其中,一个块的暂态标识为暂态表示该块的频谱相比于M个块中其它块的频谱变化较大,一个块的暂态标识为非暂态表示该块的频谱相比于M个块中其它块的频谱变化不大。例如暂态标识占用1个比特,若 暂态标识取值为0则暂态标识为暂态,若暂态标识取值为1则暂态标识为非暂态。或者,若暂态标识取值为1则暂态标识为暂态,若暂态标识取值为0则暂态标识为非暂态,此处不做限定。
302.根据M个块的M个暂态标识获得M个块的分组信息。
编码端在获得M个块的M个暂态标识之后,该M个块的M个暂态标识用于对M个块的分组,根据M个块的M个暂态标识获得M个块的分组信息,该M个块的分组信息可表示对M个块的分组方式,M个块的M个暂态标识是M个块的分组的依据,例如暂态标识相同的块可被分入一个组中,不同暂态标识的块被分入不同的组中。
在本申请的一些实施例中,M个块的分组信息可以有多种实现方式,M个块的分组信息包括:M个块的分组数量或分组数量标识,分组数量标识用于指示分组数量,当分组数量大于1时,M个块的分组信息还包括:M个块的M个暂态标识;或者,M个块的分组信息包括:M个块的M个暂态标识。通过上述M个块的分组信息可以指示M个块的分组情况,从而编码端可以使用该分组信息对M个块的频谱进行分组排列。
例如M个块的分组信息包括:M个块的分组数量和M个块的暂态标识,该M个块的暂态标识又可以称为分组标志信息,因此本申请实施例中分组信息可以包括分组数量和分组标志信息。例如分组数量的取值可以为1或2。分组标志信息用于指示M个块的暂态标识。
例如M个块的分组信息包括:M个块的暂态标识,该M个块的暂态标识又可以称为分组标志信息,因此本申请实施例中分组信息可以包括分组标志信息。例如分组标志信息用于指示M个块的暂态标识。
例如M个块的分组信息包括:M个块的分组数量为1,即当分组数量等于1时,M个块的分组信息不包括M个暂态标识,而当分组数量大于1时,M个块的分组信息还包括:M个块的M个暂态标识。
又如,M个块的分组信息中的分组数量还可以替换为分组数量标识用于指示分组数量,例如分组数量标识为0时指示分组数量为1,分组数量标识为1时指示分组数量为2。
本申请的一些实施例中,编码端执行的方法还包括:
A1.对M个块的分组信息进行编码,以获得分组信息编码结果;
A2.将分组信息编码结果写入码流。
其中,编码端在获得M个块的分组信息之后,可以在码流中携带该分组信息,首先对该分组信息进行编码,对于该分组信息所采用的编码方式,此处不做限定。通过对分组信息的编码,可以获得分组信息编码结果,该分组信息编码结果可以被写入到码流中,从而使得码流可以携带分组信息编码结果。
需要说明的是,步骤A2和后续步骤305之间没有先后顺序,可以先执行步骤305,再执行步骤A2,也可以先执行步骤A2,再执行步骤305,或者同时执行步骤A2和步骤305,此处不做限定。
303.根据M个块的分组信息对M个块的频谱进行分组排列,以获得当前帧的待编码频谱。
其中,待编码频谱又可以称为分组排列后的M个块的频谱。
编码端获得M个块的分组信息之后,可以使用该M个块的分组信息对当前帧的M个块 的频谱进行分组排列,通过对M个块的频谱进行分组排列,从而可以调整M个块的频谱在当前帧中的排列顺序。上述分组排列是根据M个块的分组信息进行的,M个块的分组信息是根据M个块的M个暂态标识获得,上述对M个块的分组排列之后,获得分组排列后的M个块的频谱,该分组排列后的M个块的频谱是以M个块的M个暂态标识为分组排序的依据,通过分组排序可以改变M个块的频谱的编码顺序。
在本申请的一些实施例中,步骤303根据M个块的分组信息对M个块的频谱进行分组排列,以获得待编码频谱,包括:
B1.将M个块中被M个暂态标识指示为暂态块的频谱分到暂态组中,以及将M个块中被M个暂态标识指示为非暂态块的频谱分到非暂态组中;
B2.将暂态组中的块的频谱排列至非暂态组中的块的频谱之前,以获得待编码频谱。
其中,编码端获得M个块的分组信息之后,对M个块基于暂态标识的不同进行分组,从而可以获得暂态组和非暂态组,接下来对M个块在当前帧的频谱中的位置进行排列,将暂态组中的块的频谱排列至非暂态组中的块的频谱之前,以获得待编码频谱。即在待编码频谱中所有暂态块的频谱位于非暂态块的频谱之前,从而能够将暂态块的频谱调整到编码重要性更高的位置,使得利用神经网络编解码处理后重建的音频信号能更好地保留暂态特征。
在本申请的一些实施例中,步骤303根据M个块的分组信息对M个块的频谱进行分组排列,以获得当前帧的待编码频谱,包括:
C1.将M个块中被M个暂态标识指示为暂态块的频谱排列至M个块中被M个暂态标识指示为非暂态块的频谱之前,以获得当前帧的待编码频谱。
其中,编码端获得M个块的分组信息之后,根据该分组信息确定M个块中每个块的暂态标识,先从M个块中找到P个暂态块以及Q个非暂态块,则M=P+Q。将M个块中被M个暂态标识指示为暂态块的频谱排列至M个块中被M个暂态标识指示为非暂态块的频谱之前,以获得当前帧的待编码频谱。即在待编码频谱中所有暂态块的频谱位于非暂态块的频谱之前,从而能够将暂态块的频谱调整到编码重要性更高的位置,使得利用神经网络编解码处理后重建的音频信号能更好地保留暂态特征。
304.利用编码神经网络对待编码频谱进行编码,以获得频谱编码结果。
305.将频谱编码结果写入码流。
在本申请实施例中,编码端获得当前帧的待编码频谱之后,可以使用编码神经网络进行编码,以生成频谱编码结果,再将该频谱编码结果写入到码流中,编码端可以向解码端发送该码流。
其中,一种可实现的方式是编码端以待编码频谱作为编码神经网络的输入数据,或者还可以对待编码频谱进行其它处理,然后作为编码神经网络的输入数据。经过编码神经网络处理之后,可以生成潜在变量(latent variables),潜在变量表示分组排列后的M个块的频谱的特征。
在本申请的一些实施例中,步骤304利用编码神经网络对待编码频谱进行编码之前,编码端执行的方法还包括:
D1.对待编码频谱进行组内交织处理,以获得组内交织处理的M个块的频谱。
在这种实现场景下,步骤304利用编码神经网络对待编码频谱进行编码,包括:
E1.利用编码神经网络对组内交织处理的M个块的频谱进行编码。
其中,编码端在获得当前帧的待编码频谱之后,可以先根据M个块的分组进行组内的交织处理,从而获得组内交织处理的M个块的频谱。则组内交织处理的M个块的频谱可以是编码神经网络的输入数据。通过组内交织处理,还可以减少编码的边信息,提高编码效率。
在本申请的一些实施例中,M个块中被M个暂态标识指示为暂态块的数量为P个,M个块中被M个暂态标识指示为非暂态块的数量为Q个,M=P+Q。本申请实施例中对P和Q的取值不做限定。
具体的,步骤D1对待编码频谱进行组内交织处理,包括:
D11.对P个块的频谱进行交织处理,以获得P个块的交织处理的频谱;
D12.对Q个块的频谱进行交织处理,以获得Q个块的交织处理的频谱。
其中,对P个块的频谱进行交织处理包括将所述P个块的频谱作为一个整体来进行交织处理;同理,对Q个块的频谱进行交织处理包括将所述Q个块的频谱作为一个整体来进行交织处理。
在执行步骤D11和D12的情况下,步骤E1利用编码神经网络对组内交织处理的M个块的频谱进行编码,包括:
利用编码神经网络对P个块的交织处理的频谱、Q个块的交织处理的频谱进行编码。
其中,在D11至D12中,编码端可以根据暂态组和非暂态组分别进行交织处理,从而可以获得P个块的交织处理的频谱和Q个块的交织处理的频谱。P个块的交织处理的频谱、Q个块的交织处理的频谱可以作为编码神经网络的输入数据。通过组内交织处理,还可以减少编码的边信息,提高编码效率。
在本申请的一些实施例中,步骤301根据待编码音频信号的当前帧的M个块的频谱获得M个块的M个暂态标识前,编码端执行的方法还包括:
F1.获得当前帧的窗类型,窗类型为短窗类型或非短窗类型;
F2.当窗类型为短窗类型时,才执行根据待编码音频信号的当前帧的M个块的频谱获得M个块的M个暂态标识的步骤。
在编码端执行301之前,编码端可以先确定当前帧的窗类型,该窗类型可以为短窗类型或非短窗类型,例如编码端根据待编码音频信号的当前帧确定窗类型。其中,短窗又可以称为短帧,非短窗又可以称为非短帧。当窗类型为短窗类型时,触发执行前述步骤301。本申请实施例中只有在当前帧的窗类型为短窗类型时可以执行前述的编码方案,实现在音频信号为暂态信号时的编码。
在本申请的一些实施例中,编码端执行前述步骤F1和F2的情况下,编码端执行的方法还包括:
G1.对窗类型进行编码以获得窗类型编码结果;
G2.将窗类型编码结果写入码流。
其中,编码端在获得当前帧的窗类型之后,可以在码流中携带该窗类型,首先对该窗类型进行编码,对于该窗类型所采用的编码方式,此处不做限定。通过对窗类型的编码, 可以获得窗类型编码结果,该窗类型编码结果可以被写入到码流中,从而使得码流可以携带窗类型编码结果。
在本申请的一些实施例中,步骤301根据待编码音频信号的当前帧的M个块的频谱获得M个块的M个暂态标识,包括:
H1.根据M个块的频谱获得M个块的M个频谱能量;
H2.根据M个频谱能量获得M个块的频谱能量平均值;
H3.根据M个频谱能量与频谱能量平均值获得M个块的M个暂态标识。
其中,编码端获得M个频谱能量之后,可以将M个频谱能量进行平均,以获得频谱能量平均值,或者将M个频谱能量中的最大值或最大的若干个值剔除之后,再进行平均,以获得频谱能量平均值。通过M个频谱能量中每个块的频谱能量与频谱能量平均值进行比较,以确定每个块的频谱相比于M个块中其它块的频谱的变化情况,进而获得M个块的M个暂态标识,其中,一个块的暂态标识可以用于表示一个块的暂态特征。本申请实施例通过每个块的频谱能量与频谱能量平均值可以确定出每个块的暂态标识,使得一个块的暂态标识能够确定该块的分组信息。
进一步的,在本申请的一些实施例中,当第一块的频谱能量大于频谱能量平均值的K倍时,第一块的暂态标识指示第一块为暂态块;或,
当第一块的频谱能量小于或等于频谱能量平均值的K倍时,第一块的暂态标识指示第一块为非暂态块;
其中,K为大于或等于1的实数。
其中,K的取值有多种,此处不做限定。以M个块中第一块的暂态标识的确定过程为例,当第一块的频谱能量大于频谱能量平均值的K倍时,说明第一块相较于M个块的其它块,频谱变化过大,此时第一块的暂态标识指示第一块为暂态块。当第一块的频谱能量小于或等于频谱能量平均值的K倍时,说明第一块相较于M个块的其它块,频谱变化不大,第一块的暂态标识指示第一块为非暂态块。
不限定的是,编码端还可以根据其它方式获得M个块的M个暂态标识,例如获得第一块的频谱能量与频谱能量平均值的差值或者比例值,根据获得的差值或者比例值来确定M个块的M个暂态标识。
通过前述实施例对编码端的举例说明可知,根据待编码音频信号的当前帧的M个块的频谱获得M个块的M个暂态标识,根据M个暂态标识获得M个块的分组信息之后,可以使用该M个块的分组信息对当前帧的M个块的频谱进行分组排列,通过对M个块的频谱进行分组排列,从而可以调整M个块的频谱在当前帧中的排列顺序,获得待编码频谱之后,利用编码神经网络对待编码频谱进行编码,获得了频谱编码结果,通过码流可以携带该频谱编码结果,因此本申请实施例中能够根据音频信号的当前帧中M个暂态标识对M个块的频谱进行分组排列,从而能够实现针对不同暂态标识的块进行分组排列以及编码,提高对音频信号的编码质量。
本申请实施例还提供一种音频信号的解码方法,该方法可以由终端设备执行,例如该终端设备可以是一种音频信号的解码装置(如下简称解码端或者解码器,例如该解码端可以是AI解码器)。如图4所示,对本申请实施例中解码端执行的方法主要包括:
401.从码流中获得音频信号的当前帧的M个块的分组信息,分组信息用于指示M个块的M个暂态标识。
解码端接收编码端发送的码流,编码端在码流中写入分组信息编码结果,解码端解析该码流可以获得音频信号的当前帧的M个块的分组信息。解码端根据该M个块的分组信息可以确定M个块的M个暂态标识。例如分组信息可以包括:分组数量和分组标志信息。又如,分组信息可以包括分组标志信息,详见前述编码端的实施例说明。
402.利用解码神经网络对码流进行解码,以获得M个块的解码频谱。
其中,解码端获得码流之后,利用解码神经网络对码流进行解码,获得M个块的解码频谱,由于编码端对M个块的频谱进行分组排列后进行了编码,编码端在码流中携带频谱编码结果,该M个块的解码频谱对应于编码端的分组排列后的M个块的频谱,该解码神经网络与编码端的编码神经网络的执行过程相逆,通过解码,可以获得重构的分组排列后的M个块的频谱。
403.根据M个块的分组信息对M个块的解码频谱进行逆分组排列处理,以获得M个块的逆分组排列处理的频谱。
解码端获得M个块的分组信息,解码端通过码流还获得M个块的解码频谱,由于编码端对M个块的频谱进行了分组排列处理,在解码端需要执行与编码端相逆的流程,因此根据M个块的分组信息对M个块的解码频谱进行逆分组排列处理,以获得M个块的逆分组排列处理的频谱,该逆分组排列处理与编码端的分组排列处理相逆。
404.根据M个块的逆分组排列处理的频谱获得所述当前帧的重构音频信号。
编码端在获得M个块的逆分组排列处理的频谱之后,可以通过对M个块的逆分组排列处理的频谱进行频域到时域的变换,以此获得所述当前帧的重构音频信号。
在本申请的一些实施例中,步骤403根据M个块的分组信息对M个块的解码频谱进行逆分组排列处理之前,解码端执行的方法还包括:
I1.对M个块的解码频谱进行组内解交织处理,以获得M个块的组内解交织处理的频谱;
步骤403根据M个块的分组信息对M个块的解码频谱进行逆分组排列处理,包括:
J1.根据M个块的分组信息对M个块的组内解交织处理的频谱进行逆分组排列处理。
其中,解码端执行的组内解交织为编码端的组内交织的逆过程,此处不再详细说明。
在本申请的一些实施例中,M个块中被M个暂态标识指示为暂态块的数量为P个,M个块中被M个暂态标识指示为非暂态块的数量为Q个,M=P+Q;
步骤I1对M个块的解码频谱进行组内解交织处理,包括:
I11.对P个块的解码频谱进行解交织处理;以及,
I12.对Q个块的解码频谱进行解交织处理。
其中,对P个块的频谱进行解交织处理包括将所述P个块的频谱作为一个整体来进行解交织处理;同理,对Q个块的频谱进行解交织处理包括将所述Q个块的频谱作为一个整体来进行解交织处理。
其中,编码端可以根据暂态组和非暂态组分别进行交织处理,从而可以获得P个块的交织处理的频谱和Q个块的交织处理的频谱。P个块的交织处理的频谱、Q个块的交织处理的频谱可以作为编码神经网络的输入数据。通过组内交织处理,还可以减少编码的边信息, 提高编码效率。由于编码端进行了组内交织,解码端需要执行相应的逆过程,即解码端可以进行解交织处理。
在本申请的一些实施例中,重构的分组排列后的M个块中被M个暂态标识指示为暂态块的数量为P个,M个块中被M个暂态标识指示为非暂态块的数量为Q个,M=P+Q;
步骤403根据M个块的分组信息对M个块的解码频谱进行逆分组排列处理,包括:
K1.根据M个块的分组信息获得P个块的索引;
K2.根据M个块的分组信息获得Q个块的索引;
K3.根据P个块的索引和Q个块的索引对M个块的解码频谱进行逆分组排列处理。
其中,编码端对M个块的频谱进行分组排列之前,M个块的索引是连续的,例如从0至M-1。当编码端进行分组排列之后,M个块的索引不再连续。解码端根据M个块的分组信息可以获得重构的分组排列后的M个块中的P个块的索引、重构的分组排列后的M个块中的Q个块的索引,通过逆分组排列处理,可以恢复出M个块的索引仍是连续的。
在本申请的一些实施例中,解码端执行的方法还包括:
L1.从码流中获得当前帧的窗类型,窗类型为短窗类型或非短窗类型;
L2.当当前帧的窗类型为短窗类型时,才执行从码流中获得当前帧的M个块的分组信息的步骤。
其中,本申请实施例中只有在当前帧的窗类型为短窗类型时可以执行前述的编码方案,实现在音频信号为暂态信号时的编码。解码端执行与编码端相逆的过程,因此解码端也可以先确定当前帧的窗类型,该窗类型可以为短窗类型或非短窗类型,例如解码端从码流中获得当前帧的窗类型。其中,短窗又可以称为短帧,非短窗又可以称为非短帧。当窗类型为短窗类型时,触发执行前述步骤401。
在本申请的一些实施例中,M个块的分组信息包括:M个块的分组数量或分组数量标识,分组数量标识用于指示分组数量,当分组数量大于1时,M个块的分组信息还包括:M个块的M个暂态标识;
或,M个块的分组信息包括:M个块的M个暂态标识。
通过前述实施例对解码端的举例说明可知,从码流中获得音频信号的当前帧的M个块的分组信息,分组信息用于指示M个块的M个暂态标识;利用解码神经网络对码流进行解码,以获得M个块的解码频谱;根据M个块的分组信息对M个块的解码频谱进行逆分组排列处理,以获得M个块的逆分组排列处理的频谱,根据M个块的逆分组排列处理的频谱获得当前帧的重构音频信号。由于码流中包括的频谱编码结果是经过分组排列的,因此解码该码流时可以获得M个块的解码频谱,再通过逆分组排列处理,可以获得M个块的逆分组排列处理的频谱,进而获得当前帧的重构音频信号。在进行信号重建时,可以根据音频信号中不同暂态标识的块进行逆分组排列以及解码,因此能够提高音频信号重建效果。
为便于更好的理解和实施本申请实施例的上述方案,下面举例相应的应用场景来进行具体说明。
如图5所示,为本申请实施例提供的在广播电视领域应用的系统架构的示意图,本申请实施例也可以应用于广播电视的直播场景和后期制作场景,或应用于终端媒体播放中的三维声编解码器。
在直播场景下,直播节目三维声制作出的三维声信号经过应用本申请实施例的三维声编码获得码流,经广电网络传输到用户侧,由机顶盒中的三维声解码器进行解码重建三维声信号,由扬声器组进行回放。后期制作场景下,后期节目三维声制作出的三维声信号经过应用本申请实施例的三维声编码获得码流,经广电网络或者互联网传输到用户侧,由网络接收器或者移动终端中的三维声解码器进行解码重建三维声信号,由扬声器组或者耳机进行回放。
本申请实施例提供音频编解码器,音频编解码器具体可以包括无线接入网、核心网的媒体网关、转码设备、媒体资源服务器等,移动终端、固网终端等。还可以应用于广播电视或终端媒体播放、VR streaming服务中的音频编解码器。
接下来分别对本申请实施例中编码端和解码端的应用场景进行说明。
如图6所示,应用本申请实施例提出的编码器执行如下的音频信号的编码方法,包括:
S11.确定当前帧的窗类型。
获得当前帧的音频信号,根据当前帧的音频信号确定当前帧的窗类型,并将窗类型写入码流。
一种具体的实现方式包括如下三个步骤:
1).将待编码音频信号进行分帧处理,获得当前帧的音频信号。
例如,当前帧的帧长为L个样点,则当前帧的音频信号为L点时域信号。
2).根据当前帧的音频信号进行暂态检测,确定当前帧的暂态信息。
进行暂态检测的方法有多种,本申请实施例不做限定。当前帧的暂态信息可以包括当前帧是否为暂态信号的标识、当前帧暂态发生的位置以及表征暂态程度的参数中的一种或多种。其中,暂态程度可以是暂态能量高低,或者是暂态发生位置的信号能量与相邻的非暂态位置的信号能量比。
3).根据当前帧的暂态信息,确定当前帧的窗类型,对所述当前帧的窗类型进行编码并将编码结果写入码流。
如果当前帧的暂态信息表征了当前帧为暂态信号,则当前帧的窗类型为短窗。
如果当前帧的暂态信息表征了当前帧为非暂态信号,则当前帧的窗类型为不包括短窗在内的其他窗类型。本申请实施例对其他窗类型不做限定,例如其他窗类型可以包括:长窗、切入窗、切出窗等。
S12.若当前帧的窗类型为短窗,对当前帧的音频信号进行短窗的加窗处理并进行时频变换,获得所述当前帧的M个块的MDCT频谱。
若当前帧的窗类型为短窗,对当前帧的音频信号进行短窗的加窗处理并进行时频变换,获得M个块的MDCT频谱。
例如,若当前帧的窗类型为短窗,使用M个叠接的短窗窗函数进行加窗处理,获得加窗后的M个块的音频信号,M为大于等于2的正整数。例如,短窗窗函数的窗长为2L/M,L为当前帧的帧长,叠接长度为L/M。例如,M等于8,L等于1024,短窗窗函数的窗长为256个样点,叠接长度为128个样点。
对加窗后的M个块的音频信号分别进行时频变换,获得当前帧的M个块的MDCT频谱。
例如,当前块的加窗后的音频信号的长度为256个样点,经过MDCT变换后,获得128 点MDCT系数,即为当前块的MDCT频谱。
S13.根据M个块的MDCT频谱,获得当前帧的分组数量和分组标志信息,对所述当前帧的分组数量和分组标志信息进行编码并将编码结果写入码流。
在步骤S13获得当前帧的分组数量和分组标志信息之前,在一种实现方式中:首先,对M个块的MDCT频谱进行交织处理,获得交织后的M个块的MDCT频谱;接下来,对交织后的M个块的MDCT频谱进行编码预处理操作,获得预处理的MDCT频谱;然后对预处理的MDCT频谱进行解交织处理,获得解交织处理的M个块的MDCT频谱;最后,根据解交织处理的M个块的MDCT频谱确定当前帧的分组数量和分组标志信息。
对M个块的MDCT频谱进行交织处理,是将M个长度为L/M的MDCT频谱交织为长度为L的MDCT频谱。将M个块的MDCT频谱中频点位置为i的M个频谱系数按照所在块的序号从0到M-1顺序排列在一起,然后将M个块的MDCT频谱中频点位置为i+1的M个频谱系数按照所在块的序号从0到M-1顺序排列在一起,i的取值为从0开始直到L/M-1。
其中,编码预处理操作可以包括:频域噪声整形(frequency domain noise shaping,FDNS)、时域噪声整形(temporal noise shaping,TNS)以及带宽扩展(bandwidth extension,BWE)等处理,这里不做限定。
解交织处理为交织处理的逆过程。预处理的MDCT频谱长度为L,将长度为L的预处理的MDCT频谱分成M个长度为L/M的MDCT频谱,每个块中的MDCT频谱按照频点从小到大排列,即可获得解交织处理的M个块的MDCT频谱。在对交织处理的频谱进行预处理,可以减少编码边信息,从而减少边信息的比特占用,提高编码效率。
根据解交织处理的M个块的MDCT频谱确定当前帧的分组数量和分组标志信息。具体方法包括如下3个步骤:
a).计算M个块的MDCT频谱能量。
假设解交织处理的M个块的MDCT频谱系数为mdctSpectrum[8][128],计算各个块的MDCT频谱能量,记为enerMdct[8]。其中,8为M的取值,128表示一个块中的MDCT系数的个数。
b).根据M个块的MDCT频谱能量,计算MDCT频谱能量的平均值。主要包括如下两种方法:
方法一:直接计算M个块的MDCT频谱能量的平均值,即enerMdct[8]的平均值,作为MDCT频谱能量的平均值avgEner。
方法二:确定M个块中MDCT频谱能量最大的块;计算除能量最大的1个块之外其他M-1个块的MDCT频谱能量的平均值,作为MDCT频谱能量的平均值avgEner。或者计算除能量最大的若干个块之外其他块的MDCT频谱能量的平均值,作为MDCT频谱能量的平均值avgEner。
c).根据M个块的MDCT频谱能量与MDCT频谱能量的平均值,确定当前帧的分组数量和分组标志信息,写入码流。
具体可以是:将各个块的MDCT频谱能量与MDCT频谱能量的平均值进行比较。如果当前块的MDCT频谱能量大于MDCT频谱能量的平均值的K倍,则当前块为暂态块,当前块的暂态标识为0;否则,当前块为非暂态块,当前块的非暂态标识为1。其中,K大于等于1, 例如K=2。根据各个块的暂态标识,将M个块进行分组,确定分组数量和分组标志信息。其中,暂态标识值相同的为一组,M个块被分成N个组,N就是分组数量。分组标志信息为M个块中每个块的暂态标识值构成的信息。
例如,暂态块构成暂态组,非暂态块构成非暂态组。具体可以是:如果各个块的暂态标识不完全相同,则当前帧的分组数量numGroups为2,否则为1。分组数量可以由分组数量标识来表示。例如,分组数量标识为1,表示当前帧的分组数量为2;分组数量标识为0,表示当前帧的分组数量为1。根据M个块的暂态标识确定当前帧的分组标志信息groupIndicator。例如,将M个块的暂态标识顺序排列构成当前帧的分组标志信息groupIndicator。
在步骤S13获得分组数量和分组标志信息之前,另一种实现方式是:不对M个块的MDCT频谱进行交织处理和解交织处理,直接根据M个块的MDCT频谱确定当前帧的分组数量和分组标志信息,对所述当前帧的分组数量和分组标志信息进行编码并将编码结果写入码流。
根据M个块的MDCT频谱确定当前帧的分组数量和分组标志信息,与根据解交织后的M个块的MDCT频谱确定当前帧的分组数量和分组标志信息类似,这里不再赘述。
将当前帧的分组数量和分组标志信息,写入码流。
此外,非暂态组还可以进一步分成两个或两个以上的其他组,本申请实施例不做限定。例如,非暂态组可以分成谐波组和非谐波组。
S14.根据当前帧的分组数量和分组标志信息对M个块的MDCT频谱进行分组排列,获得分组排列的MDCT频谱。该分组排列的MDCT频谱即为当前帧的待编码频谱。
如果当前帧的分组数量为2,则需要对当前帧的M个块的音频信号频谱进行分组排列。排列的方式为:将M个块中属于暂态组的若干个块调整到前面,属于非暂态组的若干个块调整到后面。其中,编码器的编码神经网络对于排在前面的频谱会有更好的编码效果,因此将暂态块调整到前面可以确保暂态块的编码效果,从而保留更多的暂态块的频谱细节,提升编码质量。
根据当前帧的分组数量和分组标志信息对当前帧的M个块的MDCT频谱进行分组排列,也可以是根据当前帧的分组数量和分组标志信息对当前帧解交织后的M个块的MDCT频谱进行分组排列。
S15.利用编码神经网络对分组排列的MDCT频谱进行编码,写入码流。
分组排列的MDCT频谱先进行组内交织处理,获得组内交织的MDCT频谱。然后,再利用编码神经网络,对组内交织的MDCT频谱进行编码。组内交织处理与前述获得分组数量和分组标志信息之前对M个块的MDCT频谱进行的交织处理类似,只是交织的对象为属于同一分组内的MDCT频谱。例如,对属于暂态组的MDCT频谱块进行交织处理。对属于非暂态组的MDCT频谱块进行交织处理。
编码神经网络处理是预先训练好的,本申请实施例对编码神经网络的具体网络结构和训练方法不做限定。例如编码神经网络可以选择全连接网络或者卷积神经网络(convolutional neural networks,CNN)。
如图7所示,与编码端对应的解码流程,包括:
S21.根据接收到的码流解码,获得当前帧的窗类型。
S22.若当前帧的窗类型为短窗,则根据接收到的码流解码,获得分组数量和分组标志信息。
可以解析码流中的分组数量标识信息,根据分组数量标识信息确定当前帧的分组数量。例如,分组数量标识为1,表示当前帧的分组数量为2;分组数量标识为0,表示当前帧的分组数量为1。
如果当前帧的分组数量大于1,则可以根据接收到的码流解码,获得分组标志信息。
根据接收到的码流解码,获得分组标志信息,可以是:从码流中读取M比特的分组标志信息。根据分组标志信息的第i个比特位的值可以确定第i个块是否为暂态块。若第i个比特位的值为0,表示第i个块为暂态块;第i个比特位的值为1,表示第i个块为非暂态块。
S23.根据接收到的码流,利用解码神经网络,获得解码MDCT频谱。
解码端的解码流程与编码端的编码流程相对应。具体步骤包括:
首先,根据接收到的码流解码,利用解码神经网络,获得解码MDCT频谱。
然后,根据分组数量和分组标志信息,可以确定属于同一分组的解码MDCT频谱。对属于同一分组的MDCT频谱进行组内解交织处理,获得组内解交织处理的MDCT频谱。该组内解交织处理的过程与编码端获得分组数量和分组标志信息之前对交织处理的M个块的MDCT频谱的解交织处理相同。
S24.根据分组数量和分组标志信息,对组内解交织处理的MDCT频谱进行逆分组排列处理,获得逆分组排列处理的MDCT频谱。
如果当前帧的分组数量大于1,则需要根据分组标志信息对组内解交织处理的MDCT频谱进行逆分组排列处理。解码端的逆分组排列处理是编码端分组排列处理的逆过程。
例如,假设组内解交织处理的MDCT频谱是由M个L/M点的MDCT频谱块构成。根据分组标志信息确定第i个暂态块的块索引idx0(i),将组内解交织处理的MDCT频谱中第i个块的MDCT频谱作为逆分组排列处理的MDCT频谱中的第idx0(i)个块的MDCT频谱。第i个暂态块的块索引idx0(i)为分组标志信息中第i个标志值为0的块对应的块索引,i从0开始。暂态块的数量为分组标志信息中标志值为0的比特位的数量,记作num0。在处理完暂态块后,需要对非暂态块进行处理。根据分组标志信息确定第j个非暂态块的块索引idx1(j),将组内解交织处理的MDCT频谱中第num0+j个块的MDCT频谱作为逆分组排列处理的MDCT频谱中的第idx1(j)个块的MDCT频谱。第j个非暂态块的块索引idx1(j)为分组标志信息中第j个标志值为1的块对应的块索引,j从0开始。
S25.根据逆分组排列处理的MDCT频谱,获得当前帧的重构音频信号。
根据逆分组排列处理的MDCT频谱,获得重构音频信号,一种具体的实现方式是:首先,对逆分组排列处理的M个块的MDCT频谱进行交织处理,获得M个块的交织处理的MDCT频谱;接下来,对M个块的交织处理的MDCT频谱进行解码后处理操作,例如解码后处理可以包括逆TNS、逆FDNS、BWE处理等等,解码后处理跟编码端的编码预处理方式一一对应,获得解码后处理的MDCT频谱;然后对解码后处理的MDCT频谱进行解交织处理,获得M个块的解交织处理的MDCT频谱;最后,分别对M个块的解交织处理的MDCT频谱进行频域到时域的变换,并进行去加窗及叠接相加处理后,获得重构音频信号。
根据逆分组排列处理的MDCT频谱,获得重构音频信号的另一种具体的实现方式是:分别对M个块的MDCT谱进行频域到时域的变换,并进行去加窗及叠接相加处理后,获得重构音频信号。
如图8所示,编码端执行的音频信号的编码方法包括:
S31.对输入信号进行分帧处理,获得当前帧的输入信号。
例如,帧长为1024,当前帧的输入信号为1024点音频信号。
S32.根据获得当前帧的输入信号进行暂态检测,获得暂态检测结果。
例如,将当前帧的输入信号分为L个块,计算每个块中的信号能量,如果相邻块中的信号能量发生突变,则认为当前帧为暂态信号。例如,L为大于2的正整数,可以取L=8。如果相邻块中的信号能量之间的差异大于预先设定的阈值,则认为当前帧为非暂态信号。
S33.根据暂态检测结果,确定当前帧的窗类型。
如果当前帧的暂态检测结果为暂态信号,则当前帧的窗类型为短窗,否则为长窗。
当前帧的窗类型除了短窗和长窗,还可以增加切入窗和切出窗。设当前帧的帧序号为i,根据i-1帧和i-2帧的暂态检测结果和当前帧的暂态检测结果,确定当前帧的窗类型。
如果第i帧、第i-1帧和第i-2帧的暂态检测结果均为非暂态信号,则第i帧的窗类型为长窗。
如果第i帧的暂态检测结果为暂态信号,第i-1帧和第i-2帧的暂态检测结果为非暂态信号,则第i帧的窗类型为切入窗。
如果第i帧和第i-1帧的暂态检测结果为非暂态信号,第i-2帧的暂态检测结果为暂态信号,则第i帧的窗类型为切出窗。
如果第i帧、第i-1帧和第i-2帧的暂态检测结果为除以上三种情况外的其他情况,则第i帧的窗类型为短窗。
S34.根据当前帧的窗类型,进行加窗及时频变换处理,获得当前帧的MDCT频谱。
根据长窗、切入窗、切出窗和短窗类型,分别进行加窗和MDCT变换:对长窗、切入窗、切出窗,若加窗后信号长度为2048,则获得1024个MDCT系数;对短窗,则加8个叠接的长度为256的短窗,每个短窗获得128个MDCT系数,将每个短窗的128点MDCT系数称为一个块,共1024个MDCT系数。
确定当前帧的窗类型是否为短窗,若是,执行如下步骤S35,若不是,执行如下步骤S312。
S35.若当前帧的窗类型为短窗,对当前帧的MDCT频谱进行交织处理,获得交织后的MDCT频谱。
若当前帧的窗类型为短窗,将8个块的MDCT频谱进行交织处理,即将8个128维度的MDCT频谱交织为长度1024的MDCT频谱。
交织后频谱形式可以是:block 0 bin 0,block 1 bin 0,block 2 bin 0,…,block 7 bin 0,block 0 bin 1,block 1,bin 1,block 2 bin 1,…,block 7 bin 1,…。
其中,block 0bin 0表示第0个块的第0个频点。
S36.对交织后的MDCT频谱进行编码预处理,获得预处理的MDCT频谱。
预处理可以包括FDNS、TNS、BWE等处理。
S37.对预处理的MDCT频谱进行解交织处理,获得M个块的MDCT频谱。
按与步骤S35相反的方式进行解交织,获得8个块的MDCT频谱,其中,每个块128点。
S38.根据M个块的MDCT频谱,确定分组信息。
信息可以包括分组数量numGroups和分组标志信息groupIndicator。根据M个块的MDCT频谱,确定分组信息的具体方案可以是编码端执行的前述步骤S13中的任何一种。例如,设短帧中8个块的MDCT频谱系数为mdctSpectrum[8][128],则计算各个块的MDCT频谱能量,记为enerMdct[8]。计算8个块的MDCT频谱能量的平均值,记为avgEner,此处有两种计算MDCT频谱能量的平均值的方法:
方法1:直接计算8个块MDCT频谱能量的平均值,即enerMdct[8]的平均值。
方法2:为了减少8个块中能量最大的块对平均值计算的影响,可以将最大块能量去除后,再计算平均值。
将各个块的MDCT频谱能量与平均能量比较,若大于平均能量的若干倍,则认为当前块是暂态块(标记为0),否则认为当前块是非暂态块(标记为1),所有暂态块构成暂态组,所有非暂态块构成非暂态组。
例如,当前帧的窗类型为短窗,初步判断所得的分组信息可以是:
分组数量numGroups:2。
Block索引:0 1 2 3 4 5 6 7。
分组标志信息groupIndicator:1 1 1 0 0 0 0 1。
分组数量和分组标志信息需要写入码流,传输到解码端。
S39.根据分组信息,对M个块的MDCT频谱进行分组排列,获得分组排列后的MDCT频谱。
根据分组信息对M个块的MDCT频谱进行分组排列的具体方案可以是编码端执行的前述步骤S14中的任何一种。
例如,将短帧的8个块中属于暂态组的若干个块放置到前面,属于其他组的若干个块放置到后面。
仍以步骤S38中的举例为例,若分组信息为:
Block索引:0 1 2 3 4 5 6 7。
分组标志信息groupIndicator:1 1 1 0 0 0 0 1。
则频谱排列布后的频谱形式如下:
Block索引:3 4 5 6 0 1 2 7。
即排列后的第0块的频谱为排列前的第3块的频谱,排列后的第1块的频谱为排列前的第4块的频谱,排列后的第2块的频谱为排列前的第5块的频谱,排列后的第3块的频谱为排列前的第6块的频谱,排列后的第4块的频谱为排列前的第0块的频谱,排列后的第5块的频谱为排列前的第1块的频谱,排列后的第6块的频谱为排列前的第2块的频谱,排列后的第7块的频谱为排列前的第7块的频谱。
S310.对分组排列后的MDCT频谱进行组内频谱交织处理,获得组内交织后MDCT频谱。
分组排列后的MDCT频谱,对每个组进行组内的交织处理,处理方式与步骤S35类似,只不过交织处理仅限于对属于同一分组的MDCT频谱进行处理。
仍以上述举例为例,排列后的频谱中,对暂态组(排列前的第3、4、5、6块,即排列后的第0、1、2、3块)进行交织,对其他组(排列前的第0、1、2、7块,即排列后的第4、5、6、7块)进行交织处理。
S311.利用编码神经网络,对组内交织后MDCT频谱进行编码。
本申请实施例对利用利用编码神经网络,对组内交织后MDCT频谱进行编码的具体方法不做限定。例如:组内交织后MDCT频谱,经过编码神经网络处理,生成潜在变量(latent variables)。对潜在变量进行量化处理,获得量化后的潜在变量。对量化后的潜在变量进行算术编码,将算术编码结果写入码流。
S312.若当前帧不是短帧,则按照其他类型帧对应的编码方法对当前帧的MDCT频谱进行编码。
对于其他类型帧的编码,可以不进行分组、排列以及组内交织处理。例如,直接对步骤S34获得的当前帧的MDCT频谱利用编码神经网络进行编码。
例如,确定与窗类型对应的窗函数,对当前帧的音频信号进行加窗处理,获得加窗处理后的信号;相邻帧的窗有叠接时,对加窗处理后的信号进行时频正变换,如MDCT变换,获得当前帧的MDCT频谱;对当前帧的MDCT频谱进行编码。
如图9所示,解码端执行的音频信号的解码方法包括:
S41.根据接收到的码流解码,获得当前帧的窗类型。
确定当前帧的窗类型是否为短窗,若是,执行如下步骤S42,若不是,执行如下步骤S410。
S42.若当前帧的窗类型为短窗,根据接收到的码流解码,获得分组数量和分组标志信息。
S43.根据接收到的码流解码,利用解码神经网络,获得解码MDCT频谱。
解码神经网络与编码神经网络相对应。例如,利用解码神经网络解码的具体方法:根据接收到的码流,进行算术解码,获得量化后的潜在变量。将量化后的潜在变量进行去量化处理,获得去量化后的潜在变量。将去量化后的潜在变量作为输入,经过解码神经网络处理,生成解码MDCT频谱。
S44.根据分组数量和分组标志信息,对解码MDCT频谱进行组内解交织处理,获得组内解交织处理的MDCT频谱。
根据分组数量和分组标志信息,确定属于同一组的MDCT频谱块。例如,解码MDCT频谱分为8个块。分组数量等于2,分组标志信息groupIndicator为1 1 1 0 0 0 0 1。分组标志信息中标志值为0的比特位的数量为4,那么解码MDCT频谱中前4个块的MDCT谱为一组,属于暂态组,需要进行组内解交织处理;标志值为1的比特位数量为4,那么后4个块的MDCT谱为一组,属于非暂态组,需要进行组内解交织处理。组内解交织处理获得的8个块的MDCT频谱即为该8个块的组内解交织处理的MDCT频谱。
S45.根据分组数量和分组标志信息,对组内解交织处理的MDCT频谱进行逆分组排列处理,获得逆分组排列处理的MDCT频谱。
根据分组标志信息groupIndicator,将组内解交织处理的MDCT频谱排列为按时间先后排序的M个块频谱。
例如,分组数量等于2,分组标志信息groupIndicator为1 1 1 0 0 0 0 1,则需要将组内解交织处理获得的第0块的MDCT频谱,调整为第3块的MDCT频谱(分组标志信息中第一个标志值为0的比特对应的元素位置索引为3);将组内解交织处理获得的第1块的MDCT频谱,调整为第4块的MDCT频谱(分组标志信息中第二个标志值为0的比特对应的元素位置索引为4);将组内解交织处理获得的第2块的MDCT频谱,调整为第5块的MDCT频谱(分组标志信息中第三个标志值为0的比特对应的元素位置索引为5);将组内解交织处理获得的第3块的MDCT频谱,调整为第6块的MDCT频谱(分组标志信息中第四个标志值为0的比特对应的元素位置索引为6);将组内解交织处理获得的第4块的MDCT频谱,调整为第0块的MDCT频谱(分组标志信息中第一个标志值为1的比特对应的元素位置索引为0);将组内解交织处理获得的第5块的MDCT频谱,调整为第1块的MDCT频谱(分组标志信息中第二个标志值为1的比特对应的元素位置索引为1);将组内解交织处理获得的第6块的MDCT频谱,调整为第2块的MDCT频谱(分组标志信息中第三个标志值为1的比特对应的元素位置索引为2);组内解交织处理获得的第7块的MDCT频谱,不作调整,直接作为第7块的MDCT频谱。
编码端,频谱分组排列后的短帧频谱形式如下:Block索引3 4 5 6 0 1 2 7。
解码端,逆分组排列处理的短帧频谱恢复为8个短帧的按时间先后排序的8个块频谱:Block索引0 1 2 3 4 5 6 7。
S46.对逆分组排列处理的MDCT频谱进行交织处理,获得交织处理的MDCT频谱。
若当前帧的窗类型为短窗,将逆分组排列处理的MDCT频谱进行交织处理,方法同前。
S47.对交织处理的MDCT频谱进行解码后处理,获得解码后处理的MDCT频谱。
解码后处理可以包括BWE逆处理、TNS逆处理、FDNS逆处理等等处理。
S48.对解码后处理的MDCT频谱进行解交织处理,获得重构的MDCT频谱。
S49.对重构的MDCT频谱进行逆MDCT变换以及加窗处理,获得重构音频信号。
重构的MDCT频谱包括M个块的MDCT频谱,分别对每一块的MDCT频谱进行逆MDCT变换。对逆变换后的信号进行加窗以及混叠相加处理后,即可获得短帧的重构音频信号。
S410.若当前帧的窗类型为其他窗类型,按照其他类型帧对应的解码方法解码,获得重构音频信号。
例如,根据接收到的码流解码,利用解码神经网络,获得重构的MDCT频谱。根据窗型(长窗、切入窗、切出窗)进行反变换和OLA,获得重构音频信号。
采用本申请实施例提出的方法,若当前帧的窗类型为短窗,根据当前帧的M个块的频谱,获得当前帧的分组数量和分组标志信息;根据当前帧的分组数量和分组标志信息对当前帧的M个块的频谱进行分组排列,获得分组排列的音频信号;利用编码神经网络对分组排列的频谱进行编码。能够保证当前帧音频信号为暂态信号时,能够将包含暂态特征的MDCT频谱调整到编码重要性更高的位置,使得利用神经网络编解码处理后重建的音频信号能更好地保留暂态特征。
本申请实施例也可以用于立体声编码,不同之处在于:首先,按照前述实施例中编码端步骤S31-310对立体声的左右声道分别进行处理后获得的左声道的组内交织后MDCT频谱和右声道的组内交织后MDCT频谱。然后步骤S311变为:利用编码神经网络对左声道的组 内交织后MDCT频谱和右声道的组内交织后MDCT频谱进行编码。
编码神经网络的输入不再是单声道的组内交织后MDCT频谱,而是按照步骤S31-310对立体声的左右声道分别进行处理后获得的左声道的组内交织后MDCT频谱和右声道的组内交织后MDCT频谱。
编码神经网络可以是CNN网络,将左声道的组内交织后MDCT频谱和右声道的组内交织后MDCT频谱,作为CNN网络两个通道的输入。
相对应的,解码端执行的流程包括:
根据接收到的码流解码,获得当前帧的左声道的窗类型以及分组数量和分组标志信息。
根据接收到的码流解码,获得当前帧的右声道的窗类型以及分组数量和分组标志信息。
根据接收到的码流解码,利用解码神经网络,获得解码的立体声的MDCT频谱。
根据当前帧的左声道的窗类型以及分组数量和分组标志信息以及解码的左声道的MDCT频谱进行按照实施例一解码侧单声道解码的步骤进行处理,获得重构的左声道信号。
根据当前帧的右声道的窗类型以及分组数量和分组标志信息以及解码的右声道的MDCT频谱进行按照实施例一解码侧单声道解码的步骤进行处理,获得重构的右声道信号。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
为便于更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关装置。
请参阅图10所示,本申请实施例提供的一种音频编码装置1000,可以包括:暂态标识获得模块1001、分组信息获得模块1002、分组排列模块1003和编码模块1004,其中,
暂态标识获得模块,用于根据待编码音频信号的当前帧的M个块的频谱获得所述M个块的M个暂态标识;所述M个块包括第一块,所述第一块的暂态标识用于指示所述第一块为暂态块,或者指示所述第一块为非暂态块;
分组信息获得模块,用于根据所述M个块的M个暂态标识获得所述M个块的分组信息;
分组排列模块,用于根据所述M个块的分组信息对所述M个块的频谱进行分组排列,以获得当前帧的待编码频谱;
编码模块,用于利用编码神经网络对所述待编码频谱进行编码,以获得频谱编码结果;将所述频谱编码结果写入码流。
请参阅图11所示,本申请实施例提供的一种音频解码装置1100,可以包括:分组信息获得模块1101、解码模块1102、逆分组排列模块1103和音频信号获得模块1104,其中,
分组信息获得模块,用于从码流中获得音频信号的当前帧的M个块的分组信息,所述分组信息用于指示所述M个块的M个暂态标识;
解码模块,用于利用解码神经网络对所述码流进行解码,以获得M个块的解码频谱;
逆分组排列模块,用于根据所述M个块的分组信息对所述M个块的解码频谱进行逆分组排列处理,以获得M个块的逆分组处理的频谱;
音频信号获得模块,用于根据所述M个块的逆分组处理的频谱获得当前帧的重构音频信号。
需要说明的是,上述装置各模块/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其带来的技术效果与本申请方法实施例相同,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储有程序,该程序执行包括上述方法实施例中记载的部分或全部步骤。
接下来介绍本申请实施例提供的另一种音频编码装置,请参阅图12所示,音频编码装置1200包括:
接收器1201、发射器1202、处理器1203和存储器1204(其中音频编码装置1200中的处理器1203的数量可以一个或多个,图12中以一个处理器为例)。在本申请的一些实施例中,接收器1201、发射器1202、处理器1203和存储器1204可通过总线或其它方式连接,其中,图12中以通过总线连接为例。
存储器1204可以包括只读存储器和随机存取存储器,并向处理器1203提供指令和数据。存储器1204的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1204存储有操作系统和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。操作系统可包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。
处理器1203控制音频编码装置的操作,处理器1203还可以称为中央处理单元(central processing unit,CPU)。具体的应用中,音频编码装置的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1203中,或者由处理器1203实现。处理器1203可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1203中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1203可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1204,处理器1203读取存储器1204中的信息,结合其硬件完成上述方法的步骤。
接收器1201可用于接收输入的数字或字符信息,以及产生与音频编码装置的相关设置以及功能控制有关的信号输入,发射器1202可包括显示屏等显示设备,发射器1202可用于通过外接接口输出数字或字符信息。
本申请实施例中,处理器1203用于执行前述实施例图3、图6、图8所示的由音频编码装置执行的方法。
接下来介绍本申请实施例提供的另一种音频解码装置,请参阅图13所示,音频解码装置1300包括:
接收器1301、发射器1302、处理器1303和存储器1304(其中音频解码装置1300中的处理器1303的数量可以一个或多个,图13中以一个处理器为例)。在本申请的一些实施例中,接收器1301、发射器1302、处理器1303和存储器1304可通过总线或其它方式连接,其中,图13中以通过总线连接为例。
存储器1304可以包括只读存储器和随机存取存储器,并向处理器1303提供指令和数据。存储器1304的一部分还可以包括NVRAM。存储器1304存储有操作系统和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。操作系统可包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。
处理器1303控制音频解码装置的操作,处理器1303还可以称为CPU。具体的应用中,音频解码装置的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1303中,或者由处理器1303实现。处理器1303可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1303中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1303可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1304,处理器1303读取存储器1304中的信息,结合其硬件完成上述方法的步骤。
本申请实施例中,处理器1303,用于执行前述实施例图4、图7、图9所示的由音频解码装置执行的方法。
在另一种可能的设计中,当音频编码装置或者音频解码装置为终端内的芯片时,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使该终端内的芯片执行上述第一方面任意一项的音频编码方法,或者第二方面任意一项的音频解码方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述 存储单元还可以是所述终端内的位于所述芯片外部的存储单元,如只读存储器(read-onlymemory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(randomaccessmemory,RAM)等。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述第一方面或第二方面方法的程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (25)

  1. 一种音频信号的编码方法,其特征在于,包括:
    根据待编码音频信号的当前帧的M个块的频谱获得所述M个块的M个暂态标识;所述M个块包括第一块,所述第一块的暂态标识用于指示所述第一块为暂态块,或者指示所述第一块为非暂态块;
    根据所述M个块的M个暂态标识获得所述M个块的分组信息;
    根据所述M个块的分组信息对所述M个块的频谱进行分组排列,以获得所述当前帧的待编码频谱;
    利用编码神经网络对所述待编码频谱进行编码,以获得频谱编码结果;
    将所述频谱编码结果写入码流。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    对所述M个块的分组信息进行编码,以获得分组信息编码结果;
    将所述分组信息编码结果写入所述码流。
  3. 根据权利要求1或2所述的方法,其特征在于,所述M个块的分组信息包括:所述M个块的分组数量或分组数量标识,所述分组数量标识用于指示所述分组数量,当所述分组数量大于1时,所述M个块的分组信息还包括:所述M个块的M个暂态标识;或者,所述M个块的分组信息包括:所述M个块的M个暂态标识。
  4. 根据权利要求1至3中任一所述的方法,其特征在于,所述根据所述M个块的分组信息对所述M个块的频谱进行分组排列,以获得所述当前帧的待编码频谱,包括:
    将所述M个块中被所述M个暂态标识指示为暂态块的频谱分到暂态组中,以及将所述M个块中被所述M个暂态标识指示为非暂态块的频谱分到非暂态组中;
    将所述暂态组中的块的频谱排列至所述非暂态组中的块的频谱之前,以获得所述当前帧的待编码频谱。
  5. 根据权利要求1至3中任一所述的方法,其特征在于,所述根据所述M个块的分组信息对所述M个块的频谱进行分组排列,以获得所述当前帧的待编码频谱,包括:
    将所述M个块中被所述M个暂态标识指示为暂态块的频谱排列至所述M个块中被所述M个暂态标识指示为非暂态块的频谱之前,以获得所述当前帧的待编码频谱。
  6. 根据权利要求1至5中任一所述的方法,其特征在于,所述利用编码神经网络对所述待编码频谱进行编码之前,所述方法还包括:
    对所述待编码频谱进行组内交织处理,以获得组内交织处理的M个块的频谱;
    所述利用编码神经网络对所述待编码频谱进行编码,包括:
    利用编码神经网络对所述组内交织处理的M个块的频谱进行编码。
  7. 根据权利要求6所述的方法,其特征在于,所述M个块中被所述M个暂态标识指示为暂态块的数量为P个,所述M个块中被所述M个暂态标识指示为非暂态块的数量为Q个,M=P+Q;
    所述对所述待编码频谱进行组内交织处理,包括:
    对所述P个块的频谱进行交织处理,以获得所述P个块的交织处理的频谱;
    对所述Q个块的频谱进行交织处理,以获得所述Q个块的交织处理的频谱;
    所述利用编码神经网络对所述组内交织处理的M个块的频谱进行编码,包括:
    利用编码神经网络对所述P个块的交织处理的频谱、所述Q个块的交织处理的频谱进行编码。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述根据待编码音频信号的当前帧的M个块的频谱获得所述M个块的M个暂态标识前,所述方法还包括:
    获得所述当前帧的窗类型,所述窗类型为短窗类型或非短窗类型;
    当所述窗类型为短窗类型时,才执行根据待编码音频信号的当前帧的M个块的频谱获得所述M个块的M个暂态标识的步骤。
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    对所述窗类型进行编码以获得窗类型编码结果;
    将所述窗类型编码结果写入所述码流。
  10. 根据权利要求1至9中任一所述的方法,其特征在于,所述根据待编码音频信号的当前帧的M个块的频谱获得所述M个块的M个暂态标识,包括:
    根据所述M个块的频谱获得所述M个块的M个频谱能量;
    根据所述M个频谱能量获得所述M个块的频谱能量平均值;
    根据所述M个频谱能量与所述频谱能量平均值获得所述M个块的M个暂态标识。
  11. 根据权利要求10所述的方法,其特征在于,当所述第一块的频谱能量大于所述频谱能量平均值的K倍时,所述第一块的暂态标识指示所述第一块为暂态块;或,
    当所述第一块的频谱能量小于或等于所述频谱能量平均值的K倍时,所述第一块的暂态标识指示所述第一块为非暂态块;
    其中,所述K为大于或等于1的实数。
  12. 一种音频信号的解码方法,其特征在于,包括:
    从码流中获得音频信号的当前帧的M个块的分组信息,所述分组信息用于指示所述M个块的M个暂态标识;
    利用解码神经网络对所述码流进行解码,以获得所述M个块的解码频谱;
    根据所述M个块的分组信息对所述M个块的解码频谱进行逆分组排列处理,以获得所述M个块的逆分组排列处理的频谱;
    根据所述M个块的逆分组排列处理的频谱获得所述当前帧的重构音频信号。
  13. 根据权利要求12所述的方法,其特征在于,所述根据所述M个块的分组信息对所述M个块的解码频谱进行逆分组排列处理之前,所述方法还包括:
    对所述M个块的解码频谱进行组内解交织处理,以获得所述M个块的组内解交织处理的频谱;
    所述根据所述M个块的分组信息对所述M个块的解码频谱进行逆分组排列处理,包括:
    根据所述M个块的分组信息对所述M个块的组内解交织处理的频谱进行所述逆分组排列处理。
  14. 根据权利要求13所述的方法,其特征在于,所述M个块中被所述M个暂态标识指示为暂态块的数量为P个,所述M个块中被所述M个暂态标识指示为非暂态块的数量为Q个,M=P+Q;
    所述对所述M个块的解码频谱进行组内解交织处理,包括:
    对所述P个块的解码频谱进行解交织处理;以及,
    对所述Q个块的解码频谱进行解交织处理。
  15. 根据权利要求12至14中任一所述的方法,其特征在于,所述M个块中被所述M个暂态标识指示为暂态块的数量为P个,所述M个块中被所述M个暂态标识指示为非暂态块的数量为Q个,M=P+Q;
    所述根据所述M个块的分组信息对所述M个块的解码频谱进行逆分组排列处理,包括:
    根据所述M个块的分组信息获得所述P个块的索引;
    根据所述M个块的分组信息获得所述Q个块的索引;
    根据所述P个块的索引和所述Q个块的索引对所述M个块的解码频谱进行所述逆分组排列处理。
  16. 根据权利要求12至15中任一所述的方法,其特征在于,所述方法还包括:
    从所述码流中获得当前帧的窗类型,所述窗类型为短窗类型或非短窗类型;
    当所述当前帧的窗类型为短窗类型时,才执行从码流中获得当前帧的M个块的分组信息的步骤。
  17. 根据权利要求12至16中任一项所述的方法,其特征在于,所述M个块的分组信息包括:所述M个块的分组数量或分组数量标识,所述分组数量标识用于指示所述分组数量,当所述分组数量大于1时,所述M个块的分组信息还包括:所述M个块的M个暂态标识;
    或,
    所述M个块的分组信息包括:所述M个块的M个暂态标识。
  18. 一种音频信号的编码装置,其特征在于,包括:
    暂态标识获得模块,用于根据待编码音频信号的当前帧的M个块的频谱获得所述M个块的M个暂态标识;所述M个块包括第一块,所述第一块的暂态标识用于指示所述第一块为暂态块,或者指示所述第一块为非暂态块;
    分组信息获得模块,用于根据所述M个块的M个暂态标识获得所述M个块的分组信息;
    分组排列模块,用于根据所述M个块的分组信息对所述M个块的频谱进行分组排列,以获得待编码频谱;
    编码模块,用于利用编码神经网络对所述待编码频谱进行编码,以获得频谱编码结果;将所述频谱编码结果写入码流。
  19. 一种音频信号的解码装置,其特征在于,包括:
    分组信息获得模块,用于从码流中获得音频信号的当前帧的M个块的分组信息,所述分组信息用于指示所述M个块的M个暂态标识;
    解码模块,用于利用解码神经网络对所述码流进行解码,以获得M个块的解码频谱;
    逆分组排列模块,用于根据所述M个块的分组信息对所述M个块的解码频谱进行逆分组排列处理,以获得M个块的逆分组排列处理的频谱;
    音频信号获得模块,用于根据所述M个块的逆分组排列处理上的频谱获得重构音频信号。
  20. 一种音频信号的编码装置,其特征在于,所述音频信号的编码装置包括至少一个处 理器,所述至少一个处理器用于与存储器耦合,读取并执行所述存储器中的指令,以实现如权利要求1至11中任一项所述的方法。
  21. 根据权利要求20所述的音频信号的编码装置,其特征在于,所述音频信号的编码装置还包括:所述存储器。
  22. 一种音频信号的解码装置,其特征在于,所述音频信号的解码装置包括至少一个处理器,所述至少一个处理器用于与存储器耦合,读取并执行所述存储器中的指令,以实现如权利要求12至17中任一项所述的方法。
  23. 根据权利要求22所述的音频信号的解码装置,其特征在于,所述音频信号的解码装置还包括:所述存储器。
  24. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至11、或者12至17中任意一项所述的方法。
  25. 一种计算机可读存储介质,包括如权利要求1至11任一项所述的方法所生成的码流。
PCT/CN2022/096593 2021-07-29 2022-06-01 一种音频信号的编解码方法和装置 WO2023005414A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
KR1020247006252A KR20240038770A (ko) 2021-07-29 2022-06-01 오디오 신호 인코딩 방법 및 장치와 오디오 신호 디코딩 방법 및 장치
US18/423,083 US20240177721A1 (en) 2021-07-29 2024-01-25 Audio signal encoding and decoding method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110865328.XA CN115691521A (zh) 2021-07-29 2021-07-29 一种音频信号的编解码方法和装置
CN202110865328.X 2021-07-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/423,083 Continuation US20240177721A1 (en) 2021-07-29 2024-01-25 Audio signal encoding and decoding method and apparatus

Publications (1)

Publication Number Publication Date
WO2023005414A1 true WO2023005414A1 (zh) 2023-02-02

Family

ID=85058542

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/096593 WO2023005414A1 (zh) 2021-07-29 2022-06-01 一种音频信号的编解码方法和装置

Country Status (4)

Country Link
US (1) US20240177721A1 (zh)
KR (1) KR20240038770A (zh)
CN (1) CN115691521A (zh)
WO (1) WO2023005414A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101046963A (zh) * 2004-09-17 2007-10-03 广州广晟数码技术有限公司 多声道数字音频编码设备及其方法
CN101694773A (zh) * 2009-10-29 2010-04-14 北京理工大学 一种基于tda域的自适应窗切换方法
CN102222505A (zh) * 2010-04-13 2011-10-19 中兴通讯股份有限公司 可分层音频编解码方法系统及瞬态信号可分层编解码方法
CN105247608A (zh) * 2013-04-09 2016-01-13 斯考缪兹克互动公司 用于产生音频文件的系统和方法
CN112037803A (zh) * 2020-05-08 2020-12-04 珠海市杰理科技股份有限公司 音频编码方法及装置、电子设备、存储介质
CN112767954A (zh) * 2020-06-24 2021-05-07 腾讯科技(深圳)有限公司 音频编解码方法、装置、介质及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101046963A (zh) * 2004-09-17 2007-10-03 广州广晟数码技术有限公司 多声道数字音频编码设备及其方法
CN101694773A (zh) * 2009-10-29 2010-04-14 北京理工大学 一种基于tda域的自适应窗切换方法
CN102222505A (zh) * 2010-04-13 2011-10-19 中兴通讯股份有限公司 可分层音频编解码方法系统及瞬态信号可分层编解码方法
CN105247608A (zh) * 2013-04-09 2016-01-13 斯考缪兹克互动公司 用于产生音频文件的系统和方法
CN112037803A (zh) * 2020-05-08 2020-12-04 珠海市杰理科技股份有限公司 音频编码方法及装置、电子设备、存储介质
CN112767954A (zh) * 2020-06-24 2021-05-07 腾讯科技(深圳)有限公司 音频编解码方法、装置、介质及电子设备

Also Published As

Publication number Publication date
CN115691521A (zh) 2023-02-03
KR20240038770A (ko) 2024-03-25
US20240177721A1 (en) 2024-05-30

Similar Documents

Publication Publication Date Title
CA3200632A1 (en) Audio encoding and decoding method and apparatus
WO2023005414A1 (zh) 一种音频信号的编解码方法和装置
TWI834163B (zh) 三維音頻訊號編碼方法、裝置和編碼器
WO2023005415A1 (zh) 一种多声道信号的编解码方法和装置
WO2022262576A1 (zh) 三维音频信号编码方法、装置、编码器和系统
WO2023173941A1 (zh) 一种多声道信号的编解码方法和编解码设备以及终端设备
WO2024146408A1 (zh) 场景音频解码方法及电子设备
WO2022253187A1 (zh) 一种三维音频信号的处理方法和装置
WO2022257824A1 (zh) 一种三维音频信号的处理方法和装置
WO2023142783A1 (zh) 一种音频处理方法和终端
WO2022237851A1 (zh) 一种音频编码、解码方法及装置
WO2022242479A1 (zh) 三维音频信号编码方法、装置和编码器
US20240087578A1 (en) Three-dimensional audio signal coding method and apparatus, and encoder
WO2022242483A1 (zh) 三维音频信号编码方法、装置和编码器
WO2023051370A1 (zh) 编解码方法、装置、设备、存储介质及计算机程序
CN116798438A (zh) 一种多声道信号的编解码方法和编解码设备以及终端设备
TW202422537A (zh) 音訊編解碼方法、裝置、儲存媒體及電腦程式產品
KR20230035373A (ko) 오디오 인코딩 방법, 오디오 디코딩 방법, 관련 장치, 및 컴퓨터 판독가능 저장 매체

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22848024

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20247006252

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 1020247006252

Country of ref document: KR

NENP Non-entry into the national phase

Ref country code: DE