CN112037803A

CN112037803A - Audio encoding method and device, electronic equipment and storage medium

Info

Publication number: CN112037803A
Application number: CN202010383119.7A
Authority: CN
Inventors: 闫玉凤; 肖全之; 黄荣均; 方桂萍
Original assignee: Zhuhai Jieli Technology Co Ltd
Current assignee: Zhuhai Jieli Technology Co Ltd
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2020-12-04
Anticipated expiration: 2040-05-08
Also published as: CN112037803B

Abstract

The invention provides an audio coding method and device, electronic equipment and a storage medium, wherein the method comprises the following steps: carrying out voice endpoint detection processing on the audio data to be coded so as to segment active audio segments and inactive audio segments in the audio data to be coded; for each active audio segment, calculating the granularity average energy of each sub-band in each granularity by using the energy value of each sub-band in each granularity; determining the coding rate of each active audio segment according to the granularity average energy of each active audio segment, wherein the coding rate of the active audio segment is positively correlated with the granularity average energy of the active audio segment; for each active audio segment, carrying out audio coding on the active audio segment according to the coding rate of the active audio segment; and coding the inactive audio segments obtained by dividing the audio data to be coded, wherein the coding rate of each active audio segment is greater than that of each inactive audio segment. The invention can be beneficial to improving the coding quality and reducing the audio distortion after coding.

Description

Audio encoding method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of audio coding technologies, and in particular, to an audio coding method and apparatus, an electronic device, and a storage medium.

Background

At present, in order to facilitate network transmission and storage of audio, an audio coding technique is usually required to convert original audio data into compressed data, and the amount of the compressed data is smaller, so as to be beneficial to saving a storage space and reducing a network bandwidth required by network transmission.

Disclosure of Invention

Based on the above situation, it is a primary objective of the present invention to provide an audio encoding method and apparatus, an electronic device, and a storage medium, which are beneficial to reducing audio distortion after encoding.

In order to achieve the above object, an embodiment of the present invention provides an audio encoding method, including:

step S1: carrying out voice endpoint detection processing on audio data to be coded so as to segment active audio segments and inactive audio segments in the audio data to be coded to obtain a plurality of audio segments;

step S2: partitioning each active audio segment to obtain a plurality of granularities, performing sub-band decomposition on each granularity, calculating the energy value of each sub-band in each granularity, and calculating the granularity average energy of each active audio segment by using the energy value of each sub-band in each granularity;

step S3: determining the coding rate of each active audio segment according to the granularity average energy of each active audio segment, wherein the coding rate of the active audio segment is positively correlated with the granularity average energy of the active audio segment;

step S4: for each active audio segment, carrying out audio coding on the active audio segment according to the coding rate of the active audio segment;

step S5: and coding the inactive audio segments obtained by dividing the audio data to be coded, wherein the coding rate of each active audio segment is greater than that of each inactive audio segment.

Further, step S2 includes:

step S21: performing block processing on the kth active audio segment obtained by dividing the audio data to be coded to obtain a plurality of granularities, wherein k is 1,2,3, …, and L is the number of the active audio segments obtained by dividing the audio data to be coded;

step S22: performing a subband decomposition operation on each granularity of the kth active audio segment, and then calculating an energy value of each subband of each granularity of the kth active audio segment;

wherein, W_(k,i)[sb]For the energy value of the sb sub-band in the i granularity of the k active audio segment, SP_(k,i)[sb][j]For the spectral value of the jth frequency line of the sb-th sub-band in the ith granularity of the kth active audio segment, sb represents a sub-band number, sb is 1,2,3, …, N is the number of sub-bands in each granularity, j represents a frequency line number, Z is the number of frequency lines of each sub-band, and a is a preset value greater than 1;

step S23: calculating an energy distribution value of the kth active audio segment on each sub-band;

wherein D is_k[sb]For the energy distribution value of the kth active audio segment on the sb subband, grs _ k is the granularity number obtained after the kth active audio segment is processed in a blocking mode;

step S24: determining a granular mean energy EDS of the kth active audio segment_k；

Further, the determining a coding rate for each of the active audio segments based on the granular average energy of each of the active audio segments comprises:

acquiring the total target coding rate of the audio data to be coded;

and calculating the coding rate of each active audio segment according to the overall target coding rate of the audio data to be coded, the coding rate of each inactive audio segment and the granularity average energy of each active audio segment.

Further, the coding rate of the active audio segment is positively correlated with the granularity average energy of the active audio segment, including:

the ratio between the coding rates of the active audio segments is consistent with the ratio between the granularity average energy of the active audio segments.

Further, the coding rate of the inactive audio segment is the lowest coding rate supported by the coding format corresponding to the audio coding method.

In order to achieve the above object, an embodiment of the present invention further provides an audio encoding apparatus, including:

the voice endpoint detection processing module is used for carrying out voice endpoint detection processing on the audio data to be coded so as to divide an active audio segment and an inactive audio segment in the audio data to be coded to obtain a plurality of audio segments;

the calculating module is used for carrying out block processing on each active audio segment to obtain a plurality of granularities, carrying out sub-band decomposition on each granularity and calculating the energy value of each sub-band in each granularity, and then calculating the granularity average energy of each active audio segment by using the energy value of each sub-band in each granularity;

the code rate determining module is used for determining the coding rate of each active audio segment according to the granularity average energy of each active audio segment, wherein the coding rate of the active audio segment is positively correlated with the granularity average energy of the active audio segment;

the first coding processing module is used for carrying out audio coding on each active audio segment according to the coding rate of the active audio segment;

and the second coding processing module is used for coding the inactive audio segments obtained by dividing the audio data to be coded, and the coding rate of each active audio segment is greater than that of each inactive audio segment.

In order to achieve the above object, the present invention further provides an audio encoding apparatus, including a processor and a memory coupled to the processor, where the memory stores instructions for the processor to execute, and when the processor executes the instructions, the audio encoding method according to the above can be implemented.

In order to achieve the above object, the present invention further provides an electronic device, including the audio encoding apparatus.

Furthermore, the electronic device is a sound box, a recording pen, a mobile phone, an intelligent tablet, a notebook computer, a desktop computer or an electronic toy.

To achieve the above object, the present invention further provides a computer-readable storage medium storing a computer program, which when executed by a processor implements the above audio encoding method.

The audio coding method provided by the invention has the advantages that the voice endpoint detection processing is carried out on the audio data to be coded, the active audio segment and the inactive audio segment are divided, the active audio segment with larger granularity average energy is set with larger coding rate, the active audio segment with smaller granularity average energy is set with smaller coding rate, and meanwhile, the coding rate of each active audio segment is larger than that of each inactive audio segment, so that the coding quality can be effectively improved, and the audio distortion after coding is reduced.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:

fig. 1 is a flowchart of an audio encoding method according to an embodiment of the present invention.

Detailed Description

The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth in order to avoid obscuring the nature of the present invention, well-known methods, procedures, and components have not been described in detail.

Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

Referring to fig. 1, fig. 1 is a flowchart of an audio encoding method according to an embodiment of the present invention, where the audio encoding method includes:

the beginning and the end of the speech in the audio data to be encoded can be detected through a voice endpoint detection (VAD) process, for example, in this step, a VAD algorithm based on a threshold, a statistical model or machine learning (such as a neural network) can be adopted to perform voice endpoint detection process on the audio data to be encoded;

for example, the active audio segment may be an audio segment whose audio characteristics satisfy a preset condition, and the inactive audio segment is an audio segment whose audio characteristics do not satisfy the preset condition, where the preset condition may be set according to a specific requirement, and the audio characteristics may include one or more of energy characteristics, spectral characteristics, harmonic characteristics, sub-band signal-to-noise ratio, and zero-crossing rate, that is, the audio characteristics extracted from the audio data to be encoded are analyzed by using a VAD method, so as to implement the decision of the active audio segment and the inactive audio segment;

wherein, for each active audio segment, the granularity average energy is the average of the energy of each granularity;

that is, for any two of the active audio segments, the coding rate of the active audio segment with larger granularity average energy is greater than the coding rate of the active audio segment with smaller granularity average energy;

According to the audio coding method provided by the embodiment of the invention, voice endpoint detection processing is carried out on audio data to be coded, an active audio segment and an inactive audio segment are segmented, a larger coding rate is set for the active audio segment with larger granularity average energy, a smaller coding rate is set for the active audio segment with smaller granularity average energy, and meanwhile, the coding rate of each active audio segment is greater than that of each inactive audio segment, so that the coding quality can be effectively improved, and the audio distortion after coding is reduced.

For example, in an embodiment, the step S2 may specifically include:

wherein, W_(k,i)[sb]For the energy value of the sb sub-band in the i granularity of the k active audio segment, SP_(k,i)[sb][j]For the spectral value of the j-th frequency line of the sb-th subband in the i-th granularity of the k-th active audio segment, sb denotes a subband number, sb ═ 1,2,3, …, N is the number of subbands in each granularity, j denotes a frequency line number, Z is the number of frequency lines per subband, a is a preset value greater than 1, e.g., the value of a is 2 or e or 10;

The Audio coding method in the embodiment of the present invention may be applied to an MP1(MPEG-1/2/2.5Audio Layer-1) coding process, an MP2(MPEG-1/2/2.5Audio Layer-2) coding process, an MP3(MPEG-1/2/2.5Audio Layer-3) coding process, or an encoding process of other transform coding formats;

preferably, in an embodiment, the coding rate of the inactive audio segment may be set to the lowest coding rate that can be supported by the coding format corresponding to the audio coding method, for example, in an embodiment, the audio coding method is the coding method of the MP2 coding format, and if the sampling rate of the audio data to be coded is greater than or equal to 32kHz, the coding rate of each inactive audio segment is 32 kbps; and if the sampling rate of the audio data to be coded is less than 32kHz, the coding rate of each inactive audio segment is 8 kbps.

In this embodiment, a larger coding rate is set for an active audio segment with larger granularity average energy, and a smaller coding rate is set for an active audio segment with smaller granularity average energy, that is, more bit streams are allocated to the active audio segment with larger granularity average energy, less bit streams are allocated to the active audio segment with smaller granularity average energy, and the coding rate of each active audio segment is made larger than that of each inactive audio segment, so that the coding quality can be effectively improved under the condition of the same compression rate, and a higher compression rate can be obtained under the condition of the same coding quality.

For example, in an embodiment, determining the coding rate of each active audio segment according to the granularity average energy of each active audio segment may specifically include:

acquiring the total target coding rate of the audio data to be coded;

For example, in one embodiment, the coding rate of the inactive audio segment is set to the lowest coding rate that can be supported by the currently encoded coding format, and then the coding rate of each active audio segment is calculated according to the overall target coding rate of the audio data to be encoded and the ratio between the granularity average energy of each active audio segment, the ratio between the coding rates of each active audio segment is consistent with the ratio between the granularity average energy of each active audio segment, and the sum of the number of bits of each audio segment is equal to the number of bits of the audio data to be encoded (the product of the overall target coding rate of the audio data to be encoded and the audio duration of the audio data to be encoded).

For example, the audio encoding method provided by the embodiment of the present invention may be applied to encoding in an MP2 encoding format, and the specific process steps include the following steps:

step A: performing voice endpoint detection (VAD) processing on audio data to be coded, and segmenting an active audio segment and an inactive audio segment (approximate mute segment with lower energy) in the audio data to be coded, wherein the signal length of each active audio segment is integral multiple of 384 sampling points, for example, two active audio segments and one inactive audio segment are obtained after the audio data to be coded is segmented;

the 1 st active audio segment is subjected to block processing to obtain a plurality of granularities, each block of 384 sampling points is used as one granularity to carry out the sub-band decomposition operation of MP2 coding to obtain 32 sub-bands, and each sub-band comprises 12 frequency lines, namely N is 32, and Z is 12;

and B: performing a subband decomposition operation on each granularity of the 1 st active audio segment, and then calculating the energy value of each subband of each granularity of the 1 st active audio segment;

wherein, W_(1,i)[sb]For the energy value of the sb sub-band in the i granularity of the 1 st active audio segment, SP_(1,i)[sb][j]Is the spectral value of the jth frequency line of the sb sub-band in the ith granularity of the 1 st active audio segment;

and C: calculating an energy distribution value of the 1 st active audio segment on each sub-band;

for example, the 1 st active audio segment has a total of 36 granularities (i.e., grs _1 ═ 36), i.e., D₁[1]Is the energy value W of the 1 st subband of the 1 st granularity_(1,1)[1]Energy value W of sub-band 1 of 2 nd granularity_(1,2)[1]Energy value W of sub-band 1 of the 36 th granularity_(1,36)[1]Sum, and for the same reason, sequentially calculating D₁[1]、...、D₁[32]The total energy distribution value is 32;

step D: calculating an expectation and summing the energy distribution values of the 1 st active audio segment sub-band;

EDF₁[sb]＝D₁[sb]/grs_1；

and obtaining an information quantity distribution table of the 1 st active audio frequency segment, and recording the information quantity distribution table as EDF _ 1: EDF₁[sb：1～32]And the granularity mean energy EDS of the 1 st active audio segment₁；

Wherein, the EDF_k[sb]Representing an energy distribution expectation of the kth active audio segment on the sb sub-band;

step E: processing the 2 nd active audio frequency segment by the same method to obtain the information quantity distribution table EDF _2 and the granularity average energy EDS of the 2 nd active audio frequency segment₂；

Setting the coding rate of the inactive audio segment as the lowest coding rate bitrate _ min which can be supported by MP2 coding, and then according to EDS₁、EDS₂The coding rate of the 1 st active audio segment and the coding rate of the 2 nd active audio segment are calculated according to the ratio and the total target coding rate of the audio data to be coded;

by the mode, bit streams consumed by inactive audio segments can be reduced, the compression ratio is improved, the bit streams are reasonably distributed among the active audio segments, and the coding quality of the active audio segments is effectively improved;

step F: for each audio segment in the plurality of audio segments, performing MP2 audio coding on the audio segment according to the coding rate of the audio segment, for example, each audio segment may be coded by using a psychoacoustic model;

the audio encoding method provided by the embodiment has smaller distortion of the MP2 encoded at the same encoding rate, and can better ensure the recovery of the audio signal.

In addition, since the fixed psychoacoustic model usually preferentially allocates the bitstream at low frequencies, and the actual sounds are different types, since the information distribution of the subbands is different in different types of sound signals, if the fixed psychoacoustic model is adopted, the problem of severe high frequency loss is easily caused in audio signals with rich frequencies or more high frequencies, and preferably, in an embodiment, the encoding process of the kth active audio segment includes:

step S41: obtaining a signal masking ratio of each subband in each granularity of the kth active audio segment, and then calculating a bit distribution weight value of each subband in each granularity of the kth active audio segment according to the signal masking ratio of each subband in each granularity of the kth active audio segment and an energy distribution value of the kth active audio segment on each subband;

P_(k,i)[sb]＝C*SMR_(k,i)[sb]*D_k[sb]；

wherein, P_(k,i)[sb]Assigning weight values, SMR, to bits of an sb sub-band in an i granularity of the kth active audio segment_(k,i)[sb]C is a preset coefficient and is a positive value, and is the signal masking ratio of the sb sub-band in the ith granularity of the kth active audio segment;

for each granularity, analyzing and processing a frame corresponding to the granularity by adopting a psychoacoustic model to obtain a signal masking table, wherein the signal masking table comprises a signal masking ratio of each sub-band in the granularity;

step S42: for each granularity of the kth active audio segment, carrying out bit distribution on each sub-band according to the bit distribution weight value of each sub-band, wherein the number of bits distributed by the sub-band with the larger bit distribution weight value is larger than that distributed by the sub-band with the smaller bit distribution weight value in any two sub-bands in the same granularity;

for example, for each granularity of the k-th active audio segment, a ratio between the number of bits allocated by each sub-band is consistent with a ratio between the weight values allocated by the bits of each sub-band.

Step S43: quantizing each sub-band in each granularity of the kth active audio segment according to the number of bits allocated to each sub-band, and performing bit stream encapsulation after quantization;

for example, in step S41, the EDF may be normalized with respect to the information amount distribution table of the kth active audio segment_k[sb]/EDS_kThen multiplying N and the corresponding signal masking ratio in sequence;

P_(k,i)[sb]＝SMR_(k,i)[sb]*D_k[sb]*N/(grs_k*EDS_k)；

i.e. for the k-th active audio segment: n/(grs _ k EDS) C_k)；

For example, in an embodiment, each active audio segment obtained by dividing the audio data to be encoded is encoded by using the above steps S41-S43, that is, k is an integer between 1 and L.

By the mode, different types of sound signals can be considered, signal recovery of the different types of sound signals after coding is facilitated, and stable quantization level and timely spectrum tracking are facilitated.

An embodiment of the present invention further provides an audio encoding apparatus, including:

An embodiment of the present invention further provides an audio encoding apparatus, including a processor and a memory coupled to the processor, where the memory stores instructions for the processor to execute, and when the processor executes the instructions, the audio encoding method can be implemented.

The audio coding device provided by the embodiment of the invention has the advantages that the voice endpoint detection processing is carried out on the audio data to be coded, the active audio segment and the inactive audio segment are divided, the active audio segment with larger granularity average energy is set with larger coding code rate, the active audio segment with smaller granularity average energy is set with smaller coding code rate, and meanwhile, the code rate of each active audio segment is larger than that of each inactive audio segment, so that the coding quality can be effectively improved, and the audio distortion after coding is reduced.

The embodiment of the invention also provides electronic equipment which comprises the audio coding device. For example, the electronic device may be a sound box, a voice pen, a mobile phone, a smart tablet, a notebook computer, a desktop computer, or an electronic toy.

An embodiment of the present invention further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the audio encoding method.

It will be appreciated by those skilled in the art that the above-described preferred embodiments may be freely combined, superimposed, without conflict.

It will be understood that the embodiments described above are illustrative only and not restrictive, and that various obvious and equivalent modifications and substitutions for details described herein may be made by those skilled in the art without departing from the basic principles of the invention.

Claims

1. An audio encoding method, comprising:

2. The method according to claim 1, wherein step S2 includes:

3. The method of claim 1, wherein determining a coding rate for each of the active audio segments based on a granular average energy of each of the active audio segments comprises:

acquiring the total target coding rate of the audio data to be coded;

4. The method of claim 1, wherein the coding rate of the active audio segments positively correlates with the granular mean energy of the active audio segments, comprising:

5. The method according to any of claims 1-4, wherein the coding rate of the inactive audio segment is the lowest coding rate supported by the coding format corresponding to the audio coding method.

6. An audio encoding apparatus, comprising:

7. Audio coding device comprising a processor and a memory coupled to the processor, wherein the memory has stored therein instructions for execution by the processor, wherein the instructions, when executed by the processor, enable implementation of the method according to any one of claims 1 to 5.

8. An electronic device, characterized in that it comprises an audio coding apparatus as claimed in claim 6 or an audio coding apparatus as claimed in claim 7.

9. The electronic device of claim 8, wherein the electronic device is a sound box, a voice pen, a mobile phone, a smart tablet, a laptop computer, a desktop computer, or an electronic toy.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 5.