CN112037803B

CN112037803B - Audio encoding method and device, electronic equipment and storage medium

Info

Publication number: CN112037803B
Application number: CN202010383119.7A
Authority: CN
Inventors: 闫玉凤; 肖全之; 黄荣均; 方桂萍
Original assignee: Zhuhai Jieli Technology Co Ltd
Current assignee: Zhuhai Jieli Technology Co Ltd
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2023-09-29
Anticipated expiration: 2040-05-08
Also published as: CN112037803A

Abstract

The invention provides an audio coding method and device, electronic equipment and storage medium, wherein the method comprises the following steps: performing voice endpoint detection processing on the audio data to be encoded so as to divide active audio segments and inactive audio segments in the audio data to be encoded; for each active audio segment, calculating the granularity average energy of each sub-band in each granularity by using the energy value of each sub-band; determining the coding rate of each active audio segment according to the granularity average energy of each active audio segment, wherein the coding rate of the active audio segment is positively correlated with the granularity average energy of the active audio segment; for each active audio segment, performing audio coding on the active audio segment according to the coding code rate; and coding the inactive audio segments obtained by dividing the audio data to be coded, wherein the coding rate of each active audio segment is larger than that of each inactive audio segment. The invention can be beneficial to improving the coding quality and reducing the audio distortion after coding.

Description

Audio encoding method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of audio encoding technologies, and in particular, to an audio encoding method and apparatus, an electronic device, and a storage medium.

Background

At present, in order to facilitate network transmission and storage of audio, an audio encoding technology is generally required to convert original audio data into compressed data, and the data volume after compression is smaller, so that storage space is saved and network bandwidth required by network transmission is reduced, but in general, audio distortion is easily caused after encoding.

Disclosure of Invention

Based on the above-mentioned situation, a main object of the present invention is to provide an audio encoding method and apparatus, an electronic device, and a storage medium, which are beneficial to reducing audio distortion after encoding.

In order to achieve the above object, the present invention provides an audio encoding method, including:

step S1: performing voice endpoint detection processing on audio data to be encoded so as to divide active audio segments and inactive audio segments in the audio data to be encoded to obtain a plurality of audio segments;

step S2: performing block processing on each active audio segment to obtain a plurality of granularities, performing sub-band decomposition on each granularity, calculating the energy value of each sub-band in each granularity, and calculating the granularity average energy of each active audio segment by using the energy value of each sub-band in each granularity;

step S3: determining the coding rate of each active audio segment according to the granularity average energy of each active audio segment, wherein the coding rate of the active audio segment is positively correlated with the granularity average energy of the active audio segment;

step S4: for each active audio segment, carrying out audio coding on the active audio segment according to the coding code rate;

step S5: and coding the inactive audio segments obtained by dividing the audio data to be coded, wherein the coding code rate of each active audio segment is larger than that of each inactive audio segment.

Further, step S2 includes:

step S21: the k-th active audio segment obtained by dividing the audio data to be encoded is subjected to block processing to obtain a plurality of granularities, wherein k=1, 2,3, …, L is the number of active audio segments obtained by dividing the audio data to be encoded;

step S22: sub-band decomposition is carried out on each granularity of the kth active audio segment, and then the energy value of each sub-band of each granularity of the kth active audio segment is calculated;

wherein W is _(k,i) [sb]In the ith granularity of the kth active audio segmentEnergy value of sb-th subband, SP _(k,i) [sb][j]For the frequency spectrum value of the j-th frequency line of the sb sub-band in the ith granularity of the kth active audio segment, sb represents a sub-band number, sb=1, 2,3, …, N is the number of sub-bands in each granularity, j represents a frequency line number, Z is the number of frequency lines of each sub-band, and a is a preset value greater than 1;

step S23: calculating an energy distribution value of the kth active audio segment on each sub-band;

wherein D is _k [sb]The grs_k is the granularity number obtained after the block processing of the kth active audio segment;

step S24: determining a granularity average energy EDS of the kth active audio segment _k ；

Further, the determining the coding rate of each active audio segment according to the granularity average energy of each active audio segment includes:

acquiring an overall target coding rate of the audio data to be coded;

and calculating the coding rate of each active audio segment according to the overall target coding rate of the audio data to be coded, the coding rate of each inactive audio segment and the granularity average energy of each active audio segment.

Further, the coding rate of the active audio segment is positively correlated with the granularity average energy of the active audio segment, and the method comprises the following steps:

the ratio between the coding rates of the active audio segments is consistent with the ratio between the granularity average energy of the active audio segments.

Further, the coding rate of the inactive audio segment is the lowest coding rate supported by the coding format corresponding to the audio coding method.

In order to achieve the above object, the present invention further provides an audio encoding apparatus, including:

the voice endpoint detection processing module is used for carrying out voice endpoint detection processing on the audio data to be encoded so as to divide the active audio segment and the inactive audio segment in the audio data to be encoded to obtain a plurality of audio segments;

the computing module is used for carrying out block processing on each active audio segment to obtain a plurality of granularities, carrying out sub-band decomposition on each granularity, computing the energy value of each sub-band in each granularity, and then computing the granularity average energy of each active audio segment by utilizing the energy value of each sub-band in each granularity;

the code rate determining module is used for determining the coding code rate of each active audio segment according to the granularity average energy of each active audio segment, wherein the coding code rate of the active audio segment is positively correlated with the granularity average energy of the active audio segment;

the first coding processing module is used for carrying out audio coding on each active audio segment according to the coding code rate of the active audio segment;

and the second coding processing module is used for coding the inactive audio segments obtained by dividing the audio data to be coded, and the coding rate of each active audio segment is larger than that of each inactive audio segment.

In order to achieve the above object, the present invention further provides an audio encoding apparatus, which includes a processor and a memory coupled to the processor, wherein instructions are stored in the memory for the processor to execute, and when the processor executes the instructions, the audio encoding method can be implemented.

In order to achieve the above object, the present invention further provides an electronic device, which includes the above audio encoding apparatus.

Further, the electronic equipment is a sound box, a recording pen, a mobile phone, an intelligent tablet, a notebook computer, a desktop computer or an electronic toy.

To achieve the above object, the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above audio encoding method.

According to the audio coding method provided by the invention, the voice endpoint detection processing is carried out on the audio data to be coded, the active audio segments and the inactive audio segments are segmented, the active audio segments with larger granularity average energy are provided with larger coding code rates, the active audio segments with smaller granularity average energy are provided with smaller coding code rates, and meanwhile, the coding code rate of each active audio segment is larger than the coding code rate of each inactive audio segment, so that the coding quality can be effectively improved, and the audio distortion after coding is reduced.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of embodiments of the present invention with reference to the accompanying drawings, in which:

fig. 1 is a flowchart of an audio encoding method according to an embodiment of the present invention.

Detailed Description

The present invention is described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth in order to avoid obscuring the present invention, and in order to avoid obscuring the present invention, well-known methods, procedures, flows, and components are not presented in detail.

Moreover, those of ordinary skill in the art will appreciate that the drawings are provided herein for illustrative purposes and that the drawings are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, it is the meaning of "including but not limited to".

In the description of the present invention, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

Referring to fig. 1, fig. 1 is a flowchart of an audio encoding method according to an embodiment of the present invention, the audio encoding method includes:

the start and end of speech in the audio data to be encoded may be detected by a speech end-point detection (VAD) process, for example, in this step, the VAD algorithm based on threshold, statistical model or machine learning (e.g. neural network) may be used to perform the speech end-point detection process on the audio data to be encoded;

for example, the active audio segment may be an audio segment whose audio feature meets a preset condition, and the inactive audio segment is an audio segment whose audio feature does not meet the preset condition, where the preset condition may be set according to specific requirements, and the audio feature may include one or more of an energy feature, a spectrum feature, a harmonic feature, a subband signal-to-noise ratio, and a zero-crossing rate, that is, the audio feature extracted from the audio data to be encoded is analyzed and processed by using a VAD method, so as to implement a decision of the active audio segment and the inactive audio segment;

wherein, for each active audio segment, the average energy of the granularity is the average value of the energy of each granularity;

that is, for any two active audio segments, the coding rate of the active audio segment with larger granularity average energy is larger than that of the active audio segment with smaller granularity average energy;

According to the audio coding method provided by the embodiment of the invention, the voice endpoint detection processing is carried out on the audio data to be coded, the active audio segments and the inactive audio segments are segmented, the active audio segments with larger granularity average energy are provided with larger coding code rates, the active audio segments with smaller granularity average energy are provided with smaller coding code rates, and meanwhile, the coding code rate of each active audio segment is larger than the coding code rate of each inactive audio segment, so that the coding quality can be effectively improved, and the audio distortion after coding is reduced.

For example, in an embodiment, the step S2 may specifically include:

wherein W is _(k,i) [sb]For the energy value, SP, of the sb sub-band in the ith granularity of the kth active audio segment _(k,i) [sb][j]For the spectrum value of the j-th frequency line of the sb sub-band in the ith granularity of the kth active audio segment, sb represents a sub-band number, sb=1, 2,3, …, N is the number of sub-bands in each granularity, j represents a frequency line number, Z is the number of frequency lines of each sub-band, a is a preset value greater than 1, for example, a is 2 or e or 10;

The Audio coding method in the embodiment of the invention can be applied to MP1 (MPEG-1/2/2.5 Audio Layer-1) coding process, MP2 (MPEG-1/2/2.5 Audio Layer-2) coding process or MP3 (MPEG-1/2/2.5 Audio Layer-3) coding process, and can also be applied to coding process of other transformation coding formats;

preferably, in an embodiment, the coding rate of the inactive audio segment may be set to be the lowest coding rate that can be supported by the coding format corresponding to the audio coding method, for example, in an embodiment, the audio coding method is an MP2 coding format coding method, and if the sampling rate of the audio data to be coded is greater than or equal to 32kHz, the coding rate of each inactive audio segment is 32kbps; if the sampling rate of the audio data to be encoded is smaller than 32kHz, the encoding code rate of each inactive audio segment is 8kbps.

In this embodiment, by setting a larger coding rate for the active audio segments with larger granularity average energy, setting a smaller coding rate for the active audio segments with smaller granularity average energy, that is, allocating more bitstreams to the active audio segments with larger granularity average energy, allocating less bitstreams to the active audio segments with smaller granularity average energy, and making the coding rate of each active audio segment greater than the coding rate of each inactive audio segment, the coding quality can be effectively improved under the same compression rate, and a higher compression rate can be obtained under the same coding quality.

For example, in an embodiment, determining the coding rate of each active audio segment according to the granularity average energy of each active audio segment may specifically include:

acquiring an overall target coding rate of the audio data to be coded;

For example, in one embodiment, the coding rate of the inactive audio segment is set to be the lowest coding rate that can be supported by the current coding format, and then the coding rate of each active audio segment is calculated according to the overall target coding rate of the audio data to be coded and the ratio between the granularity average energy of each active audio segment, where the ratio between the coding rates of each active audio segment is consistent with the ratio between the granularity average energy of each active audio segment, and the sum of the number of bits of each audio segment is equal to the number of bits of the audio data to be coded (the product of the overall target coding rate of the audio data to be coded and the audio duration of the audio data to be coded).

For example, the audio encoding method provided by the embodiment of the invention can be applied to encoding of an MP2 encoding format, and the specific flow steps include the following:

step A: performing voice endpoint detection (VAD) processing on the audio data to be encoded, and dividing an active audio segment and an inactive audio segment (approximate silence segment with lower energy) in the audio data to be encoded, wherein the signal length of each active audio segment is an integer multiple of 384 sampling points, for example, two active audio segments and one inactive audio segment are obtained after the audio data to be encoded is divided;

the 1 st active audio segment is subjected to block processing to obtain a plurality of granularities, 384 sampling points are used as a granularity to carry out MP2 coding sub-band decomposition operation, 32 sub-bands are obtained, and each sub-band comprises 12 frequency lines, namely N=32 and Z=12;

and (B) step (B): sub-band decomposition is performed on each granularity of the 1 st active audio segment, and then energy values of each sub-band of each granularity of the 1 st active audio segment are calculated;

wherein W is _(1,i) [sb]For the energy value, SP, of the sb sub-band in the ith granularity of the 1 st active audio segment _(1,i) [sb][j]A spectral value of a jth frequency line of a sb sub-band in an ith granularity of the 1 st active audio segment;

step C: calculating the energy distribution value of the 1 st active audio segment on each sub-band;

for example, the 1 st active audio segment has 36 granularity (i.e., grs1=36), i.e., D ₁ [1]The value of (1) is the energy value W of the 1 st subband of the 1 st granularity _(1,1) [1]Energy value W of 1 st subband of 2 nd granularity _(1,2) [1]Energy value W of sub-band 1 of granularity 36 _(1,36) [1]And, similarly, sequentially calculating D ₁ [1]、...、D ₁ [32]There are 32 energy distribution values;

step D: calculating expectation and summing the energy distribution value of the 1 st active audio segment sub-band;

EDF ₁ [sb]＝D ₁ [sb]/grs_1；

the information quantity distribution table of the 1 st active audio segment is obtained and is marked as EDF_1: EDF (electronic data flow) ₁ [sb：1～32]And the granularity average energy EDS of the 1 st active audio segment ₁ ；

Wherein EDF _k [sb]Indicating the energy distribution expectations of the kth active audio segment on the sb sub-band;

step E: processing the 2 nd active audio segment by the same method to obtain an information quantity distribution table EDF_2 and a granularity average energy EDS of the 2 nd active audio segment ₂ ；

Setting the coding rate of the inactive audio segment as the lowest coding rate bit_min supported by MP2 coding, and then according to EDS ₁ 、EDS ₂ Calculating the coding rate of the 1 st active audio segment and the coding rate of the 2 nd active audio segment according to the proportion and the overall target coding rate of the audio data to be coded;

by the method, bit streams consumed by the inactive audio segments can be reduced, the compression ratio is improved, the bit streams are reasonably distributed among the active audio segments, and the coding quality of the active audio segments is effectively improved;

step F: for each of the plurality of audio segments, MP2 audio encoding is performed according to an encoding rate, for example, each audio segment may be encoded using a psychoacoustic model;

according to the audio coding method, MP2 distortion coded on the same coding rate is smaller, and recovery of an audio signal can be well guaranteed.

Furthermore, since the fixed psychoacoustic model will preferentially allocate bit streams at low frequencies, and the actual sounds are of different kinds, and since the information distribution of the sub-bands is different in different kinds of sound signals, if the fixed psychoacoustic model is used, the problem that the high frequency loss is more serious is easily caused in the audio signals with rich frequencies or more high frequencies, preferably, in an embodiment, the encoding process of the kth active audio segment includes:

step S41: acquiring a signal masking ratio of each sub-band in each granularity of the kth active audio segment, and then calculating a bit allocation weight value of each sub-band in each granularity of the kth active audio segment according to the signal masking ratio of each sub-band in each granularity of the kth active audio segment and an energy distribution value of the kth active audio segment on each sub-band;

P _(k,i) [sb]＝C*SMR _(k,i) [sb]*D _k [sb]；

wherein P is _(k,i) [sb]Assigning weight values to bits of sb sub-bands in the ith granularity of the kth active audio segment, SMR _(k,i) [sb]The signal masking ratio of the sb sub-band in the ith granularity of the kth active audio segment is given, and C is a preset coefficient and is a positive value;

for each granularity, a psychoacoustic model can be adopted to analyze and process frames corresponding to the granularity to obtain a signal masking table, wherein the signal masking table comprises the signal masking ratio of each sub-band in the granularity;

step S42: for each granularity of the kth active audio segment, carrying out bit allocation on each sub-band according to the bit allocation weight value of each sub-band, wherein the number of bits allocated by the sub-band with larger bit allocation weight value is larger than that of bits allocated by the sub-band with smaller bit allocation weight value in any two sub-bands in the same granularity;

for example, for each granularity of the kth active audio segment, the ratio between the number of bits allocated for each sub-band is consistent with the ratio between the bit allocation weight values for each sub-band.

Step S43: for each sub-band in each granularity of the kth active audio segment, quantizing the frequency line according to the number of bits allocated to the sub-band, and performing bit stream encapsulation after quantization;

for example, inIn the step S41, the EDF may be normalized to the information amount distribution table of the kth active audio segment _k [sb]/EDS _k Then multiplying by N and corresponding signal masking ratio in turn;

P _(k,i) [sb]＝SMR _(k,i) [sb]*D _k [sb]*N/(grs_k*EDS _k )；

i.e. for the kth active audio segment: c=n/(grs_k EDS _k )；

For example, in one embodiment, each active audio segment obtained by dividing the audio data to be encoded is encoded in the steps S41-S43, i.e., k is each integer between 1 and L in turn.

By the mode, different kinds of sound signals can be considered, the signal recovery of the different kinds of sound signals after coding is facilitated, and the stable quantization level and the timely spectrum tracking are also facilitated.

The embodiment of the invention also provides an audio coding device, which comprises:

The embodiment of the invention also provides an audio coding device which comprises a processor and a memory coupled with the processor, wherein the memory stores instructions for the processor to execute, and when the processor executes the instructions, the audio coding method can be realized.

According to the audio coding device provided by the embodiment of the invention, the voice endpoint detection processing is carried out on the audio data to be coded, the active audio segments and the inactive audio segments are segmented, the active audio segments with larger granularity average energy are provided with larger coding code rates, the active audio segments with smaller granularity average energy are provided with smaller coding code rates, meanwhile, the code rate of each active audio segment is larger than the code rate of each inactive audio segment, the coding quality can be effectively improved, and the audio distortion after coding is reduced.

The embodiment of the invention also provides electronic equipment comprising the audio coding device. For example, the electronic device may be a sound box, a sound pen, a mobile phone, a smart tablet, a notebook computer, a desktop computer, or an electronic toy.

The embodiment of the invention also provides a computer readable storage medium storing a computer program which when executed by a processor implements the audio encoding method described above.

Those skilled in the art will appreciate that the above-described preferred embodiments can be freely combined and stacked without conflict.

It will be understood that the above-described embodiments are merely illustrative and not restrictive, and that all obvious or equivalent modifications and substitutions to the details given above may be made by those skilled in the art without departing from the underlying principles of the invention, are intended to be included within the scope of the appended claims.

Claims

1. An audio encoding method, comprising:

2. The method according to claim 1, wherein step S2 comprises:

wherein W is _(k,i) [sb]For the energy value, SP, of the sb sub-band in the ith granularity of the kth active audio segment _(k,i) [sb][j]For the frequency spectrum value of the j-th frequency line of the sb sub-band in the ith granularity of the kth active audio segment, sb represents a sub-band number, sb=1, 2,3, …, N is the number of sub-bands in each granularity, j represents a frequency line number, Z is the number of frequency lines of each sub-band, and a is a preset value greater than 1;

3. The method of claim 1, wherein said determining the coding rate of each of said active audio segments based on the granularity average energy of each of said active audio segments comprises:

acquiring an overall target coding rate of the audio data to be coded;

4. The method of claim 1, wherein the coding rate of the active audio segment is positively correlated with a granularity average energy of the active audio segment, comprising:

5. The method according to any one of claims 1-4, wherein the coding rate of the inactive audio segment is a lowest coding rate supported by a coding format corresponding to the audio coding method.

6. An audio encoding apparatus, comprising:

7. An audio encoding device comprising a processor and a memory coupled to the processor, wherein the memory has instructions stored therein for execution by the processor, which when executed by the processor, is capable of implementing the method according to any of claims 1-5.

8. An electronic device comprising the audio encoding apparatus according to claim 6 or the audio encoding apparatus according to claim 7.

9. The electronic device of claim 8, wherein the electronic device is a sound box, a sound pen, a mobile phone, a smart tablet, a notebook computer, a desktop computer, or an electronic toy.

10. A computer readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-5.