WO2024000534A1

WO2024000534A1 - Audio signal encoding method and apparatus, and electronic device and storage medium

Info

Publication number: WO2024000534A1
Application number: PCT/CN2022/103170
Authority: WO
Inventors: 高硕�
Original assignee: 北京小米移动软件有限公司
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2024-01-04
Also published as: CN117643073A

Abstract

Disclosed in the embodiments of the present disclosure are an audio signal encoding method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring a scenario-based audio signal; determining the number of channels and an encoding rate of the audio signal; and performing encoding processing on the audio signal according to the number of channels and the encoding rate, so as to generate an encoded code stream. Therefore, by means of performing encoding processing on an audio signal according to the number of channels and an encoding rate, the number of bits that can be used can be fully utilized during an encoding process, the waste of the number of bits is avoided, and an audio service matching the encoding rate is provided for a remote user.

Description

Audio signal encoding method, device, electronic equipment and storage medium

Technical field

The present disclosure relates to the field of communication technology, and in particular, to an audio signal encoding method, device, electronic equipment and storage medium.

Background technique

In the related art, after the audio signal is acquired, the audio signal is uniformly encoded. Among them, in the process of unified encoding processing, the number of bits that can be used by each channel is different without considering different encoding rates, which will cause the number of bits that can be used by each channel to exceed or If the number of bits is lower than that required for encoding, it will result in a waste of bits or the inability to provide audio services that match the encoding rate to remote users. This is a problem that needs to be solved urgently.

Contents of the invention

Embodiments of the present disclosure provide an audio signal encoding method, device, electronic equipment and storage medium. The audio signal is encoded according to the number of channels and the encoding rate. During the encoding process, the number of bits that can be used can be fully processed. Utilize to avoid the waste of bits and provide remote users with audio services that match the encoding rate.

In a first aspect, an embodiment of the present disclosure provides a method for encoding audio signals. The method includes: acquiring a scene-based audio signal; determining the number of channels and the encoding rate of the audio signal; and determining the number of channels and the encoding rate according to the number of channels and the Encoding rate, encoding the audio signal to generate an encoded code stream.

In this technical solution, a scene-based audio signal is obtained; the number of channels and the encoding rate of the audio signal are determined; and the audio signal is encoded according to the number of channels and the encoding rate to generate a coded code stream. As a result, the audio signal is encoded according to the number of channels and the encoding rate. During the encoding process, the number of bits that can be used can be fully utilized, avoiding the waste of bits, and providing remote users with information that matches the encoding rate. audio services.

In some embodiments, encoding the audio signal according to the number of channels and the encoding rate to generate an encoded code stream includes: encoding the audio signal according to the number of channels and the encoding rate. The audio signal is subjected to downmixing processing to generate downmixing parameters and downmixing channel signals; the downmixing channel signal is subjected to encoding processing to generate encoding parameters; the downmixing parameters and the encoding parameters are code stream complex used to generate the encoded code stream.

In some embodiments, performing downmix processing on the audio signal according to the number of channels and the encoding rate to generate downmix parameters and downmix channel signals includes: according to the number of channels and The coding rate determines a target control parameter for the audio signal; a downmix processing algorithm is determined according to the target control parameter; and a downmix processing is performed on the audio signal according to the downmix processing algorithm to generate the The downmix parameters and the downmix channel signal.

In some embodiments, determining a target control parameter for the audio signal according to the number of channels and the encoding rate includes: calculating each channel according to the number of channels and the encoding rate. an initial average rate; determine a target average rate based on the initial average rate and a preset average rate threshold; determine the target control parameter for the audio signal based on the initial average rate and the target average rate.

In some embodiments, before encoding the audio signal, the method further includes: pre-processing the audio signal with pre-emphasis and/or high-pass filtering.

In a second aspect, an embodiment of the present disclosure provides an audio signal encoding device. The audio signal encoding device includes: a signal acquisition unit configured to acquire a scene-based audio signal; an information determination unit configured to determine the The number of channels and the encoding rate of the audio signal; the encoding processing unit is configured to perform encoding processing on the audio signal according to the number of channels and the encoding rate to generate an encoded code stream.

In some embodiments, the encoding processing unit includes: a downmix processing module configured to perform downmix processing on the audio signal according to the number of channels and the encoding rate to generate downmix parameters and downmix channel signal; a parameter generation module configured to perform encoding processing on the downmix channel signal and generate encoding parameters; a code stream generation module configured to perform code stream complexation of the downmix parameter and the encoding parameter used to generate the encoded code stream.

In some embodiments, the downmix processing module includes: a parameter determination submodule configured to determine target control parameters for the audio signal according to the number of channels and the encoding rate; an algorithm determination submodule , configured to determine the downmix processing algorithm according to the target control parameter; the downmix processing submodule is configured to perform downmix processing on the audio signal according to the downmix processing algorithm to generate the downmix parameters and the downmix channel signal.

In some embodiments, the parameter determination sub-module is further configured to: calculate an initial target average rate for each channel based on the number of channels and the encoding rate; based on the initial average rate and preset The average rate threshold determines a target average rate; and the target control parameter for the audio signal is determined based on the initial average rate and the target average rate.

In some embodiments, the method further includes: a preprocessing unit configured to perform pre-emphasis and/or high-pass filtering on the audio signal.

In a third aspect, embodiments of the present disclosure provide an electronic device, which includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be used by the at least one processor. Instructions executed by the processor, the instructions being executed by the at least one processor, so that the at least one processor can execute the method described in the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in the first aspect.

In a fifth aspect, embodiments of the present disclosure provide a computer program product, including computer instructions, characterized in that, when executed by a processor, the computer instructions implement the method described in the first aspect.

It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and do not limit the present disclosure.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the disclosure or the background technology, the drawings required to be used in the embodiments or the background technology of the disclosure will be described below.

Figure 1 is a flow chart of an audio signal encoding method provided by an embodiment of the present disclosure;

Figure 2 is a schematic coordinate diagram of an audio signal in FOA format provided by an embodiment of the present disclosure;

Figure 3 is a flow chart of another audio signal encoding method provided by an embodiment of the present disclosure;

Figure 4 is a flow chart of an audio signal encoding method in the related technology provided by the embodiment of the present disclosure;

Figure 5 is a flow chart of yet another audio signal encoding method provided by an embodiment of the present disclosure;

Figure 6 is a flowchart of the sub-steps of S30 in the audio signal encoding method provided by an embodiment of the present disclosure;

Figure 7 is a flowchart of the sub-steps of S301 in the audio signal encoding method provided by an embodiment of the present disclosure;

Figure 8 is a structural diagram of an audio signal encoding device provided by an embodiment of the present disclosure;

Figure 9 is a structural diagram of the encoding processing unit in the audio signal encoding device provided by an embodiment of the present disclosure;

Figure 10 is a structural diagram of the downmix processing module in the audio signal encoding device provided by an embodiment of the present disclosure;

Figure 11 is a structural diagram of another audio signal encoding device provided by an embodiment of the present disclosure;

FIG. 12 is a structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed ways

In order to allow ordinary people in the art to better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings.

Unless the context requires otherwise, throughout the specification and claims, the term "including" is to be interpreted in an open, inclusive sense, that is, "including, but not limited to." In the description of this specification, the terms "some embodiments" and the like are intended to indicate that a particular feature, structure, material, or characteristic associated with the embodiment or example is included in at least one embodiment or example of the present disclosure. The schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be included in any suitable manner in any one or more embodiments or examples.

It should be noted that the terms "first", "second", etc. in the description, claims and drawings of the present disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. The terms “first” and “second” are used for descriptive purposes only and shall not be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as "first" and "second" may explicitly or implicitly include one or more of these features. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of the disclosure as detailed in the appended claims.

At least one in the present disclosure can also be described as one or more, and the plurality can be two, three, four or more, and the present disclosure is not limited. In the embodiment of the present disclosure, for a technical feature, the technical feature is distinguished by “first”, “second”, “third”, “A”, “B”, “C” and “D” etc. The technical features described in "first", "second", "third", "A", "B", "C" and "D" are in no particular order or order.

The corresponding relationships shown in each table in this disclosure can be configured or predefined. The values of the information in each table are only examples and can be configured as other values, which is not limited by this disclosure. When configuring the correspondence between information and each parameter, it is not necessarily required to configure all the correspondences shown in each table. For example, in the table in this disclosure, the corresponding relationships shown in some rows may not be configured. For another example, appropriate deformation adjustments can be made based on the above table, such as splitting, merging, etc. The names of the parameters shown in the titles of the above tables may also be other names understandable by the communication device, and the values or expressions of the parameters may also be other values or expressions understandable by the communication device. When implementing the above tables, other data structures can also be used, such as arrays, queues, containers, stacks, linear lists, pointers, linked lists, trees, graphs, structures, classes, heaps, hash tables or hash tables. wait.

Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered to be beyond the scope of this disclosure.

The first generation of mobile communication technology (1G) is the first generation of wireless cellular technology and is an analog mobile communication network. When upgrading from 1G to 2G, the mobile phone will be transferred from analog communication to digital communication, using the GSM (Global System for Mobile Communication, Global System for Mobile Communications) network standard, and the speech coder will use AMR (Adaptive Multi Rate-Narrow BandSpeech Codec, narrowband adaptive multi-rate Business coding), EFR (Enhanced Full Rate, enhanced full rate), FR (FullRate, full rate), HR (HarfRate, half rate), communications provide single-channel narrowband voice services, 3G mobile communication system is ITU (International Telecommunication Union , International Telecommunications Union) proposed for International Mobile Communications in 2000, can use TD-SCDMA, or use CDMA2000, or use WCDMA, and its voice coder uses AMR-WB to provide single-channel broadband voice services. 4G is a better improvement on 3G technology. Data and voice are all IP-based, providing real-time HD (High Definition, high resolution) + Voice services for voice audio, using EVS (Enhanced Voice Services, enhanced Speech Service) codec is capable of high-quality compression of both speech and audio.

The voice and audio communication services provided above have expanded from narrowband signals to ultra-wideband and even full-band services, but they are still monophonic services. People's demand for high-quality audio continues to increase. Compared with monophonic audio, stereo audio Have a sense of orientation and distribution for each sound source and improve clarity.

With the increase of transmission bandwidth, the upgrade of terminal equipment signal acquisition equipment, the improvement of signal processor performance, and the upgrade of terminal playback equipment. Three signal formats, including channel-based multi-channel audio signals, object-based audio signals, and scene-based audio signals, can provide three-dimensional audio services. The immersive voice and audio service IVAS (immersive voice and audio service) codec that is being standardized by the third-generation partner program 3GPP SA4 can support the coding and decoding requirements of the above three signal formats. Terminal devices that can support 3D audio services include mobile phones, computers, tablets, conference system equipment, AR (augmented reality, augmented reality)/VR (virtual reality, virtual reality) equipment, cars, etc.

The FOA (Firs-Order Ambisonics, 1st-order panoramic surround sound)/HOA (High-Order Ambisonics, high-order panoramic surround sound) signal is a main scene-based audio signal, which represents the audio signal collected at a certain position in the audio scene. Audio information, which is an immersive audio format in which the audio quality gradually increases as the order increases. Different Ambisonics orders represent the number of different audio signal components, that is: for an N-order Ambisonics signal, the number of Ambisonics coefficients It is (N+1)*(N+1):

The relationship between the Ambisonics signal order and the Ambisonics coefficient is shown in Table 1 below:

Ambisonics阶数Ambisonics order	Ambisonics系数/声道数Ambisonics coefficient/number of channels
00	11
11	44
22	99
33	1616
44	2525
55	3636
66	4949

Table 1

As shown in Table 1, the number of Ambisonics channels increases rapidly with the increase of order, and the amount of encoded data also increases rapidly. Correspondingly, the complexity of encoding also increases significantly. At the same time, due to the limitation of encoding rate, the encoding performance also decreases. Significantly reduced, in order to reduce the complexity of encoding, the input initial channels need to be downmixed. After the downmixing process, the number of channels becomes smaller, and the complexity of encoding decreases, thereby achieving a compromise between encoding complexity and encoding performance. Balanced state.

In view of the problems in the related art that there is a waste of bits or the inability to provide remote users with audio services that match the encoding rate, embodiments of the present disclosure provide an audio signal encoding method and device to at least solve the problems in the related art. problem, in order to make full use of the number of bits that can be used, provide remote users with audio services that match the encoding rate, and improve user experience.

Please refer to FIG. 1 , which is a flow chart of an audio signal encoding method provided by an embodiment of the present disclosure.

As shown in Figure 1, the method may include but is not limited to the following steps:

S1, obtain scene-based audio signals.

It can be understood that when the local user establishes voice communication with any remote user, the local user can establish voice communication with the terminal equipment of any remote user through the terminal device, wherein the terminal device of the local user can obtain the information in real time. The sound information of the local user's environment is used to obtain scene-based audio signals.

Among them, the sound information of the environment where the local user is located includes the sound information emitted by the local user, the sound information of surrounding things, etc. Sound information of surrounding things, such as: sound information of driving vehicles, bird calls, wind sound information, sound information of other users around the local user, etc.

It should be noted that the terminal device is an entity on the user side that is used to receive or transmit signals, such as mobile phones, computers, tablets, watches, walkie-talkies, conference system equipment, augmented reality AR/virtual reality VR equipment, cars, etc. Terminal equipment can also be called user equipment (user equipment, UE), mobile station (mobile station, MS), mobile terminal equipment (mobile terminal, MT), etc. The terminal device can be a car with communication functions, a smart car, a mobile phone, a wearable device, a tablet computer (Pad), a computer with wireless transceiver functions, a virtual reality (VR) terminal device, an augmented reality ( augmented reality (AR) terminal equipment, wireless terminal equipment in industrial control, wireless terminal equipment in self-driving, wireless terminal equipment in remote medical surgery, smart grid ( Wireless terminal equipment in smart grid, wireless terminal equipment in transportation safety, wireless terminal equipment in smart city, wireless terminal equipment in smart home, etc. The embodiments of the present disclosure do not limit the specific technology and specific equipment form used by the terminal equipment.

In the embodiment of the present disclosure, the local user's terminal device obtains scene-based audio signals through a recording device, such as a microphone, that is provided in the terminal device or cooperates with the terminal device to obtain sound information of the environment where the local user is located. Further, Generate scene-based audio signals and obtain scene-based audio signals.

In the embodiment of the present disclosure, the scene-based audio signal may be an audio signal in FOA format or an audio signal in HOA format.

S2, determine the number of channels and coding rate of the audio signal.

In the embodiment of the present disclosure, after acquiring the scene-based audio signal, the number of channels and the encoding rate of the scene-based audio signal may be determined.

For example, as shown in Figure 2, when the scene-based audio signal is obtained as an audio signal in FOA format, it is determined that the number of channels of the audio signal is 4, which may be W, X, Y, and Z. Among them, W represents a component that includes all sounds in all directions in the sound field superimposed with the same gain and phase, X represents the front-to-back direction component in the sound field, Y represents the left-right direction component in the sound field, and Z represents the up-down direction component in the sound field. Portion. Moreover, you can also confirm that the selected encoding rate is 96kbps.

S3: Encode the audio signal according to the number of channels and encoding rate to generate an encoding stream.

In the embodiment of the present disclosure, a scene-based audio signal is obtained, the number of channels and the encoding rate of the audio signal are determined, and the audio signal is encoded according to the number of channels and the encoding rate to generate an encoded code stream.

Among them, the audio signal is encoded according to the number of channels and the encoding rate. The encoding rate of each channel can be determined according to the number of channels and the encoding rate. For example, the average encoding rate of each channel can be determined, or the encoding rate of each channel can be determined. The maximum encoding rate of the channel, or the encoding rate of each channel, etc. Among them, the average encoding rate of each channel can be determined according to the encoding rate divided by the number of channels, the maximum encoding rate of each channel is equal to the encoding rate, and the encoding rate of each channel is the encoding rate.

On the basis of determining the coding rate of each channel, the number of bits that can be used by each channel under different coding rates can be considered according to the coding rate of each channel, and then during the encoding process , can make full use of the number of bits that can be used, avoid the waste of bits, and provide remote users with audio services that match the encoding rate. Among them, the generated encoding stream can ensure clear, stable and understandable audio services when the encoding rate is low, and can guarantee high-definition, stable and immersive audio services when the encoding rate is high, and can provide remote users with Audio services that match the encoding rate to improve user experience.

In the embodiment of the present disclosure, when the scene-based audio signal is obtained and the number of channels and coding rate of the audio signal are determined, pre-emphasis preprocessing can be performed on the audio signal. Pre-emphasis preprocessing on the audio signal can The high-frequency part of the audio information is emphasized to increase the high-frequency resolution of the audio information.

In the embodiment of the present disclosure, after obtaining a scene-based audio signal and determining the number of channels and encoding rate of the audio signal, the audio signal can be preprocessed by high-pass filtering to filter out the audio signal that is lower than a certain frequency threshold. signal components. The setting of the starting frequency in the high-pass filtering process can be set as needed. For example, the starting frequency can be set to 20 Hz.

Among them, after preprocessing the audio signal with high-pass filtering, the audio signal component of the required encoding frequency band can be obtained. When encoding the audio signal, it can avoid the ultra-low frequency signal from affecting the effect of the encoding process.

By implementing the embodiments of the present disclosure, a scene-based audio signal is obtained, the number of channels and the encoding rate of the audio signal are determined, and the audio signal is encoded according to the number of channels and the encoding rate to generate an encoded code stream. As a result, the audio signal is encoded according to the number of channels and the encoding rate. During the encoding process, the number of bits that can be used can be fully utilized, avoiding the waste of bits, and providing remote users with information that matches the encoding rate. audio services.

Please refer to FIG. 3 , which is a flow chart of an audio signal encoding method provided by an embodiment of the present disclosure.

As shown in Figure 3, the method may include but is not limited to the following steps:

S10, obtain scene-based audio signals.

S20, determine the number of channels and coding rate of the audio signal.

In the embodiment of the present disclosure, the relevant descriptions of S20 and S20 can be referred to the relevant descriptions in the above embodiments, and the same content will not be described again here.

S30, perform downmix processing on the audio signal according to the number of channels and the encoding rate to generate downmix parameters and downmix channel signals.

S40, perform coding processing on the downmix channel signal and generate coding parameters.

S50, perform code stream multiplexing on the downmix parameters and encoding parameters to generate an encoding code stream.

In the embodiment of the present disclosure, a scene-based audio signal is obtained, the number of channels and the encoding rate of the audio signal are determined, and the audio signal is encoded according to the number of channels and the encoding rate to generate an encoded code stream. Among them, the audio signal is encoded according to the number of channels and the encoding rate, and the audio signal can be downmixed according to the number of channels and the encoding rate to generate the downmix parameters and the downmix channel signal, and then the downmix channel is The signal is encoded and processed to generate encoding parameters, and then the code stream is multiplexed based on the downmix parameters and encoding parameters to generate the code stream.

As shown in Figure 4, in the related technology, after obtaining the audio signal (audio signal in FOA format or audio signal in HOA format), the audio signal is uniformly downmixed. After the downmixing process, the number of channels is compared with the original The number of channels is reduced, and all the reduced channels are encoded using the core encoder. The downmix parameters and core encoder output parameters generated by the downmix processing are multiplexed and the encoded code stream is output.

Among them, the audio signal is uniformly downmixed, and the number of bits that can be used by each channel is different without taking into account different encoding rates. As a result, the number of channels after downmixing is different from what the core encoder can encode. The number of channels does not match, resulting in: when the number of channels after downmixing is much smaller than the number of input channels, it is impossible to provide better audio services to remote users at high encoding rates (the reason is that each channel has The number of bits that can be used exceeds the number of bits necessary for encoding, resulting in a waste of bits); when the number of channels after downmixing is not much different from the number of input channels, it is impossible to provide remote users with a low encoding rate. Provide an audio service that matches the rate value (the reason is that the number of bits available for each channel is much lower than the number of bits necessary for encoding, resulting in poor encoding quality for each channel).

However, as shown in Figure 5, in the embodiment of the present disclosure, the scene-based audio signal (audio signal in FOA format or audio signal in HOA format) is input to the encoder end, and the encoder end can determine the number of channels and encoding of the audio signal. Rate, input the coding rate, number of channels and audio signal to the pattern analysis module, or you can also preprocess the audio signal with high-pass filtering first, and then input the preprocessed audio signal to the pattern analysis module.

The mode analysis module can output control parameters according to the selected encoding rate and number of channels, and use the control parameters to guide the downmix processing module to select the corresponding downmix processing algorithm. After the downmix processing module processes the audio signal, it outputs the downmix parameters and downmix channel signal. After the downmix channel signal is encoded and processed by the core encoder, it outputs the encoding parameters; the encoding parameters and downmix parameters are input to the code stream multiplexing The processor outputs the encoded code stream.

In the embodiment of the present disclosure, when the input scene-based audio signal is an audio signal in FOA format/an audio signal in HOA format, a matching next step is automatically selected based on the number of channels of the input audio signal and the number of bits that can be used. Mixing processing algorithm, so that the number of channels after downmixing matches the number of channels that the core encoder can encode at this encoding rate, achieving full (optimal) utilization of the available bits, that is, at low rates It can ensure the provision of clear, stable and understandable audio services. At high speeds, it can ensure the provision of high-definition, stable and immersive audio services, which can improve the user experience.

In the embodiment of the present disclosure, after the encoder end outputs the encoded code stream, it can be sent to the decoder end for decoding so that the remote terminal can obtain the sound information transmitted by the local terminal.

As shown in Figure 6, in some embodiments, S30, perform downmix processing on the audio signal according to the number of channels and coding rate to generate downmix parameters and downmix channel signals, including:

S301: Determine the target control parameters for the audio signal according to the number of channels and the encoding rate.

In the embodiment of the present disclosure, the audio signal is downmixed according to the number of channels and the coding rate, and the target control parameters of the audio signal can be determined based on the number of channels and the coding rate.

Among them, the target control parameters for the audio signal are determined based on the number of channels and the coding rate. The coding rate of each channel can be determined based on the number of channels and the coding rate. For example, the average coding rate of each channel can be determined. , or the maximum encoding rate of each channel, or the encoding rate of each channel, etc. Among them, the average encoding rate of each channel can be determined according to the encoding rate divided by the number of channels, the maximum encoding rate of each channel is equal to the encoding rate, and the encoding rate of each channel is the encoding rate.

In the embodiment of the present disclosure, on the basis of determining the coding rate of each channel according to the number of channels and the coding rate, the target control parameters for the audio signal can be determined according to the coding rate of each channel.

Of course, the target control parameters for the audio signal are determined based on the number of channels and the coding rate. You can also determine the number of channels and coding rate of the audio signal by pre-setting the corresponding relationship between the number of channels and the coding rate and the control parameters. case, the target control parameters for the audio signal can be determined.

Alternatively, the target number of channels can also be determined based on the number of channels and the encoding rate, and then the target control parameters for the audio signal can be determined based on the target number of channels, and so on.

Among them, the target number of channels is determined based on the number of channels and the coding rate. For example, N thresholds of the average coding rate are preset, N is a positive integer, N thresholds determine N+1 threshold intervals, and different threshold intervals are set to correspond to different The number of channels after downmix processing. Based on this, the initial average encoding rate is calculated according to the number of channels and the encoding rate. According to the threshold interval to which the initial average rate belongs, the target number of channels can be determined, and then based on the target number of channels, the Target control parameters of the audio signal.

It can be understood that when the encoding rate and the number of channels after downmixing are known, the average rate that can be allocated to each channel after downmixing can also be obtained, and the average rate that can be allocated to each channel after downmixing can also be obtained based on the target number of channels and/or downmixing. The average rate that each channel can be allocated after mixing processing determines the target control parameters of the audio signal.

Among them, the target control parameters for the audio signal are determined according to the target number of channels and/or the average rate that each channel can be allocated after the downmixing process. The target number of channels and/or each channel after the downmixing process can be set in advance. The corresponding relationship between the average rate of channel distribution and the control parameters. According to the target number of channels and/or the average rate that each channel can be allocated after downmix processing, as well as the preset corresponding relationship, the target control of the audio signal can be determined. parameter.

S302: Determine the downmix processing algorithm according to the target control parameters.

In the embodiment of the present disclosure, when the target control parameters for the audio signal are determined based on the number of channels and the coding rate, the downmix processing algorithm may be determined based on the target control parameters. The downmix processing algorithm may be determined by determining the downmix processing algorithm corresponding to each channel, and the downmix processing algorithms determined for different channels may be the same or different.

S303: Perform downmix processing on the audio signal according to the downmix processing algorithm to generate downmix parameters and downmix channel signals.

In the embodiment of the present disclosure, when the downmix processing algorithm corresponding to each channel is determined, the audio signal can be downmixed according to the downmix processing algorithm to generate downmix parameters and downmix channel signals.

As shown in Figure 7, in some embodiments, S301, determine the target control parameters for the audio signal according to the number of channels and the encoding rate, including:

S3011: Calculate the initial average rate of each channel according to the number of channels and encoding rate.

S3012: Determine the target average rate based on the initial average rate and the preset average rate threshold.

S3013: Determine the target control parameters for the audio signal based on the initial average rate and the target average rate.

Among them, according to the number of channels and the encoding rate, the initial average rate of each channel is calculated, which can be determined according to the encoding rate divided by the number of channels. For example, when the number of channels is 4 and the encoding rate is 96 kbps, the initial average rate of each channel is calculated to be 24 kbps based on the number of channels and the encoding rate.

In the embodiment of the present disclosure, when the initial average rate of each channel is calculated, the target average rate may be determined based on the initial average rate and a preset average rate threshold.

Among them, the preset average rate threshold can be set according to the scene-based audio signal. For example, set the first average rate threshold Thres1 to 13.2kbps, and the second average rate threshold Thres2 to 32kbps. According to the above two average rate thresholds Divide the interval corresponding to the average rate into 3 average rate intervals, as follows:

Average rate interval one: less than or equal to 13.2kbps;

Average rate interval two: greater than 13.2kbps and less than 32kbps;

Average rate interval three: greater than or equal to 32kbps.

In the embodiment of the present disclosure, the target average rate is determined based on the initial average rate and the preset average rate threshold. When the average rate threshold interval is determined based on the average rate threshold, the corresponding number of output channels is set for different average rate threshold intervals. , thus, the corresponding target output channel number can be determined based on the average rate threshold interval to which the initial average rate belongs.

Based on this, when the target number of output channels is determined, the target average rate can be calculated based on the target number of output channels and encoding rate.

For example, the number of output channels corresponding to average rate interval one is 2, the number of output channels corresponding to average rate interval two is 3, and the number of output channels corresponding to average rate interval three is 4. When the initial average rate is 24kbps In this case, it is determined that it belongs to the average rate interval 2, and it can be determined that the target output channel number is 3, and then the target average rate can be calculated to be 96kbps/3=32kbps. It can be seen that the target average rate in the average rate interval 2 has increased compared with the initial average rate, so that when the target control parameters of the audio signal are subsequently determined, the appropriate target control parameters can be determined, and the downmix processing algorithm can be determined based on the target control parameters. , so that the number of output channels after downmix processing matches the number of channels that the core encoder can encode at this encoding rate, achieving optimal utilization of the available bits, that is, ensuring clear and stable performance at low rates Intelligible audio services can ensure high-definition, stable and immersive audio services at high speeds.

In the embodiment of the present disclosure, three different types of downmix processing algorithms can be selected for three average rate intervals for scene-based audio signals. After the selected downmix processing, the downmix processing average rate interval one and the average rate interval 2. The average rate that can be used by each channel increases. The average rate range 3. Because the encoding rate is rich enough, the downmixing process is not chosen, that is, the input signal is directly used as the output signal of the downmixing process, that is, each sound after the downmixing process The average speed that the channel can use remains unchanged.

For example, as shown in Table 2 below, some scene-based audio signals, the initial average rate (the initial average rate that can be allocated to each channel) and the preset average rate threshold, as well as the corresponding number of output channels (downmix The number of channels after processing), and the determined target average rate (the average rate that can be allocated to each channel after downmix processing).

As can be seen from Table 2 below, the average rate that can be allocated to each channel after downmix processing is greater than or equal to the average number of bits that can be used for each channel, and the number of bits that can be used can be fully utilized. Avoid wasting bits and provide remote users with audio services that match the encoding rate.

Table 2

It can be understood that each element in Table 2 exists independently, and these elements are exemplarily listed in the same table, but it does not mean that all elements in the table must exist at the same time as shown in the table. The value of each element does not depend on the value of any other element in Table 2. Therefore, those skilled in the art can understand that the value of each element in Table 2 is an independent embodiment.

In the embodiment of the present disclosure, the target average rate is determined based on the initial average rate and a preset average rate threshold. In addition to the method in the above example, the average rate threshold closest to the initial average rate can also be determined as the target average rate, or, It can also be to directly determine the initial average rate as the target average rate, or it can also be to determine the average rate threshold that is greater than the initial average rate, and the average rate threshold closest to the initial average rate is the target average rate, etc., the present disclosure implements There are no specific restrictions on this.

In the embodiment of the present disclosure, after determining the target average rate, the target control parameters for the audio signal are determined based on the initial average rate and the target average rate. The corresponding relationship between the initial average rate, the target average rate, and the control parameters can be preset, for example : Set the corresponding relationship between the initial average speed and the target average speed and the control parameters, or set the difference between the initial average speed and the target average speed, and the corresponding relationship between the control parameters, or set the difference between the initial average speed and the target average speed. The absolute value of the difference, the corresponding relationship with the control parameters, or the sum of the set initial average rate and the target average rate, the corresponding relationship with the control parameters, etc. This embodiment of the present disclosure does not specifically limit this.

A downmix processing algorithm is to design a downmix conversion matrix based on the number of target output channels and the number of channels for obtaining scene-based audio signals. For example, the number of channels is N and the target output channel number is M, then the conversion matrix is M* N, N and M are all positive integers, and M is less than or equal to N.

The transformation matrix M*N satisfies the following relationship:

[M*1]＝[M*N]*[N*1]

Among them, [M*1] represents a matrix of M by 1; [M*N] represents a matrix of M by N; [N*1] represents a matrix of N by 1.

For ease of understanding, the embodiment of the present disclosure provides an exemplary embodiment.

In the exemplary embodiment, if the scene-based audio signal is obtained in FOA format, the number of channels is 4, namely: W, X, Y, Z, and the selected encoding rate is 96kbps. The target output channel number after downmixing is 3 channels, where W represents a component that contains all sounds in all directions in the sound field superimposed with the same gain and phase, X represents the front and rear direction components in the sound field, and Y represents The components in the left and right directions in the sound field, Z represents the components in the up and down directions in the sound field, and the coordinate diagram is shown in 2.

When the number of target channels after downmixing is 3, the Z component in the up and down direction is ignored, and only a total of 3 channel components of W, X, and Y are retained. There are two considerations for this strategy: First, reconstruction In the sound field, the listener at the playback end is more sensitive to the components in the front and rear and left and right directions, and less sensitive to the components in the up and down directions; secondly, there are fewer sound sources for the up and down components in the sound field of general audio scenes; the sound after downmix processing The number of channels is 3, and the average encoding rate that can be allocated to each channel is 96kbps/3=32kbps. At this average encoding rate, the encoding core can encode and reconstruct high-quality audio signals, thereby providing high-definition to remote users. Stable and immersive audio service.

FIG. 8 is a structural diagram of an audio signal encoding device provided by an embodiment of the present disclosure.

As shown in FIG. 8 , the audio signal encoding device 1 includes: a signal acquisition unit 11 , an information determination unit 12 and an encoding processing unit 13 .

The signal acquisition unit 11 is configured to acquire scene-based audio signals.

The information determining unit 12 is configured to determine the number of channels and the encoding rate of the audio signal.

The encoding processing unit 13 is configured to encode the audio signal according to the number of channels and the encoding rate to generate an encoded code stream.

By implementing the embodiments of the present disclosure, the signal acquisition unit 11 acquires a scene-based audio signal, the information determination unit 12 determines the number of channels and the encoding rate of the audio signal, and the encoding processing unit 13 encodes the audio signal according to the number of channels and the encoding rate. Processing to generate an encoded code stream, whereby the audio signal is encoded according to the number of channels and encoding rate. During the encoding process, the number of bits that can be used can be fully utilized and the waste of bits can be avoided. End users are provided with audio services that match the encoding rate.

As shown in Figure 9, in some embodiments, the encoding processing unit 13 includes: a downmix processing module 131, a parameter generation module 132, and a code stream generation module 133.

The downmix processing module 131 is configured to perform downmix processing on the audio signal according to the number of channels and the encoding rate to generate downmix parameters and downmix channel signals.

The parameter generation module 132 is configured to perform encoding processing on the downmix channel signal and generate encoding parameters.

The code stream generation module 133 is configured to perform code stream multiplexing on downmix parameters and encoding parameters to generate an encoded code stream.

As shown in Figure 10, in some embodiments, the downmix processing module 131 includes: a parameter determination sub-module 1311, an algorithm determination sub-module 1312 and a downmix processing sub-module 1313.

The parameter determination sub-module 1311 is configured to determine target control parameters for the audio signal according to the number of channels and the encoding rate.

The algorithm determination sub-module 1312 is configured to determine the downmix processing algorithm according to the target control parameters.

The downmix processing sub-module 1313 is configured to perform downmix processing on the audio signal according to the downmix processing algorithm to generate downmix parameters and downmix channel signals.

In some embodiments, the parameter determination sub-module 1311 is also configured to:

Based on the number of channels and encoding rate, calculate the initial target average rate for each channel;

Determine the target average rate based on the initial average rate and a preset average rate threshold;

According to the initial average rate and the target average rate, the target control parameters of the audio signal are determined.

As shown in Figure 11, in some embodiments, the audio signal encoding device 1 further includes: a pre-processing unit 14.

The preprocessing unit 14 is configured to perform pre-emphasis and/or high-pass filtering preprocessing on the audio signal.

Regarding the devices in the above embodiments, the specific manner in which each module performs operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

The audio signal encoding device provided by the embodiments of the present disclosure can perform the audio signal encoding method as described in some of the above embodiments. Its beneficial effects are the same as those of the audio signal encoding method described above, and will not be described again here.

FIG. 12 is a structural diagram of an electronic device 100 for performing an audio signal encoding method according to an exemplary embodiment.

Illustratively, the electronic device 100 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like.

As shown in FIG. 12 , the electronic device 100 may include one or more of the following components: a processing component 101 , a memory 102 , a power supply component 103 , a multimedia component 104 , an audio component 105 , an input/output (I/O) interface 106 , and a sensor. component 107, and communications component 108.

The processing component 101 generally controls the overall operations of the electronic device 100, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 101 may include one or more processors 1011 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 101 may include one or more modules that facilitate interaction between processing component 101 and other components. For example, processing component 101 may include a multimedia module to facilitate interaction between multimedia component 104 and processing component 101 .

Memory 102 is configured to store various types of data to support operations at electronic device 100 . Examples of such data include instructions for any application or method operating on the electronic device 100, contact data, phonebook data, messages, pictures, videos, etc. The memory 102 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as SRAM (Static Random-Access Memory), EEPROM (Electrically Erasable Programmable read only memory), which can be Erasable programmable read-only memory), EPROM (Erasable Programmable Read-Only Memory, erasable programmable read-only memory), PROM (Programmable read-only memory, programmable read-only memory), ROM (Read-Only Memory, only read memory), magnetic memory, flash memory, magnetic disk or optical disk.

Power supply component 103 provides power to various components of electronic device 100 . Power supply components 103 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 100 .

Multimedia component 104 includes a touch-sensitive display screen that provides an output interface between the electronic device 100 and the user. In some embodiments, the touch display screen may include LCD (Liquid Crystal Display) and TP (Touch Panel). The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action. In some embodiments, multimedia component 104 includes a front-facing camera and/or a rear-facing camera. When the electronic device 100 is in an operating mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front-facing camera and rear-facing camera can be a fixed optical lens system or have a focal length and optical zoom capabilities.

Audio component 105 is configured to output and/or input audio signals. For example, the audio component 105 includes a MIC (Microphone), and when the electronic device 100 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signals may be further stored in memory 102 or sent via communications component 108 . In some embodiments, audio component 105 also includes a speaker for outputting audio signals.

The I/O interface 2112 provides an interface between the processing component 101 and a peripheral interface module. The peripheral interface module may be a keyboard, a click wheel, a button, etc. These buttons may include, but are not limited to: Home button, Volume buttons, Start button, and Lock button.

Sensor component 107 includes one or more sensors for providing various aspects of status assessment for electronic device 100 . For example, the sensor component 107 can detect the open/closed state of the electronic device 100, the relative positioning of components, such as the display and the keypad of the electronic device 100, the sensor component 107 can also detect the electronic device 100 or an electronic device 100. The position of components changes, the presence or absence of user contact with the electronic device 100 , the orientation or acceleration/deceleration of the electronic device 100 and the temperature of the electronic device 100 change. Sensor assembly 107 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor component 107 may also include a light sensor, such as a CMOS (Complementary Metal Oxide Semiconductor) or a CCD (Charge-coupled Device) image sensor for use in imaging applications. In some embodiments, the sensor component 107 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 108 is configured to facilitate wired or wireless communication between electronic device 100 and other devices. The electronic device 100 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 108 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 108 also includes an NFC (Near Field Communication) module to facilitate short-range communication. For example, the NFC module can be based on RFID (Radio Frequency Identification) technology, IrDA (Infrared Data Association) technology, UWB (Ultra Wide Band) technology, BT (Bluetooth, Bluetooth) technology and other Technology to achieve.

In an exemplary embodiment, the electronic device 100 may be configured by one or more ASIC (Application Specific Integrated Circuit), DSP (Digital Signal Processor, digital signal processor), digital signal processing device (DSPD), PLD ( Programmable Logic Device, Programmable Logic Device), FPGA (Field Programmable Gate Array, Field Programmable Logic Gate Array), controller, microcontroller, microprocessor or other electronic components for executing the above audio signal encoding method . It should be noted that for the implementation process and technical principles of the electronic device in this embodiment, please refer to the aforementioned explanation of the audio signal encoding method in the embodiment of the present disclosure, and will not be described again here.

The electronic device 100 provided by the embodiments of the present disclosure can perform the audio signal encoding method as described in some of the above embodiments, and its beneficial effects are the same as those of the audio signal encoding method described above, which will not be described again here.

In order to implement the above embodiments, the present disclosure also proposes a storage medium.

When the instructions in the storage medium are executed by the processor of the electronic device, the electronic device is able to perform the audio signal encoding method as described above. For example, the storage medium can be ROM (Read Only Memory Image, read-only memory), RAM (Random Access Memory, random access memory), CD-ROM (Compact Disc Read-Only Memory, compact disc read-only memory) , tapes, floppy disks and optical data storage devices, etc.

In order to implement the above embodiments, the present disclosure also provides a computer program product. When the computer program is executed by a processor of an electronic device, the electronic device can perform the audio signal encoding method as described above.

Other embodiments of the disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure that follow the general principles of the disclosure and include common knowledge or customary technical means in the technical field that are not disclosed in the disclosure. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the systems, devices and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.

The above are only specific embodiments of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any person familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present disclosure. should be covered by the protection scope of this disclosure. Therefore, the protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims

An audio signal encoding method, characterized by including:

Obtain scene-based audio signals;

Determine the number of channels and coding rate of the audio signal;

The audio signal is encoded according to the number of channels and the encoding rate to generate an encoded code stream.
The method of claim 1, wherein encoding the audio signal according to the number of channels and the encoding rate to generate an encoded code stream includes:

Perform downmix processing on the audio signal according to the number of channels and the encoding rate to generate downmix parameters and downmix channel signals;

Perform encoding processing on the downmix channel signal to generate encoding parameters;

The downmix parameters and the encoding parameters are code stream multiplexed to generate the encoded code stream.
The method of claim 2, wherein performing downmix processing on the audio signal according to the number of channels and the encoding rate to generate downmix parameters and downmix channel signals includes:

Determine target control parameters for the audio signal according to the number of channels and the encoding rate;

Determine the downmix processing algorithm according to the target control parameters;

According to the downmix processing algorithm, the audio signal is downmixed to generate the downmix parameters and the downmix channel signal.
The method of claim 3, wherein determining target control parameters for the audio signal based on the number of channels and the encoding rate includes:

Calculate the initial average rate of each channel according to the number of channels and the encoding rate;

Determine a target average rate according to the initial average rate and a preset average rate threshold;

The target control parameter for the audio signal is determined based on the initial average rate and the target average rate.
The method according to any one of claims 1 to 4, characterized in that, before encoding the audio signal, it further includes:

The audio signal is preprocessed by pre-emphasis and/or high-pass filtering.
An audio signal encoding device, characterized by including:

a signal acquisition unit configured to acquire scene-based audio signals;

an information determination unit configured to determine the number of channels and the encoding rate of the audio signal;

An encoding processing unit is configured to perform encoding processing on the audio signal according to the number of channels and the encoding rate to generate an encoded code stream.
The device of claim 6, wherein the encoding processing unit includes:

A downmix processing module configured to perform downmix processing on the audio signal according to the number of channels and the encoding rate to generate downmix parameters and downmix channel signals;

A parameter generation module configured to perform encoding processing on the downmix channel signal and generate encoding parameters;

A code stream generation module is configured to perform code stream multiplexing on the downmix parameters and the encoding parameters to generate the code stream.
The device according to claim 7, wherein the downmix processing module includes:

A parameter determination submodule configured to determine target control parameters for the audio signal according to the number of channels and the encoding rate;

The algorithm determination submodule is configured to determine the downmix processing algorithm according to the target control parameter;

The downmix processing submodule is configured to perform downmix processing on the audio signal according to the downmix processing algorithm to generate the downmix parameters and the downmix channel signal.
The device of claim 8, wherein the parameter determination sub-module is further configured to:

Calculate an initial target average rate for each channel based on the number of channels and the encoding rate;

Determine a target average rate according to the initial average rate and a preset average rate threshold;

The target control parameter for the audio signal is determined based on the initial average rate and the target average rate.
The device according to any one of claims 6 to 9, further comprising:

A preprocessing unit configured to perform pre-emphasis and/or high-pass filtering preprocessing on the audio signal.
An electronic device, characterized by including:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1 to 5. Methods.
A non-transitory computer-readable storage medium storing computer instructions, characterized in that the computer instructions are used to cause the computer to execute the method described in any one of claims 1 to 5.
A computer program product comprising computer instructions, characterized in that, when executed by a processor, the computer instructions implement the method of any one of claims 1 to 5.