CN117998274A

CN117998274A - Audio processing method, device and storage medium

Info

Publication number: CN117998274A
Application number: CN202410405808.1A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-04-07
Filing date: 2024-04-07
Publication date: 2024-05-07

Abstract

The application discloses an audio processing method, an audio processing device and a storage medium, wherein the method comprises the following steps: acquiring smooth energy values of audio signals sent by a plurality of sound sources and a first space angle value of each sound source relative to a target object; determining one sound source with the largest smooth energy value as a target sound source; determining the azimuth resolution precision of each sound source according to the smooth energy value of each audio signal except the target audio signal and the smooth energy value of the target audio signal; updating the first space angle value according to the azimuth resolution precision to obtain a second space angle value; mixing and merging the audio signals of a plurality of sound sources with the same second spatial angle value to obtain a plurality of merged audio signals; carrying out stereo mixing processing on the target audio signal and the plurality of combined audio signals to obtain a stereo signal; and playing the stereo signal to the target object. The embodiment of the application can effectively reduce the calculated amount of the processor under the condition of providing good hearing experience for the user.

Description

Audio processing method, device and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to an audio processing method, an audio processing device, and a storage medium.

Background

In the current multi-sound source stereo audio processing technology, in order to enable a user to hear sounds with a more stereoscopic sense and a spatial hierarchy, so that the user can enjoy immersive hearing experience as if the user were in an actual environment, when stereo audio processing is performed on multi-sound sources, HRTF (Head-RELATED TRANSFER Function) processing is performed on audio signals of all sound sources respectively, and then stereo mixing processing is performed on all audio signals processed by HRTFs, so as to obtain stereo signals.

However, when HRTF processing is performed on audio signals of all sound sources, a large number of operations are required by the processor, and as the number of sound sources increases, the amount of operations performed by the processor for HRTF processing increases in a flat manner, which greatly increases the computational load of the processor. Therefore, how to reduce the calculation amount of the processor while providing a good hearing experience for the user is a technical problem to be solved.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the application provides an audio processing method, an audio processing device and a storage medium, which can effectively reduce the calculated amount of a processor under the condition of providing good hearing experience for users.

In one aspect, an embodiment of the present application provides an audio processing method, including the steps of:

Acquiring smooth energy values of audio signals sent by a plurality of sound sources, and a first space angle value of each sound source relative to a target object;

Among the audio signals emitted from the plurality of sound sources, one of the maximum smooth energy values is determined as a target audio signal, and the sound source emitting the target audio signal is determined as a target sound source;

Determining a position resolution of each of the sound sources other than the target sound source based on the smoothed energy value of each of the audio signals other than the target audio signal and the smoothed energy value of the target audio signal;

updating the first spatial angle value according to the azimuth resolution precision for each sound source except the target sound source to obtain a second spatial angle value of the sound source except the target sound source relative to the target object;

Mixing and merging the audio signals of a plurality of sound sources with the same second spatial angle value to obtain a plurality of merged audio signals;

Carrying out stereo mixing processing on the target audio signal and the plurality of combined audio signals to obtain a stereo signal;

and playing the stereo signal to the target object.

On the other hand, the embodiment of the application also provides an audio processing device, which comprises:

a signal acquisition unit, configured to acquire smooth energy values of audio signals sent by a plurality of sound sources, and a first spatial angle value of each sound source relative to a target object;

a signal determination unit configured to determine, as a target audio signal, one of the audio signals emitted from the plurality of sound sources, which has the largest smooth energy value, and determine, as a target sound source, the sound source from which the target audio signal is emitted;

A precision determination unit configured to determine a position resolution precision of each of the sound sources other than the target sound source based on the smoothed energy value of each of the audio signals other than the target audio signal and the smoothed energy value of the target audio signal;

An angle updating unit, configured to update, for each of the sound sources except for the target sound source, the first spatial angle value according to the azimuth resolution precision, to obtain a second spatial angle value of the sound source except for the target sound source with respect to the target object;

A mixing and merging unit, configured to mix and merge the audio signals of the plurality of sound sources with the same second spatial angle value to obtain a plurality of merged audio signals;

A stereo mixing unit, configured to perform stereo mixing processing on the target audio signal and the plurality of combined audio signals, to obtain a stereo signal;

and the signal playing unit is used for playing the stereo signal to the target object.

Optionally, the angle updating unit is further configured to:

Determining a plurality of space angle intervals according to the azimuth resolution precision;

Determining a target space angle interval comprising the first space angle value in a plurality of space angle intervals, wherein the target space angle interval comprises a space angle lower limit value and a space angle upper limit value;

And updating the first space angle value according to the first space angle value, the lower space angle limit value and the upper space angle limit value to obtain a second space angle value of the sound source except the target sound source relative to the target object.

Optionally, the plurality of sound sources are in the same spatial range, and the position of the spatial range where the target object is located is taken as a spatial coordinate system origin; the angle updating unit is further configured to:

Determining a space angle segmentation coefficient according to the azimuth resolution precision;

and carrying out space angle segmentation on the space range according to the space angle segmentation coefficient by taking the space coordinate system origin as a reference to obtain a plurality of space angle sections.

Optionally, the angle updating unit is further configured to:

Calculating a first angle difference between the first spatial angle value and the lower spatial angle limit value and a second angle difference between the first spatial angle value and the upper spatial angle limit value;

And if the first angle difference value is larger than the second angle difference value, updating the first spatial angle value to be the upper spatial angle limit value to obtain a second spatial angle value of the sound source except the target sound source relative to the target object, or if the first angle difference value is smaller than or equal to the second angle difference value, updating the first spatial angle value to be the lower spatial angle limit value to obtain a second spatial angle value of the sound source except the target sound source relative to the target object.

Optionally, the audio signal emitted by each sound source comprises a plurality of audio frames; the signal acquisition unit is further configured to:

For the audio signal sent by each sound source, calculating a first audio energy value of the last audio frame, and calculating a second audio energy value of the current audio frame according to the first audio energy value;

for each of the audio signals emitted by the sound sources, the second audio energy value is taken as the smoothing energy value.

Optionally, the signal acquisition unit is further configured to:

Calculating the current relative energy value of the audio frame relative to the target object;

And carrying out weighted summation on the relative energy value and the first audio energy value to obtain a second audio energy value of the current audio frame.

Optionally, the signal acquisition unit is further configured to:

Calculating a distance value between the sound source and the target object;

Calculating the instant energy value of the current audio frame;

And calculating the current relative energy value of the audio frame relative to the target object according to the distance value and the instant energy value.

Optionally, the signal acquisition unit is further configured to:

Determining a reference distance value;

Calculating the square of the ratio of the reference distance value to the distance value to obtain an energy value scale factor;

Multiplying the instant energy value by the energy value scaling factor to obtain the current relative energy value of the audio frame relative to the target object.

Optionally, the accuracy determining unit is further configured to:

for each of the audio signals except the target audio signal, calculating a ratio between the smoothed energy value of the audio signal and the smoothed energy value of the target audio signal to obtain an energy ratio corresponding to the audio signal except the target audio signal;

and determining the azimuth resolution precision of each sound source except the target sound source according to the energy ratio corresponding to each audio signal except the target audio signal.

Optionally, the accuracy determining unit is further configured to:

acquiring a position resolution precision lookup table, wherein the position resolution precision lookup table records position resolution precision corresponding to the energy ratio;

And according to the energy ratio corresponding to each audio signal except the target audio signal, searching the azimuth resolution precision of each sound source except the target sound source in the azimuth resolution precision lookup table.

Optionally, the accuracy determining unit is further configured to:

And calling a position resolution precision mapping function, and performing mapping calculation on the energy ratio corresponding to each audio signal except the target audio signal to obtain the position resolution precision of each sound source except the target sound source.

In another aspect, an embodiment of the present application further provides an electronic device, including:

At least one processor;

At least one memory for storing at least one program;

the audio processing method as described above is implemented when at least one of said programs is executed by at least one of said processors.

In another aspect, embodiments of the present application also provide a computer-readable storage medium in which a computer program executable by a processor is stored, the computer program executable by the processor being configured to implement an audio processing method as above.

In another aspect, embodiments of the present application also provide a computer program product, including a computer program or computer instructions, where the computer program or computer instructions are stored in a computer readable storage medium, and where a processor of an electronic device reads the computer program or computer instructions from the computer readable storage medium, and where the processor executes the computer program or computer instructions, so that the electronic device performs the audio processing method as above.

The embodiment of the application at least comprises the following beneficial effects: firstly, obtaining smooth energy values of audio signals sent by a plurality of sound sources and a first space angle value of each sound source relative to a target object; and then in the audio signals sent by the sound sources, determining one with the largest smooth energy value as a target audio signal, and determining the sound source sending the target audio signal as a target sound source, wherein the sound source corresponding to the audio signal with the largest smooth energy value is determined as the target sound source, so that the target sound source can be regarded as a sound source which is strongly perceived by a user, thereby being beneficial to determining a weak perception sound source which can be masked and does not influence the hearing perception of the user in the rest other sound sources; then according to the smooth energy value of each audio signal except the target audio signal and the smooth energy value of the target audio signal, determining the azimuth resolution precision of each sound source except the target sound source, updating the first space angle value according to the azimuth resolution precision for each sound source except the target sound source to obtain a second space angle value of the sound source except the target sound source relative to the target object, mixing and merging the audio signals of a plurality of sound sources with the same second space angle value to obtain a plurality of merged audio signals, wherein each sound source except the target sound source can be regarded as a sound source of weak perception of a user, so that the first space angle value of each sound source except the target sound source is updated to the second space angle value through the azimuth resolution precision, the sound sources of weak perception of the user can be classified based on the second space angle value, the audio signals of a plurality of sound sources with the same second space angle value can be mixed and merged, the number of the sound sources of weak perception of the user is reduced, the purpose of masking the weak perception sound sources is achieved, and for the user can perceive the sound sources with different hearing and the same sense of the sound source with different ears by the natural perception, and the effect of the user on the weak perception is not being ignored when the mixed sound source has the effect on the weak perception of the sound source; in addition, the number of sound sources which are weakly perceived by the user is reduced, so that when the target audio signal and the plurality of combined audio signals are subjected to stereo mixing processing to obtain a stereo signal, the calculation amount of a processor can be effectively reduced under the condition of providing good hearing experience for the user.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and do not limit the application.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a flow chart of an audio processing method according to an embodiment of the present application;

FIG. 3 is a schematic illustration of multiple sound sources provided by one embodiment of the present application;

FIG. 4 is a schematic diagram of a first horizontal spatial angle value provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a first vertical spatial angle value provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of an angular update at 10 according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an angular update at 20 according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an angular update at 30 according to an embodiment of the present application;

fig. 9 is a schematic diagram of a stereo mixing process provided by a specific example of the present application;

FIG. 10 is a schematic illustration of sub-horizontal spatial ranges divided by spatial ranges provided by one embodiment of the present application;

Fig. 11 is a schematic diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The application will be further described with reference to the drawings and specific examples. The described embodiments should not be taken as limitations of the present application, and all other embodiments that would be obvious to one of ordinary skill in the art without making any inventive effort are intended to be within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Sound source: refers to an object or device that produces sound. The sound source may be an emitter of sound waves, and may be a human being, an animal, a musical instrument, a speaker, etc. The sound source may be a solid, liquid or gas that when vibrated produces sound waves that propagate into the surrounding environment.

Audio signal: refers to a signal representing sound on a time axis. In the digital field, audio signals are typically stored and processed in digital form, which can be analyzed, processed, and transmitted using digital signal processing techniques. The audio signal may be a digital signal obtained by sampling and quantizing sound waves from a sound source, or may be a digital audio file obtained by encoding and compressing the sound waves.

Smoothing energy value: is a value obtained by smoothing the signal energy. In signal processing, smoothing energy values are typically used to reduce noise or fluctuations in the signal, making the signal more stable and easier to analyze. Smoothing energy values are widely used in the fields of audio processing, image processing, signal processing, and the like.

Orientation: in acoustics, azimuth refers to the position or direction of a sound source relative to a listener. Orientation is generally described in terms of a horizontal direction (left-right) and a vertical direction (up-down). In a stereo sound system, the azimuth refers to the left and right positions of a sound source with respect to a listener, and can be simulated by the sound pressure level difference and the phase difference of sound in left and right speakers. The azimuth information is very important for audio localization and creation of surround sound effects.

Azimuth resolution accuracy: refers to the degree of finesse used to distinguish the orientation of different sound sources in acoustic or audio processing. Higher position resolution accuracy means that the position or direction of the sound source relative to the listener can be more accurately determined, thereby providing a more realistic audio experience.

Mixing and merging: refers to a process of mixing multiple audio signals together to form a single audio signal. In audio processing, mixing and combining is typically used to mix together audio signals of different sound sources to create a richer audio effect or mixing effect. Mixing and combining can be achieved by adjusting parameters such as volume, balance, delay, phase, etc. of the audio signal.

Stereo mixing processing: refers to a process of mixing a plurality of audio signals together in audio processing to create an audio effect in a stereo environment. The stereo mixing process typically involves adjusting parameters of the sound signal such as balance, channel localization, reverberation, delay, phase, etc., so that the mixed audio exhibits a stereo and spatial sense in the left and right channels. The processing can make the listener feel that the sound is transmitted from different directions, and the stereo perception and the realism of the audio are enhanced. Stereo mixing processing is widely used in the fields of music production, movie production, game development, and the like.

In the current application scenario of multi-sound source stereo, such as application in meta space or application of analog reality, it is often required to generate stereo signals from audio signals sent by multiple sound sources and then play the stereo signals to users. For example, in a subway simulated reality game, different users participate in the game in the form of virtual human identities, respective sound signals are generated in the process, and related facility items simulating the game also construct variable environmental background sound sources, such as subway car sound, broadcast sound, various noisy sounds and the like, according to the current scene, and at the same time, the audio signals of a plurality of sound sources are mixed and finally enter each user ear. When these sound sources are subjected to stereo audio processing, it is necessary to perform HRTF processing on the audio signals of all the sound sources, respectively, and then perform stereo mixing processing on all the audio signals subjected to HRTF processing, to obtain stereo signals. However, when HRTF processing is performed on audio signals of all sound sources, a large amount of computation is required by a processor of a terminal or a server, and as the number of sound sources increases, the computation force required by the terminal or the server increases in a flat manner, and a high computation load may cause a game to be jammed, a sound picture is not smooth, and various action responses in the game may have delay. Therefore, in such applications of multi-user participation such as multiplayer games, the computational overhead is a problem that needs to be solved with emphasis. Therefore, how to reduce the calculation amount of the processor while providing a good hearing experience for the user is a technical problem to be solved.

In order to effectively reduce the calculation amount of a processor under the condition of providing a good hearing experience for a user, the embodiment of the application provides an audio processing method, an audio processing device, electronic equipment, a computer readable storage medium and a computer program product, wherein smooth energy values of audio signals sent by a plurality of sound sources and a first space angle value of each sound source relative to a target object are firstly obtained; and then in the audio signals sent by the sound sources, determining one with the largest smooth energy value as a target audio signal, and determining the sound source sending the target audio signal as a target sound source, wherein the target sound source can be regarded as a sound source strongly perceived by a user, and each sound source except the target sound source can be regarded as a sound source weakly perceived by the user; then according to the smooth energy value of each audio signal except the target audio signal and the smooth energy value of the target audio signal, determining the azimuth resolution precision of each sound source except the target sound source, updating the first space angle value according to the azimuth resolution precision for each sound source except the target sound source to obtain a second space angle value of the sound source relative to the target object, mixing and merging the audio signals of a plurality of sound sources with the same second space angle value to obtain a plurality of merged audio signals, namely classifying the sound sources weakly perceived by the user based on the second space angle value, reducing the number of the sound sources weakly perceived by the user, and naturally neglecting the weakly perceived sound sources by human ears, so that the perception results of the plurality of sound sources are not influenced by the user by mixing and merging the audio signals of the plurality of sound sources with the same second space angle value; in addition, the number of sound sources which are weakly perceived by the user is reduced, so that when the target audio signal and the plurality of combined audio signals are subjected to stereo mixing processing to obtain a stereo signal, the calculation amount of a processor can be effectively reduced under the condition of providing good hearing experience for the user.

Referring to fig. 1, fig. 1 is a schematic diagram of an implementation environment provided in an embodiment of the present application, where the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 can be directly or indirectly connected by wired or wireless communication. The terminal 101 and the server 102 may be nodes in a blockchain, which is not specifically limited in this embodiment.

In some embodiments, the terminal 101 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, an aircraft, a VR (Virtual Reality) device, an AR (Augmented Reality) device, and the like. Alternatively, the terminal 101 may acquire audio signals emitted from a plurality of sound sources, generate an audio processing request from the plurality of audio signals, and transmit the audio processing request to the server 102.

In some embodiments, the server 102 may be a stand-alone server, a server cluster or a distributed system formed by a plurality of servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms. In some embodiments, the server 102 primarily takes on computing work and the terminal 101 takes on secondary computing work; or server 102 takes on secondary computing work and terminal 101 takes on primary computing work; or the server 102 and the terminal 101 perform cooperative computing by adopting a distributed computing architecture. Alternatively, the server 102 can provide an audio processing service according to the received audio processing request, that is, process a plurality of audio signals to obtain a stereo signal.

Referring to fig. 1, in an application scenario, it is assumed that a terminal 101 is a computer. In the process that the target object plays audio from a plurality of sound sources through the terminal 101, the terminal 101 acquires a smooth energy value of audio signals emitted from the plurality of sound sources and a first spatial angle value of each sound source with respect to the target object. For example, there are a total of 3 sound sources. The smooth energy value of the audio signal sent by the 1 st sound source is 10, the smooth energy value of the audio signal sent by the 2 nd sound source is 20, and the smooth energy value of the audio signal sent by the 3 rd sound source is 30. The first spatial angle value of the sound source No. 1 is 43 °, the first spatial angle value of the sound source No. 2 is 38 °, and the first spatial angle value of the sound source No. 3 is 10 °. The terminal 101 determines, as a target audio signal, one of the audio signals emitted from the plurality of sound sources, which has the largest smooth energy value, and determines, as a target sound source, a sound source that emits the target audio signal. For example, sound source 3 is determined as the target sound source. The terminal 101 determines the azimuth resolution precision of each sound source other than the target sound source based on the smoothed energy value of each audio signal other than the target audio signal and the smoothed energy value of the target audio signal. The closer the smoothing energy value of each sound source is to the smoothing energy value of the target sound source, the greater the azimuth resolution accuracy. For example, in addition to the sound source No. 3, it is possible to determine that the azimuth resolution of sound source No. 1 is 0.3, and to determine that the azimuth resolution of sound source No. 2 is 0.6. The terminal 101 updates the first spatial angle value according to the azimuth resolution precision for each sound source other than the target sound source, and obtains the second spatial angle value of the sound source other than the target sound source with respect to the target object. For example, for sound source 1, the angle determination is performed according to the azimuth resolution precision of 0.3 and the first spatial angle value of 43 ° (by means of table look-up or function mapping, etc.), so as to obtain the second spatial angle value of sound source 1 as 40 °. For the No. 2 sound source, the angle determination is carried out according to the azimuth resolution precision of 0.6 and the first space angle value of 38 degrees (the modes such as table lookup or function mapping can be adopted), and the second space angle value of the No. 2 sound source is 40 degrees. The terminal 101 mixes and combines the audio signals of the plurality of sound sources having the same second spatial angle value to obtain a plurality of combined audio signals. For example, audio signals of the sound source 1 and the sound source 2, the second spatial angle values of which are 40 degrees, are mixed and combined to obtain combined audio signals corresponding to the sound source 1 and the sound source 2. The terminal 101 performs a stereo mixing process on the target audio signal and the plurality of combined audio signals to obtain a stereo signal. For example, the terminal 101 performs stereo mixing processing on an audio signal emitted from the sound source No. 3 and a combined audio signal corresponding to the sound source No. 1 and the sound source No. 2, to obtain stereo signals corresponding to the sound sources No. 1 to No. 3. The terminal 101 plays a stereo signal to a target object. For example, the terminal plays stereo signals corresponding to sound sources 1 to 3 to the target object.

In one embodiment, the audio processing services may be provided by the server 102. Specifically, the terminal 101, after acquiring audio signals emitted from a plurality of sound sources, transmits the plurality of audio signals and an audio processing request for the plurality of audio signals to the server 102; the server 102 processes the plurality of audio signals according to the received audio processing request to obtain a stereo signal, and sends the stereo signal to the terminal 101; the terminal 101 receives the stereo signal transmitted from the server 102 and plays the stereo signal to the target object.

In the embodiments of the present application, when related processing is required to be performed according to data related to characteristics of a target object (for example, current location information of the target object) such as attribute information or attribute information set of the target object, permission or consent of the target object is obtained first, and related laws and regulations and standards are complied with for collection, use, processing, and the like of the data. In addition, when the embodiment of the application needs to acquire the attribute information of the target object, the independent permission or independent consent of the target object is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the target object is explicitly acquired, the related data of the necessary target object for enabling the embodiment of the application to normally operate is acquired.

Fig. 2 is a flowchart of an audio processing method provided in an embodiment of the present application, which is performed by a terminal or a server alone or in combination. In this embodiment, the method is described as an example performed by the terminal. Referring to fig. 2, the audio processing method includes, but is not limited to, steps 210 to 270.

Step 210: acquiring smooth energy values of audio signals sent by a plurality of sound sources and a first space angle value of each sound source relative to a target object;

Step 220: among the audio signals emitted from the plurality of sound sources, one of the maximum smooth energy values is determined as a target audio signal, and the sound source emitting the target audio signal is determined as a target sound source;

Step 230: determining azimuth resolution accuracy of each sound source other than the target sound source based on the smoothed energy value of each audio signal other than the target audio signal and the smoothed energy value of the target audio signal;

Step 240: updating the first space angle value according to azimuth resolution precision for each sound source except the target sound source to obtain a second space angle value of the sound source except the target sound source relative to the target object;

Step 250: mixing and merging the audio signals of a plurality of sound sources with the same second spatial angle value to obtain a plurality of merged audio signals;

step 260: carrying out stereo mixing processing on the target audio signal and the plurality of combined audio signals to obtain a stereo signal;

step 270: and playing the stereo signal to the target object.

In step 210 of an embodiment, the target object may be a person or an object, for example, the target object may be a user using the terminal, or may be an apparatus or a device that receives a stereo signal sent by the terminal, or the like. In the embodiment of the present application, a user using a terminal as a target object is described as an example. The smoothed energy value is used to characterize the energy of the audio signal emitted by the sound source relative to the target object. The first spatial angle value is used to characterize the position of the sound source relative to the target object. For a target object, the position of each sound source relative to the target object is typically different. Fig. 3 is a schematic diagram of a multi-sound source environment provided by an embodiment of the present application. Referring to fig. 3, in a game scene of subway simulated reality, a plurality of sound sources include sound sources 1 to 11. Different users participate in the game in the form of virtual human images, and become sound sources such as the sound source 1 and the sound source 2 in fig. 3. During the game, the related facilities and items can also construct changeable environmental background sound sources, such as subway car sound, broadcast sound, various noisy sounds, etc., according to the current game scene, such as the number 4 sound source in fig. 3. Assuming that the target object corresponds to sound source No. 1, a plurality of audio signals emitted from sound sources No. 2 to No. 11 are processed into one stereo signal, which is then played to the target object in the game. In processing a plurality of audio signals into stereo signals, it is necessary to acquire the positions of the respective sound sources with respect to the target object. With the target object as a reference, there are sound sources on the left hand side of the target object, there are sound sources on the right hand side of the target object, there are sound sources right in front of the target object, and so on. A first spatial angle value for each sound source may be determined based on the position of each sound source relative to the target object.

The audio signal may comprise one audio frame or may comprise a plurality of audio frames. Each audio frame may be 20 milliseconds to 30 milliseconds in length. In an embodiment, for each audio frame in the audio signal, feature extraction may be performed on the audio frame to obtain audio features such as zero-crossing rate, volume, pitch, MFCC (Mel-frequency cepstral coefficients, mel frequency cepstrum coefficient) parameters, LPC (LINEAR PREDICTIVE Coding) parameters, and the like; then, according to the characteristics of the zero-crossing rate, the volume, the pitch and the like, end point detection (Endpoint Detection) is carried out to obtain sampling values, and the smooth energy value of the audio frame is obtained by calculation according to the obtained sampling values.

In another embodiment, if the audio signals sent by the sound source all include a plurality of audio frames, the smoothing energy value of the previous audio frame can be inherited when the smoothing energy value of the current audio frame is calculated, so that the calculation efficiency is improved. Specifically, when the embodiment is specifically implemented, the process of acquiring the smoothed energy values of the audio signals emitted by the plurality of sound sources in step 210 may include: for the audio signal sent by each sound source, calculating a first audio energy value of the previous audio frame, and calculating a second audio energy value of the current audio frame according to the first audio energy value; the second audio energy value is then taken as a smoothed energy value for each audio signal emitted by the sound source.

In this embodiment, the first audio energy value is used to characterize the smoothed energy value at which the last audio frame arrives at the target object. There is no previous audio frame for the first audio frame in the audio signal. Thus, if the current audio frame is the first audio frame in the audio signal, the first audio energy value of the last audio frame may be considered to be zero. The second audio energy value is calculated based on the first audio energy value, and is used for representing the energy of the current audio frame relative to the target object, and can be used as a smooth energy value when the current audio frame reaches the target object. According to the method, the device and the system, the characteristic that energy of two audio frames at adjacent moments is closely related is utilized, a mode of calculating a second audio energy value of a current audio frame based on a first audio energy value of a previous audio frame is provided, calculation accuracy is guaranteed, and meanwhile calculation efficiency is improved.

In an embodiment, the process of calculating the second audio energy value of the current audio frame according to the first audio energy value may include: firstly, calculating the relative energy value of the current audio frame relative to a target object; and then, the relative energy value and the first audio energy value are weighted and summed to obtain a second audio energy value of the current audio frame.

The relative energy value is used to characterize the energy of the audio frame after transmission of the relative distance (distance between the sound source and the target object). In one embodiment, the process of calculating the relative energy value of the current audio frame with respect to the target object may include: firstly, calculating a distance value between a sound source and a target object, and calculating an instant energy value of a current audio frame; and then, according to the distance value and the instant energy value, calculating to obtain the relative energy value of the current audio frame relative to the target object.

In an embodiment, the distance value is used to characterize the distance between the sound source and the target object. The distance value can be calculated by the formula (1):

（1）

In the case of the formula (1), Representing the distance value between the i-th sound source and the target object. /(I)Is the coordinate value of the target object in the plane rectangular coordinate system. /(I)Is the coordinate value of the ith sound source in the rectangular plane coordinate system. i is the number of the sound source. In an embodiment, the plane rectangular coordinate system is a coordinate system in a plane where the target object and the plurality of sound sources are located, and the origin of coordinates of the plane rectangular coordinate system may be the coordinates of the target object or the coordinates of any one of the sound sources, which is not specifically limited herein. Therefore, for the ith sound source, the coordinate value of the target object in the plane rectangular coordinate system and the coordinate value of the ith sound source in the plane rectangular coordinate system can be obtained first, and then the two coordinate values are input into the formula (1) for calculation, so that the distance value between the sound source and the target object can be obtained.

The instant energy value refers to the energy of an audio frame of an audio signal emitted by a sound source. The audio signal may be subjected to high pass filtering (e.g., filtering low frequencies by 250 Hz) before the instant energy value is calculated, and then the instant energy value of the audio frame (e.g., 20ms is a frame) is calculated for the filtered audio signal.

In one embodiment, the instant energy value may be calculated by equation (2):

（2）

in the formula (2) of the present invention, An instant energy value of an audio frame representing an i-th sound source, K representing a sequence number of sampling points, K representing the number of sampling points,/>A sample value representing a kth sample point of an audio frame of an ith sound source. Therefore, for the ith sound source, the sampling values of the K sampling points of the sound source can be acquired first, and then the sampling values of the K sampling points are input into the formula (2) for calculation, so that the instant energy value of the sound source can be obtained.

In one embodiment, the sound intensity is inversely proportional to the square of the distance. Therefore, after obtaining the distance value and the instant energy value, the process of calculating the relative energy value of the current audio frame with respect to the target object according to the distance value and the instant energy value may include: firstly, determining a reference distance value; then, calculating the square of the ratio of the reference distance value to the distance value to obtain an energy value scale factor; and then multiplying the instant energy value by an energy value scaling factor to obtain the relative energy value of the current audio frame relative to the target object.

The reference distance value is a reference distance constant, and the relative energy value of each sound source is calculated by referring to the reference distance value. The energy value scaling factor is used to characterize the magnitude of the ratio between the reference distance value and the distance value. The smaller the distance value, the larger the energy value scale factor; the larger the distance value, the smaller the energy value scale factor. The larger the energy scaling factor, the smaller the energy loss of the instant energy of the audio frames of the audio signal emitted by the sound source when transmitted to the target object. The result of the multiplication between the instantaneous energy value and the energy value scaling factor is taken as the relative energy value. In one embodiment, the relative energy value may be calculated by equation (3):

（3）

In the formula (3) of the present invention, Representing the relative energy value of the audio frame of the i-th sound source. /(I)Is a reference distance value, is a reference distance constant. /(I)Is the distance value between the i-th sound source and the target object. /(I)Is the instant energy value of the audio frame of the i-th sound source. Therefore, for the ith sound source, after the instant energy value of the sound source and the distance value between the sound source and the target object are obtained by calculation, the instant energy value of the sound source and the distance value between the sound source and the target object are input into the formula (3) for calculation, and then the relative energy value of the audio frame of the sound source can be obtained.

The embodiment for calculating the relative energy value by using the distance value and the instant energy value has the advantage of providing a simple and rapid calculation mode, and improving the calculation efficiency while ensuring the calculation accuracy.

In one embodiment, after the first audio energy value and the relative energy value are obtained, the second audio energy value may be calculated by equation (4):

（4）/>

In the formula (4) of the present invention, The second audio energy value of the j-th audio frame representing the i-th sound source is also represented as a smoothed energy value. /(I)A first audio energy value representing a j-1 st audio frame of the i-th sound source. /(I)Representing the relative energy value of the jth audio frame of the ith sound source. /(I)Is constant and can be set according to actual requirements. Therefore, for the ith sound source, after the first audio energy value of the jth-1 audio frame and the relative energy value of the jth audio frame of the sound source are obtained by calculation, the first audio energy value of the jth-1 audio frame and the relative energy value of the jth audio frame of the sound source are input into the formula (4) for calculation, and then the second audio energy value of the jth audio frame of the sound source can be obtained.

It should be noted that, for the 1 st audio frame of each sound source, since the 0 th audio frame is not present, the first audio energy value of the 0 th audio frame may be considered to be 0, and the smoothed energy value of the 1 st audio frame may be calculated from the relative energy values. Specifically, if j=1,=0. Thus,/>。Representing the smoothed energy value of the 1 st audio frame of the i-th sound source. /(I)Representing the relative energy value of the 1 st audio frame of the i-th sound source. Then, when calculating the smoothing energy value of the audio frame following the 1 st audio frame, the smoothing energy value of the last audio frame is taken as the first audio energy value.

It should be noted that, the above formula (1), formula (2), formula (3) and formula (4) are not unique, and the above formula may be changed according to actual needs, which is not limited in this embodiment.

The above is a description of the smooth energy values in step 210, and the first spatial angle values in step 210 are described below.

The first spatial angle value may be a first horizontal angle value projected on a horizontal plane, may be a first vertical angle value projected on a vertical plane, or may include both the first horizontal angle value projected on the horizontal plane and the first vertical angle value projected on the vertical plane. The first horizontal angle value is the included angle between the projection of the connecting line between the sound source and the target object and the horizontal reference line after the projection of the connecting line between the sound source and the target object is on the horizontal plane. In an embodiment, the horizontal reference line is a straight line in a horizontal plane to which the target object faces. The first horizontal angle value may have a value in the range of 0, 360. Referring to the plan view shown in fig. 4, the included angle between the sound source 1 and the horizontal reference line is 90 °, i.e., the first horizontal angle value of the sound source 1 is 90 °. The angle between the sound source 2 and the horizontal reference line is 30 °, i.e. the first horizontal angle value of the sound source 2 is 30 °. The angle between the sound source 3 and the horizontal reference line is 0 deg., i.e. the first horizontal angle value of the sound source 3 is 0 deg.. The first vertical angle value is an included angle between a projection of a connecting line between the sound source and the target object and a vertical reference line after the projection of the connecting line between the sound source and the target object is on a vertical plane. In one embodiment, the vertical reference line is a straight line perpendicular to the horizontal reference line, and the target object is at an intersection of the horizontal reference line and the vertical reference line. The first vertical angle value may have a value in the range of 0, 90. Referring to the side view shown in fig. 5, the included angle between the sound source 1 and the vertical reference line is 20 °, i.e., the first vertical angle value of the sound source 1 is 20 °. The angle between the sound source 2 and the vertical reference line is 45 deg., i.e. the first vertical angle value of the sound source 2 is 45 deg.. The angle between the sound source 3 and the vertical reference line is 80 deg., i.e. the first vertical angle value of the sound source 3 is 80 deg..

In one embodiment, there are three cases of the first spatial angle value of the sound source relative to the target object. The first case is that the first spatial angle value comprises only the first horizontal angle value. For example, the first horizontal angle value of the sound source 1 with respect to the target object is 90 °, the first horizontal angle value of the sound source 2 with respect to the target object is 30 °, and the first horizontal angle value of the sound source 3 with respect to the target object is 0 °. The second case is that the first spatial angle value comprises only the first vertical angle value. For example, the first vertical angle value of sound source 1 with respect to the target object is 20 °, the first vertical angle value of sound source 2 with respect to the target object is 45 °, and the first vertical angle value of sound source 3 with respect to the target object is 80 °. The third case is that the first spatial angle value includes both the first horizontal angle value and the first vertical angle value. For example, the first horizontal angle value of the sound source 1 with respect to the target object is 90 ° and the first vertical angle value is 20 °, the first horizontal angle value of the sound source 2 with respect to the target object is 30 ° and the first vertical angle is 45 °, and the first horizontal angle value of the sound source 3 with respect to the target object is 0 ° and the first vertical angle value is 80 °.

In step 220 of an embodiment, the target sound source is a sound source corresponding to the audio signal with the largest smoothed energy value among the audio signals emitted by the plurality of sound sources. The target object has different degrees of perception for audio signals of different smoothing energy values. The target sound source may be regarded as a strongly perceived sound source of the plurality of sound sources. The remaining sound sources other than the target sound source may be regarded as weakly perceived sound sources. In the presence of a strongly perceived sound source, the target object is likely to be indistinguishable from a weakly perceived sound source. Therefore, in steps 230 to 240 of an embodiment, the first spatial angle value of the weak perception sound source is updated to obtain the second spatial angle value, so that the weak perception sound sources at different positions are categorized based on the second spatial angle value, and the hearing experience of the target object is not affected after the audio signals sent by the plurality of weak perception sound sources with the same second spatial angle value are combined.

In step 230 of an embodiment, the azimuth resolution precision refers to the resolution precision of the azimuth of the sound source with respect to the target object. The larger the smooth energy value of the audio signal emitted by the sound source, the more the target object can perceive the sound source, and the greater the azimuth resolution precision of the sound source. In one embodiment, step 230 may include: for each audio signal except the target audio signal, calculating the ratio between the smooth energy value of the audio signal and the smooth energy value of the target audio signal to obtain the corresponding energy ratio of the audio signal except the target audio signal; the azimuth resolution precision of each sound source other than the target sound source is then determined from the energy ratio corresponding to each audio signal other than the target audio signal.

The energy ratio is used to characterize the difference in energy of the audio signal other than the target audio signal from the target audio signal. The azimuth resolution accuracy is used to characterize the ability to accurately distinguish between different sound source azimuths. The larger the energy ratio, the smaller the difference in energy between the audio signal other than the target audio signal and the target audio signal. The smaller the gap in energy, the greater the impact of the audio signal other than the target audio signal on the audio experience of the target object, the more accurate the position of the sound source other than the target sound source should be relative to the target object. Therefore, the greater the energy ratio, the greater the azimuth resolution accuracy, the finer the updating of the first spatial angle value, so that the position or direction of the sound source other than the target sound source relative to the target object can be more accurately determined, thereby being beneficial to providing more realistic audio experience.

The embodiment for determining the azimuth resolution precision by using the energy ratio has the advantage that the audio influence of each sound source except the target sound source relative to the target sound source can be well evaluated by using the energy ratio, so that the simplicity and universality of determining the azimuth resolution precision are improved.

The above-mentioned process of determining the azimuth resolution precision according to the energy ratio may include at least the following three cases.

In the first case, a look-up table of azimuth resolution precision is firstly obtained; and then searching the azimuth resolution precision of each sound source except the target sound source in an azimuth resolution precision lookup table according to the energy ratio corresponding to each audio signal except the target audio signal.

The azimuth resolution precision lookup table records azimuth resolution precision corresponding to the energy ratio. The azimuth resolution precision lookup table is stored in the terminal or the server and is used for indicating the corresponding relation between the energy ratio and the azimuth resolution precision. Table 1 below is an example of an azimuth resolution precision look-up table:

TABLE 1

For example, the energy ratio of a certain sound source is 0.8, and the corresponding azimuth resolution is 18 by searching in table 1.

The method for searching the azimuth resolution precision lookup table has the advantages of simplicity, easiness, and low processing cost.

And in the second case, calling a position resolution precision mapping function, and performing mapping calculation on the energy ratio corresponding to each audio signal except the target audio signal to obtain the position resolution precision of each sound source except the target sound source.

In one embodiment, in the azimuth resolution precision mapping function, the azimuth resolution precision is proportional to the energy ratio. The azimuth resolution accuracy mapping function is an increasing function, and as the energy ratio increases, the azimuth resolution accuracy increases. For example, the azimuth resolution accuracy can be calculated by the azimuth resolution accuracy mapping function of formula (5):

Acc=f（Re）（5）

In the formula (5), acc represents azimuth resolution accuracy, re represents an energy ratio, and f (·) represents an azimuth resolution accuracy mapping function. f (·) is a monotonically increasing function. For example, the energy ratio Re of the target sound source may be obtained by dividing Esm by Esm0 with a smoothing energy value Esm0 for the sound source and a smoothing energy value Esm for one of the other sound sources. At this time, the energy ratio Re of the sound source is input to the formula (5) and calculated, so that the azimuth resolution accuracy of the sound source can be obtained. It should be noted that the azimuth resolution precision mapping function may be set according to actual requirements, which is not specifically limited in this embodiment.

The method for determining the azimuth resolution precision through the azimuth resolution precision mapping function has the advantages of high precision, capability of adjusting the mapping function according to the requirement and high flexibility.

In the third case, firstly, acquiring a position resolution precision lookup table; then according to the energy ratio corresponding to each audio signal except the target audio signal, searching the azimuth resolution precision of each sound source except the target sound source in an azimuth resolution precision lookup table; and if the energy ratio is not found, calling a position resolution precision mapping function, and performing mapping calculation on the energy ratio corresponding to each audio signal except the target audio signal to obtain the position resolution precision of each sound source except the target sound source.

The method for determining the azimuth resolution precision through the combination of the azimuth resolution precision lookup table and the azimuth resolution precision mapping function has the advantages of high precision, high flexibility and high applicability.

In step 240 of an embodiment, for each sound source other than the target sound source, the first spatial angle value may be updated according to the azimuth resolution precision, to obtain a second spatial angle value of the sound source other than the target sound source with respect to the target object. With reference to the above, the greater the azimuth resolution precision, the finer the update of the first spatial angle value, so that the obtained second spatial angle value is closer to the first spatial angle value, thereby realizing the update of the azimuth of the sound source relative to the target object, and not only avoiding the contribution of the audio signal sent by the sound source to the audio experience.

In one embodiment, step 240 may include: firstly, determining a plurality of space angle intervals according to azimuth resolution precision; secondly, determining a target space angle interval comprising a first space angle value in a plurality of space angle intervals, wherein the target space angle interval comprises a space angle lower limit value and a space angle upper limit value; and then, updating the first space angle value according to the first space angle value, the lower space angle limit value and the upper space angle limit value to obtain a second space angle value of the sound source except the target sound source relative to the target object.

The plurality of sound sources are in the same spatial range. The spatial angle section is a section divided by angles included in a spatial range in which the target object is located. Referring to the above, the first spatial angle value may include a first horizontal angle value projected on a horizontal plane and a first vertical angle value projected on a vertical plane. Similarly, the spatial angular range may also include a horizontal spatial angular range projected on a horizontal plane and a vertical spatial angular range projected on a vertical plane.

In the horizontal space, the first spatial angle value is a first horizontal angle value, the spatial angle range is a horizontal spatial angle range, and the spatial angle interval is a horizontal spatial angle interval. In the vertical space, the first spatial angle value is a first vertical angle value, and the spatial angle range is a vertical spatial angle range, and the spatial angle interval is a vertical spatial angle interval. In the stereoscopic space, the first spatial angle value includes a first horizontal angle value and a first vertical angle value at the same time, and the spatial angle range includes a horizontal spatial angle range and a vertical spatial angle range at the same time, and then the spatial angle section includes a horizontal spatial angle section and a vertical spatial angle section at the same time.

In one embodiment, the spatial range uses the position of the target object as the origin of the spatial coordinate system. The process of determining a plurality of spatial angle intervals according to the azimuth resolution precision may include: firstly, determining a space angle segmentation coefficient according to azimuth resolution precision; and then taking the origin of the space coordinate system as a reference, and performing space angle segmentation on the space range according to the space angle segmentation coefficient to obtain a plurality of space angle sections.

The spatial angle division coefficient refers to a coefficient for dividing a spatial angle. The greater the azimuth resolution accuracy, the smaller the spatial angle division coefficient is, so that the spatial angle division of the spatial range is finer. For example, if the azimuth resolution is 36, the spatial angle division coefficient is 10 °, and if the azimuth resolution is 18, the spatial angle division coefficient is 20 °.

The above-mentioned process of determining the spatial angle division coefficient according to the azimuth resolution precision may include at least the following three cases.

In the first case, a space angle segmentation coefficient lookup table is acquired; and then searching the space angle segmentation coefficients in a space angle segmentation coefficient lookup table according to the azimuth resolution precision.

The space angle division coefficient lookup table records the space angle division coefficient corresponding to the azimuth resolution precision. The space angle segmentation coefficient lookup table is stored in the terminal or the server and is used for indicating the corresponding relation between the azimuth resolution precision and the space angle segmentation coefficient. Table 2 below is an example of a spatial angle division coefficient lookup table:

TABLE 2

For example, the azimuth resolution of a certain sound source is 18, and the corresponding spatial angle division coefficient is 20 ° by searching in table 2.

The method for searching the space angle segmentation coefficient lookup table has the advantages of simplicity, easiness, and low processing cost.

In the second case, a space angle division coefficient mapping function is called, and mapping calculation is performed on the position resolution accuracy to obtain a space angle division coefficient of each sound source except the target sound source.

In one embodiment, in the spatial angle segmentation coefficient mapping function, the spatial angle segmentation coefficient is inversely proportional to the azimuth resolution precision. The spatial angle division coefficient mapping function is a decreasing function, and as the azimuth resolution precision increases, the spatial angle division coefficient decreases. It should be noted that the mapping function of the spatial angle partition coefficient may be set according to actual requirements, which is not specifically limited in this embodiment.

The method for determining the space angle segmentation coefficients through the space angle segmentation coefficient mapping function has the advantages of high accuracy, capability of adjusting the mapping function according to the needs and high flexibility.

In the third case, a space angle segmentation coefficient lookup table is firstly obtained; then according to the azimuth resolution precision, searching a space angle segmentation coefficient in a space angle segmentation coefficient lookup table; if the space angle segmentation coefficient mapping function is not found, a space angle segmentation coefficient mapping function is called, mapping calculation is conducted on the position resolution precision, and the space angle segmentation coefficient of each sound source except the target sound source is obtained.

The method for determining the space angle segmentation coefficients through the space angle segmentation coefficient lookup table and the space angle segmentation coefficient mapping function has the advantages of high accuracy, high flexibility and high applicability.

The spatial angle division coefficients may include a horizontal angle division coefficient and a vertical angle division coefficient. The horizontal angle division coefficient is used for performing spatial angle division on the horizontal spatial range. The vertical angle division coefficient is used for performing spatial angle division on the vertical spatial range. For example, the horizontal space angle division coefficient is 60 °, and the horizontal space angle range is [0 °,360 ° ], and after space angle division, 6 horizontal space angle sections can be obtained, which are sequentially [0 °,60 ° ], [60 °,120 ° ], [120 °,180 ° ], [180 °,240 ° ], [240 °,300 ° ], and [300 °,360 ° ]. For example, the vertical space angle division coefficient is 30 °, and the vertical space angle range is [0 °,90 ° ], and after space angle division, 3 vertical space angle sections are obtained, in this order, [0 °,30 ° ], 30 °,60 ° ], 60 °,90 ° ].

The embodiment for performing angle division on the spatial range through the spatial angle division coefficient has the advantages of simplicity, convenience and high division efficiency.

After obtaining the plurality of space angle intervals, determining a target space angle interval comprising a first space angle value in the plurality of space angle intervals, wherein the target space angle interval comprises a space angle lower limit value and a space angle upper limit value. For example, the first spatial angle value is a first horizontal spatial angle value, and is 53 °.53 ° is among [0 °,60 ° ] of the 6 horizontal spatial angle sections of the above example, the target spatial angle section is [0 °,60 ° ], the spatial angle lower limit value of the target spatial angle section is 0 °, and the spatial angle upper limit value is 60 °.

After the lower spatial angle limit value and the upper spatial angle limit value of the target spatial angle interval are obtained, updating the first spatial angle value according to the first spatial angle value, the lower spatial angle limit value and the upper spatial angle limit value, and obtaining a second spatial angle value of the sound source except the target sound source relative to the target object.

In an embodiment, the process of updating the first spatial angle value to obtain the second spatial angle value of the sound source other than the target sound source relative to the target object may include: firstly, calculating a first angle difference value between a first space angle value and a space angle lower limit value and a second angle difference value between the first space angle value and a space angle upper limit value; if the first angle difference value is larger than the second angle difference value, the first spatial angle value is updated to be a spatial angle upper bound value, so that a second spatial angle value of the sound source except the target sound source relative to the target object is obtained, or if the first angle difference value is smaller than or equal to the second angle difference value, the first spatial angle value is updated to be a spatial angle lower bound value, so that a second spatial angle value of the sound source except the target sound source relative to the target object is obtained.

The second spatial angle value described above may have the following three cases:

The first case is that the second spatial angle value includes a second horizontal angle value, and the spatial angle lower limit value includes a horizontal angle lower limit value, and the spatial angle upper limit value includes a horizontal angle upper limit value. For example, the first horizontal angle value of the sound sources other than the target sound source with respect to the target object is 11 °. Assume that the lower boundary value of the horizontal angle is 10 deg., and the upper boundary value of the horizontal angle is 20 deg.. Then after updating the first spatial angle value, a second horizontal angle value of 10 ° of the sound source with respect to the target object can be obtained.

The second case is that the second spatial angle value includes a second vertical angle value, and the spatial angle lower limit value includes a vertical angle lower limit value, and the spatial angle upper limit value includes a vertical angle upper limit value. For example, the first vertical angle value of the sound sources other than the target sound source with respect to the target object is 34 °. Assume that the lower vertical angle limit is 30 ° and the upper vertical angle limit is 60 °. Then after updating the first spatial angle value, a second vertical angle value of 30 ° of the sound source with respect to the target object can be obtained.

The third case is that the second spatial angle value includes both the second horizontal angle value and the second vertical angle value, and the spatial angle lower limit value includes both the horizontal angle lower limit value and the vertical angle lower limit value, and the spatial angle upper limit value includes both the horizontal angle upper limit value and the vertical angle upper limit value. In the third case, when the first spatial angle value is updated, the second horizontal angle value may be updated first, and then the second vertical angle value may be updated based on the updated second horizontal angle value; or the second vertical angle value is updated firstly, and then the second horizontal angle value is updated on the basis of the updated second vertical angle value. For example, the first horizontal angle value of the sound sources other than the target sound source with respect to the target object is 11 ° and the first vertical angle value is 34 °. Assume that the lower limit of the horizontal angle is 10 °, the upper limit of the horizontal angle is 20 °, and the lower limit of the vertical angle is 30 °, and the upper limit of the vertical angle is 60 °. Then after updating the first spatial angle value, a second horizontal angle value of 10 ° and a second vertical angle value of 30 ° of the sound source with respect to the target object can be obtained.

In an example, if the horizontal space angle division coefficient is 10 °, the division positions for the horizontal space may include 0 °,10 °,20 °,30 °,40 °,50 ° up to 350 °, 36 division positions in total, thereby obtaining 36 space angle sections. If the horizontal space angle division coefficient is 20 °, the division positions for the horizontal space may include 0 °,20 °,40 °, 60 ° up to 340 °, for a total of 18 division positions, thereby obtaining 18 space angle sections. If the horizontal space angle division coefficient is 30 °, the division positions for the horizontal space may include 0 °,30 °, 60 °, 90 °, 120 ° up to 330 °, for a total of 12 division positions, thereby obtaining 12 space angle sections. The dividing process of the horizontal space by adopting other horizontal space angle dividing coefficients is similar and is not repeated. In addition, the division process of the vertical space by the vertical space angle division coefficient is similar to the division process of the horizontal space by the horizontal space angle division coefficient, and is not repeated here.

For example, assume that the first horizontal spatial angle values of sound source 1, sound source 2, and sound source 3 are 41 °, 53 °, and 58 °, respectively. Referring to fig. 6, in the case where the horizontal spatial angle division coefficient is 10 °,41 ° is located between 40 ° and 50 ° and is closer to 40 °, so 40 ° is taken as the second spatial angle value of the sound source 1. 53 ° is located between 50 ° and 60 ° and is closer to 50 °, so 50 ° is taken as the second spatial angle value of the sound source 2. 58 ° is located between 50 ° and 60 ° and closer to 60 °, so 60 ° is taken as the second spatial angle value of the sound source 3. Referring to fig. 7, in the case where the horizontal spatial angle division coefficient is 20 °,41 ° is located between 40 ° and 60 ° and is closer to 40 °, and finally 40 ° is taken as the second spatial angle value of the sound source 1. 53 is between 40 and 60 and is closer to 60, eventually 60 being the second spatial angle value of the sound source 2. 58 ° lies between 40 ° and 60 ° and is closer to 60 °, with 60 ° as the second spatial angle value of the sound source 3. Referring to fig. 8, in the case where the horizontal spatial angle division coefficient is 30 °,41 ° is located between 30 ° and 60 ° and is closer to 30 °, and finally 30 ° is taken as the second spatial angle value of the sound source 1. 53 is between 30 and 60 and more closely to 60, eventually 60 being the second spatial angle value of the sound source 2. 58 ° lies between 30 ° and 60 ° and is closer to 60 °, with 60 ° as the second spatial angle value of the sound source 3.

It should be noted that the final angle expression value (i.e., the second spatial angle value) of each sound source other than the target sound source is different at different spatial angle division coefficients. And when the space angle division coefficient is larger (i.e., the azimuth resolution precision is smaller), the total number of the second space angle values becomes smaller, and the probability that a plurality of sound sources exist under the same second space angle value becomes higher.

The above method for determining the second spatial angle value by the first angle difference value and the second angle difference value has the advantages that on the basis of realizing updating of the first spatial angle value of the sound sources except the target sound source relative to the target object, the updating accuracy is improved, and the complexity is lower.

In step 250 of one embodiment, audio signals of a plurality of sound sources with the same second spatial angle value may be mixed and combined to obtain a plurality of combined audio signals. If only one sound source exists at a certain second spatial angle value, the audio signal of the sound source can be directly used as the combined audio signal.

In an embodiment, the method of mixing and merging may include a direct addition method, an averaging method, a clamping method, a normalization method, an adaptive mixing weighting method, an automatic alignment algorithm, etc., and may be a suitable method according to actual needs, which is not limited in this embodiment. Wherein the direct addition is to directly add a plurality of audio signals together. The averaging method is to average a plurality of audio signals to obtain a composite signal. Clamping is the amplitude limiting of a plurality of audio signals, typically taking a maximum or minimum value. Normalization is the mapping of the amplitude range of an audio signal into a standard range, typically [0,1] or [ -1,1]. The self-adaptive mixing weighting is to dynamically adjust the weights of different audio signals according to the characteristics of the audio signals so as to achieve the best mixing effect. An auto-alignment algorithm is used to automatically align the time axes of multiple audio signals for subsequent processing or analysis.

In an embodiment, if the second spatial angle value is a second horizontal spatial angle value, audio signals of a plurality of sound sources with the same second horizontal spatial angle value may be mixed and combined to obtain a combined audio signal. For example, in one example, the second horizontal spatial angle values of the sound source No.1, the sound source No. 2, and the sound source No. 3 are all 30 °, and then audio signals of the sound source No.1, the sound source No. 2, and the sound source No. 3 are mixed and combined to obtain a combined audio signal.

In an embodiment, if the second spatial angle value is a second vertical spatial angle value, audio signals of a plurality of sound sources with the same second vertical spatial angle value may be mixed and combined to obtain a combined audio signal. For example, in one example, the second vertical spatial angle values of the sound source No. 1 and the sound source No. 2 are both 30 °, and the second vertical spatial angle value of the sound source No. 3 is 20 °, then the audio signals of the sound source No. 1 and the sound source No. 2 are mixed and combined to obtain one combined audio signal, and the audio signal of the sound source No. 3 is used as the other combined audio signal.

In an embodiment, if the second spatial angle value includes both the second horizontal spatial angle value and the second vertical spatial angle value, audio signals of a plurality of sound sources having the same second horizontal spatial angle value and the same second vertical spatial angle value may be mixed and combined to obtain a combined audio signal. For example, in one example, the second horizontal spatial angle values of the sound source No. 1, the sound source No. 2, and the sound source No. 3 are each 30 °, and the second vertical spatial angle values of the sound source No. 1, the sound source No. 2 are each 30 °, and the sound source No. 3 is each 20 °. Then, the audio signals of the sound source 1 and the sound source 2 are mixed and combined to obtain one combined audio signal, and the audio signal of the sound source 3 is taken as the other combined audio signal.

The auditory perception resolution of the human ear for multiple sound sources is limited for the target object, and when both strongly perceived sound sources and weakly perceived sound sources are present, the human ear naturally ignores the weakly perceived sound sources. Based on the above, the method and the device for combining the audio signals of some weak perception sound sources can effectively reduce the number of the audio signals to be processed, so that the calculation amount of a processor can be effectively reduced.

In step 260 in an embodiment, a stereo mixing process may be performed on the target audio signal and the plurality of combined audio signals to obtain a stereo signal. The stereo mixing process refers to a process of mixing sound for each of the left channel and the right channel. The stereo signals include a left channel stereo signal and a right channel stereo signal. For each of the target audio signal and the plurality of combined audio signals, the audio signal may be processed using a head related transfer function to obtain a plurality of stereo sub-signals; and then mixing the plurality of stereo sub-signals according to the left channel to obtain a left channel stereo signal, mixing the plurality of stereo sub-signals according to the right channel to obtain a right channel stereo signal, and at the moment, obtaining the stereo signal for playing to the target object.

The head related transfer function (HEAD RELATED TRANSFER Functions, HRTF), also known as ATF (anatomical transfer function), is an audio localization algorithm. The HRTF is a set of filters, and generates stereo sound effects by using HDITD (interactive TIME DELAY), IAD (Interaural Amplitude Difference) and pinna frequency vibration, so that when sound is transmitted to pinna, auditory canal and tympanic membrane in human ear, a listener can feel surround sound effects, and the HRTF can process sound sources of the virtual world in real time through digital signal processing. The HRTF is input as an audio signal and azimuth information, and output as a stereo signal. The HRTF-based stereo generation is to convolve the original audio signal u (n) with the HRIR data h (n) of the corresponding azimuth. The formula for HRTF calculation of binaural stereo signal y (n) is:。

after a plurality of stereo sub-signals are obtained, mixing of the left channel and mixing of the right channel can be performed, respectively. Referring to fig. 9, a left channel stereo mixing process is performed on a plurality of stereo sub-signals to obtain a left channel stereo signal, and a right channel stereo mixing process is performed on a plurality of stereo sub-signals to obtain a right channel stereo signal. The method of mixing can adopt a direct addition method, an average method, a clamping method, normalization, adaptive mixing weighting or automatic alignment algorithm and the like. Taking the averaging method as an example, the left channel stereo signal is The right channel stereo signal is. Where M refers to the number of audio signals, and M refers to the sequence number of the audio signals. /(I)For the mixing process of the left channel lout is the left channel stereo signal in the stereo signal. /(I)For the right channel mixing process, rout is the right channel stereo signal in the stereo signal.

In step 270 in an embodiment, a stereo signal may be played to the target object. Specifically, a left channel stereo signal is played to the left channel of the target object, and a right channel stereo signal is played to the right channel of the target object.

In the embodiments of steps 210 to 270, only one target sound source is provided, which is equivalent to dividing all sound sources in a large spatial range of the whole into one strong perception sound source and a plurality of weak perception sound sources. In practice, however, there may be sources of the plurality of sources that differ less from the smooth energy value of the target source, and the first spatial angle values of these sources, if updated, may affect the audio experience of the stereo signal. Therefore, in an embodiment, the spatial range in which the target object is located may be divided into a plurality of subspace ranges; determining a target audio signal and a plurality of combined audio signals in each subspace range according to the audio signals sent by the sound sources in each subspace range; then, carrying out stereo mixing processing on the target audio signals and a plurality of combined audio signals in all subspace ranges to obtain stereo signals; the stereo signal is then played to the target object.

In an embodiment, the spatial range where the target object is located may be divided averagely, so as to obtain a plurality of subspace ranges with consistent range sizes. If the spatial range is a horizontal spatial range, the subspace range is a sub-horizontal spatial range. Referring to fig. 10, there are a sound source 1, a sound source 2, and a sound source 3 in a horizontal space range in which a target object is located. Sound source 1, sound source 2 and sound source 3 are all within a sub-horizontal space range with a horizontal space angle interval of 2V deg.. Similarly, if the spatial range is a vertical spatial range, then the subspace range is a subspace vertical spatial range. If the spatial range is a stereoscopic spatial range, the subspace range includes both a sub-horizontal spatial range and a sub-vertical spatial range.

It should be noted that, the specific process of determining the target audio signal and the plurality of combined audio signals in each subspace range may refer to the descriptions of step 210 to step 250 above, and will not be repeated here.

The above embodiment of dividing the whole spatial range into a plurality of subspace ranges and then performing audio processing has the advantage that a plurality of strong perceptual sound sources can be reserved for the target object, so that the computing amount of a processor can be reduced and the audio effect of a stereo signal can be improved.

Referring to fig. 11, the embodiment of the present application also discloses an audio processing apparatus 1100 capable of implementing the audio processing method in the previous embodiment, the audio processing apparatus 1100 includes:

A signal acquisition unit 1110 for acquiring smooth energy values of audio signals emitted from a plurality of sound sources, and a first spatial angle value of each sound source with respect to a target object;

A signal determining unit 1120 for determining, as a target audio signal, one having the largest smooth energy value among the audio signals emitted from the plurality of sound sources, and determining, as a target sound source, a sound source from which the target audio signal is emitted;

A precision determination unit 1130 for determining a position resolution precision of each sound source other than the target sound source based on the smoothed energy value of each audio signal other than the target audio signal and the smoothed energy value of the target audio signal;

An angle updating unit 1140, configured to update, for each sound source except the target sound source, the first spatial angle value according to the azimuth resolution precision, to obtain a second spatial angle value of the sound source except the target sound source relative to the target object;

A mixing and combining unit 1150, configured to mix and combine the audio signals of the plurality of sound sources with the same second spatial angle value to obtain a plurality of combined audio signals;

a stereo mixing unit 1160, configured to perform stereo mixing processing on the target audio signal and the plurality of combined audio signals, to obtain a stereo signal;

the signal playing unit 1170 is configured to play the stereo signal to the target object.

In an embodiment, the angle updating unit 1140 is further configured to:

determining a plurality of space angle intervals according to azimuth resolution precision;

determining a target space angle interval comprising a first space angle value in a plurality of space angle intervals, wherein the target space angle interval comprises a space angle lower limit value and a space angle upper limit value;

In an embodiment, the plurality of sound sources are in the same spatial range, and the spatial range takes the position of the target object as the origin of a spatial coordinate system; the angle update unit 1140 is further configured to:

And performing space angle segmentation on the space range according to the space angle segmentation coefficient by taking the origin of the space coordinate system as a reference to obtain a plurality of space angle sections.

In an embodiment, the angle updating unit 1140 is further configured to:

Calculating a first angle difference between the first spatial angle value and a lower spatial angle limit value and a second angle difference between the first spatial angle value and an upper spatial angle limit value;

If the first angle difference value is larger than the second angle difference value, the first spatial angle value is updated to be a spatial angle upper bound value, so that a second spatial angle value of the sound source except the target sound source relative to the target object is obtained, or if the first angle difference value is smaller than or equal to the second angle difference value, the first spatial angle value is updated to be a spatial angle lower bound value, so that a second spatial angle value of the sound source except the target sound source relative to the target object is obtained.

In one embodiment, the audio signal emitted by each sound source comprises a plurality of audio frames; the signal acquisition unit 1110 is further configured to:

For each audio signal emitted by a sound source, the second audio energy value is taken as a smoothing energy value.

In an embodiment, the signal acquisition unit 1110 is further configured to:

Calculating the relative energy value of the current audio frame relative to the target object;

In an embodiment, the signal acquisition unit 1110 is further configured to:

calculating a distance value between the sound source and the target object;

calculating an instant energy value of a current audio frame;

And calculating the relative energy value of the current audio frame relative to the target object according to the distance value and the instant energy value.

In an embodiment, the signal acquisition unit 1110 is further configured to:

Determining a reference distance value;

and multiplying the instant energy value by an energy value scaling factor to obtain the relative energy value of the current audio frame relative to the target object.

In an embodiment, the precision determination unit 1130 is further configured to:

For each audio signal except the target audio signal, calculating the ratio between the smooth energy value of the audio signal and the smooth energy value of the target audio signal to obtain the corresponding energy ratio of the audio signal except the target audio signal;

Acquiring a position resolution precision lookup table, wherein the position resolution precision lookup table records the position resolution precision corresponding to the energy ratio;

and according to the energy ratio corresponding to each audio signal except the target audio signal, searching the azimuth resolution precision of each sound source except the target sound source in an azimuth resolution precision lookup table.

Referring to fig. 12, the embodiment of the application also discloses an electronic device, and the electronic device 1200 includes:

at least one processor 1210;

at least one memory 1220 for storing at least one program;

When the at least one program is executed by the at least one processor 1210, the audio processing method as before is implemented.

The embodiment of the application also discloses a computer readable storage medium, in which a computer program executable by a processor is stored, which is used for realizing the audio processing method as before when the computer program executable by the processor is executed by the processor.

The embodiment of the application also discloses a computer program product, which comprises a computer program or computer instructions, wherein the computer program or the computer instructions are stored in a computer readable storage medium, the computer program or the computer instructions are read from the computer readable storage medium by a processor of the electronic device, and the processor executes the computer program or the computer instructions, so that the electronic device executes the audio processing method.

It will be appreciated that, although the various steps in the flowcharts described above are shown in succession in the order of the indicated arrows, the steps are not necessarily executed in the order of the indicated arrows. The steps are not strictly limited in order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages that are not necessarily performed at the same time but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a portion of the steps or stages in other steps or other steps.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or units, which may be in electrical, mechanical, or other forms.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk, or an optical disk, etc., which can store program codes.

The step numbers in the above method embodiments are set for convenience of illustration, and the order of steps is not limited in any way, and the execution order of each step in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

Claims

1. An audio processing method, comprising the steps of:

and playing the stereo signal to the target object.

2. The method according to claim 1, wherein updating the first spatial angle value according to the azimuth resolution precision to obtain a second spatial angle value of the sound source other than the target sound source with respect to the target object includes:

3. The method of claim 2, wherein a plurality of the sound sources are in the same spatial range, the spatial range having a position where the target object is located as a spatial coordinate system origin;

the determining a plurality of spatial angle intervals according to the azimuth resolution precision comprises:

4. The method according to claim 2, wherein updating the first spatial angle value according to the first spatial angle value, the lower spatial angle limit value, and the upper spatial angle limit value to obtain a second spatial angle value of the sound source other than the target sound source with respect to the target object includes:

5. The method of claim 1, wherein the audio signal emitted by each of the sound sources comprises a plurality of audio frames; the obtaining the smooth energy values of the audio signals sent by the sound sources comprises the following steps:

6. The method of claim 5, wherein calculating a second audio energy value for the current audio frame from the first audio energy value comprises:

7. The method of claim 6, wherein said calculating a relative energy value of the current audio frame with respect to the target object comprises:

Calculating a distance value between the sound source and the target object;

Calculating the instant energy value of the current audio frame;

8. The method of claim 7, wherein calculating a current relative energy value of the audio frame with respect to the target object based on the distance value and the instant energy value comprises:

Determining a reference distance value;

9. The method of claim 1, wherein said determining the azimuth resolution precision of each of said sound sources other than said target sound source from said smoothed energy value of each of said audio signals other than said target audio signal and said smoothed energy value of said target audio signal comprises:

10. The method of claim 9, wherein said determining the azimuth resolution precision of each of said sound sources other than said target sound source from said energy ratio corresponding to each of said audio signals other than said target audio signal comprises:

11. The method of claim 9, wherein said determining the azimuth resolution precision of each of said sound sources other than said target sound source from said energy ratio corresponding to each of said audio signals other than said target audio signal comprises:

12. An audio processing apparatus, comprising:

13. An electronic device, comprising:

At least one processor;

At least one memory for storing at least one program;

the audio processing method according to any one of claims 1 to 11, when at least one of said programs is executed by at least one of said processors.

14. A computer-readable storage medium, in which a computer program executable by a processor is stored, which computer program, when being executed by a processor, is adapted to carry out the audio processing method according to any one of claims 1 to 11.

15. A computer program product comprising a computer program or computer instructions, characterized in that the computer program or computer instructions are stored in a computer readable storage medium, from which a processor of an electronic device reads the computer program or the computer instructions, which processor executes the computer program or the computer instructions, so that the electronic device performs the audio processing method according to any one of claims 1 to 11.