CN116939473A

CN116939473A - Audio generation method and related device

Info

Publication number: CN116939473A
Application number: CN202210326525.9A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2023-10-24

Abstract

The application discloses an audio generation method and a related device, which can be applied to various voice processing scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, vehicle-mounted scenes and the like. Determining the relative position relation between the sound source and the target object according to the position information of the sound source, and performing sound perception processing on an original audio signal of the sound source based on the distance value to obtain a target hearing perception audio signal; screening M sound sources with sound energy meeting preset conditions from N sound sources based on the sound energy of the target hearing perception audio signal; and respectively carrying out stereo reconstruction on the corresponding target auditory perception audio signals according to the position information of each sound source to obtain M stereo audio signals, and respectively carrying out audio mixing processing on left channel audio signals and right channel audio signals of the M stereo audio signals to obtain stereo mixed audio signals. Therefore, on the premise of reducing computing resources, the accuracy of the screening result is improved, and the method is more suitable for application scenes of virtual space stereo mixing.

Description

Audio generation method and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio generating method and a related device.

Background

With the development of computer technology and internet technology, metauniverse implementation needs to integrate artificial intelligence virtual perception technologies in many aspects such as audio, video and perception, and a computer virtual space approaching to real world perception is constructed, so that an experimenter can experience sense feeling indiscriminate from the real world by means of some hardware devices (such as headphones, glasses and somatosensory equipment).

However, in the current audio generation method, when reconstructing the virtual space sound effect, under the condition of similar meta-space application, because the number of sound sources is relatively large, there may be several tens of hundreds of sound sources, and the audio signals of all sound sources need to be separately reconstructed in stereo, while the calculation cost of stereo reconstruction is relatively large, so that the calculation cost of generating stereo audio signals is very high, so that the ordinary equipment cannot complete the requirement of real-time operation, and even if the calculation module is transferred to a server for implementation, the generation of the stereo audio signals also consumes very calculation resources.

Disclosure of Invention

In order to solve the technical problems, the application provides the audio generation method and the related device, which not only can reduce the consumption of computing resources, but also can realize the route selection based on the real sound effect heard by the representative target object on the premise of reducing the computing resources, and the screening result is more accurate and is more suitable for the application scene of the virtual space stereo mixing.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides an audio generating method, including:

determining a relative position relation between a sound source and a target object according to the position information of the sound source, wherein the relative position relation comprises a distance value between the sound source and the target object;

performing sound perception processing on the original audio signal of the sound source based on the distance value to obtain a target hearing perception audio signal;

determining a sound energy of the target auditory perception audio signal;

based on the sound energy of the target auditory perception audio signal, M sound sources with sound energy meeting preset conditions are screened out from N sound sources, M and N are positive integers, and M is smaller than N;

respectively carrying out stereo reconstruction on the corresponding target auditory perception audio signals according to the position information of each sound source in the M sound sources to obtain M stereo audio signals;

and respectively carrying out audio mixing processing on the left channel audio signals and the right channel audio signals of the M stereo audio signals to obtain stereo mixed audio signals.

In one aspect, an embodiment of the present application provides another audio generation method, including:

determining a sound energy of the target auditory perception audio signal;

transmitting the position information of the M sound sources and the corresponding target auditory perception audio signals to a terminal corresponding to the target object, so that the terminal respectively performs stereo reconstruction on the corresponding target auditory perception audio signals according to the position information of the M sound sources to obtain M stereo audio signals; and respectively carrying out audio mixing processing on the left channel audio signals and the right channel audio signals of the M stereo audio signals to obtain stereo mixed audio signals.

In one aspect, an embodiment of the present application provides an audio generating apparatus, where the apparatus includes a determining unit, a perception processing unit, a screening unit, a reconstruction unit, and a mixing unit:

The determining unit is used for determining the relative position relation between the sound source and the target object according to the position information of the sound source, wherein the relative position relation comprises a distance value between the sound source and the target object;

the perception processing unit is used for carrying out sound perception processing on the original audio signal of the sound source based on the distance value to obtain a target hearing perception audio signal;

the determining unit is further configured to determine a sound energy of the target auditory perception audio signal;

the screening unit is used for screening M sound sources with sound energy meeting preset conditions from N sound sources based on the sound energy of the target auditory perception audio signal, wherein M and N are positive integers, and M is smaller than N;

the reconstruction unit is used for respectively carrying out stereo reconstruction on the corresponding target auditory perception audio signals according to the position information of each sound source in the M sound sources to obtain M stereo audio signals;

and the audio mixing unit is used for respectively carrying out audio mixing processing on the left channel audio signals and the right channel audio signals of the M stereo audio signals to obtain stereo mixed audio signals.

In one aspect, an embodiment of the present application provides another audio generating apparatus, where the apparatus includes a determining unit, a perception processing unit, a screening unit, and a transmitting unit:

the sending unit is used for sending the position information of the M sound sources and the corresponding target auditory perception audio signals to the terminal corresponding to the target object, so that the terminal respectively carries out stereo reconstruction on the corresponding target auditory perception audio signals according to the position information of the M sound sources to obtain M stereo audio signals; and respectively carrying out audio mixing processing on the left channel audio signals and the right channel audio signals of the M stereo audio signals to obtain stereo mixed audio signals.

In one aspect, an embodiment of the present application provides a computer device including a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the audio generation method according to any one of the preceding aspects according to instructions in the program code.

In one aspect, embodiments of the present application provide a computer-readable storage medium for storing program code for performing the audio generation method of any one of the preceding aspects.

In one aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the audio generation method of any of the preceding aspects.

According to the technical scheme, when the virtual space generates the audio, the relative position relation between the sound source and the target object can be determined according to the position information of the sound source, and the relative position relation comprises the distance value between the sound source and the target object. Since the sound source is located at a distance from the target object, the distance values from the different sound sources to the target object are different, and thus the sound effects heard by the target object may be different when the audio signal of the sound source is transmitted to the target object. Therefore, the original audio signals of the sound sources can be subjected to sound perception processing based on the distance values, so that target hearing perception audio signals which can represent real sound effects heard by target objects are obtained, and therefore, based on the sound energy of the target hearing perception audio signals, the limited number M of sound sources with the sound energy meeting preset conditions can be screened out of N sound sources more accurately, the limited number M of sound sources can be used as sound sources which can be heard by the target objects, stereo reconstruction is carried out by screening the limited number of sound sources, and the consumption of calculation resources is reduced. And then respectively carrying out stereo reconstruction on the corresponding target auditory perception audio signals according to the position information of each sound source in the M sound sources to obtain M stereo audio signals, and respectively carrying out audio mixing processing on the left channel audio signals and the right channel audio signals of the M stereo audio signals to obtain stereo mixed audio signals so as to play the stereo mixed audio signals to a target object. The application combines the relative position relation of the sound source and the target object to perform the route selection, thereby not only reducing the consumption of the computing resource, but also realizing the route selection based on the real sound effect heard by the representative target object on the premise of reducing the computing resource, and the screening result is more accurate and is more suitable for the application scene of the virtual space stereo mixing.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 is a schematic structural diagram of an audio generating method according to the related art;

fig. 2 is an application scenario architecture diagram of an audio generation method according to an embodiment of the present application;

fig. 3 is a flowchart of an audio generating method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a relative positional relationship between a target object and a sound source according to an embodiment of the present application;

fig. 5 is a schematic diagram of an HRTF stereo reconstruction process according to an embodiment of the present application;

fig. 6 is a signaling interaction diagram of an audio generation method according to an embodiment of the present application;

FIG. 7 is a flowchart of another audio generation method according to an embodiment of the present application;

fig. 8 is a schematic view of an application scenario for restoring a stereo sound effect in a real world by using a virtual space sound effect according to an embodiment of the present application;

Fig. 9 is a block diagram of an audio generating apparatus according to an embodiment of the present application;

fig. 10 is a block diagram of another audio generating apparatus according to an embodiment of the present application;

fig. 11 is a block diagram of a terminal according to an embodiment of the present application;

fig. 12 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

The implementation of the meta-universe needs to integrate artificial intelligence virtual perception technologies in various aspects such as audio, video, perception and the like, a computer virtual space which approximates to the real world perception is constructed, and an experimenter (such as a target object) can experience sense feeling indistinguishable from the real world by means of a plurality of hardware devices (headphones, glasses and somatosensory equipment). The virtual space sound effect is a very important part, and the binaural sound signal in the real world is restored through the virtual space sound effect, so that an experimenter can perceive the stereophonic sound effect experience in the real world (such as speaking sound, laughing sound, footstep sound of different people in different peripheral directions, engine sound, sidewalk prompt sound, wind and rain sound and the like of a car from far to near). Because different objects are active in the virtual space, each object can be used as a sound source, when the target object is used as a listener, the target object receives a large number of audio signals of the sound sources, the audio signals of part of the sound sources are acquired in real time by recording equipment of the real objects, and part of the sound sources are constructed according to the virtual scene by the system. The audio signals of different sound sources can be transmitted to the terminal of the target object by the transmission network to carry out stereo mixing and then played through the earphone or the loudspeaker box of the target object.

The general flow of the multi-stereo mixing scheme provided by the related art is: and sending the audio signals and the position information of different sound sources to a server, forwarding the audio signals and the position information to terminals corresponding to other sound sources (such as users) by the server, finally generating multi-party stereo signals by the terminals based on the position information, then carrying out stereo mixing, and playing the signals after the mixing processing through stereo headphones or a loudspeaker box.

As shown in fig. 1, if the virtual space includes N sound sources, namely sound source 1, sound source 2, sound source 3, … …, and sound source N, the terminal corresponding to each sound source in the related art receives the audio signals and the position information of N-1 other sound sources forwarded by the server, the data received by the terminal will increase linearly as the number of users participating in the audio interaction in the virtual space increases, and the bandwidth consumption of the server will be O ² The level grows. And the terminal needs to do stereo generation processing to the N-1 paths of audio signals respectively, so that the calculation processing cost is high. For large-scale virtual space social application scenarios, the number of users (i.e. the number of sound sources) is very huge, and may be hundreds of thousands of users, and related technologies are faced with serious challenges no matter bandwidth resources or computing resources, so that ordinary devices cannot complete the requirements of real-time operation. Even if the computing module is transferred to a server implementation, the generation of the stereo audio signal is very computationally intensive.

In order to solve the technical problems, the embodiment of the application provides an audio generation method, which combines the relative position relation of a sound source and a target object to perform route selection, so that the consumption of computing resources can be reduced, and on the premise of reducing the computing resources, the route selection based on the real sound effect heard by the representative target object is realized, and the screening result is more accurate and is more suitable for application scenes of virtual space stereo mixing.

As shown in fig. 2, fig. 2 shows an application scene architecture diagram of an audio generation method. A server 201 and a terminal 202 may be included in the application scenario. The server 201 may provide services for the terminal 202, and the server 201 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. Terminal 202 may be used to implement a meta space, and terminal 202 may be, for example, a mobile phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, an aircraft, etc., but is not limited thereto. The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, vehicle-mounted scenes and the like. The terminal 202 and the server 201 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

In this application scenario, a plurality of sound sources, for example, N sound sources, may be, for example, a user's speaking sound, laughing sound, footstep sound, an automobile engine sound, a sidewalk prompt sound, and a weather sound. The terminal 202 may be a terminal to which a target object corresponds, and the target object may be an object in the virtual space as a listener, for example, a user, and the user may also be a sound source in the virtual space.

When the virtual space generates audio, the server 201 may determine a relative positional relationship between the sound source and the target object, including a distance value between the sound source and the target object, according to the position information of the sound source. Since the sound source is located at a distance from the target object, the distance values from the different sound sources to the target object are different, and thus the sound effects heard by the target object may be different when the audio signal of the sound source is transmitted to the target object. The server 201 can perform a sound perception process on the original audio signal of the sound source based on the distance value. The sound perception process may refer to predicting sound effects that may be perceived by a target object after an original audio signal of a sound source propagates to the target object, thereby obtaining a target auditory perception audio signal that may represent a real sound effect heard by the target object. Therefore, based on the sound energy of the target hearing perception audio signal, the limited number of M sound sources with the sound energy meeting the preset condition can be screened out from N sound sources more accurately, and the limited number of the M sound sources can be screened out to reconstruct stereo as the sound source which can be heard by the target object, so that the consumption of computing resources is reduced.

It should be noted that the above process may be referred to as a routing process, that is, the server 201 screens a limited number of M sound sources from M sound sources, where M is smaller than N, so that the computing resources consumed for subsequent stereo reconstruction may be reduced.

A stereo reconstruction and mixing process may then be performed to generate a stereo mixed audio signal. In some cases, the stereo reconstruction and mixing process may continue to be completed by the server 201. The server 201 performs stereo reconstruction on the corresponding target auditory perception audio signals according to the position information of each of the M sound sources to obtain M stereo audio signals, and performs audio mixing processing on the left channel audio signals and the right channel audio signals of the M stereo audio signals to obtain stereo mixed audio signals, so that the server 201 sends the obtained stereo mixed audio signals to the terminal 202 corresponding to the target object, so that the terminal 202 plays the stereo mixed audio signals to the target object.

In other cases, the server 201 may perform only the routing process, and the terminal 202 may implement the subsequent stereo reconstruction and stereo mixing processes, that is, the server 201 and the terminal 202 together perform the audio generation method provided by the embodiment of the present application. Specifically, after selecting a limited number of M sound sources whose sound energy satisfies a preset condition from the N sound sources, the server 201 forwards the location information of the M sound sources and the corresponding target auditory perception audio signals to the terminal 202 corresponding to the target object. Then, the terminal 202 performs stereo reconstruction on the corresponding target auditory perception audio signals according to the position information of the M sound sources, so as to obtain M stereo audio signals. The terminal 202 then mixes the left channel audio signal and the right channel audio signal of the M stereo audio signals, respectively, to obtain stereo mixed audio signals. In fig. 1, the audio generation method provided by the embodiment of the present application is described by taking the server 201 alone as an example, and the present application is not limited thereto.

It will be appreciated that the methods provided by embodiments of the present application may involve artificial intelligence (Artificial Intelligence, AI), which is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, obtains knowledge, and uses knowledge to obtain optimal results. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

The method provided by the embodiment of the application can particularly relate to a voice processing technology, and key technologies of the voice technology (Speech Technology) comprise an automatic voice recognition technology, a voice synthesis technology and a voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Next, an audio generation method provided by an embodiment of the present application will be described in detail with reference to the accompanying drawings, taking a server performing an audio generation method (i.e., a server performing a routing process, a stereo reconstruction, and a stereo mixing process) as an example.

Referring to fig. 3, fig. 3 shows a flow chart of an audio generation method, the method comprising:

s301, determining a relative position relation between a sound source and a target object according to the position information of the sound source, wherein the relative position relation comprises a distance value between the sound source and the target object.

The sound is a mechanical wave, and the sound source is an object capable of generating the mechanical wave, and in physics, the object which is sounding is called a sound source. Such as: vocal cords, tuning forks, drums, etc. are sound sources. However, the sound source is an object which is the same in space and cannot be separated from the surrounding elastic medium, and if the elastic medium is separated from the object, sound waves cannot be generated, and the object is not the sound source. The sound source in the embodiment of the application can comprise a plurality of sound sources of different types, the sound sources can be sound sources acquired in a real environment or can be virtually constructed, and each sound source carries information such as own mono-channel signals, own position information and the like. For example, the sound source carrying the audio signal may be acquired by a recording device of the user, or may be constructed by a virtual space sound effect system, i.e., a server. It can be understood that the sound source in the embodiment of the present application includes at least two sound sources, and the audio signal carried by the sound source is a mono original audio signal. The position information can be represented by three-dimensional space coordinates or two-dimensional coordinates; the location information may be sent to the server by the terminal corresponding to each sound source (for example, the terminal may locate the location of the sound source), or may be stored in the server (for example, the sound source is configured by the server).

The target object refers to a listener receiving an original audio signal generated by a sound source, and for example, the target object may include a listener in the real world or a listener constructed in a virtual space.

The relative positional relationship may refer to a position of the sound source with respect to the target object. After the original audio signal of the sound source propagates to the target object, the sound effect that the target object may perceive is related to the distance that the original audio signal propagates, i.e. the distance from the target object to the sound source, so the relative positional relationship may comprise a distance value. In other cases, after the original audio signal of the sound source propagates to the target object, the target object may perceive a sound effect related to the propagation direction of the original audio signal, so the relative positional relationship may further include an azimuth value. The distance value may refer to a distance value between the sound source and the target object, and the distance values between different sound sources and the target object may be the same or different. For example, the distance values between the sound source 1, the sound source 2 and the target object are each 2m.

When the position information is represented by two-dimensional coordinates, the distance values between different sound sources and the target object can be calculated by the following formula:

Where (x 0, y 0) represents the position information of the target object, and (x (i), y (i)) represents the position information of the i-th sound source, i being 1, 2, 3 … …, N.

When the position information is represented by three-dimensional space coordinates and the coordinate system of the three-dimensional space takes the target object as the origin, the distance values between different sound sources and the target object can be calculated by the following formula:

wherein, (x) _i ,y _i ,z _i ) The position information of the i-th sound source is represented, i being 1, 2, 3 … …, and N.

The azimuth value may refer to an azimuth value between the sound source and the target object, and azimuth values between different sound sources and target objects may be the same or different. For example, as shown in fig. 4, fig. 4 shows a schematic diagram of the relative positional relationship between the target object and different sound sources, where the sound source 1 is located 45 degrees to the right (i.e., azimuth value) in front of the target object, the distance from the target object is d1, and the sound source 2 is located 60 degrees to the left (i.e., azimuth value) in front of the target object, and the distance from the target object is d2.

S302, performing sound perception processing on the original audio signal of the sound source based on the distance value to obtain a target hearing perception audio signal.

Since the relative positional relationship between the different sound sources and the target object may be different, the sound effect heard by the target object for the different sound sources may be different after the original audio signals of the different sound sources are propagated to the target object. Therefore, the server can perform sound perception processing on the original audio signal of the sound source based on the relative position relationship to obtain the target hearing perception audio signal which can represent the real sound effect heard by the target object. When the relative positional relationship includes a distance value, the sound perception process may be a distance perception process. The distance sensing processing is to perform a monaural signal attenuation processing or a monaural signal amplification processing based on a distance value between the sound source and the target object.

In some possible implementations, the sound energy is inversely proportional to the square of the distance according to the inverse square law, and thus the formula for distance sensing the original audio signal can be expressed as:

wherein E' (i) represents the target auditory perception audio signal, E (i) represents the original audio signal, d ₀ Is a reference distance constant which can be set according to actual requirements, for example, 1 meter, and the sound energy of the target hearing perception audio signal of each sound source is d ₀ And (5) calculating to obtain the product.

The original audio signal of each sound source is processed by high-pass filtering (such as filtering low frequency 250 hz), and the instant sound energy of the filtered signal is calculated, so the calculation formula of the sound energy of the original audio signal can be:

where E (i) represents the sound energy of the i-th sound source, and s (i, k) represents the vibration amplitude of the kth sample point of the i-th sound source after high-pass filtering.

In another possible implementation manner, the distance sensing processing is performed on the original audio signal of the sound source based on the distance value, so that the distance sensing audio signal is obtained by determining a gain value corresponding to the sound source based on the distance value, and then performing attenuation processing or amplification processing on the original audio signal of the sound source based on the gain value. In this case, the product of the original audio signal and the distance gain may be regarded as the target auditory perception audio signal.

In one possible implementation, the manner of determining the gain value corresponding to the sound source based on the distance value may be to obtain the reference distance value and the gain upper limit value, so as to determine the distance gain value corresponding to the sound source based on the reference distance value, the gain upper limit value and the distance value. The upper gain limit value is a maximum gain value, and is used for avoiding the situation that the signal overflows due to the overlarge gain. The gain upper limit value may be a maximum gain value set empirically in advance.

According to the inverse square law, the sound energy is inversely proportional to the square of the distance, so the distance gain can be calculated by the following formula:

where Gmax is the upper gain limit, min () represents the function that takes the minimum, d ₀ Is a reference distance constant and can be set according to actual requirements.

By setting the gain upper limit value and further combining the gain value determined based on the gain upper limit value to perform distance sensing processing on the original audio signal, the situation that the signal overflows due to overlarge gain can be avoided.

It should be noted that, in some cases, the relative positional relationship may include an azimuth value in addition to the distance value, and the factor affecting the sound effect heard by the target object may include not only the distance value but also the azimuth value. For example, in the real world, the user can distinguish the approximate distance from the user and the approximate azimuth of the sound source according to the heard sound effect, so that in the virtual space, in order to simulate the real scene, a more real experience is brought to the user, and the sound effect heard by the user as a target object is also influenced by the distance and the azimuth. In this case, the sound sensing process may include a distance sensing process and an azimuth sensing process, and at this time, the sound sensing process is performed on the original audio signal of the sound source based on the distance value, and the target auditory sensing audio signal may be obtained by performing the distance sensing process on the original audio signal of the sound source based on the distance value to obtain the distance sensing audio signal and then performing the azimuth sensing process on the distance sensing audio signal based on the azimuth value to obtain the target auditory sensing audio signal. The distance sensing process may be shown in the foregoing formulas (3) to (5), and will not be described herein. And the sense of orientation process is similar to the distance sense process in that corresponding gain values may be calculated for different sense of orientation values to determine the target auditory sense audio signal based on the gain values and the distance sense audio signal. In the embodiment of the present application, the sequence of the distance sensing process and the direction sensing process is not limited, that is, the direction sensing process may be performed first, and the distance sensing process is performed, which is only one possible implementation manner.

In this way, the server can perform sound perception processing on the mono original audio signal by combining the distance value and the azimuth value, so that the consumption of computing resources of the server in the routing process is reduced.

In another possible implementation manner, since the method provided by the embodiment of the present application may be applied to a stereo audio generation scene, the sound masking effect thereof is actually a result of binaural masking, and the relative positional relationship between different sound sources and the left ear and the right ear is different. Therefore, in order to be more suitable for a stereo audio generation scene, the accuracy of the route selection is ensured, when the relative position relationship also comprises an azimuth value, corresponding auditory perception signals can be obtained for the left ear and the right ear respectively, and then the route selection is performed based on the auditory perception signals of the left ear and the right ear respectively. At this time, the target auditory sense audio signal includes a left ear auditory sense audio signal (i.e., an auditory sense signal corresponding to a left ear) and a right ear auditory sense audio signal (i.e., an auditory sense signal corresponding to a right ear), the sound sense processing includes a distance sense processing, and one possible implementation manner of S302 may be to calculate a first distance from a sound source to the left ear and a second distance from the sound source to the right ear according to the distance value, the azimuth value and the binaural position information on the target object, and then perform the distance sense processing on the original audio signal of the sound source based on the first distance to obtain the left ear auditory sense audio signal, and perform the distance sense processing on the original audio signal of the sound source based on the second distance to obtain the right ear auditory sense audio signal.

By considering the azimuth values of different sound sources, corresponding auditory perception signals are obtained for the left ear and the right ear respectively, so that a more accurate binaural masking result can be obtained, and a more accurate route selection is realized.

In the related art, although a routing strategy is provided, for example, in a scenario of multi-person conversation, the strategy is generally deployed on a server, that is, the server performs sound energy intensity analysis on the collected audio signals of all the participants (i.e., sound sources), based on the principle of sound masking, that is, the criterion that the sound intensity small signals are masked by the sound intensity large signals, a limited number of audio signals (usually, the paths with the largest sound energy) are filtered for each participant, for example, a certain participant is taken as a target object, 10 participants are shared except for the participant, the server selects 3 paths with the largest sound energy in the sorting of sound energy, the 3 paths of mono audio signals obtained by the routing are forwarded to the terminal corresponding to the target object for mono audio mixing processing, and the routing scheme can effectively control the calculation overhead of the terminal and the bandwidth consumption of the server, and is widely applied to the scenarios of multi-party voice conferences and the like.

Because in the multi-stereo application scene of the virtual space, the relative distance and the relative azimuth exist between the sound source bodies, the attenuation degree of the audio signals of the sound sources are different because of the distance difference, and the left and right ear perception is also possibly different because of the azimuth difference, namely the distance sense processing and the azimuth sense processing are not performed yet, and the final sound effect heard by the target object cannot be represented. Therefore, the related art proposes to convert into stereo and then select a route, that is, the server performs stereo reconstruction according to the relative position relationship between the target object and other sound sources in the virtual space, and then determines which audio signals have high audibility according to the left channel audio signal and the right channel audio signal of the stereo, which audio signals have low audibility and do not need to participate in final stereo mixing, thereby reducing the bandwidth load of the terminal and the server. However, this scheme requires the server to do 2*N because of the high computational load of the server and the different relative positional relationships between the different sound sources and the target object ² The secondary HRIR convolution process, the computational overhead is large.

The mode provided by the embodiment of the application can be based on the distance value, the azimuth value and the information of the double ear position on the target objectThe left ear auditory sense audio signal and the right ear auditory sense audio signal are distinguished by a simple position and orientation calculation formula so as to respectively route the left ear auditory sense audio signal and the right ear auditory sense audio signal, thereby eliminating the need for a server to execute 2*N ² And the calculation cost of the server is greatly reduced by the secondary HRIR convolution processing. Meanwhile, the stereo reconstruction can be carried out based on the left ear auditory perception audio signal and the right ear auditory perception audio signal after the route selection, so that the number of sound sources aimed at by the stereo reconstruction is greatly reduced, and the calculation resources consumed by the server for carrying out the stereo reconstruction can be reduced even if the stereo reconstruction is carried out on the server.

S303, determining the sound energy of the target hearing perception audio signal.

The routing process is actually based on the sound masking effect, i.e. the effect that an audio signal with small sound energy will be masked by an audio signal with large sound energy, and a limited number of sound sources (e.g. M sound sources, M being smaller than N) that will not be masked are screened out from a large number of sound sources (e.g. N sound sources), so that the sound energy of the target auditory perception audio signal may be determined first in order to implement the subsequent routing process.

In the embodiment of the present application, if the target auditory sense audio signal is the jth frame of auditory sense audio signal, the sound energy of the jth frame of auditory sense audio signal may be directly used as the sound energy of the target auditory sense audio signal.

In some cases, however, the emitted audio signal of a sound source may have fluctuations, i.e. although the audio signal emitted by one sound source may have a large majority of sound energy, due to the presence of fluctuations, it may occur that the sound energy of a certain frame, e.g. the j-th frame, of the auditory perception audio signal is relatively small. In this case, when determining whether the audio signal of a certain sound source is masked, in order to avoid accidental drop of the sound energy of the j-th frame auditory sense audio signal, so as to determine the audio signal of the sound source as a low-audibility audio signal, and improve the accuracy of the routing, in one possible implementation manner, the implementation manner of S303 may be to perform weighted summation on the sound energy of the j-th frame auditory sense audio signal and the sound energy of the j-1 th frame auditory sense audio signal, so as to obtain relative energy, and further use the relative energy as the sound energy of the target auditory sense audio signal.

In one possible implementation, the formula for calculating the relative energy may be as follows:

Esm′(i,j)＝a*Esm′(i,j-1)+(1-a)*E′(i,j) (6)

where Esm ' (i, j) represents the relative energy of the j-th frame of the hearing-aware audio signal of the i-th sound source, esm ' (i, j-1) represents the sound energy of the j-1 th frame of the hearing-aware audio signal of the i-th sound source, E ' (i, j) represents the sound energy of the j-th frame of the hearing-aware audio signal of the i-th sound source, a is a constant, and can be set according to practical requirements, for example, a=0.93.

When the embodiment of the application calculates the sound energy of each frame of auditory perception audio signal, the influence of the sound energy of the historical frame of auditory perception audio signal on the current frame (for example, the j frame) of auditory perception audio signal and the influence of the sound overlapping on the sound effect are considered, so that the relative energy which can more represent the real sound effect of the sound source is obtained, and more accurate sound source route selection can be realized.

S304, based on the sound energy of the target auditory perception audio signal, M sound sources with sound energy meeting preset conditions are screened out from N sound sources, M and N are positive integers, and M is smaller than N.

The server can screen out a limited number of sound sources (for example, M sound sources, M is smaller than N) with sound energy meeting a preset condition from N sound sources based on the sound energy of the target auditory perception audio signal, thereby greatly reducing the number of sound sources for subsequent stereo reconstruction and stereo mixing, and reducing the consumption of computing resources. The preset condition may be a condition that the audio signal may be heard by the target object, for example, the sound energy may reach a certain hearing threshold, or the sound energy of the N sound sources may be sequenced, so as to screen the sound sources sequenced in the first M bits, or the two may be combined, which is not limited in the embodiment of the present application.

In the embodiment of the application, according to different implementation manners of the S302, a corresponding screening manner is provided. When the target auditory sense audio signal includes a left auditory sense audio signal and a right auditory sense audio signal, the implementation manner of selecting M sound sources whose sound energy satisfies the preset condition from the N sound sources based on the sound energy of the target auditory sense audio signal may be to select a first sound source set whose sound energy satisfies the preset condition from the N sound sources based on the sound energy of the left auditory sense audio signal, and select a second sound source set whose sound energy satisfies the preset condition from the N sound sources based on the sound energy of the right auditory sense audio signal. The sound sources included in the combination of the first sound source set and the second sound source set are then concentrated as M sound sources. The sound sources included in the first sound source set and the sound sources included in the second sound source set may be identical or different.

For example, the N sound sources include sound source 1, sound source 2, sound source 3, … …, and sound source N, and if the sound energy of the audio signal is perceived based on the left ear hearing, a first sound source set with the sound energy satisfying the preset condition is selected from the N sound sources, for example, the first sound source set includes sound source 1, sound source 2, sound source 4, and sound source N; if the sound energy of the audio signal is perceived based on the right ear hearing, a second sound source set, for example, a second sound source set sound source 1, a sound source 2, a sound source 3 and a sound source N, of which the sound energy meets the preset condition is screened out from the N sound sources; the M sound sources finally screened are sound source 1, sound source 2, sound source 3, sound source 4 and sound source N.

In one possible implementation manner, when the sound energy reaches a certain hearing threshold and the M sound sources are screened in a manner of sorting based on the sound energy, the manner of screening the M sound sources with the sound energy satisfying the preset condition from the N sound sources based on the sound energy of the target hearing perception audio signal may be to perform a first stage filtering process on the N sound sources based on the sound energy of the target hearing perception audio signal, so as to obtain a first filtering result. Wherein the sound energy of the target auditory perception audio signal of the sound source included in the first filtering result reaches a first auditory threshold, which may be denoted thrd1. And then, sequencing the sound energy of the target auditory perception audio signals of the sound sources included in the first filtering result according to the sequence from big to small to obtain a sequencing result, and further performing second-stage filtering processing on the sound sources included in the first filtering result according to the sequencing result to obtain a second filtering result. Wherein the sound energy of the target auditory perception audio signal of the sound source included in the second filtering result is ordered in the first K bits. When K is equal to M, M sound sources may be screened out from the ranking result, and then the sound sources included in the second filtering result may be determined as M sound sources.

It should be noted that, according to the difference between the mono channel and the bi-channel (including the left channel and the right channel), the first hearing threshold may have different values. In the case of mono, if the sound energy of the target auditory sense audio signal of one sound source, e.g. sound source 1, reaches 5 times the sound energy of the target auditory sense audio signal of another sound source, e.g. sound source 2, sound source 1 may mask sound source 2, at which time the first auditory threshold may be set to around 5 times the lower sound energy; in the binaural case, since the sensitivity of the binaural recognition signal is typically stronger than that of the monaural, the difficulty of masking one sound source by another is greater than that of the monaural case, i.e., the sound energies of the target auditory perception audio signals of the two sound sources need to differ more than that of the monaural case, and in general, the sound energy of one sound source, e.g., the target auditory perception audio signal of sound source 1, reaches 8 times that of the other sound source, e.g., the target auditory perception audio signal of sound source 2, then sound source 1 may mask sound source 2 in the binaural case, and the first auditory threshold may be set to about 8 times that of the lower sound energy. The setting of the first hearing threshold is merely an example, and does not limit the present application, and the first hearing threshold may be adjusted according to actual requirements.

It will be appreciated that in the real world, although some audio signals have relatively small sound energy, the target object may be difficult to hear, but are not completely masked, i.e. for the auditory perception of the target object, such audio signals are audible and have an impact on the sound effect. Therefore, in this case, after K sound sources are selected according to the sorting result, a sound source having a sound energy greater than the second hearing threshold value may be acquired from the first filtering result as a post-supplement sound source, so that the sound source and the post-supplement sound source included in the second filtering result are determined as M sound sources. At this time, K may be smaller than M. The second hearing threshold may be smaller than the first hearing threshold, and the first hearing threshold and the second hearing threshold may be set according to actual requirements. The sound sources included in the second filtering result are referred to herein as alternative sound sources, and the second hearing threshold may be determined according to a sound energy reference value (Esm _base) and a third hearing threshold (thrd 2), for example, the second hearing threshold= Esm _base/thrd2.

S305, respectively carrying out stereo reconstruction on the corresponding target auditory perception audio signals according to the position information of each sound source in the M sound sources to obtain M stereo audio signals.

Through the above steps S301 to S304, the routing process of the server may be completed, and then the server may reconstruct stereo sound of the corresponding target auditory perception audio signals according to the position information of each of the M sound sources, to obtain M stereo audio signals. By screening out a small number of sound sources with high audible sound sources from a large number of sound sources to reconstruct stereo sound, the computing resources consumed on the server can be greatly reduced.

In one possible implementation manner, stereo reconstruction is performed on the corresponding target auditory sense audio signals according to the position information of each of the M sound sources, and the manner of obtaining the M stereo audio signals may be to find excitation signals corresponding to the M sound sources according to the position information, and then perform convolution processing based on the excitation signals and the target auditory sense audio signals of the corresponding sound sources, so as to construct the stereo audio signals.

The above manner of constructing a stereo audio signal may be referred to as head related transfer function (Head Related Transfer Functions, HRTF) stereo reconstruction, wherein the excitation signal may be an HRIR excitation signal, which is a head related excitation signal, may be obtained by looking up a related HRIR table. The correspondence between different position information and the HRIR excitation signals can be recorded in the HRIR table, so that the HRIR excitation signals are searched according to the position relationship.

The formula for constructing a stereo audio signal can be as follows:

wherein y (n) is a stereo audio signal of a sound source, u (n) is a target auditory perception audio signal of the sound source, h (n) is a corresponding HRIR excitation signal,representing a convolution process.

As shown in fig. 5, fig. 5 is a schematic diagram of HRTF stereo reconstruction processing. Since the HRIR excitation signal h (n) includes the HRIR data of the left channel and the HRIR data of the right channel, the generated y (n) also includes the left channel (left ear) audio signal and the right channel (right ear) audio signal.

S306, respectively performing audio mixing processing on the left channel audio signals and the right channel audio signals of the M stereo audio signals to obtain stereo mixed audio signals.

After the server performs stereo reconstruction to obtain a stereo audio signal, stereo mixing can be performed, so that a stereo mixed audio signal is obtained, and the stereo mixed audio signal is played to a target object through headphones or a sound box.

The stereo mixing is obtained by respectively mixing the left channel audio signal and the right channel audio signal. There are many methods of mixing, for example: direct addition, averaging, clamping, normalization, adaptive mix weighting, and auto-alignment algorithms. Taking an averaging method as an example, namely a method of respectively superposing and summing left channel audio signals and right channel audio signals of all sound sources and then averaging the signals, the formula can be as follows:

Where lout denotes a left channel of the stereo mixed audio signal, rout denotes a right channel of the stereo mixed audio signal, l (M) denotes a left channel audio signal of an mth sound source, r (M) denotes a right channel audio signal of the mth sound source, and M denotes the number of sound sources obtained by screening.

And because the existence of the route selection process greatly reduces the number of sound sources aimed at by the stereo mixing, the problems of sound breaking, noise and the like can be effectively improved.

The following describes in detail the audio generation method provided by the embodiment of the present application with reference to the accompanying drawings, taking a server and a terminal cooperatively executing an audio generation method (i.e. the server executes a routing process, the terminal corresponding to the target object executes a stereo reconstruction and a stereo mixing process) as an example.

Referring to fig. 6, fig. 6 shows a signaling interaction diagram of an audio generation method, the method comprising:

s601, the server determines the relative position relation between the sound source and the target object according to the position information of the sound source.

Wherein the relative positional relationship may include a distance value between the sound source and the target object;

s602, performing sound perception processing on the original audio signal of the sound source based on the distance value to obtain a target hearing perception audio signal.

S603, determining the sound energy of the target hearing perception audio signal.

S604, based on the sound energy of the target hearing perception audio signal, M sound sources with the sound energy meeting the preset conditions are screened out of N sound sources.

Wherein M and N are positive integers, M is smaller than N.

Referring to fig. 7, each sound source transmits an original audio signal and position information to a server (see S701 in fig. 7) through a corresponding terminal, and the server completes a routing process (see S702-S706 in fig. 7) by executing S601-S604, and the implementation manner of S601-S604 may be referred to the implementation manner of S301-S304 in the corresponding embodiment of fig. 3, which is not described herein again.

And S605, transmitting the position information of the M sound sources and the corresponding target auditory perception audio signals to the terminal corresponding to the target object.

The server mainly plays roles in routing and forwarding in the embodiment of the application, and the server sends the position information of the M sound sources and the corresponding target auditory perception audio signals to the terminal corresponding to the target object so that the terminal can complete subsequent stereo reconstruction and stereo mixing.

S606, the terminal corresponding to the target object respectively performs stereo reconstruction on the corresponding target auditory perception audio signals according to the position information of the M sound sources to obtain M stereo audio signals.

S607, the terminal corresponding to the target object respectively mixes the left channel audio signals and the right channel audio signals of the M stereo audio signals to obtain stereo mixed audio signals.

After the terminal corresponding to the target object performs stereo reconstruction to obtain a stereo audio signal, stereo mixing can be performed, so that a stereo mixed audio signal is obtained, and the stereo mixed audio signal is played to the target object through headphones or a sound box.

The terminal corresponding to the target object performs S606-S607 to complete stereo reconstruction and stereo mixing (see S707-S708 in fig. 7), and the implementation manner of S606-S607 may be referred to the implementation manner of S305-S306 in the corresponding embodiment of fig. 3, which is not described herein.

According to the embodiment of the application, the route selection process is executed at the server, and the stereo reconstruction and the stereo mixing are executed at the terminal corresponding to the target object, so that the consumption of the computing resource of the server is reduced. Meanwhile, as the server executes the route selection process, the target auditory perception audio signals of a limited number of sound sources can be sent to the terminal corresponding to the target object, and the bandwidth consumption is greatly reduced.

The stereo reconstruction and the stereo mixing are executed at the terminal, and the target hearing perception audio signals of the limited number of sound sources are aimed at by the stereo reconstruction, so that the consumption of computing resources of the terminal is greatly reduced, and the requirement of real-time operation is met. Because the terminal performs and because the routing process exists, the number of sound sources aimed by stereo mixing is greatly reduced, and the problems of sound breaking, noise and the like can be effectively improved.

The application also provides an application scene, which applies the audio generation method. Specifically, the application of the audio generation method in the application scene is as follows: the concept of meta universe is derived from the works of a student in 1992, wherein meta universe is described: "wearing headphones and eyepiece, finding the connection terminal, it is possible to enter a virtual space simulated by a computer in parallel with the real world in a virtual fit. "

The implementation of the meta-universe needs to integrate artificial intelligence virtual perception technologies in various aspects such as audio, video and perception, and a computer virtual space approaching to the real world perception is constructed, so that an experimenter can experience sensory experience indiscriminate to the real world only by means of some hardware devices such as headphones, glasses and somatosensory equipment. The virtual space sound effect is an important part in meta-space-like application, binaural sound signals in a real environment are restored through the virtual space sound effect, and an experimenter can perceive stereo sound effect experience in the real environment through wearing headphones.

As shown in fig. 8, fig. 8 shows an application scenario in which a virtual spatial sound effect (may be simply referred to as spatial sound effect) restores a stereo sound effect in the real world. The spatial sound effect refers to that a user hears sound with more stereoscopic sense and spatial layering sense through a certain audio technology process, and plays and restores a real-world hearing scene through a combination of an earphone or more than two loudspeakers, so that a listener (for example, a target object can be a user) can clearly recognize the directions, distance sense, moving tracks and the like of different sound sources, and can also feel the sense of being wrapped by the sound in all directions, so that the listener can feel immersive hearing experience as if the listener is in the real world. For example, the spatial sound effects shown in fig. 8 include speech sounds, laughter sounds, footstep sounds, engine sounds when a car is traveling from far to near, sidewalk prompt sounds, and wind and rain sounds of different persons in different directions around.

When the user uses the application similar to meta space or the user needs to experience the virtual space sound effect, the audio generation method can be adopted, that is, the real-world binaural audio signal is restored through the virtual space sound effect, and an experimenter, that is, a target object (for example, the user) can perceive the real-world stereophonic effect experience through wearing headphones, for example, the virtual space sound effect shown in fig. 8 can comprise speaking sounds, laughter sounds, footstep sounds, engine sounds, sidewalk prompt sounds, wind and rain sounds and the like of people with different surrounding different directions. The server may determine a relative positional relationship between the sound source and the target object based on the position information of the sound source, the relative positional relationship including a distance value between the sound source and the target object. Since the sound source is located at a distance from the target object, the distance values from the different sound sources to the target object are different, and thus the sound effects heard by the target object may be different when the audio signal of the sound source is transmitted to the target object. Therefore, the original audio signals of the sound sources can be subjected to sound perception processing based on the distance values, so that target hearing perception audio signals which can represent real sound effects heard by target objects are obtained, and therefore, based on the sound energy of the target hearing perception audio signals, the limited number M of sound sources with the sound energy meeting preset conditions can be screened out of N sound sources more accurately, the limited number M of sound sources can be used as sound sources which can be heard by the target objects, stereo reconstruction is carried out by screening the limited number of sound sources, and the consumption of calculation resources is reduced. And then respectively carrying out stereo reconstruction on the corresponding target auditory perception audio signals according to the position information of each sound source in the M sound sources to obtain M stereo audio signals, and respectively carrying out audio mixing processing on the left channel audio signals and the right channel audio signals of the M stereo audio signals to obtain stereo mixed audio signals so as to play the stereo mixed audio signals to a target object. The application combines the relative position relation of the sound source and the target object to perform the route selection, thereby not only reducing the consumption of the computing resource, but also realizing the route selection based on the real sound effect heard by the representative target object on the premise of reducing the computing resource, and the screening result is more accurate and is more suitable for the application scene of the virtual space stereo mixing.

The method provided by the embodiment of the application can be applied to the scenes of virtual space sound effect generation or virtual stereo generation.

It should be noted that, based on the implementation manner provided in the above aspects, further combinations may be further performed to provide further implementation manners.

Based on the audio generation method provided in the corresponding embodiment of fig. 3, the embodiment of the application further provides an audio generation device 900. Referring to fig. 9, the audio generating apparatus 900 includes a determining unit 901, a perception processing unit 902, a screening unit 903, a reconstructing unit 904, and a mixing unit 905:

the determining unit 901 is configured to determine, according to position information of a sound source, a relative positional relationship between the sound source and a target object, where the relative positional relationship includes a distance value between the sound source and the target object;

the perception processing unit 902 is configured to perform a sound perception process on an original audio signal of the sound source based on the distance value, so as to obtain a target auditory perception audio signal;

the determining unit 901 is further configured to determine a sound energy of the target auditory perception audio signal;

the screening unit 903 is configured to screen, based on the sound energy of the target auditory perception audio signal, M sound sources from N sound sources, where the sound energy satisfies a preset condition, M and N are positive integers, and M is smaller than N;

The reconstruction unit 904 is configured to reconstruct stereo sound of the corresponding target auditory perception audio signal according to the position information of each of the M sound sources, so as to obtain M stereo audio signals;

the mixing unit 905 is configured to mix the left channel audio signals and the right channel audio signals of the M stereo audio signals, so as to obtain stereo mixed audio signals.

In a possible implementation manner, the relative position relationship further includes an azimuth value, the sound sensing process includes a distance sensing process and an azimuth sensing process, and the sensing processing unit 902 is specifically configured to:

performing distance sensing processing on the original audio signal of the sound source based on the distance value to obtain a distance sensing audio signal;

and carrying out azimuth processing on the distance sensing audio signal based on the azimuth value to obtain the target auditory sensing audio signal.

In a possible implementation manner, the sensing processing unit 902 is specifically configured to:

determining a gain value corresponding to the sound source based on the distance value;

and carrying out attenuation processing or amplification processing on the original audio signal of the sound source based on the gain value to obtain a distance sensing audio signal of the sound source.

acquiring a reference distance value and a gain upper limit value;

and determining a distance gain value corresponding to the sound source based on the reference distance value, the gain upper limit value and the distance value.

In a possible implementation manner, the relative position relationship further includes an azimuth value, the target auditory sense audio signal includes a left auditory sense audio signal and a right auditory sense audio signal, the sound sense processing includes a distance sense processing, and the sense processing unit 902 is specifically configured to:

calculating a first distance from the sound source to the left ear and a second distance from the sound source to the right ear according to the distance value, the azimuth value and the binaural position information on the target object;

performing distance sensing processing on the original audio signal of the sound source based on the first distance to obtain a left ear hearing sensing audio signal;

and performing distance sensing processing on the original audio signal of the sound source based on the second distance to obtain a right ear hearing sensing audio signal.

In one possible implementation manner, the screening unit 903 is specifically configured to:

screening a first sound source set with sound energy meeting preset conditions from N sound sources based on the sound energy of the left ear auditory perception audio signal;

Screening a second sound source set with sound energy meeting preset conditions from N sound sources based on the sound energy of the right ear auditory perception audio signal;

and taking the sound sources included in the combined set of the first sound source set and the second sound source set as the M sound sources.

performing first-stage filtering processing on N sound sources based on the sound energy of the target auditory perception audio signals to obtain a first filtering result, wherein the sound energy of the target auditory perception audio signals of the sound sources included in the first filtering result reaches a first auditory threshold;

the sound energy of the target hearing perception audio signals of the sound sources included in the first filtering result is sequenced according to the sequence from big to small, and a sequencing result is obtained;

performing second-stage filtering processing on the sound sources included in the first filtering result according to the sorting result to obtain a second filtering result, wherein the sound energy of the target auditory perception audio signals of the sound sources included in the second filtering result is sorted in the front K bits;

and when K is equal to M, determining sound sources included in the second filtering result as M sound sources.

In a possible implementation manner, the screening unit 903 is further configured to:

when K is smaller than M, acquiring a sound source with sound energy larger than a second hearing threshold from the first filtering result as a post-compensation sound source;

and determining the sound sources included in the second filtering result and the post-compensation sound source as M sound sources.

In a possible implementation manner, the determining unit 901 is specifically configured to:

if the target auditory perception audio signal is a j-th frame auditory perception audio signal, weighting and summing the sound energy of the j-th frame auditory perception audio signal and the sound energy of the j-1 th frame auditory perception audio signal to obtain relative energy;

the relative energy is taken as the sound energy of the target auditory perception audio signal.

In one possible implementation manner, the reconstruction unit 904 is specifically configured to:

searching excitation signals corresponding to the M sound sources respectively according to the position relation;

and carrying out convolution processing on the excitation signal and the target auditory perception audio signal of the corresponding sound source to construct a stereo audio signal.

Based on the audio generation method provided in the corresponding embodiment of fig. 6, another audio generation apparatus 1000 is also provided in the embodiment of the present application. Referring to fig. 10, the audio generating apparatus determining unit 1001, perception processing unit 1002, filtering unit 1003, and transmitting unit 1004:

The determining unit 1001 is configured to determine, according to position information of a sound source, a relative positional relationship between the sound source and a target object, where the relative positional relationship includes a distance value between the sound source and the target object;

the sensing processing unit 1002 is configured to perform a sound sensing process on an original audio signal of the sound source based on the distance value, so as to obtain a target auditory sensing audio signal;

the determining unit 1001 is further configured to determine a sound energy of the target auditory perception audio signal;

the filtering unit 1003 is configured to filter, based on the sound energy of the target auditory perception audio signal, M sound sources from N sound sources, where M and N are positive integers, and M is smaller than N, where the sound energy satisfies a preset condition;

the sending unit 1004 is configured to send the position information of the M sound sources and the corresponding target auditory perception audio signals to a terminal corresponding to the target object, so that the terminal performs stereo reconstruction on the corresponding target auditory perception audio signals according to the position information of the M sound sources, to obtain M stereo audio signals; and respectively carrying out audio mixing processing on the left channel audio signals and the right channel audio signals of the M stereo audio signals to obtain stereo mixed audio signals.

The embodiment of the application also provides computer equipment which can execute the audio generation method. The computer device may be, for example, a terminal, taking a smart phone as an example:

fig. 11 is a block diagram illustrating a part of a structure of a smart phone according to an embodiment of the present application. Referring to fig. 11, the smart phone includes: radio Frequency (RF) circuit 1111, memory 1120, input unit 1130, display unit 1140, sensor 1150, audio circuit 1160, wireless fidelity (WiFi) module 1170, processor 1180, and power supply 1190. The input unit 1130 may include a touch panel 1131 and other input devices 1132, the display unit 1140 may include a display panel 1141, and the audio circuit 1160 may include a speaker 1161 and a microphone 1162. It will be appreciated that the smartphone structure shown in fig. 11 is not limiting of the smartphone, and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

The memory 1120 may be used to store software programs and modules, and the processor 1180 executes various functional applications and data processing of the smartphone by executing the software programs and modules stored in the memory 1120. The memory 1120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebooks, etc.) created according to the use of the smart phone, etc. In addition, memory 1120 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 1180 is a control center of the smart phone, connects various parts of the entire smart phone using various interfaces and lines, performs various functions of the smart phone and processes data by running or executing software programs and/or modules stored in the memory 1120, and invoking data stored in the memory 1120. In the alternative, processor 1180 may include one or more processing units; preferably, the processor 1180 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1180.

In this embodiment, the steps required to be performed by the processor 1180 in the smart phone may be implemented based on the terminal structure shown in fig. 11.

The computer device may also be a server, and in this embodiment of the present application, as shown in fig. 12, fig. 12 is a block diagram of a server 1200 provided in this embodiment of the present application, where the server 1200 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (Central Processing Units, abbreviated as CPUs) 1222 (e.g., one or more processors) and a memory 1232, and one or more storage media 1230 (e.g., one or more mass storage devices) storing application programs 1242 or data 1244. Wherein memory 1232 and storage medium 1230 can be transitory or persistent. The program stored on the storage medium 1230 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 1222 may be configured to communicate with the storage medium 1230, executing a series of instruction operations on the storage medium 1230 on the server 1200.

The Server 1200 may also include one or more power supplies 1226, one or more wired or wireless network interfaces 1250, one or more input/output interfaces 1258, and/or one or more operating systems 1241, e.g., windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

In this embodiment, the central processor 1222 in the server 1200 may perform the following steps:

determining a sound energy of the target auditory perception audio signal;

Or performs:

determining a sound energy of the target auditory perception audio signal;

According to an aspect of the present application, there is provided a computer-readable storage medium for storing program code for executing the audio generation method according to the foregoing embodiments.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the various alternative implementations of the above embodiments.

The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of audio generation, the method comprising:

determining a sound energy of the target auditory perception audio signal;

2. The method according to claim 1, wherein the relative positional relationship further includes an azimuth value, the sound perception process includes a distance perception process and an azimuth perception process, the sound perception process is performed on the original audio signal of the sound source based on the distance value, and a target auditory perception audio signal is obtained, including:

3. The method according to claim 2, wherein the performing distance sensing processing on the original audio signal of the sound source based on the distance value to obtain a distance-sensing audio signal comprises:

4. A method according to claim 3, wherein said determining a gain value for the sound source based on the distance value comprises:

acquiring a reference distance value and a gain upper limit value;

5. The method of claim 1, wherein the relative positional relationship further comprises an azimuth value, the target auditory sense audio signal comprises a left auditory sense audio signal and a right auditory sense audio signal, the sound sense processing comprises a distance sense processing, and the performing the sound sense processing on the original audio signal of the sound source based on the distance value to obtain the target auditory sense audio signal comprises:

6. The method according to claim 5, wherein the screening out M sound sources whose sound energy satisfies a preset condition from among the N sound sources based on the sound energy of the target auditory perception audio signal includes:

7. The method according to claim 1, wherein the screening out M sound sources whose sound energy satisfies a preset condition from among the N sound sources based on the sound energy of the target auditory perception audio signal includes:

8. The method of claim 7, wherein the method further comprises:

9. The method of any one of claims 1-8, wherein the determining the sound energy of the target auditory perception audio signal comprises:

10. The method according to any one of claims 1-8, wherein the separately performing stereo reconstruction on the corresponding target auditory perception audio signal according to the position information of each of the M sound sources to obtain M stereo audio signals includes:

11. A method of audio generation, the method comprising:

determining a sound energy of the target auditory perception audio signal;

12. An audio generating apparatus, characterized in that the apparatus comprises a determining unit, a perception processing unit, a screening unit, a reconstruction unit and a mixing unit:

13. An audio generating apparatus, characterized in that the apparatus comprises a determining unit, a perception processing unit, a screening unit and a transmitting unit:

14. A computer device, the computer device comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-11 according to instructions in the program code.

15. A computer readable storage medium for storing program code which, when executed by a processor, causes the processor to perform the method of any of claims 1-11.

16. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any of claims 1-11.