CN113889125B

CN113889125B - Audio generation method and device, computer equipment and storage medium

Info

Publication number: CN113889125B
Application number: CN202111474567.9A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2022-03-04
Anticipated expiration: 2041-12-02
Also published as: CN113889125A

Abstract

The application relates to an audio generation method, an audio generation device, a computer device and a storage medium. The method can be applied to an application scene for generating audio in a map navigation process, and comprises the following steps: performing distance sensing processing on an audio signal of a sound source based on a distance value between the sound source and a target object to obtain a distance sensing audio signal; carrying out sound mixing processing on the distance sensing audio signals of the sound sources in the same grid area to obtain sound mixing audio signals corresponding to each grid area; determining a sound image center based on the position information and the audio weight value of each sound source in the grid area; performing stereo reconstruction on the audio-mixed audio signal of the grid area based on the acoustic image center to obtain a stereo audio signal; and respectively carrying out audio mixing processing on the left channel audio signal and the right channel audio signal of the stereo audio signal to obtain a stereo audio mixing audio signal. By adopting the method, the calculation cost for generating the stereo audio signal can be effectively reduced.

Description

Audio generation method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio generation method and apparatus, a computer device, and a storage medium.

Background

With the development of computer technology and internet technology, the implementation of the meta universe needs to integrate artificial intelligence virtual perception technology in multiple aspects such as audio, video, perception and the like, a computer virtual space approaching real world perception is constructed, and an experiencer can experience sensory feelings which are not different from the real world by means of some hardware devices (such as earphones, glasses, body sensing equipment and the like).

However, in the current audio generation method, when reconstructing a virtual spatial sound effect, under the condition of similar metas application, because the number of sound sources is relatively large, there may be dozens of hundreds of sound sources, audio signals of all the sound sources need to be respectively reconstructed stereoscopically, and the computation overhead of the stereo reconstruction is relatively large, the computation overhead of the stereo audio signal generation is very high, so that the common device cannot meet the requirement of real-time operation, and even if the computation module is transferred to a server to be implemented, the generation of the stereo audio signal is very costly in computation resources.

Disclosure of Invention

In view of the above, it is necessary to provide an audio generation method, an apparatus, a computer device and a storage medium capable of effectively reducing the computational overhead of the generation of a stereo audio signal in view of the above technical problems.

A method of audio generation, the method comprising: performing distance sensing processing on an audio signal of a sound source based on a distance value between the sound source and a target object to obtain a distance sensing audio signal; carrying out sound mixing processing on the distance sensing audio signals of the sound sources in the same grid area to obtain sound mixing audio signals corresponding to each grid area; determining a sound image center based on the position information and the audio weight value of each sound source in the grid area; performing stereo reconstruction on the audio-mixed audio signal of the grid area based on the acoustic image center to obtain a stereo audio signal; and respectively carrying out audio mixing processing on the left channel audio signal and the right channel audio signal of the stereo audio signal to obtain a stereo audio mixing audio signal.

An audio generation apparatus, the apparatus comprising: the distance sensing processing module is used for carrying out distance sensing processing on the audio signal of the sound source based on the distance value between the sound source and the target object to obtain a distance sensing audio signal; the sound mixing processing module is used for carrying out sound mixing processing on the distance sensing audio signals of the sound sources in the same grid area to obtain sound mixing audio signals corresponding to each grid area; the determining module is used for determining a sound image center based on the position information and the audio weight value of each sound source in the grid area; the stereo reconstruction module is used for carrying out stereo reconstruction on the audio-mixed audio signal of the grid area based on the acoustic image center to obtain a stereo audio signal; the audio mixing processing module is further configured to perform audio mixing processing on all the left channel audio signals and all the right channel audio signals of the stereo audio signals respectively to obtain stereo audio mixing audio signals.

In one embodiment, the apparatus further comprises: the system comprises a construction module and a grid division module, wherein the construction module is used for constructing a target space network by taking the target object as a center; and the grid division module is used for carrying out uniform grid division or non-uniform grid division on the target space network to obtain the target space network comprising the grid area.

In one embodiment, the apparatus further comprises: the acquisition module is used for acquiring the coordinate information of the sound source; the conversion module is used for converting the coordinate information of the sound source into longitude and latitude information of the spherical network when the coordinate information is three-dimensional rectangular coordinate information; the determining module is further configured to determine, in the globalized network, a mesh area to which the sound source belongs based on the latitude and longitude information.

In one embodiment, the meshing module is further configured to perform uniform meshing on the target space network to obtain a target space network including the mesh region; the determining module is further used for determining audio perception accuracy values of the target object to different directions, wherein the audio perception accuracy values of the different directions are different; the grid division module is further used for carrying out grid division on the target space network according to the audio perception precision value to obtain the target space network comprising the grid area.

In one embodiment, the determining module is further configured to determine a gain value corresponding to the sound source based on a distance value between the sound source and the target object; and the distance sensing processing module is also used for carrying out attenuation processing or amplification processing on the audio signal of the sound source based on the gain value to obtain the distance sensing audio signal of the sound source.

In one embodiment, the obtaining module is further configured to obtain a reference distance value and a gain upper limit value; the determining module is further configured to determine a distance gain value corresponding to the sound source based on the reference distance value, the gain upper limit value, and the distance value.

In one embodiment, the obtaining module is further configured to obtain coordinate values and audio weight values of each sound source in the grid region; the determining module is further configured to determine a center coordinate value of the grid region based on the coordinate value and the audio weight value; and taking the position corresponding to the central coordinate value as the sound image center of the grid area.

In one embodiment, the obtaining module is further configured to obtain an audio energy value of each sound source in the mesh area; the determining module is further used for determining a total audio energy value of all the sound sources in the grid area based on the audio energy value; determining an audio weight value for each of the sound sources within the mesh area based on the audio energy value and the total audio energy value.

In one embodiment, the apparatus further comprises: the searching module is used for searching the excitation signal corresponding to the sound image center; the construction module is further configured to construct a stereo audio signal based on the excitation signal and the mixed audio signal of the mesh region.

In one embodiment, the building block further comprises: the convolution processing unit is used for carrying out convolution processing on the audio mixing signal of the grid area based on the left channel excitation signal to obtain a left channel audio signal; and performing convolution processing on the audio mixing audio signal of the grid area based on the right channel excitation signal to obtain a right channel audio signal.

In one embodiment, the determination module is further configured to determine a mesh region having the sound source within a target spatial network; determining a number of mesh regions having the sound source; the audio mixing processing module is further configured to perform audio mixing processing on the left channel audio signals of all the stereo audio signals based on the number of the grid areas to obtain left channel audio mixing audio signals; and performing sound mixing processing on the right channel audio signals of all the stereo audio signals based on the number of the grid areas to obtain the right channel sound mixing audio signals.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program: performing distance sensing processing on an audio signal of a sound source based on a distance value between the sound source and a target object to obtain a distance sensing audio signal; carrying out sound mixing processing on the distance sensing audio signals of the sound sources in the same grid area to obtain sound mixing audio signals corresponding to each grid area; determining a sound image center based on the position information and the audio weight value of each sound source in the grid area; performing stereo reconstruction on the audio-mixed audio signal of the grid area based on the acoustic image center to obtain a stereo audio signal; and respectively carrying out audio mixing processing on the left channel audio signal and the right channel audio signal of the stereo audio signal to obtain a stereo audio mixing audio signal.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of: performing distance sensing processing on an audio signal of a sound source based on a distance value between the sound source and a target object to obtain a distance sensing audio signal; carrying out sound mixing processing on the distance sensing audio signals of the sound sources in the same grid area to obtain sound mixing audio signals corresponding to each grid area; determining a sound image center based on the position information and the audio weight value of each sound source in the grid area; performing stereo reconstruction on the audio-mixed audio signal of the grid area based on the acoustic image center to obtain a stereo audio signal; and respectively carrying out audio mixing processing on the left channel audio signal and the right channel audio signal of the stereo audio signal to obtain a stereo audio mixing audio signal.

According to the audio generation method, the audio generation device, the computer equipment and the storage medium, distance sensing processing is carried out on the audio signal of the sound source based on the distance value between the sound source and the target object, and a distance sensing audio signal is obtained; carrying out sound mixing processing on distance sensing audio signals of sound sources in the same grid area to obtain sound mixing audio signals corresponding to each grid area; determining a sound image center based on the position information and the audio weight value of each sound source in the grid area; performing stereo reconstruction on the audio-mixed audio signal of the grid area based on the center of the sound image to obtain a stereo audio signal; and respectively carrying out audio mixing processing on the left channel audio signal and the right channel audio signal of all the stereo audio signals to obtain stereo audio mixing audio signals. Because a plurality of sound sources in the same grid area are regarded as a whole, namely the plurality of sound sources in the same grid area are subjected to sound mixing after distance sensing correction to obtain a new single sound source, the sound image center of each grid area is calculated by weighting the plurality of sound sources, and the sound image center is taken as the coordinate position of the new sound source to carry out stereo reconstruction, so that the integrated processing of the plurality of sound sources in the same grid area can be realized, the times of stereo reconstruction processing are greatly reduced, the calculation cost can be greatly reduced on the premise of not influencing the user experience, the requirement that the common equipment can realize real-time processing can be met, the calculation cost for generating stereo audio signals is effectively reduced, and the perception experience of a user can be ensured.

Drawings

FIG. 1 is a diagram of an application environment of a method of audio generation in one embodiment;

FIG. 2 is a schematic flow chart diagram of a method for audio generation in one embodiment;

FIG. 3 is a diagram illustrating an effect of constructing a globalized space network with a target object as a sphere center according to an embodiment;

FIG. 4 is a diagram illustrating the web-based management of multiple sound sources in one embodiment;

FIG. 5 is a flowchart illustrating the step of determining the mesh region to which the sound source belongs based on latitude and longitude information according to an embodiment;

FIG. 6 is a diagram illustrating rectangular coordinate information converted to latitude and longitude information in one embodiment;

fig. 7 is a schematic diagram of an HRTF stereo reconstruction process in one embodiment;

FIG. 8 is a diagram illustrating an exemplary application scenario of virtual spatial audio effects for restoring a stereo audio signal in a real environment;

FIG. 9 is a flow diagram of virtual stereo generation in a conventional manner in one embodiment;

FIG. 10 is a schematic flow chart illustrating a method for generating a virtual multi-dimensional audio signal based on a gridding in an embodiment;

FIG. 11 is a block diagram showing the structure of an audio generating apparatus according to an embodiment;

FIG. 12 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The audio generation method provided by the application can be applied to the application environment shown in fig. 1. As shown in fig. 1, the application environment includes a terminal 102 and a server 104, and the application environment may be an environment in which the terminal 102 interacts with the server 104. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 may receive the audio signals of the plurality of sound sources sent by the terminal 102, and the server 104 performs distance sensing processing on the audio signals of the sound sources based on the distance values between the sound sources and the target object to obtain distance sensing audio signals; the server 104 performs audio mixing processing on the distance sensing audio signals of the sound sources in the same grid area to obtain audio mixing audio signals corresponding to each grid area; the server 104 determines a sound image center based on the position information and the audio weight value of each sound source in the grid area; the server 104 performs stereo reconstruction on the audio-mixed audio signal of the grid area based on the acoustic image center to obtain a stereo audio signal; the server 104 performs audio mixing processing on the left channel audio signal and the right channel audio signal of all the stereo audio signals respectively to obtain a stereo audio mixing audio signal, and further, after the server 104 obtains the stereo audio mixing audio signal, the server 104 may return the obtained stereo audio mixing audio signal to the corresponding terminal 102, so that the user may obtain the stereo audio mixing audio signal associated with the audio signals of the multiple sound sources. The server 104 may also store the stereo mix audio signal in association with a plurality of original sound sources, for example, the stereo mix audio signal may be a virtual spatial sound effect of the audio signals of the plurality of original sound sources.

The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, smart voice interaction devices, smart appliances, vehicle terminals, and other car networking devices.

The server 104 may be implemented by an independent server or a server cluster composed of a plurality of servers, and it is understood that the server 104 provided in this embodiment of the present application may also be a service node in a blockchain system, a Peer-To-Peer (P2P, Peer To Peer) network is formed between service nodes in the blockchain system, and the P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP).

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.

Key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology and Speech synthesis Technology, as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

In one embodiment, as shown in fig. 2, an audio generating method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step 202, performing distance sensing processing on the audio signal of the sound source based on the distance value between the sound source and the target object to obtain a distance sensing audio signal.

The sound is a mechanical wave, the sound source is an object capable of generating the mechanical wave, and in physics, the object which is generating sound is called the sound source. Such as: vocal cords, tuning forks, drums, etc. are sound sources. However, the sound source is the same object in the space which can not be separated from the elastic medium around the sound source, and the same state, if the sound source is separated from the elastic medium, the sound wave can not be generated, and the object is not the sound source. The sound source in the embodiment of the application may include a plurality of sound sources of different types, the sound source may be a sound source acquired in a real environment, or a sound source configured virtually, and each sound source carries information such as a monaural signal and a coordinate of the sound source. For example, the sound source may be a sound source carrying an audio signal acquired by a recording apparatus of a user, or may be a sound source carrying an audio signal configured by a virtual spatial sound effect system, that is, a server. It can be understood that the sound source in the embodiment of the present application includes at least two sound sources, and the audio signal carried by the sound source is a mono audio signal.

The target object refers to a listener receiving an audio signal generated by a sound source, and for example, the target object may include a listener in a real environment and may also include a listener constructed in a virtual space.

The distance value refers to a distance value between a sound source and a target object, and the distance values between different sound sources and target objects may be the same or different. For example, the distance values between the sound source 1, the sound source 2, and the target object a are each 2 m. The distance sensing process is a process of performing monaural signal attenuation or monaural signal amplification based on a distance value between a sound source and a target object.

The distance sensing audio signal is a signal obtained after distance sensing processing, that is, a distance sensing audio signal, and the distance sensing audio signal may include an attenuated audio signal and an amplified audio signal. For example, according to the principle that the sound volume attenuates with distance, a monaural signal is attenuated according to the actual distance between a sound source and a listener, and the obtained attenuated audio signal is a distance-sensitive audio signal.

Specifically, the server may perform distance sensing processing on the audio signal of the sound source based on the distance value between the sound source and the target object to obtain a distance sensing audio signal. The distance sensing process comprises a single-channel signal attenuation process and a single-channel signal amplification process, and the sound source can be a plurality of sound sources.

For example, assume that the current target object is a listener a, and the listener a can listen to N sound sources in common, each sound source carries its own monaural signal and its own coordinate information, and inputs these pieces of information to a server, which is a virtual spatial sound effect system. The monaural signal can be acquired by a recording device of a user, or can be a monaural signal constructed by a virtual space sound effect system, and the coordinate information can be actually acquired by a wearable device of the user, or can be calculated by a virtual motion track of the user in the virtual space system. In addition, the server may also calculate a distance value between each sound source and the target object based on a preset function.

Further, the server may perform monaural signal attenuation processing or monaural signal amplification processing on the audio signal of each sound source based on the distance value between each sound source and the listener a to obtain an attenuated audio signal or an amplified audio signal of each sound source.

Step 204, performing audio mixing processing on the distance sensing audio signals of the sound sources in the same grid area to obtain audio mixing audio signals corresponding to each grid area.

The grid region refers to a spatial region corresponding to each grid, and the grid refers to a spatial grid which is obtained by performing grid division on the spatial network and comprises a plurality of grid regions. The server may construct a spatial network with the target object as a center, for example, the server may construct a spherical space with the target object as a center, and perform mesh division on the spherical space to obtain a spherical spatial mesh including a plurality of mesh areas. For example, the spherical space grid comprises a grid 1 and a grid 2, the grid area corresponding to the grid 1 is defined as the area with 20-30 degrees east longitude and 0-20 degrees south latitude, and the grid area corresponding to the grid 2 is defined as the area with 20-30 degrees east longitude and 20-40 degrees south latitude.

The same grid area refers to that the positions of the sound sources are in the same defined grid area, for example, the position information of the sound source 1 is 20 degrees of east longitude and 15 degrees of south latitude, the position information of the sound source 2 is 20 degrees of east longitude and 18 degrees of south latitude, and since the position information of the sound source 1 and the position information of the sound source 2 both conform to the spatial area defined by the grid 1, that is, the position information of the sound source 1 and the position information of the sound source 2 both are in the spatial area corresponding to the grid 1, it can be determined that the sound source 1 and the sound source 2 are sound sources in the same grid area.

The sound mixing processing refers to performing sound mixing processing on audio signals carried by all sound sources in the same grid area to obtain a new sound source after the sound mixing processing. For example, the audio signal sigA carried by the sound source 1 and the audio signal sigB carried by the sound source 2 in the area of the grid 1 are subjected to audio mixing processing to obtain an audio mixing audio signal sigC after the audio mixing processing, and the audio mixing audio signal sigC is the audio signal of the new sound source 3.

The audio mixing signal is an audio signal of a new sound source obtained after audio mixing processing, that is, the audio mixing signal. For example, the audio signal of the new sound source 3 is obtained by performing a mixing process on the audio signals of the sound sources 1 and 2, and thus the audio signal of the new sound source 3 is a mixed audio signal.

Specifically, after the server performs distance sensing processing on the audio signal of the sound source based on the distance value between the sound source and the target object to obtain the distance sensing audio signal, the server may perform sound mixing processing on the distance sensing audio signal of the sound source in the same grid area to obtain a sound mixing audio signal corresponding to each grid area. Different meshing strategies can be preset, and based on the preset meshing strategies, the spatial network containing the target object is subjected to meshing so as to obtain the spatial network containing a plurality of meshing areas. The server may construct a corresponding spatial network with the target object as a center, for example, construct a corresponding spherical spatial network with the target object as a center. It is understood that the spatial network in the embodiments of the present application includes, but is not limited to, a spheronized spatial network, and may also be other spatial networks, for example, a hemispherized spatial network, and spatial networks of other irregular geometries, and the like.

And step 206, determining the center of the sound image based on the position information and the audio weight value of each sound source in the grid area.

The audio weight value is a proportion of an energy value of a distance-sensitive audio signal of each sound source in an audio frame to a total energy value of distance-sensitive audio signals of all sound sources in the audio frame, and may also be referred to as an energy weighting coefficient. The audio signal is a one-dimensional signal and the energy value is the sum of the squares of all samples within a certain time window length. For example, 2 sound sources in the same grid area are respectively a sound source 1 and a sound source 2, the energy value of the distance sensing audio signal of the current frame of the sound source 1 is e1, the energy value of the distance sensing audio signal of the current frame of the sound source 2 is e2, and the audio weight value of the current frame of the sound source 1 is e1 ÷ (e 1+ e 2); the audio weight value of sound source 2 for the current frame is e2 ÷ (e 1+ e 2).

The sound image center refers to a position corresponding to the coordinate value of the center calculated by weighting the coordinate information of all the sound sources in each grid area and the audio weight value. The center of the sound image corresponding to each grid region may be the same or different. The sound image center in the embodiment of the present application may be a point in the spatial network, that is, a position corresponding to a central coordinate value is a coordinate point, and the coordinate point is a sound image center point.

Specifically, after the server performs audio mixing processing on the distance-sensitive audio signals of each sound source in the same grid area to obtain the audio-mixed audio signal corresponding to each grid area, the server may determine the sound image center based on the position information and the audio weight value of each sound source in the grid area. Since there are multiple mesh areas, the server may determine the sound image center corresponding to each mesh area based on the position information and the audio weight value of all the sound sources in each mesh area. It is understood that the position information of the sound source in the present embodiment may be three-dimensional rectangular coordinate information of the sound source with respect to the target object.

And step 208, performing stereo reconstruction on the audio-mixed audio signal of the grid area based on the sound image center to obtain a stereo audio signal.

The stereo reconstruction refers to performing stereo reconstruction processing by using a stereo reconstruction technique, and the stereo reconstruction technique may include a plurality of virtual stereo reconstruction techniques, for example, in this embodiment, a Head Related Transfer Function (HRTF) virtual stereo reconstruction technique may be used to perform stereo reconstruction processing.

The stereo audio signal is a signal obtained by performing stereo reconstruction on the mixed audio signal of each grid region, and is the stereo audio signal. The stereo audio signal in the embodiment of the present application may be a binaural stereo audio signal, for example, a signal a2 is obtained by performing HRTF stereo reconstruction on a mixed audio signal a1 of grid 1, and the signal a2 is a binaural stereo audio signal.

Specifically, after the server determines the sound image center based on the position information and the audio weight value of each sound source in the mesh area, the server may perform stereo reconstruction on the mixed audio signal of each mesh area based on the sound image center corresponding to each mesh area to obtain a stereo audio signal corresponding to each mesh area. The stereo reconstruction may adopt an HRTF processing method, that is, the server may perform HRTF processing on the mixed audio signal of each mesh region based on the sound image center of each mesh region, so as to obtain a stereo audio signal corresponding to each mesh region.

For example, assuming that after the space where the target object is located is subjected to mesh division based on a preset mesh division strategy, the obtained mesh areas included in the space mesh are mesh 1 and mesh 2, the server may determine, based on the three-dimensional rectangular coordinate information and the audio weight value of each sound source in mesh 1, that the sound image center corresponding to the space area of mesh 1 is point a 1; and the server determines the sound image center corresponding to the space area of the grid 2 as a point A2 based on the three-dimensional rectangular coordinate information and the audio weight value of each sound source in the area of the grid 2. Further, the server may perform stereo reconstruction on the mixed audio signal of the grid 1 and the mixed audio signal of the grid 2 based on the sound image center of each grid region, respectively, to obtain a stereo audio signal of the grid 1 and a stereo audio signal of the grid 2. The server can search the HRTF excitation signals corresponding to the position information of the sound image center of each grid area from the related HRTF table, and convolute the audio mixing signals of each grid area by using the searched HRTF excitation signals respectively to obtain the stereo audio signals of each grid area. For example, the server may search the HRTF excitation signal h (1) corresponding to the position information of the point a1 in the HRTF table, and perform convolution processing on the mixed audio signal of the mesh 1 by using the searched HRTF excitation signal h (1), so as to obtain the stereo audio signal of the mesh 1.

Step 210, performing audio mixing processing on the left channel audio signal and the right channel audio signal of all the stereo audio signals respectively to obtain stereo audio mixing audio signals.

The stereo audio mixing signal is obtained by mixing stereo audio signals in different grid areas, and the obtained audio signal is the stereo audio mixing signal. The stereo mix audio signal may include a left mix audio signal and a right mix audio signal. For example, if the grid regions included in the spatial grid are grid 1, grid 2, and grid 3, respectively, a sound source exists in grid 1 and grid 2, and a sound source does not exist in grid 3, the server may perform mixing processing on the left channel signal of the stereo audio signal in the spatial regions of grid 1 and grid 2, so as to obtain the left channel mixed audio signal of the stereo mixed audio signal in the spatial regions of multiple grids.

Specifically, after the server performs stereo reconstruction on the audio-mixed audio signal of each grid area based on the sound image center corresponding to each grid area to obtain the stereo audio signal corresponding to each grid area, the server may perform audio-mixing processing on the left-channel audio signal and the right-channel audio signal of the stereo audio signal corresponding to each grid area, respectively, so as to obtain stereo audio-mixed audio signals corresponding to a plurality of grid areas. The mixing processing may adopt various mixing processing methods, for example, processing methods such as direct addition, averaging, clamping, normalization, adaptive mixing weighting, and auto-alignment algorithm. In the embodiment of the application, an averaging method can be adopted to perform sound mixing processing on the left channel audio signals and the right channel audio signals of all stereo audio signals respectively, that is, the left channel signals and the right channel signals of all grid areas with sound source signals are subjected to superposition and summation and then averaged to obtain the left channel sound mixing audio signals and the right channel sound mixing audio signals of the stereo sound mixing audio signals which are finally output.

For example, the average method will be described as an example. Assuming that grid regions included in the spatial grid are grid 1, grid 2, and grid 3, respectively, a sound source exists in grid 1 and grid 2, and a sound source does not exist in grid 3, the server may perform stereo reconstruction on the audio-mixed signal of grid 1 and the audio-mixed signal of grid 2 based on the sound image center corresponding to each grid region, respectively, to obtain a stereo audio signal S1 of grid 1 and a stereo audio signal S2 of grid 2. Since no sound source exists in the grid 3, the server can perform audio mixing processing on the stereo audio signal of the grid 1 and the stereo audio signal of the grid 2, and thus a stereo audio mixing audio signal of a multi-grid spatial region can be obtained. For example, the server may adopt a mixing processing manner of an averaging method to obtain a left mixing audio signal and a right mixing audio signal of the finally output stereo mixing audio signal by respectively superposing and summing the left channel signal and the right channel signal of the grid 1 and the grid 2 where the sound source signal exists and then averaging. The server calculates the sum of the superposition of the left channel audio signals of all the grid areas with the sound source signals, then takes the average value, and takes the obtained average value as the left channel audio mixing audio signal of the finally output stereo audio mixing audio signal; meanwhile, the server calculates the sum of the superposition of all the right channel audio signals of the grid area with the sound source signal, then takes the average value, and takes the obtained average value as the right channel audio mixing audio signal of the finally output stereo audio mixing audio signal.

In the audio generation method, distance sensing processing is performed on an audio signal of a sound source based on a distance value between the sound source and a target object to obtain a distance sensing audio signal; carrying out sound mixing processing on distance sensing audio signals of sound sources in the same grid area to obtain sound mixing audio signals corresponding to each grid area; determining a sound image center based on the position information and the audio weight value of each sound source in the grid area; performing stereo reconstruction on the audio-mixed audio signal of the grid area based on the center of the sound image to obtain a stereo audio signal; and respectively carrying out audio mixing processing on the left channel audio signal and the right channel audio signal of all the stereo audio signals to obtain stereo audio mixing audio signals. Because a plurality of sound sources in the same grid area are regarded as a whole, namely the plurality of sound sources in the same grid area are subjected to sound mixing after distance sensing correction to obtain a new single sound source, the sound image center of each grid area is calculated by weighting the plurality of sound sources, and the sound image center is taken as the coordinate position of the new sound source to carry out stereo reconstruction, so that the integrated processing of the plurality of sound sources in the same grid area can be realized, the times of stereo reconstruction processing are greatly reduced, the calculation cost can be greatly reduced on the premise of not influencing the user experience, the requirement that the real-time processing can be realized by common equipment can be met, the calculation cost of generating stereo audio signals is effectively reduced, and the perception experience of a user can be ensured.

In one embodiment, the method further comprises:

constructing a target space network by taking a target object as a center;

and carrying out uniform meshing or non-uniform meshing on the target space network to obtain the target space network comprising the mesh area.

The spatial network refers to a spatial region centered on the target object, and the spatial network may include multiple types of spatial networks, for example, a spherical spatial network centered on the target object.

The grid division is to divide the space network into a plurality of small units, which are used as the important factor in the pretreatment of finite element analysis, and the matching degree of the grid division and a calculation target and the quality of the grid determine the quality of the later finite element calculation. For example, the grid division may include structured grid division and unstructured grid division, and the grid processing manner may include various grid processing manners such as directly utilizing an open source grid tool to process and a custom grid processing method.

The uniform grid division refers to that all areas of the spatial network are subjected to grid division by adopting grids with the same size, and the density of the obtained divided grids is uniform; the non-uniform grid division refers to that grids with different sizes are adopted in different areas of a space network for grid division, and the density of the obtained divided grids is non-uniform. For example, in a non-uniform grid, some regions are sparsely-meshed and some regions are densely-meshed. The grid density refers to that grids with different sizes are adopted in different areas of the grids, and the purpose is to adapt to the distribution characteristics of the calculated data. In the region where the calculated data change gradient is large (for example, the sound source localization accuracy of the target object to the horizontal direction and the direction right in front is high), in order to better reflect the data change rule, a relatively dense grid is required. In the region where the gradient of the calculated data change is small (for example, the target object has low azimuth accuracy for alignment and backward), in order to reduce the model size, a relatively sparse grid should be divided. Thus, the structure of the whole grid shows different mesh division forms.

Specifically, the server may construct a target space network with the target object as a center, and perform uniform meshing or non-uniform meshing on the target space network to obtain a target space network including a plurality of mesh areas. Because the human ear has high sound source localization accuracy in the horizontal direction, high front direction accuracy, low rear direction accuracy, low upper direction accuracy, and lowest lower direction accuracy, the server can finely divide the mesh of the region in the horizontal front direction (assuming that the position information in the horizontal front direction is latitude-45 degrees to +45 degrees, longitude-90 degrees to +90 degrees) of the target object and obtain more mesh regions based on the characteristics of sound source localization; the grid regions in other directions can be divided sparsely, and the number of the obtained grid regions is small.

For example, as shown in fig. 3, an effect diagram of constructing a spheroid space network with a target object as a sphere center is shown. The server can construct a sphericized space network by taking the target object as a sphere center, and the server performs uniform meshing or non-uniform meshing on the sphericized space network according to the longitude and latitude information to obtain the sphericized space network comprising different mesh areas. Fig. 4 is a schematic diagram of the web grid management of multiple sound sources. In fig. 4, a sphere space is constructed by using a listener as a sphere center, and in fig. 4, grid division is performed according to longitude and latitude, for example, a sphere space network includes a grid 1 and a grid 2, the grid 1 is defined as an area with east longitude of 20-30 degrees and south latitude of 0-20 degrees, and the grid 2 is defined as an area with east longitude of 20-30 degrees and south latitude of 20-40 degrees. Therefore, the subsequent server can calculate the longitude and latitude information based on the three-dimensional rectangular coordinate information of different sound sources relative to the target object, so that the grid area to which each sound source belongs can be determined according to the longitude and latitude information, the integrated processing of multiple sound sources in the same grid area is realized, the reduction of the stereo reconstruction processing times is paved, and the calculation overhead is greatly reduced on the premise of not influencing the user experience.

In one embodiment, as shown in fig. 5, the target spatial network is a spherical network, and the method further includes a step of determining a mesh area to which the sound source belongs based on latitude and longitude information, and specifically includes:

step 502, obtaining coordinate information of a sound source.

And step 504, when the coordinate information is three-dimensional rectangular coordinate information, converting the coordinate information of the sound source into longitude and latitude information of a spherical network.

Step 506, in the spherical network, determining the grid area to which the sound source belongs based on the longitude and latitude information.

Here, the coordinate information of the sound source refers to coordinate information of the sound source with respect to the target object.

Specifically, when the server constructs a spheroidized spatial network with the target object as the center of sphere, and performs uniform meshing or non-uniform meshing on the spheroidized spatial network according to the longitude and latitude information to obtain the spheroidized spatial network including different mesh regions, the server can acquire the coordinate information of each sound source. And when the coordinate information of the sound source is three-dimensional rectangular coordinate information, the server converts the coordinate information of the sound source into longitude and latitude information of the spherical network. As shown in fig. 6, a schematic diagram of converting rectangular coordinate information into longitude and latitude information is shown. In a globalized network, the server may determine a mesh region to which the sound source belongs based on latitude and longitude information. Wherein, assuming that the three-dimensional rectangular coordinate value of a certain sound source is [ x, y, z ], the conversion formula of the longitude and latitude information is shown in the following formulas (1) and (2):

in the above equations (1) and (2), la represents longitude, lo represents latitude, arctan represents arctan function, x represents an x-axis coordinate value of the sound source with respect to the target object, y represents a y-axis coordinate value of the sound source with respect to the target object, and z represents a z-axis coordinate value of the sound source with respect to the target object.

For example, assume that the server acquires coordinate information of a sound source a as a (x)_a，y_a，z_a) The server may convert the three-dimensional rectangular coordinate information of the sound source a into the longitude and latitude information la and lo of the spheroidized network according to the above formula (1) and formula (2). The server is assumed to obtain longitude and latitude information of the sound source a based on the above formula (1) and formula (2), wherein la is 20 east longitude and lo is 15 south latitude. In the spherical network, the grid area corresponding to the grid 1 is defined as the area with 20-30 degrees east longitude and 0-20 degrees south latitude, and the grid area corresponding to the grid 2 is defined as the area with 20-30 degrees east longitude and 20-40 degrees south latitudeThe server may determine the grid area to which the sound source a belongs to grid 1 based on the longitude and latitude information, i.e., east longitude 20 degrees and south latitude 15 degrees.

In this embodiment, different grid regions are defined based on the difference of human ears in perception accuracy in different directions, so that the grid region to which each sound source belongs can be determined based on the longitude and latitude coordinates corresponding to each sound source, that is, the multiple sound sources in the same grid can be integrally processed through the web-grid management of sound sources in different directions, and thus, the calculation overhead can be greatly reduced on the premise of not affecting the user experience.

In one embodiment, the step of performing uniform meshing or non-uniform meshing on the target spatial network to obtain the target spatial network including the mesh region includes:

carrying out uniform grid division on the target space network to obtain a target space network comprising grid areas; alternatively, the first and second electrodes may be,

determining audio perception accuracy values of the target object to different directions, wherein the audio perception accuracy values of the target object to the different directions are different; and carrying out grid division on the target space network according to the audio perception precision value to obtain the target space network comprising grid areas.

Wherein, audio frequency perception precision value means the identification precision value of each position in the human ear to the space, because the human ear discerns the sound direction and is based on sound pressure difference, phase difference and sound are fixed a position at self body reflection difference, the difference is very slight on the signal, need ear structure enough complicacy, or long-term high strength training can promote the position accuracy of discerning, in addition from the angle of evolution, the people differentiates the position and relies on the vision more, the sense of hearing then weakens this ability, so the human ear discerns position angle ability relatively limitedly, this application embodiment is based on this characteristic that the audio frequency perception precision value of human ear to different positions is different, adopt the mode of grid processing, and through the integrated processing reduction cost of a plurality of sound sources in the same grid region.

Specifically, after the server constructs a target space network with a target object as a center, the server can perform uniform grid division on the target space network to obtain a target space network including a grid area; or the server can determine the audio perception accuracy values of the target object to different directions, wherein the audio perception accuracy values of the different directions are different; the server can perform grid division on the target space network according to the audio perception precision values of different directions to obtain the target space network comprising a plurality of grid areas.

For example, assuming that human ears have high sound source positioning accuracy in a horizontal direction, high front direction accuracy, low rear direction accuracy, low upper direction accuracy, and lowest lower direction accuracy, the server may determine audio perception accuracy values of the target object in different directions, and perform mesh division on the target spatial network according to the audio perception accuracy values in different directions, as shown in fig. 4, assuming that a horizontal front region of the target object is a region from latitude-45 degrees to +45 degrees, and longitude-90 degrees to +90 degrees, the server may perform fine mesh division on the regions from latitude-45 degrees to +45 degrees, and longitude-90 degrees to +90 degrees, so as to obtain more mesh regions. For example, the server may perform grid division according to a preset grid interval of 30 degrees, and divide areas of latitude-45 degrees to +45 degrees and longitude-90 degrees to +90 degrees into 15 grids, that is, 15 grid areas may be obtained; for other directions, the server can perform grid division according to a preset grid interval of 60 degrees, so that other latitude and longitude areas can be divided into sparse grids, and the number of obtained grid areas is small. For example, the server performs grid division according to a preset grid spacing distance of 60 degrees, and divides an area with a latitude of 45 degrees to 105 degrees and a longitude of 90 degrees to 150 degrees into 4 grids, so as to obtain 4 grid areas. Therefore, based on the characteristics of human ear resolution capability, different strategies are adopted for grid division, namely the grids of the areas with high human ear perception accuracy can be divided more finely, the grids of the areas with weaker human ear perception accuracy can be divided more sparsely, so that the integrated processing of multiple sound sources in the same grid is realized on the premise of not influencing user experience, the calculation cost is greatly reduced, and the requirement that common equipment can process and generate stereo audio signals in real time can be met.

In one embodiment, the step of performing distance sensing processing on an audio signal of a sound source based on a distance value between the sound source and a target object to obtain a distance sensing audio signal includes:

determining a gain value corresponding to the sound source based on the distance value between the sound source and the target object;

and based on the gain value, carrying out attenuation processing or amplification processing on the audio signal of the sound source to obtain a distance sensing audio signal of the sound source.

Gain refers to the degree to which a current, voltage, or power is increased for a component, circuit, device, or system. The gain value is an amplification factor, i.e. the gain value indicates the degree of amplification of the signal, and the amplification of the signal is stronger when the gain value is increased. For example, the gain value may be set to 0 of 2, 2/3 of 2, 4/3 of 2, 2 of 2, and the like. The gain value in the embodiment of the present application may be a preset fixed gain value, or may also be a calculation method in which a gain value is preset, and the server automatically generates a corresponding gain value according to the calculation method, where the calculation method and the generation method of the gain value are not limited.

Specifically, when the server performs distance sensing processing on the audio signal of each sound source based on the distance value between the sound source and the target object, the server may determine a gain value corresponding to each sound source based on the distance value between the sound source and the target object, and perform attenuation processing or amplification processing on the audio signal of each sound source based on the gain value, so as to obtain the distance sensing audio signal of each sound source. The distance value and the gain value between the sound source and the target object have a mapping relationship, for example, a preset gain value function includes a corresponding relationship between the distance value and the gain value. When the target space network constructed with the target object as the center is a spherical space network, the distance value between each sound source and the target object can be obtained by using the following formula:

in the above formula (3), d represents a distance value between the sound source and the target object, and x, y, and z represent coordinate information of the sound source with respect to the target object.

For example, assuming that the server calculates the distance value d between the sound source a and the target object a based on the above formula (3) in the globalized space network, the server may input the distance value d into a preset gain value function or determine the gain value g1 corresponding to the sound source a based on a preset gain value determination method. Further, the server may perform attenuation processing or amplification processing on the audio signal sigA1 carried by the sound source a according to the principle that the sound volume attenuates with distance based on the obtained gain value g1, so as to obtain the distance-sensitive audio signal sigA2 of the sound source a.

In this embodiment, after the distance sensing correction is performed on the plurality of sound sources in the same grid, a more accurate distance-sensing corrected audio signal can be obtained, so that audio mixing processing can be subsequently performed on the distance-sensing corrected audio signal in the same grid area, a more accurate new sound source signal can be obtained, and the accuracy of the audio signal after audio mixing processing is effectively ensured.

In one embodiment, the gain value is a distance gain value, and the step of determining the gain value corresponding to the sound source based on the distance value between the sound source and the target object includes:

acquiring a reference distance value and a gain upper limit value;

and determining a distance gain value corresponding to the sound source based on the reference distance value, the gain upper limit value and the distance value.

The reference distance value refers to a reference distance value set according to an empirical value, and the reference distance value may be set according to experience in advance.

The gain upper limit value is a maximum gain value and is used for avoiding the situation that the signal overflows due to overlarge gain. The gain upper limit value may be a maximum gain value set empirically in advance.

Specifically, when the server performs distance sensing processing on the audio signal of each sound source based on the distance value between the sound source and the target object, the server may obtain a preset reference distance value and a preset gain upper limit value, and determine the distance gain value corresponding to each sound source based on the reference distance value, the preset gain upper limit value, and the distance value between each sound source and the target object. When the target space network constructed by taking the target object as the center is a spherical space network, determining the distance gain value corresponding to each sound source can utilize the following formula:

where g in the above equation (4) represents a distance gain value, min represents a minimum value, Gmax represents an upper limit value of the gain, d0 represents a reference distance value, and d represents a distance value between the sound source and the target object.

That is, when the server performs distance sensing processing on the audio signal of each sound source, the server can obtain the distance sensing audio signal of each sound source by using the following formula, in view of the fact that the sound volume is inversely proportional to the square of the distance value:

Sig0 ×g= Sig1 （5）

in the above equation (5), g represents a distance gain value, Sig0 represents an audio signal of each sound source, and Sig1 represents a distance-sensitive audio signal of each sound source.

For example, assuming that the audio signal of the sound source a is Sig0, and the distance gain value of the sound source a is g1 obtained by the server according to the above equation (4), the server may multiply the audio signal Sig0 of the sound source a by the distance gain value g1 based on the above equation (5), so as to obtain the distance-sensitive audio signal Sig1 of the sound source a. Therefore, after the distance sensing correction is carried out on the plurality of sound sources in the same grid, more accurate audio signals after the distance sensing correction can be obtained, the audio signals after the distance sensing correction in the same grid area can be subjected to audio mixing processing subsequently, more accurate new audio mixing signals can be obtained, and the accuracy of the audio signals after the audio mixing processing is effectively guaranteed.

In one embodiment, the step of determining the center of the sound image based on the position information and the audio weight value of each sound source in the mesh area includes:

obtaining coordinate values and audio weight values of all sound sources in the grid area;

determining a central coordinate value of the grid region based on the coordinate value and the audio weight value;

and taking the position corresponding to the central coordinate value as the sound image center of the grid area.

The central coordinate value is a coordinate value corresponding to the center of the sound image in each grid region, that is, the central coordinate value.

Specifically, after the server performs audio mixing processing on the distance-sensitive audio signals of the sound sources in the same grid area to obtain the audio-mixed audio signal corresponding to each grid area, the server may obtain coordinate values and audio weight values of each sound source in each grid area, determine central coordinate values of each grid area based on the coordinate values and the audio weight values, and use a position corresponding to the central coordinate values as a sound image center of each grid area.

For example, assuming that the position information of sound source a and sound source B are both in the spatial area corresponding to grid 1, the server may be based on the three-dimensional rectangular coordinate information a (x) of sound source a in the spatial area of grid 1_a，y_a，z_a) And audio weight value a1 and three-dimensional rectangular coordinate information B (x) of sound source B_b，y_b，z_b) And an audio weight value a2, determining the center of the sound image of grid 1 as C (x)_c，y_c，z_c). Therefore, a plurality of sound sources in the same grid area can be regarded as a whole, namely the plurality of sound sources in the same grid area are subjected to sound mixing after distance sense correction to obtain a new single sound source, and simultaneously, the sound image center of each grid area is determined based on the coordinate information and the audio weight of the plurality of sound sources in each grid area, and the sound image center is used as the coordinate position of the new sound source to perform stereo reconstruction, so that the integrated processing of the plurality of sound sources in the same grid area can be realized, the times of stereo reconstruction processing are greatly reduced, the calculation overhead can be greatly reduced on the premise of not influencing the user experience, and the common equipment can realize the realization of the stereo reconstruction processingThe requirement of generating the stereo audio signal is processed in real time, and the perception experience of a user can be guaranteed while the calculation overhead of generating the stereo audio signal is effectively reduced.

In one embodiment, the coordinate values of each sound source include a first direction coordinate value, a second direction coordinate value and a third direction coordinate value, and the step of determining the center coordinate value of the grid region based on the coordinate values and the audio weight value includes:

determining a first direction coordinate value of a sound image center of the grid area based on the first direction coordinate value and the audio weight value of each sound source in the grid area;

determining a second direction coordinate value of the sound image center of the grid area based on the second direction coordinate value and the audio weight value of each sound source in the grid area;

and determining the third direction coordinate value of the sound image center of the grid area based on the third direction coordinate value and the audio weight value of each sound source in the grid area.

The first direction coordinate value, the second direction coordinate value and the third direction coordinate value may be coordinate values corresponding to three different directions. For example, if the first direction is an x-axis direction, the second direction is a y-axis direction, and the third direction is a z-axis direction, the first direction coordinate value is an x-axis coordinate value, the second direction coordinate value is a y-axis coordinate value, and the third direction coordinate value is a z-axis coordinate value.

Specifically, after the server performs audio mixing processing on the distance-sensitive audio signals of the sound sources in the same grid area to obtain the audio-mixed audio signal corresponding to each grid area, the server may obtain coordinate values of each sound source in each grid area relative to the target object and an audio weight value of each sound source, determine a center coordinate value of each grid area based on the coordinate values and the audio weight values, and use a position corresponding to the center coordinate value as a sound image center of each grid area. That is, the server may determine the first direction coordinate value of the sound image center of each mesh region based on the first direction coordinate value and the audio weight value of each sound source in each mesh region; the server determines a second direction coordinate value of the sound image center of each grid area based on the second direction coordinate value and the audio weight value of each sound source in each grid area; and the server determines the third direction coordinate value of the sound image center of each grid area based on the third direction coordinate value and the audio weight value of each sound source in each grid area.

Assuming that M sound sources are in the same grid area, the energy value of the distance sensing audio signal of each sound source in the current frame is e (i), and i is the serial number of each sound source in the current grid area, the coordinate value of the position corresponding to the sound image center of the grid is:

wherein a (i) in the above formula (6) is an audio weight value, x (i), y (i), and z (i) are three-dimensional rectangular coordinate values of the ith sound source in the mesh relative to the target object, and x (i) is a three-dimensional rectangular coordinate value of the ith sound source in the mesh relative to the target object_c，y_c，z_cIs the three-dimensional rectangular coordinate value of the center of the sound image of the grid.

For example, assuming that 2 sound sources in the same grid area are respectively a sound source 1 and a sound source 2, a first direction coordinate value of the sound source 1 is x (1), a second direction coordinate value is y (1), a third direction coordinate value is z (1), an audio weight value of the current frame is a (1), a first direction coordinate value of the sound source 2 is x (2), a second direction coordinate value is y (2), a third direction coordinate value is z (2), and an audio weight value of the current frame is a (2), then the coordinate values of the sound image center of the grid are: x is the number of_c= a(1)×x(1)+ a(2)×x(2)、y_c=a (1). times.y (1) + a (2). times.y (2) and z_c= a(1)×z(1)+ a(2)×z(2)。

In this embodiment, the sound image center of each mesh area may be determined based on the coordinate information and the audio weight of the multiple sound sources in each mesh area, and the sound image center is used as the coordinate position of the new sound source to perform stereo reconstruction, so that integrated processing of the multiple sound sources in the same mesh area may be realized, and the number of times of stereo reconstruction processing is greatly reduced, so that the calculation overhead may be greatly reduced on the premise of not affecting the user experience, so as to meet the requirement that the common device may generate a stereo audio signal in real time, and while the calculation overhead generated by the stereo audio signal is effectively reduced, the perception experience of the user may be ensured.

In one embodiment, the method further comprises:

acquiring audio energy values of all sound sources in a grid area;

determining a total audio energy value of all sound sources in the grid area based on the audio energy values;

and determining the audio weight value of each sound source in the grid area based on the audio energy value and the total audio energy value.

The audio energy value refers to an energy value of a distance-sensitive audio signal of a current frame of each sound source, for example, the energy value of the distance-sensitive audio signal of the current frame of the sound source 1 is e (1), and the energy value of the distance-sensitive audio signal of the current frame of the sound source 2 is e (2).

The total audio energy value refers to the total audio energy value of the distance-sensitive audio signals of all the sound sources in the grid area. For example, if the grid 1 includes the sound source 1 and the sound source 2, the total audio energy value of the grid 1 is the sum of the energy value e (1) of the distance sensing audio signal of the current frame of the sound source 1 and the energy value e (2) of the distance sensing audio signal of the current frame of the sound source 2, that is, the total audio energy value is e (1) + e (2).

Specifically, the server may obtain the audio energy value of the distance-sensing audio signal of each sound source in each mesh area, and determine the total audio energy value of the distance-sensing audio signals of all the sound sources in each mesh area based on the audio energy value of the distance-sensing audio signal of each sound source. Further, the server may determine an audio weight value of the distance-sensitive audio signal of each sound source in each mesh area based on the audio energy value of the distance-sensitive audio signal of each sound source in each mesh area and the total audio energy value of the distance-sensitive audio signal of each mesh area.

Assuming that M sound sources are in the same grid area, the energy value of the distance sensing audio signal of each sound source in the current frame is e (i), i is the serial number of each sound source in the current grid area, and the audio weight value a (i) of the distance sensing audio signal of each sound source can be obtained by using the following formula:

wherein, a (i) in the above formula (7) is an audio weight value of the distance-sensitive audio signal of the sound source, e (i) is an energy value of the distance-sensitive audio signal of the current frame of each sound source,

the total audio energy value of the distance sensing audio signal for all sound sources within each grid area.

For example, if 2 sound sources in the same grid area are respectively a sound source 1 and a sound source 2, the energy value of the distance-sensitive audio signal of the current frame of the sound source 1 is e (1), the energy value of the distance-sensitive audio signal of the current frame of the sound source 2 is e (2), and then the audio weight value of the distance-sensitive audio signal of the sound source 1 is:

(ii) a The audio weight value of the distance-sensitive audio signal of the sound source 2 is:

. Therefore, the total audio energy value of the distance sensing audio signals of all the sound sources in each grid area can be determined based on the audio energy value of the distance sensing audio signals of each sound source in each grid area, and the audio weight value of the distance sensing audio signals of each sound source in each grid area can be dynamically determined based on the audio energy value and the total audio energy value, so that the coordinate position of the sound image center of the current frame of each grid area can be dynamically determined subsequently based on the audio weight value and the coordinate value of each sound source in each grid area, and the sound image center is taken as the coordinate position of a new sound source to carry out stereo reconstruction, so that the integrated processing of multiple sound sources in the same grid area can be realized, the times of stereo reconstruction processing are greatly reduced, the calculation cost can be greatly reduced on the premise of not influencing the user experience, and the requirement that the common equipment can realize real-time processing can be met, realizing effective reduction of calculation overhead of stereo audio signal generationMeanwhile, the perception experience of the user can be guaranteed.

In one embodiment, the step of performing stereo reconstruction on the mixed audio signal of the grid region based on the center of the sound image to obtain a stereo audio signal includes:

searching an excitation signal corresponding to the center of the sound image;

and constructing a stereo audio signal based on the excitation signal and the mixed audio signal of the grid area.

The excitation signal in this application may be an HRIR excitation signal, which may be obtained by looking up an associated HRIR table.

Specifically, after the server determines the sound image centers of the grid regions based on the position information and the audio weight values of the sound sources in the grid regions, the server may search the HRIR excitation signals corresponding to the sound image centers, and construct the stereo audio signals corresponding to the grid regions based on the HRIR excitation signals and the audio mixing signals of the grid regions. The server may search, according to the position information of each sound image center, an HRIR excitation signal that matches the position information of the sound image center from the relevant HRIR table, that is, in the embodiment of the present application, there is a mapping relationship between the HRIR excitation signal and the position information of the sound image center.

For example, assuming that 2 sound sources are respectively a sound source 1 and a sound source 2 in the same grid area, after the server determines that the position information of the sound image center S1 of the grid 1 is (longitude 60, latitude 0) based on the position information and the audio weight value of the sound source 1 and the sound source 2 in the grid 1, the server may search for the HRIR excitation signal corresponding to the position information of the sound image center S1, and assuming that the HRIR excitation signal corresponding to (longitude 60, latitude 0) stored in the related HRIR table is h (1), the server may construct the stereo audio signal sig3 corresponding to the grid 1 based on the HRIR excitation signal h (1) and the mixed audio signal sig2 in the grid area. Therefore, the calculation cost for generating the stereo audio signal can be effectively reduced based on the web grid processing method, and real-time processing is realized.

In one embodiment, the excitation signals include a left channel excitation signal and a right channel excitation signal; the step of constructing a stereo audio signal based on the excitation signal and the mixed audio signal of the mesh region includes:

performing convolution processing on the audio mixing audio signal of the grid area based on the left channel excitation signal to obtain a left channel audio signal;

and performing convolution processing on the audio mixing signal of the grid area based on the right channel excitation signal to obtain a right channel audio signal.

Here, the left channel excitation signal refers to HRIR data of a left channel, and the right channel excitation signal refers to HRIR data of a right channel.

Specifically, after the server determines the sound image centers of the grid regions based on the position information and the audio weight values of the sound sources in the grid regions, the server may search the HRIR excitation signals corresponding to the sound image centers, and since the HRIR excitation signals include the HRIR data of the left channel and the HRIR data of the right channel, the server may construct the stereo audio signals corresponding to the grid regions based on the HRIR excitation signals and the audio mixing audio signals of the grid regions. The server performs stereo reconstruction on the mixed audio signals of each grid region based on an HRTF stereo reconstruction technology, and performs convolution operation on the mono mixed audio signals u (n) of each grid region and the HRIR excitation signals h (n) of the corresponding position of the sound image center of each grid region to obtain the two-channel stereo signals y (n) output after the convolution operation. The stereo audio signal y (n) can be obtained by using the following formula:

where y (n) in the above formula (8) is the output binaural stereo audio signal, u (n) is the mono mixed audio signal of each grid region, h (n) is the HRIR excitation signal corresponding to the position information of the sound image center of each grid region, and ⨂ represents the convolution process.

As shown in fig. 7, a schematic diagram of the HRTF stereo reconstruction process is shown. Since the HRIR excitation signal h (n) comprises HRIR data for the left channel and HRIR data for the right channel, the generated y (n) also comprises left channel signal results and right channel signal results.

For example, assuming that the server determines the sound image center S1 of the grid 1 region based on the position information and the audio weight value of each sound source in the grid 1 region, the server may search for the HRIR excitation signal corresponding to the sound image center S1 as h (1), and since the HRIR excitation signal includes the HRIR data of the left channel and the HRIR data of the right channel, the server may construct the stereo audio signal sig3 corresponding to the grid 1 region based on the HRIR excitation signal h (1) and the mix audio signal sig2 of the grid 1 region. That is, the server may perform convolution operation on the mixed audio signal sig2 in the grid 1 area as the input signal u (n) and the HRIR excitation signal h (1) in the direction corresponding to the sound image center S1 in the grid 1 area, that is, y (1) = u (sig2) ⨂ h (1), that is, may obtain the binaural signal y (1) output after the convolution operation. Therefore, the calculation cost for generating the stereo audio signal can be effectively reduced based on the web grid processing method, and real-time processing is realized.

In one embodiment, the stereo mix audio signal includes a left mix audio signal and a right mix audio signal;

determining a mesh region having a sound source within a target spatial network;

determining a number of mesh regions having sound sources;

carry out the audio mixing respectively with the audio signal of the left channel and the audio signal of the right channel of all stereo audio signal, obtain stereo audio mixing audio signal, include:

based on the number of the grid areas, carrying out audio mixing processing on left channel audio signals of all the stereo audio signals to obtain left channel audio mixing audio signals;

and performing audio mixing processing on the right channel audio signals of all the stereo audio signals based on the number of the grid areas to obtain right channel audio mixing signals.

Specifically, the server may construct a target space network with the target object as a center, and perform uniform meshing or non-uniform meshing on the target space network to obtain a target space network including a plurality of mesh areas. Further, the server may determine mesh regions with sound sources and determine the number of mesh regions with sound sources within the constructed target spatial network. The server can perform audio mixing processing on the left channel audio signals of all the stereo audio signals based on the number of the grid areas to obtain left channel audio mixing audio signals; meanwhile, the server may perform audio mixing processing on the right channel audio signals of all the stereo audio signals based on the number of the grid areas to obtain right channel audio mixing audio signals, and the left channel audio mixing audio signals and the right channel audio mixing audio signals obtained by the server are the finally obtained dual-channel stereo audio mixing audio signals. The method of mixing processing may include various mixing processing modes, for example: direct addition processing, an averaging method, a clamping method, normalization processing, adaptive mixing weighting processing, an automatic alignment algorithm and the like.

In the embodiment of the present application, an averaging method is taken as an example, and the averaging method is a method of averaging left and right channel signals output by grid regions of all active source signals after respective superposition and summation.

The stereo audio signal can be obtained by using the following formula:

in the above formulas (9) and (10), lout (j) is a left mixed audio signal of the stereo mixed audio signal, rout (j) is a right mixed audio signal of the stereo mixed audio signal, K is the number of grid areas having a sound source in the current frame, j is a sample number of the current frame, l (K, j) is a left channel signal of the stereo audio signal output by each grid area having a sound source, and r (K, j) is a right channel signal of the stereo audio signal output by each grid area having a sound source. It is to be understood that, since the position of the sound source may change during the real-time processing, the number of the mesh regions having the sound source may be different for each frame, that is, dynamically changed, for example, the number of the mesh regions having the sound source is 5 for the first frame, and the number of the mesh regions having the sound source is 6 for the twentieth frame, and the server in the embodiment of the present application may dynamically calculate the left mix audio signal and the right mix audio signal of the stereo mix audio signal corresponding to each frame based on the above formulas (9) (10).

For example, assuming that the sampling rate of the audio signal is 48000hz, and when 20ms is one frame, there are 960 samples in one frame, that is, 960 left channel data outputs and 960 right channel data outputs in one frame, j = 1 to 960 in the above equation (9) (10).

Assuming that the total number of the current actual sound images is 50 sound sources, after the server performs meshing on the spatial network where the target object is located based on a preset meshing strategy, the server may acquire that the number of the mesh areas where the current frame has the sound sources is 10, that is, the 50 sound sources are classified into 10 mesh areas according to the preset meshing strategy, that is, K = 10, j = 1-960, and then the server may perform sound mixing calculation on the left channel audio signals of all the stereo audio signals in the 10 mesh areas of the current frame based on the above formula (9) to obtain left channel sound mixing audio signals; the server may perform a mixing calculation on the right channel audio signals of all the stereo audio signals in the 10 grid areas of the current frame based on the above formula (10), so as to obtain a right channel mixing audio signal. Therefore, the calculation cost for generating the stereo audio signal can be effectively reduced based on the web grid processing method, and real-time processing is realized.

The application also provides an application scene, and the application scene applies the audio generation method. Specifically, the audio generation method is applied to the application scenario as follows:

the concept of metas stems from the famous science fiction chenille, noted in 1992, stevenson thus describes metas in its novel "avalanche": when the earphone and the eyepiece are worn and the connecting terminal is found, the computer can enter a virtual space which is simulated by the computer and is parallel to the real world in a virtual body separating mode. The achievement of the meta universe needs to integrate artificial intelligence virtual perception technologies of multiple aspects such as audio, video and perception, a computer virtual space approaching real world perception is constructed, and an experiencer can experience sensory experience which is not different from the real world only by means of hardware devices such as earphones, glasses and body sensing equipment. The virtual space sound effect is an important part in similar metauniverse application, the binaural sound signals in the real environment are restored through the virtual space sound effect, and an experiencer can perceive stereo sound effect experience in the real environment by wearing earphones.

Fig. 8 is a schematic view of an application scenario of a virtual spatial sound effect for restoring a stereo audio signal in a real environment. The spatial sound effect means that a user can hear sound with more stereoscopic impression and spatial hierarchy sense through certain audio technology processing, the auditory scene of an actual scene is played and restored through an earphone or the combination of more than two loudspeakers, the listener can clearly recognize the direction, the distance sense and the moving track of different acoustic objects, the listener can feel the feeling of being wrapped by sound in all directions, and the listener can feel the immersive auditory experience in the actual environment. For example, the spatial sound effect shown in fig. 8 includes the voice of a person who is speaking, laughing, footstep, engine sound of a vehicle coming from far to near, sidewalk warning sound, wind and rain sound, and the like.

When a user uses an application similar to the metasma or needs to experience a virtual space sound effect, the audio generation method can be adopted, that is, a binaural sound signal in a real environment is restored by the virtual space sound effect, and an experiencer, that is, the user can perceive stereo sound effect experience in the real environment by wearing earphones, for example, the virtual space sound effect shown in fig. 8 may include speech sounds, laughter sounds, footstep sounds of different people in different directions around the user, engine sounds of an automobile coming from far to near, sidewalk prompt sounds, wind and rain sounds, and the like. The server can perform distance sensing processing on the audio signals of the sound source based on the distance value between the sound source and the experiencer to obtain distance sensing audio signals, and perform sound mixing processing on the distance sensing audio signals of the sound source in the same grid area to obtain sound mixing audio signals corresponding to each grid area; the server can determine a sound image center based on the audio signal and the audio weight value of each sound source in the grid area, and performs stereo reconstruction on the audio-mixed audio signal in the grid area based on the sound image center to obtain a stereo audio signal; and the server respectively performs audio mixing processing on the left channel audio signal and the right channel audio signal of all the stereo audio signals to obtain stereo audio mixing audio signals. Because a plurality of sound sources in the same grid area are regarded as a whole, namely the plurality of sound sources in the same grid area are subjected to sound mixing after distance sensing correction to obtain a new single sound source, the sound image center of each grid area is calculated by weighting the plurality of sound sources, and the sound image center is taken as the coordinate position of the new sound source to carry out stereo reconstruction, so that the integrated processing of the plurality of sound sources in the same grid area can be realized, the times of stereo reconstruction processing are greatly reduced, the calculation cost can be greatly reduced on the premise of not influencing the user experience, the requirement that the common equipment can realize real-time processing can be met, the calculation cost for generating stereo audio signals is effectively reduced, and the perception experience of a user can be ensured.

The method provided by the embodiment of the application can be applied to a scene generated by a virtual space sound effect or a virtual stereo. The following description is given of an audio generation method provided by the embodiment of the present application, taking a scene generated by a virtual spatial sound effect as an example, and includes the following steps:

in a traditional audio generation mode, the generation of virtual stereo is very computation resource consuming, in order to restore the experience of the real world, a virtual spatial sound effect needs to perform stereo reconstruction respectively for different sound sources in different directions through an HRTF (Head Related Transfer Function) virtual stereo reconstruction technology, and the sound is mixed and transmitted to the ears of an experiencer, because a large number of sound sources need to perform HRTF stereo reconstruction simultaneously, huge computation overhead causes great challenge to real-time audio experience, so that common equipment cannot meet the requirement of real-time operation, and even if a computation module is transferred to a server for realization, the generation of stereo audio signals is very computation resource consuming.

Fig. 9 shows a flowchart of virtual stereo generation in a conventional manner. The traditional virtual spatial sound effect is realized based on a virtual stereo reconstruction technology, as shown in fig. 9, it is assumed that a listener can listen to N sound sources together, each sound source carries its own monophonic signal and its own coordinate information, the coordinate information includes information that each sound source comes from different distances and different azimuth angles, and the user terminal can input the information into a virtual spatial sound effect system, i.e., a server. As shown in fig. 9, the server performs the correlation processing in the conventional manner by the following steps:

1) the server calculates the distance and the space angle between the current coordinate of each sound source and a listener, wherein the space angle comprises a horizontal angle and a vertical angle;

2) the server performs distance sensing processing according to the actual distance between a sound source and a listener according to the principle that the sound volume attenuates along with the distance, wherein the distance sensing processing can be single-channel signal attenuation processing;

3) the server processes HRIR impulse response signals after distance sensing processing of sound source signals, namely the server matches HRIR impulse response signals according to horizontal angle values and vertical angle values of the sound sources and performs convolution processing by using the matched HRIR impulse response signals to obtain stereo signals with azimuth sensing;

4) the server performs sound mixing processing on the left channel and the right channel of the stereo signals of each sound source respectively to finally obtain the virtual stereo sound received by a listener.

In the traditional technical scheme, all sound sources are subjected to HRTF (head related transfer) processing independently, the calculation overhead of convolution processing in the HRTF processing is relatively large, similar to the application of a metauniverse, the number of the sound sources is relatively large, and dozens of hundreds of sound sources are possible, so that the calculation overhead of virtual stereo generation is very high, the requirement that common equipment cannot complete real-time operation is caused, and the operation cost is very high even if a calculation module is transferred to a server for realization.

Therefore, in order to solve the above-mentioned problem, that is, in order to solve the problem of high overhead of the real-time multi-sound-source HRTF operation, the embodiment of the present application proposes a web-based gridding management, in which a plurality of sound sources in the same spherical grid are regarded as an entity, and meanwhile, considering the ear resolution capability, the definition of grid division may be non-equidistant, that is, the grid of a region with high ear perception accuracy may be divided more densely, and the grid of a region with weaker ear perception accuracy may be divided more sparsely. For example, 100 HRIR convolution processes need to be performed on the original 100 sound source objects, and the 100 sound sources are distributed in 10 grid areas, so that only 10 HRIR convolution processes need to be performed, and the calculation overhead is greatly reduced. Even if the number of sound sources is large, for example, dozens of hundreds of sound sources exist, the calculation overhead of generating the stereo audio signal can be effectively reduced, the requirement of real-time processing is met, and meanwhile the perception experience of a user can be guaranteed.

The embodiment provides a method for generating a multivariate virtual stereo audio signal based on web gridding, which specifically comprises the following steps:

on the technical side, as shown in fig. 10, a flow chart of a method for generating a multivariate virtual stereo audio signal based on web-based formatting is illustrated. The sound source in fig. 10 is a plurality of sound sources. As proved by a large number of experiments, the accuracy value of the sound direction identification of a common person is not high, so that the specific positions of different sound sources in the same grid cannot be accurately identified by human ears of a plurality of sound sources in the same grid as long as the grid division is reasonable, and therefore, the network grid processing method is designed according to the phenomenon, the calculation cost of virtual stereo generation can be effectively reduced, and real-time processing is realized.

Because the resolving power of human ears to different spatial angles is different, the horizontal front is relatively accurate, the accuracy may be within 15 degrees, the rear is relatively weak, and may be 30 degrees or more, therefore, based on this characteristic, the meshing strategy in the embodiment of the present application may be set as: as shown in fig. 4, in the horizontal direction, the region in front of the target subject's body is gridded with a horizontal angle of 15 degrees as a grid interval, and the region in back of the target subject's body is gridded with a horizontal angle of 30 degrees as a grid interval; in the vertical direction, the regions above the horizontal plane are gridded at a grid interval of 25 degrees, and the regions below the horizontal plane are gridded at a grid interval of 30 degrees.

Fig. 10 is a schematic flow chart of a method for generating a virtual stereo audio signal based on web-based multiplexing. The sound source in fig. 10 is a plurality of sound sources. As shown in fig. 10, first, the server may obtain a distance between an ith sound source and a listener, and perform distance sensing processing on an original monaural signal Sig0(i) carried by the ith sound source to obtain a distance-sensitive audio signal Sig1(i), meanwhile, the server may convert a three-dimensional rectangular coordinate of the ith sound source into a longitude and latitude coordinate, and perform matching with a defined ball grid based on the longitude and latitude coordinate, and when the server performs matching with the defined ball grid based on the longitude and latitude coordinate of each sound source, the server may match a corresponding grid based on longitude and latitude information of each sound source relative to the listener. Further, the server may perform mixing processing on the distance-sensitive audio signals Sig1 of multiple sound sources in the same grid, to obtain a mixed audio signal Sig2 corresponding to each grid. In order to make the sound image orientation more accurate, in the embodiment of the present application, the server may calculate the coordinate values of the grid sound image center of each grid in a weighted manner based on the three-dimensional rectangular coordinate values of each sound source relative to the listener and the energy value of the distance-sensitive audio signal Sig1 of each sound source, and match the corresponding HRIR impulse response parameters according to the position information of each grid sound image center. Further, the server may perform convolution processing on the mixed audio signal Sig2 of each grid by using each HRIR impulse response parameter obtained by matching to obtain a virtual stereo signal Sig3 of each grid, and finally, the server may perform stereo mixing processing on the Sig3 signals of multiple grids to obtain a final virtual stereo signal Sig4 heard by a listener.

After the server determines the sound image center point of each grid, for example, assuming that the longitude and latitude coordinates of the sound image center point S1 of grid 1 are longitude 60 and latitude 0, the server may search the relevant HRIR table to obtain the HRIR impulse response parameters corresponding to the longitude and latitude coordinates of the sound image center point S1, where the HRIR impulse response parameters are HRIR excitation signals, and the HRIR table stores HRIR excitation signals corresponding to different longitude and latitude information.

The key technology is as follows:

1. three-dimensional rectangular coordinate conversion longitude and latitude coordinate

Assuming that the three-dimensional coordinate value of a certain sound source is [ x, y, z ], the conversion formula of the longitude and latitude information thereof is as shown in the foregoing formulas (1) (2).

2. Distance calculation and distance sensing processing

The distance values between the respective sound sources and the listener can be obtained by using the formula (3) as shown in the foregoing.

Since the sound volume is inversely proportional to the square of the distance when the server performs the distance sensing process, the server can calculate the distance sensing audio signal Sig1 of each sound source after the distance sensing process by using the above equation (5). That is, the original monaural signal Sig0 carried by each sound source is multiplied by the distance gain value g, so as to obtain the distance-sensitive audio signal Sig1 of each sound source, and the distance gain value g can be calculated by the above equation (4).

3. HRTF processing

Based on the HRTF stereo reconstruction technology, the mono audio mixing signal u (n) of each grid area and the HRIR data h (n) corresponding to the sound image center point of each grid area are convoluted, and a two-channel stereo signal y (n) is output. The two-channel stereo signal y (n) can be obtained by using the formula (8) as shown in the foregoing.

4. Acoustic image center of computational grid

Assuming that M sound sources are in the same grid, the energy value of the distance sensing audio signal of each sound source in the current frame is e (i), and i is the serial number of each sound source in the current grid area, the coordinate value of the sound image center point of the grid can be calculated by the formula (6). The energy weighting coefficient of each sound source can be calculated by the aforementioned formula (7).

5. Stereo sound mixing processing

The stereo mix signal is obtained by mixing the left channel signal and the right channel signal of the virtual stereo signal Sig3 output from each mesh. There are many methods for mixing sound, for example: the method includes direct addition, an averaging method, a clamping method, normalization, adaptive audio mixing weighting, an automatic alignment algorithm and the like.

In this embodiment, through web-grid management of sound sources in different directions, a plurality of sound sources in the same grid are subjected to distance-sensing correction and then mixed to form a new single sound source, meanwhile, the grid sound image center is calculated through multi-element weighting, the grid sound image center is used as the coordinate direction of the new sound source to perform HRTF virtual stereo reconstruction, and finally, virtual stereo sound of each grid where the sound source exists is mixed to obtain a final overall virtual stereo signal. In the embodiment of the application, different grid regions are defined based on the difference of human ears on perception accuracy in different directions, so that integrated processing of multiple sound sources in the same grid can be realized, the frequency of HRTF processing is greatly reduced, the calculation overhead can be greatly reduced on the premise of not influencing user experience, the requirement of real-time processing which can be realized by common equipment is met, and the perception experience of a user is also guaranteed.

It should be understood that although the various steps in the flow charts of fig. 1-10 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-10 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 11, there is provided an audio generating apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, specifically comprising: distance perception processing module 1102, remix processing module 1104, determining module 1106 and stereo reconstruction module 1108, wherein:

and a distance sensing processing module 1102, configured to perform distance sensing processing on an audio signal of a sound source based on a distance value between the sound source and a target object, so as to obtain a distance sensing audio signal.

And the audio mixing processing module 1104 is configured to perform audio mixing processing on the distance sensing audio signals of the sound sources in the same grid area to obtain an audio mixing audio signal corresponding to each grid area.

A determining module 1106, configured to determine a sound image center based on the position information and the audio weight value of each sound source in the grid area.

The stereo reconstruction module 1108 is configured to perform stereo reconstruction on the audio-mixed audio signal in the grid area based on the center of the sound image, so as to obtain a stereo audio signal.

The audio mixing processing module 1104 is further configured to perform audio mixing processing on the left channel audio signal and the right channel audio signal of all the stereo audio signals respectively to obtain stereo audio mixing audio signals.

For specific limitations of the audio generating apparatus, reference may be made to the above limitations of the audio generating method, which are not described herein again. The modules in the audio generating device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 12. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store audio generation data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an audio generation method.

Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of audio generation, the method comprising:

performing distance sensing processing on an audio signal of a sound source based on a distance value between the sound source and a target object to obtain a distance sensing audio signal;

carrying out sound mixing processing on the distance sensing audio signals of the sound sources in the same grid area to obtain sound mixing audio signals corresponding to each grid area;

determining a sound image center based on the position information and the audio weight value of each sound source in the grid area;

performing stereo reconstruction on the audio-mixed audio signal of the grid area based on the acoustic image center to obtain a stereo audio signal;

and respectively carrying out audio mixing processing on the left channel audio signal and the right channel audio signal of the stereo audio signal to obtain a stereo audio mixing audio signal.

2. The method of claim 1, further comprising:

constructing a target space network by taking the target object as a center;

3. The method of claim 2, wherein the target spatial network is a sphering network, the method further comprising:

acquiring coordinate information of the sound source;

when the coordinate information is three-dimensional rectangular coordinate information, converting the coordinate information of the sound source into longitude and latitude information of the spherical network;

and in the spherical network, determining the grid area to which the sound source belongs based on the longitude and latitude information.

4. The method of claim 2, wherein non-uniformly meshing the target spatial network to obtain the target spatial network including the mesh region comprises:

determining audio perception accuracy values of the target object to different directions, wherein the audio perception accuracy values of the different directions are different; and carrying out grid division on the target space network according to the audio perception precision value to obtain the target space network comprising the grid area.

5. The method according to claim 1, wherein the distance sensing processing is performed on the audio signal of the sound source based on the distance value between the sound source and the target object to obtain a distance sensing audio signal, and comprises:

determining a gain value corresponding to a sound source based on a distance value between the sound source and a target object;

and based on the gain value, carrying out attenuation processing or amplification processing on the audio signal of the sound source to obtain the distance sensing audio signal of the sound source.

6. The method of claim 5, wherein the gain value is a distance gain value; the determining, based on the distance value between the sound source and the target object, a gain value corresponding to the sound source includes:

acquiring a reference distance value and a gain upper limit value;

7. The method according to claim 1, wherein the determining a sound image center based on the position information and the audio weight value of each sound source in the mesh area comprises:

obtaining coordinate values and audio weight values of the sound sources in the grid area;

8. The method of claim 1, further comprising:

acquiring audio energy values of the sound sources in the grid area;

determining a total audio energy value for all of the sound sources within the mesh area based on the audio energy values;

determining an audio weight value for each of the sound sources within the mesh area based on the audio energy value and the total audio energy value.

9. The method of claim 1, wherein the performing stereo reconstruction on the mixed audio signal of the grid region based on the sound image center to obtain a stereo audio signal comprises:

searching an excitation signal corresponding to the sound image center;

10. The method of claim 9, wherein the excitation signal comprises a left channel excitation signal and a right channel excitation signal;

constructing a stereo audio signal based on the excitation signal and the mixed audio signal of the mesh region, including:

and performing convolution processing on the audio mixing audio signal of the grid area based on the right channel excitation signal to obtain a right channel audio signal.

11. The method of claim 1, wherein the stereo mix audio signal comprises a left mix audio signal and a right mix audio signal;

determining a mesh region having the sound source within a target spatial network;

determining a number of mesh regions having the sound source;

and respectively mixing the left channel audio signal and the right channel audio signal of the stereo audio signal to obtain a stereo mixed audio signal, wherein the method comprises the following steps:

performing audio mixing processing on the left channel audio signals of all the stereo audio signals based on the number of the grid areas to obtain left channel audio mixing audio signals;

and performing sound mixing processing on the right channel audio signals of all the stereo audio signals based on the number of the grid areas to obtain the right channel sound mixing audio signals.

12. An apparatus for audio generation, the apparatus comprising:

the distance sensing processing module is used for carrying out distance sensing processing on the audio signal of the sound source based on the distance value between the sound source and the target object to obtain a distance sensing audio signal;

the sound mixing processing module is used for carrying out sound mixing processing on the distance sensing audio signals of the sound sources in the same grid area to obtain sound mixing audio signals corresponding to each grid area;

the determining module is used for determining a sound image center based on the position information and the audio weight value of each sound source in the grid area;

the stereo reconstruction module is used for carrying out stereo reconstruction on the audio-mixed audio signal of the grid area based on the acoustic image center to obtain a stereo audio signal;

the audio mixing processing module is further configured to perform audio mixing processing on all the left channel audio signals and all the right channel audio signals of the stereo audio signals respectively to obtain stereo audio mixing audio signals.

13. The audio generation apparatus of claim 12, wherein the apparatus further comprises: the system comprises a construction module and a grid division module, wherein the construction module is used for constructing a target space network by taking the target object as a center; and the grid division module is used for carrying out uniform grid division or non-uniform grid division on the target space network to obtain the target space network comprising the grid area.

14. The audio generation apparatus of claim 13, wherein the apparatus further comprises: the acquisition module is used for acquiring the coordinate information of the sound source; the conversion module is used for converting the coordinate information of the sound source into longitude and latitude information of a spherical network when the coordinate information is three-dimensional rectangular coordinate information; the determining module is further configured to determine, in the globalized network, a mesh area to which the sound source belongs based on the latitude and longitude information.

15. The audio generation apparatus according to claim 13, wherein the determining module is further configured to determine audio perception accuracy values of the target object for different orientations, the audio perception accuracy values being different for the different orientations; the grid division module is further used for carrying out grid division on the target space network according to the audio perception precision value to obtain the target space network comprising the grid area.

16. The audio generating apparatus according to claim 12, wherein the determining module is further configured to determine a gain value corresponding to the sound source based on a distance value between the sound source and a target object; and the distance sensing processing module is also used for carrying out attenuation processing or amplification processing on the audio signal of the sound source based on the gain value to obtain the distance sensing audio signal of the sound source.

17. The audio generating apparatus according to claim 16, wherein the obtaining module is further configured to obtain a reference distance value and a gain upper limit value; the determining module is further configured to determine a distance gain value corresponding to the sound source based on the reference distance value, the gain upper limit value, and the distance value.

18. The audio generating apparatus according to claim 12, wherein the obtaining module is further configured to obtain coordinate values and audio weight values of each sound source in the mesh region; the determining module is further configured to determine a center coordinate value of the grid region based on the coordinate value and the audio weight value; and taking the position corresponding to the central coordinate value as the sound image center of the grid area.

19. The audio generating apparatus according to claim 12, wherein the obtaining module is further configured to obtain audio energy values of the sound sources in the mesh area; the determining module is further used for determining a total audio energy value of all the sound sources in the grid area based on the audio energy value; determining an audio weight value for each of the sound sources within the mesh area based on the audio energy value and the total audio energy value.

20. The audio generation apparatus of claim 12, wherein the apparatus further comprises: the searching module is used for searching the excitation signal corresponding to the sound image center; the construction module is further configured to construct a stereo audio signal based on the excitation signal and the mixed audio signal of the mesh region.

21. The audio generation apparatus of claim 20, wherein the building module further comprises: the convolution processing unit is used for carrying out convolution processing on the audio mixing signal of the grid area based on the left channel excitation signal to obtain a left channel audio signal; and performing convolution processing on the audio mixing signal of the grid area based on the right channel excitation signal to obtain a right channel audio signal.

22. The audio generation apparatus of claim 12, wherein the determination module is further configured to determine a mesh region with the sound source within a target spatial network; determining a number of mesh regions having the sound source; the audio mixing processing module is further configured to perform audio mixing processing on the left channel audio signals of all the stereo audio signals based on the number of the grid areas to obtain left channel audio mixing audio signals; and performing sound mixing processing on the right channel audio signals of all the stereo audio signals based on the number of the grid areas to obtain right channel sound mixing audio signals.

23. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.

24. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.