CN117082435B

CN117082435B - Virtual audio interaction method and device, storage medium and electronic equipment

Info

Publication number: CN117082435B
Application number: CN202311321281.6A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2024-02-09
Anticipated expiration: 2043-10-12
Also published as: CN117082435A

Abstract

The application discloses a virtual audio interaction method and device, a storage medium and electronic equipment. Wherein the method comprises the following steps: acquiring an audio transmission request; responding to the audio sending request, and respectively acquiring effective perception parameters of the first audio interactive object when different numbers of second audio interactive objects simultaneously sound; determining a target number of second audio interactive objects from at least two second audio interactive objects, and sending the audio of the target number of second audio interactive objects when the second audio interactive objects sound simultaneously to a target client as target interactive audio, wherein the effective perception parameters corresponding to the first audio interactive objects when the second audio interactive objects sound simultaneously are the highest. The application can be applied to the field of artificial intelligence and natural language processing technology. The method and the device solve the technical problem that the interaction efficiency of the virtual audio is low.

Description

Virtual audio interaction method and device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of computers, and in particular, to a virtual audio interaction method, device, storage medium and electronic apparatus.

Background

The existing mixing scheme flow of virtual audio (such as multi-stereo) is as follows: and sending the sound signals and the user position (virtual position) information of different users to a server, forwarding all the sound signals and the position information to related user terminals by the server, finally generating multi-party stereo signals by all the user terminals based on the relative position information, then carrying out stereo mixing, and playing the signals after mixing through stereo headphones or a loudspeaker box.

However, for a large-scale virtual space social application scene, the number of users is very huge, audio data received by a user terminal is linearly increased, and the bandwidth consumption of a server is increased in square level, so that large amount of calculation overhead and network bandwidth consumption are brought, the problems of data packet loss, network delay jitter, non-real-time calculation processing and the like are caused, and the technical problem of low interaction efficiency of virtual audio is caused.

Therefore, the related art has a technical problem of low interaction efficiency of the virtual audio.

Disclosure of Invention

The embodiment of the application provides a virtual audio interaction method, a virtual audio interaction device, a storage medium and electronic equipment, and aims to at least solve the technical problem that virtual audio interaction efficiency is low in the related technology.

According to an aspect of an embodiment of the present application, there is provided an interaction method of virtual audio, including: acquiring an audio transmission request, wherein the audio transmission request is used for requesting to transmit target interactive audio to a target client matched with a first audio interactive object, the first audio interactive object is an audio interactive object positioned in a virtual space, the first audio interactive object performs audio interaction with at least two second audio interactive objects in the virtual space, and the target interactive audio is interactive audio received by the first audio interactive object in the virtual space; respectively acquiring effective perception parameters of the first audio interactive object when different numbers of second audio interactive objects simultaneously sound in response to the audio sending request, wherein the effective perception parameters are used for measuring perception quality corresponding to interactive audio received by the first audio interactive object; determining a target number of second audio interactive objects from the at least two second audio interactive objects, and sending the audio of the target number of second audio interactive objects when the second audio interactive objects are simultaneously sounded to the target client as the target interactive audio, wherein the effective perception parameter corresponding to the first audio interactive objects when the target number of second audio interactive objects are simultaneously sounded is highest.

According to another aspect of the embodiments of the present application, there is also provided an interaction device for virtual audio, including: the first acquisition unit is used for acquiring an audio transmission request, wherein the audio transmission request is used for requesting to transmit target interactive audio to a target client matched with a first audio interactive object, the first audio interactive object is an audio interactive object positioned in a virtual space, the first audio interactive object performs audio interaction with at least two second audio interactive objects in the virtual space, and the target interactive audio is interactive audio received by the first audio interactive object in the virtual space; the second obtaining unit is used for responding to the audio sending request and respectively obtaining effective perception parameters of the first audio interactive object when different numbers of second audio interactive objects simultaneously sound, wherein the effective perception parameters are used for measuring the perception quality corresponding to the interactive audio received by the first audio interactive object; and the sending unit is used for determining a target number of second audio interactive objects from the at least two second audio interactive objects, and sending the audio frequency of the target number of second audio interactive objects when the second audio interactive objects are simultaneously sounded to the target client as the target interactive audio frequency, wherein the effective perception parameter corresponding to the first audio interactive objects when the target number of second audio interactive objects are simultaneously sounded is highest.

As an alternative, the second obtaining unit includes: the first determining module is used for determining a circular space area by taking the first position of the first audio interactive object in the virtual space as a circle center and taking a target distance as a radius, wherein the target distance is larger than the distance between the first position and any second position, and the second position is the position of the second audio interactive object in the virtual space; the first acquisition module is used for acquiring the azimuth sparsity of the sound source when the second audio interactive objects with different numbers are used as the sound source to simultaneously sound according to the circular space area, wherein the azimuth sparsity is used for representing the distribution degree of the sound source in azimuth, and the azimuth is the position of the sound source in the virtual space relative to the first audio interactive object; and the second determining module is used for determining the target range where the azimuth sparsity is located and the effective perception parameters corresponding to the target range.

As an alternative, the apparatus further includes: the second obtaining module is configured to obtain first sound source relative energies of the at least two second audio interaction objects before determining a circular space region by using the first position of the first audio interaction object in the virtual space as a center, and the target distance as a radius, where the first sound source relative energies are used to indicate energy of interaction audio of the second audio interaction object relative to the first audio interaction object; the ordering module is used for ordering the at least two second audio interactive objects from large to small according to the relative energy of the first sound source before determining a circular space region by taking the first position of the first audio interactive object in the virtual space as a circle center and the target distance as a radius, so as to obtain a plurality of ordered second audio interactive objects; the first obtaining module includes: a dividing sub-module, configured to divide the circular space region into a first number of space sub-regions according to the orientation of the first audio interaction object, where a region angle of the corresponding space sub-region in the circular space region, which is oriented in a forward direction of the first audio interaction object, is smaller than a region angle of the corresponding space sub-region in the circular space region, which is oriented in a reverse direction of the first audio interaction object, and the first number is an integer greater than or equal to 2; a first determining submodule, configured to determine, in order, at least one second audio interaction object included in an audio object set where the different number of second audio interaction objects are located from the plurality of second audio interaction objects, where the second audio interaction object in the at least one second audio interaction object is not repeated, and the second audio interaction object in the different number of second audio interaction objects in the audio object set is allowed to be repeated; the first acquisition sub-module is used for respectively acquiring a second number which is occupied on the circular space area and corresponds to different space sub-areas of each second audio interactive object in the at least one second audio interactive object contained in each audio object set; and the second determining submodule is used for respectively determining the proportional relation between the second quantity and the first quantity, and taking the proportional relation as the azimuth sparsity corresponding to each audio object set.

As an alternative, the apparatus further includes: and a third determining module, configured to determine, after the effective perception parameters of the first audio interaction object when the second audio interaction objects of different numbers of second audio interaction objects simultaneously sound are obtained respectively, that the perception quality corresponding to the received interaction audio is highest when the second audio interaction objects of the first audio interaction object in the target set simultaneously sound according to the effective perception parameters corresponding to each of the audio object sets, where the audio object set includes the target set, the number of second audio interaction objects in the at least one second audio interaction object included in the target set is the target number, and the average energy corresponding to the second audio interaction objects in the at least one second audio interaction object included in the target set is greater than or equal to the average energy corresponding to the second audio interaction object in the at least one second audio interaction object included in the candidate set, where the number of second audio interaction objects in the at least one second audio interaction object included in the candidate set is the target number.

As an alternative, the second determining module includes: a third determining submodule, configured to determine, when the target range is determined, the valid perception parameter corresponding to the target range according to a target mapping relationship, where the target range is used to indicate a range of a section where the azimuth sparsity is located, and the target mapping relationship is used to indicate that, when the target range is between 0 and a first azimuth sparsity value, the valid perception parameter is a first valid perception value; when the target range is between the first azimuth sparse value and a second azimuth sparse value, the effective perception parameter is between the first effective perception value and a second effective perception value, and the effective perception parameter increases with the increase of the azimuth sparse degree, wherein the second azimuth sparse value is larger than the first azimuth sparse value, and the second effective perception value is larger than the first effective perception value; the effective sensing parameter is a third effective sensing value when the target range is between the second azimuth sparse value and a third azimuth sparse value, wherein the third azimuth sparse value is larger than the second azimuth sparse value, and the third effective sensing value is larger than the second effective sensing value; when the target range is between the third-direction sparse value and a fourth-direction sparse value, the effective sensing parameter is between a fourth effective sensing value and the third effective sensing value, and the effective sensing parameter decreases with the increase of the azimuth sparsity, wherein the fourth-direction sparse value is larger than the third-direction sparse value, and the fourth effective sensing value is smaller than the third effective sensing value; and when the target range is greater than the fourth-direction sparse value, the effective perceptual parameter is the fourth effective perceptual value.

As an alternative, the apparatus further includes: the third acquisition module is used for acquiring N alternative audio interaction objects after the audio transmission request is acquired, wherein the alternative audio interaction objects are audio interaction objects positioned in the virtual space, and N is a positive integer greater than or equal to 2; and a fourth determining module, configured to screen and determine M audio interaction objects with highest relative energies of the second sound sources from the N audio interaction objects based on the second sound source relative energies of the N candidate audio interaction objects after the audio transmission request is acquired, and determine the M audio interaction objects as the at least two second audio interaction objects, where the second sound source relative energies are used to indicate energies of interaction audio of the candidate audio interaction objects relative to the first audio interaction object, and M is a positive integer greater than or equal to 2, less than or equal to N.

As an alternative, the fourth determining module includes: a second obtaining sub-module, configured to obtain a relative distance between each of the N candidate audio interaction objects and the first audio interaction object, and obtain an audio intensity between each of the N candidate audio interaction objects and the first audio interaction object; a fourth determining sub-module configured to determine the second sound source relative energy of each of the audio interactive objects with respect to the first audio interactive object according to the relative distance and the audio intensity, wherein the second sound source relative energy is in a proportional relationship with the audio intensity and in an inverse relationship with the square of the relative distance; a fifth determining submodule, configured to determine a target audio interaction object with the largest second sound source relative energy from the N candidate audio interaction objects, and determine a product of the second sound source relative energy of the target audio interaction object and a preset parameter as a target masking energy threshold, where the preset parameter is between 0 and 1; and a sixth determining submodule, configured to determine, from the N candidate audio interaction objects, the M audio interaction objects whose second sound source relative energies are greater than the target masking energy threshold.

As an alternative, the transmitting unit includes: a fourth obtaining module, configured to obtain target audio information of an interaction audio of each of the second audio interaction objects in the target number, where the target audio information includes spatial distance information, spatial azimuth information, and an audio source signal of the second audio interaction object; and the forwarding module is used for sending the target audio information of each second audio interactive object to the target client according to a corresponding route selection channel, wherein the target client determines associated transmission data matched with each second audio interactive object according to the spatial distance information and the spatial orientation information, convolutionally mixes the audio source signals by utilizing the associated transmission data to obtain mixed target virtual audio, and the target virtual audio is used for being played on the target client to the first audio interactive object.

According to yet another aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the interactive method of virtual audio as above.

According to still another aspect of the embodiments of the present application, there is further provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the virtual audio interaction method described above through the computer program.

In this embodiment of the present application, for a first audio interaction object, under the condition that a plurality of (at least two) other second audio interaction objects exist in a virtual space for performing audio interaction, the second audio interaction objects with the highest perception quality and the audio frequencies when simultaneously sounding are determined further according to the interaction audio perception quality corresponding to the effective perception parameters when simultaneously sounding the second audio interaction objects with different numbers. The method and the device select the audio data with the best perception effect of the first audio interactive object from the plurality of sound sources corresponding to the second audio interactive objects to transmit to the client, so that the aim of guaranteeing the quality of the audio mixing signals of the virtual audio is fulfilled, the number of the audio used for subsequent audio mixing processing is reduced, the calculation load of the server and the network transmission broadband are reduced, the technical effect of improving the interaction efficiency of the virtual audio is achieved, and the technical problem that the interaction efficiency of the virtual audio is lower in the prior art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a schematic illustration of an application environment of an alternative virtual audio interaction method according to an embodiment of the present application;

FIG. 2 is a schematic illustration of a flow of an alternative virtual audio interaction method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative virtual audio interaction method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an alternative virtual audio interaction method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative virtual audio interaction method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an alternative virtual audio interaction method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an alternative virtual audio interaction method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an alternative virtual audio interaction method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an alternative virtual audio interaction device according to an embodiment of the present application;

Fig. 10 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to facilitate a clearer understanding of the embodiments related to the present application, the following concepts will be explained.

Virtual stereo: the virtual stereo is a method for converting sound signals of sound sources with different directions in a real sound environment into a double-channel sound signal by adopting a double-channel processing technology, and finally playing the double-channel sound signal through two channels or two loudspeakers, so that sound field reproduction is realized, and the virtual stereo is a method for generating a stereo effect by simulating sound delay and sound intensity difference from a sound source to left and right ears through an algorithm. The virtual stereo compresses multichannel information to double channels in the original real environment, so that the cost of transmission and playing equipment is effectively reduced, the cost performance of the stereo effect is best, and the input of the virtual stereo algorithm is usually mono sound signals of sound sources and azimuth coordinate information of the sound sources, and the input is left and right double-channel sound signals.

Meta universe: the meta universe is a virtual reality concept, which is a digital world based on blockchain and encryption technology, and can interact with the real world. The meta-universe provides an open platform that allows users to freely explore, create, and experience a variety of content and applications in the virtual world, including games, social networks, virtual reality, and the like. For example, the meta universe may find a virtual space parallel to the real world, simulated by a computer, entered in a virtual fit, for a user to find a connection terminal by wearing headphones and an eyepiece.

Spatial sound effects: the space sound effect is that the user hears sound with more stereoscopic sense and space layering sense through a certain audio technology treatment, and the hearing scene in the actual scene is played and restored through the combination of the earphone or more than two loudspeakers, so that the listener can clearly recognize the azimuth, the distance sense and the moving track of different acoustic objects, can also feel the sense of being wrapped by the sound in all directions, and is enabled to feel as if the listener is in the immersive hearing experience of the actual environment.

Artificial intelligence (Artificial Intelligence, AI for short) is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Natural language processing (Nature Language processing, NLP for short) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. The natural language processing relates to natural language, namely the language used by people in daily life, and is closely researched with linguistics; and also to computer science and mathematics. An important technique for model training in the artificial intelligence domain, a pre-training model, is developed from a large language model (Large Language Model) in the NLP domain. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

According to an aspect of the embodiments of the present application, there is provided a virtual audio interaction method, optionally, as an optional implementation manner, the virtual audio interaction method may be, but is not limited to, applied to the environment shown in fig. 1. Including, but not limited to, a client 102 and a server 112, the client 102 may include, but is not limited to, a display 104, a processor 106, and a memory 108, and the server 112 includes a database 114 and a processing engine 116.

The specific process comprises the following steps:

step S102, a user controls an execution task (such as a game task) in the virtual space of the virtual role through the client 102, and triggers an audio sending request to acquire interactive audio generated by surrounding environment or other virtual roles on other clients;

steps S104-S106, the client 102 initiates an audio transmission request to the server 112;

step S108, the server 112 responds to the audio sending request and respectively obtains the effective perception parameters of the first audio interactive object when different numbers of second audio interactive objects simultaneously sound through the processing engine 116;

step S110, the server 112 respectively obtains effective perception parameters of the first audio interactive object when different numbers of second audio interactive objects simultaneously sound;

step S112, the server 112 determines a target number of second audio interactive objects and corresponding target interactive audio;

step S114-S116, the target interactive audio is sent to the client 102 through the network 110, where the processor 106 in the client 102 is configured to receive the target interactive audio and perform corresponding convolution mixing processing, play the mixed audio, display corresponding text information on the display 104 during playing, and store the mixed audio in the memory 108.

In addition to the example shown in fig. 1, the above steps may be performed by the user device or the server independently, or by the user device and the server cooperatively, such as by the client 102 performing the steps of S108 to S112 described above, thereby relieving the processing pressure of the server 112. The client 102 includes, but is not limited to, a notebook computer, a tablet computer, a desktop computer, a smart television, etc., and the application is not limited to a specific implementation of the client 102. The server 112 may be a single server or a server cluster composed of a plurality of servers, or may be a cloud server.

Alternatively, as an optional implementation manner, as shown in fig. 2, the interaction method of the virtual audio may be performed by an electronic device, such as a client or a server shown in fig. 1, and specific steps include:

s202, acquiring an audio transmission request, wherein the audio transmission request is used for requesting to transmit target interactive audio to a target client matched with a first audio interactive object, the first audio interactive object is an audio interactive object positioned in a virtual space, the first audio interactive object performs audio interaction with at least two second audio interactive objects in the virtual space, and the target interactive audio is interactive audio received by the first audio interactive object in the virtual space;

S204, responding to the audio sending request, and respectively acquiring effective perception parameters of the first audio interactive object when different numbers of second audio interactive objects simultaneously sound, wherein the effective perception parameters are used for measuring perception quality corresponding to interactive audio received by the first audio interactive object;

s206, determining a target number of second audio interactive objects from the at least two second audio interactive objects, and sending the audio of the target number of second audio interactive objects when the second audio interactive objects are simultaneously sounded to the target client as the target interactive audio, wherein the effective perception parameters corresponding to the first audio interactive objects when the target number of second audio interactive objects are simultaneously sounded are highest.

Alternatively, in this embodiment, the above-mentioned interaction method of virtual audio may be, but not limited to, applied to a virtual space sound effect usage scene, such as a multi-stereo mixing scene in the meta-universe. The implementation of the meta-universe needs to integrate artificial intelligent virtual perception technologies in various aspects such as audio, video and perception, a computer virtual space approaching to the real world perception is constructed, and an experimenter can experience sense feeling indiscriminate to the real world by means of some hardware devices (headphones, glasses and somatosensory equipment). The virtual space sound effect is a very important part, and the binaural sound signal in the real environment is restored through the virtual space sound effect, so that an experimenter can perceive the stereophonic sound effect experience (such as speaking sound, laughing sound and footstep sound of different people in different peripheral directions, engine sound, sidewalk prompt sound, wind and rain sound and the like of a car from far to near) in the real environment by wearing headphones. Because different users are active in the virtual space, the users receive a large amount of sound source information, part of sound sources are collected in real time by recording equipment of real users, and part of sound sources are constructed according to virtual scenes by the system. Different sound sources are transmitted to a terminal user by a transmission network to carry out stereo mixing, and then played through a user earphone or a sound box.

The flow of the existing multi-element stereo mixing scheme is as follows: and sending the sound signals and the user position (virtual position) information of different users to a server, forwarding all the sound signals and the position information to related user terminals by the server, finally generating multi-party stereo signals by all the user terminals based on the relative position information, then carrying out stereo mixing, and playing the audio signals after mixing through stereo headphones or a loudspeaker box.

Further illustratively, as shown in fig. 3, the server 302 is responsible for audio data forwarding, for example, the server 304 is responsible for forwarding audio of the sound source 304 to other sound sources (the sound source 306, the sound source 308, and the sound source 310), and for forwarding audio of the other sound sources (the sound source 306, the sound source 308, and the sound source 310) to the sound source 1 (the client).

It will be appreciated that each user will receive N-1 side data (mono sound data and location data) forwarded by the server, as users participating in virtual space audio interactions increase, the data received by the user terminals will increase linearly, while the bandwidth consumption of the server will increase in square scale. Since the subsequent receiving client needs to perform stereo generation processing on the sound signals of other multiple routes respectively, for example, real-time convolution operation is performed based on the HRIR data, and finally, mixing operation is performed, the calculation processing cost is high. For large-scale virtual space social application scenes, the number of users is very huge, and hundreds of thousands of users can participate, so that large calculation overhead and network bandwidth consumption are brought, the problems of data packet loss, network delay jitter, non-real-time calculation processing and the like are caused, and the technical problem of low interaction efficiency of virtual audio is caused.

For the above problems, by using the virtual audio interaction method provided in this embodiment, the audio data with the limited number (the target number) with the best perception effect of the first audio interaction object is selected from the multiple sound sources corresponding to the multiple second audio interaction objects at present, and is transmitted and forwarded to the client, so that the purpose of guaranteeing the quality of the audio mixing signal of the virtual audio is achieved, the number of audio used for subsequent audio mixing processing is reduced, the calculation load of the server and the network transmission broadband are reduced, the technical effect of improving the interaction efficiency of the virtual audio is achieved, and the technical problem that the interaction efficiency of the virtual audio is lower in the prior art is solved.

Optionally, in this embodiment, the first audio interaction object may be, but not limited to, a current audio interaction object located in the virtual space, the second audio interaction object may be, but not limited to, another audio interaction object located in the virtual space, and the first audio interaction object may be, but not limited to, performing audio interaction with at least two second audio interaction objects in the virtual space, where a (matched) target client where the first audio interaction object is located receives at least two virtual audio (target interaction audio) generated by simultaneously sounding at least two second audio interaction objects at a certain moment or a certain period of time, and playing the at least two virtual audio (target interaction audio) to the first audio interaction object after performing the mixing process.

Alternatively, in this embodiment, the target interactive audio may be, but is not limited to, multi-stereo within the virtual space, and may be, but is not limited to, interactive audio received by a client matching the first audio interactive object.

It should be noted that, for the different second audio interaction objects, the relative distance from the current first audio interaction object may be different, and the energy of the virtual audio generated by the different second audio interaction objects may also be different. For multi-stereo sound in the virtual space, each audio interaction object is taken as a sound source and has definite coordinate position information, the auditory perception of a listener is related to the relative distance between the sound source and the relative direction of the sound source, the closer the sound source distance of the same sound intensity is, the stronger the auditory perception is, and the auditory perception corresponding to the distribution situation of different numbers of sound source directions is also different.

It can be appreciated that in the case that the first audio interactive object in the multi-stereo application scene of the virtual space receives interactive audio of other second audio interactive objects with different numbers and different directions, the corresponding interactive audio perception quality is different.

In a stereo application scenario, stereo azimuth auditory masking (i.e., the number of sound sources and azimuth information) is another important influencing factor, in addition to the energy that the interactive audio of a sound source has itself, which affects the listener's auditory perception. When the number of sound sources is large and relatively concentrated, or when the number of sound sources is small and relatively diffuse, the auditory perception of the corresponding listener is often not of optimal quality. If the number of sound sources is large and relatively concentrated, the perceived quality of hearing is reduced due to interference between the plurality of sound sources.

Further by way of example, the above-described method for interacting virtual audio is applied in the interaction scenario shown in fig. 3, so that the server 302 can receive the interactive audio of the sound source and filter and combine the interactive audio to transmit to the client where the other sound sources are located. Specifically, in the case where the sound source 304 is the first audio interaction object, and the sound source 306, 308, and 310 are the second audio interaction objects, the perceived quality of the context correspondence of the two amounts of the interaction audio corresponding to the sound source 308 transmitted by the receiving server 302 by the sound source 304 may be different from the perceived quality of the context correspondence of the three amounts of the interaction audio corresponding to the sound source 306, 308, and 310 transmitted by the receiving server 302 by the sound source 304.

It can be understood that, in response to the audio transmission request, effective perception parameters of the first audio interactive object when different numbers of second audio interactive objects simultaneously sound are respectively obtained, so as to measure the perception quality of the corresponding interactive audio, and determine the second audio interactive object with the highest effective perception parameters/the highest perception quality.

For further illustration, as shown in fig. 4, the sound source 402 is a first audio interaction object, the sound source 404, the sound source 406, the sound source 408 and the sound source 410 are 4 second audio interaction objects, and respectively obtaining, in response to the audio transmission request, effective perception parameters of the first audio interaction object when different numbers of the second audio interaction objects are simultaneously uttered includes: the number of acquisitions is 2, the number of acquisitions is 3, the number of acquisitions is 4, and the number of acquisitions is 4, the number of acquisitions is 404, the number of acquisitions is 406, the number of acquisitions is 408, and the number of acquisitions is 416 is a third number of acquisitions. Further, the largest effective perception parameter is determined from the first effective perception parameter 412, the second effective perception parameter 414 and the third effective perception parameter 416, and taking the second effective perception parameter 414 as an example, the largest effective perception parameter 414 determines that the number of the sound sources 402 is 3, the perception quality corresponding to the interactive audio received when the sound sources 404, 406 and 408 sound simultaneously is highest, and the audio when the sound sources 404, 406 and 408 sound simultaneously is sent to the target client where the sound sources 402 are located as target interactive audio.

Through the embodiment provided by the application, for the first audio interactive object, under the condition that a plurality of (at least two) other second audio interactive objects exist in the virtual space for audio interaction, the second audio interactive objects with the highest perception quality and the audio frequency during simultaneous sounding are further determined according to the interactive audio perception quality corresponding to the effective perception parameters when the second audio interactive objects with different numbers sound simultaneously. The method selects the audio data with the best perception effect of the first audio interactive object from the multiple sound sources corresponding to the current multiple second audio interactive objects to transmit and forward to the client, thereby not only achieving the purpose of guaranteeing the quality of the audio mixing signals of the virtual audio, but also reducing the number of the audio used for subsequent audio mixing processing, reducing the calculation load of the server and the network transmission broadband, and realizing the technical effect of improving the interaction efficiency of the virtual audio

As an alternative, respectively obtaining effective perception parameters of the first audio interactive object when different numbers of second audio interactive objects sound simultaneously, including:

s1, determining a circular space area by taking a first position of a first audio interactive object in a virtual space as a circle center and a target distance as a radius, wherein the target distance is larger than a distance between the first position and any second position, and the second position is the position of a second audio interactive object in the virtual space;

S2, acquiring azimuth sparsity of the sound source when different numbers of second audio interactive objects are used as the sound source to simultaneously sound according to the circular space region, wherein the azimuth sparsity is used for representing the distribution degree of the sound source in azimuth, and the azimuth is the position of the sound source in the virtual space relative to the first audio interactive object;

s3, determining a target range where the azimuth sparsity is located and an effective perception parameter corresponding to the target range.

Optionally, in this embodiment, a circular space area is determined with a first position of the first audio interaction object in the virtual space as a center and a target distance as a radius, so that each of the second audio interaction objects is located in the circular space area.

Optionally, in this embodiment, according to the 360-degree azimuth area of the horizontal plane, each area in different azimuth of the first audio interaction object is divided according to the target included angle, and the circular space area is divided into Q sub-areas.

Optionally, in this embodiment, acquiring the azimuth sparsity of the sound source when the first number of second audio interaction objects are simultaneously sounded as the sound source includes: determining the number U of the sub-areas occupied by the first number of second audio interactive objects altogether, wherein U is more than or equal to 1 and less than or equal to Q; and determining the U/Q as the azimuth sparsity, wherein the azimuth sparsity is used for representing the distribution degree of the sound source in azimuth, and the smaller the azimuth sparsity is, the more concentrated the distribution degree of the sound source in azimuth is, the larger the azimuth sparsity is, and the more dispersed the distribution degree of the sound source in azimuth is.

It should be noted that, under the condition that the azimuth sparsity corresponding to the current number of second audio interactive objects is determined, determining a corresponding effective perception parameter according to a target range where the azimuth sparsity is located, where the target range is used for indicating a range of intervals where the azimuth sparsity is located.

It should be noted that, each azimuth sparsity corresponds to a determined effective sensing parameter, but one effective sensing parameter may correspond to a plurality of azimuth sparsities.

As an alternative, before determining the circular space region by taking the first position of the first audio interactive object in the virtual space as the center and taking the target distance as the radius, the method further comprises:

s1, acquiring first sound source relative energy of at least two second audio interactive objects, wherein the first sound source relative energy is used for indicating the energy of the interactive audio of the second audio interactive objects relative to the energy of the first audio interactive objects;

s2, ordering at least two second audio interactive objects from large to small according to the relative energy of the first sound source to obtain a plurality of ordered second audio interactive objects;

according to the circular space region, when the second audio interactive objects with different numbers are obtained as sound sources to simultaneously sound, the azimuth sparseness of the sound sources comprises:

S3, dividing the circular space region into a first number of space subregions according to the orientation of the first audio interaction object, wherein the region angle of the corresponding space subregion in the circular space region, which is oriented by the positive direction of the first audio interaction object, is smaller than the region angle of the corresponding space subregion in the circular space region, which is oriented by the negative direction of the first audio interaction object, and the first number is an integer greater than or equal to 2;

s4, determining at least one second audio interactive object contained in an audio object set where different numbers of second audio interactive objects are located from a plurality of second audio interactive objects in sequence, wherein the second audio interactive objects in the at least one second audio interactive object are not repeated, and the second audio interactive objects in the different numbers of audio object sets are allowed to be repeated;

s5, respectively acquiring second numbers which are occupied by the second audio interactive objects in the circular space area and correspond to different space subareas;

s6, respectively determining the proportional relation between each second quantity and each first quantity, and taking the proportional relation as the azimuth sparsity corresponding to each audio object set.

Optionally, in this embodiment, acquiring the first sound source relative energy of at least the second audio interaction object includes: acquiring the relative distance between the second audio interaction object and the first audio interaction object and the audio intensity between the second audio interaction object and the first audio interaction object; and determining the relative energy of a first sound source of the second audio interactive object relative to the first audio interactive object according to the relative distance and the audio intensity, wherein the relative energy of the first sound source is in a proportional relation with the audio intensity and in an inverse relation with the square of the relative distance.

By way of further example, suppose that there are P virtual interactive objects in the current virtual space (including the current first virtual interactive object and P-1 second virtual interactive objects), corresponding to P sound source signals and their azimuth information, where the azimuth information of the first virtual interactive object is (x 0, y 0), the azimuth information of the ith second virtual interactive object is (xi, yi), i is the sequence number of each second virtual interactive object, and i= 1~P-1. The relative distance D (i) between the ith second virtual interactive object and the first virtual interactive object can be, but is not limited to, obtained by formula (1):

（1）

And, obtaining an audio intensity E (i) between the second audio interaction object and the first audio interaction object, which may include, but is not limited to: the interactive audio (sound source signal) of the second audio interactive object is subjected to high-pass filtering (such as filtering low frequency 250 hz), and the instant energy calculation of the current frame (for example, 20ms is one frame) of the filtered signal is calculated, as shown in formula (2):

(2)

wherein s (K) is a sample value of a kth sample of the ith sound source of the current frame after high-pass filtering, where k=1 to K, and K is the number of samples of one frame.

It should be noted that, the interactive audio (sound source signal) of the second audio interactive object is in the form of a discrete array collected by the microphone device, and corresponds to a plurality of samples, for example, 1 frame for 20ms, and one sample is collected every 2ms, where k=10, and K is 1-10.

Alternatively, in the present embodiment, the relative distance is determined according to the aboveAnd the above audio intensity->Determining the first sound source relative energy of the second audio interaction object with respect to the first audio interaction object>As shown in formula (3):

（3）

wherein d0 is a reference distance constant, for example, 1 meter, and the relative energy of the first sound source of each second audio interactive object is calculated by referring to the d0 distance.

In the calculation, the first sound source relative energy (of the current frame) is calculatedIn consideration of the fact that the energy information of the history frame needs to be combined to measure the sound perception intensity, the method can also, but is not limited to, calculate the historical weighted relative energy value through weighted smoothing calculation, and take the historical weighted relative energy value as the updated relative energy of the first sound source, wherein the formula (4) is as follows:

（4）

where a is a constant, e.g., a=0.9, and j is a frame number.

Optionally, in this embodiment, under the condition that the relative energy of the first sound source of each second audio interactive object is acquired, the ordered plurality of second audio interactive objects are obtained by sorting from large to small according to the relative energy of the first sound source, where the relative energy of the first sound source of the first second audio interactive object after sorting is the largest, and corresponds to the effective perception parameter when the second audio interactive objects with the number of 1 are simultaneously sounded.

It should be noted that the above method for respectively obtaining the effective perception parameters of the first audio interactive object when the second audio interactive objects with different numbers simultaneously sound includes: obtaining effective perception parameters of the first sound source when the one (ordered first) second audio interactive object (A1) with the largest relative energy is sounded, obtaining effective perception parameters … … of the first sound source when the two (ordered first and second) second audio interactive objects (A1 and A2) with the largest relative energy are sounded, obtaining effective perception parameters … … of the first sound source when the three (ordered first, second and third) second audio interactive objects (A1, A2 and A3) with the largest relative energy are sounded

Optionally, in this embodiment, according to the 360-degree azimuth area of the horizontal plane, each area in different azimuth of the first audio interaction object is divided according to the target included angle, and the circular space area is divided into a first number of space sub-areas (i.e., the Q sub-areas).

Further by way of example, as shown in fig. 5, the area is divided into Q space sub-areas according to 360 degrees of azimuth on the horizontal plane, each area in different azimuth with the first audio interaction direction is divided according to an included angle of 15-45 degrees, and generally in the front area, the front area is smaller in division angle due to stronger human ear resolution, and the rear area is larger in division angle due to weaker human ear resolution. After the Q space subareas are divided, determining the azimuth sparsity as U/Q under the condition that R sound sources included in the audio object set of the second audio interactive objects with different numbers occupy U space subareas in total.

According to the embodiment provided by the application, the azimuth sparsity definition is provided, the effective perception parameters are judged based on the azimuth sparsity, the effective perception parameters corresponding to the azimuth sparsity are fully utilized, the purpose of optimal audio auditory perception can be achieved while the least number of interactive audio is determined, the network bandwidth and the computing cost of the server for transmitting audio are greatly reduced while the audio mixing quality is guaranteed, and the technical effect of improving the interaction efficiency of the virtual audio is achieved.

As an optional solution, after respectively acquiring the effective perception parameters of the first audio interactive object when different numbers of second audio interactive objects sound simultaneously, the method further includes:

s1, determining that the perception quality corresponding to the received interactive audio is highest when the second audio interactive objects in the target set are simultaneously sounded according to the effective perception parameters corresponding to each audio object set, wherein the audio object set comprises the target set, the number of the second audio interactive objects in at least one second audio interactive object contained in the target set is the target number, the average energy corresponding to the second audio interactive object in at least one second audio interactive object contained in the target set is larger than or equal to the average energy corresponding to the second audio interactive object in at least one second audio interactive object contained in the candidate set in each audio object set, and the number of the second audio interactive objects in at least one second audio interactive object contained in the candidate set is the target number.

Optionally, in this embodiment, at least one audio object set with the largest effective perceptual parameter is determined, a target set with the largest average energy (i.e., the smallest number of second audio interaction objects) is determined from the at least one audio object set, and the target number and the target interaction audio are determined based on the target set.

Further by way of example, in the case of obtaining the above-mentioned ordered (ordered) M second audio interaction objects, traversing in turn to obtain the azimuth sparsity corresponding to the first J second audio interaction objects and the effective perception parameter corresponding to the azimuth sparsity, where j= 1~M. And under the condition that the maximum target effective perception parameter is determined, further determining at least one J value corresponding to the target effective perception parameter, and determining the smallest J value (Jmin) in the at least one J value as the target quantity, wherein the average energy corresponding to the first Jmin second audio interactive objects is the largest in each candidate set.

As an optional solution, determining a target range where the azimuth sparsity is located and an effective perception parameter corresponding to the target range includes:

S1, under the condition that a target range is determined, determining an effective perception parameter corresponding to the target range according to a target mapping relation, wherein the target range is used for indicating a range of a section where azimuth sparsity is located, and the target mapping relation is used for indicating:

under the condition that the target range is between 0 and a first direction sparse value, the effective perception parameter is a first effective perception value;

under the condition that the target range is between a first azimuth sparse value and a second azimuth sparse value, the effective perception parameter is between the first effective perception value and the second effective perception value, and the effective perception parameter is increased along with the increase of the azimuth sparsity, wherein the second azimuth sparse value is larger than the first azimuth sparse value, and the second effective perception value is larger than the first effective perception value;

under the condition that the target range is between a second azimuth sparse value and a third azimuth sparse value, the effective perception parameter is a third effective perception value, wherein the third azimuth sparse value is larger than the second azimuth sparse value, and the third effective perception value is larger than the second effective perception value;

under the condition that the target range is between a third-direction sparse value and a fourth-direction sparse value, the effective perception parameter is between a fourth effective perception value and a third effective perception value, and the effective perception parameter is reduced along with the increase of the azimuth sparseness, wherein the fourth-direction sparse value is larger than the third-direction sparse value, and the fourth effective perception value is smaller than the third effective perception value;

And under the condition that the target range is larger than the fourth-direction sparse value, the effective perception parameter is a fourth effective perception value.

Optionally, in this embodiment, the target mapping relationship is used to indicate that, in a case where the target range is between 0 and the first azimuth sparse value (real number greater than 0), the effective sensing parameter is a first effective sensing value, in a case where the target range is between the first azimuth sparse value and the second azimuth sparse value, the effective sensing parameter is between the first effective sensing value and the second effective sensing value, and the effective sensing parameter increases with increasing azimuth sparse, in a case where the target range is between the second azimuth sparse value and the third azimuth sparse value, the effective sensing parameter is a third effective sensing value, in a case where the target range is between the third azimuth sparse value and the fourth azimuth sparse value, the effective sensing parameter is between the fourth effective sensing value and the third effective sensing value, and the effective sensing parameter decreases with increasing azimuth sparse, and in a case where the target range is greater than the fourth azimuth sparse value, the effective sensing parameter is a fourth effective sensing value.

It should be noted that the effective perception parameter may be, but is not limited to, a number of sound sources indicating that the first interactive audio object can simultaneously perceive. Further by way of example, as shown in fig. 6, when the azimuth sparse value is lower than s0, representing that the sound source azimuth is relatively concentrated, the first interactive audio object can simultaneously perceive that the sound source azimuth is n0, that is, the effective perception parameter is n0; when the azimuth sparse value is between s0 and s1, representing that the azimuth of the sound source is concentrated to disperse, the first interactive audio object can simultaneously sense that the sound source Fang Weishu is gradually increased due to the azimuth dispersion until the maximum value is s1, namely the effective sensing parameter is between n0 and n1; when the azimuth sparse value is between s1 and s2, representing that the azimuth of the sound source is relatively dispersed, the first interactive audio object can sense the number of the azimuth of the sound source as n1 at the same time, namely the effective sensing parameter is n1; when the azimuth sparse value is between s2 and s3, representing that the azimuth of the sound source is excessively dispersed, the first interactive audio object can simultaneously sense that the sound source Fang Weishu is gradually reduced due to interference among the sound sources, namely, the effective sensing parameters are between n2 and n1, and when the azimuth sparse value exceeds s3, the first interactive audio object can simultaneously sense that the azimuth number of the sound source is reduced to n2, namely, the effective sensing parameters are n2.

As an alternative, after acquiring the audio transmission request, the method further includes:

s1, N alternative audio interaction objects are obtained, wherein the alternative audio interaction objects are audio interaction objects positioned in a virtual space, and N is a positive integer greater than or equal to 2;

s2, based on the relative energy of the second sound source of the N alternative audio interactive objects, M audio interactive objects with the highest relative energy of the second sound source are screened and determined from the N audio interactive objects, and the M audio interactive objects are determined to be at least two second audio interactive objects, wherein the relative energy of the second sound source is used for indicating the energy of the interactive audio of the alternative audio interactive objects relative to the energy of the first audio interactive object, and M is a positive integer greater than or equal to 2 and less than or equal to N.

Alternatively, in the present embodiment, the N candidate audio interaction objects may be, but are not limited to, all second audio interaction objects in the virtual space, except the first audio interaction object, which are initial.

Optionally, in this embodiment, the second sound source relative energy of each of the N candidate audio interaction objects is obtained, the first M audio interaction objects with the highest second sound source relative energy are determined from the second sound source relative energies, and the M audio interaction objects are determined as the at least two second audio interaction objects.

According to the embodiment provided by the application, the screening is performed through the relative energy of the second sound source, at least two second virtual interactive objects with higher relative capacity of the second sound source are initially determined from all the first virtual interactive objects in the virtual space, the effective perception parameters corresponding to the azimuth sparsity are fully utilized, the target interactive audio with smaller quantity and optimal quality is determined from the at least two second virtual interactive objects with higher relative capacity of the second sound source, the aim of optimal audio hearing perception is achieved while the minimum quantity of interactive audio is determined, and therefore network bandwidth and calculation cost of audio transmitted by a server are greatly reduced while audio mixing quality is guaranteed, and the technical effect of improving the interactive efficiency of the virtual audio is achieved.

As an alternative, the screening and determining M audio interaction objects with the highest relative energy of the second sound source from N audio interaction objects includes:

s1, acquiring a relative distance between each audio interaction object in N alternative audio interaction objects and a first audio interaction object, and acquiring audio intensity between each audio interaction object and the first audio interaction object;

S2, determining the relative energy of a second sound source of each audio interaction object relative to the first audio interaction object according to the relative distance and the audio intensity, wherein the relative energy of the second sound source is in a proportional relation with the audio intensity and in an inverse relation with the square of the relative distance;

s3, determining a target audio interactive object with the maximum relative energy of a second sound source from N candidate audio interactive objects, and determining the product of the relative energy of the second sound source of the target audio interactive object and a preset parameter as a target masking energy threshold value, wherein the preset parameter is between 0 and 1;

s4, determining M audio interactive objects with the relative energy of the second sound source larger than the target masking energy threshold value from N alternative audio interactive objects.

Optionally, in this embodiment, acquiring the second sound source relative energy of at least each of the above second audio interaction objects includes: acquiring the relative distance between each second audio interactive object and the first audio interactive object and the audio intensity between each second audio interactive object and the first audio interactive object; and determining the relative energy of the second sound source of each second audio interactive object relative to the first audio interactive object according to the relative distance and the audio intensity, wherein the relative energy of the second sound source is in a proportional relation with the audio intensity and in an inverse relation with the square of the relative distance.

By way of further example, suppose that there are P virtual interactive objects in the current virtual space (including the current first virtual interactive object and P-1 second virtual interactive objects), corresponding to P sound source signals and their azimuth information, where the azimuth information of the first virtual interactive object is (x 0, y 0), the azimuth information of the ith second virtual interactive object is (xi, yi), i is the sequence number of each second virtual interactive object, and i= 1~P-1. The relative distance D (i) between the ith second virtual interactive object and the first virtual interactive object can be, but is not limited to, obtained by the above formula (1):

（1）

and, obtaining the audio intensity E (i) between each of the second audio interaction objects and the first audio interaction object, which may include, but is not limited to: the interactive audio (sound source signal) of each second audio interactive object is subjected to high-pass filtering (for example, filtering low frequency 250 hz), and the instant energy calculation of the current frame (for example, 20ms is a frame) of the filtered signal is calculated, as shown in the formula (2):

(2)

It should be noted that, the interactive audio (sound source signal) of each second audio interactive object is in the form of a discrete array collected by the microphone device, and corresponds to a plurality of samples, for example, 1 frame for 20ms, and one sample is collected every 2ms, where k=10, and K is 1-10.

Alternatively, in the present embodiment, according to the aboveRelative distanceAnd the above audio intensity->Determining the relative energy of the second sound source of each second audio interactive object relative to the first audio interactive object>As shown in the above formula (3):

（3）

wherein d0 is a reference distance constant, for example, 1 meter, and the relative energy of the second sound source of each of the second audio interactive objects is calculated by referring to the d0 distance.

In the calculation of the above (of the current frame) second sound source relative energyIn consideration of the fact that the energy information of the history frame needs to be combined to measure the sound perception intensity, it is also possible, but not limited to, to calculate the historical weighted relative energy value through weighted smoothing calculation, and use the historical weighted relative energy value as the updated second sound source relative energy, where the formula is as follows in the above formula (4):

（4）

where a is a constant, e.g., a=0.9, and j is a frame number.

It should be noted that, determining a target audio interactive object with the largest relative energy of the second sound source from the N candidate audio interactive objects, determining the product of the relative energy of the second sound source of the target audio interactive object and a preset parameter as a target masking energy threshold, and determining the M audio interactive objects with the relative energy of the second sound source greater than the target masking energy threshold in the N candidate audio interactive objects.

As an alternative, the audio when the target number of the second audio interactive objects are simultaneously uttered is sent to the target client as target interactive audio, which includes:

s1, acquiring target audio information of interaction audio of each second audio interaction object in a target number, wherein the target audio information comprises spatial distance information, spatial azimuth information and audio source signals of each second audio interaction object;

s2, sending the target audio information of each second audio interactive object to a target client according to a corresponding route selection channel, wherein the target client determines associated transfer data matched with each second audio interactive object according to the space distance information and the space azimuth information, and convolutionally mixing audio source signals by using the associated transfer data to obtain mixed target virtual audio, and the target virtual audio is used for being played to the first audio interactive object on the target client.

Optionally, in this embodiment, in the case that the target audio information corresponding to the target interactive audio (of the target number) is determined, the server forwards the target audio information to the corresponding client through the routing channel of the target number, and the client performs stereo mixing processing (i.e., convolution mixing processing) after receiving the sound information (i.e., the audio source signal) and the position information (i.e., the spatial distance information and the spatial azimuth information) of the target number, including HRIR convolution processing and left-right channel mixing processing.

Optionally, in the present embodiment, the HRIR convolution process is used to instruct to convolve the original mono input signal u (n) with the HRIR data h (n) of the corresponding azimuth, which is output as the binaural signal y (n). Equation (5) is as follows:

（5）

as shown in fig. 7, the h (n) is internally divided into HRIR data of the left channel and the right channel, and thus y (n) is generated corresponding to the left channel and the right channel signal results.

Alternatively, in this embodiment, the left-right channel mixing process is to mix the left channel signal and the right channel signal separately. The mixing party can include, but is not limited to, direct addition, an averaging method, a clamping method, normalization, self-adaptive mixing weighting and an automatic alignment algorithm, and an averaging method is taken as an example, and is shown in formulas (6) and (7), namely, a method of respectively superposing and summing output left and right channel signals of grids of all sound source signals and then averaging the superposed output left and right channel signals.

（6）

（7）

Where lout and rout are the left and right channels of the stereo output.

As an alternative scheme, the above-mentioned interaction method of virtual audio is applied in a virtual multi-element stereo routing mixing scene based on azimuth sparseness detection, under this scene, there is a relatively obvious technical bottleneck in the existing stereo mixing scheme, when the number of users participating in is increased, the calculation cost and network bandwidth consumption are very huge, for example, N participants, the server has N x (P-1) channels of signals to forward, and each user client will receive N-1 channels of signals, and carry out stereo reconstruction processing and N-1 channels of stereo mixing of 2 x (P-1) times of HRIR convolution computation on the client, and the virtual space application is the audio experience simulating the real scene, the N value in the practical application is relatively large, which can cause a great amount of consumption of network bandwidth, the calculation cost of the client is huge, these can cause problems such as data packet loss problem, network delay jitter problem, calculation processing non-real time, etc., thus causing terminal sound blocking problem, and poor experience; when the user quantity is increased to a certain quantity, the existing sound mixing scheme is very easy to cause sound breaking problem when linear superposition processing is carried out on all sound source signals, and the sound sources are noisy, so that uncomfortable hearing is caused, and the series of problems cannot be well solved by the existing scheme.

By utilizing the interaction method of virtual audio, according to the binaural stereo masking effect of human ears, the definition of azimuth sparsity is provided, the upper limit value of the number of sound sources perceived by human ears at the same time is judged based on the azimuth sparsity, and then the method is used for a multi-element stereo routing scheme, namely, limited number of sound source data perceived by human ears are selected from a plurality of current sound sources to be transmitted and forwarded, and finally, a receiver can hear the mixed sound signal of the selected sound sources. The embodiment is an effective method for solving multi-stereo signal routing mixing, the calculation load and the network transmission bandwidth of the method are only one percent or one thousandth of those of the traditional full-forwarding scheme, the operation cost of multi-stereo related service application is greatly reduced, and a user can also use a common terminal (low-configuration equipment) to participate in large-scale stereo interactive application.

Further illustrating, a multi-stereo mixing scheme flow chart of the interaction method based on virtual audio described above, as shown in fig. 8, includes: the steps of the client executing step S802, the server executing step S804-step S814, the client executing step S816-step S818 are as follows:

step S802, (client) transmits sound source sound signal and coordinate information;

Step S804, (server) calculates the sound source to listener distance;

step S806, (server) calculates the relative energy of each sound source based on the distance;

step S808, (server) calculates the relative energy of each sound source history weight;

step S810, (server) energy masking threshold calculation and TopM sound source filtering;

step S812, calculating the azimuth sparsity of the TopM sound source and the matched maximum selected azimuth number;

step S814, selecting (server) TopN sound source;

step S816, (client) client stereo generation;

in step S818, the (client) client stereo mix.

Optionally, in this embodiment, assuming that there are P participants in the current virtual space, corresponding to P sound source signals and their azimuth information, the current user (e.g. listener a) and P-1 other users have different relative distances in the virtual space, and the relative distances can be calculated by using a geometric formula:where x0, y0 is the user plane rectangular coordinates, and x (i), y (i) is the plane rectangular coordinates of the other users, i is the serial number i= 1~P-1 of the other users. Each sound source signal is processed by high-pass filtering (such as filtering low frequency 250 hz), and the instant energy calculation of the current frame (for example, 20ms is a frame) of the filtered signal is calculated: / >S (K) is a sample value of a kth sample of the ith sound source of the current frame after high-pass filtering, where k=1 to K, and K is the number of samples of one frame.

According to the inverse square law, the sound intensity is inversely proportional to the square of the distance, the relative energy value of each sound source is calculated through the distance result,where d0 is a reference distance constant, e.g., 1 meter, and the relative energy values of the respective sound sources are calculated with reference to the d0 distance.

The relative energy values of the current frame of the P-1 sound sources heard by each listener are obtained through calculation, and the energy information of the historical frames is combined to measure the perception intensity of sound, so that the historical weighted relative energy values are obtained through weighted smoothing calculation, and the formula is as follows:where a is a constant, e.g., a=0.9, and j is a frame number.

For example, after calculating P-1 corresponding historical weighted relative energy values of P-1 sound sources heard by the client a, the present embodiment filters out a limited number of sound sources for server forwarding based on the acoustic binaural masking effect, which is described in detail below.

Masking phenomena are a common psycho-acoustic phenomenon that is determined by the frequency-and time-resolved mechanisms of the human ear to sound, meaning that relatively weak sounds will not be perceived by the human ear, i.e., masked by a strong sound, which is called a masker, and a weak sound, which is called a masker, in the vicinity of a stronger sound. The invention will recognize a limited number of masking sound sources from the P-1 sound sources for server forwarding, while the masked sound sources will be ignored.

Therefore, in this embodiment, the P-1 historical weighted relative energy values are ranked from large to small to obtain the maximum energy valueThe first masking energy threshold is obtained by dividing this value by a constant coefficient cof0 (whose value is empirically set). Ordered TopM sources with historical weighted relative energy values greater than Thres Esm are filtered out of P-1 sources, where m=min (P-1, M0), M0 being a constant such as m0=15.

In a stereo application scene, besides a first masking threshold, stereo azimuth auditory masking is another consideration factor, because the scene is complex in practical engineering application, the masking value cannot be accurately described, the embodiment gives a solution idea from the viewpoint of the number of selected channels, because the number of sound which can be perceived by the human ear simultaneously is limited, because the signal-to-mask ratio (SMR) between sound sources in different azimuth is reduced, which means that the number of sound sources which can be perceived by the human ear simultaneously is more than that of a single-channel sound source scene without azimuth, the embodiment defines the azimuth sparsity SparsePos parameter of the sound source, and gives an upper limit value Nout of the number of stereo selected channels corresponding to different SparsePos values, the relation of which can be shown in fig. 9, when SparsePos is lower than s0, the number of sound sources is represented to be concentrated, and the number of azimuth of sound sources which can be perceived by the human ear simultaneously is n0; when the SparsePos is between s0 and s1, representing the sound source azimuth from concentrated to dispersed, the human ear can simultaneously sense that the sound source Fang Weishu is increased to be s1 at maximum due to azimuth dispersion; when the SparsePos is between s1 and s2, representing the sound source azimuth to be relatively dispersed, and the human ear can sense the sound source azimuth number as n1 at the same time; when the SparsePos is between s2 and s3, representing that the sound source azimuth is excessively dispersed, the human ear can simultaneously sense that the sound source Fang Weishu gradually descends due to the interference among the sound sources, and when the SparsePos value exceeds s3, the human ear can simultaneously sense that the sound source azimuth number is reduced to n2. The embodiment carries out subsequent selection strategy processing based on the azimuth sparseness and the maximum selected azimuth number relation.

Wherein, the azimuth sparsity definition: the horizontal plane 360-degree azimuth area is divided into Q areas, each area in different azimuth of the human body is divided according to an included angle of 15-45 degrees, and normally, the front area is smaller in dividing angle due to strong human ear resolution, and the rear area is larger in dividing angle due to weak human ear resolution, as shown in the following figure. After Q areas are divided, the azimuth sparsity is sparsepos=u/Q if there are R sound sources which occupy a total of U azimuth areas.

In order to further filter the sound sources obtained by filtering the ordered Top M sound sources through the first masking energy threshold through the azimuth masking effect, the following pseudo codes are adopted:

For j = 1 ~ M

calculating SparsePos (j) values of ordered Topj sound sources and obtaining corresponding selected azimuth Nout (j) values

End

The maximum selected number nmax=max (Nout (j)) is found here, the minimum value Jmin of the corresponding possible plurality of j values, where j= 1~M.

Note that Nout (j) is determined based on sparsetos (j) and the corresponding mapping relation, where sparsetos (j) is used to indicate the first j ordered sound sources with the largest (historically weighted) relative energy values among the ordered Top M sound sources.

The final routing output is the top Jmin sound sources with the historical weighted relative energy values ordered from large to small.

The Jmin sound sources to be forwarded, which are obtained based on the method, are forwarded by the participating server, namely, the related information comprises the distance and azimuth of the spatial position of the sound source and the sampling point information of the sound signal of the sound source, and the relevant information is forwarded to the corresponding client, such as the client A by the server. After receiving the Jmin sound signals and the position information, the client a performs stereo mixing processing, including HRTF (namely HRIR convolution) and left and right channel mixing processing, and other listeners can screen sound sources according to the above flow to perform route mixing.

The HRTF-based stereo generation is to convolve the original mono input signal u (n) with the HRIR data h (n) of the corresponding azimuth, which is output as a binaural stereo signal y (n). The formula is as follows:

（5）

note that, since the h (n) is internally divided into the HRIR data of the left channel and the right channel, the generated y (n) also corresponds to the left channel and the right channel signal result, as shown in fig. 7 below:

the stereo mixing process is that the left channel signal and the right channel signal are mixed respectively. The method of mixing can include, but is not limited to, direct addition, averaging, clamping, normalization, adaptive mixing weighting, and automatic alignment algorithm, and taking the averaging method as an example, that is, a method of respectively adding and summing the left and right channel signals of the output of all grids with sound source signals and then averaging.

（6）

（7）

According to the embodiment provided by the application, the stereo routing mixing solution based on azimuth sparsity detection is provided, the problems of calculation overhead and bandwidth consumption of the existing stereo mixing solution are effectively solved, and the problems of sound breaking, noise and other test are effectively improved because the number of sound sources of stereo mixing is greatly reduced under a routing strategy.

It will be appreciated that in the specific embodiments of the present application, related data such as user information is referred to, and when the above embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

According to another aspect of the embodiments of the present application, there is also provided a virtual audio interaction device for implementing the virtual audio interaction method. As shown in fig. 9, the apparatus includes:

a first obtaining unit 902, configured to obtain an audio sending request, where the audio sending request is used to request to send a target interaction audio to a target client that matches a first audio interaction object, where the first audio interaction object is an audio interaction object located in a virtual space, the first audio interaction object performs audio interaction with at least two second audio interaction objects in the virtual space, and the target interaction audio is interaction audio received by the first audio interaction object in the virtual space;

the second obtaining unit 904 is configured to respond to the audio sending request, and respectively obtain effective perception parameters of the first audio interaction object when different numbers of second audio interaction objects simultaneously sound, where the effective perception parameters are used to measure perception quality corresponding to the interaction audio received by the first audio interaction object;

and a sending unit 906, configured to determine a target number of second audio interaction objects from the at least two second audio interaction objects, and send, as target interaction audio, audio when the target number of second audio interaction objects simultaneously sound, to the target client, where the effective perception parameter corresponding to the first audio interaction object when the target number of second audio interaction objects simultaneously sound is highest.

Specific embodiments may refer to examples shown in the above-mentioned interaction device for virtual audio, which are not described herein.

As an alternative, the second obtaining unit 904 includes:

the first determining module is used for determining a circular space area by taking a first position of the first audio interactive object in the virtual space as a circle center and a target distance as a radius, wherein the target distance is larger than the distance between the first position and any second position, and the second position is the position of the second audio interactive object in the virtual space;

the first acquisition module is used for acquiring azimuth sparsity of the sound source when different numbers of second audio interactive objects are used as the sound source to simultaneously sound according to the circular space region, wherein the azimuth sparsity is used for representing the distribution degree of the sound source in azimuth, and the azimuth is the position of the sound source in the virtual space relative to the first audio interactive object;

the second determining module is used for determining a target range in which the azimuth sparsity is located and an effective perception parameter corresponding to the target range.

Specific embodiments may refer to examples shown in the above-mentioned virtual audio interaction method, and this example is not described herein.

As an alternative, the apparatus further includes:

The second acquisition module is used for acquiring the first sound source relative energy of at least two second audio interactive objects before determining a circular space area by taking a first position of the first audio interactive object in the virtual space as a circle center and a target distance as a radius, wherein the first sound source relative energy is used for indicating the energy of the interactive audio of the second audio interactive object relative to the first audio interactive object;

the ordering module is used for ordering at least two second audio interactive objects from large to small according to the relative energy of a first sound source before determining a circular space area by taking a first position of the first audio interactive object in the virtual space as a circle center and a target distance as a radius, so as to obtain a plurality of ordered second audio interactive objects;

the first acquisition module comprises:

dividing the circular space region into a first number of space subregions according to the orientation of the first audio interaction object, wherein the region angle of the positive direction of the first audio interaction object in the circular space region is smaller than the region angle of the negative direction of the first audio interaction object in the circular space region, and the first number is an integer greater than or equal to 2;

The first determining submodule is used for determining at least one second audio interaction object contained in an audio object set where different numbers of second audio interaction objects are located from a plurality of second audio interaction objects in sequence, wherein the second audio interaction objects in the at least one second audio interaction object are not repeated, and the second audio interaction objects in the different numbers of audio object sets are allowed to be repeated;

the first acquisition submodule is used for respectively acquiring second numbers which are occupied by the second audio interactive objects in the round space area and correspond to different space subareas;

and the second determining submodule is used for respectively determining the proportional relation between each second quantity and each first quantity and taking the proportional relation as the azimuth sparsity corresponding to each audio object set.

As an alternative, the apparatus further includes:

and the third determining module is used for determining that the perception quality corresponding to the received interactive audio is highest when the first audio interactive object simultaneously sounds in the second audio interactive objects in the target set according to the effective perception parameters corresponding to the audio object sets, wherein the audio object sets comprise the target set, the number of the second audio interactive objects in at least one second audio interactive object contained in the target set is the target number, and the average energy corresponding to the second audio interactive object in at least one second audio interactive object contained in the target set is larger than or equal to the average energy corresponding to the second audio interactive object in at least one second audio interactive object contained in the candidate set in each audio object set, and the number of the second audio interactive objects in at least one second audio interactive object contained in the candidate set is the target number.

As an alternative, the second determining module includes:

the third determining submodule is used for determining effective perception parameters corresponding to a target range according to a target mapping relation under the condition that the target range is determined, wherein the target range is used for indicating the interval range where the azimuth sparsity is located, and the target mapping relation is used for indicating:

As an alternative, the apparatus further includes:

the third acquisition module is used for acquiring N alternative audio interaction objects after acquiring the audio transmission request, wherein the alternative audio interaction objects are audio interaction objects positioned in the virtual space, and N is a positive integer greater than or equal to 2;

and the fourth determining module is used for screening and determining M audio interaction objects with highest relative energy of the second sound sources from the N audio interaction objects based on the relative energy of the second sound sources of the N alternative audio interaction objects after the audio transmission request is acquired, and determining the M audio interaction objects as at least two second audio interaction objects, wherein the relative energy of the second sound sources is used for indicating the energy of the interaction audio of the alternative audio interaction objects relative to the first audio interaction object, and M is a positive integer greater than or equal to 2 and less than or equal to N.

As an alternative, the fourth determining module includes:

a second obtaining sub-module, configured to obtain a relative distance between each of the N candidate audio interaction objects and the first audio interaction object, and obtain an audio intensity between each of the N candidate audio interaction objects and the first audio interaction object;

a fourth determining sub-module for determining a second sound source relative energy of each audio interaction object relative to the first audio interaction object according to the relative distance and the audio intensity, wherein the second sound source relative energy is in a proportional relationship with the audio intensity and in an inverse relationship with the square of the relative distance;

a fifth determining submodule, configured to determine a target audio interaction object with the largest second sound source relative energy from the N candidate audio interaction objects, and determine a product of the second sound source relative energy of the target audio interaction object and a preset parameter as a target masking energy threshold, where the preset parameter is between 0 and 1;

and the sixth determining submodule is used for determining M audio interactive objects with the relative energy of the second sound source larger than the target masking energy threshold value from N alternative audio interactive objects.

As an alternative, the transmitting unit 906 includes:

a fourth acquisition module, configured to acquire target audio information of interaction audio of each of the second audio interaction objects in the target number, where the target audio information includes spatial distance information, spatial azimuth information, and audio source signals of the second audio interaction objects;

the forwarding module is used for sending the target audio information of each second audio interactive object to the target client according to the corresponding route selection channel, wherein the target client determines associated transfer data matched with each second audio interactive object according to the space distance information and the space azimuth information, convolutionally mixes audio source signals by utilizing the associated transfer data to obtain mixed target virtual audio, and the target virtual audio is used for being played to the first audio interactive object on the target client.

According to yet another aspect of the embodiments of the present application, there is further provided an electronic device for implementing the above-mentioned virtual audio interaction method, which may be, but is not limited to, the client 102 or the server 112 shown in fig. 1, the embodiment being illustrated by the electronic device as the client 102, and further as shown in fig. 10, the electronic device includes a memory 1002 and a processor 1004, where the memory 1002 stores a computer program, and the processor 1004 is configured to execute the steps in any of the above-mentioned method embodiments by means of the computer program.

Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

s1, acquiring an audio transmission request, wherein the audio transmission request is used for requesting to transmit target interactive audio to a target client matched with a first audio interaction object, the first audio interaction object is an audio interaction object positioned in a virtual space, the first audio interaction object performs audio interaction with at least two second audio interaction objects in the virtual space, and the target interactive audio is interactive audio received by the first audio interaction object in the virtual space;

S2, responding to the audio sending request, and respectively acquiring effective perception parameters of the first audio interactive object when different numbers of second audio interactive objects simultaneously sound, wherein the effective perception parameters are used for measuring the perception quality corresponding to the interactive audio received by the first audio interactive object;

s3, determining a target number of second audio interactive objects from at least two second audio interactive objects, and sending the audio frequency of the target number of second audio interactive objects when simultaneously sounding to a target client as target interactive audio frequency, wherein the effective perception parameter corresponding to the first audio interactive objects when simultaneously sounding the target number of second audio interactive objects is highest.

Alternatively, it will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 10 is merely illustrative, and that fig. 10 is not intended to limit the configuration of the electronic device described above. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 10, or have a different configuration than shown in FIG. 10.

The memory 1002 may be configured to store software programs and modules, such as program instructions/modules corresponding to the virtual audio interaction method and apparatus in the embodiments of the present application, and the processor 1004 executes the software programs and modules stored in the memory 1002 to perform various functional applications and data processing, that is, implement the virtual audio interaction method described above. The memory 1002 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory 1002 may further include memory remotely located relative to the processor 1004, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1002 may be, but is not limited to, for storing information such as valid perception parameters. As an example, as shown in fig. 10, the memory 1002 may include, but is not limited to, a first acquiring unit 902, a second acquiring unit 904, and a transmitting unit 906 in the interactive apparatus including the virtual audio. In addition, other module units in the above-mentioned interaction device of virtual audio may be further included, which is not described in detail in this example.

Optionally, the transmission device 1006 is configured to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 1006 includes a network adapter (Network Interface Controller, NIC) that can be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 1006 is a Radio Frequency (RF) module for communicating with the internet wirelessly.

In addition, the electronic device further includes: a display 1008 for displaying information such as effective perception parameters; and a connection bus 1010 for connecting the respective module parts in the above-described electronic apparatus.

In other embodiments, the user device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting the plurality of nodes through a network communication. The nodes may form a peer-to-peer network, and any type of computing device, such as a server, a user device, etc., may become a node in the blockchain system by joining the peer-to-peer network.

According to one aspect of the present application, a computer program product is provided, comprising a computer program/instructions containing program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via a communication portion, and/or installed from a removable medium. When executed by a central processing unit, performs the various functions provided by the embodiments of the present application.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

It should be noted that the computer system of the electronic device is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

The computer system includes a central processing unit (Central Processing Unit, CPU) which can execute various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) or a program loaded from a storage section into a random access Memory (Random Access Memory, RAM). In the random access memory, various programs and data required for the system operation are also stored. The CPU, the ROM and the RAM are connected to each other by bus. An Input/Output interface (i.e., I/O interface) is also connected to the bus.

The following components are connected to the input/output interface: an input section including a keyboard, a mouse, etc.; an output section including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and the like, and a speaker, and the like; a storage section including a hard disk or the like; and a communication section including a network interface card such as a local area network card, a modem, and the like. The communication section performs communication processing via a network such as the internet. The drive is also connected to the input/output interface as needed. Removable media such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, and the like are mounted on the drive as needed so that a computer program read therefrom is mounted into the storage section as needed.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such embodiments, the computer program may be downloaded and installed from a network via a communication portion, and/or installed from a removable medium. The computer program, when executed by a central processing unit, performs the various functions defined in the system of the present application.

According to one aspect of the present application, there is provided a computer-readable storage medium, from which a processor of a computer device reads the computer instructions, the processor executing the computer instructions, causing the computer device to perform the methods provided in the various alternative implementations described above.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for executing the steps of:

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing electronic equipment related hardware, and the program may be stored in a computer readable storage medium, where the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present application.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided in the present application, it should be understood that the disclosed user equipment may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. A method of virtual audio interaction, comprising:

acquiring an audio transmission request, wherein the audio transmission request is used for requesting to transmit target interactive audio to a target client matched with a first audio interactive object, the first audio interactive object is an audio interactive object positioned in a virtual space, the first audio interactive object performs audio interaction with at least two second audio interactive objects in the virtual space, and the target interactive audio is interactive audio received by the first audio interactive object in the virtual space;

Responding to the audio sending request, and acquiring azimuth sparsity of sound sources when different numbers of second audio interaction objects are used as sound sources to simultaneously sound according to a circular space area, wherein the azimuth sparsity is used for representing the distribution degree of the sound sources in azimuth, the azimuth is the position of the sound sources in the virtual space relative to the first audio interaction object, the center of the circular space area is a first position of the first audio interaction object in the virtual space, the radius is a target distance, the target distance is larger than the distance between the first position and any second position, and the second position is the position of the second audio interaction object in the virtual space;

determining a target range in which the azimuth sparsity is located and effective perception parameters, corresponding to the target range, of the first audio interactive object when the second audio interactive objects with different numbers simultaneously sound, wherein the effective perception parameters are used for measuring perception quality, corresponding to interactive audio received by the first audio interactive object;

determining a target number of second audio interactive objects from the at least two second audio interactive objects, and sending the audio of the target number of second audio interactive objects when the second audio interactive objects sound simultaneously to the target client as the target interactive audio, wherein the effective perception parameter corresponding to the first audio interactive objects when the second audio interactive objects sound simultaneously is the highest.

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

before the azimuth sparseness of the sound source is obtained when the second audio interactive objects with different numbers are taken as the sound source to simultaneously sound according to the circular space region, the method further comprises:

acquiring first sound source relative energies of the at least two second audio interactive objects, wherein the first sound source relative energies are used for indicating the energy of the interactive audio of the second audio interactive objects relative to the first audio interactive objects;

sequencing the at least two second audio interactive objects from large to small according to the relative energy of the first sound source to obtain a plurality of ordered second audio interactive objects;

according to the circular space region, when the second audio interactive objects with different numbers are obtained as sound sources to simultaneously sound, the azimuth sparsity of the sound sources comprises:

dividing the circular space region into a first number of space subregions according to the orientation of the first audio interaction object, wherein the region angle of the positive direction of the first audio interaction object in the circular space region, which is smaller than the region angle of the negative direction of the first audio interaction object in the circular space region, of the corresponding space subregion, and the first number is an integer greater than or equal to 2;

Sequentially determining at least one second audio interaction object contained in an audio object set where the second audio interaction objects with different numbers are located from the plurality of second audio interaction objects, wherein the second audio interaction objects in the at least one second audio interaction object are not repeated, and the second audio interaction objects in the audio object set with different numbers are allowed to be repeated;

respectively acquiring a second number of different space subareas occupied by each second audio interactive object in the at least one second audio interactive object contained in each audio object set on the circular space area;

and respectively determining the proportional relation between the second quantity and the first quantity, and taking the proportional relation as the azimuth sparsity corresponding to each audio object set.

3. The method of claim 2, wherein after the separately obtaining valid perception parameters of the first audio interaction object when a different number of second audio interaction objects are simultaneously speaking, the method further comprises:

according to the effective perception parameters corresponding to the audio object sets, determining that the perception quality corresponding to the received interactive audio is highest when the first audio interactive object simultaneously sounds in the second audio interactive objects in the target set, wherein the audio object set comprises the target set, the number of the second audio interactive objects in the at least one second audio interactive object contained in the target set is the target number, and the average energy corresponding to the second audio interactive object in the at least one second audio interactive object contained in the target set is greater than or equal to the average energy corresponding to the second audio interactive object in the at least one second audio interactive object contained in the candidate set in each audio object set, and the number of the second audio interactive objects in the at least one second audio interactive object contained in the candidate set is the target number.

4. The method according to claim 1, wherein determining the target range in which the azimuth sparsity is located and the effective perception parameter corresponding to the target range include:

under the condition that the target range is determined, determining the effective perception parameter corresponding to the target range according to a target mapping relation, wherein the target range is used for indicating a range of a section where the azimuth sparsity is located, and the target mapping relation is used for indicating:

the effective perception parameter is a first effective perception value under the condition that the target range is between 0 and a first direction sparse value;

when the target range is between the first azimuth sparse value and a second azimuth sparse value, the effective perception parameter is between the first effective perception value and a second effective perception value, and the effective perception parameter is increased along with the increase of the azimuth sparse degree, wherein the second azimuth sparse value is larger than the first azimuth sparse value, and the second effective perception value is larger than the first effective perception value;

the effective perception parameter is a third effective perception value when the target range is between the second azimuth sparse value and a third azimuth sparse value, wherein the third azimuth sparse value is larger than the second azimuth sparse value, and the third effective perception value is larger than the second effective perception value;

When the target range is between the third-direction sparse value and a fourth-direction sparse value, the effective perception parameter is between a fourth effective perception value and the third effective perception value, and the effective perception parameter is reduced along with the increase of the azimuth sparseness, wherein the fourth-direction sparse value is larger than the third-direction sparse value, and the fourth effective perception value is smaller than the third effective perception value;

and in the case that the target range is greater than the fourth-direction sparse value, the effective perception parameter is the fourth effective perception value.

5. The method of claim 1, wherein after the obtaining the audio transmission request, the method further comprises:

acquiring N alternative audio interaction objects, wherein the alternative audio interaction objects are audio interaction objects positioned in the virtual space, and N is a positive integer greater than or equal to 2;

and screening M audio interaction objects with highest relative energy of the second sound sources from the N audio interaction objects based on the relative energy of the second sound sources of the N alternative audio interaction objects, and determining the M audio interaction objects as the at least two second audio interaction objects, wherein the relative energy of the second sound sources is used for indicating the energy of the interaction audio of the alternative audio interaction objects relative to the first audio interaction object, and M is a positive integer greater than or equal to 2 and less than or equal to N.

6. The method of claim 5, wherein the screening the M audio interaction objects from the N audio interaction objects to determine that the second sound source has the highest relative energy comprises:

acquiring the relative distance between each audio interaction object in the N alternative audio interaction objects and the first audio interaction object, and acquiring the audio intensity between each audio interaction object and the first audio interaction object;

determining the second sound source relative energy of each audio interaction object relative to the first audio interaction object according to the relative distance and the audio intensity, wherein the second sound source relative energy is in a proportional relation with the audio intensity and in an inverse relation with the square of the relative distance;

determining a target audio interactive object with the maximum relative energy of the second sound source from the N candidate audio interactive objects, and determining the product of the relative energy of the second sound source of the target audio interactive object and a preset parameter as a target masking energy threshold value, wherein the preset parameter is between 0 and 1;

and determining the M audio interactive objects of which the second sound source relative energy is larger than the target masking energy threshold value from the N alternative audio interactive objects.

7. The method according to any one of claims 1 to 6, wherein the sending the audio when the target number of second audio interaction objects are uttered simultaneously as the target interaction audio to the target client comprises:

acquiring target audio information of the interactive audio of each second audio interactive object in the target number, wherein the target audio information comprises spatial distance information, spatial azimuth information and audio source signals of the second audio interactive objects;

and sending the target audio information of each second audio interactive object to the target client according to a corresponding route selection channel, wherein the target client determines associated transfer data matched with each second audio interactive object according to the space distance information and the space azimuth information, and carries out convolution mixing processing on the audio sound source signals by utilizing the associated transfer data to obtain mixed target virtual audio, and the target virtual audio is used for being played to the first audio interactive object on the target client.

8. An interactive apparatus for virtual audio, comprising:

The first acquisition unit is used for acquiring an audio transmission request, wherein the audio transmission request is used for requesting to transmit target interactive audio to a target client matched with a first audio interactive object, the first audio interactive object is an audio interactive object positioned in a virtual space, the first audio interactive object performs audio interaction with at least two second audio interactive objects in the virtual space, and the target interactive audio is interactive audio received by the first audio interactive object in the virtual space;

the second obtaining unit is used for responding to the audio sending request, obtaining azimuth sparsity of the sound source when different numbers of second audio interaction objects are used as the sound source to simultaneously sound according to a circular space area, wherein the azimuth sparsity is used for representing the distribution degree of the sound source in azimuth, the azimuth is the position of the sound source in the virtual space relative to the first audio interaction object, the circle center of the circular space area is a first position of the first audio interaction object in the virtual space, the radius is a target distance, and the target distance is larger than the distance between the first position and any second position, and the second position is the position of the second audio interaction object in the virtual space; determining a target range in which the azimuth sparsity is located and effective perception parameters, corresponding to the target range, of the first audio interactive object when the second audio interactive objects with different numbers simultaneously sound, wherein the effective perception parameters are used for measuring perception quality, corresponding to interactive audio received by the first audio interactive object;

And the sending unit is used for determining a target number of second audio interactive objects from the at least two second audio interactive objects, and sending the audio of the target number of second audio interactive objects when the second audio interactive objects are simultaneously sounded to the target client as the target interactive audio, wherein the effective perception parameters corresponding to the first audio interactive objects when the second audio interactive objects of the target number are simultaneously sounded are highest.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program, when run by an electronic device, performs the method of any one of claims 1 to 7.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 7 by means of the computer program.