CN116437284A

CN116437284A - Spatial audio synthesis method, electronic device and computer readable storage medium

Info

Publication number: CN116437284A
Application number: CN202310691824.7A
Authority: CN
Inventors: 曾青林; 魏彤; 张海宏
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-06-13
Filing date: 2023-06-13
Publication date: 2023-07-14

Abstract

The embodiment of the application discloses a spatial audio synthesis method, electronic equipment and a computer-readable storage medium, wherein the method comprises the following steps: determining a first set of relative positions based on a first position in the target space where the target object is located and an absolute position of each microphone in the target space, the first set of relative positions including a relative position of each microphone with respect to the first position; determining first head movement information based on the head movement information of the target object at the first position; synthesizing spatial audio of the first location based on the first set of relative locations, the first head information, and the first set of audio; the first audio set includes environmental audio acquired by the respective microphones while the target object is in the first position. Based on the method described in the application, the efficiency of synthesizing the spatial audio can be effectively improved.

Description

Spatial audio synthesis method, electronic device and computer readable storage medium

Technical Field

The embodiment of the application relates to the technical field of audio processing, in particular to a spatial audio synthesis method, electronic equipment and a computer readable storage medium.

Background

The spatial audio is far-ultra-stereo and surround sound audio, and can improve the immersion and the presence of a user. At present, the number of the spatial audio is too small, and when the existing non-spatial audio is replayed, the non-spatial audio can be spatially expanded through a spatial audio expansion technology to obtain the spatial audio. The spatial audio can be carved out of the spatial sound field, and the sound feeling experience of a user in the spatial sound field is simulated. However, the efficiency of obtaining spatial audio through spatial audio expansion techniques is low.

Disclosure of Invention

The application provides a spatial audio synthesis method, electronic equipment and a computer readable storage medium, which can effectively improve the efficiency of synthesized spatial audio.

In a first aspect, an embodiment of the present application provides a spatial audio synthesis method, including: determining a first set of relative positions based on a first position in the target space where the target object is located and an absolute position of each microphone in the target space, the first set of relative positions including a relative position of each microphone with respect to the first position; determining first head movement information based on the head movement information of the target object at the first position; synthesizing spatial audio of the first location based on the first set of relative locations, the first head information, and the first set of audio; the first audio set includes environmental audio acquired by the respective microphones while the target object is in the first position.

In the above embodiment, since the environmental audio collected by each microphone may reflect the spatial sound field at each microphone position, the spatial sound field perceived by the target object when in the first position may be reproduced based on the environmental audio collected by each microphone, the first relative position set, and the spatial audio of the first position obtained by the first head movement information. The method for obtaining the spatial audio by utilizing the collected environmental audio can effectively improve the efficiency of synthesizing the spatial audio.

In one possible embodiment, the method further comprises: receiving a position setting request for a target object; the first location is determined in response to the location setting request.

Because the position setting request can be set by the user according to the self requirement, the first position determined based on the position setting request can be flexibly adapted to the self requirement of the user, and the obtained spatial audio of the first position can provide the user with more realistic and immersive sound field experience.

In one possible embodiment, the method further comprises: acquiring first sensor information of a target object; the first sensor information includes ambulatory information of the target object; based on the first sensor information, a first location is determined.

Because the first sensor information comprises the walking information, and the walking information is the walking information which is actually perceived, a more accurate first position can be determined based on the first sensor information, and the obtained spatial audio of the first position can be re-carved into the spatial sound field of the more accurate first position.

In one possible embodiment, the method further comprises: receiving a head motion setting request for a target object; in response to the head movement setting request, head movement information of the target object in the first position is determined.

Because the head movement setting request can be set by a user according to the self requirement, the head movement information of the first position determined based on the head movement setting request can be flexibly adapted to the self requirement of the user, and the obtained spatial audio of the first position can provide the user with more realistic and immersive sound field experience.

In one possible embodiment, the method further comprises: acquiring second sensor information of a target object; the second sensor information includes head motion information of the target object in the first position.

Because the second sensor information comprises the head movement information at the first position, and the head movement information is the head movement information which is actually perceived, the head movement information at the first position can be obtained more accurately, and the obtained spatial audio at the first position can be re-carved into the spatial sound field at the first position more accurately.

In one possible implementation, synthesizing spatial audio of the first location based on the first set of relative locations, the first head information, and the first set of audio specifically includes: acquiring head related transfer functions corresponding to the microphones from a database based on the first head information and the first relative position set; obtaining mapping environment audio of each microphone based on the head related transfer function of each microphone and the environment audio collected by each microphone; based on the mapped environmental audio of each microphone, spatial audio of the first location is obtained.

In one possible embodiment, the method further comprises at least one of: transmitting spatial audio at a first location; playing the spatial audio of the first position; alternatively, spatial audio of the first location is saved.

By transmitting/playing/saving the spatial audio of the first location, a post-playback or real-time playback of the spatial sound field of the first location can be achieved.

In one possible embodiment, the method further comprises: responsive to the target object moving from the first position to a second position in the target space, determining a second set of relative positions based on the second position and the absolute positions of the respective microphones, the second set of relative positions including the relative positions of the respective microphones with respect to the second position; determining second head movement information based on the head movement information of the target object at the second position; synthesizing spatial audio of the second location based on the second set of relative locations, the second header information, and the second set of audio; the second audio set includes environmental audio collected by each microphone when the target object is in the second position.

By synthesizing the spatial audio of the second location in response to the position movement of the target object, and the spatial audio of the second location may be recreated with the spatial sound field of the second location, the resulting spatial sound field may be updated according to the moved location.

In a second aspect, embodiments of the present application provide an electronic device comprising a memory and one or more processors; the memory is coupled to the one or more processors for storing a computer program comprising program instructions; the one or more processors invoke the program instructions to cause the electronic device to perform: determining a first set of relative positions based on a first position in the target space where the target object is located and an absolute position of each microphone in the target space, the first set of relative positions including a relative position of each microphone with respect to the first position; determining first head movement information based on the head movement information of the target object at the first position; synthesizing spatial audio of the first location based on the first set of relative locations, the first head information, and the first set of audio; the first audio set includes environmental audio acquired by the respective microphones while the target object is in the first position.

In one possible implementation, the one or more processors further invoke the program instructions to cause the electronic device to perform: receiving a position setting request for a target object; the first location is determined in response to the location setting request.

In one possible implementation, the one or more processors further invoke the program instructions to cause the electronic device to perform: acquiring first sensor information of a target object; the first sensor information includes ambulatory information of the target object; based on the first sensor information, a first location is determined.

In one possible implementation, the one or more processors further invoke the program instructions to cause the electronic device to perform: receiving a head motion setting request for a target object; in response to the head movement setting request, head movement information of the target object in the first position is determined.

In one possible implementation, the one or more processors further invoke the program instructions to cause the electronic device to perform: acquiring second sensor information of a target object; the second sensor information includes head motion information of the target object in the first position.

In one possible implementation, the one or more processors, when invoking the program instructions, cause the electronic device to perform synthesizing spatial audio for a first location based on a first set of relative locations, first head information, and a first set of audio, specifically include: acquiring head related transfer functions corresponding to the microphones from a database based on the first head information and the first relative position set; obtaining mapping environment audio of each microphone based on the head related transfer function of each microphone and the environment audio collected by each microphone; based on the mapped environmental audio of each microphone, spatial audio of the first location is obtained.

In one possible implementation, the one or more processors further invoke the program instructions to cause the electronic device to perform at least one of: transmitting spatial audio at a first location; playing the spatial audio of the first position; alternatively, spatial audio of the first location is saved.

In one possible implementation, the one or more processors further invoke the program instructions to cause the electronic device to perform: responsive to the target object moving from the first position to a second position in the target space, determining a second set of relative positions based on the second position and the absolute positions of the respective microphones, the second set of relative positions including the relative positions of the respective microphones with respect to the second position; determining second head movement information based on the head movement information of the target object at the second position; synthesizing spatial audio of the second location based on the second set of relative locations, the second header information, and the second set of audio; the second audio set includes environmental audio collected by each microphone when the target object is in the second position.

In a third aspect, embodiments of the present application provide a chip system for application to an electronic device, the chip system comprising one or more processors configured to invoke program instructions to cause the electronic device to perform a method as described in the first aspect or any of the possible implementations of the first aspect.

In a fourth aspect, the present embodiments provide a computer program product comprising a computer program comprising program instructions which, when run on an electronic device, cause the electronic device to perform a method as described in the first aspect or any one of the possible implementations of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer readable storage medium comprising a computer program comprising program instructions which, when run on an electronic device, cause the electronic device to perform a method as described in the first aspect or any one of the possible implementations of the first aspect.

Drawings

FIG. 1 is a schematic diagram of a 6DOF provided by embodiments of the present application;

fig. 2 is a schematic hardware structure of an electronic device according to an embodiment of the present application;

fig. 3A is a schematic diagram of a home environment provided in an embodiment of the present application;

FIG. 3B is a schematic illustration of an office environment provided by embodiments of the present application;

fig. 4 is a schematic flow chart of a spatial audio synthesis method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of two-dimensional coordinates provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of an interface of an application provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of an interface of another application provided in an embodiment of the present application;

FIG. 8 is a schematic diagram of an interface of yet another application provided in an embodiment of the present application;

FIG. 9 is a schematic flow chart of another spatial audio synthesis method according to an embodiment of the present application;

fig. 10 is a schematic software structure of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and thoroughly described below with reference to the accompanying drawings. Wherein, in the description of the embodiments of the present application, "/" means or is meant unless otherwise indicated, for example, a/B may represent a or B; the text "and/or" is merely an association relation describing the associated object, and indicates that three relations may exist, for example, a and/or B may indicate: the three cases where a exists alone, a and B exist together, and B exists alone, and in addition, in the description of the embodiments of the present application, "plural" means two or more than two.

The terms "first," "second," and the like, are used below for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.

The term "User Interface (UI)" in the following embodiments of the present application is a media interface for interaction and information exchange between an application program or an operating system and a user, which enables conversion between an internal form of information and an acceptable form of the user. The user interface is a source code written in a specific computer language such as java, extensible markup language (extensible markup language, XML) and the like, and the interface source code is analyzed and rendered on the electronic equipment to finally be presented as content which can be identified by a user. A commonly used presentation form of the user interface is a graphical user interface (graphic user interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be a visual interface element of text, icons, buttons, menus, tabs, text boxes, dialog boxes, status bars, navigation bars, widgets, etc., displayed in a display of the electronic device.

Concepts or terms referred to in the present application are explained in the following to facilitate understanding by those skilled in the art.

1. Spatial audio

For a sound source located in a certain space, the sound emitted by the sound source can be heard by a listener after being propagated through the space and binaural reflected. The listener can feel the distance between the sound source and the listener, the azimuth of the sound source and the like according to the characteristics of the received sound, so that the position of the sound source is judged. In other words, the listener can perceive rich spatial information (i.e., spatial sound field) through the heard sound. Such sound containing spatial information is spatial audio.

Currently, spatial audio can be obtained by spatially expanding non-spatial audio through a spatial audio expansion technique. For example, multiple speakers at different locations are used to play the same audio simultaneously to simulate sounds from different directions. Alternatively, sound emitted from a sound source is processed based on the position information of the sound source and the listener, thereby obtaining spatial audio. However, the efficiency of obtaining spatial audio based on spatial audio expansion techniques is low.

2. Influence factor of spatial audio

Spatial audio may be distinguished by being affected by environmental effects and binaural effects.

The environmental effect is that the sound is attenuated, distorted, etc. according to the propagation distance in the process of propagating from the sound source to the listener, and is reflected when encountering an obstacle in the space, thereby changing the characteristics of the sound. Binaural effects refer to: the sound emitted from the sound source has different reflection paths when reflected by the ears, and thus the characteristics of the sound heard by different listeners are different.

3. Six degrees of freedom 6DOF (six degrees of freedom,6 DOF)

The 6DOF may include a 3DOF for head movement and a 3DOF for walk movement.

Referring to fig. 1, fig. 1 is a schematic diagram of a 6DOF according to an embodiment of the present application. Taking an object as an example, taking a trunk central point of a human body as an O point, and taking the O point as a coordinate origin to establish a three-dimensional coordinate axis: the x-axis, y-axis, z-axis, ambulatory 3DOF refers to translational movement in these three coordinate axis directions. The 3DOF of the head motion means a rotational motion in directions around three coordinate axes, which can be represented by yaw angle Raw, pitch angle Pitch, and Roll angle Roll in the drawing. The yaw angle Raw represents the angle of the rotational movement about the y-axis, the Pitch angle Pitch represents the angle of the rotational movement about the x-axis, and the Roll angle Roll represents the angle of the rotational movement about the z-axis.

Ambulatory 3DOF may cause the absolute position of the subject to change and head 3DOF may cause the subject's head movement information to change.

In order to improve the efficiency of synthesizing spatial audio, the application provides a spatial audio synthesis method, electronic equipment and a computer readable storage medium. The electronic device may be a terminal device or a server with a data processing function, for example, the terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart watch, etc., and the server may be an independent physical server, or a server cluster or a distributed system formed by multiple physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content distribution network (content delivery network, CDN), and basic cloud computing services such as big data and an artificial intelligent platform.

The hardware configuration of the electronic device is exemplified below, respectively.

Referring to fig. 2, fig. 2 is a schematic hardware structure of an electronic device according to the present application. By way of example, the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a user identification module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It should be understood that the illustrated structure of the embodiment of the present invention does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, for example: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the set memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flexible light-emitting diode (flex), a Miniled, a micr OLED, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, or the like.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: dynamic picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent awareness of the electronic device 100 may be implemented through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. For example, in the present application, the electronic device 100 may play spatial audio.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also referred to as a "horn," is used to convert audio electrical signals into sound signals. The electronic device 100 may listen to music, or to hands-free conversations, through the speaker 170A.

A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. When electronic device 100 is answering a telephone call or voice message, voice may be received by placing receiver 170B in close proximity to the human ear.

Microphone 170C, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can sound near the microphone 170C through the mouth, inputting a sound signal to the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may also be provided with three, four, or more microphones 170C to enable collection of sound signals, noise reduction, identification of sound sources, directional recording, etc.

The earphone interface 170D is used to connect a wired earphone.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like.

The gyro sensor 180B may be used to determine a motion gesture of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., x, y, and z axes) may be determined by gyro sensor 180B.

The air pressure sensor 180C is used to measure air pressure.

The magnetic sensor 180D includes a hall sensor.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity may be detected when the electronic device 100 is stationary. And may also be used to recognize the pose of the electronic device 100.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser.

The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode.

The ambient light sensor 180L is used to sense ambient light level. The electronic device 100 may adaptively adjust the brightness of the display 194 based on the perceived ambient light level.

The fingerprint sensor 180H is used to collect a fingerprint.

The temperature sensor 180J is for detecting temperature. In some embodiments, the electronic device 100 performs a temperature processing strategy using the temperature detected by the temperature sensor 180J.

The touch sensor 180K, also referred to as a "touch device". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is for detecting a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display 194. In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 at a different location than the display 194.

The bone conduction sensor 180M may acquire a vibration signal. In some embodiments, bone conduction sensor 180M may acquire a vibration signal of a human vocal tract vibrating bone pieces.

The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The electronic device 100 may receive key inputs, generating key signal inputs related to user settings and function controls of the electronic device 100.

The following describes an application scenario of the spatial audio synthesis method provided in the present application:

the spatial audio synthesis method provided by the application can be applied to replaying spatial audio of various spatial environments, and the spatial environments can be a home environment, an office environment, a K song environment and the like.

Taking a home environment as an example, as shown in fig. 3A, a home environment is proposed, and when a user is not in the home environment, spatial audio at any position in the home environment can be obtained by the method provided by the application, and acoustic experience at any position in the home environment can be obtained by listening to the spatial audio. The home environment comprises a living room area and a bedroom area, and a plurality of intelligent devices located in the home environment are shown: the living room area comprises intelligent sound equipment, an intelligent television and intelligent monitoring, and the bedroom area comprises an intelligent computer and an intelligent mobile phone. For example, a user may assume that himself is at a sofa in a living room area in the home environment, and assume that his head movement information is as follows: the head faces the direction of the intelligent television, and the rotation angle, the pitch angle and the rolling angle of the head are all zero. And then, mapping the household environment audio acquired by the microphones to the sofa through the relative positions of the microphones carried in the intelligent devices relative to the sofa and the head movement information of the user, so as to obtain the spatial audio of the sofa. By listening to the spatial audio at the sofa, the user can get the sound experience at the sofa.

Taking an office environment as an example, as shown in fig. 3B, an office environment is proposed, and when a user is not in the office environment, spatial audio of any position in the office environment can be obtained by the method provided by the application, and acoustic experience of any position in the office environment can be obtained by listening to the spatial audio. Wherein a plurality of smart devices located in the office environment are shown: intelligent printer, intelligent projector, intelligent stereo set, intelligent computer and smart mobile phone. For example, if the user assumes that the user is at a desk in the office environment, and assumes that his own head movement information is as follows: the head faces the direction of the intelligent computer, and the rotation angle, the pitch angle and the rolling angle of the head are all zero. And then, mapping office environment audio acquired by the microphones to the office desk through the relative positions of the microphones carried in the intelligent equipment relative to the office desk and the head movement information of the user, so as to obtain the space audio at the office desk. By listening to the spatial audio at the desk, the user can get a sound feel experience at the desk.

Taking a K song environment as an example, after the user records the singing audio, different spatial audio can be added for the singing audio in the sound repairing APP so as to simulate the singing effect of the user in different spaces. For example, the user may add spatial audio of KTV, spatial audio of bathroom, spatial audio of concert, etc., which may be obtained by the method proposed in the present application. The spatial audio of KTV may be spatial audio of a central position of the KTV environment (the central position is preset or may be selected by a user), the KTV environmental audio collected by the microphones of each intelligent device in the real KTV environment, and then the KTV environmental audio collected by each microphone is mapped to the central position based on the relative position of each microphone with respect to the central position and assumed head movement information (the head movement information may be preset or may be selected by a user), so as to obtain the spatial audio of the central position. By adding spatial audio at the center position, the singing audio of the user can be provided with spatial information at the center position.

Alternatively, when the spatial audio of each spatial environment is played back, a real-time playback mode or a post-playback mode may be employed.

Alternatively, the method can be used in combination with a video playing device, a sensing device, etc. to enhance the sense of realism of the user when playing back the spatial audio. For example, in a VR/AR scene, position information and head movement information of a user are obtained through a sensing device, and then environmental audio collected by each microphone in a space environment is mapped according to the position information and the head movement information to obtain the spatial audio of the user. Finally, by playing the spatial audio and synchronously playing the corresponding spatial video, the visual experience and the auditory experience of the user in the spatial environment can be simulated.

The spatial audio synthesis method provided by the embodiment of the application can also be applied to more scenes, and is not limited herein.

The spatial audio synthesis method provided in the embodiment of the present application is specifically described below by means of fig. 4:

referring to fig. 4, fig. 4 is a flow chart of a spatial audio synthesis method according to an embodiment of the present application, where the method includes steps 401 to 403. The method shown in fig. 4 may be performed by an electronic device, or a chip in an electronic device. The following description will take an electronic device as an example of a method of executing the method, and the electronic device may be the electronic device 100 described in the foregoing. Wherein:

Step 401, determining a first set of relative positions based on a first position where a target object in a target space is located and an absolute position of each microphone in the target space.

Wherein the target space is a closed space environment. For example, the target space may be a home environment, an office environment, a K song environment, or the like as listed above.

The target object is located in a target space, which may be a virtual object (e.g. a virtual person) or a real object (e.g. a robot) for simulating a real user in the target space. For example, when the real user is not in the target space, a virtual object (virtual person) may be taken as the target object; alternatively, if a robot, a living body, or the like is present in the target space and is movable in the target space, the robot, the living body, or the like may be the target object when the real user is not in the target space.

The first position where the target object is located refers to the absolute position of the target object in the target space. Illustratively, taking the home environment in fig. 3A as an example of the target space, the first location where the target object is located may be a sofa of a living room area in the home environment. Taking the office environment in fig. 3B as an example of the target space, the first location where the target object is located may be a desk in the office environment.

Each microphone is a microphone carried by each intelligent device in the target space. By way of example, taking the target space as the home environment in fig. 3A as an example, various smart devices carrying microphones may include: intelligent computer, intelligent audio amplifier, intelligent control, smart mobile phone, smart television etc.. Taking the target space as an example of the office environment in fig. 3B, various smart devices carrying microphones may include: intelligent printer, intelligent projector, intelligent stereo set, intelligent computer and smart mobile phone.

Alternatively, these smart devices carrying microphones may be distributed at different locations in the target space and connected in the same communication network.

The absolute position of each microphone refers to the absolute position of each microphone in the target space when the smart device in which each microphone is located is kept stationary.

The first set of relative positions refers to the relative positions of the respective microphones with respect to the first position. Specifically, the absolute positions of the microphones may be determined first, and then the relative positions of the microphones with respect to the first position may be determined according to the first position and the absolute positions of the microphones.

Alternatively, the absolute position of each microphone, the first position, and the relative position of each microphone with respect to the first position mentioned above may be represented in the form of two-dimensional coordinates, or in the form of three-dimensional coordinates, which is not limited in this application.

For example, as shown in fig. 5, taking a two-dimensional coordinate form as an example, a two-dimensional coordinate system may be established by taking the head center of the target object as the origin, and then the microphone 1-5 is obtained in the two-dimensional coordinate system according to the absolute positions and the first positions of the microphones 1-5The following coordinates: (

）、（/>

）、（/>

）、（/>

）、（/>

) These 5 coordinates are the first set of relative positions.

The following describes the manner in which the absolute positions of the respective microphones and the first position are determined, respectively:

1. determination of the absolute position of the individual microphones

Alternatively, the absolute position of each microphone in the target space may be regarded as the absolute position of each microphone; alternatively, the absolute position of the electronic device associated with each microphone in the target space may be taken as the absolute position of each microphone.

Wherein the absolute position of the respective microphone in the target space, or of the respective associated electronic device in the target space, can be determined and updated by means of an existing positioning algorithm. Illustratively, the positioning algorithm includes, but is not limited to, an indoor WIFI positioning algorithm, an ultrasonic positioning algorithm, a visual positioning algorithm, and the like.

2. Determination of the first position

In one possible implementation, an electronic device may receive a location setting request for a target object; the first location is determined in response to the location setting request.

For example, an Application (APP) for listening to spatial audio is installed in the electronic device, and a user may trigger the electronic device to receive a location setting request by operating the APP, and cause the electronic device to further determine the first location.

For example, the APP may support a user to customize a position of listening to a spatial audio, as shown in fig. 6, fig. 6 is a schematic interface of the application provided in the present application, where the interface displays a target spatial schematic (taking the target space is the home environment in fig. 3A as an example), and the user may simulate a scene where the target object (virtual object) is located at the first position by customizing one position in the target spatial schematic as the first position, and trigger the electronic device to receive the position setting request. For example, the user may customize the first location of the target object selected to the point a in the target space by operation 601, and trigger the electronic device to receive a location setting request by operation 602 for the "determine" control, where the location setting request includes the location of the point a.

Alternatively, the APP may support the user to select a location for listening to the spatial audio, as shown in fig. 7, and fig. 7 is a schematic view of another interface of the application provided in the present application, where the interface displays a target spatial schematic and a location point to be selected. The position points to be selected are located in the target space and comprise a point B located beside a living room sofa, a point C located on a bed and a point D located beside a desk. The user can select any position point from the position points to be selected so as to simulate the scene that the target object is positioned at the selected position point, and the electronic equipment is triggered to receive the position setting request. Illustratively, the user selects point B as the first location of the target object via operation 701 and triggers the electronic device to receive a location setting request including the location of point B via operation 702 for the "determine" control.

It will be appreciated that the above-described manner of triggering the electronic device to receive a location setting request via the APP is merely exemplary. In a specific implementation, the electronic device may be triggered to receive the location setting request in other ways, which is not limited herein.

Based on the mode, the user can trigger the electronic equipment to receive the position setting request according to the self requirement, so that the determined first position can be flexibly adapted to the self requirement of the user.

In another possible implementation, the electronic device may obtain first sensor information of the target object; the first sensor information includes ambulatory information of the target object; based on the first sensor information, a first location is determined.

Alternatively, the first sensor information may be acquired in real time or acquired over a historical period.

For example, in a real-time playback scenario, the first sensor information may include real-time ambulatory information of the user (e.g., real-time acquired ambulatory information of the user while wearing the AR/VR device).

As another example, in a post-hoc replay scenario, the first sensor information may include historical ambulatory information of the robot or any organism at a particular time within the historical period (e.g., during a meeting, the meeting site is recorded by the robot traveling, and spatial audio at that particular time may be generated later). Wherein the specific moment may be user-selected or may be automatically selected. For example, at the time of a meeting, the robot has traveled at the meeting place for 3 minutes, wherein the robot is located at the center of the meeting place at the 1 st minute, so the user can select the history walk information corresponding to the 1 st minute as the first sensor information. Or, for the historical walking information covering 3 minutes, taking 30s as an interval, and taking the historical walking information corresponding to the 30s as the first sensor information to generate the spatial audio of the position where the 30s is located; and then taking the historical walking information corresponding to the 1 st minute as the first sensor information to generate the spatial audio of the position of the 1 st minute, and so on.

Alternatively, the first sensor information may include, but is not limited to: image information, pose information, communication information, and the like.

For example, when the target object walks, the electronic device may extract environmental features of the collected image information and match the environmental features with a stored environmental feature library, so as to obtain a first position where the target object is located.

For another example, the pose information includes acceleration in three directions acquired by the inertial navigation sensor, and the electronic device can deduce the first position of the target object according to the acceleration in three directions and the initial position of the target object. The initial position of the target object may be set by the user, or may be obtained according to sensor information of the initial position (e.g., the initial position of the target object is identified according to initial image information, etc.).

For another example, the communication information includes Wi-Fi information obtained by the electronic device through a wireless fidelity (wireless fidelity, wi-Fi) signal, and the electronic device can obtain the first position of the target object according to the Wi-Fi information obtained by the target object and the Wi-Fi fingerprint library.

Based on this approach, since the ambulatory information included in the first sensor information is truly perceived real-time/historical ambulatory information, a more accurate first location may be determined based on the first sensor information.

Step 402, determining first head movement information based on the head movement information of the target object at the first position.

Wherein the header information of the target object includes, but is not limited to: the orientation of the target object, the rotation angles of the target object in three directions of the x-axis, the y-axis and the z-axis (i.e., yaw angle Raw, pitch angle Pitch and Roll angle Roll), etc.

In one possible implementation, an electronic device may receive a head motion setting request for a target object; in response to the head movement setting request, head movement information of the target object in the first position is determined.

For example, for an APP listening to spatial audio, after the electronic device displays the interface shown in fig. 6 or 7 of the APP, the interface shown in fig. 8 may be displayed. As shown in fig. 8, the interface displays input boxes 801-803, and a "determine" control 804. The user may first enter the Pitch angle Pitch of the head (rotation angle about the x-axis) in input box 801, the yaw angle Raw of the head (rotation angle about the y-axis) in input box 802, the Roll angle Roll of the head (rotation angle about the z-axis) in input box 803, and then click on the "ok" control 804, triggering the electronic device to receive a head movement setting request containing the Pitch, raw, and Roll entered by the user.

Or, the electronic device may display a head image of the target object, and the user may simulate the head movement of the target object by sliding the screen in a preset direction, thereby triggering the electronic device to receive the head movement setting request. For example, the user may slide the head avatar displayed in the screen up and down to simulate the rotation angle of the head of the target object around the y-axis.

The triggering condition for receiving the head movement setting request is not limited in the present application.

Based on the mode, the user can trigger the electronic equipment to receive the head movement setting request according to the self requirement, so that the determined head movement information at the first position can be flexibly adapted to the self requirement of the user.

In another possible implementation, the electronic device may obtain second sensor information of the target object; the second sensor information includes head motion information of the target object in the first position.

Alternatively, the second sensor information may be acquired in real time or acquired over a historical period.

For example, in a real-time playback scenario, the second sensor information may include real-time head movement information of the user. Such as real-time acquired ambulatory information as the user wears the AR/VR device.

As another example, in a post-hoc playback scenario, the second sensor information may include historical head movement information of the robot or any living being at a particular time within the historical period. The specific time may be determined according to the above description, which is not described herein.

Alternatively, the second sensor information may include, but is not limited to: pose information, image information, and the like. For example, the pose information includes rotation angles according to three directions acquired by the inertial navigation sensor. For another example, the image information includes image information acquired by the camera when the target object walks, the electronic device may extract environmental features of the acquired image information, and match the environmental features with a stored environmental feature library, so as to obtain head motion information of the target object at the first position.

In this way, since the head movement information included in the second sensor information is real-time/historic head movement information that is actually perceived, more accurate head movement information of the first position can be determined based on the second sensor information.

Step 403, synthesizing spatial audio of the first location based on the first set of relative locations, the first head information and the first set of audio.

Wherein the first audio set includes environmental audio collected by each microphone when the target object is in the first position.

Alternatively, the first audio set may be acquired by spatial audio triggering of the current synthesized first location. For example, after the electronic device receives the above-mentioned position setting request or head movement setting request, each microphone may be notified to start collecting environmental audio, thereby obtaining the first audio set.

Alternatively, the first audio set may be collected triggered by the movement of the respective microphone positions. For example, as each microphone moves from the rest position to the absolute position of each microphone, ambient audio may be captured, resulting in a first audio set.

Alternatively, the first audio set may be acquired triggered by a change in state of the last target space. For example, the state of the target space may be caused by weather changes, layout changes within the target space, external environmental changes of the target space, and the like. When the state of the target space changes, the propagation of sound in the target space also changes, in which case the microphones need to be triggered to collect the changed environmental audio, so as to obtain the first audio set.

The manner in which the microphones are triggered to collect the first audio set in the present application may not be limited to the manner listed above.

In one possible implementation, the electronic device may obtain, from the database, head-related transfer functions corresponding to the respective microphones based on the first head information and the first set of relative positions; obtaining mapping environment audio of each microphone based on the head related transfer function of each microphone and the environment audio collected by each microphone; based on the mapped environmental audio of each microphone, spatial audio of the first location is obtained.

Wherein, the head related transfer function corresponding to each microphone is used for indicating: the environmental effects and binaural effects affect the characteristics of the spatial audio (i.e. the change in the spatial sound field) as the sound picked up by the respective microphone is transferred from the respective microphone into the ears of the target object. The database stores the corresponding relation of the head related transfer functions corresponding to the first head movement information, the first relative position set and the microphones.

Optionally, the head related transfer function corresponding to each microphone includes a left ear head related transfer function and a right ear head related transfer function, and the mapping environment audio of each microphone includes a left ear mapping environment audio and a right ear mapping environment audio of each microphone. The spatial audio of the first location includes spatial audio corresponding to the left ear and spatial audio corresponding to the right ear when the target object is in the first location.

For example, spatial audio corresponding to the left ear when the target object is in the first position may be obtained based on the following equation 1

The spatial audio corresponding to the right ear when the target object is in the first position is obtained based on the following formula 2>

：

(equation 1)

(formula 2)/(S)>

Wherein, the liquid crystal display device comprises a liquid crystal display device,

represents the ambient audio acquired by microphone 1, < +.>

Representation wheatEnvironmental audio collected by the wind 2;

and->

Representing the left-ear and right-ear related transfer functions of the microphone 1; />

And

the left-ear and right-ear related transfer functions of the microphone 2 are shown. />

A convolution process is represented and is performed,

left ear mapping ambient audio representing microphone 1, < ->

Right ear mapping ambient audio representing microphone 1, < ->

Left ear mapping ambient audio representing microphone 2, < ->

Representing the right ear mapping ambient audio of microphone 2.

In one possible implementation, the electronic device may further perform one or more of the following after obtaining the spatial audio of the first location: transmitting spatial audio at a first location; playing the spatial audio of the first position; alternatively, spatial audio of the first location is saved. Based on this approach, post-playback as well as real-time playback of spatial audio at the first location can be achieved.

In the above-described embodiments, since the environmental audio collected by each microphone may reflect the spatial sound field at each microphone location, and the head-related transfer function corresponding to each microphone may indicate: the influence of environmental effects and binaural effects on the spatial sound field when sound picked up by the respective microphones is transferred from the respective microphones into the ears of the target object. Therefore, based on the environmental audio collected by each microphone, the first relative position set and the spatial audio of the first position obtained by the first head movement information, the spatial sound field felt by the target object when the target object is positioned at the first position can be reproduced. The method for obtaining the spatial audio by utilizing the collected environmental audio can effectively improve the efficiency of synthesizing the spatial audio. In addition, the acquired first audio set can be recycled, so that the processing flow for synthesizing the spatial audio can be effectively simplified, and the processing complexity is reduced.

Through the embodiment shown in fig. 4, spatial audio of a first location may be obtained, and a method for obtaining spatial audio of a plurality of locations is described below through a flow shown in fig. 9, where the method includes steps 901 to 905:

Step 901, determining a set of relative positions based on the current position of the target object in the target space and the absolute positions of the microphones in the target space.

Step 902, determining the head motion information based on the head motion information of the current position of the target object.

Step 903, synthesizing spatial audio of the current position of the target object based on the relative position set, the head motion information and the audio set.

Step 904, judging whether the current position of the target object is updated.

If the electronic device determines that there is an update in the location where the target object is currently located, step 905 is performed.

Step 905, updating the current position of the target object.

After the electronic device performs step 905, the step 901 may be performed again.

For example, when the current location of the target object is the first location, the electronic device may determine spatial audio of the target object in the first location by performing steps 901 to 904. The electronic device may then update the target object to move from the first position to the second position by performing step 905, i.e. determining the current position of the target object as the second position. Next, the electronic device determines spatial audio in which the target object is at the second position by executing steps 901 to 904 again. The electronic device performs steps 901 to 904, and the specific implementation manner of determining the spatial audio of the target object in the first position may refer to the embodiment shown in fig. 4, which is not described herein.

The following describes a specific implementation manner of determining the spatial audio in which the target object is at the second position by executing steps 901 to 904 again by the electronic device:

in one possible implementation, responsive to the target object moving from the first location to a second location in the target space, a second set of relative locations is determined based on the second location and the absolute locations of the respective microphones; determining second head movement information based on the head movement information of the target object at the second position; spatial audio for the second location is synthesized based on the second set of relative locations, the second header information, and the second set of audio.

Wherein the manner of determining the second location may be determined with reference to the manner of determining the first location. The manner in which the second set of relative positions is determined may be determined with reference to the manner in which the first set of relative positions is determined. The second head movement information may be determined with reference to the manner of determination of the first head movement information.

The second audio set includes environmental audio collected by each microphone when the target object is in the second position. The second audio set is the same as or different from the first audio set.

For example, if after the first audio set is acquired, the absolute position of each microphone is unchanged and the state of the target space is unchanged, the first audio set and the second audio set are the same. Otherwise, if the absolute position of each microphone is unchanged and/or the state of the target space is changed after the first audio set is acquired, the microphones can be controlled to acquire the environmental audio again, and when the spatial audio of the second position needs to be determined, the environmental audio acquired again by each microphone is taken as the second audio set.

Based on the second set of relative positions, the second motion information and the second set of audio, a specific implementation of synthesizing spatial audio for the second position may refer to the implementation of step 403 described above.

Based on the manner described in fig. 9, if the current position of the target object changes all the time, spatial audio of a plurality of different positions can be obtained according to the change of the current position of the target object.

In one possible implementation, when the electronic device obtains the spatial audio of a plurality of different positions, the plurality of spatial audio can be played according to a preset sequence, so that the immersion of the user is improved. The preset sequence may be a time sequence corresponding to the plurality of different positions, for example, the user selects the first position and then selects the second position through a position setting operation, and the time sequence of the first position precedes the time sequence of the second position. Alternatively, among the sensor information in which the target object can be received for 3 minutes, the first position is a position determined from the sensor information of the 1 st minute, and the second position is a position determined from the sensor information of the 1 st minute, the time series of the first position precedes the time series of the second position. Therefore, the electronic device can play the spatial audio of the first position first and then play the spatial audio of the second position.

The following describes the software structure of the electronic device of the present application:

referring to fig. 10, fig. 10 is a schematic software structure of an electronic device according to the present application. The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, a hardware abstraction layer (hardware abstraction layer, HAL) layer, and a kernel layer, respectively.

The application layer may include a series of application packages. As shown in fig. 10, the application package may include applications (also referred to as applications) such as cameras, gallery, calendar, phone calls, maps, navigation, WLAN, bluetooth, music, video, short messages, etc.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions. As shown in fig. 9, the application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and the like.

Wherein the window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make such data accessible to applications. Such data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.

The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.

The telephony manager is used to provide the communication functions of the electronic device 100. Such as the management of call status (including on, hung-up, etc.).

The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.

The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, message alerts, etc. The notification manager may also be a notification presented in the form of a chart or scroll bar text in the system top status bar, such as a notification of a background running application, or a notification presented on a screen in the form of a dialog interface. For example, a text message is prompted in a status bar, a prompt tone is emitted, the electronic device vibrates, and an indicator light blinks, etc.

The hardware abstraction layer may include a plurality of functional modules. Such as a location processing module, a header information processing module, an audio processing module, etc.

And the position processing module is used for determining a first relative position set based on a first position of the target object in the target space and the absolute position of each microphone in the target space, wherein the first relative position set comprises the relative position of each microphone relative to the first position.

The head information processing module is used for determining first head movement information based on the head movement information of the target object at the first position.

An audio processing module for synthesizing spatial audio of the first location based on the first set of relative locations, the first head information, and the first set of audio; the first audio set includes environmental audio acquired by the respective microphones while the target object is in the first position.

The functions of the position processing module, the header information processing module and the audio processing module may be embodied according to the methods described in the above method embodiments.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

It should be understood that each step in the above-described method embodiments provided in the present application may be implemented by an integrated logic circuit of hardware in a processor or an instruction in the form of software. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor or in a combination of hardware and software modules in a processor.

The application also provides an electronic device, which may include: memory and a processor. Wherein the memory is operable to store a computer program comprising program instructions; the processor may be configured to invoke the program instructions in the memory to cause the electronic device to perform the method of any of the embodiments described above.

The present application also provides a chip system including at least one processor for implementing the functions involved in the method performed by the electronic device in any of the above embodiments.

In one possible design, the system on a chip also includes a memory to hold program instructions and data, the memory being located either within the processor or external to the processor.

The chip system may be formed of a chip or may include a chip and other discrete devices.

Alternatively, the processor in the system-on-chip may be one or more. The processor may be implemented in hardware or in software. When implemented in hardware, the processor may be a logic circuit, an integrated circuit, or the like. When implemented in software, the processor may be a general purpose processor, implemented by reading software code stored in a memory.

Alternatively, the memory in the system-on-chip may be one or more. The memory may be integrated with the processor or may be separate from the processor, and embodiments of the present application are not limited. For example, the memory may be a non-transitory processor, such as a ROM, which may be integrated on the same chip as the processor, or may be separately disposed on different chips, and the type of memory and the manner of disposing the memory and the processor in the embodiments of the present application are not specifically limited.

Illustratively, the system-on-chip may be a field programmable gate array (field programmable gate array, FPGA), an application specific integrated chip (application specific integrated circuit, ASIC), a system on chip (SoC), a central processing unit (central processor unit, CPU), a network processor (network processor, NP), a digital signal processing circuit (digital signal processor, DSP), a microcontroller (micro controller unit, MCU), a programmable controller (programmable logic device, PLD) or other integrated chip.

The present application also provides a computer-readable storage medium storing a computer program comprising program instructions. The program instructions, when executed, cause a computer to perform the method performed by the electronic device in any of the embodiments described above.

The embodiments of the present application may be arbitrarily combined to achieve different technical effects.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer programs, which include program instructions. When the program instructions are loaded and executed on a computer, the processes or functions described in the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. Usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid State Disks (SSDs)), among others.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: ROM or random access memory RAM, magnetic or optical disk, etc.

In summary, the foregoing description is only exemplary embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made according to the disclosure of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of spatial audio synthesis, the method comprising:

determining a first set of relative positions based on a first position in a target space where a target object is located and an absolute position of each microphone in the target space, the first set of relative positions including a relative position of each microphone with respect to the first position;

determining first head movement information based on head movement information of the target object at the first position;

Synthesizing spatial audio of the first location based on the first set of relative locations, the first head movement information, and a first set of audio; the first set of audio includes environmental audio acquired by the respective microphones while the target object is in the first position.

2. The method according to claim 1, wherein the method further comprises:

receiving a position setting request for the target object;

the first location is determined in response to the location setting request.

3. The method according to claim 1, wherein the method further comprises:

acquiring first sensor information of the target object; the first sensor information includes ambulatory information of the target object;

the first location is determined based on the first sensor information.

4. The method according to claim 1, wherein the method further comprises:

receiving a head motion setting request for the target object;

in response to the head movement setting request, head movement information of the target object at the first position is determined.

5. The method according to claim 1, wherein the method further comprises:

Acquiring second sensor information of the target object; the second sensor information includes head motion information of the target object in the first position.

6. The method of any of claims 1-5, wherein the synthesizing spatial audio for the first location based on the first set of relative locations, the first head information, and a first set of audio comprises:

acquiring head related transfer functions corresponding to the microphones from a database based on the first head movement information and the first relative position set;

obtaining mapping environment audio of each microphone based on the head related transfer function of each microphone and the environment audio collected by each microphone;

and obtaining the spatial audio of the first position based on the mapping environmental audio of each microphone.

7. The method according to any one of claims 1-5, further comprising at least one of:

transmitting spatial audio of the first location;

playing the spatial audio of the first position; or alternatively, the process may be performed,

and saving the spatial audio of the first position.

8. The method according to any one of claims 1-5, further comprising:

Responsive to the target object moving from the first position to a second position of the target space, determining a second set of relative positions based on the second position and the absolute positions of the respective microphones, the second set of relative positions including the relative positions of the respective microphones with respect to the second position;

determining second head movement information based on the head movement information of the target object at the second position;

synthesizing spatial audio of the second location based on the second set of relative locations, the second head movement information, and a second set of audio; the second audio set includes environmental audio collected by the respective microphones while the target object is in the second position.

9. An electronic device comprising a memory and one or more processors; the memory is coupled to the one or more processors for storing a computer program comprising program instructions; the one or more processors invoking the program instructions to cause the electronic device to perform the method of any of claims 1-8.

10. A computer readable storage medium comprising a computer program comprising program instructions which, when run on an electronic device, cause the electronic device to perform the method of any of claims 1-8.