CN116684777A

CN116684777A - Audio processing and model training method, device, equipment and storage medium

Info

Publication number: CN116684777A
Application number: CN202310454751.XA
Authority: CN
Inventors: 刘静林; 陈谦; 王雯
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2023-04-21
Filing date: 2023-04-21
Publication date: 2023-09-01

Abstract

The present disclosure relates to an audio processing and model training method, apparatus, device and storage medium. The method comprises the steps of obtaining mono audio corresponding to a first target object in motion, and calculating the radial speed of the first target object relative to a second target object. Further, spatial audio is generated based on the position information of the first target object, the pose information of the second target object, the mono audio, and the radial velocity of the first target object relative to the second target object. The generated spatial audio contains the motion information of the first target object, the difference between the phase of the audio signal in the spatial audio and the phase of the actual binaural audio is smaller, the accuracy of calculating the phase of the left and right ear channels in the spatial audio is improved, and the accuracy of the generated spatial audio is greatly improved. Therefore, when the user hears the spatial audio, the user can feel the motion of the first target object according to the phase difference between the left and right ear audio signals in the spatial audio.

Description

Audio processing and model training method, device, equipment and storage medium

Technical Field

The disclosure relates to the field of information technology, and in particular relates to an audio processing and model training method, device, equipment and storage medium.

Background

Since the person is binaural, the specific position of the sound source can be determined from the phase difference of the sounds heard by the left and right ears, respectively. Therefore, in speech technology, it is important to provide users with immersive spatial audio. Spatial audio, also called binaural audio or stereo, specifically, spatial audio comprises two audio signals, one provided to the left ear and the other provided to the right ear.

Spatial audio can be generated through a machine learning model at present, but the accuracy of the spatial audio generated by the machine learning model is limited, resulting in that a user cannot feel the motion of a sound source.

Disclosure of Invention

In order to solve the above technical problems or at least partially solve the above technical problems, the present disclosure provides an audio processing method, a model training method, an apparatus, a device, and a storage medium, so that when a user hears spatial audio, the user accurately determines a position of a first target object, for example, a sound source, according to a phase difference between left and right ear audio signals in the spatial audio, and senses movement of the first target object, thereby providing an immersive experience for the user, and greatly improving user experience.

In a first aspect, an embodiment of the present disclosure provides an audio processing method, including:

acquiring mono audio corresponding to a first target object in motion;

calculating a radial velocity of the first target object relative to a second target object;

generating spatial audio according to the position information of the first target object, the gesture information of the second target object, the mono audio and the radial speed of the first target object relative to the second target object.

In a second aspect, an embodiment of the present disclosure provides an audio processing method, including:

calculating the radial speed of a first target object relative to a second target object in a virtual space according to the position information of the first target object relative to the second target object in the virtual space;

generating spatial audio according to the position information of the first target object, the gesture information of the second target object, the mono audio pre-configured for the first target object and the radial speed of the first target object relative to the second target object;

and playing the spatial audio.

In a third aspect, an embodiment of the present disclosure provides an audio processing method, including:

Calculating the radial speed of a movable target relative to a target user according to the position information of the movable target relative to the target user, which is displayed in a virtual reality display device, wherein the target user wears the virtual reality display device;

generating spatial audio according to the position information of the movable target, the gesture information of the target user, the mono audio configured for the movable target in advance and the radial speed of the movable target relative to the target user;

and playing the spatial audio through the virtual reality display device.

In a fourth aspect, an embodiment of the present disclosure provides an audio processing method, including:

collecting mono audio sent by a movable target;

calculating the radial speed of the movable target relative to a target user according to the position information of the movable target relative to the target user, wherein the target user is provided with augmented reality equipment;

generating spatial audio according to the position information of the movable target, the gesture information of the target user, the mono audio and the radial speed of the movable target relative to the target user;

And playing the spatial audio through the augmented reality device.

In a fifth aspect, embodiments of the present disclosure provide an audio processing method, including:

acquiring mono audio sent by a first user in motion;

calculating the radial speed of the first user relative to the second user according to the position information of the first user relative to the second user in the preset space;

generating spatial audio according to the position information of the first user, the gesture information of the second user, the mono audio and the radial speed of the first user relative to the second user;

and sending the spatial audio to audio playing equipment worn by the second user.

In a sixth aspect, an embodiment of the present disclosure provides an audio processing method, including:

acquiring mono audio sent by a first user in motion;

calculating the radial speed of the first user relative to a target position in a preset space according to the position information of the first user relative to the target position in the preset space;

generating spatial audio according to the position information of the first user, the gesture information of the second user, the mono audio and the radial speed of the first user relative to the target position;

And sending the spatial audio to audio playing equipment worn by the second user, wherein the second user and the first user are positioned in different spaces.

In a seventh aspect, embodiments of the present disclosure provide a model training method, the method comprising:

acquiring mono audio sent by a movable sound source;

acquiring binaural audio acquired by a first pickup unit and a second pickup unit of a target object;

inputting the position information of the movable sound source, the gesture information of the target object, the mono audio and the radial speed of the movable sound source relative to the target object into a machine learning model to be trained, so that the machine learning model outputs spatial audio;

training the machine learning model according to the spatial audio and the binaural audio, wherein the trained machine learning model is used for executing the audio processing method.

In an eighth aspect, an embodiment of the present disclosure provides an audio processing apparatus, including:

the acquisition module is used for acquiring mono audio corresponding to the first target object in motion;

a calculation module for calculating a radial velocity of the first target object relative to a second target object;

And the generating module is used for generating spatial audio according to the position information of the first target object, the gesture information of the second target object, the mono audio and the radial speed of the first target object relative to the second target object.

In a ninth aspect, an embodiment of the present disclosure provides an electronic device, including:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method according to the first to seventh aspects.

In a tenth aspect, embodiments of the present disclosure provide a computer readable storage medium having stored thereon a computer program for execution by a processor to implement the methods of the first to seventh aspects.

The audio processing and model training method, device, equipment and storage medium provided by the embodiment of the disclosure are used for obtaining mono audio corresponding to a first target object in motion and calculating the radial speed of the first target object relative to a second target object. Further, spatial audio is generated based on the position information of the first target object, the pose information of the second target object, the mono audio, and the radial velocity of the first target object relative to the second target object. The generated spatial audio contains the motion information of the first target object, so that the generated spatial audio is matched with the binaural audio actually felt by the second target object when the monaural audio corresponding to the first target object reaches the second target object, the difference between the phase of the audio signal in the spatial audio and the phase of the actual binaural audio is small, the accuracy of phase calculation of left and right ear channels in the spatial audio is improved, and the accuracy of the generated spatial audio is greatly improved. Therefore, when a user hears the spatial audio, the position of a first target object such as a sound source is accurately judged according to the phase difference between the left and right ear audio signals in the spatial audio, and the movement of the first target object is sensed, so that immersive experience is provided for the user, and the user experience is greatly improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a flow chart of a model training method provided by an embodiment of the present disclosure;

fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure;

FIG. 3 is a schematic illustration of radial velocities provided by an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a twisted network according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a two-channel gradient network according to another embodiment of the present disclosure;

FIG. 6 is a flow chart of an audio processing method according to another embodiment of the present disclosure;

FIG. 7 is a flow chart of an audio processing method according to another embodiment of the present disclosure;

FIG. 8 is a flow chart of an audio processing method according to another embodiment of the present disclosure;

FIG. 9 is a flowchart of an audio processing method according to another embodiment of the present disclosure;

fig. 10 is a schematic diagram of an application scenario provided in another embodiment of the present disclosure;

FIG. 11 is a flowchart of an audio processing method according to another embodiment of the present disclosure;

fig. 12 is a schematic diagram of an application scenario provided in another embodiment of the present disclosure;

FIG. 13 is a flowchart of an audio processing method according to another embodiment of the present disclosure;

fig. 14 is a schematic diagram of an application scenario provided in another embodiment of the present disclosure;

FIG. 15 is a flowchart of an audio processing method according to another embodiment of the present disclosure;

fig. 16 is a schematic diagram of an application scenario provided in another embodiment of the present disclosure;

fig. 17 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present disclosure;

fig. 18 is a schematic structural diagram of an embodiment of an electronic device provided in an embodiment of the disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, a further description of aspects of the present disclosure will be provided below. It should be noted that, without conflict, the embodiments of the present disclosure and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the disclosure.

It should be noted that, the location information (including the location information of the moving object, etc.), the gesture information (such as the gesture information of the user) and the mono audio (including but not limited to the sound made by the moving user, etc.) related to the present application are all information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide a corresponding operation entrance for the user to select authorization or rejection.

In addition, the audio processing method provided by the application can relate to the following terms for explanation, and the details are as follows:

wave observation network (Wave Net): one common waveform synthesis neural network.

Stereophonic sound: also called spatial audio or binaural audio, the two channels are of different waveforms due to the difference in sound source position and distance between the left and right ears. For example, the sound emitted from the sound source is mono audio, and after being propagated, the sound reaches the ears to become binaural audio.

Twisted network (Warp Net): a neural network for synthesizing binaural audio.

Dual channel gradient network (Binaural Grad): a neural network for synthesizing binaural audio.

Since the person is binaural, the specific position of the sound source can be determined from the phase difference of the sounds heard by the left and right ears, respectively. Therefore, in speech technology, it is important to provide users with immersive spatial audio. Spatial audio, also called binaural audio or stereo, specifically, spatial audio comprises two audio signals, one provided to the left ear and the other provided to the right ear. Spatial audio can be generated through a machine learning model at present, but the accuracy of the spatial audio generated by the machine learning model is limited, resulting in that a user cannot feel the motion of a sound source. In view of this problem, embodiments of the present disclosure provide a model training method, which is described below in connection with specific embodiments.

Fig. 1 is a flowchart of a model training method provided in an embodiment of the present disclosure. The method can be performed by a model training device, the device can be implemented in a software and/or hardware mode, the device can be configured in an electronic device, such as a server or a terminal, wherein the terminal specifically comprises a mobile phone, a computer or a tablet computer, etc. The server may specifically be a cloud server, where the model training method may be executed by the cloud server, and a plurality of computing nodes (cloud servers) may be deployed in the cloud server, where each computing node has processing resources such as computation and storage. At the cloud, a service may be provided by multiple computing nodes, although one computing node may provide one or more services. The cloud may provide the service by providing a service interface to the outside, and the user invokes the service interface to use the corresponding service. The service interface includes a software development kit (Software Development Kit, abbreviated as SDK), an application program interface (Application Programming Interface, abbreviated as API), and the like. In addition, the model training method described in this embodiment may be applied to an application scenario as shown in fig. 2. As shown in fig. 2, the application scenario includes a terminal 21 and a server 22, where the server 22 may perform the model training method, for example, the server 22 may train a machine learning model to be trained according to the method, and when training is completed, the trained machine learning model may remain local to the server 22, or may be deployed to the terminal 21 or other servers, so that the server 22, the terminal 21, or other servers may convert mono audio into accurate spatial audio according to the trained machine learning model. The following describes the method in detail with reference to fig. 2, and as shown in fig. 1, the method specifically includes the following steps:

S101, acquiring mono audio sent by a movable sound source.

As shown in fig. 3, the sound source 31 is movable, and 32 represents monaural audio emitted from the sound source 31. The present embodiment refers to a movable sound source as a movable sound source. In addition, the present embodiment is not limited to the specific form of the sound source 31, and for example, the sound source 31 may be a user who emits sound, for example, a user who speaks, a user who claps, or the like. Alternatively, sound source 31 may be a speaker, a sound box, a robot, an automobile model, or the like. Wherein, the robot can be a humanoid robot, a sweeping robot and the like. It will be appreciated that the sound source 31 is not limited to the several forms described above, but may be other objects that are capable of moving, generating sound, or producing an audio signal.

Specifically, the sound source 31 may be equipped with an audio collecting device, which may collect, in real time or periodically, the monaural audio emitted by the sound source 31, and send the monaural audio emitted by the sound source 31 to the server 22. Alternatively, the sound source 31 may transmit the monaural audio it produces to the server 22.

S102, acquiring binaural audio acquired by a first pickup unit and a second pickup unit of a target object.

As shown in fig. 3, there is also a target object 33 around the sound emission source 31, and the target object 33 may be a listener, a viewer, a sound pickup device, a human body model, or the like. Specifically, the target object 33 includes a first sound pickup unit 34 and a second sound pickup unit 35. Specifically, when the target object 33 is a listener, a spectator, that is, the target object 33 is a person, the first sound pickup unit 34 may be an audio pickup device worn on the right ear of the target object 33, and the second sound pickup unit 35 may be an audio pickup device worn on the left ear of the target object 33.

When the target object 33 is a sound pickup apparatus, the first sound pickup unit 34 and the second sound pickup unit 35 are two sound pickup units in the sound pickup apparatus, and assuming that the center point of the sound pickup apparatus is the origin 36 shown in fig. 3, the distance of the first sound pickup unit 34 with respect to the origin 36 may be the distance of the right ear of the person with respect to the center of the head of the person, or the distance of the first sound pickup unit 34 with respect to the origin 36 may be determined according to the distance of the right ear of the person with respect to the center of the head of the person. Similarly, the distance of the second sound pickup unit 35 with respect to the origin 36 may be the distance of the left ear of the person with respect to the center of the head of the person, or the distance of the second sound pickup unit 35 with respect to the origin 36 may be determined based on the distance of the left ear of the person with respect to the center of the head of the person.

When the target object 33 is a human body model having a part or component simulating a human body organ, for example, the human body model has a head model, a right ear model, a left ear model, and the distance of the right ear model with respect to the center of the head model may be the distance of the right ear of the person with respect to the center of the head of the person, or the distance of the right ear model with respect to the center of the head model may be determined according to the distance of the right ear of the person with respect to the center of the head of the person. Similarly, the distance of the left ear model relative to the center of the head model may be the distance of the left ear of the person relative to the center of the head of the person, or the distance of the left ear model relative to the center of the head model may be determined based on the distance of the left ear of the person relative to the center of the head of the person. In addition, the right ear model and the left ear model of the human body model can be respectively provided with an audio acquisition device. For example, an audio pickup device worn on the right ear model is denoted as a first sound pickup unit 34, and an audio pickup device worn on the left ear model is denoted as a second sound pickup unit 35.

It is to be understood that the implementation form of the target object 33 is not limited to the above-described ones, but may be other objects capable of realizing the sound pickup function.

When the monaural audio emitted from the sound source 31 reaches the target object 33 through transmission of a transmission medium, for example, through propagation of air, the first sound pickup unit 34 and the second sound pickup unit 35 may respectively pick up one audio signal. The audio signals collected by the first sound pickup unit 34 and the second sound pickup unit 35, respectively, are recorded as binaural audio.

For example, the audio signal collected by the first sound pickup unit 34 is referred to as an audio signal a, and the audio signal collected by the second sound pickup unit 35 is referred to as an audio signal B. Since the position of the first sound pickup unit 34 with respect to the sound source 31 and the position of the second sound pickup unit 35 with respect to the sound source 31 are different, the direction of the first sound pickup unit 34 with respect to the sound source 31 and the direction of the second sound pickup unit 35 with respect to the sound source 31 are also different, and thus the audio signal a and the audio signal B are not identical, for example, the phase of the audio signal a and the phase of the audio signal B are different. Or the time when the audio signal a arrives at the first sound pickup unit 34 and the time when the audio signal B arrives at the second sound pickup unit 35 are different.

When the target object 33 is a listener or a viewer, if the audio signal a is received by the right ear and the audio signal B is received by the left ear, the brain of the target object 33 may determine the position and direction of the sound source 31 with respect to the target object 33 based on the phase difference between the audio signal a and the audio signal B or based on the time difference between the audio signal a and the audio signal B. It will be appreciated that the audio signal a and the audio signal B, respectively, may change in real time as the target object 33 is moving, as well as the phase difference and the time difference. At this time, the target object 33 can sense the motion of the sound source 31 based on the phase difference and the time difference that change in real time. For example, the target object 33 may not only sense that the sound source 31 is moving, but also sense how the sound source 31 is moving, e.g., the sound source 31 is revolving around the target object 33, or the sound source 31 is constantly approaching the target object 33, etc. In addition, in the present embodiment, the target object 33 may be stationary.

S103, inputting the position information of the movable sound source, the gesture information of the target object, the mono audio and the radial speed of the movable sound source relative to the target object into a machine learning model to be trained, so that the machine learning model outputs spatial audio.

The coordinate system established by the x-axis and the y-axis, for example, as shown in fig. 3, is a top view of a cartesian coordinate system established with the center of the target object 33 as the origin of coordinates, i.e., the height dimension of the cartesian coordinate system is omitted. Specifically, the sound source 31 is driven at a velocity v _xy Moving in a plane formed by the x-axis and the y-axis. According to velocity v _xy The radial velocity of the sound source 31 with respect to the target object 33 can be obtained, for example, the radial velocity of the sound source 31 with respect to the target object 33 includes the radial velocity of the sound source 31 with respect to the first sound pickup unit 34 and the radial velocity of the sound source 31 with respect to the second sound pickup unit 35. For example, the direction of the line from the sound source 31 to the first sound pickup unit 34 is r, the speed v _xy Can be decomposed into components v in the r direction _r And a component perpendicular to the r directionComponent v _r Which may be noted as the radial velocity of sound source 31 relative to first pick-up unit 34. Similarly, the radial velocity of the sound source 31 with respect to the second sound pickup unit 35 can be calculated.

In addition, in the coordinate system established on the x-axis and the y-axis shown in fig. 3, the position information of the departure sound source 31 and the posture information of the target object 33 may also be determined. For example, the positional information of the sound source 31 may be coordinates of the sound source 31 in the coordinate system. The posture information of the target object 33 may be a head direction of the target object 33 in the coordinate system.

Further, the server 22 may input the monaural audio emitted by the sound source 31, the positional information of the sound source 31, the posture information of the target object 33, the radial velocity of the sound source 31 with respect to the target object 33 into the machine learning model to be trained, so that the machine learning model outputs the spatial audio. Specifically, the spatial audio output by the machine learning model is binaural audio when the monaural audio predicted by the machine learning model to be trained propagates to the target object 33 given the monaural audio from the sound source 31, the positional information of the sound source 31, the posture information of the target object 33, and the radial velocity of the sound source 31 relative to the target object 33.

And S104, training the machine learning model according to the spatial audio and the binaural audio, wherein the trained machine learning model is used for executing an audio processing method.

Since the binaural audio collected by the first sound pickup unit 34 and the second sound pickup unit 35 is actual binaural audio when the monophonic audio propagates to the target object 33, and there is a difference between the binaural audio predicted by the machine learning model to be trained when the monophonic audio propagates to the target object 33 and the actual binaural audio, the machine learning model is trained according to the spatial audio and the actual binaural audio output by the machine learning model to be trained.

For example, according to the similarity or difference between the spatial audio output by the machine learning model to be trained and the actual binaural audio, parameters in the machine learning model are adjusted, so that the spatial audio output by the machine learning model and the actual binaural audio are gradually similar in the subsequent iterative training process. For example, parameter adjustments may be guided based on gradient data between spatial audio output by the machine learning model and actual binaural audio. Where the intention of a gradient is a vector (vector) that means that the directional derivative of a function at that point takes a maximum along that direction, i.e. the function changes the fastest along that direction (the direction of this gradient) at that point with the greatest rate of change. It is also based on this principle that the direction of adjustment of the parameters can be guided so that the spatial audio output by the machine learning model is close to the actual binaural audio. Thereby obtaining a machine learning model after training. The trained machine learning model may be used to perform the audio processing method as described below.

It will be appreciated that since the sound source 31 is moving, the monophonic audio emitted by the sound source 31, the positional information of the sound source 31, and the radial velocity of the sound source 31 relative to the target object 33 are variable at different times. Accordingly, the monaural audio, the positional information of the sound source 31, the radial velocity of the sound source 31 with respect to the target object 33, the binaural audio collected by the first sound pickup unit 34 and the second sound pickup unit 35, and the spatial audio output by the machine learning model to be trained can be aligned. For example, the positional information of the sound source 31 indicates the position of the sound source 31 when the sound source 31 emits the monaural audio. The radial velocity of the sound source 31 with respect to the target object 33 is the radial velocity of the sound source 31 with respect to the target object 33 when the sound source 31 emits the mono audio. The binaural audio collected by the first sound pickup unit 34 and the second sound pickup unit 35 is binaural audio collected by the first sound pickup unit 34 and the second sound pickup unit 35 when the binaural audio propagates to the target object 33. The spatial audio output by the machine learning model to be trained is the spatial audio predicted by the machine learning model from the mono audio. And in the training process, training the machine learning model according to the spatial audio corresponding to the same mono audio and the actual binaural audio.

According to the embodiment of the disclosure, when the mono audio sent by the movable sound source reaches the target object, the binaural audio is collected through the first pickup unit and the second pickup unit of the target object. In addition, the position information of the movable sound source, the gesture information of the target object, the mono audio and the radial speed of the movable sound source relative to the target object are input into a machine learning model to be trained, so that the machine learning model can learn the motion information of the movable sound source according to the radial speed of the movable sound source relative to the target object, and the spatial audio output by the machine learning model can contain the motion information of the movable sound source. Further, according to the spatial audio and the binaural audio, the machine learning model is trained, so that the motion information of the movable sound source contained in the spatial audio is gradually close to the actual motion information of the movable sound source, and the trained machine learning model can output accurate spatial audio, namely spatial audio infinitely close to the actual binaural audio. The user can feel the motion of the movable sound source when hearing the space audio, thereby providing an immersive experience for the user and greatly improving the user experience.

Specifically, the present embodiment is not limited to the structure of the machine learning model as described above, and for example, the machine learning model may be a Warp network (Warp Net) or a Binaural gradient network (Binaural Grad) as described above.

As shown in fig. 4, the Warp network (Warp Net) includes a neural time Warp (Neural Time Warping) and a time convolution network (Temporal Conv Net). Neural time warping includes warping layers (Warp), warp activation functions (Warp activation), neural warping layers (Neural Warp), geometric warping layers (Geometric Warp). The time convolution network includes N super convolution layers (Hyper Conv layers). C (C) ₀ The positional information of the sound source 31 and the posture information of the target object 33 as shown above are represented. v _r-left The radial velocity of the sound source 31 with respect to the second sound pickup unit 35 is indicated. v _r-right The radial velocity of the sound source 31 with respect to the first sound pickup unit 34 is indicated. X is x _1:T Representing mono audio from the sound source 31, the mono audio comprising T-frame audio.

In training a warped network as shown in fig. 4, x _1:T 、C ₀ 、v _r-left 、v _r-right Can be used as input of the distortion network, ρ _1:T Representing the output of the warp activation function in the neural time warp, or representing the output of other intermediate layers above the warp activation function. Representing neural time warp outputIs a frequency domain audio signal.Representing audio signals in two time domains of the time convolution network output, i.e. +.>Is the spatial audio output by the warped network to be trained. Further, the warped network is trained based on the actual binaural audio collected by the first sound pickup unit 34 and the second sound pickup unit 35 and the spatial audio output by the warped network.

As shown in fig. 5, the Binaural gradient network (Binaural Grad) includes M Residual blocks (Residual blocks) and a conditional network (conditional). Wherein the M residual blocks are connected in sequence, e.g. the output of the next residual block is the input of the last residual block. Each residual block includes a full connection layer (Fully Connected Layer, FC), a Dilated convolutional layer (Dilated Conv), a convolutional layer (Conv), and the like. The conditional network comprises a plurality of convolutional layers (Conv).

C when training the binaural gradient network ₀ 、v _r-left 、v _r-right Can be used as an input to the binaural gradient network such that the output of the binaural gradient network is spatial audio predicted by the binaural gradient network. Further, the channel gradient network is trained based on the actual binaural audio collected by the first sound pickup unit 34 and the second sound pickup unit 35 and the spatial audio output by the binaural gradient network.

It will be appreciated that the machine learning model as described above, after being trained, may remain in the server 22, or may be deployed to the terminal 21 or other server. So that the server 22, terminal 21 or other server can implement the audio processing method according to the trained machine learning model as described in the following embodiments.

Fig. 6 is a flowchart of an audio processing method according to another embodiment of the present disclosure. For example, the audio processing method may be performed by a cloud server, such as server 22. In this embodiment, the method specifically includes the following steps:

s601, acquiring mono audio corresponding to a first target object in motion.

Specifically, the first target object is a movable sounding source, and the audio acquisition device can acquire monophonic audio sent by the sounding source. Alternatively, in some embodiments, the system controlling the first target object may pre-configure the first target object with a mono audio and treat the mono audio as audio emitted by the first target object.

For example, when the first target object is a user speaking, the user may be wearing an audio collection device that collects mono audio from the user and sends the mono audio to the server 22.

When the first target object is a game piece displayed by a game application installed in an electronic device, the electronic device is considered to be the control system for the game piece, or the control system for the game piece may be a service platform established by a developer of the game application. The control system may configure the game piece with mono audio and treat the mono audio as audio emanating from the first target object. Further, the control system may send the mono audio to the server 22.

S602, calculating the radial speed of the first target object relative to the second target object.

For example, the second target object is located around the first target object, and the second target object may be stationary. For example, the second target object may be a listener, a viewer, a sound pickup apparatus, a human model, or the like. The server 22 may calculate the radial velocity of the first target object relative to the second target object based on the position information of the first target object relative to the second target object.

S603, generating spatial audio according to the position information of the first target object, the gesture information of the second target object, the mono audio and the radial speed of the first target object relative to the second target object.

For example, the server 22 may generate spatial audio based on the position information of the first target object, the pose information of the second target object, the mono audio, the radial velocity of the first target object relative to the second target object.

Optionally, the position information of the first target object is position information of the first target object in a first coordinate system, and the first coordinate system is a coordinate system with the second target object as a coordinate origin; the posture information of the second target object is direction information of the second target object.

For example, the present embodiment may establish a cartesian coordinate system with the center of the first target object as the origin of coordinates, where the cartesian coordinate system is denoted as the first coordinate system, and the position information of the first target object may be position information of the first target object in the cartesian coordinate system. The pose information of the second target object is the direction information of the second target object in a cartesian coordinate system, such as the head direction.

Optionally, generating spatial audio according to the position information of the first target object, the gesture information of the second target object, the mono audio, and the radial velocity of the first target object relative to the second target object, and generating spatial audio includes: the position information of the first target object, the gesture information of the second target object, the mono audio, and the radial velocity of the first target object relative to the second target object are input into a pre-trained machine learning model, such that the machine learning model outputs the spatial audio.

For example, the server 22 stores the trained machine learning model as described above. The server 22 may input the position information of the first target object, the pose information of the second target object, the mono audio, the radial velocity of the first target object relative to the second target object into the trained machine learning model such that the machine learning model outputs spatial audio. Because the trained machine learning model is a precise model, the spatial audio output by the trained machine learning model is precise spatial audio, i.e. spatial audio infinitely close to the actual binaural audio.

The embodiment obtains the mono audio corresponding to the first target object in motion, and calculates the radial speed of the first target object relative to the second target object. Further, spatial audio is generated based on the position information of the first target object, the pose information of the second target object, the mono audio, and the radial velocity of the first target object relative to the second target object. The generated spatial audio contains the motion information of the first target object, so that the generated spatial audio is matched with the binaural audio actually felt by the second target object when the monaural audio corresponding to the first target object reaches the second target object, the difference between the phase of the audio signal in the spatial audio and the phase of the actual binaural audio is small, the accuracy of phase calculation of left and right ear channels in the spatial audio is improved, and the accuracy of the generated spatial audio is greatly improved. Therefore, when a user hears the spatial audio, the position of a first target object such as a sound source is accurately judged according to the phase difference between the left and right ear audio signals in the spatial audio, and the movement of the first target object is sensed, so that immersive experience is provided for the user, and the user experience is greatly improved.

In some cases, the method described in this embodiment may further enable the user to achieve consistency in the sense of hearing and vision of the first target object. In addition, the method described in this embodiment does not require introducing additional hyper-parameters into the machine learning model, nor modifying the loss function. But rather can be well extended to different types of machine learning models, such as the warped network shown in fig. 4 and the binaural gradient network shown in fig. 5. Therefore, the method described in this embodiment can achieve the plug-and-play effect.

Optionally, the radial velocity of the first target object with respect to the second target object includes a radial velocity of a first pickup unit of the first target object with respect to the second target object and a radial velocity of a second pickup unit of the first target object with respect to the second target object.

For example, the first sound pickup unit of the second target object may be the right ear of the second target object, or an audio collection device worn on the right ear. The second pickup unit of the second target object may be a left ear of the second target object or an audio collection device worn on the left ear. The radial velocity of the first target object relative to the second target object includes a radial velocity of a first pickup unit of the first target object relative to the second target object and a radial velocity of a second pickup unit of the first target object relative to the second target object.

Optionally, calculating the radial velocity of the first target object relative to the second target object includes the following steps:

s701, calculating a position vector of the first target object with respect to the first pickup unit of the second target object in a first coordinate system using the second target object as a coordinate origin.

For example, after a Cartesian coordinate system, i.e., a first coordinate system, is established with the head center of the second target object as the origin of coordinates, the three-dimensional (3D) position of the first target object in the Cartesian coordinate system is recorded asThe first pick-up unit of the second target object is, for example, the right ear of the second target object, and the three-dimensional position of the right ear in the Cartesian coordinate system is recorded asThe position vector of the first target object with respect to the right ear is denoted +.>Calculated by the following formula (1):

wherein, the liquid crystal display device comprises a liquid crystal display device,and->Respectively represent three-dimensional coordinates, P _x Representation->X-axis coordinates and +.>Position vector of x-axis coordinate of (c), P _y Representation->Y-axis coordinate and +.>Position vector of y-axis coordinate of (c), P _z Representation->Z-axis coordinate and +.>Is a position vector of the z-axis coordinate of (c).

S702, calculating the moving speed of the first target object according to the position vector.

For example, according to position vectors Calculating the moving speed of the first target object>Calculated by the following formula (2):

wherein, the liquid crystal display device comprises a liquid crystal display device,representing P _x Time derivative, add to>Representing P _y Time derivative, add to>Representing P _z And (5) deriving time.

S703, decomposing a moving speed of the first target object into a radial speed of the first target object with respect to the first sound pickup unit in a second coordinate system using the first sound pickup unit as a coordinate origin.

For example, a spherical coordinate system is established with the right ear as the origin of coordinates, and the spherical coordinate system is denoted as a second coordinate system. In the second coordinate system, the following formula (3) is usedDecomposing into radial velocity +.>

Wherein, the liquid crystal display device comprises a liquid crystal display device,representing radial unit vectors in a spherical coordinate system. />In particular v as described above _r-right . Similarly, v as described above can be calculated _r-left 。

In the embodiment shown in figure 4 or figure 5,C ₀ ∈r ⁷ = (x, y, z, qx, qy, qz, qw), where (x, y, z) represents the position of the first target object in the cartesian coordinate system. (qx, qy, qz, qw) represents the head direction of the second target object in the cartesian coordinate system, i.e. the head direction is represented by a quadruple (qx, qy, qz, qw). Since the present embodiment introduces v _r-left And v _r-right Thus, the condition employed by the machine learning model to generate spatial audio may be a new condition C,

since the present embodiment can calculate the radial velocity of the first target object with respect to the first pickup unit of the second target object and the radial velocity of the first target object with respect to the second pickup unit of the second target object, and provide these two radial velocities as conditions to the machine learning model, the spatial audio generated by the machine learning model may contain the motion information of the first target object, that is, the machine learning model may perceive the motion information of the first target object when generating the spatial audio. Experiments have verified that the method described in this embodiment can bring about improvement of various accuracy, especially improves Phase accuracy (Phase L2) of audio signals in spatial audio generated by a machine learning model, and at present, phase L2 can reach 0.780, namely, better spatial audio Phase loss in industry. The loss can be understood as the difference between the phase of the audio signal in the spatial audio and the phase of the actual binaural audio.

It can be appreciated that the audio processing method provided in this embodiment may be applicable to different application scenarios, for example, virtual conference, virtual assistant experience, augmented reality, virtual reality, and so on. Different application scenarios are described below in connection with specific embodiments.

Fig. 8 is a flowchart of an audio processing method according to another embodiment of the present disclosure. For example, the method can be applied to game scenes and meta-universe scenes, and the method can be specifically executed by a cloud server. In this embodiment, the method specifically includes the following steps:

s801, calculating the radial speed of a first target object relative to a second target object in a virtual space according to the position information of the first target object relative to the second target object in the virtual space.

For example, the first target object and the second target object are located simultaneously in a virtual space of the game scene or in a virtual space of the metauniverse scene, i.e. the first target object and the second target object are in the same virtual space. In this virtual space, it is assumed that the first target object is a movable sound source and the second target object is stationary. Since the positions of the first target object and the second target object in the virtual space are known, the cloud server may establish a cartesian coordinate system with the second target object as a coordinate origin, and map the positions of the first target object and the second target object in the virtual space into the cartesian coordinate system, respectively, so as to obtain the positions of the first target object and the second target object in the cartesian coordinate system, respectively, further, calculate a position vector of the first target object relative to the second target object according to the positions of the first target object and the second target object in the cartesian coordinate system, and calculate the radial velocity of the first target object relative to the second target object according to the position vector, for example, the radial velocity of the first target object relative to the left ear and the right ear of the second target object, respectively. The specific calculation process is described above and will not be described here again.

S802, generating spatial audio according to the position information of the first target object, the gesture information of the second target object, the mono audio configured for the first target object in advance and the radial speed of the first target object relative to the second target object.

For example, after the cloud server maps the positions of the first target object and the second target object in the virtual space to the cartesian coordinate system, the position information of the first target object in the cartesian coordinate system and the head direction of the second target object in the cartesian coordinate system may be obtained. Further, the position information, the head direction, the monaural audio configured for the first target object in advance, and the radial velocity of the first target object relative to the second target object are input into the trained machine learning model, so that the machine learning model outputs spatial audio.

S803, playing the spatial audio.

For example, the cloud server may send the spatial audio to a terminal that presents a game scene or a metauniverse scene, and the terminal may play the spatial audio such that a user of the terminal perceives a motion of a first target object in the game scene or the metauniverse scene from the spatial audio. The game experience of the user or the experience of the meta-universe scene is improved.

Fig. 9 is a flowchart of an audio processing method according to another embodiment of the present disclosure. For example, the method can be applied to virtual reality, and the method can be specifically executed by a cloud server. In this embodiment, the method specifically includes the following steps:

s901, calculating the radial speed of a movable target relative to a target user according to the position information of the movable target relative to the target user, which is displayed in a virtual reality display device, wherein the target user wears the virtual reality display device.

For example, as shown in fig. 10, a target user 100 wears a Virtual Reality (VR) display device 101, such as VR glasses. The VR glasses include a display screen 102, and a screen of a virtual reality scene, for example, including a movable target 103, is displayed on the display screen 102. For example, a Cartesian coordinate system is established with the target user 100 as the origin of coordinates. The x-axis as shown in fig. 10 may be the x-axis of the cartesian coordinate system and the y-axis as shown in fig. 10 may be the y-axis of the cartesian coordinate system. In this cartesian coordinate system, a position vector of the movable target 103 with respect to the target user 100 can be calculated. Further, from the position vector, the radial velocity of the movable target 103 with respect to the target user 100, for example, the radial velocities of the movable target 103 with respect to the left ear and the right ear of the target user 100, respectively, can be calculated.

S902, generating spatial audio according to the position information of the movable target, the gesture information of the target user, the mono audio configured for the movable target in advance and the radial speed of the movable target relative to the target user.

In addition, as shown in fig. 10, the position information of the movable target 103 in the cartesian coordinate system and the head direction of the target user 100 in the cartesian coordinate system may also be calculated. The cloud server may input the position information, the head direction, the monaural audio configured in advance for the movable target 103, and the radial speeds of the movable target 103 with respect to the left ear and the right ear of the target user 100, respectively, into the trained machine learning model, so that the machine learning model outputs the spatial audio.

And S903, playing the spatial audio through the virtual reality display equipment.

For example, the cloud server may send the spatial audio to the virtual reality display device 101, and the virtual reality display device 101 may play the spatial audio, so that the target user 100 perceives the motion of the movable target 103 according to the spatial audio while viewing the virtual reality scene. Providing a more immersive sensation to the target user.

Fig. 11 is a flowchart of an audio processing method according to another embodiment of the present disclosure. For example, the method can be applied to an augmented reality scene, and the method can be specifically executed by a cloud server. In this embodiment, the method specifically includes the following steps:

s1101, collecting mono audio sent by a movable target.

For example, as shown in fig. 12, the movable object 121 may be provided with an audio collection device, which may collect monaural audio emitted by the movable object 121 during movement. The audio acquisition device can send the acquired mono audio to the cloud server.

S1102, calculating the radial speed of the movable target relative to a target user according to the position information of the movable target relative to the target user, wherein the target user is provided with an augmented reality device.

For example, the target user 122 is wearing an augmented reality (Augmented Reality, AR) device 123, e.g., AR glasses. The AR glasses may measure positional information of the movable target 121 with respect to the target user 122, and further convert the positional information into a position vector in a cartesian coordinate system having the target user 122 as a coordinate origin. And calculates a radial velocity of the movable object 121 with respect to the target user 122 based on the position vector, for example, radial velocities of the movable object 121 with respect to left and right ears of the target user 122, respectively.

S1103, generating spatial audio according to the position information of the movable object, the gesture information of the target user, the mono audio, and the radial velocity of the movable object relative to the target user.

As shown in fig. 12, a sensor such as a gyroscope is installed in the AR glasses, so that the AR glasses can also sense the head direction of the target user 122 in the cartesian coordinate system. Further, the AR glasses transmit the position information of the movable object 121 in the cartesian coordinate system, the head direction of the target user 122 in the cartesian coordinate system, and the radial velocities of the movable object 121 with respect to the left and right ears of the target user 122, respectively, to the cloud server. So that the cloud server inputs the position information, the head direction, the mono audio, and the radial speeds of the movable target 121 with respect to the left ear and the right ear of the target user 122, respectively, into the trained machine learning model, so that the machine learning model outputs the spatial audio.

And S1104, playing the spatial audio through the augmented reality equipment.

For example, the cloud server may send the spatial audio to the AR glasses, which may play the spatial audio, so that the target user 122 may feel the movement of the movable target 121 relative to the target user 122 according to the spatial audio while watching the game, thereby improving the experience of augmented reality.

Fig. 13 is a flowchart of an audio processing method according to another embodiment of the present disclosure. For example, the method can be applied to conference scenes in the same space, and the method can be specifically executed by a cloud server. In this embodiment, the method specifically includes the following steps:

s1301, acquiring mono audio sent by a first user in motion.

For example, as shown in fig. 14, the first user 141 and the second user 142 are located in the same space, which is denoted as a preset space 143. The first user 141 is in a motion state and the second user 142 is in a stationary state. The first user 141 may wear an audio collection device that may collect mono audio sent by the first user and send the mono audio to the cloud server.

S1302, calculating the radial speed of the first user relative to the second user according to the position information of the first user relative to the second user in the preset space.

For example, in the preset space 143, a photographing device, such as a camera, is provided, which photographs the first user 141 and the second user 142 in real time. The photographing device may send the image photographed by the photographing device to the cloud server, so that the cloud server may determine the position information of the first user 141 relative to the second user 142 according to the first user 141 and the second user 142 in the image. Further, the cloud server may also establish a cartesian coordinate system with the second user 142 as the origin of coordinates. And converts the position information of the first user 141 with respect to the second user 142 into a position vector in a cartesian coordinate system. And calculates a radial velocity of the first user 141 with respect to the second user 142 based on the position vector, for example, radial velocities of the first user 141 with respect to left and right ears of the second user 142, respectively.

S1303, generating spatial audio according to the position information of the first user, the gesture information of the second user, the mono audio and the radial speed of the first user relative to the second user.

For example, the cloud server may input the location information of the first user 141 in the cartesian coordinate system, the head direction of the second user 142 in the cartesian coordinate system, the mono audio, the radial velocity of the first user 141 relative to the left and right ears of the second user 142, respectively, into the trained machine learning model such that the machine learning model outputs the spatial audio.

And S1304, transmitting the spatial audio to audio playing equipment worn by the second user.

For example, the cloud server may send the spatial audio to an audio playback device, such as a headset, worn by the second user 142. The audio playing device plays the spatial audio. So that the second user 142 can feel the motion of the first user 141. And improving the user experience in the conference scene.

Fig. 15 is a flowchart of an audio processing method according to another embodiment of the present disclosure. For example, the method can be applied to a teleconference scene, and the method can be specifically executed by a cloud server. In this embodiment, the method specifically includes the following steps:

S1501, acquiring mono audio sent by a first user in motion.

As shown in fig. 16, the first user 161 is located in the first space 162 and the second user 163 is located in the second space 164, i.e., the first user 161 and the second user 163 are located in different spaces. The first user 161 is in motion in the first space 162. The second user 163 is in a stationary state in the second space 164. The first user 161 may wear an audio collection device that may collect mono audio sent by the first user and send the mono audio to the cloud server.

S1502, calculating the radial speed of the first user relative to a target position in a preset space according to the position information of the first user relative to the target position in the preset space.

For example, the first space 162 is denoted as a preset space in which the target position 165 exists. The coordinates of target location 165 in the preset space may be pre-stored in the cloud server. In addition, a shooting device, such as a camera, is further disposed in the preset space, and the shooting device can shoot the first user 161 in real time and send a real-time shot picture to the cloud server, so that the cloud server can determine the position information of the first user 161 relative to the target position 165 according to the first user 161 in the picture. Further, the cloud server may also establish a Cartesian coordinate system with target location 165 as the origin of coordinates. And converts the position information of first user 161 relative to target position 165 into a position vector in a cartesian coordinate system. And calculates a radial velocity of the first user 161 relative to the target location 165 based on the location vector.

S1503, generating spatial audio according to the position information of the first user, the gesture information of the second user, the mono audio, and the radial velocity of the first user with respect to the target position.

For example, in fig. 16, a photographing device may be configured in the second space 164 to photograph the second user 163, and a photographed picture is transmitted to the cloud server, so that the cloud server may determine the head direction of the second user 163 according to the picture. Or, the gyroscope is installed on the headset worn by the second user 163, so that the headset can detect the head direction of the second user 163 and send the head direction to the cloud server. Further, the cloud server may input the position information of the first user 161 in the cartesian coordinate system, the head direction of the second user 163, the mono audio, the radial velocity of the first user 161 relative to the target position 165 into the trained machine learning model, such that the machine learning model outputs the spatial audio.

S1504, the spatial audio is sent to audio playing equipment worn by the second user, and the second user and the first user are located in different spaces.

For example, the cloud server may send the spatial audio to an audio playback device, such as a headset, worn by the second user 163. Thereby enabling a remote conference and enabling a remote user in the remote conference, such as second user 163, to perceive the movements of a speaker, such as the first user, through spatial audio, providing the second user with an immersive sensation as if the second user and the first user were in the same space.

Fig. 17 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present disclosure. The audio processing apparatus provided in the embodiment of the present disclosure may execute the processing flow provided in the embodiment of the audio processing method, as shown in fig. 17, the audio processing apparatus 170 includes:

an acquiring module 171, configured to acquire monaural audio corresponding to a first target object in motion;

a calculation module 172 for calculating a radial velocity of the first target object relative to a second target object;

the generating module 173 is configured to generate spatial audio according to the position information of the first target object, the gesture information of the second target object, the mono audio, and the radial velocity of the first target object relative to the second target object.

Optionally, the generating module 173 generates spatial audio according to the position information of the first target object, the gesture information of the second target object, the mono audio, and the radial velocity of the first target object relative to the second target object, and is specifically configured to:

the position information of the first target object, the gesture information of the second target object, the mono audio, and the radial velocity of the first target object relative to the second target object are input into a pre-trained machine learning model, such that the machine learning model outputs the spatial audio.

Optionally, the position information of the first target object is position information of the first target object in a first coordinate system, and the first coordinate system is a coordinate system with the second target object as a coordinate origin;

the posture information of the second target object is direction information of the second target object.

Optionally, when the calculating module 172 calculates the radial velocity of the first target object relative to the second target object, the calculating module is specifically configured to:

calculating a position vector of a first pickup unit of the first target object relative to the second target object in a first coordinate system with the second target object as a coordinate origin;

calculating the moving speed of the first target object according to the position vector;

in a second coordinate system using the first sound pickup unit as a coordinate origin, the moving speed of the first target object is decomposed into a radial speed of the first target object relative to the first sound pickup unit.

The audio processing device of the embodiment shown in fig. 17 may be used to implement the technical solution of the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein again.

The internal functions and structures of an audio processing device are described above, which may be implemented as an electronic apparatus. Fig. 18 is a schematic structural diagram of an embodiment of an electronic device provided in an embodiment of the disclosure. As shown in fig. 18, the electronic device includes a memory 181 and a processor 182.

The memory 181 is used for storing programs. In addition to the programs described above, the memory 181 may also be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and the like.

The memory 181 may be implemented by any type of volatile or nonvolatile memory device or combination thereof such as Static Random Access Memory (SRAM), electrically Erasable Programmable Read Only Memory (EEPROM), erasable Programmable Read Only Memory (EPROM), programmable Read Only Memory (PROM), read Only Memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

Processor 182 is coupled to memory 181, executing programs stored in memory 181 for:

acquiring mono audio corresponding to a first target object in motion;

Further, as shown in fig. 18, the electronic device may further include: communication component 183, power component 184, audio component 185, display 186, and other components. Only some of the components are schematically shown in fig. 18, which does not mean that the electronic device only comprises the components shown in fig. 18.

The communication component 183 is configured to facilitate communication between the electronic device and other devices, either wired or wireless. The electronic device may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 183 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 183 also includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

A power supply assembly 184 provides power to the various components of the electronic device. The power components 184 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic devices.

The audio component 185 is configured to output and/or input audio signals. For example, the audio component 185 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 181 or transmitted via the communication component 183. In some embodiments, audio assembly 185 further includes a speaker for outputting audio signals.

The display 186 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation.

In addition, the embodiment of the present disclosure also provides a computer readable storage medium having stored thereon a computer program that is executed by a processor to implement the audio processing method and the model training method described in the above embodiments.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of audio processing, wherein the method comprises:

acquiring mono audio corresponding to a first target object in motion;

2. The method of claim 1, wherein generating spatial audio from the position information of the first target object, the pose information of the second target object, the mono audio, and the radial velocity of the first target object relative to the second target object, comprises:

3. The method of claim 1, wherein the position information of the first target object is position information of the first target object in a first coordinate system, the first coordinate system being a coordinate system having the second target object as a coordinate origin;

4. The method of claim 1, wherein the radial velocity of the first target object relative to the second target object comprises a radial velocity of a first pickup unit of the first target object relative to the second target object and a radial velocity of a second pickup unit of the first target object relative to the second target object.

5. The method of claim 1, wherein calculating the radial velocity of the first target object relative to the second target object comprises:

6. A method of audio processing, wherein the method comprises:

and playing the spatial audio.

7. A method of audio processing, wherein the method comprises:

and playing the spatial audio through the virtual reality display device.

8. A method of audio processing, wherein the method comprises:

collecting mono audio sent by a movable target;

and playing the spatial audio through the augmented reality device.

9. A method of audio processing, wherein the method comprises:

acquiring mono audio sent by a first user in motion;

10. A method of audio processing, wherein the method comprises:

acquiring mono audio sent by a first user in motion;

11. A model training method, wherein the method comprises:

acquiring mono audio sent by a movable sound source;

training the machine learning model based on the spatial audio and the binaural audio, the trained machine learning model being used to perform the audio processing method according to any one of claims 1-10.

12. An audio processing apparatus, comprising:

13. An electronic device, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1-11.

14. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the method of any of claims 1-11.