CN114422935B

CN114422935B - Audio processing method, terminal and computer readable storage medium

Info

Publication number: CN114422935B
Application number: CN202210258905.3A
Authority: CN
Inventors: 吴黄伟
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-09-23
Anticipated expiration: 2042-03-16
Also published as: CN114422935A

Abstract

The application discloses an audio processing method, a terminal and a computer readable storage medium, and belongs to the technical field of spatial audio. The method comprises the following steps: receiving an application operation on a target application; presenting an application interface of the target application according to the application operation; determining the position of a virtual sound source of the spatial audio to be output according to the interface presentation form or the interface content of the application interface; determining the relative position of the head of the user wearing the earphone relative to the virtual sound source according to the position of the virtual sound source; the spatial audio output to the headphones is adjusted according to the relative orientation so that the spatial audio output to the headphones is perceived by the user as originating from the virtual sound source. Therefore, the position of the virtual sound source of the spatial audio output to the earphone can be adjusted along with the interface presentation form or the interface content of the application interface, the setting mode of the virtual sound source of the spatial audio is expanded, and the spatial sense of the spatial audio and the auditory experience of a user are improved.

Description

Audio processing method, terminal and computer readable storage medium

Technical Field

The present application relates to the field of spatial audio technologies, and in particular, to an audio processing method, a terminal, and a computer-readable storage medium.

Background

Most headphones output audio that is perceived by the user as coming from directly in front of the user, and cannot restore the true direction of the sound. For example, when a user wears a headset, the user always feels that sound is coming from the front regardless of the direction in which the user equipment connected to the headset is. Currently, in order to improve the reality and presence of a user listening to sound, a spatial audio technology is proposed, which can simulate a specific direction and position of sound so that the spatial audio output by headphones is perceived by the user as originating from a specific direction, such simulated specific direction (sound source) being referred to as a virtual sound source.

In the related art, when a user wears earphones to listen to music or watch video, the screen center of the user equipment is generally set as a virtual sound source, when the head position or head orientation of the user changes, the change in position of the head of the user relative to the virtual sound source can be detected by a motion sensor such as a gyroscope configured with the earphones, and the spatial audio output to the earphones is adjusted according to the change in position of the head of the user relative to the virtual sound source, so that the spatial audio output by the earphones is perceived by the user as always originating from the screen center of the user equipment. For example, when the user is facing the screen of the user device, the spatial audio output by the headphones is perceived by the user as originating from the front; when the user's head is turned to the left, the spatial audio output by the headphones is perceived by the user as originating from the right.

However, fixing the screen center of the user equipment as a virtual sound source of spatial audio has certain limitations, and is low in flexibility, and may not meet the auditory requirements of the user.

Disclosure of Invention

The application provides an audio processing method, a terminal and a computer readable storage medium, which can solve the problems that the screen center of user equipment is fixed as a virtual sound source of spatial audio, and the virtual sound source has certain limitation and low flexibility. The technical scheme is as follows:

in a first aspect, an audio processing method is provided, which is applied in a terminal, and the method includes:

and receiving application operation of a user on the target application, and presenting an application interface of the target application according to the application operation. And determining the position of the virtual sound source of the spatial audio to be output according to the interface presentation form or the interface content of the application interface. And determining the relative orientation of the head of the user wearing the headset relative to the virtual sound source according to the determined position of the virtual sound source. The spatial audio output to the headphones is adjusted according to the relative orientation of the user's head with respect to the virtual sound source so that the spatial audio output to the headphones is perceived by the user as originating from the virtual sound source.

The interface presentation form of the application interface refers to the form in which the application interface is presented to the user. The interface presentation forms of the application interface comprise various interface presentation forms, and the various interface presentation forms can be classified according to the occupation ratio of an interface display area of the application interface on a screen, whether the application interface is minimized to be an icon form, whether the application interface is displayed in a foreground or displayed in a background, and the like. It is to be understood that classification may also be made according to other circumstances.

For example, the interface presentation form of the application interface may include: the occupation ratio of the interface display area of the application interface on the screen is larger than the proportion threshold, the occupation ratio of the interface display area of the application interface on the screen is smaller than or equal to the proportion threshold, the application interface is minimized into an icon form, the application interface is switched to a background, the application interface is switched to a foreground and the like.

The proportion threshold may be preset according to needs, for example, may be set to 25% or 50%, and the like, which is not limited in this embodiment of the application.

In the embodiment of the application, the position of the virtual sound source of the spatial audio is determined according to the interface presentation form or the interface content of the application interface, and then the spatial audio output to the earphone is adjusted according to the relative position of the head of the user wearing the earphone relative to the virtual sound source, so that the position of the virtual sound source of the spatial audio output to the earphone can be adjusted along with the interface presentation form or the interface content of the application interface, the setting mode of the virtual sound source of the spatial audio is expanded, and the spatial sense of the spatial audio and the auditory experience of the user are improved.

As an example, in the case that the position of the virtual sound source of the spatial audio to be output is determined according to the interface presentation form of the application interface, the positions of the virtual sound sources corresponding to different interface presentation forms of the application interface are different. For example, the first interface presentation format corresponds to a first virtual sound source, the second interface presentation format corresponds to a second virtual sound source, and the first virtual sound source and the second virtual sound source are located at different positions.

As an example, determining the position of the virtual sound source of the spatial audio to be output according to the interface presentation form of the application interface includes one or more of the following ways:

1) and if the ratio of the interface display area of the application interface on the screen is greater than the ratio threshold, determining the preset position as the position of the virtual sound source of the spatial audio to be output.

The preset position may be preset, for example, the preset position may be a front position of a head of a user, a center position of a screen, or a center position of an interface of an application interface. Alternatively, the preset position may also be determined according to the interface content of the application interface, which is not limited in this embodiment of the application.

2) And if the ratio of the interface display area of the application interface on the screen is less than or equal to the ratio threshold, determining the window position of the application interface as the position of the virtual sound source of the spatial audio to be output.

The window position may be a window center position or other positions of the window.

3) And if the application interface is minimized to be in the icon form, determining the position of the minimized icon of the application interface as the position of the virtual sound source of the spatial audio to be output.

The minimized icon of the application interface may be an application icon displayed on the main interface or an application icon displayed in the taskbar, and the icon form and the display position of the minimized icon of the application interface are not limited in the embodiment of the application.

In addition, the position of the minimized icon may be the icon center position of the minimized icon, and may also be other positions of the minimized icon, which is not limited in this embodiment of the application.

4) And if the application interface is switched to the background, determining the position of the corresponding virtual sound source before the application interface is switched to be the position of the virtual sound source of the spatial audio to be output after the position moves towards the reverse direction of the screen light-emitting direction by a specified distance.

That is, if the application interface is switched to the background, the virtual sound source is moved to the back direction of the screen by a designated distance.

The specified distance may be preset, for example, set to 10cm, 20cm, or 30, and the like, which is not limited in this embodiment of the application.

5) And if the application interface is switched to the foreground, determining the position of the virtual sound source corresponding to the application interface before switching as the position of the virtual sound source of the spatial audio to be output after the position moves towards the screen light-emitting direction by a specified distance.

That is, if the application interface is switched to the foreground again, the virtual sound source is moved to the front of the screen by a designated distance.

It should be understood that the positions of the virtual sound sources corresponding to different interface presentation forms may also be set to other corresponding relationships, which is not limited in this embodiment of the application. In addition, the positions of different window presentation forms of a specific window in the application interface corresponding to different virtual sound sources can be set. For example, different window presentation forms of a video playing window in a video application are set to correspond to the positions of different virtual sound sources.

For example, the positions of the virtual sound sources corresponding to the different window presentations of the specific window in the application interface may include:

and if the ratio of the specific window on the screen is greater than the ratio threshold, determining the preset position as the position of the virtual sound source of the spatial audio to be output. And if the ratio of the interface display area of the specific window on the screen is less than or equal to the ratio threshold, determining the window position of the specific window as the position of the virtual sound source of the spatial audio to be output. And if the specific window is minimized into an icon form, determining the position of the minimized icon of the specific window as the position of the virtual sound source of the spatial audio to be output. And if the specific window is switched to the background, determining the position of the virtual sound source corresponding to the position of the virtual sound source before the specific window is switched to the position of the virtual sound source of the spatial audio to be output after the position moves towards the opposite direction of the screen light-emitting direction for a specified distance. And if the specific window is switched to the foreground, determining the position of the virtual sound source corresponding to the specific window before switching as the position of the virtual sound source of the spatial audio to be output after the position moves towards the screen light-emitting direction by a specified distance.

As an example, in the case that the position of the virtual sound source of the spatial audio to be output is determined according to the interface content of the application interface, the sound production position in the application interface may be determined, and the sound production position in the application interface may be determined as the position of the virtual sound source of the spatial audio to be output.

The interface content of the application interface may include a sound production position in the application interface, where the sound production position in the application interface refers to a sound source position in a display screen of the application interface. Such as the location of the speaking speaker in the application interface, such as the location of the sounding instrument or instrument, etc. Among the instruments that are sounding may be drums that are being struck, pianos that are being played, and so on.

The sounding position in the application interface is set to be the position of the virtual sound source of the spatial audio, so that the spatial audio sensed by the user can come from the sounding position in the application interface, and the virtual sound source of the spatial audio sensed by the user can be changed along with the change of the sounding position, thereby further improving the spatial sense of the spatial audio and the auditory experience of the user.

For example, the position of the specified part of the speaker in the application interface may be specified, and the position of the specified part of the speaker may be specified as the utterance position. The designated part may be a mouth or a head. That is, the mouth position or the head position of the speaker in the application interface may be determined as the position of the virtual sound source of the spatial audio to be output.

As one example, determining the speaker speaking in the application interface may include the following implementations:

the first implementation mode comprises the following steps: the speaker who is speaking in the application interface is determined by means of image recognition.

For example, the application interface is subjected to portrait recognition to determine the portrait in the application interface. And recognizing the talking action of the portrait in the application interface, and determining the portrait for executing the talking action in the application interface as the speaker who speaks.

The second implementation mode comprises the following steps: in the case that the application interface is a video call interface or a video conference interface, a speaker speaking in the application interface can be determined according to the intensity of the audio signal of each user who is in a video call or a video conference.

For example, the target user who is speaking may be determined from the plurality of users based on the strength of the audio signals of the plurality of users who are in a video call or a video conference. And determining the user image of the target user in the application interface as the speaker who is speaking.

For example, for each user who is in a video call or a video conference, it is determined whether the intensity of the audio signal of the user is greater than an intensity threshold. And if so, determining the user as the speaking user. And if not, determining the user as the user who does not speak.

Wherein, the sound production position in the application interface can comprise a plurality of positions. When the sound emission positions include a plurality of positions, the plurality of sound emission positions in the application interface may also be positions of a plurality of virtual sound sources from which the spatial audio is to be output, so that the spatial audio output to the headphones is perceived by the user as originating from the plurality of virtual sound sources.

Wherein the relative orientation of the user's head with respect to the virtual sound source may include a relative position of the user's head position with respect to the virtual sound source, and a relative direction of the user's head orientation with respect to the particular direction. The specific direction may be preset, for example, the specific direction is a direction in which the head of the user faces the virtual sound source. It should be understood that the specific direction may be set to other directions, which is not limited in the embodiments of the present application.

As an example, a user image of a user wearing headphones may be captured by a camera, from which the user's head relative to the virtual sound source is oriented. For example, a user image of the user may be captured by the camera, and from the captured user image, the relative position of the head position of the user with respect to the virtual sound source and the relative direction of the head orientation with respect to the direction in which the head is oriented toward the virtual sound source are determined.

It will be appreciated that the relative orientation of the user's head with respect to the virtual sound source may also be determined in other ways. For example, the headset is provided with a motion sensor such as a gyroscope and/or an acceleration sensor, the headset can measure the motion information of the head of the user through the motion sensor, the detected motion information of the head of the user is sent to the terminal, and the terminal comprehensively analyzes the relative orientation of the head of the user relative to the virtual sound source by combining the motion information of the head of the user and the user image collected by the camera. In this way, the detection accuracy of the position and orientation of the user's head can be further improved.

In addition, the terminal can also identify the relative orientation of the head of the user relative to the virtual sound source through the face ID technology. For example, the terminal is configured with a sensor set and a dot-matrix projector, a 3D face model of the user is constructed by the sensor set and the dot-matrix projector to recognize face information of the user, and the relative position and the relative direction of the head of the user with respect to the virtual sound source are analyzed according to the recognized face information. Wherein the sensor set may include an ambient light sensor, a distance sensor, an infrared lens, a floodlight sensing element, and the like.

In a second aspect, an audio processing apparatus is provided, the audio processing apparatus having a function of implementing the behavior of the audio processing method in the first aspect. The audio processing apparatus includes at least one module, where the at least one module is configured to implement the audio processing method provided by the first aspect.

In a third aspect, an audio processing apparatus is provided, where the audio processing apparatus includes a processor and a memory, and the memory is used for storing a program that supports the audio processing apparatus to execute the audio processing method provided in the first aspect, and storing data used for implementing the audio processing method in the first aspect. The processor is configured to execute programs stored in the memory. The audio processing apparatus may further comprise a communication bus for establishing a connection between the processor and the memory.

In a fourth aspect, a computer-readable storage medium is provided, having stored therein instructions, which, when run on a computer, cause the computer to perform the audio processing method of the first aspect described above.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the audio processing method of the first aspect described above.

The technical effects obtained by the second, third, fourth and fifth aspects are similar to the technical effects obtained by the corresponding technical means in the first aspect, and are not described herein again.

Drawings

Fig. 1 is a schematic diagram of a spatial orientation of spatial audio provided by an embodiment of the present application;

fig. 2 is a schematic view of a scene in which a user wears an earphone to watch a video according to an embodiment of the present application;

fig. 3 is a schematic view of a video watched by a user wearing a headset according to another embodiment of the present application;

fig. 4 is a schematic diagram illustrating changes of an application interface of a video application and corresponding changes of a position of a virtual sound source according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a change in an application interface and a corresponding change in a position of a virtual sound source of another video application according to an embodiment of the present application;

fig. 6 is a schematic view of a scene of a multi-person video call provided in an embodiment of the present application;

fig. 7 is a schematic view of a scene of a multi-person video call provided in an embodiment of the present application;

fig. 8 is a block diagram of a software system of a terminal according to an embodiment of the present disclosure;

fig. 9 is a flowchart of an audio processing method provided in an embodiment of the present application;

FIG. 10 is a flowchart of another audio processing method provided in an embodiment of the present application;

fig. 11 is a flowchart of another audio processing method provided by an embodiment of the present application;

FIG. 12 is a flowchart of another audio processing method provided by an embodiment of the present application;

FIG. 13 is a flowchart of another audio processing method provided by an embodiment of the present application;

fig. 14 is a flowchart of another audio processing method provided by an embodiment of the present application;

FIG. 15 is a flow chart of another audio processing method provided by an embodiment of the present application;

fig. 16 is a flowchart of another video processing method provided in the embodiment of the present application;

fig. 17 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be understood that reference to "a plurality" in this application means two or more. In the description of the present application, "/" means "or" unless otherwise stated, for example, a/B may mean a or B; "and/or" herein is only an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, for the convenience of clearly describing the technical solutions of the present application, the words "first", "second", and the like are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

First, terms related to the embodiments of the present application are explained for convenience of understanding.

Spatial audio techniques

Spatial audio techniques refer to simulating sound as emanating from a particular direction and location. Spatial audio techniques, to some extent, are more like "localizing" sound such that the spatial audio output by a user device is perceived by the user as originating from a particular orientation of the simulation. The particular orientation of such a simulation (i.e., the source of the simulated sound) is often referred to as the virtual source.

The spatial audio technology is to precisely place the surround sound channel in a proper orientation, so that a user can feel the immersive surround sound experience by rotating the head or a mobile device. Such a simulation is not just a traditional surround sound effect, but a simulation of the user device as a sound device at a fixed position in space, i.e. as a virtual sound source.

Spatial audio is audio generated by spatial audio techniques that can be perceived by a user as originating from a virtual sound source, i.e. audio that originates from a virtual sound source in the user's perception.

The spatial audio technology can simulate the spatial sense of sound by directional audio filtering and adjusting the sound frequency received by the ears of a user, so as to realize the simulation of the sound in a specific direction and generate the spatial audio.

Referring to fig. 1, fig. 1 is a schematic view of a spatial orientation of spatial audio according to an embodiment of the present disclosure. As shown in fig. 1, the original audio may be processed by spatial audio techniques into spatial audio that originates from the front, rear, left, right, up, down, etc. dimensions in the user's perception of hearing, such that the spatial audio may be perceived by the user as coming from a particular front, rear, left, right, up, down, etc. orientation after being played. These particular orientations are not the actual sources of the original audio, but rather virtual sources that are modeled by spatial audio techniques.

It should be noted that the audio processing method according to the present application is applicable to any terminal having a display function and an audio processing function, such as a mobile phone, a tablet computer, a smart television, a VR device, or a smart wearable device, and the present application is not limited thereto. The terminal is connected with an audio playing device supporting a spatial audio playing function, and the audio playing device is used for receiving the spatial audio output by the terminal and playing the received spatial audio. The audio player can be an earphone supporting spatial audio and the like, and the earphone is connected with the terminal in a connection mode such as Bluetooth. Of course, the terminal may also be integrated with an audio playing device, and after generating the spatial audio, the audio playing device integrated with the terminal plays the spatial audio. For convenience of description, the terminal is a tablet computer, and the audio playing device is an earphone connected to the tablet computer.

Referring to fig. 2, fig. 2 is a schematic view of a scene in which a user wears an earphone to watch a video according to an embodiment of the present application, as shown in fig. 2, a tablet pc 10 is placed in front of the user, the user wears an earphone 20 on the head, and the earphone 20 is connected to the tablet pc 10 through bluetooth. The tablet computer 10 has a video application installed therein, and a user can open the video application, watch video played by the video application, and listen to audio output by the video application through the headset 20 worn by the user. It should be understood that fig. 2 is only described by taking the example of outputting audio by a video application, and the video application may be other applications capable of outputting audio, such as a video call application, a video conference application, or a music application.

In the related art, the application of the spatial audio technology is generally to set the screen of the tablet computer 10 as a fixed virtual sound source, for example, to set the center of the screen of the tablet computer as a virtual sound source. That is, in the process of the user operating the application interface of the video application, no matter what kind of change occurs in the interface content or the interface presentation form of the application interface, the virtual sound source of the spatial audio output by the video application to the earphone 20 by using the spatial audio technology is unchanged and is always the screen center of the tablet computer, so that the spatial audio played by the earphone 20 is always perceived by the user as coming from the screen center of the tablet computer 10.

In addition, when the position of the virtual sound source is fixed, if the head position or the head orientation of the user changes while the user is listening to the audio through the headset 20 worn by the user, the positions of the ear and the virtual sound source may also change. In order to enable the user to perceive the position change, the headphones 20 are provided with motion sensors such as an acceleration sensor and a gyroscope, the head movements of the user are tracked by the motion sensors, the relative position change and the head orientation change of the head of the user relative to the screen center of the tablet computer 10 are determined according to the head movements of the user, the relative position change and the head orientation change of the head of the user are sent to the video application, and the spatial audio output to the headphones 20 is adjusted by the video application according to the relative position change and the head orientation change of the head of the user, so that the spatial audio output to the headphones 20 is perceived by the user as originating from a virtual sound source after being played. That is, when the relative position and head orientation of the user's head with respect to the virtual sound source change, the spatial audio may be adjusted to simulate a change in the spatial perception of sound, such as a change in the intensity of sound, so that the user can perceive a change in the orientation of the head with respect to the virtual sound source.

As shown in fig. 2, when the user is facing the tablet pc 10, the headset 20 may track the head movement of the user through the configured motion sensor, analyze a relative position 1 of the head position with respect to the screen center of the tablet pc 10 and a head orientation 1 of the head facing the screen center according to the head movement information, process the audio parameters of the video application according to the relative position 1 and the head orientation 1, generate a spatial audio 1 perceived by the user as originating from the front, and output the spatial audio 1 to the headset 20 for playing.

As shown in fig. 3, after the head of the user twists to the left, the headset 20 may track the head movement of the head of the user twisting to the left through the configured motion sensor, analyze the relative position 2 of the head position with respect to the screen center of the tablet pc 10 and the head orientation 2 of the head facing to the left through the head movement information, process the audio parameters of the video application according to the relative position 2 and the head orientation 2, generate the spatial audio 2 perceived by the user as originating from the right, and output the spatial audio 2 to the headset 20 for playing.

However, the above-mentioned audio processing method using the screen position of the tablet pc 10 as the fixed virtual sound source of the spatial audio has certain limitations, may not meet the listening requirement of the user, and has low flexibility. For example, if the virtual sound source of the output spatial audio is fixed to the screen center position, the output spatial audio cannot reflect the interface change of the application interface, and the flexibility is low. In order to improve the spatial sense and flexibility of spatial audio and meet the auditory sense requirement of a user, the embodiment of the application provides an audio processing method capable of correspondingly adjusting the position of a virtual sound source of the spatial audio according to the change of interface content or interface presentation form of an application interface, so that the virtual sound source of the spatial audio output by user equipment correspondingly changes along with the change of the interface content or the interface presentation form of the application interface, that is, the spatial audio perceived by the user can originate from different positions along with the change of the application interface, and further the spatial audio perceived by the user can embody the interface change of the application interface, thereby improving the spatial sense of the spatial audio and the auditory experience of the user.

Next, taking a video application as an example, the position of the virtual sound source of the spatial audio that is adjusted and output according to the change of the interface presentation form of the application interface will be described. It should be understood that adjusting the position of the virtual sound source corresponding to the output spatial audio according to the change of the interface presentation form of the application interface may also be applied to other applications, for example, to a music playing application, a video call application, a video conference application, and the like, which is not limited in this embodiment of the application.

Referring to fig. 4-5, fig. 4-5 are schematic diagrams illustrating changes of an application interface of a video application and corresponding changes of a position of a virtual sound source according to an embodiment of the present application. Based on the application scenario shown in fig. 2, the user opens the video application of the tablet pc 10 and wears the headset 20 supporting the spatial audio playing function, and the headset 20 is connected to the tablet pc 10 via bluetooth. A user performs a video playing operation in an application interface of a video application, and in response to the user's operation, the tablet pc 20 displays a user interface 401 as shown in (a) of fig. 4, the user interface 401 is an application interface 1 of the video application, that is, the application interface 1 is displayed in a full-screen form (an interface display area of the application interface 1 occupies the entire area of the screen), and the application interface 1 includes a video playing window maximally displayed in the interface, that is, the video playing window occupies the entire area of the application interface 1. Under the condition that the application interface 1 displays in a full screen mode, the video application may use the position of the application interface 1 as a virtual sound source (virtual sound source 1) of the spatial audio to be output, generate the spatial audio with the virtual sound source being the virtual sound source 1 according to the audio parameters of the video, and output the generated spatial audio to the headphones 20 for playing, so that the spatial audio played by the headphones 20 is perceived by the user as originating from the virtual sound source 1.

It should be understood that, in the diagram (a) in fig. 4, the position of the application interface 1 is only taken as an example of the interface center of the application interface 1, and the position of the application interface 1 may also be other positions of the application interface 1, which is not limited in this embodiment of the application. In addition, the video application may further use the virtual sound source 1 as a virtual sound source of the spatial audio to be output when the ratio of the interface display area of the application interface 1 on the screen is greater than a certain ratio threshold. The proportion threshold may be preset, for example, may be set to 25%, etc.

If the user performs a window zoom-out operation on the application interface 1, in response to the user operation, the tablet pc 10 displays a user interface 402 as shown in fig. 4 (b), where the user interface 402 includes an application interface 2 of a video application after the window zoom-out and a main interface that is hidden by the application interface 2, the main interface displays application icons of applications such as a memo, music, video, and a gallery, and a ratio of an interface display area of the application interface 2 on the screen is less than or equal to a ratio threshold. When the window of the application interface of the video application is reduced to a value smaller than or equal to the ratio threshold, for example, to 50% of the screen, the video application may use the position of the application interface 2 after the window is reduced as a virtual sound source (virtual sound source 2), generate a spatial audio with the virtual sound source being the virtual sound source 2 according to the audio parameters of the video, and output the generated spatial audio to the headphones 20 for playing, so that the spatial audio played by the headphones 20 is perceived by the user as originating from the virtual sound source 2. It should be understood that, in the diagram (b) in fig. 4, the position of the application interface 2 is only taken as an example of the interface center of the application interface 2, and the position of the application interface 2 may also be other positions of the application interface 2, which is not limited in this embodiment of the application.

If the user performs a window minimization operation on the application interface 2, for example, clicks a minimization button in the upper right corner of the application interface 2, in response to the user operation, the tablet pc 10 minimizes the video application, and displays the user interface 403 as shown in (c) of fig. 4, where the user interface 403 displays a minimization icon of the video application, and the minimization icon is an application icon of the video application displayed on the main interface. In the case of minimizing the video application, the video application may use the position of the minimization icon of the video application as a virtual sound source (virtual sound source 3), generate a spatial audio with the virtual sound source being the virtual sound source 3 according to the audio parameters of the video, and output the generated spatial audio to the headphones 20 for playing, so that the spatial audio played by the headphones 20 is perceived by the user as originating from the virtual sound source 3. Wherein, the position of the minimized icon can be the icon center of the minimized icon, etc.

Referring to fig. 5 (a), if the user performs an operation of switching the video application to the background on the application interface 2, such as clicking an application icon of the memo application, opening the memo application to switch the memo application to the foreground running, in response to the operation of the user, the tablet computer 10 switches the memo application to the foreground running and the video application to the background running, and displays the user interface 404 as shown in fig. 5 (b), where the user interface 404 includes the application interface 3 of the memo application and the application interface 2 of the video application that is hidden by the application interface 3. In the case where the video application is switched to the background, the video application may move the position of the virtual sound source (virtual sound source 2) before the video application is switched to the background to a position in which the position is moved by a specified distance in a direction opposite to the light-emitting direction of the screen (a direction away from the screen) as the virtual sound source (virtual sound source 4), generate spatial audio in which the virtual sound source is the virtual sound source 4 according to the audio parameters of the video, and output the generated spatial audio to the headphones 20 for playing, so that the spatial audio played by the headphones 20 is perceived by the user as originating from the virtual sound source 4. Therefore, the user can perceive that the spatial audio output by the earphone is moved towards the rear of the screen in the sense of hearing, and the effect that the video application is switched to the background from the foreground is simulated by moving the virtual sound source of the spatial audio perceived by the user towards the back of the screen.

If the user performs an operation of switching the video application to the foreground on the application interface 2, for example, clicks a close button on the upper right corner of the application interface 3 of the memo application, in response to the operation of the user, the tablet 10 closes the memo application, and switches the video application to the foreground again, and displays the user interface 405 as shown in fig. 5 (c), where the user interface 405 includes the application interface 2 of the video application displayed on the foreground again. Under the condition that the video application is switched from the background to the foreground again, the video application can take the position of the virtual sound source (virtual sound source 4) before the video application is switched to the foreground as the virtual sound source (virtual sound source 5) at the position which moves towards the light-emitting direction (front of the screen) of the screen by a specified distance, generate the spatial audio of which the virtual sound source is the virtual sound source 5 according to the audio parameters of the video, and output the generated spatial audio to the earphones 20 for playing, so that the played spatial audio is perceived by the user to be originated from the virtual sound source 5. Therefore, the user can sense that the spatial audio output by the earphone moves towards the front of the screen in an auditory sense, and the effect of switching the video application from the background to the foreground is simulated by moving the virtual sound source of the spatial audio sensed by the user towards the light-emitting direction of the screen.

In addition, after determining the location of the virtual sound source, the video application may generate spatial audio from the location of the virtual sound source and the audio parameters of the video. In the process of generating the spatial audio according to the position of the virtual sound source and the audio parameter of the video, the video application may call the camera of the tablet computer 10, acquire a user image through the camera, analyze the head position and the head orientation of the user according to the acquired user image, further determine the relative position of the head position of the user with respect to the position of the virtual sound source, and then process the audio parameter of the video according to the relative position and the head orientation to generate the spatial audio of the virtual sound source, which is the determined virtual sound source. For example, after determining the virtual sound source 1, the video application may process the audio parameters of the video according to the relative position and head orientation of the head position of the user with respect to the position of the virtual sound source 1, and generate the spatial audio with the virtual sound source being the virtual sound source 1.

In addition, after the head position and the head orientation of the user are analyzed according to the acquired user image, the relative direction of the head orientation of the user relative to the direction of the head towards the virtual sound source can be determined, and the audio parameters of the video are processed according to the relative position and the relative direction of the head of the user to generate the spatial audio of the virtual sound source 1. For example, after determining the virtual sound source 1, the video application may process the audio parameters of the video according to the relative position of the head position of the user with respect to the position of the virtual sound source 1 and the relative direction of the head of the user with respect to the virtual sound source 1, so as to generate the spatial audio of the virtual sound source 1.

It should be understood that, in the embodiment of the present application, only the positions of the virtual sound sources corresponding to the several interface presentation forms shown in fig. 4 and fig. 5 are taken as examples for description, in other embodiments, different interface presentation forms and the positions of the corresponding virtual sound sources may also be set to other corresponding relationships, which is not limited in the embodiment of the present application.

In addition, fig. 4 and fig. 5 are only used for explaining that the different interface presentation forms of the application interface correspond to the positions of the different virtual sound sources, but in other embodiments, the different window presentation forms of the specific window in the application interface may also be set to correspond to the positions of the different virtual sound sources. For example, taking a video playing window in an application interface of a video application as an example, the correspondence between different window presentation forms of the video playing window and the position of the virtual sound source of the spatial audio may be shown in table 1:

it should be understood that the corresponding relationship between the different window presentation forms of the video playing window and the position of the virtual sound source of the spatial audio may also be set as other corresponding relationships, which is not limited in this embodiment of the application.

Next, taking a video call application as an example, a position of a virtual sound source of spatial audio that is adjusted and output according to a change in interface content of an application interface will be described. It should be understood that adjusting the position of the virtual sound source corresponding to the output spatial audio according to the change of the interface content of the application interface may also be applied to other applications, for example, to a video conference application or a video application, which is not limited in this embodiment of the present application.

Referring to fig. 6, fig. 6 is a schematic view of a scene of a multi-person video call according to an embodiment of the present disclosure. As shown in fig. 6, A, B, C three people use their respective terminals to conduct video calls. Wherein C uses the tablet computers 10 and A, B to perform video call, and wears the earphones 20 supporting spatial audio to listen to the audio output by the tablet computer 10.

The tablet computer 10 is installed with a video call application, and C carries out a multi-person video call with A, B through the video call application. During a multi-person video call, the tablet computer 10 displays the video call interface 601, and the video call interface 601 includes A, B, C user images. When a speech is detected, the video call application may use the mouth position of a in the video call interface 601 as the virtual sound source 1, analyze the head position and head orientation of C according to the user image of C, determine the relative orientation of the head of C with respect to the virtual sound source 1, process the audio parameter of the video call application according to the relative orientation, generate a spatial audio with the virtual sound source being the virtual sound source 1, output the generated spatial audio to the headphones 20 for playing, so that the spatial audio played by the headphones 20 is perceived by the user as originating from the mouth position of a in the video call interface 601. Wherein the relative orientation may comprise a relative position of the user's head position with respect to the virtual sound source 1 and a relative direction of the user's head orientation with respect to the direction when the head is oriented towards the virtual sound source 1.

Referring to fig. 7, after the video call is switched to B speech, the tablet pc 10 displays the video call interface 701 shown in fig. 7, where the video call interface 701 includes A, B, C user images. When the video call application detects that the speech is switched to the speech of B, the mouth position of B in the video call interface 701 is used as the virtual sound source 2, meanwhile, the head position and the head orientation of the user are analyzed according to the user image of C, the relative position of the head of the user relative to the virtual sound source 2 is determined, according to the relative position, the audio parameters of the video call application are processed, the virtual sound source is generated to be the spatial audio of the virtual sound source 2, the generated spatial audio is output to the earphone 20 to be played, and the spatial audio played by the earphone 20 is perceived by the user to be originated from the mouth position of B in the video call interface 701.

In addition, if it is detected that a and B speak simultaneously, the video call application may further use the mouth position of a and the mouth position of B in the video call interface as the virtual sound source 3 and the virtual sound source 4, while the video call application analyzes the user's head position and head orientation from the user image of C, determines the relative orientation of the user's head with respect to the virtual sound source 3 and the virtual sound source 4, respectively, and determines the relative orientation of the user's head with respect to the virtual sound source 3 and the virtual sound source 4, respectively, processes the audio parameters applied to the video call to generate the spatial audio of the virtual sound source 3 and the virtual sound source 4, outputs the generated spatial audio to the earphone 20 for playing, such that the spatial audio played by the headphones 20 is perceived by the user as originating from the mouth position of a and the mouth position of B in the video-call interface, i.e. from two virtual sound sources.

It should be understood that, in the embodiment of the present application, the position of the mouth of the speaker in the video call interface is set as the virtual sound source of the spatial audio to be output, but in other embodiments, other positions such as the position of the head of the speaker in the video call interface may also be set as the virtual sound source, which is not limited in the embodiment of the present application.

In addition, in the embodiment of the present application, different virtual sound sources are only set corresponding to different interface contents of a video call interface of a video call application, and in other embodiments, different virtual sound sources may also be set corresponding to different interface contents of an application interface of other applications, for example, other applications may be a video playing interface of a video application, a video conference interface of a video conference application, a game interface of a game application, and the like, which is not limited in this embodiment of the present application.

In addition, in addition to setting the position of the speaker in the application interface as a virtual sound source, the position of other target objects that are sounding in the application interface may also be set as a virtual sound source, such as a sounding instrument or instrument. For example, the musical instrument being sounded may be a drum being struck, a piano being played, or the like. The sound production position in the application interface is set to be the virtual sound source of the spatial audio, so that the spatial audio sensed by the user comes from the sound production position in the application interface, and the virtual sound source of the spatial audio sensed by the user can be changed along with the change of the sound production position, thereby further improving the spatial sense of the spatial audio and the auditory experience of the user.

Next, a software system of a terminal according to an embodiment of the present application will be described.

The software system of the terminal can adopt a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture or a cloud architecture. The embodiment of the application takes an Android (Android) system with a layered architecture as an example to exemplarily explain a software system of a terminal.

Fig. 8 is a block diagram of a software system of a terminal according to an embodiment of the present disclosure. Referring to fig. 8, the layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system layer, and a kernel layer from top to bottom.

The application layer may include a series of application packages. As shown in fig. 8, the application package may include a target application, bluetooth, etc. application. The target application can be an application program which needs to output audio, such as video, video call, video conference, music, call and the like. The target application may process the audio parameters to be output as spatial audio for output. As shown in fig. 8, a target application such as a video Application (APP) may include an application interface recognition module, a user head recognition module, and a spatial audio processing module.

The application interface identification module is used for identifying interface content or an interface presentation form of the target application. For example, the application interface identification module comprises an interface content identification module and an interface presentation form identification module. The interface content identification module is used for identifying the sound production position corresponding to the audio parameters of the target application in the application interface, such as identifying the position of the mouth or the head of a speaker. The interface presentation form recognition module is used for recognizing the interface presentation form of the application interface, such as the occupation ratio of the application interface on a screen, whether the application interface is minimized into an icon, whether the application interface is switched to a background, whether the application interface is switched to a foreground, and the like.

The user head identification module is used for identifying the head position and the head orientation of a user wearing the headset. For example, the user head recognition module may recognize the head position and head orientation of the user through a user image of the user wearing the headset captured by the camera. Or comprehensively determining the head position and the head orientation of the user according to the user image of the user wearing the headset and the sensor information, detected by the acceleration sensor such as the acceleration sensor and/or the gyroscope, of the headset, which is sent by the headset. Of course, the user identification module may also identify the head position and the head orientation of the user wearing the headset through other technologies, for example, through a face identity (face ID) technology, and identify the head position and the head orientation of the user, which is not limited in this embodiment of the application.

The spatial audio processing module is used for determining the position of a virtual sound source of the spatial audio to be output according to the interface content or the interface presentation form of the target application identified by the application interface identification module. And then according to the head position and the head orientation of the user identified by the user head identification module, determining the relative position of the head position of the user relative to the virtual sound source and the relative direction of the head orientation relative to the head orientation towards the virtual sound source direction. The audio parameters of the target application are processed according to the relative position and relative direction to generate spatial audio that is perceived by the user as originating from the determined virtual sound source.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions. As shown in FIG. 8, the application framework layer may include a window manager, a camera API, a view system, and the like. The window manager is used for managing window programs. For example, the window manager may obtain window information, such as a display size and a window size of the application interface, determine whether the application interface is minimized to an icon, displayed in the foreground, displayed in the background, and the like. The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The camera API is a calling interface of the camera and is used for calling the camera to shoot images. The view system may be used to build a display interface for an application, which may be composed of one or more views, such as views that include display pictures or videos.

For example, the interface identification module may obtain window information detected by the window manager, and determine the interface presentation form of the target application according to the window information. Or the window manager detects the window information, determines the interface presentation form of the target application according to the detected window information, and sends the interface presentation form of the target application to the interface identification module.

For example, the user identification module may call a camera API, start a camera through the camera API, acquire a user image of a user wearing the headset, and perform image identification on the acquired user image to determine a head position and a head orientation of the user.

It should be understood that fig. 8 is only illustrated by taking an example of integrating an application interface recognition module, a user head recognition module, and a spatial audio processing module in a target application such as a video APP, and in other implementations, the application framework layer may also include a spatial audio processing module and an application interface recognition module or a user head recognition module and related recognition modules. The video APP can call the relevant modules of the application framework layer to realize relevant functions. For example, an application interface recognition module in the application framework layer is called to recognize the interface content or interface presentation form of the application interface of the target application, or a user head recognition module in the application framework layer is called to recognize the head position and head orientation of the user, or a spatial audio processing module is called to generate spatial audio, etc.

The Android Runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system. The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android. The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application layer and the application framework layer as binary files. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules, such as: surface managers (surface managers), Media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., OpenGL ES), 2D graphics engines (e.g., SGL), and the like. The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications. The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc. The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like. The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The kernel layer at least comprises a display driver, a camera driver and an audio driver. The display driver is used to drive the display. The camera drive is used for driving the camera. The audio driver is used for driving the audio module.

Next, the audio processing method proposed in the embodiment of the present application is described in detail by taking a video APP as an example with reference to the software system shown in fig. 8. Fig. 9 is a flowchart of an audio processing method provided in an embodiment of the present application, and as shown in fig. 9, the method includes the following steps:

step 901: the user opens the video APP to play the video, and wears the earphones supporting the spatial audio to listen to the audio.

For example, the user clicks an icon of a video APP displayed on the screen of the terminal 100 to open the video APP. In response to the operation of the user, the terminal 100 starts the video APP and displays an application interface of the video APP. The user can execute a video playing operation on an application interface of the video APP to play the video.

Step 902: and the video APP displays a video playing window 1 according to the video playing operation of the user.

For example, the application interface of the video APP includes video covers of a plurality of videos. The user clicks a video cover of the video 1, the video APP plays the video 1 in response to the operation of the user, and a video playing window of the video 1 is displayed on the application interface.

Wherein, the video playing window 1 is displayed in a full screen form.

Step 903: the video APP calls the camera API, and the camera is started through the camera API.

Step 904: the camera collects user images of a user wearing the headset.

Wherein the user image includes a user head or a user face to identify a position and orientation of the user head from the user icon.

Step 905: the camera sends the collected user image to the video APP.

Wherein, the camera can be monocular camera or binocular camera. To binocular camera, binocular camera can send the user's image of every camera collection for video APP.

Step 906: the window manager detects that the video playback window 1 accounts for 100% of the screen.

When detecting the video playing window 1, the window manager may detect the window size and the screen size of the video playing window 1, and then determine the occupation ratio of the video playing window 1 on the screen according to the window size and the screen size of the video playing window 1.

It should be understood that, in the embodiment of the present application, the video playing window 1 is only displayed in a full screen form, and the percentage of the video playing window 1 on the screen is 100% for example, and the percentage of the video playing window 1 on the screen may also be other percentages, such as 80%, for example, which is not limited in the embodiment of the present application.

Step 907: the window manager sends the occupation ratio of the video playing window 1 on the screen to the video APP as a detection result 1.

Step 908: and the video APP receives the detection result 1 sent by the window manager, and determines that the occupation ratio of the video playing window 1 on the screen is more than 25% according to the detection result 1, and then the central position of the screen is used as a virtual sound source 1.

In the embodiment of the present application, different window presentation forms of the video playing window correspond to different positions of the virtual sound source, and the different window presentation forms may include that the ratio of the window on the screen is greater than 25%, the ratio of the window on the screen is less than or equal to 25%, the window is minimized to be an icon, the window is switched to the background, the window is switched to the foreground, and the like. The correspondence between the window presentation form and the position of the virtual sound source may be preset. The corresponding relationship may be set by default or by user definition, which is not limited in the embodiment of the present application.

The video APP receives the detection result 1 sent by the window manager, and can judge whether the occupation ratio of the video playing window 1 on the screen is greater than 25% according to the detection result 1, and if the occupation ratio of the video playing window 1 on the screen is determined to be greater than 25%, the position of the virtual sound source corresponding to the window presentation form that the occupation ratio of the video playing window 1 on the screen is greater than 25% is determined from the preset corresponding relationship to be used as the virtual sound source 1. For example, the position of the virtual sound source corresponding to the window presentation form in which the video playing window 1 accounts for more than 25% of the screen is the center position of the screen. It should be understood that the position of the virtual sound source corresponding to the window presentation form may also be other positions, such as the window center position of the video playing window 1.

It should be understood that the embodiment of the present application is only described by taking a ratio threshold of 25% as a ratio of the video playing window 1 on the screen as an example, and in other embodiments, the ratio threshold may also be other ratios such as 50%, 75%, and the like, which is not limited by the embodiment of the present application.

In addition, in the embodiment of the present application, the window manager detects the occupation ratio of the video playing window 1 on the screen, and sends the occupation ratio of the video playing window 1 on the screen to the video APP for explanation, but in other embodiments, after detecting the occupation ratio of the video playing window 1 on the screen, the window manager may further determine whether the occupation ratio of the video playing window 1 on the screen is greater than a ratio threshold, and send the determination result to the video APP. Or, the window manager may also detect the window size and the screen size of the video playing window 1, send the window size and the screen size of the video playing window 1 to the video APP, and determine the occupation ratio of the video playing window 1 on the screen by the video APP according to the window size and the screen size of the video playing window 1, so as to determine whether the occupation ratio of the video playing window 1 on the screen is greater than the proportion threshold.

Step 909: the video APP receives the user image sent by the camera, and determines the relative position of the head position of the user relative to the virtual sound source 1 and the relative direction of the head orientation relative to the direction of the head facing the virtual sound source 1 according to the user image and the position of the virtual sound source 1.

The video APP may determine the head position and head orientation of the user from the user image sent by the user camera, and then determine the relative position of the head position of the user with respect to the virtual sound source 1 and the relative direction of the head orientation of the user with respect to the direction in which the head is oriented toward the virtual sound source from the head position and head orientation of the user and the position of the virtual sound source 1.

As an example, if the camera is a binocular camera, the video APP may determine, from the user images captured by the binocular camera, a relative position of the head position of the user with respect to the virtual sound source, and a relative direction of the head orientation with respect to a direction of the head towards the virtual sound source. If the camera is a monocular camera, the video APP can determine the relative position of the head position of the user relative to the virtual sound source and the relative direction of the head orientation relative to the direction of the head facing the virtual sound source according to the multi-frame user images collected by the monocular camera.

It should be understood that the embodiment of the present application is only described as an example of determining the relative position and the relative direction of the head of the user with respect to the virtual sound source through the image captured by the camera, and in other embodiments, the relative position and the relative direction of the head of the user with respect to the virtual sound source may also be determined in other manners.

For example, the headset is provided with a motion sensor such as an acceleration sensor and/or a gyroscope, the headset can detect the motion information of the head of the user through the motion sensor, the detected motion information of the head of the user is sent to the video APP, and the video APP combines the motion information of the head of the user and the user image collected by the camera to comprehensively analyze the relative position and the relative direction of the head of the user relative to the virtual sound source. In this way, the accuracy of detection of the position and orientation of the user's head can be further improved.

In addition, the video APP can also analyze the position and the direction of the head of the user relative to the virtual sound source through the face ID technology. For example, the terminal 100 is configured with a sensor set and a dot matrix projector, a 3D face model of the user is constructed by the dot matrix projected by the sensor set and the dot matrix projector to recognize face information of the user, and a relative position and a relative direction of the head of the user with respect to the virtual sound source are analyzed based on the recognized face information. Wherein the sensor set may include an ambient light sensor, a distance sensor, an infrared lens, a floodlight sensing element, and the like.

As an example, the face ID technique may be used to analyze the position and direction of the user's head with respect to the virtual sound source when the terminal 100 is not equipped with a camera or the camera cannot be used. Or, on the basis of adopting the face ID technology, the relative position and the relative direction of the head of the user relative to the virtual sound source are comprehensively analyzed by combining the motion information of the head of the user, which can be detected by the earphone through the motion sensor. The embodiment of the present application does not limit this.

Step 910: and processing the audio parameters of the video APP by the video APP according to the determined relative position and relative direction to generate a spatial audio 1 taking the virtual audio source 1 as a virtual audio source.

That is, the video APP processes the audio parameters of the video APP according to the determined relative position and relative direction, and generates the spatial audio 1 carrying the orientation information, where the orientation information is used to indicate the position and direction of the virtual audio source 1 of the spatial audio 1, so that the spatial audio 1 is perceived by the user as originating from the virtual audio source 1.

For example, the video APP may perform directional audio filtering on audio parameters according to the determined relative position and relative direction, and adjust the sound frequency received by the ears of the user to generate the spatial audio 1 with the virtual sound source 1 as the virtual sound source.

Step 911: the video APP sends spatial audio 1 to the headphones.

For example, the video APP may send spatial audio 1 to the headphones in the form of an audio signal.

Step 912: the headphones play spatial audio 1.

The played spatial audio 1 can be perceived as originating from the virtual sound source 1 after being listened to by the user, i.e. the user perceives auditorily that the heard sound is emitted from the center of the screen, thus realizing that the audio parameters of the video APP are simulated as the spatial audio originating from the center of the screen.

Then, if the head position or orientation of the user changes, the relative position of the head position of the user with respect to the virtual sound source 1 and the relative direction of the head orientation with respect to the direction of the head toward the virtual sound source 1 may be continuously determined, and the spatial audio to be output to the headphones is adjusted according to the change of the relative position and the relative direction, so that the spatial audio output to the headphones is perceived by the user as originating from the virtual sound source 1.

In addition, if the window presentation form of the video playback window 1 is changed, the position of the virtual sound source of the spatial audio output to the headphones is also changed accordingly. Next, referring to fig. 10, a process of correspondingly adjusting the spatial audio output to the headphones when the video playback window 1 is switched to a window presentation form in which the percentage of the window on the screen is less than 25% will be described.

Step 913: the user performs a window reduction operation for the video playback window 1.

Step 914: and the video APP reduces the window of the video playing window 1, and the video playing window 2 after the window reduction is displayed on the application interface.

The video playing window 2 is smaller than the window size of the video playing window 1.

Step 915: the window manager detects that video playback window 2 is 25% on the screen.

Step 916: the window manager sends the occupation ratio of the video playing window 2 on the screen to the video APP as a detection result 2.

Step 917: the video APP receives a detection result 2 sent by the window manager, determines that the proportion of the video playing window 2 on the screen is equal to 25% according to the detection result 2, and takes the window center position of the video playing window 2 as a virtual sound source 2.

It should be understood that the position of the virtual sound source corresponding to the window presentation form of the audio playing window 2 when the percentage of the screen is less than or equal to 25% may also be set to other positions, which is not limited in this embodiment of the application.

Step 918: the camera sends the collected user image to the video APP.

Step 919: the video APP receives the user image sent by the camera, and determines the relative position of the head position of the user relative to the virtual sound source 2 and the relative direction of the head orientation relative to the direction of the head orientation to the virtual sound source 2 according to the user image and the position of the virtual sound source 2.

Step 920: and processing the audio parameters of the video APP by the video APP according to the determined relative position and relative direction to generate a spatial audio 2 taking the virtual audio source 2 as a virtual audio source.

Step 921: the video APP sends spatial audio 2 to the headphones.

For example, the video APP may send the spatial audio 2 to the headphones in the form of an audio signal.

Step 922: the headphones play the spatial audio 2.

The played spatial audio 2 can be perceived as originating from the virtual sound source 2 after being listened to by the user, i.e. the user audibly perceives that the heard sound is emitted from the window center of the video playing window 2, thus realizing that the audio parameters of the video APP are simulated as the spatial audio originating from the window center of the video playing window.

In addition, when the video playback window 2 is changed, the position of the virtual sound source 2 is changed, or the head position or orientation of the user is changed, and in these cases, the relative position of the head position of the user with respect to the virtual sound source 2 and the relative direction of the head orientation with respect to the direction in which the head faces the virtual sound source 2 are also changed. Therefore, after the video APP outputs the spatial audio 2, the video APP will continue to determine the relative position of the user's head with respect to the virtual audio source 2 and the relative direction of the head facing the direction of the virtual audio source 2 with respect to the head, and according to the changes in the relative position and relative direction, continue to adjust the spatial audio to be output to the headphones so that the spatial audio output to the headphones is perceived by the user as originating from the window center of the video playback window 2.

If the video playing window of the video APP is minimized to be in the icon form, the position of the virtual sound source of the spatial audio can be switched to the position of the minimized icon of the video playing window. Next, referring to fig. 11, a process of correspondingly adjusting spatial audio output to headphones when the video playback window 2 is minimized to an icon will be described.

Step 923: the user performs a minimizing operation for the video play window 2.

For example, the minimizing operation may be an operation of clicking a minimizing button in the upper right corner of the video playing window 2, and may also be other minimizing operations.

Step 924: the video APP minimizes the video playing window 2, and the minimized icon of the video playing window 2 is displayed on the screen.

Wherein, the minimized icon may be an application icon of the video APP. Of course, the minimize icon may also be in other forms, such as a minimize icon displayed in a taskbar, and the like.

Step 925: the window manager detects video playback window 2 to minimize the icon display.

Step 926: the window manager sends the video playing window 2 to the video APP as the detection result 3 with the minimized icon display.

Step 927: the video APP receives a detection result 3 sent by the window manager, and the position of the minimized icon of the video playing window 2 is used as a virtual sound source 3 according to the window detection result 3.

For example, the icon center position of the minimized icon is set as the virtual sound source 3.

Step 928: the camera sends the collected user image to the video APP.

Step 929: the video APP receives the user image sent by the camera, and determines the relative position of the head position of the user relative to the virtual sound source 3 and the relative direction of the head orientation relative to the direction of the head orientation to the virtual sound source 3 according to the user image and the position of the virtual sound source 3.

Step 930: and the video APP processes the audio parameters of the video APP according to the determined relative position and relative direction to generate a spatial audio 3 with the virtual audio source 3 as a virtual audio source.

Step 931: the video APP sends spatial audio 3 to the headphones.

Step 932: the headphones play the spatial audio 3.

The played spatial audio 3 can be perceived as originating from the virtual sound source 3 after being listened to by the user, i.e. the user perceives aurally that the heard sound is emitted from the minimized icon of the video playing window 2, thus realizing that the audio parameters of the video APP are simulated as the spatial audio originating from the minimized icon of the video playing window.

Next, referring to fig. 11, a process of correspondingly adjusting spatial audio output to headphones when the video playback window 2 is minimized to an icon will be described.

If the video playing window of the video APP is switched to the background, the position of the virtual sound source of the spatial audio can be switched to the position which is moved by the specified distance in the direction back to the screen. Next, referring to fig. 12, a process of correspondingly adjusting the spatial audio output to the headphone when the video playing window 2 is switched to the background will be described.

Step 933: the user performs an operation of switching the video playback window 2 from the foreground to the background.

Step 934: and the video APP switches the video playing window 2 to background display.

For example, if the user opens the application interfaces of other applications to switch the video playback window 2 to the background, the terminal 100 displays the application interfaces of the other applications and the video playback window 2 blocked by the application interfaces of the other applications on the screen.

Step 935: the window manager detects that video playback window 2 is switched to the background display.

Step 936: the window manager switches the video playing window 2 to the background display and sends the video playing window 2 to the video APP as a detection result 4.

Step 937: and the video APP receiving window manager sends a detection result 4, and according to the detection result 4, the position of the virtual sound source 2 before switching is moved by 20cm in the direction opposite to the screen light emitting direction to serve as the virtual sound source 4.

For example, a spatial coordinate system is established in which the horizontal direction of the screen is the X axis, the direction perpendicular to the X axis in the plane of the screen is the Y axis, and the direction intersecting the X axis and the Y axis is the Z axis. Based on the detection result 4, the virtual sound source before switching can be moved 20cm in the Z-axis direction toward the back of the screen without changing the positions of the virtual sound source in the X-axis and the Y-axis.

It should be understood that, in this embodiment, the position of the virtual sound source is only described as moving by 20cm, and the moving distance of the virtual sound source may also be other distances, for example, 10cm or 30cm, which is not limited in this embodiment.

Step 938: the camera sends the collected user image to the video APP.

Step 939: the video APP receives the user image sent by the camera, and determines the relative position of the head position of the user relative to the virtual sound source 4 and the relative direction of the head orientation relative to the direction of the head orientation to the virtual sound source 4 according to the user image and the position of the virtual sound source 4.

Step 940: and the video APP processes the audio parameters of the video APP according to the determined relative position and relative direction to generate a spatial audio 4 with the virtual audio source 4 as a virtual audio source.

Step 941: the video APP sends spatial audio 4 to the headphones.

For example, the video APP may send the spatial audio 4 to the headphones in the form of an audio signal.

Step 942: the headphones play the spatial audio 4.

The played spatial audio 4 can be perceived as originating from the virtual sound source 4 after being listened to by the user, so that the sound source of the sound heard by the user is perceptually perceived to move towards the back of the screen, and therefore when the window is switched to the background, the audio parameter of the video APP is simulated to be the spatial audio moving towards the back of the screen from the virtual sound source.

If the video playing window of the video APP is switched to the foreground again, the position of the virtual sound source of the spatial audio can be switched to the position which is moved towards the front of the screen by the specified distance. Next, referring to fig. 13, a process of correspondingly adjusting the spatial audio output to the headphones when the video playback window 2 is switched to the foreground again will be described.

Step 943: the user performs an operation of switching the video playback window 2 from the background to the foreground.

Step 944: and the video APP switches the video playing window 2 to the foreground again for displaying.

Step 945: the window manager detects that video playback window 2 is switched to the foreground display.

Step 946: the window manager switches the video playing window 2 to the foreground for display, and sends the detection result 5 to the video APP.

Step 947: and the video APP receives the detection result 5 sent by the window manager, and the position of the virtual sound source 4 before switching is taken as the virtual sound source 5 after moving 20cm towards the screen light-emitting direction according to the detection result 5.

For example, in the above-described spatial coordinate system, the position of the virtual sound source before switching is moved 20cm forward on the screen in the Z-axis direction without changing the positions in the X-axis and the Y-axis based on the detection result 4.

Here, the position of the virtual sound source 5 is the same as the position of the virtual sound source 2.

Step 948: the camera sends the collected user image to the video APP.

Step 949: the video APP receives the user image sent by the camera, and determines the relative position of the head position of the user relative to the virtual sound source 5 and the relative direction of the head orientation relative to the direction of the head facing the virtual sound source 5 according to the user image.

Step 950: and processing the audio parameters of the video APP by the video APP according to the determined relative position and relative direction to generate a spatial audio 5 taking the virtual audio source 5 as a virtual audio source.

Step 951: the video APP sends spatial audio 5 to the headphones.

For example, the video APP may send the spatial audio 5 to the headphones in the form of an audio signal.

Step 952: the headphones play the spatial audio 5.

The space audio frequency 5 of broadcast can be perceived as originating from virtual sound source 5 after being listened to by the user for the sound source of sound that the user perceived in the sense of hearing has taken place to the screen place ahead and has moved, so realized when switching over the window to the proscenium again, simulate the audio parameter of video APP for the space audio frequency that virtual sound source moved to screen place ahead.

Fig. 14 is a flowchart of another audio processing method provided in an embodiment of the present application, where an execution subject of the method is a video APP installed in a terminal 100, and the terminal 100 is connected to an earphone, as shown in fig. 14, the method includes the following steps:

step 1401: the video APP plays videos according to video playing operation of a user and outputs spatial audio to the earphone.

The video APP can display a video playing window in the process of playing the video. The video APP can determine the position of a virtual sound source of the spatial audio to be output according to the window presentation form of the video playing window, the audio parameters of the video are processed according to the position of the virtual sound source, the spatial audio which can be perceived by a user as originating from the virtual sound source is generated, and the generated spatial audio is output to the earphone.

In the process of outputting the spatial audio to the earphone, if the window presentation mode of the video playing window changes, the position of the virtual sound source of the spatial audio can be switched according to the window presentation mode of the video playing window, and then the spatial audio output to the earphone is adjusted according to the position of the switched virtual sound source, so that the spatial audio output to the earphone is perceived by a user as originating from the switched virtual sound source.

Next, the following steps 1402 to 1414 illustrate a process of switching the position of the virtual sound source of the spatial audio according to the change of the window presentation mode of the video playback window.

Step 1402: and receiving window operation of a user on a video playing window of the video APP in the process of playing the video.

Step 1403: and responding to the operation of the user, and adjusting the window presentation mode of the video playing window by the video APP.

Step 1404: the video APP detects whether the ratio of the adjusted video playing window on the screen is greater than a ratio threshold, if so, step 1405 is executed, and if not, step 1407 is executed.

Step 1405: the position of a virtual sound source of spatial audio is switched to the center position of the screen.

Step 1406: the video APP adjusts the spatial audio output to the earphones based on the position of the switched virtual sound source, so that the spatial audio output to the earphones is perceived by the user as originating from the switched virtual sound source.

Step 1407: the video APP detects whether the ratio of the adjusted video playing window on the screen is smaller than or equal to a ratio threshold, if so, step 1408 is executed, and if not, step 1409 is executed.

Step 1408: the video APP switches the position of the virtual sound source of the spatial audio to the window center position of the adjusted video playing window, and then step 1406 is executed.

Step 1409: the video APP detects whether the adjusted video playing window is minimized to be a minimized icon, if so, step 1410 is executed, and if not, step 1411 is executed.

Step 1410: and the video APP switches the position of the virtual sound source to the position of the minimized icon, and then the step 1406 is executed.

Step 1411: the video APP detects whether the adjusted video playing window is switched to the background, if so, step 1412 is executed, and if not, step 1413 is executed.

Step 1412: the video APP switches the position of the virtual sound source to a position shifted by a predetermined distance in a direction opposite to the light emitting direction of the screen, and then performs step 1406.

Step 1413: and the video APP detects whether the adjusted video playing window is switched to the foreground, if so, the step 1414 is executed, if not, the position of the virtual sound source is not switched, the step 1402 is returned, the operation of the user on the video playing window is continuously received, and whether the position of the virtual sound source needs to be switched is judged according to the change of the window presenting mode of the video playing window.

Step 1414: the video APP switches the position of the virtual sound source to a position moved by a predetermined distance in the light emitting direction of the screen, and then performs step 1406.

It should be understood that, in the embodiment of the present application, only the window presentation form after the change is detected according to the above detection order when the window presentation form of the video playing window changes, the detection order does not constitute a limitation on the detection order of the window presentation form, in other embodiments, detection may also be performed according to other detection orders, which is not limited in the embodiment of the present application. In addition, the embodiment of the present application is only an example of a video playing window, and in other embodiments, the video playing window may also be another window of an application interface or an application interface, which is not limited in the embodiment of the present application.

Next, the audio processing method proposed in the application embodiment is described in detail by taking the video call APP as an example with reference to the software system shown in fig. 8. Fig. 15 is a flowchart of another audio processing method provided in an embodiment of the present application, and as shown in fig. 15, the method includes the following steps:

step 1501: c opens video conversation APP, carries out three-party video conversation through video conversation APP and A, B, wears the earphone that supports the space audio simultaneously and listens to the audio frequency.

Step 1502: the video call APP displays a video call interface that includes A, B, C user images of three people.

It should be understood that the video call interface may not include the user image of B, which is not limited in the embodiment of the present application.

Step 1503: the video call APP calls the camera API, and the camera is called through the camera API.

Step 1504: the camera collects the user image of the C wearing the earphone.

Wherein the user image is typically an image comprising the user's head.

Step 1505: the camera sends the user image of the C who gathers to video conversation APP.

It should be noted that, in the embodiment of the present application, the user image of C acquired by the camera is merely taken as an example for description, and in other embodiments, if the video call interface includes the user image of C, the video call APP may also directly acquire the user image of C from the video call interface, which is not limited in the embodiment of the present application.

Step 1506: if the video call APP detects that A is speaking, the mouth position of A in the video call interface is determined, and the mouth position of A is used as a virtual sound source 1.

In one embodiment, the video call APP may perform image recognition on user images of individual users to detect the user who is speaking. For example, the video call APP may recognize the facial movement or oral movement of C from the user image of C, and determine whether a is speaking based on the facial movement or oral movement of C.

In another embodiment, the video call APP may also detect the speaking user according to the strength of the received audio signals from the respective users. For example, it is determined whether the intensity of the received audio signal from C is greater than an intensity threshold, and if so, it is determined that C is speaking.

It should be understood that the video call APP may also use other manners to detect the user who is speaking, which is not limited in this embodiment of the present application.

Step 1507: the video APP receives the user image of the C sent by the camera, and determines the relative position of the head position of the current C relative to the virtual sound source 1 and the relative direction of the head orientation relative to the direction of the head orientation to the virtual sound source 1 according to the user image of the C and the position of the virtual sound source 1.

Step 1508: and processing the audio parameters of the video APP by the video APP according to the determined relative position and relative direction to generate a spatial audio 1 taking the virtual audio source 1 as a virtual audio source.

Step 1509: the video call APP sends spatial audio 1 to the headphones.

Step 1510: the headphones play spatial audio 1.

The spatial audio 1 played by the headphones, after being listened to by the user, can be perceived by the user as originating from the oral position of the speaking a in the video-call interface.

Step 1511: and if the video call APP detects that the speech is switched from the A to the B, determining the mouth position of the B in the video call interface, and taking the mouth position of the B as a virtual sound source 2.

Step 1512: the camera sends the user image of the C who gathers to video conversation APP.

Step 1513: the video APP receives the user image of the C sent by the camera, and determines the relative position of the head position of the current C relative to the virtual sound source 2 and the relative direction of the head orientation relative to the direction of the head orientation to the virtual sound source 2 according to the user image of the C.

Step 1514: and processing the audio parameters of the video APP by the video APP according to the determined relative position and relative direction to generate a spatial audio 2 taking the virtual audio source 2 as a virtual audio source.

Step 1515: the video call APP sends spatial audio 2 to the headphones.

Step 1516: the headphones play the spatial audio 2.

The spatial audio 2 played by the headphones, after being listened to by the user, can be perceived by the user as originating from the oral position of B speaking in the video-call interface.

Therefore, the position of the virtual sound source can be changed according to the position of the mouth of the speaker in the video call interface, and the output of the spatial audio is adjusted based on the adjusted position of the virtual sound source, so that the user can perceive that the audio of the video call originates from the mouth of the speaker, and the audio effect of the spatial audio and the auditory experience of the user are improved.

Fig. 16 is a flowchart of another video processing method provided in an embodiment of the present application, where an execution subject of the method is that a video installed in the terminal 100 passes through an APP, and the terminal 100 is connected to an earphone, as shown in fig. 16, the method includes the following steps:

step 1601: A. b, C three people are engaged in a video call and C is wearing headphones that support spatial audio to listen to the audio.

Step 1602: the video call APP displays a video call interface that includes A, B, C user images of three people.

Wherein, this video conversation APP is the video conversation APP installed on the terminal 100 that C used. Of course, the video call APP may also be a video call APP installed in a terminal used by another user in the video call process, which is not limited in this embodiment of the present application.

Step 1603: if the video call APP detects that A is speaking in the video call process, the mouth position of A in the video call interface is determined, and the mouth position of A is used as a virtual sound source 1.

Step 1604: the video call APP acquires a user image of the C through the camera, determines the relative position of the head position of the C relative to the virtual sound source 1 and the relative direction of the head orientation relative to the direction of the head orientation towards the virtual sound source 1 according to the user image of the C acquired by the camera and the position of the virtual sound source 1, and adjusts the spatial audio output to the earphone according to the determined relative position and relative direction.

In addition, after the mouth position of a is taken as the spatial audio output to the headphone as the virtual sound source 1, it is also possible to detect whether the position of a in the video call interface has changed. If yes, the process returns to step 1603 to continue to determine the mouth position of a in the video call interface, and the mouth position of a is used as the virtual sound source 1. If not, judging whether the head position or the head direction of the C changes or not according to the user image of the C collected by the camera. If the head position or head orientation of C has changed, then go back to step 1604 to continue to determine the relative position of the head position of C with respect to the virtual sound source 1 and the relative direction of the head orientation with respect to the direction of the head towards the virtual sound source 1, based on the user picture of C captured by the camera and the position of the virtual sound source 1. And if the head position or the head orientation of the C is not changed, continuously outputting the spatial audio to the earphone.

Step 1605: if the video call APP detects that the speech arrives at the B in the video call process, the oral position of the B in the video call interface is determined, and the oral position of the B is used as a virtual sound source 2.

Step 1606: the video conversation APP acquires a user image of the C through the camera, determines the relative position of the head position of the C relative to the virtual sound source 2 and the relative direction of the head orientation relative to the direction of the head orientation towards the virtual sound source 1 according to the user image of the C acquired by the camera and the position of the virtual sound source 2, and adjusts the spatial audio output to the earphone according to the determined relative position and relative direction.

Step 1607: the video call APP detects whether the position of B in the video call interface changes. If yes, the process returns to step 1605 to continue to determine the mouth position of B in the video call interface, and the newly determined mouth position of B is used as the virtual sound source 2, otherwise, step 1608 is executed.

Step 1608: video conversation APP detects whether the head position or the head orientation of C changes according to the user image of C that the camera was gathered. If yes, go back to step 1606 to continue to determine the relative position of the head position of C with respect to the virtual sound source 2 and the relative direction of the head orientation with respect to the direction of the head towards the virtual sound source 2 according to the user picture of C and the position of the virtual sound source 2 captured by the camera, so as to adjust the spatial audio output to the headphones according to the re-determined relative position and direction. If not, the spatial audio output to the earphone is continued.

In addition, if it is detected that a and B speak simultaneously in the video call process, the video call APP may further determine the mouth position of a and the mouth position of B in the video call interface, and use the mouth position of a and the mouth position of B as the virtual sound source 1 and the virtual sound source 2, respectively. Then, the relative position of the head position of the C relative to the virtual sound source 1 and the relative direction of the head orientation relative to the direction of the head facing the virtual sound source 1, and the relative position of the head position of the C relative to the virtual sound source 2 and the relative direction of the head orientation relative to the direction of the head facing the virtual sound source 2 are respectively determined through the user image of the C collected by the camera, and the spatial audio output to the earphone is adjusted according to the determined relative position and relative direction, so that the spatial audio played by the earphone is perceived by the user to be respectively from the two virtual sound sources of the virtual sound source 1 and the virtual sound source 2, namely, the user can perceive that the sound of A is transmitted from the mouth position of A on the screen, and the sound of B is transmitted from the mouth position of B on the screen.

Fig. 17 is a schematic structural diagram of a terminal according to an embodiment of the present application. Referring to fig. 17, the terminal 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identification Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation to the terminal 100. In other embodiments of the present application, terminal 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller may be, among other things, a neural center and a command center of the terminal 100. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

The charging management module 140 is configured to receive charging input from a charger. The charging management module 140 may also supply power to the terminal 100 through the power management module 141 while charging the battery 142.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like.

The wireless communication function of the terminal 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in terminal 100 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. Such as: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution including 2G/3G/4G/5G wireless communication and the like applied to the terminal 100. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 150 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the electromagnetic wave to the modem processor for demodulation. The mobile communication module 150 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating a low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then passes the demodulated low frequency baseband signal to a baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs a sound signal through an audio device (not limited to the speaker 170A, the receiver 170B, etc.) or displays an image or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional modules, independent of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication applied to the terminal 100, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), Bluetooth (BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, perform frequency modulation and amplification on the signal, and convert the signal into electromagnetic waves through the antenna 2 to radiate the electromagnetic waves. For example, the terminal 100 may be connected to a headset supporting spatial audio through the wireless communication module 160, and the spatial audio may be transmitted to the headset through the wireless communication module 160 and played by the headset.

In some embodiments, the antenna 1 of the terminal 100 is coupled to the mobile communication module 150 and the antenna 2 is coupled to the wireless communication module 160 so that the terminal 100 can communicate with a network and other devices through a wireless communication technology.

The terminal 100 implements a display function through the GPU, the display screen 194, and the application processor, etc. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may be a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-o led, a quantum dot light-emitting diode (QLED), or the like. In some embodiments, terminal 100 may include 1 or N displays 194, where N is an integer greater than 1.

The terminal 100 may implement a photographing function through the ISP, the camera 193, the video codec, the GPU, the display screen 194, and the application processor, etc.

The ISP is used to process the data fed back by the camera 193. For example, when taking a picture, open the shutter, on light passed through the lens and transmitted camera light sensing element, light signal conversion was the signal of telecommunication, and camera light sensing element transmits the signal of telecommunication to ISP and handles, turns into the image that the naked eye is visible. The ISP can also carry out algorithm optimization on noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the terminal 100 may include 1 or N cameras 193, N being an integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the terminal 100 selects a frequency point, the digital signal processor is configured to perform fourier transform or the like on the frequency point energy.

Video codecs are used to compress or decompress digital video. The terminal 100 may support one or more video codecs. In this way, the terminal 100 can play or record video in a plurality of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor, which processes input information quickly by referring to a biological neural network structure, for example, by referring to a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can implement applications such as intelligent recognition of the terminal 100, for example: image recognition, face recognition, speech recognition, text understanding, and the like. For example, interface contents of the application interface may be recognized by the NPU, for example, a position of an utterance of a speaker or the like that is speaking in the application interface, or a head motion of the user in the user image or the like.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the terminal 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. Such as saving files of music, video, etc. in an external memory card.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the terminal 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (e.g., audio data, a phonebook, etc.) created during use of the terminal 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The terminal 100 can implement audio functions, such as music playing, recording, etc., through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110. For example, the audio module 170 may convert the applied spatial audio information into an analog audio signal and output the analog audio signal to a headphone.

The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The terminal 100 can listen to music through the speaker 170A or listen to a hands-free call.

The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the terminal 100 receives a call or voice information, it can receive voice by bringing the receiver 170B close to the human ear.

The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal to the microphone 170C by speaking the user's mouth near the microphone 170C. The terminal 100 may be provided with at least one microphone 170C. In other embodiments, the terminal 100 may be provided with two microphones 170C to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the terminal 100 may further include three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, implement directional recording functions, and so on.

The headphone interface 170D is used to connect a wired headphone. The headset interface 170D may be the USB interface 130, or may be a 3.5mm open mobile platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface. For example, the headphone interface 170D is used to connect a wired headphone supporting spatial audio, and the spatial audio is output to the headphone through the headphone interface 170D.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys or touch keys. The terminal 100 may receive a key input, and generate a key signal input related to user setting and function control of the terminal 100. The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration prompts as well as for touch vibration feedback. Indicator 192 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc. The SIM card interface 195 is used to connect a SIM card.

In one embodiment, the terminal 100 is connected to a headset that supports playing spatial audio. The earphone may be an over-the-ear earphone, or an in-ear earphone, and the embodiment of the present application does not limit the type of the earphone.

As one example, the headphone is provided with a motion sensor such as a gyro sensor and an acceleration sensor. The headset may detect head motion information of a user wearing the headset through a motion sensor and transmit the detected head motion information to the terminal 100.

Among other things, the gyro sensor may be used to determine the motion pose of the user's head. In some embodiments, the angular velocity of the user's head about three axes (i.e., the x, y, and z axes) may be determined by a gyroscope sensor. The acceleration sensor may detect the magnitude of acceleration of the user's head in various directions (typically three axes). The magnitude and direction of gravity can be detected when the head is stationary.

In the above embodiments, the implementation may be wholly or partly realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., Digital Versatile Disk (DVD)), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is not intended to limit the present application to the particular embodiments disclosed, but rather, the present application is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present application.

Claims

1. An audio processing method, applied to a terminal, the method comprising:

receiving an application operation on a target application;

presenting an application interface of the target application according to the application operation;

determining an interface presentation form of the application interface, and determining the position of a virtual sound source of the spatial audio to be output according to the interface presentation form of the application interface;

determining the relative position of the head of a user wearing an earphone relative to the virtual sound source according to the position of the virtual sound source, wherein the earphone is connected with the terminal;

adjusting spatial audio output to the headphones according to the relative orientation so that the spatial audio output to the headphones is perceived by the user as originating from the virtual sound source;

wherein, the determining the position of the virtual sound source of the spatial audio to be output according to the interface presentation form of the application interface comprises at least two of the following modes:

if the ratio of the interface display area of the application interface on the screen is larger than a ratio threshold, determining a preset position as the position of a virtual sound source of the spatial audio to be output;

if the ratio of the interface display area of the application interface on the screen is smaller than or equal to a ratio threshold value, determining the window position of the application interface as the position of a virtual sound source of the spatial audio to be output;

if the application interface is minimized into an icon form, determining the position of the minimized icon of the application interface as the position of a virtual sound source of the spatial audio to be output;

if the application interface is switched to the background, determining the position of the virtual sound source of the spatial audio to be output after the position of the corresponding virtual sound source before the application interface is switched moves a specified distance in the direction opposite to the screen light-emitting direction;

and if the application interface is switched to the foreground, determining the position of the virtual sound source corresponding to the application interface before switching as the position of the virtual sound source of the spatial audio to be output after the position moves towards the screen light-emitting direction by a specified distance.

2. The method of claim 1, wherein in a case where a preset position is determined as a position of a virtual sound source of spatial audio to be output, the preset position is a center position of the screen, a center position of the application interface, or a sound emission position in the application interface.

3. The method of claim 1, wherein the interface presentation of the application interface comprises a window presentation of a specified window in the application interface.

4. The method of claim 3, wherein the target application is a video application and the designated window is a video playback window.

5. The method of any of claims 1-4, wherein determining the relative orientation of the head of the user wearing the headset with respect to the virtual sound source based on the position of the virtual sound source comprises:

acquiring a user image of the user through a camera;

from the captured user images, the relative position of the user's head position with respect to the virtual sound source and the relative direction of the head orientation with respect to the direction in which the head is oriented towards the virtual sound source are determined.

6. The method of any of claims 1-4, wherein the adjusting the spatial audio output to the headphones according to the relative orientation comprises:

processing audio parameters of the target application according to the relative orientation to generate spatial audio perceived as originating from the virtual sound source;

transmitting the generated spatial audio to the headphones.

7. A terminal, characterized in that the terminal comprises a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the audio processing method according to any of claims 1-6 when executing the computer program.

8. A computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to execute the audio processing method of any one of claims 1-6.