CN111091848B

CN111091848B - Method and device for predicting head posture

Info

Publication number: CN111091848B
Application number: CN201911166426.3A
Authority: CN
Inventors: 史明
Original assignee: Chongqing IQIYI Intelligent Technology Co Ltd
Current assignee: Beijing Dream Bloom Technology Co ltd; Qingdao Dream Blossom Technology Co ltd; Beijing IQIYi Intelligent Entertainment Technology Co Ltd
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2022-09-30
Anticipated expiration: 2039-11-25
Also published as: CN111091848A

Abstract

The application aims at providing a method and a device for predicting head gestures, wherein audio is decoded, decoded audio data are obtained, the decoded audio data are analyzed, the sound source position at the next moment is determined, and the head gestures of a user at the next moment are predicted according to the sound source position at the next moment.

Description

Method and device for predicting head posture

Technical Field

The present application relates to the field of computer technology, and more particularly, to a technique for predicting head pose.

Background

In the existing VR (Virtual Reality) panoramic video, only the video content in the field angle fov (field of view) is presented to the user when viewing, and the video content in other parts is effectively invalid. Therefore, many VR panoramic video schemes currently use an asymmetric coding scheme, that is, only the video portion seen by the user is of higher definition, and the invisible portion is of lower definition. In order to reduce the time delay, the head movement position of the user is predicted, and content which is high definition in the FOV range is downloaded in advance for decoding and displaying.

Existing schemes for predicting head pose typically include traditional machine learning methods or deep learning methods based on video content, etc. according to user habits. The prediction effect of the method based on the user habit is poor mainly because the head movement of the user is strongly related to the content of the film, that is, the user can move according to different film plots. The video content-based method, whether a traditional machine learning method or a deep learning method, has very high computational complexity, especially the deep learning-based method has higher computational complexity, and the computational-intensive prediction method is difficult to meet the real-time requirement and has very large power consumption.

Therefore, how to predict the head pose of the user in real time, efficiently and accurately becomes one of the problems that the skilled person needs to solve urgently.

Disclosure of Invention

The application aims to provide a method and a device for predicting head gestures.

According to an aspect of the present application, there is provided a method for predicting a head pose, wherein the method comprises:

decoding the audio to obtain decoded audio data;

analyzing the decoded audio data and determining the position of a sound source at the next moment;

and predicting the head posture of the user at the next moment according to the sound source position at the next moment.

According to another aspect of the present application, there is also provided an apparatus for predicting a head pose, wherein the apparatus comprises:

decoding means for decoding the audio to obtain decoded audio data;

analyzing means for analyzing the decoded audio data and determining a sound source position at a next time;

and the predicting device is used for predicting the head posture of the user at the next moment according to the sound source position at the next moment.

According to yet another aspect of the application, there is also provided a computer readable storage medium storing computer code which, when executed, performs a method as in any one of the preceding claims.

According to yet another aspect of the application, there is also provided a computer program product, which when executed by a computer device, performs the method of any of the preceding claims.

According to yet another aspect of the present application, there is also provided a computer apparatus, including:

one or more processors;

a memory for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any preceding claim.

The application provides a method for predicting the head posture of a user, the audio is decoded, the decoded audio data is obtained, the decoded audio data is analyzed, the sound source position at the next moment is determined, the head posture of the user at the next moment is predicted according to the sound source position at the next moment, and due to the fact that the image-based prediction method is large in calculated amount and high in power consumption, the method based on the sound prediction is adopted, when the VR panoramic video is watched, head movement of the user in the film watching process is predicted, and the data volume of the audio is small and good in processing, and besides, the method based on the sound prediction also accords with human behavior habits. The method based on the sound prediction has the characteristics of small calculated amount, low power consumption and the like, and is suitable for being applied to a real-time system.

Furthermore, according to the predicted head posture of the user at the next moment, the field angle FOV at the next moment is determined, high-definition video content in the field angle FOV range at the next moment is downloaded in advance, and the definition switching time delay caused by head motion when the user watches VR panoramic video is reduced. The method and the device can play the high-image-quality VR panoramic video in a lower bandwidth environment, not only can save bandwidth cost, but also reduce the requirements of a terminal user on the network and equipment, thereby reducing the access threshold of the user.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present application;

FIG. 2 illustrates a schematic diagram of an apparatus for predicting head pose according to an aspect of the present application;

FIG. 3 illustrates a schematic diagram for predicting a head pose according to a preferred embodiment of the present application;

FIG. 4 illustrates a flow diagram of a method for predicting head pose according to another aspect of the present application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel, concurrently, or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The term "computer device" or "computer" in this context refers to an intelligent electronic device that can execute predetermined processes such as numerical calculation and/or logic calculation by running predetermined programs or instructions, and may include a processor and a memory, wherein the processor executes a pre-stored instruction stored in the memory to execute the predetermined processes, or the predetermined processes are executed by hardware such as ASIC, FPGA, DSP, or a combination thereof. Computer devices include, but are not limited to, servers, personal computers, laptops, tablets, smart phones, and the like.

The computer equipment comprises user equipment and network equipment. Wherein the user equipment includes but is not limited to computers, smart phones, PDAs, etc.; the network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of computers or network servers, wherein Cloud Computing is one of distributed Computing, a super virtual computer consisting of a collection of loosely coupled computers. The computer equipment can be independently operated to realize the application, and can also be accessed into a network to realize the application through the interactive operation with other computer equipment in the network. The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.

It should be noted that the user equipment, the network device, the network, etc. are only examples, and other existing or future computer devices or networks may also be included in the scope of the present application, if applicable, and are included by reference.

The methods discussed below, some of which are illustrated by flow diagrams, may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine or computer readable medium such as a storage medium. The processor(s) may perform the necessary tasks.

Specific structural and functional details disclosed herein are merely representative and are provided for purposes of describing example embodiments of the present application. This application may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element may be termed a second element, and, similarly, a second element may be termed a first element, without departing from the scope of example embodiments. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly connected" or "directly coupled" to another element, there are no intervening elements present. Other words used to describe the relationship between elements (e.g., "between" versus "directly between", "adjacent" versus "directly adjacent to", etc.) should be interpreted in a similar manner.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed substantially concurrently, or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

The present application is described in further detail below with reference to the attached drawing figures.

FIG. 1 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present application. The computer system/server 12 shown in FIG. 1 is only one example and should not be taken to limit the scope of use or the functionality of embodiments of the present application.

As shown in FIG. 1, computer system/server 12 is in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 1, and commonly referred to as a "hard drive"). Although not shown in FIG. 1, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which or some combination of which may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.

The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 20. As shown, network adapter 20 communicates with the other modules of computer system/server 12 via bus 18. It should be appreciated that although not shown in FIG. 1, other hardware and/or software modules may be used in conjunction with the computer system/server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.

The processing unit 16 executes various functional applications and data processing by executing programs stored in the memory 28.

For example, the memory 28 stores therein a computer program for executing the functions and processes of the present application, and the present application predicts the head pose when the processing unit 16 executes the corresponding computer program.

The specific means/steps of the present application for predicting head pose will be described in detail below.

FIG. 2 illustrates a schematic diagram of an apparatus for predicting head pose in accordance with an aspect of the subject application.

The apparatus 1 comprises decoding means 201, analyzing means 202 and predicting means 203.

The decoding apparatus 201 decodes the audio to obtain decoded audio data.

Specifically, VR panorama video is usually audio and video packaged together, and the decoding apparatus 201 decodes this, and splits the audio and video, thereby obtaining decoded audio data. Here, the encoding of audio is in a variety of formats, including but not limited to stereo, panned sound, and other audio that can determine where objects occur in a scene.

It will be understood by those skilled in the art that the above-described audio format is merely exemplary and should not be considered as limiting the present application, and that other existing or future audio formats, as may be suitable for use in the present application, are also intended to be encompassed within the scope of the present application and are hereby incorporated by reference.

The analysis means 202 analyzes the decoded audio data and determines the sound source position at the next time.

Specifically, the analysis device 202 analyzes the audio data obtained by decoding by the decoding device 201, and determines the sound source position at the next time by, for example, acquiring the coordinates of the sound source, calculating the sound time difference between the left and right channels, and the like.

In a preferred embodiment, if the decoded audio data is stereo, the analyzing device 202 analyzes the sound time difference of the left and right channels of the stereo; and determining the sound source position at the next moment according to the time difference from the sound source to the left ear and the right ear of the user.

For example, if the decoded audio data is stereo, and the stereo sound is divided into two channels, i.e., left and right channels, and the two channels are not synchronized with each other and have a certain sound time difference, as shown in fig. 3, the user can see the top view of the user, the left and right sides are the ears of the user, and the distances from the sound source to the left and right ears of the user are different, so that the analysis device 202 analyzes the sound time difference of the left and right channels of the stereo sound, for example, the sound time difference is 0.001s, and then calculates and determines the sound source position at the next time based on the time difference from the sound source to the left and right ears of the user in combination with the sound speed.

In another preferred embodiment, if the decoded audio data is a panoramic sound, the analysis device 202 determines the coordinates of a sound source in the panoramic sound at the next time as the sound source position at the next time.

For example, if the decoded audio data is a panoramic sound, and each sound source in the panoramic sound has a specific corresponding coordinate, the analysis device 202 directly analyzes and specifies the coordinate of the sound source in the panoramic sound at the next time as the sound source position at the next time. For example, if a bird is called as a bird on a tree at the next time in the VR panoramic video, analyzer 202 directly determines the coordinates of the bird at that time as the sound source position at the next time.

It will be understood by those skilled in the art that the above-described manner of determining the location of a sound source is by way of example only and should not be construed as limiting the present application, and that other existing or future manners of determining the location of a sound source, as may be applicable to the present application, are intended to be encompassed within the scope of the present application and are hereby incorporated by reference.

The prediction means 203 predicts the head posture of the user at the next time from the sound source position at the next time.

Specifically, the prediction means 203 can predict the head posture of the user at the next time by predicting that the user will face the sound source position at the next time based on the sound source position at the next time determined by the analysis means 202.

For example, in a VR panoramic video, the next time is a bird that is in the tree, and usually the user faces the sound source during the viewing process, for example, the user raises his head to see the bird when the bird is in the bird, so that when the analyzing device 202 determines the location of the bird call at the next time, the predicting device 203 can predict that the user will raise his head to see the bird according to the location of the bird call, so as to predict the head posture of the user at the next time.

Here, the apparatus 1 decodes audio, obtains decoded audio data, analyzes the decoded audio data, determines a sound source position at a next time, and predicts a head posture of a user at the next time from the sound source position at the next time, and since a prediction method based on an image has a relatively large amount of calculation and high power consumption, the apparatus 1 predicts a head movement during a viewing process of the user when viewing a VR panorama video by predicting based on sound, because the data amount of audio is very small and is relatively easy to handle, and furthermore, a method based on sound prediction is relatively in line with human behavior habits. The mode of the device 1 based on the sound prediction has the characteristics of small calculated amount, low power consumption and the like, and is suitable for being applied to a real-time system.

In a preferred embodiment, the device 1 further comprises downloading means (not shown). The downloading device determines the field angle FOV at the next moment according to the predicted head posture of the user at the next moment; and downloading the video content with high definition in the field angle FOV range at the next moment in advance.

Specifically, since the field angle FOV of the user can be known from the head posture after the head posture of the user is determined, the downloading device determines the field angle FOV of the user at the next time point from the head posture of the user at the next time point predicted by the prediction device 203, and downloads the video content having high definition in the field angle FOV range at the next time point in advance from the field angle FOV. For VR panoramic video, only the video content in the FOV will be presented to the user when viewing, and the video content in other parts is effectively invalid. Thus, the downloading device may use an asymmetric encoding scheme, i.e. only the video portions that are visible to the user are of higher definition and the portions that are not visible are of lower definition. And the downloading device determines the field angle FOV of the user at the next moment according to the predicted head posture of the user at the next moment, so that the content with high definition in the FOV range is downloaded in advance for decoding and displaying according to the field angle FOV, and the time delay is reduced.

It should be understood by those skilled in the art that the above-described manner of downloading video is merely exemplary and should not be construed as limiting the present application, and that other existing or future manners of downloading video, as applicable to the present application, are intended to be encompassed within the scope of the present application and are hereby incorporated by reference.

The device 1 determines the field angle FOV at the next moment according to the predicted head posture of the user at the next moment, downloads the high-definition video content within the field angle FOV at the next moment in advance, and reduces the resolution switching delay caused by head motion when the user watches the VR panoramic video. The device 1 can play the high-image quality VR panoramic video in a lower bandwidth environment, so that the bandwidth cost can be saved, the requirements of a terminal user on a network and equipment are reduced, and the access threshold of the user is reduced.

FIG. 4 illustrates a flow diagram of a method for predicting a head pose according to another aspect of the present application.

In step S401, the apparatus 1 decodes audio, obtaining decoded audio data.

Specifically, VR panoramic video is typically audio and video packaged together, and in step S401, the apparatus 1 decodes this, and splits the audio and video, so as to obtain decoded audio data. Here, the encoding of audio is in a variety of formats, including but not limited to stereo, panned sound, and other audio that can determine where objects occur in a scene.

In step S402, the device 1 analyzes the decoded audio data and determines the sound source position at the next time.

Specifically, in step S402, the device 1 analyzes the audio data obtained after decoding in step S401, and determines the sound source position at the next time, for example, by acquiring the coordinates of the sound source, calculating the sound time difference between the left and right channels, and the like.

In a preferred embodiment, if the decoded audio data is stereo, in step S402, the apparatus 1 analyzes the sound time difference of the left and right channels of the stereo; and determining the sound source position at the next moment according to the time difference from the sound source to the left ear and the right ear of the user.

For example, if the decoded audio data is stereo, the stereo sound is divided into two channels, i.e., left and right channels, and the two channels are not synchronized with each other and have a certain sound time difference, as shown in fig. 3, the user can see the top view of the user, the left and right sides are the ears of the user, and the distances from the sound source to the left and right ears of the user are different, so in step S402, the apparatus 1 analyzes the sound time difference of the left and right channels of the stereo sound, for example, the sound time difference is 0.001S, and then calculates and determines the sound source position at the next time according to the time difference from the sound source to the left and right ears of the user and the sound velocity.

In another preferred embodiment, if the decoded audio data is a panoramic sound, in step S402, the apparatus 1 determines the coordinates of the sound source in the panoramic sound at the next time as the sound source position at the next time.

For example, if the decoded audio data is a panoramic sound, and each sound source in the panoramic sound has specific corresponding coordinates, in step S402, the device 1 directly analyzes and determines the coordinates of the sound source in the panoramic sound at the next time as the sound source position at the next time. For example, if a bird is called as a bird on a tree at the next time in the VR panoramic image, the device 1 directly specifies the coordinates of the bird at that time as the sound source position at the next time in step S402.

In step S403, the device 1 predicts the head posture of the user at the next time from the sound source position at the next time.

Specifically, in step S403, the apparatus 1 can predict the head posture of the user at the next time point by predicting that the user will face the sound source position at the next time point based on the sound source position at the next time point determined in step S402.

For example, in a VR panoramic video, the next time is a bird on the tree, and the user usually faces the sound source during the viewing process, for example, the user may raise his head to see the bird when the bird is called, so that when the apparatus 1 determines the position of the bird call at the next time in step S402, the apparatus 1 may predict that the user will raise his head to see the bird according to the position of the bird call in step S403, and may predict the head posture of the user at the next time.

Here, the apparatus 1 decodes audio, obtains decoded audio data, analyzes the decoded audio data, determines a sound source position at the next time, and predicts a head posture of a user at the next time according to the sound source position at the next time, and since a prediction method based on an image has a relatively large amount of calculation and high power consumption, the apparatus 1 predicts head movement during a viewing process of the user when viewing a VR panorama video, because the data amount of the audio is very small and is relatively easy to handle, and further, the method based on the sound prediction also relatively conforms to human behavior habits. The mode of the device 1 based on the sound prediction has the characteristics of small calculated amount, low power consumption and the like, and is suitable for being applied to a real-time system.

In a preferred embodiment, the method further comprises step S404 (not shown). In step S404, the apparatus 1 determines a field angle FOV at the next time point, based on the predicted head posture of the user at the next time point; and downloading the video content with high definition in the field angle FOV range at the next moment in advance.

Specifically, after the head posture of the user is determined, the field angle FOV of the user can be known according to the head posture, and therefore, in step S404, the apparatus 1 determines the field angle FOV of the user at the next time according to the head posture of the user at the next time predicted in step S403, and downloads the video content which is high definition within the field angle FOV range at the next time in advance according to the field angle FOV. For VR panoramic video, only the video content in the FOV will be presented to the user when viewing, and the video content in other parts is effectively invalid. Therefore, in step S404, the apparatus 1 may adopt an asymmetric coding scheme, i.e. only the video portion that is visible to the user is of higher definition, and the invisible portion is of lower definition. In step S404, the apparatus 1 determines the field angle FOV of the user at the next time according to the predicted head pose of the user at the next time, so as to download the content with high definition in the FOV range in advance for decoding and displaying according to the field angle FOV, thereby reducing the time delay.

The device 1 determines the field angle FOV at the next moment according to the predicted head posture of the user at the next moment, downloads the high-definition video content within the field angle FOV at the next moment in advance, and reduces the resolution switching delay caused by head motion when the user watches the VR panoramic video. The device 1 can play the high-definition VR panoramic video in a lower bandwidth environment, so that the bandwidth cost can be saved, the requirements of a terminal user on a network and equipment are reduced, and the access threshold of the user is reduced.

The present application also provides a computer readable storage medium having stored thereon computer code which, when executed, performs a method as in any one of the preceding.

The present application also provides a computer program product, which when executed by a computer device, performs the method of any of the preceding claims.

The present application further provides a computer device, comprising:

one or more processors;

a memory for storing one or more computer programs;

It is noted that the present application may be implemented in software and/or a combination of software and hardware, for example, the various means of the present application may be implemented using Application Specific Integrated Circuits (ASICs) or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method for predicting head pose, wherein the method comprises:

decoding the audio to obtain decoded audio data;

wherein the content of the first and second substances,

if the decoded audio data is stereo, analyzing the sound time difference of left and right channels of the stereo; determining the sound source position at the next moment according to the time difference from the sound source to the left ear and the right ear of the user;

if the decoded audio data is panoramic sound, determining the coordinates of a sound source in the panoramic sound at the next moment as the position of the sound source at the next moment;

2. The method of claim 1, wherein the method further comprises:

determining a field angle FOV at the next moment according to the predicted head posture of the user at the next moment;

and downloading the video content with high definition in the field angle FOV range at the next moment in advance.

3. An apparatus for predicting head pose, wherein the apparatus comprises:

decoding means for decoding the audio to obtain decoded audio data;

analyzing means for analyzing the decoded audio data and determining a sound source position at a next time; wherein, the first and the second end of the pipe are connected with each other,

if the decoded audio data is panoramic sound, determining the coordinates of a sound source in the panoramic sound at the next moment as the sound source position at the next moment;

4. The apparatus of claim 3, wherein the apparatus further comprises a downloading means for:

5. A computer readable storage medium storing computer code which, when executed, performs the method of any of claims 1-2.

6. A computer device, the computer device comprising:

one or more processors;

a memory for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-2.