CN111091848B - Method and device for predicting head posture - Google Patents

Method and device for predicting head posture Download PDF

Info

Publication number
CN111091848B
CN111091848B CN201911166426.3A CN201911166426A CN111091848B CN 111091848 B CN111091848 B CN 111091848B CN 201911166426 A CN201911166426 A CN 201911166426A CN 111091848 B CN111091848 B CN 111091848B
Authority
CN
China
Prior art keywords
sound source
next moment
user
sound
audio data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911166426.3A
Other languages
Chinese (zh)
Other versions
CN111091848A (en
Inventor
史明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dream Bloom Technology Co ltd
Qingdao Dream Blossom Technology Co ltd
Beijing IQIYi Intelligent Entertainment Technology Co Ltd
Original Assignee
Chongqing IQIYI Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing IQIYI Intelligent Technology Co Ltd filed Critical Chongqing IQIYI Intelligent Technology Co Ltd
Priority to CN201911166426.3A priority Critical patent/CN111091848B/en
Publication of CN111091848A publication Critical patent/CN111091848A/en
Application granted granted Critical
Publication of CN111091848B publication Critical patent/CN111091848B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/012Head tracking input arrangements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/194Transmission of image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/332Displays for viewing with the aid of special glasses or head-mounted displays [HMD]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/366Image reproducers using viewer tracking

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The application aims at providing a method and a device for predicting head gestures, wherein audio is decoded, decoded audio data are obtained, the decoded audio data are analyzed, the sound source position at the next moment is determined, and the head gestures of a user at the next moment are predicted according to the sound source position at the next moment.

Description

Method and device for predicting head posture
Technical Field
The present application relates to the field of computer technology, and more particularly, to a technique for predicting head pose.
Background
In the existing VR (Virtual Reality) panoramic video, only the video content in the field angle fov (field of view) is presented to the user when viewing, and the video content in other parts is effectively invalid. Therefore, many VR panoramic video schemes currently use an asymmetric coding scheme, that is, only the video portion seen by the user is of higher definition, and the invisible portion is of lower definition. In order to reduce the time delay, the head movement position of the user is predicted, and content which is high definition in the FOV range is downloaded in advance for decoding and displaying.
Existing schemes for predicting head pose typically include traditional machine learning methods or deep learning methods based on video content, etc. according to user habits. The prediction effect of the method based on the user habit is poor mainly because the head movement of the user is strongly related to the content of the film, that is, the user can move according to different film plots. The video content-based method, whether a traditional machine learning method or a deep learning method, has very high computational complexity, especially the deep learning-based method has higher computational complexity, and the computational-intensive prediction method is difficult to meet the real-time requirement and has very large power consumption.
Therefore, how to predict the head pose of the user in real time, efficiently and accurately becomes one of the problems that the skilled person needs to solve urgently.
Disclosure of Invention
The application aims to provide a method and a device for predicting head gestures.
According to an aspect of the present application, there is provided a method for predicting a head pose, wherein the method comprises:
decoding the audio to obtain decoded audio data;
analyzing the decoded audio data and determining the position of a sound source at the next moment;
and predicting the head posture of the user at the next moment according to the sound source position at the next moment.
According to another aspect of the present application, there is also provided an apparatus for predicting a head pose, wherein the apparatus comprises:
decoding means for decoding the audio to obtain decoded audio data;
analyzing means for analyzing the decoded audio data and determining a sound source position at a next time;
and the predicting device is used for predicting the head posture of the user at the next moment according to the sound source position at the next moment.
According to yet another aspect of the application, there is also provided a computer readable storage medium storing computer code which, when executed, performs a method as in any one of the preceding claims.
According to yet another aspect of the application, there is also provided a computer program product, which when executed by a computer device, performs the method of any of the preceding claims.
According to yet another aspect of the present application, there is also provided a computer apparatus, including:
one or more processors;
a memory for storing one or more computer programs;
the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any preceding claim.
The application provides a method for predicting the head posture of a user, the audio is decoded, the decoded audio data is obtained, the decoded audio data is analyzed, the sound source position at the next moment is determined, the head posture of the user at the next moment is predicted according to the sound source position at the next moment, and due to the fact that the image-based prediction method is large in calculated amount and high in power consumption, the method based on the sound prediction is adopted, when the VR panoramic video is watched, head movement of the user in the film watching process is predicted, and the data volume of the audio is small and good in processing, and besides, the method based on the sound prediction also accords with human behavior habits. The method based on the sound prediction has the characteristics of small calculated amount, low power consumption and the like, and is suitable for being applied to a real-time system.
Furthermore, according to the predicted head posture of the user at the next moment, the field angle FOV at the next moment is determined, high-definition video content in the field angle FOV range at the next moment is downloaded in advance, and the definition switching time delay caused by head motion when the user watches VR panoramic video is reduced. The method and the device can play the high-image-quality VR panoramic video in a lower bandwidth environment, not only can save bandwidth cost, but also reduce the requirements of a terminal user on the network and equipment, thereby reducing the access threshold of the user.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present application;
FIG. 2 illustrates a schematic diagram of an apparatus for predicting head pose according to an aspect of the present application;
FIG. 3 illustrates a schematic diagram for predicting a head pose according to a preferred embodiment of the present application;
FIG. 4 illustrates a flow diagram of a method for predicting head pose according to another aspect of the present application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel, concurrently, or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
The term "computer device" or "computer" in this context refers to an intelligent electronic device that can execute predetermined processes such as numerical calculation and/or logic calculation by running predetermined programs or instructions, and may include a processor and a memory, wherein the processor executes a pre-stored instruction stored in the memory to execute the predetermined processes, or the predetermined processes are executed by hardware such as ASIC, FPGA, DSP, or a combination thereof. Computer devices include, but are not limited to, servers, personal computers, laptops, tablets, smart phones, and the like.
The computer equipment comprises user equipment and network equipment. Wherein the user equipment includes but is not limited to computers, smart phones, PDAs, etc.; the network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of computers or network servers, wherein Cloud Computing is one of distributed Computing, a super virtual computer consisting of a collection of loosely coupled computers. The computer equipment can be independently operated to realize the application, and can also be accessed into a network to realize the application through the interactive operation with other computer equipment in the network. The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.
It should be noted that the user equipment, the network device, the network, etc. are only examples, and other existing or future computer devices or networks may also be included in the scope of the present application, if applicable, and are included by reference.
The methods discussed below, some of which are illustrated by flow diagrams, may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine or computer readable medium such as a storage medium. The processor(s) may perform the necessary tasks.
Specific structural and functional details disclosed herein are merely representative and are provided for purposes of describing example embodiments of the present application. This application may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element may be termed a second element, and, similarly, a second element may be termed a first element, without departing from the scope of example embodiments. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly connected" or "directly coupled" to another element, there are no intervening elements present. Other words used to describe the relationship between elements (e.g., "between" versus "directly between", "adjacent" versus "directly adjacent to", etc.) should be interpreted in a similar manner.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed substantially concurrently, or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
The present application is described in further detail below with reference to the attached drawing figures.
FIG. 1 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present application. The computer system/server 12 shown in FIG. 1 is only one example and should not be taken to limit the scope of use or the functionality of embodiments of the present application.
As shown in FIG. 1, computer system/server 12 is in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 1, and commonly referred to as a "hard drive"). Although not shown in FIG. 1, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which or some combination of which may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.
The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 20. As shown, network adapter 20 communicates with the other modules of computer system/server 12 via bus 18. It should be appreciated that although not shown in FIG. 1, other hardware and/or software modules may be used in conjunction with the computer system/server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.
The processing unit 16 executes various functional applications and data processing by executing programs stored in the memory 28.
For example, the memory 28 stores therein a computer program for executing the functions and processes of the present application, and the present application predicts the head pose when the processing unit 16 executes the corresponding computer program.
The specific means/steps of the present application for predicting head pose will be described in detail below.
FIG. 2 illustrates a schematic diagram of an apparatus for predicting head pose in accordance with an aspect of the subject application.
The apparatus 1 comprises decoding means 201, analyzing means 202 and predicting means 203.
The decoding apparatus 201 decodes the audio to obtain decoded audio data.
Specifically, VR panorama video is usually audio and video packaged together, and the decoding apparatus 201 decodes this, and splits the audio and video, thereby obtaining decoded audio data. Here, the encoding of audio is in a variety of formats, including but not limited to stereo, panned sound, and other audio that can determine where objects occur in a scene.
It will be understood by those skilled in the art that the above-described audio format is merely exemplary and should not be considered as limiting the present application, and that other existing or future audio formats, as may be suitable for use in the present application, are also intended to be encompassed within the scope of the present application and are hereby incorporated by reference.
The analysis means 202 analyzes the decoded audio data and determines the sound source position at the next time.
Specifically, the analysis device 202 analyzes the audio data obtained by decoding by the decoding device 201, and determines the sound source position at the next time by, for example, acquiring the coordinates of the sound source, calculating the sound time difference between the left and right channels, and the like.
In a preferred embodiment, if the decoded audio data is stereo, the analyzing device 202 analyzes the sound time difference of the left and right channels of the stereo; and determining the sound source position at the next moment according to the time difference from the sound source to the left ear and the right ear of the user.
For example, if the decoded audio data is stereo, and the stereo sound is divided into two channels, i.e., left and right channels, and the two channels are not synchronized with each other and have a certain sound time difference, as shown in fig. 3, the user can see the top view of the user, the left and right sides are the ears of the user, and the distances from the sound source to the left and right ears of the user are different, so that the analysis device 202 analyzes the sound time difference of the left and right channels of the stereo sound, for example, the sound time difference is 0.001s, and then calculates and determines the sound source position at the next time based on the time difference from the sound source to the left and right ears of the user in combination with the sound speed.
In another preferred embodiment, if the decoded audio data is a panoramic sound, the analysis device 202 determines the coordinates of a sound source in the panoramic sound at the next time as the sound source position at the next time.
For example, if the decoded audio data is a panoramic sound, and each sound source in the panoramic sound has a specific corresponding coordinate, the analysis device 202 directly analyzes and specifies the coordinate of the sound source in the panoramic sound at the next time as the sound source position at the next time. For example, if a bird is called as a bird on a tree at the next time in the VR panoramic video, analyzer 202 directly determines the coordinates of the bird at that time as the sound source position at the next time.
It will be understood by those skilled in the art that the above-described manner of determining the location of a sound source is by way of example only and should not be construed as limiting the present application, and that other existing or future manners of determining the location of a sound source, as may be applicable to the present application, are intended to be encompassed within the scope of the present application and are hereby incorporated by reference.
The prediction means 203 predicts the head posture of the user at the next time from the sound source position at the next time.
Specifically, the prediction means 203 can predict the head posture of the user at the next time by predicting that the user will face the sound source position at the next time based on the sound source position at the next time determined by the analysis means 202.
For example, in a VR panoramic video, the next time is a bird that is in the tree, and usually the user faces the sound source during the viewing process, for example, the user raises his head to see the bird when the bird is in the bird, so that when the analyzing device 202 determines the location of the bird call at the next time, the predicting device 203 can predict that the user will raise his head to see the bird according to the location of the bird call, so as to predict the head posture of the user at the next time.
Here, the apparatus 1 decodes audio, obtains decoded audio data, analyzes the decoded audio data, determines a sound source position at a next time, and predicts a head posture of a user at the next time from the sound source position at the next time, and since a prediction method based on an image has a relatively large amount of calculation and high power consumption, the apparatus 1 predicts a head movement during a viewing process of the user when viewing a VR panorama video by predicting based on sound, because the data amount of audio is very small and is relatively easy to handle, and furthermore, a method based on sound prediction is relatively in line with human behavior habits. The mode of the device 1 based on the sound prediction has the characteristics of small calculated amount, low power consumption and the like, and is suitable for being applied to a real-time system.
In a preferred embodiment, the device 1 further comprises downloading means (not shown). The downloading device determines the field angle FOV at the next moment according to the predicted head posture of the user at the next moment; and downloading the video content with high definition in the field angle FOV range at the next moment in advance.
Specifically, since the field angle FOV of the user can be known from the head posture after the head posture of the user is determined, the downloading device determines the field angle FOV of the user at the next time point from the head posture of the user at the next time point predicted by the prediction device 203, and downloads the video content having high definition in the field angle FOV range at the next time point in advance from the field angle FOV. For VR panoramic video, only the video content in the FOV will be presented to the user when viewing, and the video content in other parts is effectively invalid. Thus, the downloading device may use an asymmetric encoding scheme, i.e. only the video portions that are visible to the user are of higher definition and the portions that are not visible are of lower definition. And the downloading device determines the field angle FOV of the user at the next moment according to the predicted head posture of the user at the next moment, so that the content with high definition in the FOV range is downloaded in advance for decoding and displaying according to the field angle FOV, and the time delay is reduced.
It should be understood by those skilled in the art that the above-described manner of downloading video is merely exemplary and should not be construed as limiting the present application, and that other existing or future manners of downloading video, as applicable to the present application, are intended to be encompassed within the scope of the present application and are hereby incorporated by reference.
The device 1 determines the field angle FOV at the next moment according to the predicted head posture of the user at the next moment, downloads the high-definition video content within the field angle FOV at the next moment in advance, and reduces the resolution switching delay caused by head motion when the user watches the VR panoramic video. The device 1 can play the high-image quality VR panoramic video in a lower bandwidth environment, so that the bandwidth cost can be saved, the requirements of a terminal user on a network and equipment are reduced, and the access threshold of the user is reduced.
FIG. 4 illustrates a flow diagram of a method for predicting a head pose according to another aspect of the present application.
In step S401, the apparatus 1 decodes audio, obtaining decoded audio data.
Specifically, VR panoramic video is typically audio and video packaged together, and in step S401, the apparatus 1 decodes this, and splits the audio and video, so as to obtain decoded audio data. Here, the encoding of audio is in a variety of formats, including but not limited to stereo, panned sound, and other audio that can determine where objects occur in a scene.
It will be understood by those skilled in the art that the above-described audio format is merely exemplary and should not be considered as limiting the present application, and that other existing or future audio formats, as may be suitable for use in the present application, are also intended to be encompassed within the scope of the present application and are hereby incorporated by reference.
In step S402, the device 1 analyzes the decoded audio data and determines the sound source position at the next time.
Specifically, in step S402, the device 1 analyzes the audio data obtained after decoding in step S401, and determines the sound source position at the next time, for example, by acquiring the coordinates of the sound source, calculating the sound time difference between the left and right channels, and the like.
In a preferred embodiment, if the decoded audio data is stereo, in step S402, the apparatus 1 analyzes the sound time difference of the left and right channels of the stereo; and determining the sound source position at the next moment according to the time difference from the sound source to the left ear and the right ear of the user.
For example, if the decoded audio data is stereo, the stereo sound is divided into two channels, i.e., left and right channels, and the two channels are not synchronized with each other and have a certain sound time difference, as shown in fig. 3, the user can see the top view of the user, the left and right sides are the ears of the user, and the distances from the sound source to the left and right ears of the user are different, so in step S402, the apparatus 1 analyzes the sound time difference of the left and right channels of the stereo sound, for example, the sound time difference is 0.001S, and then calculates and determines the sound source position at the next time according to the time difference from the sound source to the left and right ears of the user and the sound velocity.
In another preferred embodiment, if the decoded audio data is a panoramic sound, in step S402, the apparatus 1 determines the coordinates of the sound source in the panoramic sound at the next time as the sound source position at the next time.
For example, if the decoded audio data is a panoramic sound, and each sound source in the panoramic sound has specific corresponding coordinates, in step S402, the device 1 directly analyzes and determines the coordinates of the sound source in the panoramic sound at the next time as the sound source position at the next time. For example, if a bird is called as a bird on a tree at the next time in the VR panoramic image, the device 1 directly specifies the coordinates of the bird at that time as the sound source position at the next time in step S402.
It will be understood by those skilled in the art that the above-described manner of determining the location of a sound source is by way of example only and should not be construed as limiting the present application, and that other existing or future manners of determining the location of a sound source, as may be applicable to the present application, are intended to be encompassed within the scope of the present application and are hereby incorporated by reference.
In step S403, the device 1 predicts the head posture of the user at the next time from the sound source position at the next time.
Specifically, in step S403, the apparatus 1 can predict the head posture of the user at the next time point by predicting that the user will face the sound source position at the next time point based on the sound source position at the next time point determined in step S402.
For example, in a VR panoramic video, the next time is a bird on the tree, and the user usually faces the sound source during the viewing process, for example, the user may raise his head to see the bird when the bird is called, so that when the apparatus 1 determines the position of the bird call at the next time in step S402, the apparatus 1 may predict that the user will raise his head to see the bird according to the position of the bird call in step S403, and may predict the head posture of the user at the next time.
Here, the apparatus 1 decodes audio, obtains decoded audio data, analyzes the decoded audio data, determines a sound source position at the next time, and predicts a head posture of a user at the next time according to the sound source position at the next time, and since a prediction method based on an image has a relatively large amount of calculation and high power consumption, the apparatus 1 predicts head movement during a viewing process of the user when viewing a VR panorama video, because the data amount of the audio is very small and is relatively easy to handle, and further, the method based on the sound prediction also relatively conforms to human behavior habits. The mode of the device 1 based on the sound prediction has the characteristics of small calculated amount, low power consumption and the like, and is suitable for being applied to a real-time system.
In a preferred embodiment, the method further comprises step S404 (not shown). In step S404, the apparatus 1 determines a field angle FOV at the next time point, based on the predicted head posture of the user at the next time point; and downloading the video content with high definition in the field angle FOV range at the next moment in advance.
Specifically, after the head posture of the user is determined, the field angle FOV of the user can be known according to the head posture, and therefore, in step S404, the apparatus 1 determines the field angle FOV of the user at the next time according to the head posture of the user at the next time predicted in step S403, and downloads the video content which is high definition within the field angle FOV range at the next time in advance according to the field angle FOV. For VR panoramic video, only the video content in the FOV will be presented to the user when viewing, and the video content in other parts is effectively invalid. Therefore, in step S404, the apparatus 1 may adopt an asymmetric coding scheme, i.e. only the video portion that is visible to the user is of higher definition, and the invisible portion is of lower definition. In step S404, the apparatus 1 determines the field angle FOV of the user at the next time according to the predicted head pose of the user at the next time, so as to download the content with high definition in the FOV range in advance for decoding and displaying according to the field angle FOV, thereby reducing the time delay.
It should be understood by those skilled in the art that the above-described manner of downloading video is merely exemplary and should not be construed as limiting the present application, and that other existing or future manners of downloading video, as applicable to the present application, are intended to be encompassed within the scope of the present application and are hereby incorporated by reference.
The device 1 determines the field angle FOV at the next moment according to the predicted head posture of the user at the next moment, downloads the high-definition video content within the field angle FOV at the next moment in advance, and reduces the resolution switching delay caused by head motion when the user watches the VR panoramic video. The device 1 can play the high-definition VR panoramic video in a lower bandwidth environment, so that the bandwidth cost can be saved, the requirements of a terminal user on a network and equipment are reduced, and the access threshold of the user is reduced.
The present application also provides a computer readable storage medium having stored thereon computer code which, when executed, performs a method as in any one of the preceding.
The present application also provides a computer program product, which when executed by a computer device, performs the method of any of the preceding claims.
The present application further provides a computer device, comprising:
one or more processors;
a memory for storing one or more computer programs;
the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any preceding claim.
It is noted that the present application may be implemented in software and/or a combination of software and hardware, for example, the various means of the present application may be implemented using Application Specific Integrated Circuits (ASICs) or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (6)

1. A method for predicting head pose, wherein the method comprises:
decoding the audio to obtain decoded audio data;
analyzing the decoded audio data and determining the position of a sound source at the next moment;
wherein the content of the first and second substances,
if the decoded audio data is stereo, analyzing the sound time difference of left and right channels of the stereo; determining the sound source position at the next moment according to the time difference from the sound source to the left ear and the right ear of the user;
if the decoded audio data is panoramic sound, determining the coordinates of a sound source in the panoramic sound at the next moment as the position of the sound source at the next moment;
and predicting the head posture of the user at the next moment according to the sound source position at the next moment.
2. The method of claim 1, wherein the method further comprises:
determining a field angle FOV at the next moment according to the predicted head posture of the user at the next moment;
and downloading the video content with high definition in the field angle FOV range at the next moment in advance.
3. An apparatus for predicting head pose, wherein the apparatus comprises:
decoding means for decoding the audio to obtain decoded audio data;
analyzing means for analyzing the decoded audio data and determining a sound source position at a next time; wherein, the first and the second end of the pipe are connected with each other,
if the decoded audio data is stereo, analyzing the sound time difference of left and right channels of the stereo; determining the sound source position at the next moment according to the time difference from the sound source to the left ear and the right ear of the user;
if the decoded audio data is panoramic sound, determining the coordinates of a sound source in the panoramic sound at the next moment as the sound source position at the next moment;
and the predicting device is used for predicting the head posture of the user at the next moment according to the sound source position at the next moment.
4. The apparatus of claim 3, wherein the apparatus further comprises a downloading means for:
determining a field angle FOV at the next moment according to the predicted head posture of the user at the next moment;
and downloading the video content with high definition in the field angle FOV range at the next moment in advance.
5. A computer readable storage medium storing computer code which, when executed, performs the method of any of claims 1-2.
6. A computer device, the computer device comprising:
one or more processors;
a memory for storing one or more computer programs;
the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-2.
CN201911166426.3A 2019-11-25 2019-11-25 Method and device for predicting head posture Active CN111091848B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911166426.3A CN111091848B (en) 2019-11-25 2019-11-25 Method and device for predicting head posture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911166426.3A CN111091848B (en) 2019-11-25 2019-11-25 Method and device for predicting head posture

Publications (2)

Publication Number Publication Date
CN111091848A CN111091848A (en) 2020-05-01
CN111091848B true CN111091848B (en) 2022-09-30

Family

ID=70393959

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911166426.3A Active CN111091848B (en) 2019-11-25 2019-11-25 Method and device for predicting head posture

Country Status (1)

Country Link
CN (1) CN111091848B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102543099A (en) * 2010-12-24 2012-07-04 索尼公司 Sound information display device, sound information display method, and program
CN108141696A (en) * 2016-03-03 2018-06-08 谷歌有限责任公司 The system and method adjusted for space audio

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11164606B2 (en) * 2017-06-30 2021-11-02 Qualcomm Incorporated Audio-driven viewport selection

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102543099A (en) * 2010-12-24 2012-07-04 索尼公司 Sound information display device, sound information display method, and program
CN108141696A (en) * 2016-03-03 2018-06-08 谷歌有限责任公司 The system and method adjusted for space audio

Also Published As

Publication number Publication date
CN111091848A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
CN110610510B (en) Target tracking method and device, electronic equipment and storage medium
WO2020186935A1 (en) Virtual object displaying method and device, electronic apparatus, and computer-readable storage medium
CN107909022B (en) Video processing method and device, terminal equipment and storage medium
EP3913924B1 (en) 360-degree panoramic video playing method, apparatus, and system
CN109741463B (en) Rendering method, device and equipment of virtual reality scene
US11455705B2 (en) Asynchronous space warp for remotely rendered VR
CN111882634B (en) Image rendering method, device, equipment and storage medium
US11475636B2 (en) Augmented reality and virtual reality engine for virtual desktop infrastucture
CN112051961A (en) Virtual interaction method and device, electronic equipment and computer readable storage medium
CN115953468A (en) Method, device and equipment for estimating depth and self-movement track and storage medium
CN111368593B (en) Mosaic processing method and device, electronic equipment and storage medium
WO2022218042A1 (en) Video processing method and apparatus, and video player, electronic device and readable medium
CN103634945A (en) SOC-based high-performance cloud terminal
CN111459266A (en) Method and device for operating 2D application in virtual reality 3D scene
CN111091848B (en) Method and device for predicting head posture
CN109857244B (en) Gesture recognition method and device, terminal equipment, storage medium and VR glasses
CN117280680A (en) Parallel mode of dynamic grid alignment
CN115715464A (en) Method and apparatus for occlusion handling techniques
CN109493349B (en) Image feature processing module, augmented reality equipment and corner detection method
CN109814703B (en) Display method, device, equipment and medium
CN111063011A (en) Face image processing method, device, equipment and medium
CN110944239A (en) Video playing method and device
WO2022194061A1 (en) Target tracking method, apparatus and device, and medium
WO2022121654A1 (en) Transparency determination method and apparatus, and electronic device and storage medium
CN112218003B (en) Desktop image acquisition method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 102600 305-9, Floor 3, Building 6, Yard 10, Kegu 1st Street, Beijing Economic and Technological Development Zone, Daxing District, Beijing (Yizhuang Cluster, High end Industrial Zone, Beijing Pilot Free Trade Zone)

Patentee after: Beijing dream bloom Technology Co.,Ltd.

Address before: 102600 305-9, Floor 3, Building 6, Yard 10, Kegu 1st Street, Beijing Economic and Technological Development Zone, Daxing District, Beijing (Yizhuang Cluster, High end Industrial Zone, Beijing Pilot Free Trade Zone)

Patentee before: Beijing iqiyi Intelligent Technology Co.,Ltd.

CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 102600 305-9, Floor 3, Building 6, Yard 10, Kegu 1st Street, Beijing Economic and Technological Development Zone, Daxing District, Beijing (Yizhuang Cluster, High end Industrial Zone, Beijing Pilot Free Trade Zone)

Patentee after: Beijing iqiyi Intelligent Technology Co.,Ltd.

Address before: 401133 room 208, 2 / F, 39 Yonghe Road, Yuzui Town, Jiangbei District, Chongqing

Patentee before: CHONGQING IQIYI INTELLIGENT TECHNOLOGY Co.,Ltd.

Address after: 266400 Room 302, building 3, Office No. 77, Lingyan Road, Huangdao District, Qingdao, Shandong Province

Patentee after: Qingdao Dream Blossom Technology Co.,Ltd.

Address before: 102600 305-9, Floor 3, Building 6, Yard 10, Kegu 1st Street, Beijing Economic and Technological Development Zone, Daxing District, Beijing (Yizhuang Cluster, High end Industrial Zone, Beijing Pilot Free Trade Zone)

Patentee before: Beijing dream bloom Technology Co.,Ltd.

PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20231009

Granted publication date: 20220930

PD01 Discharge of preservation of patent
PD01 Discharge of preservation of patent

Date of cancellation: 20231129

Granted publication date: 20220930