CN116246043B

CN116246043B - Method, device, equipment and storage medium for presenting augmented reality audiovisual content

Info

Publication number: CN116246043B
Application number: CN202310094822.XA
Authority: CN
Inventors: 莫建清; 何汉武; 刘聪; 洪杨; 蔡武鑫
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2023-09-29
Anticipated expiration: 2043-02-07
Also published as: CN116246043A

Abstract

The invention relates to the technical field of computers, and discloses a presentation method of augmented reality audiovisual content, which comprises the following steps: based on a plurality of target video streams acquired by a plurality of visual sensing devices in a target scene where a user is located, modeling the target scene, tracking and positioning the user by utilizing the visual sensing devices to obtain the current position and the gaze direction of the user, determining a scene object set in the field of view of the user from the modeling of the target scene according to the current position and the gaze direction, searching a preset audiovisual content library based on the scene object set to obtain target audiovisual content, determining a target display terminal and a target audio player from a plurality of display terminals and a plurality of audio players in the target scene according to the current position and the gaze direction, and presenting the target audiovisual content to the user by utilizing the target display terminal and the target audio player, thereby providing a novel presentation method of the audiovisual content in augmented reality.

Description

Method, device, equipment and storage medium for presenting augmented reality audiovisual content

Technical Field

The present invention relates to the field of augmented reality technologies, and in particular, to a method, an apparatus, a computer device, and a storage medium for presenting augmented reality audiovisual content.

Background

The augmented reality technology is a technology for overlaying virtual information in real world in real time, and is used for presenting augmented reality content in a head-mounted display, a handheld intelligent display device and other modes, so that virtual-real fusion is realized, and the augmented reality technology is commonly used for application occasions such as scene teaching, information guiding and entertainment. In the current application, the content of the enhanced information and the time for presenting the enhanced information are mainly preset manually, so that the enhanced information is biased to some static scenes, fixed-flow operation guidance and other applications, the situations of multiple tasks, multiple user demands or dynamic scenes are difficult to process, and in some application occasions, the wearable or handheld device is not suitable or not recommended to use. In some current researches, to adapt to the application requirement of a dynamic scene, people can generate enhancement information by identifying information such as characters in the scene by using a machine vision technology. Current augmented reality techniques still have many directions worth exploring and studying, such as how to solve 3D reconstruction of a scene, how to interact with a user, how to detect gaze points of a user, how to identify scene objects in the current field of view of a user, how to generate audiovisual content associated with scene objects, and so forth.

Disclosure of Invention

To solve at least one of the above technical problems, a first aspect of the present invention discloses a method for presenting augmented reality audiovisual content, the method comprising:

modeling a target scene based on a plurality of target video streams acquired by a plurality of visual sensing devices in the target scene where a user is located;

tracking and positioning the user by utilizing the vision sensing equipment to obtain the current position and the gaze direction of the user;

determining a scene object set in the view of the user from the modeling of the target scene according to the current position and the gaze direction;

searching a preset audiovisual content library based on the scene object set to obtain target audiovisual content;

determining a target display terminal and a target audio player from a plurality of display terminals and a plurality of audio players in the target scene according to the current position and the gaze direction;

and presenting the target audiovisual content to the user by using the target display terminal and the target audio player.

In an optional embodiment, the modeling the target scene based on a plurality of target video streams acquired by a plurality of visual sensing devices in the target scene where the user is located includes:

Performing dense point cloud reconstruction on a plurality of target video streams acquired by a plurality of visual sensing devices in a target scene where a user is located;

performing instance segmentation on point cloud data, creating bounding boxes of each scene object in the target scene, and organizing bounding boxes of all scene objects in a hierarchical bounding box mode;

creating a scene object database, wherein the scene object database is used for storing information of each scene object, and the information of the scene objects in the scene object database is updated in real time.

In an optional embodiment, the tracking and positioning the user by using the vision sensing device to obtain the current position and the gaze direction of the user includes:

tracking the user based on the visual sensing device mapping to a measurable range in a 2D map;

when the position of the user is detected to be fixed, determining the current position of the user, and scheduling the vision sensing equipment to acquire the face image of the user;

and determining the gaze direction of the user based on the face image.

In an optional embodiment, when the number of the visual sensing devices for acquiring the face image is greater than or equal to two, analyzing the face image by using a deep neural network to obtain the gaze direction of the user;

And when the number of the vision sensing devices for acquiring the face images is not more than two, creating a bounding box model of the head of the user based on the face images, and taking the normal direction of the front end face of the bounding box model as the gaze direction of the user.

In an optional embodiment, the determining a set of scene objects in the field of view of the user from the modeling of the target scene according to the current position and the gaze direction includes:

determining a view cone for representing a user field of view from modeling of the target scene according to the current position and the gaze direction;

from the scene objects that are within the range of the view cone, a set of scene objects in the user's field of view is determined.

In an alternative embodiment, the library of audiovisual content comprises: the method comprises the steps of searching a preset audio-visual content library based on the scene object set to obtain target audio-visual content, and comprises the following steps:

generating target retrieval conditions based on the scene object set, the user model of the user and interaction requirements;

searching the video library and the audio library based on the target search condition;

If the matched video and/or audio is searched, the searched video and/or audio is used as target audio-visual content;

and if the matched video and/or audio is not searched, searching the text library based on the target search condition, and converting the searched text information into audio to serve as the target audiovisual content.

In an optional embodiment, the determining, according to the current position and the gaze direction, a target display terminal and a target audio player from a plurality of display terminals and a plurality of audio players in the target scene includes:

and determining a target display terminal and a target audio player from a plurality of display terminals and a plurality of audio players in the target scene based on preset pose conditions and/or visible conditions and/or distance conditions, wherein the pose conditions, the visible conditions and the distance conditions are all related to the current position and the gaze direction.

A second aspect of the invention discloses a presentation device for augmented reality audiovisual content, the device comprising:

the modeling module is used for modeling the target scene based on a plurality of target video streams acquired by a plurality of visual sensing devices in the target scene where the user is located;

The positioning module is used for tracking and positioning the user by utilizing the vision sensing equipment to obtain the current position and the gaze direction of the user;

a scene object set determining module, configured to determine a scene object set in a field of view of the user from modeling of the target scene according to the current position and the gaze direction;

the retrieval module is used for retrieving a preset audiovisual content library based on the scene object set to obtain target audiovisual content;

the equipment determining module is used for determining a target display terminal and a target audio player from a plurality of display terminals and a plurality of audio players in the target scene according to the current position and the gaze direction;

and the presentation module is used for presenting the target audiovisual content to the user by utilizing the target display terminal and the target audio player.

A third aspect of the invention discloses a computer device comprising:

a memory storing executable program code;

a processor coupled to the memory;

the processor invokes the executable program code stored in the memory to perform some or all of the steps in the method for rendering augmented reality audiovisual content disclosed in the first aspect of the invention.

A fourth aspect of the invention discloses a computer storage medium storing computer instructions which, when invoked, are adapted to perform part or all of the steps of the method of presenting audiovisual content of augmented reality disclosed in the first aspect of the invention.

In the embodiment of the invention, firstly, a plurality of target video streams acquired by a plurality of visual sensing devices in a target scene where a user is located are used for modeling the target scene, the visual sensing devices are used for tracking and positioning the user to obtain the current position and the gaze direction of the user, a scene object set in the view field of the user is determined from the modeling of the target scene according to the current position and the gaze direction, a preset audiovisual content library is searched based on the scene object set to obtain target audiovisual content, a target display terminal and a target audio player are determined from a plurality of display terminals and a plurality of audio players in the target scene according to the current position and the gaze direction, and finally, the target audiovisual content is presented to the user by utilizing the target display terminal and the target audio player, so that interaction can be performed based on the position and the gaze direction of the user, the audiovisual content to be presented can be searched according to the interaction, and a proper presentation terminal is selected, and a novel augmented reality audiovisual content presentation method is provided, and more possibility is provided for realizing an augmented reality technology.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow diagram of a method of presenting augmented reality audiovisual content in accordance with an embodiment of the present invention;

FIG. 2 is a diagram illustrating one hardware device configuration for implementing the rendering method disclosed in FIG. 1;

FIG. 3 is a schematic diagram illustrating a retrieval process in the presentation method disclosed in FIG. 1;

FIG. 4 is a schematic diagram illustrating a process of determining a display terminal and an audio player in the rendering method disclosed in FIG. 1;

FIG. 5 is a schematic diagram illustrating a rendering process of audiovisual content in the rendering method disclosed in FIG. 1;

fig. 6 is a schematic structural diagram of a presentation device for augmented reality audiovisual content according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a computer device according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a computer storage medium according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or article that comprises a list of steps or elements is not limited to only those listed but may optionally include other steps or elements not listed or inherent to such process, method, article, or article.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The invention discloses a method, a device, a computer device and a storage medium for presenting augmented reality audiovisual contents, which are characterized in that firstly, a plurality of target video streams acquired by a plurality of visual sensing devices in a target scene where a user is located are used for modeling the target scene, the visual sensing devices are utilized for tracking and positioning the user to obtain the current position and the gaze direction of the user, a scene object set in the view field of the user is determined from the modeling of the target scene according to the current position and the gaze direction, a preset audiovisual content library is searched based on the scene object set to obtain the target audiovisual contents, the target display terminal and the target audio player are determined from a plurality of display terminals and a plurality of audio players in the target scene according to the current position and the gaze direction, and finally, the target audiovisual contents are presented to the user by utilizing the target display terminal and the target audio player, so that interaction can be carried out based on the position and the gaze direction of the user, the audiovisual contents to be presented and the selected appropriate audiovisual contents are searched accordingly, and a novel augmented reality-presenting method is provided. The following will describe in detail.

Example 1

Referring to fig. 1, fig. 1 is a flow chart illustrating a method for presenting augmented reality audiovisual content according to an embodiment of the present invention. As shown in fig. 1, the method of presenting augmented reality audiovisual content may include the operations of:

101. and modeling the target scene based on a plurality of target video streams acquired by a plurality of visual sensing devices in the target scene where the user is located.

As shown in fig. 2, fig. 2 is a diagram showing a hardware device configuration for implementing the presentation method disclosed in fig. 1. The hardware device configuration may mainly include: the server is used for storing data related to the technical scheme of implementing the application and processing the data; a wireless network for transmission of information; the visual sensing equipment is used for acquiring video streams of scenes where the user is located; the visual sensing device can comprise a plurality of cameras arranged at fixed positions, and also can comprise handheld equipment, wearable equipment, a visual sensing device arranged on a mobile robot or a mobile vehicle and the like; one or more display terminals operable to receive video signals; one or more audio players operable to receive the audio signals. The server sends scheduling instructions (which may include pose adjustment instructions and image acquisition instructions of the vision sensing devices) to the designated one or more vision sensing devices through the wireless network according to the sensing device scheduling algorithm. After the vision sensing device collects the image signals, the vision sensing device sends video stream data to the server. After generating the video and/or audio, the server transmits the video and/or audio to one or more display terminals and audio players according to an audiovisual device scheduling algorithm. The server can also query the status of the visual sensing device, the display terminal and the audio player as required.

In the embodiment of the invention, a plurality of video streams in a scene can be acquired by using a plurality of fixedly installed cameras, then dense point cloud reconstruction is performed, point cloud data are subjected to instance segmentation, bounding boxes of each scene object (namely objects in the scene such as a table) are created, and bounding boxes of all the scene objects are organized in a hierarchical bounding box mode. A scene object database may also be created for managing the detected scene objects, storing information of the name, purpose, attribute, bounding box center coordinates, etc. of each scene object. Scene reconstruction and processing programs can also be circularly operated in the background of the server so as to continuously update scene information. Specifically, if a new scene object is detected, a record representing the scene object data is added to the database. If the detected scene object exists in the database, but the attribute of the scene object, such as pose, state, etc., is changed, the relevant data record is updated. If a scene object in the database still cannot be found by traversing the entire scene, it is deleted from the database.

102. And tracking and positioning the user by using the vision sensing equipment to obtain the current position and the gaze direction of the user.

and determining the gaze direction of the user based on the face image.

In the embodiment of the invention, a wearable or handheld positioning and display device is not required, and an Outside-in (outlide-in) tracking positioning mode is adopted. Specifically, a plurality of cameras can be arranged in a scene, more than one camera is used for collecting video streams of a user at different angles, identifying the user and recording the motion trail of the user. It may be required that the number of cameras collecting the user at the same time is not less than two. For better scheduling of cameras, the current measurable range of each camera can be mapped into a 2D map, the observable state represented by 0 and 1, and stored in a three-dimensional array r= { R _c,i,j In }, e.g., r _2,6,10 =1 denotes that the 2D grid with coordinates (6, 10) enters the observable range of camera No. 2. If the number of observable cameras at the current position of the user is found to be less than 2, or the number of observable cameras at the position of the user at the next moment is predicted to be less than 2, the idle cameras can be scheduled, a rotation transformation matrix is calculated according to the position relation between the center of the camera and the user, pose adjustment parameters of the camera are obtained, and the pose of the camera is adjusted to enable the camera to shoot the user, so that good tracking of the user can be achieved.

When the position of the user on the map is not changed, not less than two cameras are scheduled in the camera list, the gestures of the cameras are adjusted to be aligned to the face of the user, and a face image set containing eyes of the user is acquired. And inputting the image set into a depth neural network trained in advance, and regressing a sight vector. And then, intersecting and calculating the sight line vector with the 3D scene graph reconstructed in advance to obtain the gaze point coordinates. If the number of cameras aiming at the face of the user is less than two, point cloud information of the head of the user is obtained according to a multi-view reconstruction method, a bounding box model of the head of the user is created, and the normal direction of the front end face (face direction) of the bounding box is taken as the sight line direction. Thus, the adjustment can be flexibly made and the sight direction can be flexibly and accurately determined under the condition that the number of cameras aimed at the face of the user is uncertain.

103. And determining a scene object set in the view of the user from the modeling of the target scene according to the current position and the gaze direction.

In the embodiment of the invention, the view cone representing the view range of the user can be defined and set first, then the view cone is traversed in the hierarchical bounding box structure of the scene objects, and the scene objects entering the view cone of the user are put into the scene object set. Then, a visibility test can be further performed according to the direction of the user's line of sight, so that scene objects invisible to the user are removed from the scene object set.

The viewing cone can take the midpoint of the connecting line of the centers of pupils of two eyes as a cone top, the sight line direction as a cone center line, and the default horizontal viewing angle is set to be +/-36 degrees and the vertical viewing angle is set to be +/-20 degrees according to the limiting angle of the viewing angle of human eyes and the viewing angle when focusing attention. The parameters may be user-defined or may be determined based on feedback of preferences of the user in use.

Further, the objects in the set of scene objects may also be ordered and placed in a priority queue. The priority of the user's gaze target object may be set to be highest, and the priority of the remaining objects may be according to the following:

1) Based on the distance between the user and the scene object, the smaller the distance is, the higher the priority is. The distance is a linear distance from the top of the viewing cone of the user to the nearest point of the scene object bounding box.

2) The smaller the angle, the higher the priority based on the angle size. The angle is an included angle between a line connecting the top of the viewing cone to the center of the scene object and the midline of the viewing cone of the user.

104. And searching a preset audiovisual content library based on the scene object set to obtain target audiovisual content.

As shown in fig. 3, fig. 3 is a schematic diagram showing a retrieval process in the presentation method disclosed in fig. 1. The method of audiovisual content generation may include both a database retrieval approach and a text-to-audio approach. The video library, the audio library and the text library are stored in a server side or a cloud side. After receiving an audiovisual content generation instruction, reading a scene object queue, a current user model and interaction requirements to generate different retrieval conditions, preferentially retrieving video and audio from a video library and an audio library, and if the video or audio cannot be matched, retrieving text information meeting the conditions from a text library, and converting the text into an audio file by using an audio conversion module.

The generated audiovisual content may relate to scene objects in the user's field of view, and the audio/video files entering the audiovisual content recommendation list may also take into account the current user's user model (image) and the user's interaction requirements. For example, for a scene object "refrigerator," the audio-video associated therewith may include enlightenment animation of refrigerator knowledge, pinyin and associated words of the refrigerator, english words and related sentences of the refrigerator, refrigerator maintenance guidelines, current status display of the refrigerator, and the like. If the current user is a preschool child and the interaction requirement is enlightenment education, enlightenment animation, video or audio regarding refrigerator knowledge is preferentially recommended.

In order to realize complex semantic retrieval, scene objects, video/audio/text information, user models and interaction requirement information can be further structured.

1) The data of the scene object may include an object ID, an object name, an alias, an object description, and the like.

2) The data of the video/audio/text information may include ID, file name, url, category, subject, content, applicable scene, user matching degree, and the like. Wherein ID represents the number of the video/audio/text file; the category indicates whether the file is video, audio or text; the topics and content reflect a summary of the content of the file; the applicable scenes comprise learning, operation guidance, entertainment, sports and the like; the user matching reflects the closeness (a value between 0 and 1) of the content of the file to different types of users, which is obtained through a fully connected neural network. The training data set is a marked video/audio/text file, and the fully connected neural network outputs the scores of the test files on the user model 1 and the user model 2 … user model n respectively, wherein the scores are the compactness.

3) The data of the user model may include gender, age, occupation, ability level, health, personality, preference, and the like. By storing templates of multiple user models, a user may directly select a template or perform personalized configuration on the basis of the template.

4) The data of the interaction requirement may include an interaction category and an application scenario. The interaction categories include video and voice categories. Application scenarios include learning, operational guidance, entertainment, and sports. Wherein, the study is divided into infant enlightenment study, professional course study and vocational training; the operation guide comprises equipment operation guide, process operation guide and disability guide; entertainment includes music, dance, games, etc.

In the embodiment of the invention, the audio-visual content is searched in multiple dimensions and directions, so that the finally presented audio-visual content is more suitable and accurate, and the user experience of the augmented reality technology is improved.

105. And determining a target display terminal and a target audio player from a plurality of display terminals and a plurality of audio players in the target scene according to the current position and the gaze direction.

As shown in fig. 4, fig. 4 is a schematic diagram illustrating a process of determining a display terminal and an audio player in the presentation method disclosed in fig. 1. The ids of all display terminals whose states are available may be put into the candidate list L, l= { v _i I=1, 2, …, n }, n being a positive integer. First, calculating the cosine value of the included angle between the display plane of each display terminal in the candidate list and the direction of the user's line of sight. Let the user currentlyThe line of sight direction vector is V, for each display terminal V in the list _i The normal vector of the display plane is N _i . If the ith display terminal v _i If the pose of the (i) display terminal is not adjustable, calculating the (i) display terminal v _i Cosine of angle between display plane of (c) and user's line of sight direction, i.e. c _i ＝-V.N _i Wherein c _i I.e. the i-th display terminal v _i Cosine value of included angle between display plane of (2) and user's line of sight direction V.N _i It can be understood as a method for calculating the current gaze direction vector V of the user and the normal vector N of the display plane _i And (3) calculating the cosine value of the included angle. If the ith display terminal v _i Is adjustable, a normal N can be found in the adjustable range R of the position _i ^′ So that c _i ＝-V.N _i ^′ The value of (c) is the smallest, i.e. c _i ＝min(-V.N _i ^′ ),s.t.N _i ^′ E R. Temporary storage of ith display terminal v _i Pose p after adjustment _i . If c _i If the value of (a) is smaller than a given threshold T (default value is set to 0), then v is set to _i And deleting from the candidate list. Otherwise, it is tested whether the display plane is visible to the user, i.e. whether an occluding object is present between the display plane and the user. If the user is associated with the ith display terminal v _i If there is a shielding, the ith display terminal v _i And deleting from the candidate list. For the remaining display terminals in the candidate list L, according to c _i The values are ordered from small to large.

Sequentially examining the display terminals in the candidate list L, and inquiring the current display terminal v from the equipment information table _i Recommended distance range d of (d) _min ,d _max ]If the terminal v is displayed _i Distance d from user _i ∈[d _min ,d _max ]Then select display terminal v _i And the final display terminal. If the temporary storage pose p of the display terminal _i If the current pose is inconsistent with the current pose, a signal is sent to adjust the current pose to p _i 。

For selection of audio players, the distance between each audio player and the user may be calculated and stored in the sound source queue from small to large. And taking out the audio players from the queue one by one, and starting the audio player if the current audio player and the user are in the same closed space.

106. And presenting the target audiovisual content to the user by using the target display terminal and the target audio player.

After the selection of the display terminal and the audio player is completed, the retrieved audio-visual content can be presented to the user through the selected display terminal and audio player, so that the presentation effect of the audio-visual content on the user can be improved, the realization effect of augmented reality is improved, and the user experience is improved. As shown in fig. 5, fig. 5 is a schematic diagram showing a presentation process of audiovisual content in the presentation method disclosed in fig. 1. The audiovisual content may be presented by one or more display terminals and/or audio players. The display terminal comprises a handheld display device, various display screens, projectors and the like. Some display terminals can adjust the pose by sending signals so that users can watch at a better angle at different positions. The relevant data of the audio-visual information presentation device are stored in a device information table, and the data content comprises device id, name, category, port, ip address, state, current pose, whether the pose is adjustable, and the like. The presentation of audiovisual content may depend on the user's current position and gaze direction, with audiovisual device scheduling being an important link whose tasks mainly include determining candidate device lists, visibility calculations, pose adjustments and device selection.

Optionally, the method for presenting the augmented reality audiovisual content disclosed by the embodiment of the invention can also be applied to a scene of multiple users. For a multi-user scene, the scheme can be further optimized, specifically as follows:

the camera scheduling scheme may be modified when performing user tracking positioning and gaze point estimation. One is to use redundant cameras so that each user at any time enters the acquisition range of at least two cameras. Secondly, a time-sharing method is adopted, namely a discontinuous mode is adopted to collect the video stream of the user, camera resources are fully utilized, but discontinuous user tracks can be caused, and the video stream can be corrected through a track interpolation algorithm.

When determining the scene object set in the user view, a scene object set can be respectively created for each user, the scene objects in the user view are respectively obtained, and if the views of different users have overlapping areas, the scene objects in the overlapping areas only need to be identified once.

In generating the audio-visual content recommendation list, the audio-visual content recommendation list may be created separately for each user. The scene object list, the user model and the interaction requirement of each user are respectively input, and the processing program of the server side generates the audio-visual content recommendation tables of different users.

When the audio-visual content is presented, the available audio-visual information presenting equipment can be queried by utilizing the equipment information table stored at the server according to the current position information and the staring point of different users, and the presenting equipment can be respectively selected for the different users. If interference or collision of presentation devices of different users is detected, the audio-visual contents of users arriving earlier at the current position are presented preferentially according to the principle of first come first serve.

Example two

Referring to fig. 6, fig. 6 is a schematic structural diagram of a presentation device for augmented reality audiovisual content according to an embodiment of the present invention. As shown in fig. 6, the presentation apparatus of the augmented reality audiovisual content may include:

the modeling module 601 is configured to model a target scene based on a plurality of target video streams acquired by a plurality of visual sensing devices in the target scene where a user is located;

a positioning module 602, configured to track and position the user by using the vision sensing device, so as to obtain a current position and a gaze direction of the user;

a scene object set determining module 603, configured to determine a scene object set in a field of view of the user from modeling of the target scene according to the current position and the gaze direction;

the retrieval module 604 retrieves a preset audio-visual content library based on the scene object set to obtain target audio-visual content;

a device determining module 605, configured to determine a target display terminal and a target audio player from a plurality of display terminals and a plurality of audio players in the target scene according to the current position and the gaze direction;

a presentation module 606 for presenting the target audiovisual content to the user using the target display terminal and the target audio player.

For the specific description of the apparatus for presenting the augmented reality audiovisual content, reference may be made to the specific description of the method for presenting the augmented reality audiovisual content, and for avoiding repetition, a detailed description will be omitted.

Example III

Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the invention. As shown in fig. 7, the computer device may include:

a memory 701 storing executable program code;

a processor 702 connected to the memory 701;

the processor 702 invokes executable program code stored in the memory 701 to perform steps in a method for rendering augmented reality audiovisual content as disclosed in the first embodiment of the present invention.

Example IV

Referring to fig. 8, an embodiment of the present invention discloses a computer storage medium 801, where the computer storage medium 801 stores computer instructions for executing steps in a method for presenting augmented reality audiovisual content according to an embodiment of the present invention when the computer instructions are called.

The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above detailed description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product that may be stored in a computer-readable storage medium including Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic disc Memory, tape Memory, or any other medium that can be used for computer-readable carrying or storing data.

Finally, it should be noted that: the method, the device, the computer equipment and the storage medium for presenting the augmented reality audiovisual content disclosed by the embodiment of the invention are disclosed as the preferred embodiment of the invention, and are only used for illustrating the technical scheme of the invention, but not limiting the technical scheme; although the invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that; the technical scheme recorded in the various embodiments can be modified or part of technical features in the technical scheme can be replaced equivalently; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method of rendering augmented reality audiovisual content, the method comprising:

presenting the target audiovisual content to the user using the target display terminal and the target audio player;

the step of tracking and positioning the user by using the vision sensing device to obtain the current position and the gaze direction of the user comprises the following steps:

determining a gaze direction of the user based on the face image;

the determining a scene object set in the view of the user from the modeling of the target scene according to the current position and the gaze direction includes:

2. The method for presenting augmented reality audiovisual content according to claim 1, wherein the modeling the target scene based on a plurality of target video streams acquired by a plurality of visual sensing devices in the target scene in which the user is located comprises:

3. The method of claim 1, wherein the determining the gaze direction of the user based on the face image comprises:

when the number of the visual sensing devices for acquiring the face images is greater than or equal to two, analyzing the face images by using a deep neural network to obtain the gaze direction of the user;

4. The method of claim 1, wherein the library of audiovisual content comprises: the method comprises the steps of searching a preset audio-visual content library based on the scene object set to obtain target audio-visual content, and comprises the following steps:

5. The method for presenting augmented reality audiovisual content according to claim 1, wherein the determining a target display terminal and a target audio player from a plurality of display terminals and a plurality of audio players in the target scene according to the current position and the gaze direction comprises:

6. A presentation device for augmented reality audiovisual content, the device comprising:

A presentation module for presenting the target audiovisual content to the user using the target display terminal and the target audio player;

determining a gaze direction of the user based on the face image;

7. A computer device, the computer device comprising:

A memory storing executable program code;

a processor coupled to the memory;

the processor invokes the executable program code stored in the memory to perform the method of presentation of augmented reality audiovisual content as claimed in any one of claims 1 to 5.

8. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a method of presentation of augmented reality audiovisual content as claimed in any one of claims 1 to 5.