CN116805350A - Virtual-real fusion explanation method and device, electronic equipment and storage medium - Google Patents

Virtual-real fusion explanation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116805350A
CN116805350A CN202310544910.5A CN202310544910A CN116805350A CN 116805350 A CN116805350 A CN 116805350A CN 202310544910 A CN202310544910 A CN 202310544910A CN 116805350 A CN116805350 A CN 116805350A
Authority
CN
China
Prior art keywords
virtual
pose
scene
explanation
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310544910.5A
Other languages
Chinese (zh)
Inventor
齐越
王宇泽
王君义
段宛彤
曲延松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202310544910.5A priority Critical patent/CN116805350A/en
Publication of CN116805350A publication Critical patent/CN116805350A/en
Pending legal-status Critical Current

Links

Abstract

The embodiment of the application provides a virtual-real fusion interpretation method, a virtual-real fusion interpretation device, electronic equipment and a storage medium, which can collect information of a target scene in advance, train a scene regression model by utilizing the information of the target scene, fuse depth information of objects in the target scene into the model, predict corresponding three-dimensional scene point clouds after inputting RGB images collected by a user into the model, further obtain pose of a camera, reduce calculation force requirements on the equipment, improve solving speed of the pose of the camera, obtain the pose of the camera more accurately, and because the information of each point of the target scene is collected in advance, the constructed model does not have a large number of holes, and the information of the known scene is also used when determining the pose of the virtual interpretation role, thereby realizing real virtual-real fusion occlusion effect and enhancing user experience.

Description

Virtual-real fusion explanation method and device, electronic equipment and storage medium
Technical Field
The application relates to the field of augmented reality, in particular to a virtual-real fusion explanation method and device, electronic equipment and a storage medium.
Background
At present, most of augmented reality systems are based on real-time positioning reconstruction technology to construct a target scene map and calculate the pose of each frame of camera, but most of current AR (Augmented Reality ) devices are not provided with depth cameras, because the computing force requirements on the AR devices are very high after the depth cameras are provided, the condition of system breakdown due to the exhaustion of computing force resources is easy to occur, so the real-time positioning reconstruction technology can only perform three-dimensional reconstruction based on RGB (Red Green Blue) images and calculate the pose of a camera, the result is inaccurate and the effect is not ideal; the problem that objects are blocked by each other usually exists in a scene, so that in practice, the camera is ensured to be capable of scanning every point in the scene and the feasibility is not ensured, and a large number of holes can appear in a model reconstructed by the existing three-position reconstruction method; and the movement speed of the acquisition equipment is high, the area which is not reconstructed is inevitably observed, and at the moment, the correct virtual-real shielding relation cannot be rendered in the augmented reality application.
Disclosure of Invention
The embodiment of the application mainly aims to provide a virtual-real fusion explanation method and device, electronic equipment and storage medium, which are used for enhancing the virtual-real fusion effect and improving the user experience.
In a first aspect, an embodiment of the present application provides a virtual-real fusion interpretation method, including:
in a scene construction stage, a plurality of continuous RGBD video frames of a target scene acquired by an RGBD camera are acquired, a semantic map is constructed for the target scene through a semantic SLAM system, the corresponding relation between each pixel point in each RGB image in the plurality of continuous RGBD video frames and three-dimensional scene point coordinates in the semantic map is determined, and a plurality of three-dimensional scene point coordinates corresponding to the same RGB image form a three-dimensional scene point cloud corresponding to the RGB image; wherein each RGBD video frame contains one RGB image and one depth image; the semantic map contains the position, posture and semantic information of objects in the target scene;
training a scene regression model according to the plurality of continuous RGB images and the three-dimensional scene point clouds, wherein the scene regression model is used for predicting the corresponding three-dimensional scene point clouds according to the input RGB images and obtaining the corresponding relation between each pixel point in the input RGB images and the coordinates of the three-dimensional scene point;
in the explanation stage, responding to an explanation request for a target scene, and acquiring an original RGB image of the target scene through an RGB camera;
inputting the acquired original RGB images into a trained scene regression model to obtain three-dimensional scene point clouds and corresponding relations corresponding to each original RGB image, calculating camera pose according to the three-dimensional scene point clouds and the corresponding relations, and determining pose information of virtual explanation roles set in the target scene according to the semantic map and the camera pose;
And obtaining a virtual-real fusion image according to the pose information of the virtual explanation character and the original RGB image, and displaying the virtual-real fusion image to a user.
Optionally, the scene regression model includes:
the encoder is used for regressing the sparse point cloud of the three-dimensional scene point by extracting the characteristics of the original RGB image;
and the decoder is used for complementing the sparse point cloud through a point cloud complement model to obtain dense point cloud of the three-dimensional scene point.
Optionally, obtaining a virtual-real fusion image according to the pose information of the virtual explanation character and the original RGB image, and displaying the virtual-real fusion image to the user, including:
rendering a virtual image according to the pose information of the virtual explanation character and the original RGB image; the rendering process is carried out on the GPU by utilizing a computer loader;
and fusing the original RGB image and the virtual image to obtain a virtual-real fused image, and displaying the virtual-real fused image to a user.
Optionally, determining pose information of the virtual explanation roles set in the target scene according to the semantic map and the camera pose includes:
determining the position of a key target according to the semantic map, and determining a plurality of candidate poses for the virtual roles; the key targets are objects for the virtual explanation roles to explain in the objects in the target scene;
For any candidate pose, determining the visualization degree of the key target when the virtual explanation character is in the candidate pose according to the semantic map, the position of the key target, the attention degree of the user to different positions of the key target and the camera pose, wherein the visualization degree depends on the shielding degree of the key target; determining the rationalization degree of the distance and the relative orientation between the virtual explanation character and the user when the virtual explanation character is in the candidate pose according to the semantic map and the camera pose;
and determining pose information of the virtual explanation roles from the plurality of candidate poses according to the visualization degree and the rationalization degree.
Optionally, the virtual explanation roles are used for explaining for a plurality of users; the method further comprises the steps of:
distributing interaction weights for all users according to the interaction times and duration of the virtual explanation roles and all users;
correspondingly, determining pose information of the virtual interpretation role from the plurality of candidate poses according to the visualization degree and the rationalization degree comprises the following steps:
determining experience indexes of each user according to the corresponding visualization degree and rationalization degree of each user in the plurality of users aiming at any candidate pose in the plurality of candidate poses;
And determining pose information of the virtual explanation roles from the plurality of candidate poses according to experience indexes and interaction weights of each user.
Optionally, determining pose information of the virtual explanation roles set in the target scene according to the semantic map and the camera pose includes:
randomly determining the initial pose of the virtual explanation character as the current pose;
and according to the current pose, performing multiple iterations through a Mei Teluo Brix-Studies algorithm to obtain pose information of the virtual interpretation role.
In a second aspect, an embodiment of the present application provides a virtual-real fusion explanation device, including:
the scene construction module is used for acquiring a plurality of continuous RGBD video frames of a target scene acquired by an RGBD camera in a scene construction stage, constructing a semantic map for the target scene through a semantic SLAM system, determining the corresponding relation between each pixel point in each RGB image in the plurality of continuous RGBD video frames and three-dimensional scene point coordinates in the semantic map, and forming a three-dimensional scene point cloud corresponding to the RGB image by corresponding to the plurality of three-dimensional scene point coordinates of the same RGB image; wherein each RGBD video frame contains one RGB image and one depth image; the semantic map contains the position, posture and semantic information of objects in the target scene;
The model training module is used for training a scene regression model according to the plurality of continuous RGB images and the three-dimensional scene point clouds, wherein the scene regression model is used for predicting the corresponding three-dimensional scene point clouds according to the input RGB images and obtaining the corresponding relation between each pixel point in the input RGB images and the coordinates of the three-dimensional scene point;
the scene acquisition module is used for responding to the explanation request aiming at the target scene in the explanation stage, and acquiring an original RGB image of the target scene through an RGB camera;
the pose determining module is used for inputting the acquired original RGB images into a trained scene regression model to obtain three-dimensional scene point clouds corresponding to each original RGB image and the corresponding relation, calculating the pose of a camera according to the three-dimensional scene point clouds and the corresponding relation, and determining pose information of a virtual explanation role set in the target scene according to the semantic map and the camera pose;
and the virtual-real fusion module is used for obtaining a virtual-real fusion image according to the pose information of the virtual explanation character and the original RGB image and displaying the virtual-real fusion image to a user.
In a third aspect, an embodiment of the present application provides an electronic device, including:
at least one processor; and a memory communicatively coupled to the at least one processor;
Wherein the memory stores instructions executable by the at least one processor to cause the electronic device to perform the method of any of the above aspects.
In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, where computer executable instructions are stored, and when executed by a processor, implement a method according to any one of the above aspects.
In a fourth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the above aspects.
According to the virtual-real fusion interpretation method, device, electronic equipment and storage medium, information of a target scene can be acquired in advance, a scene regression model is trained by utilizing the information of the target scene, depth information of objects in the target scene can be fused into the model, after RGB images acquired by a user are input into the model, corresponding three-dimensional scene point clouds can be predicted, the pose of a camera is further obtained, the calculation force requirement on the equipment is reduced, the resolution speed of the pose of the camera is improved, meanwhile, the obtained pose of the camera is more accurate, a large number of holes cannot appear in the constructed model due to the fact that the information of each point of the target scene is acquired in advance, the information of the known scene is also used when the pose of a virtual interpretation role is determined, the real virtual-real fusion occlusion effect can be achieved, and the user experience is enhanced.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is an application scenario diagram provided in an embodiment of the present application;
fig. 2 is a schematic flow chart of a virtual-real fusion explanation method according to an embodiment of the present application;
FIG. 3 is a convolutional neural network of an encoder-decoder architecture according to an embodiment of the present application;
FIG. 4 is a relationship diagram of four phases of an interaction process provided by an embodiment of the present application;
fig. 5 is a schematic structural diagram of a virtual-real fusion explanation device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
In the technical field of augmented reality, the accuracy of camera pose calculation has a great influence on the final virtual-real fusion effect.
In some technologies, the pose of each frame of camera is calculated according to the video frames acquired by the user in real time, but most of AR devices worn by the user are RGB cameras, no depth cameras are equipped, the effect of three-dimensional reconstruction based on RGB images is not ideal, because the pose estimation method based on the depth cameras is accurate, but the time space efficiency is low, and for the task with high calculation pressure of an AR-based virtual explanation character explanation system, when multiple threads (such as a camera pose positioning thread, a scene understanding thread, a voice recognition thread, a virtual explanation character rendering thread, a scene virtual and real fusion thread and the like) of the system are concurrent, the real-time and high-immersion effect is difficult to achieve. In addition, the moving speed of the user in the target scene is high, the area which is not reconstructed can be observed, so that a correct virtual-real shielding relation cannot be rendered, and in addition, the problem that objects are shielded with each other usually exists in the scene, so that the camera can be ensured to be scanned to each point in the scene in practice and has no feasibility, and a large number of holes can appear in a model reconstructed by the existing three-position reconstruction method.
In view of this, the application provides a virtual-real fusion interpretation method, which can collect the target scene by adopting an RGBD (Red Green Blue Depth, red green blue three primary colors plus depth) camera containing a depth camera in advance before a user enters the target scene, so as to obtain abundant information of the known scene, when the user enters the target scene, the information of the target scene is known, therefore, the accurate camera pose can be rapidly calculated by adopting a common RGB camera to collect the image of the target scene, further, the rendered virtual image is more accurate, and the information of the known scene contains the depth information of the object, so that the real virtual-real fusion occlusion effect can be realized, and the user experience is enhanced.
Fig. 1 is an application scenario diagram provided in an embodiment of the present application. The method comprises the steps of acquiring a target scene in advance, acquiring a plurality of RGBD video frames of the target scene, constructing a semantic map according to the RGBD video frames as the target scene, simultaneously obtaining the corresponding relation between an RGB image and a three-dimensional scene point cloud, constructing a training set training scene regression model by the RGB image and the three-dimensional scene point cloud, predicting the corresponding three-dimensional scene point cloud according to the input RGB image by the trained scene regression model, inputting the RGB image acquired by a worn RGB camera into the trained scene regression model when a user enters the target scene, obtaining the predicted three-dimensional scene point cloud and the corresponding relation, calculating the pose of a camera according to the three-dimensional scene point cloud and the corresponding relation, determining the pose of a virtual explanation role according to the pose of the camera and the semantic map, and finally obtaining a virtual-real fused image according to the pose of the virtual explanation role and the RGB image acquired by the user.
The application provides a virtual-real fusion explanation method, which can be used for acquiring information of a target scene in advance, constructing a semantic map for the target scene, determining three-dimensional scene point clouds corresponding to each RGB image, training a scene regression model according to the RGB images and the three-dimensional scene point clouds corresponding to the RGB images, wherein the trained model contains depth information of pixel Points in the RGB images, inputting the RGB images acquired by AR equipment worn by a user in the trained model when the user enters the target scene, and then regressing the corresponding three-dimensional scene point clouds, and restoring accurate camera pose by utilizing a PnP (Perspective-N-Points) algorithm according to the corresponding relation between the regressed three-dimensional scene point clouds, wherein the computing force requirements on the equipment are not high because of the RGB images acquired by the AR equipment of the user, the cost of the equipment is reduced, and the pose of a virtual explanation role can be optimized according to the information of the semantic map and the camera pose, so that the virtual explanation is blocked under the condition that the target is not large as much as possible, the interactive effect of the user is more reasonable, the interactive effect is achieved under the condition that the user has different key point and the interactive effect is more oriented to the user, and the real-time is more reasonable.
The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 2 is a schematic flow chart of a virtual-real fusion explanation method according to an embodiment of the present application. The execution body of the present application may be any device with a data processing function, as shown in fig. 2, and the virtual-real fusion explanation method provided by the embodiment of the present application may include:
step 201, in a scene construction stage, a plurality of continuous RGBD video frames of a target scene acquired by an RGBD camera are acquired, a semantic map is constructed for the target scene through a semantic SLAM (Simultaneous Localization and Mapping) system, the corresponding relation between each pixel point in each RGB image in the plurality of continuous RGBD video frames and three-dimensional scene point coordinates in the semantic map is determined, and a plurality of three-dimensional scene point coordinates corresponding to the same RGB image form a three-dimensional scene point cloud corresponding to the RGB image; wherein each RGBD video frame contains one RGB image and one depth image; the semantic map contains the position, posture and semantic information of objects in the target scene.
The RGBD camera is a camera equipped with a depth camera, each video frame image of a target scene acquired by the RGBD camera contains an RGB image and a depth image, and the semantic SLAM system can construct a map containing semantic information, that is, a semantic map, for the target scene.
Specifically, a target scene, such as a museum, a plant garden, a factory workshop and the like, is selected, then the target scene is shot by utilizing an RGBD camera, the action in the shooting process of the RGBD camera is as slow and fine as possible, each point in each area in the scene is shot as far as possible, the semantic map constructed by the video frame image shot by adopting the method is more accurate and has a correct virtual-real shielding relation, and a large number of holes can be avoided.
Inputting a video frame shot by taking the received RGBD camera as a target scene into a semantic SLAM system, constructing a semantic map of the target scene, and simultaneously obtaining three-dimensional point scene point coordinates corresponding to each pixel point in each RGB image in the semantic map, wherein each RGB image comprises a plurality of pixel points, each pixel point corresponds to one three-dimensional scene point coordinate, so that the RGB images correspond to a set of three-dimensional scene point coordinates, namely three-dimensional scene point clouds, and each RGB image corresponds to one three-dimensional scene point cloud.
Optionally, two processes of constructing a semantic map of a scene and obtaining a corresponding relation between each RGB image and a three-dimensional scene point cloud can be separately performed, a slow and detailed shooting method is still adopted to construct the semantic map for a target scene, but when the corresponding relation between each RGB image and the three-dimensional scene point cloud is obtained, the target scene can be shot through a Kinect, 7 key explanation areas in the target scene can be selected to shoot, each area comprises 1000 video frames, 7000 video frames are taken in total, each video frame comprises one RGB image and one depth image, 7000 video frames are input into a Kinectformis system to obtain the three-dimensional scene point cloud corresponding to each video frame, the 7000 video frames and the three-dimensional scene point cloud corresponding to the 7000 video frames serve as a data set to train a subsequent scene regression model, and meanwhile, the real pose corresponding to each video frame is obtained to evaluate the accuracy of the pose calculation of a subsequent camera.
The Kinect is also a camera with a depth camera, the Kinect fusion system is a real-time dense three-dimensional reconstruction system, and the coordinates of three-dimensional scene points are calculated through depth information of a pixel point in a video frame and camera internal parameters.
Step 202, training a scene regression model according to the plurality of continuous RGB images and the three-dimensional scene point cloud, wherein the scene regression model is used for predicting the corresponding three-dimensional scene point cloud according to the input RGB images and obtaining the corresponding relation between each pixel point in the input RGB images and the coordinates of the three-dimensional scene point.
Specifically, let RGB image be I, and the coordinate of pixel point in the RGB image be p i Predicted pixel point coordinate p i The coordinate of the corresponding three-dimensional scene point is y i The three-dimensional scene point cloud corresponding to the RGB image isThe learnable parameter is w, the scene regression model is f, and the +.>
The model training method is as follows: inputting a certain RGB image into a scene regression model, outputting a predicted three-dimensional scene point cloud, and adopting L1 loss regularization as a loss function, namelyAnd adjusting parameters of the scene regression model according to the loss value of the loss function, and repeatedly executing the process until the number of times of execution reaches the preset maximum number of times, or the calculated loss value is smaller than a preset threshold value.
In the formulaIn (b), y i Is associated with pixel point p i The corresponding real three-dimensional scene point coordinates,is associated with pixel point p i Corresponding predicted three-dimensional scene point coordinates, < >>And the predicted three-dimensional scene point cloud corresponding to the RGB image.
Optionally, the scene regression model includes:
the encoder is used for regressing the sparse point cloud of the three-dimensional scene point by extracting the characteristics of the original RGB image;
and the decoder is used for complementing the sparse point cloud through a point cloud complement model to obtain dense point cloud of the three-dimensional scene point.
Specifically, the scene regression model in the present application adopts a convolutional neural network, and the convolutional neural network adopts an Encoder-Decoder (Encoder-Decoder) structure, and fig. 3 is a convolutional neural network with an Encoder-Decoder structure according to an embodiment of the present application. As shown in fig. 3, the network input is RGB image, the network output is three-dimensional scene point cloud, the network is a pure convolution network, no full connection layer exists, the middle gray cuboid represents the convolution layer of 3*3, the dark gray cuboid represents the convolution layer of 1*1, the light gray cuboid represents the point cloud complement model, the dark gray cuboid and the middle gray cuboid jointly form the encoder, and the light gray cuboid is the decoder. The encoder can extract the characteristics of the image by downsampling the RGB image and regress the sparse point cloud of the three-dimensional scene point. The point cloud complement model in the decoder can complement the sparse point cloud to obtain dense point cloud of the three-dimensional scene point, wherein the point cloud complement model adopts PCN (Point Completion Network, point cloud complement network) as a Backbone network (backhaul).
The convolutional neural network used in the application firstly carries out downsampling on the whole scene and then completes complementation, if the size of scene points before downsampling is 3 multiplied by 160 multiplied by 120, scene points with quarter size are generated after downsampling, the size of the scene points is 3 multiplied by 80 multiplied by 60, and after the supersampling is carried out by the point cloud completion model, the size of the scene points is restored to 3 multiplied by 160 multiplied by 120. The point cloud completion operation can be used for completing the information which cannot be collected by the single picture and the point with invalid depth when the single picture generates the scene point.
Step 203, in the explanation stage, responding to an explanation request for a target scene, and collecting an original RGB image of the target scene through an RGB camera;
specifically, a specific button or a sensor is arranged on the AR equipment worn by the user, and after the user presses the button or touches the sensor, the AR equipment receives an explanation request for the target scene, and an RGB camera on the AR equipment is used for collecting an original image of the target scene.
204, inputting the acquired original RGB images into a trained scene regression model to obtain three-dimensional scene point clouds corresponding to each original RGB image and the corresponding relation, calculating the pose of a camera according to the three-dimensional scene point clouds and the corresponding relation, and determining pose information of a virtual explanation role set in the target scene according to the semantic map and the camera pose;
Specifically, each acquired original RGB image is input into a trained scene regression model, and the corresponding relation between the three-dimensional scene point cloud corresponding to the RGB image and each pixel point and three-dimensional scene point coordinates in the RGB image can be obtained.
Considering that the prediction of the scene regression model has errors, the scene point cloud is based on the three-dimensional sceneWhen the corresponding relation between the pixel points and the three-dimensional scene points is used for solving the pose parameter matrix h of the camera, a solver with stronger robustness is adopted for reducing errors, and the application adopts RANSAC (Random Sample Consensus, random sampling consistency), and comprises the following optimization steps:
1. calculating a plurality of camera pose parameter predictions:
wherein, some three-dimensional scene points are randomly selected from the three-dimensional scene point cloud corresponding to a certain RGB image as sampling points,for the collection of these sampling points, the mapping relation between these sampling points and two-dimensional pixel points in the RGB image is stored. g (-) is a camera attitude parameter solver and can be any method for calculating the pose through point cloud, and the application uses PnP algorithm.
And randomly selecting different sampling points to form a set of a plurality of sampling points, and obtaining parameter estimation of a plurality of camera poses by adopting a PnP algorithm.
2. And scoring all the predictions to obtain predicted coarse camera pose parameters:
each of the predictions is scored by a scoring function s (·) for each of the predictions.
Wherein, the liquid crystal display device comprises a liquid crystal display device,for the illustrative function only returns 1 when the condition in brackets is true, the other cases return 0, τ is a given threshold, in the present case only when the function e j (y i ,h j ) When the value of (a) is smaller than τ, the three-dimensional coordinate point y i Considered as an interior point, the value of the oscillometric function being 1, otherwise a three-dimensional coordinate point y i And is considered not an interior point, the value of the indirection function is 0./>Is the predicted coarse camera pose parameter.
In the embodiment of the application, τ=10px is selected as a threshold value of pose error.
e j The calculation formula of (-) is:
r(y i ,h j ) Residual error of heavy projection through three-dimensional scene point coordinate y i Corresponding real pixel point coordinates p i And the calculated pixel point coordinates Kh j -1 y i And performing difference to obtain the product.
r(y i ,h j )=|p i -Kh j -1 y i | 2
Wherein K is a camera reference matrix, h j -1 Is the inverse of the camera pose parameter matrix.
The scoring function ultimately calculates the number of inliers in each set of sample points, with the number of inliers in the set of sample points increasing the higher the score. According to the formula based on each estimated scoring conditionAnd obtaining predicted pose parameters of the coarse camera.
3. Optimizing predicted coarse camera pose parameters:
the method comprises the following steps: and recalculating the pose parameters of the camera by using all the correctly predicted three-dimensional scene points. The calculation formula is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,for correctly predicted three-dimensional scene points +.>Is the predicted pose parameter of the camera.
The application carries out two-stage training, wherein the first stage adopts a single RGB image as input, adopts a real scene point cloud as output and adopts L1 loss regularization as a loss function.
In the second stage, a single RGB image is used as input, and an interior point is selected through RANSAC and a re-projection residual error r (·) is used as a loss function. Considering that the second stage needs to perform RANSAC sampling on each RGB image and select a part of points to perform gradient descent, the batch size (the number of data (samples) that are transferred to the program for training in a single pass) of the training in the second stage is 1, where the batch size is the number of data (samples) that are transferred to the program for training in a single pass.
The application adopts Adam (adaptive moment estimation) as the adaptive momentOptimization algorithm, 10 -6 As a learning rate, training was performed using an Nvidia P6000 GPU (graphics processing unit, graphics processor).
After obtaining the camera pose, optionally, determining pose information of the virtual explanation roles set in the target scene according to the semantic map and the camera pose, including:
determining the position of a key target according to the semantic map, and determining a plurality of candidate poses for the virtual roles; the key targets are objects for the virtual explanation roles to explain in the objects in the target scene;
for any candidate pose, determining the visualization degree of the key target when the virtual explanation character is in the candidate pose according to the semantic map, the position of the key target, the attention degree of the user to different positions of the key target and the camera pose, wherein the visualization degree depends on the shielding degree of the key target; determining the rationalization degree of the distance and the relative orientation between the virtual explanation character and the user when the virtual explanation character is in the candidate pose according to the semantic map and the camera pose;
and determining pose information of the virtual explanation roles from the plurality of candidate poses according to the visualization degree and the rationalization degree.
Specifically, the key target may be any real object in the target scene, or may be a virtual object placed in the target scene. The key target may be one or more, and the location of the key target may be determined by a semantic map.
A reasonable position can be arbitrarily selected from the target scene as a candidate position of the virtual explanation character, the reasonable position is a point on the ground, the point is not occupied by a user, a real object and a virtual object in the target scene, the virtual explanation character can have multiple candidate gestures at each candidate position, and one candidate position and one candidate gesture at the candidate position jointly form one candidate pose of the virtual explanation character.
For example, 1000 virtual positions are selected for the virtual lecture character in the target scene, and 10 candidate poses are selected for the virtual lecture character at each virtual position, so that the virtual lecture character has 1000×10=10000 candidate poses in total.
The candidate pose of the virtual lecture character may be expressed as a= (l) x ,l y ,l z ,o x ,o y ,o z ) Wherein (l) x ,l y ,l z ) Candidate locations representing virtual lecture characters, (o) x ,o y ,o z ) The candidate gesture representing the virtual lecture character is the direction of the virtual lecture character in the present application.
For any candidate pose, the visualization degree of the key target when the virtual explanation character is in the candidate pose can be C V (a) Representation, design C V (a) The objective of (a) is to optimize the degree of visualization of the key object, which is penalized if it is found that the key object is occluded from the view of user u, and because the degree of interest of the user is different for different locations of the key object, an attention matrix is introduced, and different weights are assigned to the different locations.
In general, the user has a higher interest level in the central region of the key target, a lower interest level in the edge region, and a higher weight may be assigned to the position of the central region, and a lower weight may be assigned to the edge region.
C V (a) The expression of (2) is:
where (I, j) is the coordinates of a certain position on the key target, I (·) is a binary matrix of whether the certain position is occluded, I (I, j) =1 indicates occluded, and I (I, j) =0 indicates not occluded. Each key target comprises a plurality of pixel points, I max Is in the field of view of user uA set of non-occluded pixels of all key targets, E is a set of all pixels of all key targets, E k Is the set of pixel points in the kth key target in the field of view of user u that are not occluded. w (i, j) is an attention matrix, which is a two-dimensional matrix in which the value of each element represents the attention weight of the position of coordinates (i, j), and the calculation formula of w (i, j) is as follows:
ω(i,j)=∑ω k (i,j)
when pixel (i, j) does not belong to the field of view of the key target, ω (i, j) =0, d k (i, j) is the distance from the ith row and jth column pixel points in the kth key target to the key target center. In the definition, when the center of the key target is blocked by the virtual explanation character in a large area, a larger punishment is introduced, so that the position of the virtual explanation character is optimized, and the virtual explanation character can not block the key target as much as possible.
For any candidate pose, the rationalization degree of the distance and the relative orientation between the virtual explanation character and the user when the virtual explanation character is in the candidate pose is C S (a) Representation, design C S (a) The purpose of (a) is to optimize the position and orientation of the virtual lecture character because the user feels uncomfortable when the virtual lecture character is too close to the user, the user's attention to the virtual lecture character is affected when the distance is too far, and the user does not face the virtual lecture character when the virtual lecture character is in conversation with the user, and the user also feels uncomfortable.
C S (a) The expression of (2) is:
a mixture gaussian distribution is used to represent the distribution of the position and orientation of the user, with K gaussian kernels. d is the distance between the virtual lecture character and the user, and θ represents the orientation of the virtual lecture character with respect to the user. d through EuropeAnd calculating the King distance, namely calculating the theta through the included angle between the user and the z-axis direction of the local coordinates of the virtual interpretation character. G k (. Cndot.) is the probability density function of the kth Gaussian distribution, with average value μ k Covariance is Σk.
According to the visualization degree C V (a) And degree of rationality C S (a) Aiming at the current user u, the experience index designed for the candidate pose of the virtual interpretation role is the value of a cost function, and the expression of the cost function is as follows:
C u (a)=λ V C V (a)+λ S C S (a)
Wherein lambda is V And lambda (lambda) S Is a super parameter and meets lambda VS =1。
And calculating the value of the cost function of each candidate pose, and taking the candidate pose with the minimum corresponding cost function value as the pose of the virtual interpretation role.
By introducing C V (a) Can optimize the visualization degree of the key target, so that the central part of the key target is not blocked by the virtual explanation character in a large area, and C is introduced S (a) The position and the orientation of the virtual explanation character can be optimized, so that the user and the virtual explanation character keep a proper distance, and the virtual explanation character faces the user as much as possible during the dialogue, thereby improving the user experience, and finally, the cost function designed for the candidate pose comprehensively considers C V (a) And C S (a) And (3) determining pose information of the virtual interpretation role.
Optionally, the virtual explanation roles explain for a plurality of users; the method further comprises the steps of:
distributing interaction weights for all users according to the interaction times and duration of the virtual explanation roles and all users;
correspondingly, determining pose information of the virtual interpretation role from the plurality of candidate poses according to the visualization degree and the rationalization degree comprises the following steps:
determining experience indexes of each user according to the corresponding visualization degree and rationalization degree of each user in the plurality of users aiming at any candidate pose in the plurality of candidate poses;
And determining pose information of the virtual explanation roles from the plurality of candidate poses according to experience indexes and interaction weights of each user.
Specifically, when the virtual explanation character explains for a plurality of users, according to the interaction times and duration of the virtual explanation character and each user, an interaction weight is allocated to each user, and the interaction weight allocated to the user u is as follows:
wherein lambda is N And lambda (lambda) T Is a coefficient of weight, satisfying lambda NT =1, α is the decay coefficient of time, N u Is the interaction times of user u in time window delta T, T u Is the time interval of the last interaction of user u. In particular, when user u has no interaction within a certain time window Δt, N u =0,T u =ΔT。
For any one candidate pose, determining that the experience index of each user is the cost function value, wherein the expression of the cost function is as follows: c (C) u (a)=λ V C V (a)+λ S C S (a)
According to the cost function value C of each user u (a) And interaction weight omega u The expression of the cost function for any one candidate pose under the multi-user scene is obtained as follows:
where U is the set of all users.
And calculating the value of the cost function of each candidate pose under the multi-user scene, and taking the candidate pose with the minimum corresponding cost function value as the pose of the virtual interpretation role.
In practical application, in the process of interaction of multiple users with the virtual explanation character, the pose of the virtual explanation character can be changed continuously. In order to clarify a specific change process, the application defines a round of interaction as multiple interactions between the virtual explanation character and the user, wherein the interaction mode can be that the virtual explanation character answers a question of the user or the virtual explanation character presents some content to the user. In one round of interaction, the interaction times between the virtual explanation roles and the users are N, and the interaction time is T. Defining a time window as delta T, namely, if no interaction exists between the virtual interpretation character and the user in delta T time after the last dialogue, and considering that one round of interaction is ended. The set of users is defined as U, i.e. the set of all users participating in the interaction in this round of interaction. The process of defining interactions in the embodiment of the application is divided into four phases:
1. standby stage: when no interaction exists between the virtual explanation character and the user and the virtual explanation character has no other tasks, the virtual explanation character is in a standby state and waits for interaction of the first user.
2. The starting phase is as follows: after the first interaction between the virtual lecture character and the user is started, the start phase is entered.
3. And (3) interaction stage: in this stage, the virtual lecture character receives interaction from other users U, and adds the corresponding user U to the user set U.
4. Ending: when the virtual explanation character is in the interaction stage and no other tasks exist, if the last interaction time between the virtual explanation character and the user exceeds delta T, the round of interaction is considered to be ended, and the ending stage is entered.
Fig. 4 is a relationship diagram of four stages of an interaction process provided in an embodiment of the present application, as shown in fig. 4, each stage can only be transferred from the previous stage, after the end of the stage, a standby stage is entered, waiting for the first interaction between the virtual explanation character and the user, and the four stages are repeated to form a loop structure.
The specific behavior of each phase is defined as follows:
1. standby stage:
in the standby phase, the virtual interpretation character is in a standby stateAnd waiting for interaction of the user. In this stage, the virtual lecture character does not accept any interactive behavior of the user, and does not show any content to the user. In this stage, the pose of the virtual interpretation character is determined by the cost functionAnd determining that the pose of the virtual interpretation character is the candidate pose with the minimum cost function value and does not change. In this stage, each user has the same weight, namely:
ω u =0
At the same time let lambda VS So that the rationalization degree C of the pose S (a) The weight of the virtual interpretation character pose is increased, so that the optimization targets of the virtual interpretation character pose are equal to the positions of the users as far as possible.
2. The starting phase is as follows:
in the starting stage, after the first interaction between the virtual explanation character and the user is started, the starting stage is entered, and the current user is recorded as u 0 And will correspond to user u 0 Added to the user set U. In this stage, the pose of the virtual lecture character is redetermined, only the current user u 0 Optimizing the optimal pose, namely:
after the virtual lecture character moves to a new location, an interaction phase will be entered.
3. In the interaction phase, the virtual explanation character receives interaction behaviors from other users and corresponds to the user u i Adding the interaction times and interaction time of all users into the user set U, and calculating the interaction weights of the users at the moment as follows:
note the current pose as a= (l) x ,l y ,l z ,o x ,o y ,o z ) The optimized pose is a' = (l) x ′,l y ′,l z ′,o x ′,o y ′,o z '), comparing the distances of the virtual explanation character and the virtual explanation character, if the distance is smaller than the threshold value eta, considering that the pose of the virtual explanation character is not changed at the moment, only changing the direction of the virtual explanation character, not moving, otherwise, moving, and updating the current pose. I.e. the final pose is
p(a)=(l x l y l z ) T
This phase continues while a user is removed from the user set U when the last interaction time was longer than Δt. Meanwhile, when the user set U is empty or no other user interaction exists in the delta T time, the ending stage is entered.
4. Ending:
in this stage, the virtual explanation character can optimize its pose again according to the standby stage mode, i.e. ω u After the virtual interpretation character is moved to the optimal position, the virtual interpretation character enters a standby stage.
In this way, in the interaction process of the virtual explanation character and a plurality of users, different interaction weights can be distributed to the users in different stages of the interaction process, so that the virtual explanation character tends to be oriented to the user with the highest interaction weight, and the user experience is improved.
Optionally, determining pose information of the virtual explanation roles set in the target scene according to the semantic map and the camera pose includes:
step a, randomly determining an initial pose of a virtual explanation character as a current pose;
specifically, a midpoint of a connecting line between a user with highest interaction weight and a current key target center point is taken as a sphere center, one point is randomly selected as an initial position of a virtual explanation character in a sphere with half of the connecting line length as a radius, the direction of the connecting line is taken as an initial direction of the virtual explanation character, and the initial position and the initial direction together form an initial pose of the virtual explanation character and serve as the current pose.
Step b, iterating for a plurality of times through a Mei Teluo Brix-Studies algorithm according to the current pose, and obtaining pose information of the virtual explanation roles;
specifically, the Mei Teluo Bolisi-Black-Tinsted algorithm simulates an annealing process, and the algorithm is correspondingly executed as follows:
input: virtual explaining of a character's current pose
On put: virtual interpretation character pose information
In the formula Q (a' |a t )=(l x ′,l y ′,l z ′,o x ,o y +δ,o z ) Middle, (l) x ′,l y ′,l z ′)∈Ω t (a t ),Ω t (a t ) Is a spherical area with the radius r and the coordinates corresponding to the current pose as the circle center, and delta is epsilon (-delta) maxmax ),a t A 'is the current pose, a' is the new pose, (l) x ′,l y ′,l z ' coordinates corresponding to the new pose, (o) x ,o y +δ,o z ) For the corresponding orientation of the new pose, (o) x ,o y ,o z ) Delta is the corresponding orientation of the current pose max Is a preset threshold.
Specifically, the new pose is obtained based on the current pose, and the coordinates of the current pose are (l) x ,l y ,l z ) Taking the coordinate corresponding to the current pose as a circle center, and optionally selecting one point in a spherical region with the radius r as the coordinate of a new pose; the current pose corresponds to an orientation of (o x ,o y ,o z ) Wherein o x 、o y 、o z View of virtual explaining roles respectivelyIncluded angles between the line direction and x-axis, y-axis and z-axis of a set three-dimensional coordinate system, o x ,o z All are unchanged, o is y Become o y +delta, delta is within a set range (-delta) maxmax ) And changing to obtain the corresponding orientation of the new pose, wherein the coordinates and the orientation of the new pose form the new pose together.
In the formulaIn C (a) t ) For the cost function value corresponding to the current pose, C (a') is the cost function value corresponding to the new pose, T 0 Is the initial temperature of the simulated anneal (Simulated Annealing), β is the decay coefficient of temperature, and t is the number of iterations. In practice, the initial temperature is typically 300 ℃, and the decay factor of the temperature is typically 0.5, i.e. the temperature is reduced by 0.5 ° per iteration, until the temperature is reduced to 0.
In this way, the current pose of the virtual explanation character is determined through the information of the target scene, and then iteration is carried out for a plurality of times through the Mei Teluo Brix-Hospital algorithm, so that the pose information of the virtual explanation character is calculated.
And 205, obtaining a virtual-real fusion image according to the pose information of the virtual explanation character and the original RGB image, and displaying the virtual-real fusion image to a user.
Optionally, obtaining a virtual-real fusion image according to the pose information of the virtual explanation character and the original RGB image, and displaying the virtual-real fusion image to the user, including:
Rendering a virtual image according to the pose information of the virtual explanation character and the original RGB image; the rendering process is carried out on the GPU by utilizing a computer loader;
and fusing the original RGB image and the virtual image to obtain a virtual-real fused image, and displaying the virtual-real fused image to a user.
The computer loader is a technology, and with the help of the computer loader, the GPU can be directly used as a parallel processor, and the GPU has not only 3D rendering capability but also other computing capabilities.
Specifically, the rendering process is performed on the GPU by using the computer loader, and the rendering process of the virtual image of each frame is as follows:
1. the camera pose, the pose of the virtual explanation character and the pose of the virtual object obtained in the above steps are subjected to MVP (Model View Projection, model, observation and projection full scale) matrix transformation to obtain coordinates under a world coordinate system.
2. And reading the material of the image of the current camera frame, and waiting for all cameras and the GUI (Graphical User Interface ) to be rendered.
3. Three temporary rendering materials are applied for storing intermediate results.
4. And repeating rendering operation on a group of orthogonal color key cameras, changing the grid material of the world into a non-lited material when the cameras are used for removing vision bodies, and returning the same pure color to any light source. A set of images containing occlusion relationships are rendered, the background colors of which are orthogonal solid colors.
5. And copying the group of pictures and the picture materials shot by the real camera into a buffer area of a calculation shader, performing similar rejection operation on the images of the group of orthogonal color keys, and covering similar colors in a kernel of the GPU.
6. And executing async GPU readback on the resulting texture, initiating a readback request on the GPU, and writing the content in the resulting texture from the GPU memory into the memory.
7. Releasing the temporary rendering material of the application.
And after the rendered virtual image is obtained, fusing the original RGB image and the virtual image corresponding to the same moment to obtain a virtual-real fused image, and displaying the virtual-real fused image to a user.
According to the application, the base line based on the CPU is compared with the base line based on the GPU, and the result shows that the performance can be improved by 10 times after the GPU acceleration is introduced, so that 30fps recording can be realized on a scene. The effect of improving the performance is very obvious.
In this way, the rendering process is completely run on the GPU, so that the copy cost between the CPU and the memory of the GPU is reduced, and the advantage of the quantity of ALUs (arithmetic and logic unit, arithmetic logic units) of the GPU is utilized, so that the performance is remarkably improved.
In summary, the virtual-real fusion explanation method provided by the embodiment of the application comprises the following steps: in a scene construction stage, a plurality of continuous RGBD video frames of a target scene acquired by an RGBD camera are acquired, a semantic map is constructed for the target scene through a semantic SLAM system, the corresponding relation between each pixel point in each RGB image in the plurality of continuous RGBD video frames and three-dimensional scene point coordinates in the semantic map is determined, and a plurality of three-dimensional scene point coordinates corresponding to the same RGB image form a three-dimensional scene point cloud corresponding to the RGB image; wherein each RGBD video frame contains one RGB image and one depth image; the semantic map contains the position, posture and semantic information of objects in the target scene; training a scene regression model according to the plurality of continuous RGB images and the three-dimensional scene point clouds, wherein the scene regression model is used for predicting the corresponding three-dimensional scene point clouds according to the input RGB images and obtaining the corresponding relation between each pixel point in the input RGB images and the coordinates of the three-dimensional scene point; in the explanation stage, responding to an explanation request for a target scene, and acquiring an original RGB image of the target scene through an RGB camera; inputting the acquired original RGB images into a trained scene regression model to obtain three-dimensional scene point clouds and corresponding relations corresponding to each original RGB image, calculating camera pose according to the three-dimensional scene point clouds and the corresponding relations, and determining pose information of virtual explanation roles set in the target scene according to the semantic map and the camera pose; and obtaining a virtual-real fusion image according to the pose information of the virtual explanation character and the original RGB image, and displaying the virtual-real fusion image to a user. The method comprises the steps of acquiring information of a target scene in advance, training a scene regression model by utilizing the information of the target scene, fusing depth information of objects in the target scene into the model, and when RGB images acquired by a user are input into the model, predicting corresponding three-dimensional scene point clouds, further obtaining the pose of a camera, reducing the computational power requirements on equipment, simultaneously obtaining more accurate pose of the camera, wherein a large number of hollows can not occur in the constructed model due to the fact that the information of each point of the target scene is acquired in advance, and determining the pose of a virtual interpretation role also uses the information of a known scene, so that the real virtual-real fusion occlusion effect can be realized, and the user experience is enhanced.
The virtual-real fusion explanation method provided by the embodiment of the application can be executed by one device or a plurality of devices.
Corresponding to the virtual-real fusion explanation method, the embodiment of the application also provides a virtual-real fusion explanation device. Fig. 5 is a schematic structural diagram of a virtual-real fusion explanation device according to an embodiment of the present application. As shown in fig. 5, the apparatus includes:
the scene construction module 501 is configured to obtain a plurality of continuous RGBD video frames of a target scene acquired by an RGBD camera in a scene construction stage, construct a semantic map for the target scene through a semantic SLAM system, and determine a correspondence between each pixel point in each RGB image in the plurality of continuous RGBD video frames and a three-dimensional scene point coordinate in the semantic map, where a plurality of three-dimensional scene point coordinates corresponding to the same RGB image form a three-dimensional scene point cloud corresponding to the RGB image; wherein each RGBD video frame contains one RGB image and one depth image; the semantic map contains the position, posture and semantic information of objects in the target scene;
the model training module 502 is configured to train a scene regression model according to the multiple continuous RGB images and the three-dimensional scene point clouds, where the scene regression model is configured to predict a corresponding three-dimensional scene point cloud according to an input RGB image, and obtain a corresponding relationship between each pixel point in the input RGB image and a three-dimensional scene point coordinate;
A scene acquisition module 503, configured to acquire, in response to an interpretation request for a target scene, an original RGB image of the target scene by an RGB camera in an interpretation stage;
the pose determining module 504 is configured to input the collected original RGB images into a trained scene regression model, obtain a three-dimensional scene point cloud corresponding to each original RGB image and the corresponding relationship, calculate a pose of a camera according to the three-dimensional scene point cloud and the corresponding relationship, and determine pose information of a virtual explanation character set in the target scene according to the semantic map and the pose of the camera;
and the virtual-real fusion module 505 obtains a virtual-real fusion image according to the pose information of the virtual explanation character and the original RGB image and displays the virtual-real fusion image to a user.
Optionally, the 5 modules may be deployed on one device or may be deployed on multiple devices, where in the explanation stage, the real-time performance of the scene acquisition module 503, the virtual-real fusion module 505, etc. is relatively strong, and the modules may be deployed on a client, and the pose determination module 504 may have relatively strong computational power requirements on a computer, and may be deployed on a server.
The scene acquisition module 503 of the client is configured to, in an explanation stage, respond to an explanation request for a target scene, acquire an original RGB image of the target scene through an RGB camera, and send the acquired original RGB image to the pose determination module 504 of the server, so that the pose determination module 504 of the server inputs the acquired original RGB image into a trained scene regression model to obtain a three-dimensional scene point cloud and the corresponding relationship corresponding to each original RGB image, calculate a pose of a camera according to the three-dimensional scene point cloud and the corresponding relationship, and determine pose information of a virtual explanation role set in the target scene according to the semantic map and the pose of the camera; and finally, the virtual-real fusion module 505 of the client is configured to receive the pose information and the original RGB image of the virtual interpretation character sent by the pose determination module 504 of the server, obtain a virtual-real fusion image according to the pose information and the original RGB image of the virtual interpretation character, and display the virtual-real fusion image to the user.
Correspondingly, the pose determining module 504 of the server is configured to receive the original RGB images sent by the scene collecting module 503 of the client, input the collected original RGB images into a trained scene regression model, obtain a three-dimensional scene point cloud corresponding to each original RGB image and the corresponding relationship, calculate a camera pose according to the three-dimensional scene point cloud and the corresponding relationship, and determine pose information of a virtual explanation character set in the target scene according to the semantic map and the camera pose; and finally, sending the pose information of the virtual explanation character and the original RGB image to a virtual-real fusion module 505 of the client so that the virtual-real fusion module obtains a virtual-real fusion image according to the pose information of the virtual explanation character and the original RGB image and displays the virtual-real fusion image to a user.
Optionally, the scene regression model includes:
the encoder is used for regressing the sparse point cloud of the three-dimensional scene point by extracting the characteristics of the original RGB image;
and the decoder is used for complementing the sparse point cloud through a point cloud complement model to obtain dense point cloud of the three-dimensional scene point.
Optionally, the virtual-real fusion module 505 is specifically configured to:
rendering a virtual image according to the pose information of the virtual explanation character and the original RGB image; the rendering process is carried out on the GPU by utilizing a computer loader;
And fusing the original RGB image and the virtual image to obtain a virtual-real fused image, and displaying the virtual-real fused image to a user.
Optionally, the pose determining module 504 is specifically configured to, when determining pose information of a virtual interpretation role set in the target scene according to the semantic map and the camera pose:
determining the position of a key target according to the semantic map, and determining a plurality of candidate poses for the virtual roles; the key targets are objects for the virtual explanation roles to explain in the objects in the target scene;
for any candidate pose, determining the visualization degree of the key target when the virtual explanation character is in the candidate pose according to the semantic map, the position of the key target, the attention degree of the user to different positions of the key target and the camera pose, wherein the visualization degree depends on the shielding degree of the key target; determining the rationalization degree of the distance and the relative orientation between the virtual explanation character and the user when the virtual explanation character is in the candidate pose according to the semantic map and the camera pose;
and determining pose information of the virtual explanation roles from the plurality of candidate poses according to the visualization degree and the rationalization degree.
Optionally, the virtual explanation roles are used for explaining for a plurality of users; the pose determination module 504 is further configured to:
distributing interaction weights for all users according to the interaction times and duration of the virtual explanation roles and all users;
accordingly, the pose determining module 504 is specifically configured to, when determining pose information of the virtual interpretation role from the plurality of candidate poses according to the visualization degree and the rationalization degree:
determining experience indexes of each user according to the corresponding visualization degree and rationalization degree of each user in the plurality of users aiming at any candidate pose in the plurality of candidate poses;
and determining pose information of the virtual explanation roles from the plurality of candidate poses according to experience indexes and interaction weights of each user.
Optionally, the pose determining module 504 is specifically configured to, when determining pose information of a virtual interpretation role set in the target scene according to the semantic map and the camera pose:
randomly determining the initial pose of the virtual explanation character as the current pose;
and according to the current pose, performing multiple iterations through a Mei Teluo Brix-Studies algorithm to obtain pose information of the virtual interpretation role.
The specific implementation principle and effect of the virtual-real fusion explanation device provided by the embodiment of the application can be referred to the foregoing embodiment, and will not be repeated here.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 6, the electronic device of the present embodiment may include:
at least one processor 601; and
a memory 602 communicatively coupled to the at least one processor;
wherein the memory 602 stores instructions executable by the at least one processor 601, the instructions being executable by the at least one processor 601 to cause the electronic device to perform the method according to any one of the embodiments described above.
Alternatively, the memory 602 may be separate or integrated with the processor 601.
The implementation principle and technical effects of the electronic device provided in this embodiment may be referred to the foregoing embodiments, and will not be described herein again.
The embodiment of the application also provides a computer readable storage medium, wherein computer executable instructions are stored in the computer readable storage medium, and when a processor executes the computer executable instructions, the method of any of the previous embodiments is realized.
Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the preceding embodiments.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules may be combined or integrated into another system, or some features may be omitted or not performed.
The integrated modules, which are implemented in the form of software functional modules, may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or processor to perform some of the steps of the methods described in the various embodiments of the application.
It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU for short), other general purpose processors, digital signal processor (Digital Signal Processor, DSP for short), application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution. The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile memory NVM, such as at least one magnetic disk memory, and may also be a U-disk, a removable hard disk, a read-only memory, a magnetic disk or optical disk, etc.
The storage medium may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application specific integrated circuit (Application Specific Integrated Circuits, ASIC for short). It is also possible that the processor and the storage medium reside as discrete components in an electronic device or a master device.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. A virtual-real fusion explanation method is characterized by comprising the following steps:
in a scene construction stage, a plurality of continuous RGBD video frames of a target scene acquired by an RGBD camera are acquired, a semantic map is constructed for the target scene through a semantic SLAM system, the corresponding relation between each pixel point in each RGB image in the plurality of continuous RGBD video frames and three-dimensional scene point coordinates in the semantic map is determined, and a plurality of three-dimensional scene point coordinates corresponding to the same RGB image form a three-dimensional scene point cloud corresponding to the RGB image; wherein each RGBD video frame contains one RGB image and one depth image; the semantic map contains the position, posture and semantic information of objects in the target scene;
training a scene regression model according to the plurality of continuous RGB images and the three-dimensional scene point clouds, wherein the scene regression model is used for predicting the corresponding three-dimensional scene point clouds according to the input RGB images and obtaining the corresponding relation between each pixel point in the input RGB images and the coordinates of the three-dimensional scene point;
in the explanation stage, responding to an explanation request for a target scene, and acquiring an original RGB image of the target scene through an RGB camera;
inputting the acquired original RGB images into a trained scene regression model to obtain three-dimensional scene point clouds and corresponding relations corresponding to each original RGB image, calculating camera pose according to the three-dimensional scene point clouds and the corresponding relations, and determining pose information of virtual explanation roles set in the target scene according to the semantic map and the camera pose;
And obtaining a virtual-real fusion image according to the pose information of the virtual explanation character and the original RGB image, and displaying the virtual-real fusion image to a user.
2. The method of claim 1, wherein the scene regression model comprises:
the encoder is used for regressing the sparse point cloud of the three-dimensional scene point by extracting the characteristics of the original RGB image;
and the decoder is used for complementing the sparse point cloud through a point cloud complement model to obtain dense point cloud of the three-dimensional scene point.
3. The method of claim 1, wherein obtaining a virtual-real fusion image from the pose information of the virtual lecture character and the original RGB image and displaying the virtual-real fusion image to the user, comprising:
rendering a virtual image according to the pose information of the virtual explanation character and the original RGB image; the rendering process is carried out on the GPU by utilizing a computer loader;
and fusing the original RGB image and the virtual image to obtain a virtual-real fused image, and displaying the virtual-real fused image to a user.
4. The method of claim 1, wherein determining pose information of a virtual interpretation character set in the target scene from the semantic map, camera pose, comprises:
determining the position of a key target according to the semantic map, and determining a plurality of candidate poses for the virtual roles; the key targets are objects for the virtual explanation roles to explain in the objects in the target scene;
For any candidate pose, determining the visualization degree of the key target when the virtual explanation character is in the candidate pose according to the semantic map, the position of the key target, the attention degree of the user to different positions of the key target and the camera pose, wherein the visualization degree depends on the shielding degree of the key target; determining the rationalization degree of the distance and the relative orientation between the virtual explanation character and the user when the virtual explanation character is in the candidate pose according to the semantic map and the camera pose;
and determining pose information of the virtual explanation roles from the plurality of candidate poses according to the visualization degree and the rationalization degree.
5. The method of claim 4, wherein the virtual lecture character is used to lecture a plurality of users; the method further comprises the steps of:
distributing interaction weights for all users according to the interaction times and duration of the virtual explanation roles and all users;
correspondingly, determining pose information of the virtual interpretation role from the plurality of candidate poses according to the visualization degree and the rationalization degree comprises the following steps:
determining experience indexes of each user according to the corresponding visualization degree and rationalization degree of each user in the plurality of users aiming at any candidate pose in the plurality of candidate poses;
And determining pose information of the virtual explanation roles from the plurality of candidate poses according to experience indexes and interaction weights of each user.
6. The method of claim 1, wherein determining pose information of a virtual interpretation character set in the target scene from the semantic map, camera pose, comprises:
randomly determining the initial pose of the virtual explanation character as the current pose;
and according to the current pose, performing multiple iterations through a Mei Teluo Brix-Studies algorithm to obtain pose information of the virtual interpretation role.
7. The utility model provides a virtual reality fuses explanation device which characterized in that includes:
the scene construction module is used for acquiring a plurality of continuous RGBD video frames of a target scene acquired by an RGBD camera in a scene construction stage, constructing a semantic map for the target scene through a semantic SLAM system, determining the corresponding relation between each pixel point in each RGB image in the plurality of continuous RGBD video frames and three-dimensional scene point coordinates in the semantic map, and forming a three-dimensional scene point cloud corresponding to the RGB image by corresponding to the plurality of three-dimensional scene point coordinates of the same RGB image; wherein each RGBD video frame contains one RGB image and one depth image; the semantic map contains the position, posture and semantic information of objects in the target scene;
The model training module is used for training a scene regression model according to the plurality of continuous RGB images and the three-dimensional scene point clouds, wherein the scene regression model is used for predicting the corresponding three-dimensional scene point clouds according to the input RGB images and obtaining the corresponding relation between each pixel point in the input RGB images and the coordinates of the three-dimensional scene point;
the scene acquisition module is used for responding to the explanation request aiming at the target scene in the explanation stage, and acquiring an original RGB image of the target scene through an RGB camera;
the pose determining module is used for inputting the acquired original RGB images into a trained scene regression model to obtain three-dimensional scene point clouds corresponding to each original RGB image and the corresponding relation, calculating the pose of a camera according to the three-dimensional scene point clouds and the corresponding relation, and determining pose information of a virtual explanation role set in the target scene according to the semantic map and the camera pose;
and the virtual-real fusion module is used for obtaining a virtual-real fusion image according to the pose information of the virtual explanation character and the original RGB image and displaying the virtual-real fusion image to a user.
8. An electronic device, comprising:
at least one processor; and a memory communicatively coupled to the at least one processor;
Wherein the memory stores instructions executable by the at least one processor to cause the electronic device to perform the method of any one of claims 1-6.
9. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor implement the method of any of claims 1-6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.
CN202310544910.5A 2023-05-15 2023-05-15 Virtual-real fusion explanation method and device, electronic equipment and storage medium Pending CN116805350A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310544910.5A CN116805350A (en) 2023-05-15 2023-05-15 Virtual-real fusion explanation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310544910.5A CN116805350A (en) 2023-05-15 2023-05-15 Virtual-real fusion explanation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116805350A true CN116805350A (en) 2023-09-26

Family

ID=88079191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310544910.5A Pending CN116805350A (en) 2023-05-15 2023-05-15 Virtual-real fusion explanation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116805350A (en)

Similar Documents

Publication Publication Date Title
Kopanas et al. Point‐Based Neural Rendering with Per‐View Optimization
CN111369681B (en) Three-dimensional model reconstruction method, device, equipment and storage medium
US11232286B2 (en) Method and apparatus for generating face rotation image
US20230394743A1 (en) Sub-pixel data simulation system
CN111753698A (en) Multi-mode three-dimensional point cloud segmentation system and method
CN114898028A (en) Scene reconstruction and rendering method based on point cloud, storage medium and electronic equipment
CN110648274B (en) Method and device for generating fisheye image
CN115457188A (en) 3D rendering display method and system based on fixation point
CN116977522A (en) Rendering method and device of three-dimensional model, computer equipment and storage medium
CN112991537B (en) City scene reconstruction method and device, computer equipment and storage medium
CN113159232A (en) Three-dimensional target classification and segmentation method
CN111161398A (en) Image generation method, device, equipment and storage medium
CN114170290A (en) Image processing method and related equipment
CN117274515A (en) Visual SLAM method and system based on ORB and NeRF mapping
CN116452757B (en) Human body surface reconstruction method and system under complex scene
CN114170231A (en) Image semantic segmentation method and device based on convolutional neural network and electronic equipment
CN116228986A (en) Indoor scene illumination estimation method based on local-global completion strategy
CN116805350A (en) Virtual-real fusion explanation method and device, electronic equipment and storage medium
CN115564639A (en) Background blurring method and device, computer equipment and storage medium
CN116883770A (en) Training method and device of depth estimation model, electronic equipment and storage medium
CN113064539A (en) Special effect control method and device, electronic equipment and storage medium
CN116091871B (en) Physical countermeasure sample generation method and device for target detection model
CN116977535B (en) Real-time ray tracing method and device, storage medium and electronic equipment
CN116645468B (en) Human body three-dimensional modeling method, method and device for training human body structure to generate model
CN117853664B (en) Three-dimensional face reconstruction method based on double-branch feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination