Disclosure of Invention
The invention provides a remote live-action augmented reality method based on VR head-mounted display equipment, which overcomes the defects that the existing VR live-action software can only passively watch pushed live-action pictures, can not superpose virtual models, has low image quality, weak immersion feeling, limited effective transmission distance and the like.
The invention adopts the technical scheme that a remote live-action augmented reality method based on VR head-mounted display equipment comprises the following steps:
step S100: acquiring a remote live-action video in real time by adopting a camera device at a far end to obtain a remote live-action video of a far-end scene; performing depth calculation on the obtained remote live-action video to obtain depth information of the remote live-action video;
step S200: constructing a scene structure by using the depth information of the remote live-action video obtained in the step S100, and fusing the virtual object and the remote live-action video to obtain virtual-real fused image data;
step S300: intercepting the image data after the virtual-real fusion, and sending the image data after the virtual-real fusion to a client in real time through a network;
step S400: the client receives the virtual-real fused image data transmitted by the network, and displays and plays the image data by using VR head-mounted display equipment;
step S500: the action data of the user are acquired through the VR head-mounted display device, and then the action data are transmitted through a network to control the remote acquisition device to follow up in real time.
In step S100, an external camera robot is used as a camera at a far end, and the external camera robot collects real-time video images from multiple angles and a wide angle; then, depth calculation is carried out on the video image through software to obtain video data and depth information of the shot real scene; the outer shooting robot comprises binocular cameras, two-degree-of-freedom holders, a calculation communication module and a trolley which are arranged in pairs; the binocular camera is fixed on the cradle head, and the cradle head and the computing communication module are fixed on the trolley; the computing communication module comprises a computing module in charge of depth computing, a rendering module for scene construction and virtual-real fusion and a wireless communication module for wireless data transmission; data transmission is carried out between the external camera and the client through a cloud server; the client side is composed of data receiving and processing equipment and VR head-mounted display equipment, and the cloud server comprises a Web server and a signaling server.
Preferably, in the above VR head-mounted display device-based augmented reality method, the specific process of step S200 includes:
s201, constructing a 3D scene structure according to the depth information obtained in the step S100; combining the 3D scene structure with the collected 2D video image to render a 3D scene;
s202, importing the geometric virtual object generated by the computer into a 3D scene;
and S203, carrying out occlusion and shadow calculation on the imported geometric virtual object according to the depth information calculated by the calculation module, and realizing virtual-real fusion.
Preferably, in the above VR head-mounted display device-based augmented reality remote sensing method, the specific process of step S300 includes:
s301, rendering the scene subjected to virtual-real fusion into a Back Buffer of OpenGL by a rendering module according to a specified frame rate;
s302, after the Back Buffer is updated, the rendering module acquires rendered image data from the Back Buffer;
s303, carrying out compression coding on the image data rendered in the step S302;
s304, the wireless communication module sends the coded image data to the client in real time through a high-speed network.
Preferably, in the above VR head-mounted display device based augmented reality remote sensing method, the specific process of step S400 includes:
s401, the client receives the transmitted video image data through a WebRTC point-to-point technology;
s402, the client decodes the image data after receiving the image data;
s403, converting the decoded image data into Texture, and using the Texture as a Texture map of a geometric patch in a scene;
and S404, projecting the video image to VR head-mounted display equipment for display.
Preferably, in the above remote realistic augmented reality method based on the VR head-mounted display device, the specific process of step S500 includes:
s501, the client tracks and collects head movements of a user through VR head-mounted display equipment to generate a rotation instruction;
s502, the client acquires instructions of the outward shooting robot, such as forward and backward movement, sent by a user through a handle;
s503, converting the instruction data into character strings, connecting the character strings through WebRTC and sending the character strings to the pickup robot in real time;
s504, after the external camera robot receives the instruction data, converting the character string into a numerical value; the external camera robot is respectively converted into actual instructions of a holder or a trolley according to the identifiers of different instructions;
and S505, the outer shooting robot controls the holder to rotate and the trolley to retreat or advance through actual instructions.
Preferably, in the above VR head-mounted display device-based augmented reality method, the specific process of step S100 includes:
step S101: in the process of trolley advancing and the process of cradle head rotation of the outer shooting robot, the binocular camera collects surrounding live-action videos in real time, and sends the collected left and right video images to the computing module;
step S102: the calculation module corrects the two paths of video images; after correction, the corresponding pixels of the same reference point on the two images are on the same line;
step S103: acquiring disparity maps of a left video image and a right video image by adopting an SGBM algorithm, and selecting one of the disparity maps for calculation;
step S104: filling a parallax map hole: detecting the disparity map selected in the step S103 to find a cavity area, and then filling the cavity by using the average value of the adjacent reliable disparity values;
step S105: and converting the disparity map into a depth map, wherein the conversion formula is as follows:
depth = (Fx × base)/Disp, where Depth represents a Depth value; fx represents the normalized focal length; baseline represents the distance between the optical centers of two cameras, and is called the Baseline distance; disp represents a disparity value;
step S106: and traversing pixels of the disparity map to perform depth conversion, so as to obtain a depth map.
In step S201, camera calibration is performed by using the collected binocular video images Image1 and Image2 and the calculated depth information, so as to obtain the pos information of the binocular camera in the real scene, and the pos information is assigned to the virtual binocular camera in the virtual scene; adding two geometric patches in a virtual scene, respectively serving as display pictures of a virtual camera, and respectively displaying binocular 2D video images Image1 and Image2 on the two patches;
in step S202, a virtual object is imported into a virtual 3D scene, and an offset between the position information P1 of the virtual object to be placed in the real scene and the position information P2 of the binocular camera in the real scene is calculated; and (3) calculating to obtain the position information P4 of the virtual object in the virtual space according to the position information P3 of the virtual camera, wherein the calculation formula is as follows: p4= P3+ P1-P2.
Preferably, in step S203, the implementation manner of occlusion in the above-mentioned remote live-action augmented reality method based on the VR head-mounted display device is:
in the step S100, depth calculation of a scene is carried out according to a video image acquired by a real binocular camera to obtain a depth map D1; generating a depth map D2 through ZBuff data of a binocular camera of the virtual scene; comparing the pixels traversing the depth map D1 with the pixels corresponding to the depth map D2, and if the depth value of the pixel of the depth map D1 is smaller than that of the pixel corresponding to the depth map D2, performing pixel position offset on the pixels corresponding to the video images Image1 and Image2 on the two display panels to enable the positions of the pixels to exceed the positions of the pixels corresponding to the depth map D2;
in step S203, the shadow is implemented as: the spatial position P5 corresponding to the depth image pixel can be restored according to the depth image D1 of the real binocular camera; obtaining a space coordinate system matrix M of the lamplight through the position and rotation information of the lamplight, and calculating a space position P6 of a pixel under the lamplight coordinate system, wherein the calculation formula is as follows: p6= M × P5; and calculating whether the pixel is shielded from the virtual object under the lamplight coordinate system, and if so, superposing a shadow when the pixel is rendered.
Preferably, in the above VR head-mounted display device-based remote live-action augmented reality method, in step S304, the WebRTC-based point-to-point connection implementation process is as follows:
establishing connection between the client and the signaling server and between the external camera robot and the signaling server by using socket.IO, and acquiring an IP address and a corresponding communication port of the signaling server;
establishing point-to-point connection between the external camera robot and the client through a signaling server, wherein the specific process comprises the following steps:
the client creates a PeerConnection; a client creates a Data Channel and creates an Offer;
in the callback function of creating Offer, obtaining the Description information of Offer, setting the Description information as local Description information through an interface SetLocalDescription (), and then sending Offer to a signaling server; utilizing a regular expression to add the setting of the maximum code rate and the average code rate in the Description information; the signaling server forwards the Offer to the external camera robot;
after receiving the Offer, the outer camera robot sends Answer information to a signaling server, and the signaling server forwards the Answer information to the client; the client collects local ICE Candidate and sends the ICE Candidate to the signaling server, and the signaling server forwards the ICE Candidate to the external camera robot;
the foreign camera robot receives the ICE Candidate of the client, collects the local ICE Candidate and forwards the ICE Candidate to the client through the signaling server; the client receives the far-end ICE Candidate and adds the ICE Candidate into the PeerConnection; connection establishment;
and sending the encoded image data to the client through the established point-to-point connection.
The invention has the advantages that: the defects that the pushed live-broadcast pictures can only be passively watched by existing VR live-broadcast software, virtual models cannot be superposed, the image quality is low, the immersion sense is not strong, the effective transmission distance is limited and the like are overcome, the remote live-broadcast augmented reality method based on the VR head-mounted display equipment is provided, remote live-broadcast data can be collected in real time, virtual digital models can be superposed and transmitted in real time, the VR head-mounted display equipment can be connected for immersion viewing, the video recording equipment can be remotely controlled to follow up according to head actions, and the real-time performance, the reality sense, the immersion sense and the interactivity are greatly improved.
Compared with a common virtual reality live broadcast scene, the method and the device have the advantages that the professional shooting equipment is used for processing the image in combination with the related algorithm, the actual human eye observation effect is better fitted, and the definition of the video and the comfort level of an experiencer can be improved; the intelligent recognition of remote real scenes and the timely superposition of virtual objects are realized by applying a deep learning and graphics algorithm, so that a real-time virtual fusion effect is achieved; the method is combined with a 5G transmission technology, the high-definition image quality and the fluency of the client are improved, the client generates an immersion scene by receiving a common 2D picture, compared with a panoramic picture shot by a panoramic camera, transmission data is reduced by 50%, and the image quality is higher in definition; utilize VR head to wear display device's head-eye follow-up signal control multi freedom cloud platform, realize real-time interaction to the sense of immersing of first person master control mode reinforcing user for user experience is more close reality. The invention can be widely applied to cultural exposition, house exhibition, medical rehabilitation and remote observation and cooperative operation of high-risk industries, and has wide application prospect and higher social and economic value expectation.
Detailed Description
The technical features of the present invention will be further described with reference to the following embodiments.
Referring to fig. 1, the invention provides a remote realistic augmented reality method based on a VR head-mounted display device, and hardware for implementing the remote realistic augmented reality method based on the VR head-mounted display device mainly includes three main components, namely a remote end, a client and a cloud server, which are described in sequence below.
And the remote end adopts an external camera robot to perform real-time acquisition, depth calculation, scene construction, virtual-real fusion and wireless transmission of the live-action video, namely to complete the steps S100, S200 and S300 in the step 3. The structural schematic diagram of the peripheral robot is shown in fig. 2 and is composed of a binocular camera, a two-degree-of-freedom holder, a calculation communication module and a trolley. The binocular camera is fixed on the cradle head, and the cradle head and the calculation communication module are fixed on the trolley. The binocular camera is used for simulating human eyes, and the two-degree-of-freedom holder can rotate in the horizontal direction and the vertical direction by corresponding angles respectively, so that the inventor simulates neck motions of a human by using the two-degree-of-freedom holder; the trolley is used for simulating the advancing actions of people such as advancing, retreating, left turning, right turning and the like. The computing communication module comprises a computing module, a rendering module and a wireless communication module and is respectively responsible for functions of depth computing, scene construction, virtual-real fusion, wireless transmission and the like.
The client side is composed of data receiving and processing equipment and VR head-mounted display equipment. And the data receiving and processing equipment is responsible for receiving and processing the video data, converting the video data into data which can be recognized by the VR head-mounted display equipment, and correctly playing the data in the VR head-mounted display equipment. Meanwhile, the data receiving and processing equipment is also responsible for acquiring the head action data of the user through the VR head-mounted display equipment, and sending the head action data to a far end after processing.
The Web server and the signaling server are deployed on the cloud server at the same time. The Web server is mainly used for accessing a Web page when the client is a browser, and is not specifically described in this embodiment. The signaling server is responsible for establishing point-to-point communication between the remote end and the client.
Referring to fig. 3, the present embodiment provides a remote real-scene augmented reality method based on a VR head-mounted display device, where the method includes the following steps:
step S100: acquiring a remote live-action video in real time by adopting a camera device at a far end to obtain a remote live-action video of a far-end scene; performing depth calculation on the obtained remote live-action video to obtain depth information of the remote live-action video;
step S200: constructing a scene structure by using the depth information of the remote live-action video obtained in the step S100, and fusing the virtual object and the remote live-action video to obtain virtual-real fused image data;
step S300: intercepting the image data after the virtual-real fusion, and sending the image data after the virtual-real fusion to a client in real time through a network;
step S400: the client receives the image data after the virtual-real fusion transmitted by the network, and uses VR head-mounted display equipment to display and play;
step S500: the action data of the user are acquired through the VR head-mounted display device, and then are transmitted through a network, and the follow-up of the remote acquisition device is controlled in real time.
In the embodiment of the invention, firstly, a video image is collected in real time at multiple angles and wide angles through an external camera robot, and depth calculation is carried out on the video image through software to obtain video data and depth information of a shot real scene. And then constructing a scene structure according to the obtained depth information, and combining the scene structure with the acquired 2D video image to render a complex 3D scene. And (3) guiding the geometric virtual object generated by the computer into the scene, and carrying out occlusion and shadow calculation to realize virtual-real accurate fusion.
Then, acquiring a virtual-real fused rendering image from a back buffer rendered by software according to a specified frame rate, and performing compression coding to reduce the data volume transmitted by a network; and then transmitting the encoded image data to the client in real time through a high-speed network such as a 5G network. And finally, the client receives the image data transmitted by the network, performs reverse decoding firstly, converts the decoded video data into a required image format, and finally projects the image data to VR head-mounted display equipment for high-immersion presentation.
In addition, the client tracks and collects the head movement of the user through VR head-mounted display equipment to generate rotation movement instruction data; the robot is used for acquiring command data such as forward and backward movement sent by a user through a handle. User instructions are transmitted to the remote robot in real time through a 5G high-speed network, the robot is controlled to rotate correspondingly when shooting is conducted, and the robot is controlled to move forwards and backwards correspondingly, so that the immersion sense of a user is enhanced in a first person master control mode, and the user experience is closer to reality.
In an embodiment of the present invention, the step S100 specifically includes:
s101, in the process of trolley advancing or the process of cradle head rotating, a binocular camera collects surrounding live-action videos in real time and sends collected left and right video images to a computing module;
s102, a calculation module corrects two paths of video images, including distortion correction and stereo epipolar line correction, and after correction, corresponding pixels of the same reference point on the two images are on the same line;
s103, acquiring disparity maps of the left video image and the right video image by adopting an SGBM algorithm, and calculating a left disparity map and a right disparity map;
and S104, filling the disparity map hole. Due to the conditions of shading or uneven illumination and the like, a part of unreliable parallax can appear in the parallax image, and thus a parallax image hole is formed. The step is to detect the disparity map, find a cavity area and then fill the cavity with the mean value of the adjacent reliable disparity values;
and S105, converting the disparity map into a depth map. The unit of parallax is a pixel, the unit of depth is a millimeter, the conversion formula:
Depth=(Fx*Baseline)/Disp
wherein Depth represents a Depth value; fx represents the normalized focal length; baseline represents the distance between the optical centers of two cameras, and is called the Baseline distance; disp denotes a disparity value.
And traversing pixels of the disparity map to perform depth conversion, so as to obtain a depth map.
In the embodiment of the present invention, the step S200 specifically includes:
s201, the rendering module constructs a complex 3D scene by utilizing the collected 2D video image and the depth information obtained by calculation.
S202, the geometric virtual object generated by the computer can be led into the scene.
S203, because the depth information is calculated by the calculation module, the introduced geometric virtual object can be shielded and shaded, and virtual-real precise fusion is realized.
Specifically, in step S201, camera calibration is performed by using the acquired binocular video images Image1 and Image2 and the calculated depth information, so as to obtain the position information of the binocular camera in the real scene, and the position information is assigned to the virtual binocular camera in the virtual scene. Two geometric patches are added to the virtual scene, and the binocular 2D video images Image1 and Image2 are displayed on the two patches as display screens of the virtual camera.
Specifically, in step S202, a virtual object is imported into the virtual 3D scene, and an offset between the position information P1 of the virtual object to be placed in the real scene and the position information P2 of the binocular camera in the real scene is calculated. According to the position information P3 of the virtual camera, position information P4 of the virtual object in the virtual space is obtained through calculation, and the calculation formula is as follows:
P4=P3+P1-P2。
specifically, in step S203, the occlusion and the shadow are implemented as follows:
and (3) realizing shielding:
in step S100, depth calculation of the scene has been performed based on the video images captured by the real binocular camera, so that a depth map D1 can be obtained, and a depth map D2 can be generated from the ZBuff data of the binocular camera of the virtual scene (the virtual scene includes two display panels and the imported geometric virtual object). Traversing the pixels of the depth map D1, comparing with the pixels corresponding to the depth map D2, if the depth value of the pixels of the depth map D1 is smaller than the depth value of the pixels corresponding to the depth map D2, shifting the pixel positions of the corresponding pixels of the video images Image1 and Image2 on the two display panels to enable the positions of the pixels to exceed the positions of the corresponding pixels of the depth map D2, thereby rendering the effect that the real scene shields the virtual object.
Shadow realization:
the spatial position P5 corresponding to the depth image pixel can be restored according to the depth map D1 of the real binocular camera, a spatial coordinate system matrix M of the lighting is obtained through the position and rotation information of the lighting, and then the spatial position P6 of the pixel under a lighting coordinate system can be calculated, and the calculation formula is as follows:
P6=M*P5。
and calculating whether the pixels are shielded from the virtual object under the lamplight coordinate system, and if so, superposing a shadow when the pixels are rendered.
In an embodiment of the present invention, the step S300 specifically includes:
s301, rendering the virtual-real fused scene into a Back Buffer of OpenGL by a rendering module according to a specified frame rate (here, 60FPS is set);
and S302, after the Back Buffer updating is finished, the rendering module acquires rendered image data from the Back Buffer.
S303, in order to reduce the data volume of network transmission, compressing and encoding the image data;
s304, the wireless communication module sends the coded image data to the client in real time through a high-speed network such as a 5G network.
Specifically, in step S301, openGL (Open Graphics Library) is an application programming interface, and may be used to develop an interactive three-dimensional computer Graphics application. OpenGL uses a double buffering technique, i.e. has two buffers: front Buffer and Back Buffer. Front Buffer is the screen we see, back Buffer is in memory and is invisible to us. Each time we draw is done in the Back Buffer, the final result of the drawing must be copied to the screen when the drawing is complete. Similarly, in the invention, the scene after the virtual-real fusion is rendered into the Back Buffer first and then switched to the screen.
60FPS is a definition in the field of images, meaning that the picture is refreshed at 60 frames per second. Although theoretically, the faster the refresh rate of the frame is, the better the smoothness of the frame is, the too high refresh rate has no practical meaning, but in the present invention, the amount of data to be transmitted is increased, which leads to the decrease of the definition of the frame. The invention uses 60FPS to render data and reads image data from Back Buffer, which can meet the requirement of fluency and ensure that the image quality is not reduced.
Specifically, in step S303, the uncompressed image data has a large amount and occupies a large bandwidth, and in order to reduce the data amount transmitted through the network, the image data is compressed and encoded according to the present invention. The hard coding algorithm of NVIDIA is adopted, the algorithm uses a GPU (graphic processing unit) for coding, the performance is high, the quality is good, and the transmission efficiency is effectively improved.
Specifically, in step S304, the WebRTC technology is used to establish point-to-point communication between the remote end and the client, and then the encoded image data is sent to the client in real time. WebRTC is an abbreviation of Web Real-Time Communication, and is a Real-Time Communication technology that allows Web applications or sites to establish Peer-to-Peer (Peer-to-Peer) connections between browsers without an intermediary, so as to implement transmission of video streams and/or audio streams or any other data, and support a Web browser to perform Real-Time voice conversation or video conversation. In the present invention, a point-to-point connection between two C + + applications is established, not between browsers. The point-to-point connection implementation process based on the WebRTC comprises the following steps:
firstly, establishing connection between a client and a signaling server and connection between a remote end and the signaling server by using socket.IO, wherein the IP address and the corresponding communication port of the signaling server are required to be known in the process;
then, a point-to-point connection between the remote end and the client is established through a signaling server, and the specific flow is as follows:
the client creates a PeerConnection.
The client creates a Data Channel and creates an Offer. In the callback function for creating Offer, the Description information of Offer is obtained, and is set as the local Description information through an interface SetLocalDescription (), and then Offer is sent to the signaling server. In this step, it is to be noted that the setting of the maximum code rate and the average code rate needs to be added in the Description information by using a regular expression, otherwise, the video quality is reduced due to too low code rate in the transmission process. The signaling server then forwards Offer to the remote end.
After receiving the Offer, the remote end sends the Answer information to the signaling server, and the signaling server forwards the Answer information to the client.
The client collects the local ICE Candidate and sends it to the signaling server, which forwards it to the remote end.
The remote end receives the ICE Candidate of the client end, and also collects the local ICE Candidate and forwards the ICE Candidate to the client end through the signaling server.
The client receives the far-end ICE Candidate and adds the ICE candididate into the PeerConnection.
And establishing the connection.
And finally, sending the encoded image data to the client through the established point-to-point connection.
In the embodiment of the present invention, the step S400 specifically includes:
s401, the client receives the transmitted video image data through a WebRTC point-to-point technology;
s402, decoding the received image data;
s403, converting the decoded image data into Texture, and using the Texture as a Texture map of a geometric patch in a scene;
s404, projecting the video image to VR head-mounted display equipment through a camera for displaying.
Specifically, in step S401, to receive compressed Video image data sent by a remote end through an established WebRTC point-to-point connection, a Video rendering class VideoRenderer needs to be created first, the class inherits from a VideoFrame class of the WebRTC, and a Video Sink is registered in the class by using a Video Track to connect a Video Engine; then, in the operation process, the video image data transmitted by the far end can be received from the OnFrame () overloading interface of the type.
Specifically, in step S402, the received video image data is compression-encoded at the far end, and therefore, corresponding decoding is also required. The hard decoding algorithm of NVIDIA is still adopted for decoding, and a decoded video image is obtained. The NVIDIA hard decoding algorithm is based on CUDA, and utilizes the parallel processing capability of a GPU to accelerate decoding. The algorithm is high in execution efficiency, and the quality of the decoded image meets the requirements of the invention.
Specifically, in step S403, the client software must ensure that the decoded video image data can be played in real time, and the implementation manner adopted by the present invention is as follows: a geometric patch model is created in a scene, and then a default Texture is set as a Texture map for the model. After each video image data is decoded, the video image data is converted into the image data of Texture, and then the image data of default Texture is replaced, so that the effect of continuous playing is achieved. Compared with the method for replacing the Texture, the method has the advantages that only the memory occupied by the original image data of the Texture needs to be rewritten, and the replaced Texture memory does not need to be released repeatedly, so that the execution efficiency is improved, and the memory occupied amount is also reduced.
Specifically, in step S404, two head-mounted display device cameras are set in the client software, and are aligned with the geometric patch model in step S403. VR wears display device and passes through streamVR and the camera in the software is correlated with, can directly project the image of shooing in the camera to VR wears display device to reach the purpose that we watched the video in real time in VR wears display device.
In the embodiment of the present invention, the step S500 specifically includes:
s501, tracking and collecting head motions of a user through VR head-mounted display equipment by a client, and generating a rotation instruction;
s502, the client collects instructions of advancing and retreating the robot and the like sent by a user through a handle;
s503, converting the instruction data into character strings, connecting the character strings through WebRTC, and sending the character strings to the remote robot in real time;
and S504, after the remote robot receives the instruction data, converting the character string into a numerical value. And respectively converting the command into an actual command of the holder or the trolley according to the marks of the different commands.
And S505, sending an instruction, and controlling the holder or the trolley to rotate or advance.
Therefore, the remote robot is controlled in a first-person main control mode, the immersion of the user is enhanced, and the user experience is closer to reality.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.