CN115309301A

CN115309301A - Android mobile phone end-side AR interaction system based on deep learning

Info

Publication number: CN115309301A
Application number: CN202210541388.0A
Authority: CN
Inventors: 戴玉超; 朱睿杰; 项末初; 卢馨悦; 徐智鸿
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2022-11-08

Abstract

The invention discloses an Android Mobile phone end face AR interaction system based on deep learning, which comprises a Mobile phone with a camera, wherein the Mobile phone camera collects original color image data, processes an image stream in real time by calling an API (application program interface) of the camera, trains an efficient and robust lightweight depth estimation neural network model by using a Pythrch Mobile deep learning framework, runs neural network reasoning on the Mobile phone end face by using the limited computing power of the Mobile phone, and generates a prediction depth map corresponding to the original image data. And combining the original image and the predicted Depth map, and realizing the Android mobile phone end-side AR interaction system independent of the Depth API by utilizing the AR interaction function and the Unity development example of the ARCore Depth Lab.

Description

Android mobile phone end-side AR interaction system based on deep learning

Technical Field

The invention relates to the field of three-dimensional scene perception, in particular to an Android mobile phone end-side AR interaction system based on deep learning.

Background

In recent years, with the rapid development of deep learning and neural network technology, the related applications in the field of computer vision have been advanced dramatically. Meanwhile, people have an increasing need for entertainment of vision-related mobile phone applications. People are no longer satisfied with interacting with scenes in simple two-dimensional images and begin to expect deeper interaction with stereoscopic three-dimensional scenes. In the process of realizing the interaction with the three-dimensional scene, the depth estimation is used as a key ring of three-dimensional perception, and plays a vital role. When the traditional camera equipment shoots images and videos, only limited 2D image information can be obtained, depth information in a real three-dimensional world is lacked, and the defects of high cost, large size and the like exist when distance measuring equipment such as a radar, an RGBD camera and the like is adopted. In addition, the current monocular depth estimation algorithm with higher precision generally depends on a high-performance computational power environment, a better depth estimation effect is difficult to obtain in a non-ideal experimental environment, and the algorithm cannot be well deployed to a mobile terminal, so that the popularization and application limitations of the algorithm are exposed. Therefore, an interactive system which does not depend on a high-performance computing environment and distance measuring equipment and can be directly deployed on a mobile terminal to realize a real-time 3D scene interactive function has a great application prospect.

The existing two-dimensional video special effect technology such as the special effect technology on short video editors like Tik Tok has certain limitation on the effect of video secondary creation. For example, when a user wants to add a special effect of a specific scene to a video (such as snow), the conventional two-dimensional video technology can only stitch a static two-dimensional picture with a character, which is hard and disadvantageous to the effect of the video. The invention can directly construct a 3D scene according to the depth estimation result, and add a simulated special effect, thereby better reflecting the depth level change of the environment in the video, leading the video to be more real, vivid and improving the film watching experience of the video.

The invention aims to utilize a light monocular depth estimation network to calculate the scene depth in real time in an AR scene at the mobile phone end side under the condition of limited calculation power at the mobile phone end, and restore the real scene to the maximum extent. On the basis, the Unity rendering engine and the like are used for manufacturing the special effect, and the invention can realize the interaction effect between people and the environment by setting the virtual object in the real environment.

Disclosure of Invention

The invention aims to obtain more accurate depth information from simple 2D video input by applying a more mature algorithm training model, solve the depth estimation problem under a monocular camera system, solve the defects of precision and efficiency of monocular depth estimation under the traditional method, provide a light monocular depth estimation network with good robustness, high precision and high efficiency, break through the dependence of the current high-precision monocular depth estimation algorithm on a high-performance example environment, pay attention to practical application, and explore the possibility of applying the method to AR and VR scenes at the mobile phone end. Besides meeting the entertainment requirements, the novel medical robot has wide application prospects in future automatic driving, intelligent medical treatment and military operation.

In order to achieve the purpose, the invention provides the following technical scheme: and finally, based on the assistance of the depth information, a Unity software is used for making a three-dimensional special effect so as to realize the generation of a virtual object at an accurate position and man-machine interaction, so as to face an AR/VR practical application scene. The method and the system deploy the algorithm to the mobile phone side through android development and by combining a Pythrch mobile framework, and realize real-time interaction at the mobile phone side.

Specifically, the method comprises the following steps:

a) Acquisition of training/test data: performing large-scale network training by adopting an open source data set such as NYU-Depth V2, shooting videos indoors by adopting a Kinect DK camera, automatically generating a Depth map as supervision information, and taking the videos shot by a monocular camera as an input test sample;

b) Designing a monocular depth estimation algorithm: and constructing and applying by adopting an AR Core framework, using parameters returned by the AR Core as initial values of camera parameters, and adjusting corresponding parameters by combining a network to obtain a camera pose as a basis of geometric constraint of interframe depth estimation. Designing a loss function of the network on the basis of a main network for deep prediction by using a pre-trained lightweight network EfficientNet, and training on a data set;

c) Evaluating a monocular depth estimation algorithm: the real depth value of the training data set is used as a supervision signal of the model, the prediction result of the model is compared, the minimum loss function of the model is constructed, meanwhile, the capability of providing reasonable regularization at the part with less constraint is reserved, and accurate depth information is obtained to achieve the interaction effect;

d) Deployment algorithm on the end side: the Unity is used as an auxiliary tool for development, after the depth information of the neural network is inferred, the information is imported into a Unity module, a scene is reconstructed through an algorithm, unity software is used for adding a special effect, and the Unity module is deployed on a Mobile terminal of the Mobile phone by utilizing Pythroch Mobile.

Preferably, the mobile phone system is an Android system and the version is Android 8 or more.

Preferably, the mobile end side chip is of a high-pass Snapdragon 865 model or above, and a CPU or a GPU can be used to complete neural network reasoning, thereby realizing high-frame-rate operation.

Preferably, the lightweight depth estimation model deployed at the Mobile phone end is a method for creating a serializable and optimizable model from pytorque codes through torchscript after training at the server end is completed, model conversion and model optimization are performed, the converted model is in a ptl format and comprises model weights and a model interpreter, and meanwhile, through model optimization of a PyTorch Mobile module, the average inference speed of the optimized model is improved by 60% compared with that before optimization.

Preferably, the lightweight depth estimation method for deployment of the lightweight depth estimation model at the mobile phone end side includes the following implementation steps:

s1.1: training a model on a server, and training model weights by adopting a depth data set;

s1.2: model inference device is converted by Pythrch Mobile and stored

S1.3: model reasoner on Android Studio software through Java programming

Leading into an ARCore module;

s1.4: call handset camera API, acquiring an image stream I = { I = { [ I ] ₁ ,I ₂ ,…,I _n And extract the current frame I _n As input I of RGB image _RGB ；

S1.5: running model inference device at mobile phone end side

Outputting a predicted depth map

S1.6: depth map I to be predicted _Depth And adding the data into the data stream to realize the encapsulation of the module.

Preferably, the lightweight depth estimation neural network model algorithm specifically comprises the following steps:

s2.1: the method comprises the steps that a lightweight depth estimation model of a depth map is predicted at a mobile phone end side, color RGB images (the image format is YUV 420) shot by a camera and pose parameters of the camera (the camera pose parameters returned in an ARCore frame of Google are required to be used as initial values of the camera parameters) are input into the lightweight depth estimation model, and the lightweight depth estimation model is output into a predicted depth image in a RAW format and a predicted confidence image;

s2.2: the depth estimation neural network model is a monocular depth estimation model, single inference completed by the model does not depend on information of front and back image frames or multiple images, and single depth estimation can be completed by inputting a single image;

s2.3: the depth estimation neural network model is a lightweight network model, a model inference device deployed at a mobile phone end is smaller than 150M, and the depth map prediction with FPS of 30 frames per second is realized on a mobile phone platform with high pass Snapdagon 865 and above.

S2.4: inputting an image I by taking EfficientNet as a backbone network of a depth prediction algorithm encoder _RGB Extracting features at different resolutions (one half, one fourth, one eighth and one sixteenth) through EfficientNet to construct an image feature pyramid { S } _1/2 ,S _1/4 ,S _1/8 ,S _1/16 In the present invention, the model backbone network can be constructed by a similar lightweight model (e.g., mobileNet)Replacement;

s2.5: the multi-scale fusion structure is adopted as a decoder of the depth prediction algorithm, as shown in fig. 3, a decoder module receives a feature branch under the current resolution and a feature branch under the upper resolution, and the features of the upper resolution are spliced and fused with the features of the current resolution through a residual convolution module. The residual convolution module is formed by combining two Relu activation layers and two convolution modules with convolution kernels of 3x3 in a cross-serial mode. Inputting the fused features into a residual convolution module with the same structure, and outputting the features of the current branch through a resampling module and a convolution module with a convolution kernel size of 1x 1;

s2.6: using the multi-scale loss as a loss function of the neural network model, and calculating the formula as follows:

the gradient difference of the predicted depth and the real depth in the data set in the directions of the x axis and the y axis is calculated respectively by a formula, and the gradient difference are added and fused under different scale resolutions.

S2.7: for better robustness and generalization capability of the model on different data sets, the model uses affine-invariant depth prediction, i.e. d ^* = ds + μ. And s and mu are scales and offsets in affine transformation, and affine transformation parameters between the predicted depth and the real depth are obtained through a global least square method.

S2.8: the model is trained on a plurality of public depth data sets such as NyuDepthv2, KITTI, scanNet, ETH3D and the like, so that the model learns enough data prior, and the generalization capability of the model is improved.

Preferably, the step of implementing the AR interactive function by using the AR interactive function of the ARCore Depth Lab and Unity is as follows:

s3.1, after the depth information prediction of the neural network is completed, replacing a depth image returned by an ARCore calling depth API with the generated depth prediction image, and calling ARCore in Unity;

s3.2: generating grid information of a scene through a depth map by using a rendering engine provided by unity, and rendering a pseudo color map representing the depth information;

s3.3: and adding a corresponding special effect to the depth scene by using a part of functions of the ARCore depth lab and utilizing a special effect component of the unity scene.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention directly deploys the algorithm to the mobile phone end side and utilizes the mobile phone computing power to carry out neural network reasoning, thereby avoiding the serious dependence of the existing monocular depth estimation method on the computing power resource of a large-scale server. The existing deep estimation network is difficult to consider both precision and efficiency, a method with higher precision usually needs a long-time model reasoning process, and the model structure is usually more complex. Different from the existing large-scale deep learning network, the method provided by the invention achieves effective balance in precision and efficiency. The invention adopts a lightweight network structure model to realize the monocular depth estimation network inference frame by frame, the model structure of the network is simpler, the calculation power consumed during the network training is reduced, and meanwhile, the network inference is convenient to operate and the end-side deployment is convenient to carry out;

2. the method has the advantages that the development of the application at the depth estimation end side on an Android platform is realized, the method is different from the existing framework for operating neural network reasoning on a Mobile phone platform, the existing method generally uses a Pythroch model to train a model at a server end, obtains the model after parameter convergence, converts the model into an ONNX format, converts the model into a Tensiloflow framework for operation, and uses the Tensiloflow lite module to finish the model reasoning at the Mobile device end side, the method is independent of the Tensiloflow framework, uses the Pythroch Mobile direct conversion model, and uses the Pythroch framework to directly operate the model reasoning at the Mobile device end, so that the method is more convenient and faster, and avoids switching the operation model at different deep learning frameworks;

3. the depth estimation method provided by the invention avoids dependence on a depth API which is an interface (only supported by part of high-end mobile phone models) provided by an Android platform mobile phone system, is different from the existing software such as depthlab, the depth information in the method is obtained by deep learning model reasoning from RGB images, does not need additional hardware equipment (such as a laser radar, a millimeter wave radar and other depth sensors) to acquire the depth information, utilizes unity as a three-dimensional special effect development tool to realize the AR/VR interaction function, and has strong practical application value;

4. the invention carries out the reasoning of a depth estimation network on a framework of a Pythrch Mobile, the framework provides an end-to-end working process, the process from research to a production environment at the side of a Mobile device is simplified, and the framework is protected. The invention adopts a clear structural framework, and facilitates the subsequent modification and upgrading operation of each part of content.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a system model flow diagram of the present invention;

FIG. 2 is a diagram illustrating a depth estimation result of the system according to the present invention;

FIG. 3 is a diagram of an algorithm model of the system of the present invention;

FIG. 4 is a diagram illustrating AR interaction of the system of the present invention.

Detailed Description

For further understanding of the present invention, the objects, technical solutions and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings and examples. It is to be understood that the present invention is illustrative only and not limiting. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

Referring to fig. 1, the present invention provides a technical solution: an Android mobile phone end face AR interaction system based on deep learning comprises a mobile phone with a camera, original color image data are collected by the mobile phone through calling of a camera API, camera parameters, pose and the like are obtained, camera frames are extracted, and image streams are processed in real time. The method comprises the steps that an efficient and robust lightweight deep estimation neural network model is trained by the server side through a Pytrch Mobile deep learning framework, a serializable and optimizable model is established from Pytrch codes through a torchscript after training is finished, model conversion and model optimization are conducted, and the model is processed and stored in a ptl format including model weights and a model interpreter. The lightweight Depth estimation model imports a model file converted by means of the torchscript into an ARCore module through Java language and Android Studio software, operates reasoning at the mobile phone end side, and replaces a Depth API interface with a Depth map obtained through reasoning and prediction to realize input and output of data streams. And (4) running neural network reasoning at the side of the mobile phone by using the limited computing power of the mobile phone to generate a predicted depth map corresponding to the original image data. After the depth information prediction of the neural network is completed, replacing the depth image returned by the ARCore calling depth API with the generated depth prediction graph, and calling ARCore in Unity. Firstly, generating mesh information of a scene through a depth map by using a rendering engine provided by unity, and rendering a pseudo color map representing the depth information; and then, using partial functions of the ARCore-depth-lab, and utilizing the special effect component of the unity scene to add the corresponding special effect to the depth scene. And combining the original image and the predicted Depth map, and realizing the Android mobile phone end-side AR interaction system independent of the Depth API by utilizing the AR interaction function and the Unity development example of the ARCore Depth Lab.

Please refer to fig. 2, which shows the depth map effect tested by the network structure model. Fig. 2 is a diagram showing the effect of depth map construction of an indoor scene by a depth estimation framework and a lightweight depth estimation network model introduced by the present invention, wherein the first line and the third line are input RGB images, and the second line and the fourth line are corresponding depth maps predicted by using the network of the present invention. After a multi-scale fusion decoding frame is adopted, the estimation of the detail part of the model prediction graph is more accurate, and most three-dimensional information of a scene is recovered under limited calculation force.

Please refer to fig. 3, which is a schematic diagram of a depth estimation network model structure according to the present invention. The network model adopts EfficientNet as a backbone network of an encoder to extract image features, an image pyramid is constructed under different resolutions, a multi-scale fusion decoder is adopted to fuse the image features, and finally a depth map corresponding to a predicted image is decoded through a residual convolution module. The residual convolution module is formed by sequentially arranging and connecting a Relu activation layer, a convolution module with convolution kernel size of 3x3, a Relu activation layer and a convolution module with convolution kernel size of 3x3 in series; the multi-scale fusion module receives the feature graphs of the current feature branch and the previous feature branch, fuses the features of the previous feature branch after passing through the residual convolution module with the features of the current branch, and then sequentially connects a residual convolution module with the same structure, a resampling module and a convolution module with the convolution kernel size of 1x1 to output decoded features.

Please refer to fig. 4, which is a diagram illustrating an AR interaction effect actually measured at an Android mobile phone mobile terminal according to the technical solution of the present invention. After the depth map rendering is operated, the model can finish the rendering of the scene in a short time to generate a corresponding pseudo color map. According to the depth estimation result, the corresponding object can be aligned at the mobile phone end, and the virtual object can be placed. And moving the mobile phone, wherein the virtual object can move correspondingly along with the scene, so that the interaction of the three-dimensional information is realized.

The above-disclosed preferred embodiments of the present invention are merely illustrative of the technical solutions of the present invention and not restrictive. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents, and modifications and equivalents may be made thereto without departing from the spirit and scope of the invention, which should be limited only by the claims and their full scope and equivalents.

Claims

1. Android cell-phone terminal side AR interactive system based on degree of depth study, its characterized in that:

firstly, acquiring original color image data by a Mobile phone through calling a camera API, using a high-efficiency and robust lightweight depth estimation neural network model trained by a Pythrch Mobile deep learning framework, using the limited computing power of the Mobile phone at the Mobile phone end side, operating neural network inference, processing an image stream in real time, and then generating a prediction depth map corresponding to the original image data. And finally, combining the original image and the predicted Depth map, and realizing the AR interaction function by utilizing the AR interaction function and Unity of the ARCore Depth Lab.

2. The deep learning-based Android mobile phone end-side AR interaction system of claim 1, characterized in that: the mobile phone system is an Android system and the version is Android 8 or above.

3. The deep learning-based Android mobile phone end-side AR interaction system of claim 1, characterized in that: the mobile phone can use a CPU or a GPU to finish neural network reasoning, and recommends using a high-performance chip (such as a high-pass Snapdragon 865) to realize the operation of a high frame rate.

4. The deep learning-based Android mobile phone end-side AR interaction system of claim 1, characterized in that: the lightweight depth estimation model deployed at the mobile phone end is converted and optimized by a method for creating a serializable and optimizable model from a PyTorch code through a torchscript after training at a server end, and the stored model suffix is in a ptl format, and model file information comprises model weight and an interpreter of the model;

5. the deep learning-based Android mobile phone end-side AR interaction system of claim 1, characterized in that: the lightweight depth estimation method deployed at the mobile phone end side comprises the following implementation steps:

s1.2: model inference device is converted by Pythrch Mobile and stored

S1.3: model reasoner on Android Studio software through Java programming

Leading into an ARCore module;

s1.4: calling the API of the mobile phone camera to acquire an image stream I = { I = { (I) ₁ ,I ₂ ,…,I _n And extract the current frame I _n As input I of RGB image _RGB ；

S1.5: running model reasoner at mobile phone end side

Outputting a predicted depth map

S1.6: depth map I to be predicted _Depth Adding the data into a data stream to realize the encapsulation of the module;

6. the deep learning-based Android mobile phone end-side AR interaction system of claim 1, characterized in that: the lightweight depth estimation neural network model algorithm specifically comprises the following steps:

S2.4: inputting an image I by taking EfficientNet as a backbone network of a depth prediction algorithm encoder _RGB Extracting features at different resolutions (one-half, one-fourth, one-eighth and one-sixteenth) through EfficientNet to construct the image feature pyramid { S } _1/2 ,S _1/4 ,S _1/8 ,S _1/16 In the invention, the model backbone network can be replaced by a similar lightweight model (such as MobileNet);

S2.7: for better robustness and generalization capability of the model on different data sets, the model uses affine-invariant depth prediction, i.e. d ^* = ds + μ. And s and mu are the scale and the offset in affine transformation, and affine transformation parameters between the predicted depth and the true depth are obtained through a global least square method.

7. The deep learning-based Android mobile phone end-side AR interaction system of claim 1, characterized in that: the method for realizing the AR interaction function by utilizing the AR interaction function and Unity of the ARCore Depth Lab comprises the following steps: