CN113643357A - AR portrait photographing method and system based on 3D positioning information - Google Patents

AR portrait photographing method and system based on 3D positioning information Download PDF

Info

Publication number
CN113643357A
CN113643357A CN202110793636.6A CN202110793636A CN113643357A CN 113643357 A CN113643357 A CN 113643357A CN 202110793636 A CN202110793636 A CN 202110793636A CN 113643357 A CN113643357 A CN 113643357A
Authority
CN
China
Prior art keywords
portrait
reconstruction
photo
semantic
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110793636.6A
Other languages
Chinese (zh)
Inventor
陈志国
丛林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yixian Advanced Technology Co ltd
Original Assignee
Hangzhou Yixian Advanced Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yixian Advanced Technology Co ltd filed Critical Hangzhou Yixian Advanced Technology Co ltd
Priority to CN202110793636.6A priority Critical patent/CN113643357A/en
Publication of CN113643357A publication Critical patent/CN113643357A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20024Filtering details
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20036Morphological image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application relates to an AR portrait photographing method and system based on 3D positioning information, wherein the method comprises the following steps: acquiring a 3D reconstruction map, and positioning to obtain the pose and internal reference of a camera; acquiring a portrait photo with a 3D reconstruction scene, and performing human body semantic segmentation on the portrait photo with the 3D reconstruction scene through a semantic depth neural network to obtain a portrait semantic mask and a portrait foot pixel position; projecting pixels of the foot part of the portrait into a 3D space through the pose and the internal reference of a camera, wherein the intersection point of the projected ray and a 3D model is a 3D model point where a human body is located, averaging the depth of the 3D model point to obtain a portrait depth value, and comparing the virtual object depth value and the portrait depth value in a 3D reconstruction scene to obtain the relative shielding relation of the virtual object depth value and the portrait depth value; and carrying out corrosion expansion and Gaussian filtering processing on the portrait semantic mask, and fusing the portrait with the determined relative shielding relation and the virtual object through guide filtering to obtain a fused photo. The user experience is improved.

Description

AR portrait photographing method and system based on 3D positioning information
Technical Field
The application relates to the technical field of AR, in particular to an AR portrait photographing method and system based on 3D positioning information.
Background
With the rapid development of the augmented reality technology in the field of game and entertainment, the shooting application based on the large-space augmented reality is more and more extensive. However, in practical situations, since the content of the virtual space covers a large range of real scenes, such as the entire building is covered by the virtual content, in such a situation when a person stands in front of the building to take a photo group, the person is hidden by the virtual content, thereby affecting the user experience.
In the related art, most of application scenes of portrait shooting based on large-scene augmented reality experience are based on smaller experience contents, for example, a camera is aimed at the palm of the user, and at this time, the mobile terminal can automatically call out a virtual character and display the virtual character on the hand of the user. However, in this case, the hand is a background with respect to the virtual object, and therefore, there is no need to consider the occlusion relationship of the virtual object with respect to the palm. In addition, for other shooting methods based on augmented reality, some do not consider the problem that virtual content obstructs people, and some generate augmented reality pictures through post-artificial processing, and the pictures lack the sense of reality.
At present, no effective solution is provided for the problems of image distortion and poor user experience caused by the fact that human figures are shielded by virtual contents when people in a virtual scene are photographed in the related art.
Disclosure of Invention
The embodiment of the application provides an AR portrait photographing method and system based on 3D positioning information, and aims to solve the problems that in the related art, when people in a virtual scene are photographed, the existing portrait is shielded by virtual content, so that the picture distortion and the user experience are poor.
In a first aspect, an embodiment of the present application provides an AR portrait photographing method based on 3D positioning information, the method including:
acquiring a 3D reconstruction map, and positioning through image information and auxiliary information in the 3D reconstruction map to obtain the pose and internal reference of a camera;
acquiring a portrait photo with the 3D reconstruction scene, and performing human body semantic segmentation on the portrait photo with the 3D reconstruction scene through a semantic depth neural network to obtain a portrait semantic mask and portrait foot pixel positions, wherein the photo comprises a portrait and the 3D reconstruction scene;
projecting the pixels of the foot part of the portrait into a 3D space through the camera pose and internal parameters, wherein the intersection point of the projected ray and a 3D model is a 3D model point where a human body is located, averaging the depth of the 3D model point to obtain a portrait depth value, and comparing the depth value of a virtual object in the 3D reconstructed scene with the depth value of the portrait to obtain the relative shielding relation between the portrait and the virtual object;
and carrying out corrosion expansion and Gaussian filtering processing on the portrait semantic mask, and fusing the portrait with the virtual object with the determined relative shielding relation through guiding filtering to obtain a fused photo.
In some of these embodiments, prior to acquiring the 3D reconstructed map, the method includes:
and acquiring a shot picture of the experience area through the camera, and performing 3D map reconstruction on the experience area, wherein the picture covers each view angle of a building in the experience area, and adjacent pictures have cross views with the overlapping area not less than 50%.
In some of these embodiments, 3D map reconstructing the experience area comprises:
and 3D reconstruction is carried out on the spatial structure of the experience area through a colomap, the ground outside the building is directly completed by a plane, and the 2D feature points and the descriptors of the shot pictures and the corresponding 3D key point coordinates of the 2D feature points in the 3D reconstruction map are reserved.
In some embodiments, the obtaining the pose of the camera by positioning through the image information and the auxiliary information in the 3D reconstructed map includes:
and selecting the image with the largest matching number of the feature points through the matching of the 2D feature points, and performing PNP calculation on the 2D feature points and the corresponding 3D key points to obtain the pose of the camera.
In some embodiments, the obtaining of the portrait semantic mask by performing human semantic segmentation on the portrait photo with the 3D reconstructed scene through a semantic deep neural network includes:
carrying out convolution calculation on the photo through an encode module, and outputting to obtain a convolution parameter;
calculating the convolution parameters output by each layer of Block through a decode module, and outputting a mask reference;
and carrying out convolution operation on the mask reference through the convolution parameters to obtain the portrait semantic mask.
In a second aspect, an embodiment of the present application provides an AR portrait photographing system based on 3D positioning information, the system includes:
the positioning module is used for acquiring a 3D reconstruction map, and positioning through image information and auxiliary information in the 3D reconstruction map to obtain the pose and internal reference of the camera;
the semantic segmentation module is used for acquiring a portrait photo with the 3D reconstruction scene, and performing human body semantic segmentation on the portrait photo with the 3D reconstruction scene through a semantic depth neural network to obtain a portrait semantic mask and portrait foot pixel positions, wherein the portrait photo comprises a portrait and the 3D reconstruction scene;
the depth estimation module is used for projecting the pixels of the foot part of the portrait to a 3D space through the camera pose and the internal reference, wherein the intersection point of the projected ray and the 3D model is a 3D model point where a human body is located, the depth of the 3D model point is averaged to obtain the portrait depth value, and the relative shielding relation between the portrait and the virtual object is obtained by comparing the virtual object depth value and the portrait depth value in the 3D reconstruction scene;
and the photo fusion module is used for performing corrosion expansion and Gaussian filtering processing on the portrait semantic mask, and fusing the portrait with the virtual object with the determined relative shielding relation through guiding filtering to obtain a fusion photo.
In some of these embodiments, the system further comprises a reconstruction module that, prior to acquiring the 3D reconstructed map,
the reconstruction module is used for acquiring shot pictures of the experience area through the camera and performing 3D map reconstruction on the experience area, wherein the pictures cover all view angles of buildings in the experience area, and adjacent pictures have cross views with the overlapping area not less than 50%.
In some embodiments, the reconstruction module is further configured to perform 3D reconstruction on the spatial structure of the experience area through a colomap, complete the ground outside the building directly with a plane, and reserve the 2D feature points and the descriptors of the shot pictures and the coordinates of the corresponding 3D key points of the 2D feature points in the 3D reconstruction map.
In some embodiments, the positioning module is further configured to select an image with the largest number of feature point matches through matching of the 2D feature points, and perform PNP calculation on the 2D feature points and the corresponding 3D key points to obtain the pose of the camera.
In some embodiments, the semantic segmentation module is further configured to perform convolution calculation on the photo through an encode module, and output a result to obtain a convolution parameter;
calculating the convolution parameters output by each layer of Block through a decode module, and outputting a mask reference;
and carrying out convolution operation on the mask reference through the convolution parameters to obtain the portrait semantic mask.
Compared with the related art, the AR portrait photographing method based on the 3D positioning information, provided by the embodiment of the application, acquires the 3D reconstruction map, and performs positioning through the image information and the auxiliary information in the 3D reconstruction map to obtain the pose and the internal reference of the camera; then, acquiring a portrait photo with a 3D reconstruction scene, and performing human body semantic segmentation on the portrait photo with the 3D reconstruction scene through a semantic depth neural network to obtain a portrait semantic mask and portrait foot pixel positions, wherein the photo comprises a portrait and the 3D reconstruction scene; then, projecting the pixels of the foot part of the portrait to a 3D space through the pose and the internal reference of a camera, wherein the intersection point of the projected ray and the 3D model is a 3D model point where a human body is located, averaging the depth of the 3D model point to obtain a portrait depth value, and comparing the depth value of a virtual object in a 3D reconstruction scene with the depth value of the portrait to obtain the relative shielding relation between the portrait and the virtual object; and finally, carrying out corrosion expansion and Gaussian filtering processing on the portrait semantic mask, and fusing the portrait with the determined relative shielding relation and the virtual object through guide filtering to obtain a fused photo.
According to the method and the device, the depth of the portrait in the 3D background is obtained by utilizing the 3D positioning information of the space scene, and the relative shielding relation between the portrait and the virtual content is determined by comparing the depth of the virtual content created in the scene with the depth of the portrait in the 3D scene, so that the problem that the user experience is poor due to the fact that the portrait is shielded by the virtual content when the person in the virtual scene is photographed in the related technology is solved, and the user experience is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic application environment diagram of an AR portrait photographing method based on 3D positioning information according to an embodiment of the present application;
FIG. 2 is a flowchart of an AR portrait photographing method based on 3D positioning information according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a semantic deep neural network according to an embodiment of the present application;
FIG. 4 is a block diagram of an AR portrait photographing system based on 3D positioning information according to an embodiment of the present application;
FIG. 5 is another block diagram of an AR portrait photographing system based on 3D positioning information according to an embodiment of the present application;
fig. 6 is an internal structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. Reference herein to "a plurality" means greater than or equal to two. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The 3D positioning information-based AR portrait photographing method provided by the present application may be applied to an application environment shown in fig. 1, where fig. 1 is an application environment schematic diagram of the 3D positioning information-based AR portrait photographing method according to the embodiment of the present application, as shown in fig. 1. Wherein the terminal device 11 communicates with the server 10 via a network. The server 10 acquires a 3D reconstruction map, and performs positioning through image information and auxiliary information in the 3D reconstruction map to obtain the pose and internal reference of the camera; then, acquiring a portrait photo with a 3D reconstruction scene, and performing human body semantic segmentation on the portrait photo with the 3D reconstruction scene through a semantic depth neural network to obtain a portrait semantic mask and portrait foot pixel positions, wherein the photo comprises a portrait and the 3D reconstruction scene; then, projecting the pixels of the foot part of the portrait to a 3D space through the pose and the internal reference of a camera, wherein the intersection point of the projected ray and the 3D model is a 3D model point where a human body is located, averaging the depth of the 3D model point to obtain a portrait depth value, and comparing the depth value of a virtual object in a 3D reconstruction scene with the depth value of the portrait to obtain the relative shielding relation between the portrait and the virtual object; and finally, carrying out corrosion expansion and Gaussian filtering processing on the portrait semantic mask, fusing the portrait and the virtual object with the determined relative shielding relation through guiding filtering to obtain a fused photo, and displaying the obtained fused photo on the terminal equipment 11. The terminal device 11 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, cameras, and the like, and the server 10 may be implemented by an independent server or a server cluster formed by a plurality of servers. Specifically, the portrait semantic segmentation and the depth estimation in the embodiment of the present application may be processed at the mobile terminal, or an image taken by a camera may be uploaded to a server, and the calculation processing may be performed on the server.
The embodiment provides an AR portrait photographing method based on 3D positioning information, and fig. 2 is a flowchart of the AR portrait photographing method based on 3D positioning information according to the embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:
step S201, acquiring a 3D reconstruction map, and positioning through image information and auxiliary information in the 3D reconstruction map to obtain the pose and internal reference of a camera;
preferably, before acquiring the 3D reconstructed map, a picture is first taken with a camera in the area experienced by the user, the taken picture covers each view of the buildings in the experience area, and the adjacent taken pictures have a sufficiently large cross-view, specifically, the cross-view needs to overlap by no less than 50%. And then, 3D reconstruction is carried out on the spatial structure of the experience area through a colomap to obtain a 3D reconstruction map, and the feature point and the descriptor of each shot picture and the corresponding 3D key point coordinate of the 2D feature point in the 3D reconstruction map are reserved. Preferably, the sift feature points are used in this embodiment, and in addition, in order to perform depth estimation on virtual content created in a subsequent virtual scene, the ground outside the building can be directly supplemented by a plane. It should be noted that colomap is an open source software for performing three-dimensional reconstruction using images, and can acquire many intermediate data according to personal needs to implement some small functions. In addition, because the two parts of SfM and MVS are integrated, after an image is input in the colomap, the processes of image matching, sparse reconstruction, dense reconstruction, grid reconstruction and the like can be directly carried out, a graphical visualization interface can be provided, and the use is convenient.
Further, in the positioning stage, the user may first take a picture of the same location of the aforementioned embodied area, then extract the sift feature points in the taken picture, and then match the 2D feature points of this picture with the 2D feature points in the previously acquired taken picture, where the picture with the largest number of feature point matches is considered as the picture with the closest angle to the current picture taken by the user. And finally, performing PNP calculation in camera pose estimation on the 2D feature points and the corresponding 3D key points in the shot picture to obtain the pose of the current camera. It should be noted that, the PNP calculation is that point pairs corresponding to n spatial 3D points and 2D points of the image are known, so as to calculate the camera pose or the object pose;
step S202, obtaining a portrait photo with a 3D reconstruction scene, and performing human body semantic segmentation on the portrait photo with the 3D reconstruction scene through a semantic depth neural network to obtain a portrait semantic mask and portrait foot pixel positions, wherein the photo comprises a portrait and the 3D reconstruction scene;
preferably, in this embodiment, the semantic deep neural network for human body semantic segmentation is divided into two modules, which are an encode module and a decode module respectively. Fig. 3 is a schematic structural diagram of a semantic deep neural network according to an embodiment of the present application, and as shown in fig. 3, an encode module includes Block1-4, which adopts a modified version of mobilenetV3 as a backbone.
Optionally, in order to reduce the amount of computation, before performing convolution computation on the input photo by the encode module, the RGB input photo with the size of 1 × 3 × 512 is first changed into the photo with the size of 1 × 48 × 128 by the spacedapth operation, wherein the spacedapth operation also changes the spatial resolution into the number of channels, so that the amount of computation of the network can be reduced and the size of the network re-input photo is ensured to be large enough to achieve the best segmentation effect of the pixel-level photo.
Further, as shown in fig. 3, the image after size reduction is input into Block1 in the encode module for convolution calculation, Block1 adopts convolution of 1X1, the number of input channels is 48, the number of output channels is 40, and downsampling is not performed, and then a batchnorm and an hswish activation function are followed; then, the photos are respectively entered into blocks 2, blocks 3 and blocks 4 of the mobilenetV3 structure for convolution calculation, and feature maps with sizes of 64 × 64, 32 × 32 and 16 × 16 are obtained.
After obtaining the convolution parameters output by each layer of Block, calculating the convolution parameters output by each layer of Block through a decode module, and outputting a mask reference, specifically, as shown in fig. 3, an output4 of Block4 enters decode4 for calculation, then performs concat connection with an output of Block3, then performs decode3 calculation, and so on until Block2 finishes calculation, outputs the convolution parameters, and enters a prototype network branch. Preferably, in this embodiment, the decode performs upsampling on the convolution parameter output by each layer of Block through FPN to obtain feature map information, so that multi-scale information can be fully utilized. It should be noted that the semantic deep neural network has two different network branches, one is a weight branch for outputting parameters of convolution, and the other is a prototype branch for referencing a mask.
And finally, carrying out convolution calculation on the mask reference of the prototype branch through the convolution parameter output by the weight branch to obtain the final semantic segmentation mask. The weight branch is directly obtained by calculating the branch output of Block4, specifically, the output parameters of Block4 are subjected to global average pooling and connected with a full-link layer, the output dimension is 91, and 91 convolution parameters are output; as shown in fig. 3, the Prototype branch firstly performs convolution with a kernel of 3 on the output of decode2, converts the channel of the convolutional codes from 24 to 6, then arranges 91 convolution parameters output by the weight branch into 3 parameters of 1X1 convolution to obtain 6 convolution kernels of 1X6, adds 6 bias, and outputs a feature map with the channel number of 1 after convolution with 3 parameters of 1X1, that is, the final semantic mask.
It should be noted that, in the present embodiment, for training of the semantic deep neural network, the Loss function is shown in the following equation 1:
diceloss*0.1+0.8*BinaryFocalLoss+0.1*JaccardLoss (1)
through the segmentation of the human body semantics, a human image semantic mask value is obtained, so that the position of a foot pixel point of the human image is obtained, and the depth value of the foot pixel point of the human image is the depth of the human in the virtual space scene;
step S203, projecting the pixels of the foot part of the portrait to a 3D space through the pose and the internal reference of a camera, wherein the intersection point of the projected ray and the 3D model is the 3D model point where the human body is located, averaging the depth of the 3D model point to obtain the depth value of the portrait, and comparing the depth value of the virtual object and the depth value of the portrait in the 3D reconstructed scene to obtain the relative shielding relation between the portrait and the virtual object;
in the embodiment, because a map is built on line in a large-space scene, after a current shot picture is positioned, camera pose and camera internal parameters are obtained through positioning, pixels at the foot part of the picture are projected into a 3D space, because the depth is not known, each pixel point can project a ray, the intersection point of the ray and the 3D model is the 3D model point where the human body is located, the depth of the points is the depth of the human body, and more accurate depth values of the human body can be obtained through averaging the depths of the points.
In a large-space augmented reality experience, the content may be diverse, with different virtual content at different depths. In order to solve the occlusion relationship between the virtual content and the human body, it is necessary to estimate the depth values of the human image and the virtual content, so as to determine whether the virtual content is in front of the human image or behind the human image, and to block the human image for an object in front of the human image, and to block the human image for an object behind the human image, which becomes a background. Specifically, after the accurate portrait depth value is obtained through the above process, the relative shielding relationship between the portrait and the virtual object in the virtual space scene can be obtained by comparing the virtual object depth value and the portrait depth value in the 3D reconstructed scene;
step S204, carrying out corrosion expansion and Gaussian filtering processing on the portrait semantic mask, and fusing the portrait with the determined relative shielding relation and the virtual object through guide filtering to obtain a fused photo;
in this embodiment, since the semantic mask is a binary image, for example, the portrait may be set to 1, and the background is set to 0, under the condition that the semantic mask estimation is not very accurate, if the mask value is directly used to fuse the portrait and the background, there is a very obvious split feeling, so that morphological operations such as erosion and expansion need to be performed on the mask, then gaussian filtering processing is performed, and finally, guided filtering is adopted to fuse the portrait and the virtual content, which have determined a relative occlusion relationship, so as to achieve an effect of natural edge transition, and finally, a fine and beautiful synthetic picture is obtained.
Through the steps S201 to S204, in the embodiment of the present application, a scene space structure is firstly reconstructed in a 3D manner on line, so as to obtain a spatial 3D structure and a map. Secondly, after the image information and other auxiliary information in the 3D reconstruction map are positioned and the position and the posture of the portrait in the current space are obtained through shooting, human body semantic segmentation is carried out on the portrait photo with the 3D reconstruction scene through a semantic depth neural network to obtain the portrait semantic mask and the pixel position of the portrait foot; then, the pixels of the foot part of the portrait are projected into a 3D space through the camera pose and the internal reference, the intersection point of the projected ray and the 3D model is a 3D model point where a human body is located, the depth of the 3D model point is averaged to obtain the portrait depth value, the depth value of a virtual object and the portrait depth value in a 3D reconstruction scene are compared to obtain the relative shielding relation of the two, and finally, the photo is fused to obtain the effect of enhancing the reality of the portrait in a large-space scene.
It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.
The embodiment also provides an AR portrait photographing system based on 3D positioning information, which is used to implement the foregoing embodiments and preferred embodiments, and the description of the system is omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 4 is a block diagram of a 3D positioning information-based AR portrait photographing system according to an embodiment of the present application, and as shown in fig. 4, the system includes a positioning module 41, a semantic segmentation module 42, a depth estimation module 43, and a photo fusion module 44:
the positioning module 41 is configured to acquire a 3D reconstructed map, and perform positioning through image information and auxiliary information in the 3D reconstructed map to obtain a pose and internal reference of the camera; the semantic segmentation module 42 is configured to obtain a portrait photo with a 3D reconstructed scene, and perform human semantic segmentation on the portrait photo with the 3D reconstructed scene through a semantic depth neural network to obtain a portrait semantic mask and portrait foot pixel positions, where the photo includes a portrait and the 3D reconstructed scene; the depth estimation module 43 is configured to project the pixels of the foot of the portrait into a 3D space through the pose and the internal reference of the camera, where the intersection point of the projected ray and the 3D model is a 3D model point where a human body is located, average the depth of the 3D model point to obtain a portrait depth value, and obtain a relative occlusion relationship between the portrait and the virtual object by comparing a virtual object depth value and the portrait depth value in the 3D reconstructed scene; and the photo fusion module 44 is used for performing corrosion expansion and Gaussian filtering processing on the portrait semantic mask, and fusing the portrait with the virtual object with the determined relative shielding relationship through guiding filtering to obtain a fusion photo.
Through the system, in the embodiment of the application, the positioning module 41 firstly performs 3D reconstruction on the scene space structure on line, so as to obtain the 3D structure and the map of the space. Secondly, after the image information and other auxiliary information in the 3D reconstruction map are positioned and the position and the posture of the portrait in the current space are obtained through shooting, the semantic segmentation module 42 performs human semantic segmentation on the portrait photo with the 3D reconstruction scene through a semantic deep neural network to obtain a portrait semantic mask and the pixel position of the foot of the portrait; then, the depth estimation module 43 projects the pixels of the foot of the portrait into the 3D space through the pose and the internal reference of the camera, the intersection point of the projected ray and the 3D model is the 3D model point where the human body is located, the depths of the 3D model points are averaged to obtain the depth value of the portrait, and the depth value of the virtual object and the depth value of the portrait in the 3D reconstructed scene are compared to obtain the relative shielding relationship between the two depth values; the photo fusion module 44 finally fuses the photos to obtain the effect of the portrait for enhancing the reality in the large-space scene, so that the problem that the user experience is poor due to the fact that the existing portrait is shielded by virtual content when the people in the virtual scene are photographed is solved, and the user experience is improved.
In some embodiments, the system further includes a reconstruction module, and fig. 5 is another structural block diagram of the AR portrait photographing system based on 3D positioning information according to an embodiment of the present application, and as shown in fig. 5, the system includes a reconstruction module 51, a positioning module 41, a semantic segmentation module 42, a depth estimation module 43, and a photo fusion module 44, in this embodiment, a camera is first used to take a picture in an area experienced by a user, the taken picture is obtained to cover each view of a building in the experienced area, and adjacent taken pictures have a sufficiently large cross-view, specifically, the cross-view needs to overlap an area by not less than 50%. And then, 3D reconstruction is carried out on the spatial structure of the experience area through a colomap to obtain a 3D reconstruction map, and the feature point and the descriptor of each shot picture and the corresponding 3D key point coordinate of the 2D feature point in the 3D reconstruction map are reserved. Preferably, the sift feature points are used in this embodiment, and in addition, in order to perform depth estimation on virtual content created in a subsequent virtual scene, the ground outside the building can be directly supplemented by a plane. It should be noted that colomap is an open source software for performing three-dimensional reconstruction using images, and can acquire many intermediate data according to personal needs to implement some small functions. In addition, because the two parts of SfM and MVS are integrated, after an image is input in the colomap, the processes of image matching, sparse reconstruction, dense reconstruction, grid reconstruction and the like can be directly carried out, a graphical visualization interface can be provided, and the use is convenient.
It should be noted that, for specific examples in other embodiments in the present application, reference may be made to examples described in the embodiment and the optional implementation manner of the above augmented reality-based portrait photographing method, and details of this embodiment are not repeated herein.
Note that each of the modules may be a functional module or a program module, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
The present embodiment also provides an electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
In addition, in combination with the 3D positioning information-based AR portrait photographing method in the above embodiment, the embodiment of the present application may provide a storage medium to implement. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any one of the above-described embodiments of the method for photographing AR portrait based on 3D positioning information.
In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for photographing an AR portrait based on 3D positioning information. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
In an embodiment, fig. 6 is a schematic internal structure diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 6, there is provided an electronic device, which may be a server, and its internal structure diagram may be as shown in fig. 6. The electronic device comprises a processor, a network interface, an internal memory and a non-volatile memory connected by an internal bus, wherein the non-volatile memory stores an operating system, a computer program and a database. The processor is used for providing calculation and control capabilities, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing an environment for an operating system and running of a computer program, the computer program is executed by the processor to realize the AR portrait photographing method based on the 3D positioning information, and the database is used for storing data.
Those skilled in the art will appreciate that the configuration shown in fig. 6 is a block diagram of only a portion of the configuration associated with the present application, and does not constitute a limitation on the electronic device to which the present application is applied, and a particular electronic device may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An AR portrait photographing method based on 3D positioning information is characterized by comprising the following steps:
acquiring a 3D reconstruction map, and positioning through image information and auxiliary information in the 3D reconstruction map to obtain the pose and internal reference of a camera;
acquiring a portrait photo with the 3D reconstruction scene, and performing human body semantic segmentation on the portrait photo with the 3D reconstruction scene through a semantic depth neural network to obtain a portrait semantic mask and portrait foot pixel positions, wherein the photo comprises a portrait and the 3D reconstruction scene;
projecting the pixels of the foot part of the portrait into a 3D space through the camera pose and internal parameters, wherein the intersection point of the projected ray and a 3D model is a 3D model point where a human body is located, averaging the depth of the 3D model point to obtain a portrait depth value, and comparing the depth value of a virtual object in the 3D reconstructed scene with the depth value of the portrait to obtain the relative shielding relation between the portrait and the virtual object;
and carrying out corrosion expansion and Gaussian filtering processing on the portrait semantic mask, and fusing the portrait with the virtual object with the determined relative shielding relation through guiding filtering to obtain a fused photo.
2. The method of claim 1, wherein prior to obtaining the 3D reconstructed map, the method comprises:
and acquiring a shot picture of the experience area through the camera, and performing 3D map reconstruction on the experience area, wherein the picture covers each view angle of a building in the experience area, and adjacent pictures have cross views with the overlapping area not less than 50%.
3. The method of claim 2, wherein 3D mapping the experience area comprises:
and 3D reconstruction is carried out on the spatial structure of the experience area through a colomap, the ground outside the building is directly completed by a plane, and the 2D feature points and the descriptors of the shot pictures and the corresponding 3D key point coordinates of the 2D feature points in the 3D reconstruction map are reserved.
4. The method according to claim 1, wherein the positioning through the image information and the auxiliary information in the 3D reconstructed map to obtain the pose of the camera comprises:
and selecting the image with the largest matching number of the feature points through the matching of the 2D feature points, and performing PNP calculation on the 2D feature points and the corresponding 3D key points to obtain the pose of the camera.
5. The method of claim 1, wherein performing human semantic segmentation on the portrait photo with the 3D reconstructed scene through a semantic deep neural network to obtain a portrait semantic mask comprises:
carrying out convolution calculation on the photo through an encode module, and outputting to obtain a convolution parameter;
calculating the convolution parameters output by each layer of Block through a decode module, and outputting a mask reference;
and carrying out convolution operation on the mask reference through the convolution parameters to obtain the portrait semantic mask.
6. An AR portrait photographing system based on 3D positioning information, the system comprising:
the positioning module is used for acquiring a 3D reconstruction map, and positioning through image information and auxiliary information in the 3D reconstruction map to obtain the pose and internal reference of the camera;
the semantic segmentation module is used for acquiring a portrait photo with the 3D reconstruction scene, and performing human body semantic segmentation on the portrait photo with the 3D reconstruction scene through a semantic depth neural network to obtain a portrait semantic mask and portrait foot pixel positions, wherein the portrait photo comprises a portrait and the 3D reconstruction scene;
the depth estimation module is used for projecting the pixels of the foot part of the portrait to a 3D space through the camera pose and the internal reference, wherein the intersection point of the projected ray and the 3D model is a 3D model point where a human body is located, the depth of the 3D model point is averaged to obtain the portrait depth value, and the relative shielding relation between the portrait and the virtual object is obtained by comparing the virtual object depth value and the portrait depth value in the 3D reconstruction scene;
and the photo fusion module is used for performing corrosion expansion and Gaussian filtering processing on the portrait semantic mask, and fusing the portrait with the virtual object with the determined relative shielding relation through guiding filtering to obtain a fusion photo.
7. The system of claim 6, further comprising a reconstruction module that, prior to obtaining the 3D reconstructed map,
the reconstruction module is used for acquiring shot pictures of the experience area through the camera and performing 3D map reconstruction on the experience area, wherein the pictures cover all view angles of buildings in the experience area, and adjacent pictures have cross views with the overlapping area not less than 50%.
8. The system of claim 7,
the reconstruction module is further used for reconstructing the space structure of the experience area in a 3D mode through a colomap, directly completing the ground outside the building through a plane, and reserving the 2D feature points and the descriptors of the shot pictures and the coordinates of the corresponding 3D key points of the 2D feature points in the 3D reconstruction map.
9. The system of claim 6,
and the positioning module is further used for selecting the image with the largest matching number of the feature points through the matching of the 2D feature points, and performing PNP calculation on the 2D feature points and the corresponding 3D key points to obtain the pose of the camera.
10. The system of claim 6,
the semantic segmentation module is further used for performing convolution calculation on the photo through the encode module and outputting to obtain a convolution parameter;
calculating the convolution parameters output by each layer of Block through a decode module, and outputting a mask reference;
and carrying out convolution operation on the mask reference through the convolution parameters to obtain the portrait semantic mask.
CN202110793636.6A 2021-07-12 2021-07-12 AR portrait photographing method and system based on 3D positioning information Pending CN113643357A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110793636.6A CN113643357A (en) 2021-07-12 2021-07-12 AR portrait photographing method and system based on 3D positioning information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110793636.6A CN113643357A (en) 2021-07-12 2021-07-12 AR portrait photographing method and system based on 3D positioning information

Publications (1)

Publication Number Publication Date
CN113643357A true CN113643357A (en) 2021-11-12

Family

ID=78417292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110793636.6A Pending CN113643357A (en) 2021-07-12 2021-07-12 AR portrait photographing method and system based on 3D positioning information

Country Status (1)

Country Link
CN (1) CN113643357A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114125310A (en) * 2022-01-26 2022-03-01 荣耀终端有限公司 Photographing method, terminal device and cloud server

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103489214A (en) * 2013-09-10 2014-01-01 北京邮电大学 Virtual reality occlusion handling method, based on virtual model pretreatment, in augmented reality system
US20170243352A1 (en) * 2016-02-18 2017-08-24 Intel Corporation 3-dimensional scene analysis for augmented reality operations
CN108510573A (en) * 2018-04-03 2018-09-07 南京大学 A method of the multiple views human face three-dimensional model based on deep learning is rebuild
CN108711144A (en) * 2018-05-16 2018-10-26 上海白泽网络科技有限公司 augmented reality method and device
CN108717709A (en) * 2018-05-24 2018-10-30 东北大学 Image processing system and image processing method
CN109725733A (en) * 2019-01-25 2019-05-07 中国人民解放军国防科技大学 Human-computer interaction method and human-computer interaction equipment based on augmented reality
CN111583390A (en) * 2020-04-28 2020-08-25 西安交通大学 Three-dimensional semantic graph reconstruction method of convolutional neural network based on deep semantic fusion
CN111815755A (en) * 2019-04-12 2020-10-23 Oppo广东移动通信有限公司 Method and device for determining shielded area of virtual object and terminal equipment
CN112365604A (en) * 2020-11-05 2021-02-12 深圳市中科先见医疗科技有限公司 AR equipment depth of field information application method based on semantic segmentation and SLAM
WO2021073292A1 (en) * 2019-10-15 2021-04-22 北京市商汤科技开发有限公司 Ar scene image processing method and apparatus, and electronic device and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103489214A (en) * 2013-09-10 2014-01-01 北京邮电大学 Virtual reality occlusion handling method, based on virtual model pretreatment, in augmented reality system
US20170243352A1 (en) * 2016-02-18 2017-08-24 Intel Corporation 3-dimensional scene analysis for augmented reality operations
CN108510573A (en) * 2018-04-03 2018-09-07 南京大学 A method of the multiple views human face three-dimensional model based on deep learning is rebuild
CN108711144A (en) * 2018-05-16 2018-10-26 上海白泽网络科技有限公司 augmented reality method and device
CN108717709A (en) * 2018-05-24 2018-10-30 东北大学 Image processing system and image processing method
CN109725733A (en) * 2019-01-25 2019-05-07 中国人民解放军国防科技大学 Human-computer interaction method and human-computer interaction equipment based on augmented reality
CN111815755A (en) * 2019-04-12 2020-10-23 Oppo广东移动通信有限公司 Method and device for determining shielded area of virtual object and terminal equipment
WO2021073292A1 (en) * 2019-10-15 2021-04-22 北京市商汤科技开发有限公司 Ar scene image processing method and apparatus, and electronic device and storage medium
CN111583390A (en) * 2020-04-28 2020-08-25 西安交通大学 Three-dimensional semantic graph reconstruction method of convolutional neural network based on deep semantic fusion
CN112365604A (en) * 2020-11-05 2021-02-12 深圳市中科先见医疗科技有限公司 AR equipment depth of field information application method based on semantic segmentation and SLAM

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JUMI爱笑笑: "yolact模型详解", Retrieved from the Internet <URL:《https://blog.csdn.net/weixin_39326879/article/details/106931707》> *
WATERSINK: "实例分割之YOLACT(You Only Look At Coefficients)", Retrieved from the Internet <URL:《https://blog.csdn.net/qq_14845119/article/details/89792952》> *
云从天上来: "实例分割模型YOLACT和YOLACT++", Retrieved from the Internet <URL:《https://blog.csdn.net/xiao_ling_yun/article/details/109782753》> *
卞贤掌;费海平;李世强;: "基于语义分割的增强现实图像配准技术", 《电子技术与软件工程》, no. 23 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114125310A (en) * 2022-01-26 2022-03-01 荣耀终端有限公司 Photographing method, terminal device and cloud server

Similar Documents

Publication Publication Date Title
CN108875523B (en) Human body joint point detection method, device, system and storage medium
CN107993216B (en) Image fusion method and equipment, storage medium and terminal thereof
CN111787242B (en) Method and apparatus for virtual fitting
US11488293B1 (en) Method for processing images and electronic device
JP7387202B2 (en) 3D face model generation method, apparatus, computer device and computer program
KR101885090B1 (en) Image processing apparatus, apparatus and method for lighting processing
CN110378947B (en) 3D model reconstruction method and device and electronic equipment
CN112308977B (en) Video processing method, video processing device, and storage medium
CN112102198A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN115984447B (en) Image rendering method, device, equipment and medium
CN111275824A (en) Surface reconstruction for interactive augmented reality
CN108921798B (en) Image processing method and device and electronic equipment
CN115496863B (en) Short video generation method and system for scene interaction of movie and television intelligent creation
CN107564085B (en) Image warping processing method and device, computing equipment and computer storage medium
CN113822798B (en) Method and device for training generation countermeasure network, electronic equipment and storage medium
CN115861515A (en) Three-dimensional face reconstruction method, computer program product and electronic device
CN113643357A (en) AR portrait photographing method and system based on 3D positioning information
WO2021109764A1 (en) Image or video generation method and apparatus, computing device and computer-readable medium
Zhang et al. Reconstruction of refocusing and all-in-focus images based on forward simulation model of plenoptic camera
CN109040612B (en) Image processing method, device and equipment of target object and storage medium
CN116977539A (en) Image processing method, apparatus, computer device, storage medium, and program product
US11127218B2 (en) Method and apparatus for creating augmented reality content
CN113610864A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN113409231A (en) AR portrait photographing method and system based on deep learning
CN112258435A (en) Image processing method and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination