WO2022095757A1 - 图像渲染方法和装置 - Google Patents

图像渲染方法和装置 Download PDF

Info

Publication number
WO2022095757A1
WO2022095757A1 PCT/CN2021/126469 CN2021126469W WO2022095757A1 WO 2022095757 A1 WO2022095757 A1 WO 2022095757A1 CN 2021126469 W CN2021126469 W CN 2021126469W WO 2022095757 A1 WO2022095757 A1 WO 2022095757A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
rendered
virtual
rendering
virtual background
Prior art date
Application number
PCT/CN2021/126469
Other languages
English (en)
French (fr)
Inventor
裴仁静
陈艳花
许松岑
刘宏马
梅意城
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022095757A1 publication Critical patent/WO2022095757A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • G06T15/205Image-based rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/04
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present application relates to the technical field of image processing, and in particular, to an image rendering method and apparatus.
  • Virtual background replacement refers to replacing the background of the original image with another different background.
  • the foreground and the original background are generally first segmented on the original image to obtain the foreground and the original background; then the foreground and the virtual background are rendered and imaged by image fusion to obtain the image after the background replacement.
  • the embodiments of the present application provide an image rendering method and apparatus, which can improve the virtual background replacement effect.
  • an embodiment of the present application provides an image rendering method.
  • the method includes: a terminal device first obtains an image to be processed, and the image to be processed may refer to an original video image in a video stream; and then detects a target in the image to be processed.
  • the object performs a preset action and/or whether the target object is occluded by the first object; if it is detected that the target object in the image to be processed performs a preset action and/or the target object is occluded by the first object, correspondingly, determine the preset action
  • the target object does not perform the preset action and is not blocked; or, the target object is not blocked and the target object performs the preset action, but the depth value of the interactive object corresponding to the second virtual object to be rendered is smaller than the target object
  • the depth value of the target virtual background image is taken as the background image to be rendered, and then image rendering is performed according to the virtual background image to be rendered and the main image to obtain a rendered image.
  • the virtual object corresponding to it is rendered in the virtual background image, that is, when the target object is occluded, the corresponding virtual object is rendered to perform Occlusion, when the target object performs an interactive action, the corresponding interactive virtual object is rendered to avoid the unreasonable and unreal phenomenon of the rendered image as much as possible, and to improve the rationality and authenticity of the rendered image, thereby improving the Virtual background replacement effect.
  • the target object is a person, that is, when the subject image is a portrait subject image
  • the second action corresponding to the preset action of "sit down” is determined.
  • the virtual object to be rendered is "chair".
  • the depth value of the interactive object for example, a stool, etc.
  • the "chair” is rendered at the corresponding position in the target virtual background image to obtain the to-be-rendered image.
  • Virtual background image perform image rendering on the virtual background image to be rendered and the main image to obtain a rendered image, where the rendered image is an image after the virtual background is replaced.
  • the portrait subject in the image to be processed is sitting on the interactive object, and the portrait subject in the rendered image will also sit on the "chair", so that the image after background replacement is consistent with the original video image in terms of interaction , to avoid unreasonable phenomena such as people sitting in the air in the image after background replacement.
  • the target object when it is detected that the target object performs a preset action and is also blocked by the first object, and the depth value of the interactive object of the second virtual object to be rendered is greater than the depth value of the target object, the first virtual object to be rendered is and the second to-be-rendered virtual object is rendered in the target virtual background image to obtain the to-be-rendered virtual background image.
  • the second virtual object to be rendered is rendered on the target virtual object.
  • the virtual background image to be rendered is obtained.
  • the detection target object does not perform the preset action, but is blocked by the first object; or, the target object performs the preset action, but the depth value of the interactive object corresponding to the second virtual object to be rendered is smaller than the depth value of the target object, the target object If blocked by the first object, the first virtual object to be rendered is rendered in the target virtual background image to obtain the virtual background image to be rendered.
  • the above process of performing image rendering according to the to-be-rendered virtual background image and the subject image, and obtaining the rendered image may include: performing bottom-level consistent rendering based on the to-be-rendered virtual background image, and obtaining the bottom layer The main image after consistent rendering; image rendering is performed according to the underlying consistent rendering of the main image and the virtual background image to be rendered to obtain a rendered image.
  • the above-mentioned process of performing bottom-level consistent rendering based on the virtual background image to be rendered, and obtaining a subject image after bottom-level consistent rendering may include: The processed image is input to the pre-trained first style transfer model, and the image to be processed after the bottom-level consistent rendering output by the first style transfer model is obtained; the main image is extracted from the to-be-processed image after the bottom-level consistent rendering, and the bottom-level consistent rendering is obtained. Sexually rendered subject image.
  • the low-frequency image is used as the input of the model, and low-level features such as texture in the image can be ignored.
  • the virtual background replacement effect can be further improved.
  • the training process of the style transfer model may include: obtaining a training data set, where the training data set includes a first virtual background image and an original video image;
  • the original video image is input to the pre-built second style transfer model, and the forward training result output by the second style transfer model is obtained; the first loss value between the forward training result and the low-frequency image of the first virtual background image is calculated;
  • the forward training result and the low-frequency image of the original video image are input to the second style transfer model after forward training, and the reverse training result output by the second style transfer model after forward training is obtained; the reverse training result and the original video are calculated.
  • the second loss value between images calculate the third loss value between the reverse training result and the low-frequency image of the original video image; adjust the network parameters of the second style transfer model according to the first loss value, and according to the second loss value and the third loss value, adjust the network parameters of the second style transfer model after forward training; repeat the training process, and obtain the first style transfer model after training when the predetermined conditions are met.
  • the predetermined condition may be used to characterize that the loss value of the model tends to be stable, which may specifically mean that the first loss value, the second loss value and the third loss value are all stable around a certain value.
  • the predetermined conditions are met, it is considered that the model training is completed, and the trained first style transfer model is obtained.
  • the first loss value is the loss value of the LAB space, that is, after the forward training result and the low-frequency image of the first virtual background image are transferred to the LAB space, the variance difference and mean difference of the two images in the LAB domain are calculated to constrain the global similarity in color, brightness, saturation, etc.
  • the third loss value is also the loss value of the LAB space.
  • the model obtained through the above training process can not only ensure consistency in style, but also consistency in image content.
  • Using the first style transfer model for bottom-level consistent rendering can further improve the background replacement effect.
  • the above-mentioned process of performing bottom-level consistent rendering based on the virtual background image to be rendered, and obtaining a subject image after bottom-level consistent rendering may also include: converting the to-be-rendered virtual background image to LAB color space to obtain the first image; calculate the first standard deviation and the first mean of the L channel, A channel and B channel of the first image respectively; transfer the main image to the LAB color space to obtain the second image; according to the first standard deviation Correcting the second image with the first mean to obtain a third image, the difference between the second standard deviation of the L channel, A channel, and B channel of the third image and the first standard is within the first preset threshold interval, and the second mean The difference from the first mean is within the second preset threshold interval; the third image is transferred from the LAB color space to the RGB color space to obtain a fourth image, which is the main image after consistent rendering of the bottom layer.
  • each channel has its corresponding variance and mean, and according to the standard deviation and mean of each channel in the first image, the standard deviation and mean of the corresponding channel in the second image are corrected.
  • the first standard deviation of the L channel of the first image is A1
  • the first mean value is B1
  • the first standard deviation of the A channel is A2
  • the first mean value is B2
  • the first standard deviation of the B channel is A3
  • the first standard deviation of the B channel is A3.
  • the mean is B3.
  • Both the first preset threshold interval and the second preset threshold interval may be set according to actual needs, for example, both the first preset threshold interval and the second threshold interval may be set to 0.
  • the above-mentioned process of obtaining the rendered image from the subject image after consistent rendering of the bottom layer and the virtual background to be rendered may include: inputting the subject image after consistent rendering of the bottom layer into a pre-rendered image.
  • the first STN network that has been trained, obtains the first change matrix output by the first STN network; the virtual background image to be rendered is input into the second STN network that is pre-trained to obtain the second change matrix output by the second STN network; using The first change matrix performs image affine change on the main image after consistent rendering of the bottom layer to obtain a first changed image; uses the second change matrix to perform image affine change on the virtual background image to be rendered to obtain a second changed image; Image synthesis is performed on the changed image and the second changed image to obtain a rendered image.
  • the subject image is rendered in a more reasonable position through the STN network, which further improves the rationality and authenticity of the background-replaced image.
  • the above process of placing the second virtual object to be rendered in the target virtual background image may include: according to the semantic segmentation result of the image to be processed, determining that the interactive object corresponding to the preset action is waiting processing the first position in the image; taking the second position corresponding to the first position in the target virtual background image as the rendering position of the first virtual object to be rendered; determining that the depth value of the interactive object in the image to be processed is greater than the depth of the target object value; render the second virtual object to be rendered at the rendering position of the target virtual background image.
  • the above process of detecting that the target object in the image to be processed is occluded by the first object may include: determining, according to the semantic segmentation result of the image to be processed, the belonging of each pixel in the image to be processed category; obtain the depth information of the image to be processed; when it is determined that there is a target pixel with a depth value smaller than that of the target object within the preset range of the target object according to the depth information, the category corresponding to the target pixel is taken as the first object, and It is determined that the target object is occluded by the first object.
  • the preset range of the target object may refer to within a preset range around the pixel point corresponding to the target object.
  • the method may further include: : determine the virtual background image to be recommended according to the similarity between the original background image of the image to be processed and each of the second virtual background images; display the virtual background image to be recommended.
  • the virtual background image is recommended to the user according to the similarity between the original background image and the virtual background image, so that the virtual background image used for background replacement can be more related to the original background image.
  • the above process of determining the virtual background image to be recommended according to the similarity between the original background image of the image to be processed and each of the second virtual background images may include: scene segmentation to obtain the original background image of the image to be processed; perform multi-category semantic segmentation on the original background image to obtain a second semantic segmentation result; perform multi-category semantic segmentation on each second virtual background image to obtain the The third semantic segmentation result; according to the second semantic segmentation result and the third semantic segmentation result, calculate the IOU values of the original background image and each of the second virtual background images; respectively calculate the first color distribution curve of the original background image, and each second the second color distribution curve of the virtual background image; calculating the curve similarity between the first color distribution curve and each second color distribution curve; determining the virtual background to be recommended from the second virtual background image according to the curve similarity and the IOU value image.
  • the method may further include: if the depth value of the interactive object corresponding to the second virtual object to be rendered is smaller than the depth value of the target object, rendering the first virtual object to be rendered on the target virtual object In the background image, obtain the virtual background image to be rendered, or use the target virtual background image as the virtual background image to be rendered;
  • the method may further include: according to the rendering position of the second virtual object to be rendered, rendering the second virtual object to be rendered in a and the rendered image to get the output image.
  • the second virtual object to be rendered is not rendered on the target virtual object.
  • the second virtual object to be rendered and the rendered image are fused to obtain the final output image.
  • the rendering sequence between the second virtual object to be rendered and the main body is: render the main body first, and then render the second virtual object to be rendered.
  • the rendering sequence between the second virtual object to be rendered and the main body is: render the second virtual object to be rendered first, Re-render the subject.
  • an embodiment of the present application provides an image rendering apparatus, and the apparatus may include:
  • the image acquisition module is used to acquire the image to be processed; the detection module is used to detect that the target object in the to-be-processed image performs a preset action and/or the target object is occluded by the first object; the virtual object determination module is used to determine the preset action The second virtual object to be rendered corresponding to the action and/or the first virtual object to be rendered corresponding to the first object; the virtual object rendering module is used to render the first virtual object to be rendered and/or the second virtual object to be rendered on the In the target virtual background image, a virtual background image to be rendered is obtained, wherein the depth value of the interactive object corresponding to the second virtual object to be rendered is greater than the depth value of the target object; the rendering module is configured to perform rendering according to the virtual background image to be rendered and the main image. Image rendering is performed to obtain a rendered image, where the main image is extracted from the image to be processed and includes the image of the target object.
  • the rendering module is specifically used to: perform consistent rendering of the bottom layer based on the virtual background image to be rendered, so as to obtain the main image after the consistent rendering of the bottom layer; Perform image rendering on the virtual background image to be rendered to obtain a rendered image.
  • the rendering module is specifically configured to: input the low-frequency image of the virtual background image to be rendered and the image to be processed into the pre-trained first style transfer model, and obtain the output of the first style transfer model The image to be processed after the consistent rendering of the bottom layer is obtained; the subject image is extracted from the image to be processed after the consistent rendering of the bottom layer, and the subject image after the consistent rendering of the bottom layer is obtained.
  • a model training module is further included, for: obtaining a training data set, where the training data set includes a first virtual background image and an original video image;
  • the original video image is input to the pre-built second style transfer model, and the forward training result output by the second style transfer model is obtained; the first loss value between the forward training result and the low-frequency image of the first virtual background image is calculated;
  • the forward training result and the low-frequency image of the original video image are input to the second style transfer model after forward training, and the reverse training result output by the second style transfer model after forward training is obtained; the reverse training result and the original video are calculated.
  • the second loss value between images calculate the third loss value between the reverse training result and the low-frequency image of the original video image; adjust the network parameters of the second style transfer model according to the first loss value, and according to the second loss value and the third loss value, adjust the network parameters of the second style transfer model after forward training; repeat the training process, and obtain the first style transfer model after training when the predetermined conditions are met.
  • the rendering module is specifically used to: transfer the virtual background image to be rendered to the LAB color space to obtain the first image; calculate the L channel, A channel, and B channel of the first image respectively.
  • the first standard deviation and the first mean value the main image is transferred to the LAB color space to obtain the second image; the second image is corrected according to the first standard deviation and the first mean value, and the third image is obtained.
  • the difference between the second standard deviation of the channel and the B channel and the first standard is within the first preset threshold interval, and the difference between the second mean value and the first mean value is within the second preset threshold value interval;
  • the color space is transferred to the RGB color space, and the fourth image is obtained, and the fourth image is the main image after consistent rendering of the bottom layer.
  • the rendering module is specifically configured to: input the subject image after consistent rendering of the bottom layer into the pre-trained first STN network to obtain the first change matrix output by the first STN network; Input the virtual background image to be rendered into the pre-trained second STN network, and obtain the second change matrix output by the second STN network; use the first change matrix to perform image affine change on the main image after consistent rendering of the bottom layer, and obtain A first changed image; using a second change matrix to perform image affine change on the virtual background image to be rendered to obtain a second changed image; image synthesis of the first changed image and the second changed image to obtain a rendered image.
  • the virtual object rendering module is specifically configured to: determine the first position of the interactive object corresponding to the preset action in the to-be-processed image according to the semantic segmentation result of the to-be-processed image; The second position corresponding to the first position in the background image is used as the rendering position of the first virtual object to be rendered; it is determined that the depth value of the interactive object in the image to be processed is greater than the depth value of the target object; rendering at the rendering position of the target virtual background image The second virtual object to be rendered.
  • the detection module is specifically configured to: determine the category of each pixel in the to-be-processed image according to the semantic segmentation result of the to-be-processed image; obtain depth information of the to-be-processed image; information, determine that there is a target pixel with a depth value smaller than the depth value of the target object in the preset range of the target object, take the category corresponding to the target pixel as the first object, and determine that the target object is occluded by the first object.
  • a background recommendation module is further included, configured to: determine the virtual background image to be recommended according to the similarity between the original background image of the image to be processed and each of the second virtual background images; display A virtual background image to be recommended.
  • the background recommendation module is specifically used to: perform background and background segmentation on the image to be processed to obtain the original background image of the image to be processed; perform multi-class semantic segmentation on the original background image to obtain the second semantic segmentation result; perform multi-category semantic segmentation on each second virtual background image to obtain a third semantic segmentation result of each second virtual background image; according to the second semantic segmentation result and the third semantic segmentation result, calculate the original background image and each Two IOU values of the virtual background image; respectively calculate the first color distribution curve of the original background image, and the second color distribution curve of each second virtual background image; calculate the difference between the first color distribution curve and each second color distribution curve Curve similarity; according to the curve similarity and the IOU value, the virtual background image to be recommended is determined from the second virtual background image.
  • the virtual object rendering module is further configured to: if the depth value of the interactive object corresponding to the second virtual object to be rendered is smaller than the depth value of the target object, render the first virtual object to be rendered in the In the target virtual background image, the virtual background image to be rendered is obtained, or the target virtual background image is used as the virtual background image to be rendered; the second virtual object to be rendered is rendered in the rendered image to obtain an output image.
  • an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • a terminal device including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, any one of the above-mentioned first aspects is implemented Methods.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the method according to any one of the foregoing first aspects.
  • an embodiment of the present application provides a chip system, where the chip system includes a processor, the processor is coupled to a memory, and the processor executes a computer program stored in the memory, so as to implement any one of the above-mentioned first aspects. method.
  • the chip system may be a single chip, or a chip module composed of multiple chips.
  • an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the method described in any one of the above-mentioned first aspects.
  • FIG. 1 is a schematic structural diagram of a terminal device 100 provided by an embodiment of the present application.
  • FIG. 2 is a block diagram of a software structure of a terminal device 100 provided by an embodiment of the present application
  • FIG. 3 is a schematic block diagram of the flow of an image rendering solution provided by an embodiment of the present application.
  • FIG. 4 is a schematic interface diagram of a video call scenario provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a virtual background image provided by an embodiment of the present application.
  • FIG. 6 is a schematic interface diagram of a background recommendation process provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the effect of consistent rendering based on structural semantics provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of the effect of consistent rendering based on an interaction relationship provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a model-based underlying consistent rendering provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a training process of a style transfer model provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of the effect of bottom-layer consistent rendering provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a bottom-layer consistent rendering process based on an image processing algorithm provided by an embodiment of the present application
  • FIG. 13 is a schematic diagram of the effect of bottom-layer consistent rendering based on an image processing algorithm provided by an embodiment of the present application.
  • FIG. 14 is a schematic diagram of a consistent rendering process based on a positional relationship provided by an embodiment of the present application.
  • 15 is a schematic diagram of an STN network training process provided by an embodiment of the present application.
  • 16 is a schematic block diagram of the effect of consistent rendering based on a positional relationship provided by an embodiment of the present application
  • 17 is a schematic flowchart of an image rendering process provided by an embodiment of the present application.
  • FIG. 18 is another schematic flowchart of an image rendering process provided by an embodiment of the present application.
  • FIG. 20 is a schematic diagram of a video call scene of a large-screen device provided by an embodiment of the present application.
  • FIG. 21 is a schematic diagram of a change in a virtual background replacement image provided by an embodiment of the present application.
  • FIG. 22 is a schematic interface diagram of a virtual background replacement process in a shooting scene provided by an embodiment of the present application.
  • FIG. 23 is a schematic interface diagram of a virtual background replacement process in a video recording scene provided by an embodiment of the present application.
  • the correlation between the original background and the virtual background is usually not considered during image rendering, which leads to poor rationality and authenticity of the image after background replacement. There are unreasonable or even unreal phenomena.
  • the original background refers to the background in the original image.
  • the consistency of hue and brightness is handled in a single way, which makes the chroma and brightness of the foreground and the hue and brightness of the virtual background inconsistent and inconsistent, thereby affecting the fusion effect of the foreground and the virtual background.
  • the embodiments of the present application provide an image rendering solution, which considers the correlation between the original background and the virtual background during image rendering, so as to improve the rationality and authenticity of the virtual background replacement, thereby improving the virtual background replacement effect.
  • the embodiment of the present application also renders the hue, brightness, contrast, and color of the foreground and the virtual background in a consistent manner, so that the hue, brightness, contrast, and color of the foreground and the hue, brightness, contrast, and color of the virtual background, etc. Consistent, improve the fusion effect of foreground and virtual background.
  • the image rendering solution provided by the embodiments of the present application may be applied to a terminal device, and the terminal device may be a portable terminal device such as a mobile phone, a tablet computer, a notebook computer, or a wearable device, and may be an augmented reality (AR) device or a virtual reality device. (virtual reality, VR) devices, and can also be terminal devices such as in-vehicle devices, netbooks, or smart screens.
  • AR augmented reality
  • VR virtual reality
  • the embodiments of the present application do not limit any specific types of terminal devices.
  • FIG. 1 shows a schematic structural diagram of a terminal device 100 .
  • the terminal device 100 may include a processor 110, a memory 120, a camera 130, a display screen 140, and the like.
  • the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the terminal device 100 .
  • the terminal device 100 may include more or less components than those shown in the drawings, or combine some components, or separate some components, or arrange different components.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP) ), controller, video codec, digital signal processor (DSP), and/or neural-network processing unit (NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • graphics processor graphics processor
  • image signal processor image signal processor
  • ISP image signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
  • the processor 110 may include one or more interfaces.
  • the interface may include a mobile industry processor interface (MIPI), a general-purpose input/output (GPIO) interface, and the like.
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • the MIPI interface can be used to connect the processor 110 with the display screen 140, the camera 130 and other peripheral devices. MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc.
  • the processor 110 communicates with the camera 130 through a CSI interface, so as to implement the shooting function of the terminal device 100 .
  • the processor 110 communicates with the display screen 140 through the DSI interface to implement the display function of the terminal device 100 .
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface may be used to connect the processor 110 with the camera 130, the display screen 140, and the like. It can be understood that the interface connection relationship between the modules illustrated in the embodiments of the present application is only a schematic illustration, and does not constitute a structural limitation of the terminal device 100 . In other embodiments of the present application, the terminal device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the terminal device 100 implements a display function through a GPU, a display screen 140, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 140 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the display screen 140 is used to display images, videos, and the like.
  • the display screen 140 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light).
  • LED organic light-emitting diode
  • AMOLED organic light-emitting diode
  • FLED flexible light-emitting diode
  • Miniled MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
  • the terminal device 100 may include 1 or N display screens 140 , where N is a positive integer greater than 1.
  • the terminal device 100 may implement a shooting function through an ISP, a camera 130, a video codec, a GPU, a display screen 140, an application processor, and the like.
  • the ISP is used to process the data fed back by the camera 130 .
  • the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
  • ISP can also perform algorithm optimization on image noise, brightness, and skin tone. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 130 .
  • Camera 130 is used to capture still images or video.
  • the object is projected through the lens to generate an optical image onto the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the terminal device 100 may include 1 or N cameras 130 , where N is a positive integer greater than 1.
  • a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy, and the like.
  • Video codecs are used to compress or decompress digital video.
  • the terminal device 100 may support one or more video codecs.
  • the terminal device 100 can play or record videos in various encoding formats, for example, moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
  • MPEG moving picture experts group
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the terminal device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • Memory 120 may be used to store computer-executable program code, which includes instructions.
  • the memory 120 may include a stored program area and a stored data area.
  • the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
  • the storage data area may store data (such as audio data, phone book, etc.) created during the use of the terminal device 100 and the like.
  • the memory 120 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
  • the processor 110 executes various functional applications and data processing of the terminal device 100 by executing the instructions stored in the memory 121 and/or the instructions stored in the memory provided in the processor.
  • the software system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiments of the present application take an Android system with a layered architecture as an example to exemplarily describe the software structure of the terminal device 100 .
  • FIG. 2 is a block diagram of a software structure of a terminal device 100 according to an embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
  • the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.
  • the application layer can include a series of application packages.
  • the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, and smooth call.
  • applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, and smooth call.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer may include window managers, content providers, view systems, telephony managers, resource managers, notification managers, and the like.
  • Content providers are used to store and retrieve data and make these data accessible to applications.
  • the data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
  • a display interface can consist of one or more views.
  • the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
  • the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
  • Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
  • the core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
  • a system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
  • surface manager surface manager
  • media library Media Libraries
  • 3D graphics processing library eg: OpenGL ES
  • 2D graphics engine eg: SGL
  • the Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
  • 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.
  • the workflow of the software and hardware of the terminal device 100 is exemplarily described below with reference to the shooting scene.
  • the corresponding hardware interrupt is sent to the kernel layer.
  • the kernel layer processes touch operations into raw input events (including touch coordinates, timestamps of touch operations, etc.). Raw input events are stored at the kernel layer.
  • the application framework layer obtains the original input event from the kernel layer, and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and the control corresponding to the click operation is the control of the camera application icon, for example, the camera application calls the interface of the application framework layer to start the camera application, and then starts the camera driver by calling the kernel layer.
  • Camera 130 captures still images or video.
  • the electrical signal is converted into a digital image signal through the ISP.
  • the digital image signal is then input to the DSP for processing, and the digital image signal is converted into standard RGB, YUV and other image signals.
  • image display is performed through GPU, display screen and application processor.
  • the CPU executes the image rendering solution provided by the embodiment of the present application to render the image, obtains the image with the background replaced, and displays the image with the background replaced by the GUP, the display screen, the application processor, etc. .
  • the following exemplarily introduces the image rendering solution provided by the embodiments of the present application according to the terminal device 100 shown in FIG. 1 and FIG. 2 .
  • the image rendering solution may include the following steps:
  • Step S301 the terminal device 100 acquires a video stream.
  • the terminal device 100 can collect video streams in real time through its own integrated camera, and the camera can be a front camera or a rear camera; it can receive video streams from other terminal devices; it can also read Pre-recorded and locally stored video streams.
  • This embodiment of the present application does not limit the manner in which the terminal device 100 acquires the video stream.
  • the terminal device 100 is specifically a mobile phone, and the main interface of the mobile phone includes a phone 41 and Changlian Call 42, also include smart life, settings, application stores and other applications.
  • the user can initiate a video call through the phone 41 and the Changlian call 42 .
  • the following is an introduction to the user initiating a video call through the phone 41 .
  • the Changlian call interface 44 there are displayed contacts that can be dialed.
  • a magic pen 47 is included in the call interface 46 .
  • the front camera is called to collect image data, and at this time, the mobile phone can obtain the video stream. Then through DSP, GPU, application processor and display screen, etc., the collected image signal is displayed on the display screen.
  • Step S302 the terminal device 100 extracts the main image from the video image of the video stream.
  • the video stream includes multiple frames of images that are consecutive in time sequence, and the terminal device 100 can extract the main image for each frame of image, or can extract the main image at intervals of a preset number of frames, which can be determined according to actual application requirements. OK, it is not limited here.
  • the manner in which the terminal device 100 extracts the subject image from the video image may be arbitrary.
  • subject images can be extracted from video images by means of semantic segmentation and instance segmentation.
  • semantic segmentation can give each pixel in the image a semantic label, and the semantic label is used to identify the category to which the pixel belongs. For example, setting the label of a person as a red pixel means that the pixels of the person in the image are marked as red.
  • the categories contained in the original video images, as well as the location and proportion of each category can be known through the semantic segmentation results.
  • the original video image contains two categories of people and trees, as well as the positions and proportions of the two categories in the original video image.
  • Instance segmentation can distinguish different individuals of the same type on the basis of semantic segmentation. That is, instance segmentation can distinguish different individuals belonging to the same category, for example, to distinguish two individuals belonging to the same person as person 1 and person 2.
  • a video image can usually be divided into a foreground and a background, and the above-mentioned subject image can usually be understood as the foreground in the video image.
  • the present application takes the video image used for extracting the main image in the video stream as the original video image.
  • Step S303 the terminal device 100 determines the target virtual background image.
  • the target virtual background image refers to the virtual background used to replace the original background.
  • the terminal device 100 determines the target virtual background image, it can perform an image rendering process according to the target virtual background image and the above-mentioned main image, so as to obtain an output image after background replacement.
  • the target virtual background image may be selected by the user.
  • the terminal device 100 may display a plurality of virtual background images in the virtual background library on the display screen according to a preset display sequence for the user to select. After the user's selection, the terminal device 100 takes the virtual background image selected by the user as the target virtual background image in response to the user's selection operation.
  • the mobile phone displays an answering interface 48 .
  • a magic pen 47 is also included in the answering interface 48 .
  • the mobile phone collects the video stream through the camera, and displays the video stream on the calling interface 46 and the answering interface 48 in real time. At this time, the calling interface 46 and the answering interface 48 display video images without virtual background replacement. The real background of the video image is not shown in FIG. 4 .
  • the user can perform virtual background replacement through the magic pen 47 .
  • the mobile phone displays a window 49 on the answering interface 48 in response to the user's click operation on the magic pen 47, and the window 49 includes two options of skin care and scene.
  • the mobile phone displays virtual background images 411 to 414 on the window 49 in response to the user's click operation.
  • the virtual background images 411 to 414 may be as shown in FIG. 5 .
  • the mobile phone can further perform image rendering according to the target virtual background image and the main image to obtain the output image after background replacement, and display the output image on the display screen to obtain the background replacement interface 410 .
  • image rendering according to the target virtual background image and the subject image will be described in detail below.
  • the mobile phone does not recommend virtual backgrounds.
  • the mobile phone sequentially displays the virtual background images in the virtual background library in the window 49 according to the default display order.
  • the default display order is virtual background image 411 , virtual background image 412 , virtual background image 413 , and virtual background image 414 .
  • the virtual background replacement can be performed by the magic pen 47 in the answering interface 48, the virtual background can also be replaced by the magic pen 47 in the calling interface 47.
  • the process types of the two are the same and will not be repeated here.
  • the terminal device 100 may first determine the virtual background image to be recommended according to the similarity between each virtual background image and the original background image; and then display the virtual background image to be recommended on the display screen, to recommend virtual backgrounds to users.
  • the virtual background recommendation process may be as follows:
  • the terminal device 100 performs multi-class semantic segmentation on the original background image to obtain a first segmentation result; performs multi-class semantic segmentation on each virtual background image in the virtual background gallery to obtain a second segmentation result for each virtual background image.
  • the above-mentioned original background image refers to the background image in the original video image, which may be obtained by dividing the foreground and background of the original video image.
  • an intersection-over-union (IOU) ratio between the first segmentation result and each of the second segmentation results is calculated.
  • the IOU value can be used to characterize the similarity in structure and content between the original background image and the virtual background image.
  • the terminal device 100 performs color distribution curve statistics on the original background image to obtain a first color distribution curve; and performs color distribution curve statistics on each virtual background image in the virtual background library to obtain a second color distribution curve. Then calculate the curve similarity between the first color distribution curve and each of the second color distribution curves.
  • the curve similarity can be used to characterize the color similarity between the original background image and each virtual background image.
  • the terminal device 100 determines the virtual background image to be recommended according to the IOU value and the curve similarity.
  • the first weight of the IOU value and the second weight of the curve similarity are preset. For each virtual background image, multiply the IOU value of the virtual background image and the first weight to obtain the first product, and multiply the curve similarity and the second weight to obtain the second product; The first and second products of are added to obtain the recommendation score for each virtual background image. Finally, sorting is performed according to the recommendation score of each virtual background image, and the first K virtual background images are screened as virtual background images to be recommended, where K is a positive integer.
  • the virtual background library includes 4 virtual background images in FIG. 5 .
  • the mobile phone calculates the recommendation score of each virtual background image in FIG. 5 respectively, and the recommendation score is the virtual background image 412, the virtual background image 413, the virtual background image 414 and the virtual background image in order from high to bottom. 411.
  • K 3, that is, the first three virtual background images are screened as the virtual background images to be recommended.
  • the mobile phone determines the virtual background image to be recommended, when the user needs to use the magic pen 47
  • the mobile phone displays a window 49 on the answering interface in response to the user's operation on the magic pen 47 .
  • the mobile phone displays the virtual background image 412 , the virtual background image 413 , the virtual background image 414 , and the virtual background image 411 in the window 49 in order from left to right according to the recommendation score.
  • a box is added to the display positions of the virtual background image 412, the virtual background image 413, and the virtual background image 414 to remind the user, that is, the virtual background image 412 ⁇ 414.
  • the mobile phone After the user clicks on the virtual background image 412 recommended in the window 49, the mobile phone, in response to the user's click operation, performs image rendering according to the virtual background image 412 and the main image, obtains the output image after the background replacement, and displays the output image on the display screen. , the interface 415 after the background replacement is obtained.
  • the prompting manner of the virtual background recommendation is arbitrary, and is not limited to the form of adding boxes shown in FIG. 6 .
  • different colored boxes may be added to the virtual background image according to the recommendation score, or arrow indicators may be added to each virtual background image to be recommended, so as to guide the user to select the virtual background image to be recommended.
  • the virtual background recommendation can also be performed in the form of a pop-up window. And the virtual background image to be recommended is displayed in the window.
  • the virtual background recommendation process is not limited to the above-mentioned process.
  • a virtual background image with a recommendation score higher than a preset score threshold may be selected as the virtual background image to be recommended.
  • the virtual background image selected by the user may be quite different from the color, structure and content of the original background image in terms of color, structure and content, thus making the virtual background replacement effect poor.
  • recommending a virtual background to the user can make the target virtual background image more related to the original background image, that is, the color, structure and content of the target virtual background image are related to the original background image.
  • the color, structure and content are more similar, which further improves the virtual background replacement effect and user experience.
  • the terminal device 100 may also actively determine the target virtual background image without human participation. At this time, the terminal device 100 may randomly select a virtual background image as the target virtual background image, or may calculate the recommendation score of each virtual background image in the virtual background database through the virtual background recommendation process mentioned above, and select the recommendation score The highest virtual background image serves as the target virtual background image.
  • Step S304 the terminal device 100 performs image rendering according to the subject image and the target virtual background image to obtain an output image.
  • the output image is the image with the background replaced.
  • the image after the background replacement is regarded as the image after image rendering.
  • the terminal device 100 after determining the target virtual background image, performs consistent rendering according to the subject image and the target virtual background image, obtains an image after background replacement, and displays the image on the display screen.
  • the consistent rendering in this embodiment of the present application refers to an image rendering process in which the correlation between the original video image and the target virtual background image is considered.
  • the correlation between the original video image and the target virtual background image may include at least one of the following: content, occlusion, position, and interaction.
  • the content refers to the image content, that is, the image content of the background-replaced image is consistent with the image content of the original video image.
  • the background-replaced image can be made consistent with the original video image in image content through the underlying consistent rendering process.
  • the underlying features may refer to features such as color, hue, brightness, and contrast of the image, and rendering for underlying features may be referred to as underlying consistent rendering.
  • underlying consistent rendering the color, hue, brightness, and contrast of the main image can be made consistent with the color, hue, brightness, and contrast of the target virtual background image, so that the part of the main image in the background-replaced image and the background are consistent.
  • the color, hue, brightness, contrast, etc. are consistent, and in addition, the image content after the background replacement is made consistent with the image content of the original video image, thereby further improving the virtual background replacement effect.
  • Occlusion refers to whether there is an occlusion relationship between the subject and the object in the original background image. When there is an occlusion relationship, the corresponding occlusion relationship should also be reflected in the image after the background replacement, so that the image after the background replacement is in the original video image. The occlusion relationship of the subject is consistent.
  • a consistent rendering process based on structural semantics can be used to make the subject occlusion relationship between the image after background replacement and the original video image consistent. Whether the subject is occluded can be determined through high-level consistent rendering based on structural semantics, and when the subject is occluded, the virtual object that needs to be rendered can be determined.
  • the position refers to the positional relationship between the subject and the object in the background-replaced image, which can be embodied as the rendering position of the subject in the virtual background image.
  • the pose of the subject image in the target virtual background image can be determined through the consistent rendering based on the positional relationship, and the subject image can be rendered at a reasonable position in the virtual background image.
  • Interaction refers to the interaction relationship between the subject and the object in the original video image, which can be embodied in whether the subject performs a preset interaction action. If the preset interaction action is performed, there is an interaction relationship between the subject and the corresponding object. Correspondingly, the interactive relationship between the subject and the corresponding object should also be reflected in the image after the background replacement, so that the image after the background replacement and the original video image are consistent in the interactive relationship.
  • the corresponding virtual objects can be increased or decreased when the subject makes a preset action. For example, when the subject is a person, when the person in the original video image makes a "sit down" action, virtual objects such as chairs or stools are rendered at reasonable positions in the image to increase the reality after background replacement and improve the effect of virtual background replacement.
  • the terminal device 100 may perform consistent rendering based on at least one of underlying features, structural semantics, location relationships, and interaction relationships.
  • the terminal device 100 can perform at least one of the following consistent rendering processes in the process of obtaining the background-replaced image based on the target virtual background image and the main image: consistent rendering at the bottom layer, consistent rendering based on structural semantics, Consistent rendering based on positional relationships and consistent rendering based on interaction relationships.
  • the terminal device 100 may perform multi-category semantic segmentation on the original video image to obtain multi-category semantic segmentation results.
  • the above-mentioned original video image generally refers to the original video image in the above-mentioned step S302.
  • the terminal device 100 determines whether the subject is blocked according to the depth information of each category in the original video image.
  • the depth information can represent the context of each category, and according to the depth information, it can be determined whether there are other objects in front of the subject. If it is determined that there are other objects occluded in front of the subject, the categories of other objects in front of the subject can be further determined according to the multi-class semantic segmentation results.
  • the subject in the original video image is a person
  • the depth information of the original video image it is determined that there are other objects in front of the person in the original video image.
  • other objects in front of the person are tables, and it can be determined that the person is blocked by the table, that is, there is an occlusion relationship between the person and the table.
  • the category of each pixel in the original video image can be known through the multi-class semantic segmentation result, and the distance (ie depth value) of each pixel from the camera can be known through the depth map of the original video image.
  • the terminal device 100 determines the first virtual object to be rendered after determining that the subject in the original video image is occluded and the object type that occludes the subject.
  • the subject in the original video image is a person, and there is a table in front of the person.
  • the terminal device 100 may recommend a virtual object similar to or related to the table, and the virtual object may be, for example, various styles of tables pre-registered in the virtual object library.
  • the user can select one or more of the recommended virtual objects as the first virtual objects to be rendered according to their own needs.
  • the terminal device 100 may not perform the virtual object recommendation process, but directly select a related virtual object from the virtual object library as the first virtual object to be rendered.
  • the first virtual object to be rendered can be used as the foreground, and the foreground and background images can be merged with the target virtual background image to obtain a new virtual background image.
  • the first virtual object to be rendered is a table, and the table is used as the foreground to perform image rendering with the target virtual background image to obtain a new virtual background image.
  • the center position of the occluding object in the original background image may be used as the initial rendering position of the first virtual object to be rendered.
  • the user can also independently determine the rendering position of the first virtual object to be rendered. For example, the user can adjust the rendering position of the first virtual object to be rendered by dragging.
  • the terminal device 100 determines that the main body in the original video image is not blocked by other objects according to the multi-class semantic segmentation results and the depth information, it may not need to determine the first virtual object to be rendered, and does not need to render the first virtual object to be rendered in the target virtual background image. Render virtual objects.
  • the image 74 is the target virtual background image. It is assumed that the virtual object library includes objects 75 to 77, and the object 75 is a virtual object recommended by the system. Specifically, the user may be prompted by adding a box.
  • the image 74 and the object 75 can be rendered first to obtain the virtual background image to be rendered, and then the background image to be rendered and the main body 72 are synthesized to obtain the background replacement. , such as the image 78 in (c) of FIG. 7 .
  • the image 78 in (c) of FIG. 7 it can be seen that there is an occlusion relationship between the main body 72 and the table 73 in the image 71, and there is an occlusion relationship between the main body 72 and the object 75 in the image 78, so that the image after background replacement is consistent with the original video image in terms of the occlusion relationship .
  • the terminal device 100 may perform motion recognition based on consecutive multiple frames of original video images. If it is recognized that the subject in the original video image has performed a preset action, a second virtual object to be rendered associated with the preset action is determined.
  • the preset actions may be set according to actual applications, for example, the preset actions are "sit down” and "hold an object in hand".
  • the action recognition method may be any existing method, which is not limited here.
  • the terminal device 100 may select an object associated with the preset action from the virtual object library to determine the second virtual object to be rendered.
  • the association relationship between each preset action and the virtual object may be preset, and the virtual object is directly selected through the preset action subsequently.
  • the preset actions include “sit down” and “hold an object in hand", the virtual object corresponding to "sit down” is a chair or a stool, and the virtual object corresponding to "holding an object in hand” is a cup.
  • the user may also preset the virtual objects to be used in the initialization stage, and after recognizing the preset action, the terminal device 100 automatically selects the corresponding virtual objects from the virtual objects set by the user. For example, before the video call starts, the user pre-sets which virtual objects may need to be used during the video call; after the setting is completed, during the video call, after the mobile phone recognizes the preset action, select the corresponding virtual objects from the set virtual objects virtual objects.
  • the terminal device 100 may also recommend a virtual object to the user, and the user selects the desired virtual object. However, it takes a certain amount of time for the user to select a virtual object, which will lead to a certain lag in the rendering of the virtual object. Therefore, in order to ensure that the subject can render the corresponding virtual object in the corresponding position in time after making the preset action, usually Instead of the user selecting virtual objects, the terminal device autonomously determines the virtual objects that need to be rendered.
  • the terminal device 100 After the terminal device 100 determines the second virtual object to be rendered, it needs to further determine the rendering position of the second virtual object to be rendered. Specifically, the terminal device 100 performs multi-class semantic segmentation on the original video image, and determines the position of the interactive object in the original video image according to the multi-class semantic segmentation result.
  • the interactive object refers to the object corresponding to the preset action.
  • the subject in the original video image is a person, and the person makes a "sit down" action, that is, the person sits on a chair, and the chair is the above-mentioned interactive object .
  • the categories contained in the image and the location of each category can be known from the multi-category semantic segmentation results. Therefore, the location of the interactive objects in the original video image can be known from the multi-category semantic segmentation results of the original video image.
  • the rendering position of the second virtual object to be rendered in the virtual background image to be rendered is determined according to the position of the interactive object in the original video image.
  • the interactive object is a chair
  • the pixel position of the chair in the original video image is the first position
  • the position corresponding to the first position in the virtual background image to be rendered is used as the rendering position.
  • the terminal device 100 can also determine the front and rear positional relationship between the interactive object and the subject in the original video image according to the depth information of each category in the original video image, that is, determine which of the interactive object and the subject is in front and which is behind. And set the rendering order between the second virtual object to be rendered and the main body according to the front and rear positional relationship. For example, if the subject is in front and the interactive object is behind, the rendering order is: render the interactive object first, then the subject.
  • the rendering sequence is: render the interactive object first, then render the main body, after the terminal device 100 determines the second virtual object to be rendered, it can use the second virtual object to be rendered as the foreground, and render it with the target virtual background image to obtain a new virtual object. background image. If the preset action is not recognized, the second virtual object to be rendered is not rendered in the target virtual background image.
  • rendering sequence render the main body first, then render the interactive object, after the terminal device 100 determines the second virtual object to be rendered, after the main image and the virtual background image to be rendered are fused, the second virtual object to be rendered is rendered on the In the fused image, the final output image is obtained.
  • image 81 and image 82 are original video images, and the sequence order of these two images in the original video stream is: image 81 First, image 82 last.
  • Image 83 and image 84 are output images obtained by using the existing virtual background replacement method after background replacement, wherein image 83 corresponds to image 81 , and image 82 corresponds to image 84 .
  • Image 85 and image 86 are images obtained after using the above-mentioned consistent rendering based on the interaction relationship with the background replaced. Among them, the image 85 corresponds to the image 81, and the image 86 corresponds to the image 82.
  • the main character 87 in the image 81 is in a standing state; the main character 87 in the image 82 is in a sitting state, that is, the main character 87 goes from standing to sitting.
  • the low-level consistent rendering can include two different implementations.
  • the two different low-level consistent rendering methods are introduced below.
  • the terminal device 100 inputs the low-frequency image of the virtual background image to be rendered and the original video image into the pre-trained style transfer model, and the output of the style transfer model is the original video image after consistent rendering at the bottom layer.
  • Image style transfer refers to the use of algorithms to learn the style of a certain image, and then apply this style to another image, or transfer the style of one image to another image.
  • the style transfer model refers to a model for implementing image style transfer, that is, the style of the virtual background image to be rendered can be transferred to the original video image through the style transfer model.
  • the process may include the following steps:
  • Step S901 the terminal device 100 acquires the low-frequency image of the virtual background image to be rendered.
  • a low-frequency image of the virtual background image to be rendered may be generated.
  • Step S902 the terminal device 100 inputs the low-frequency image and the original video image of the virtual background image to be rendered into the pre-trained style transfer model, and obtains the bottom-layer consistent rendered original video image output by the style transfer model.
  • the training process of the style transfer model can be as follows:
  • the input of the style transfer model is the original video image and the low-frequency image of the virtual background image, and the output is the forward training result.
  • the virtual background image refers to the image used for background replacement in the training dataset.
  • the forward training result is transferred to the color-opposite space (Lab color space, LAB), and then the first variance and first mean of the forward training result in the LAB domain are calculated;
  • the background image is transferred to the LAB space, the second variance and the second mean of the virtual background image in the LAB domain are calculated.
  • the mean difference and variance difference of the forward training result and the virtual background image in the LAB domain are calculated respectively.
  • the mean difference and variance difference of the LAB domain are used as the loss value (Loss) between the model output and the input, that is, the Loss during forward training is the Loss of the LAB domain.
  • Loss in the LAB domain can constrain the similarity of color, brightness and saturation between the forward training results and the virtual background.
  • the color, brightness, saturation, etc. of the main image or the original video image can be consistent with the color, brightness, saturation, etc. of the virtual background image.
  • the low-frequency images of the original video images and the forward training results are input into the style transfer model after forward training, and the output of the model is the reverse training results.
  • the Loss in the LAB domain between the reverse training result and the low-frequency image of the original video image is calculated.
  • the specific calculation process can be as follows: transfer the reverse training result and the low-frequency image of the original video image to the LAB space, calculate the variance and mean of the reverse training result in the LAB domain, and calculate the variance of the low-frequency image of the original video image in the LAB domain and mean, according to the variance and mean of the reverse training result, and the variance and mean of the low-frequency image of the original video image, the mean difference and variance difference between the reverse training result and the low-frequency image of the original video image are calculated, and the reverse The mean and variance differences in the LAB domain between the training results and the low-frequency images of the original video images are taken as the Loss in the LAB domain.
  • the loss value between the reverse training result and the original video image is calculated. After weighting the Loss of the LAB domain and the loss value between the reverse training result and the original video image, a loss value is obtained, and then the network parameters of the forward-trained style transfer model are adjusted according to the loss value.
  • the content of the main image or the original video image can be consistent with the content of the virtual background image to be rendered.
  • a training includes a forward training process and a reverse training process, that is, in the current training process, after a forward training process is performed, a reverse training process is performed based on the results of the forward training. , in the next training process, a forward training process is still performed first, and then reverse training is performed. Iteratively train multiple times in turn to obtain the style transfer model that has been trained.
  • the training process of the style transfer model may be performed on the terminal device 100, or may not be performed on the terminal device 100, but is loaded to the terminal device 100 after the training of other devices is completed.
  • the models of forward training and reverse training are different, and the weights are different, that is, reverse training is not based on the model and results obtained by forward training.
  • the image content in the style-transferred image is inconsistent with the image content of the original image.
  • different styles correspond to different style transfer models. For example, if there are three different background images, three different style transfer models are required to transfer the styles of the three background images to the corresponding images respectively.
  • the forward training and reverse training models have the same weights, that is, reverse training is based on the style transfer model after forward training and the results of forward training.
  • reverse training is based on the style transfer model after forward training and the results of forward training.
  • Image content is based on consistency and renders better.
  • the image after style transfer refers to the image obtained by the underlying consistent rendering.
  • the forward training and reverse training in the prior art are not the same model, resulting in that the image content after style transfer is inconsistent with the content of the original video image.
  • both forward training and reverse training are the same model, so the image content after style transfer is the same as the image content of the original video image.
  • the input of the prior art style transfer model is the virtual background image to be rendered, while the input of the model in the above-mentioned mode 1 is the original video image and the low frequency image of the virtual background image to be rendered.
  • different styles correspond to the same style transfer model.
  • a style transfer model is required, and the styles of the three background images can be transferred to the corresponding images respectively. . That is to say, the style transfer model obtained by the model training method of the above method 1 can transfer a variety of different styles to another image.
  • the model training process and the underlying consistent rendering process described above are all introduced with original video images.
  • the original video image may also be replaced with a main image, that is, the original video image in FIG. 9 and FIG. 10 may be replaced with a main image.
  • the main image is extracted from the original video image, and then the main image and the low-frequency image of the virtual background image to be rendered are input into the style transfer model for forward training.
  • the subject image and the low-frequency image of the virtual background image to be rendered are input into the style transfer model that has been trained, and the underlying consistent rendered subject image output by the style transfer model is obtained.
  • the low-frequency image of the virtual background image to be rendered is used.
  • the low-frequency image can ignore the underlying features such as texture, and can carry out multi-style reverse training, plus forward training and reverse training. Trained to the same model with reduced distortion.
  • fixed color temperature and fixed lighting are generally used for image rendering, that is, virtual lighting with a fixed direction is rendered for the foreground, or a fixed color temperature is rendered for the foreground, which may cause the image after the background is replaced.
  • the color, color temperature, brightness, contrast, etc. of the subject and the background are inconsistent, and the background replacement effect is poor.
  • the color, color temperature, brightness, contrast, etc. of the subject in the background-replaced image and the background are consistent, and the background replacement effect is better.
  • image 111 is the original video image
  • image 112 is the virtual background image
  • image 113 is the image obtained by using the above-mentioned method 1 after the background is replaced.
  • 114 is an image after background replacement obtained by using the existing method.
  • the background color and tone in the image 111 are mainly ocean blue, and the color of the clothes of the person in the image 111 is the color 1, for example, the color 1 is white.
  • the background color and tone in the image 112 are mainly sunset yellow.
  • the main image and the virtual background image can be made consistent in color, hue, contrast, and color temperature.
  • the specific performance is: the color of the clothes of the person in the image 113 (ie the subject image) is color 2, for example, the color 2 is yellow, that is, the color, color temperature, etc. of the person in the image 113 are consistent with the color, color temperature, etc. of the background, and the subject and the background are consistent. consistency is high.
  • the main image and the virtual background image are quite different in color, hue, contrast, and color temperature.
  • the specific performance is as follows: the color of the clothes of the person in the image 114 is color 1, that is, it is consistent with the main image in the original video image. In this way, the subject image and the background in the image 114 are quite different in color and color temperature, and the consistency between the subject and the background is poor.
  • the terminal device 100 transfers the virtual background image to be rendered from the RGB color space to the LAB color space, and obtains the image in the LAB color space. , and then calculate the first standard deviation (std) and the third mean (mean) of the L, A, and B channels of the first image. Each channel has its own corresponding standard deviation and mean.
  • the terminal device 100 transfers the main image or the original video image to the LAB color space to obtain a second image in the LAB color space, and corrects the standard deviation of the second image according to the first standard deviation and the third mean of the first image and the mean to get the third image.
  • the standard deviations of the L, A, and B channels of the second image are respectively set as the first standard deviations of the corresponding channels in the first image, or the standard deviations of the L, A, and B channels of the second image are set respectively.
  • the difference between the standard deviation and the first standard deviation of the corresponding channel in the first image is within the preset threshold interval; the mean values of the three channels of L, A, and B in the second image are respectively set to be in the first image.
  • the third mean of the corresponding channel, or the difference between the mean of the L, A, and B channels of the second image and the third mean of the corresponding channel in the first image is within a preset threshold interval.
  • the standard deviations of the three channels L, A, and B of the third image are equal to or close to the first standard deviation of the corresponding channels in the first image, and the three standard deviations of L, A, and B of the third image are equal to or close to the first standard deviation.
  • Each channel mean is equal to or close to the third mean of the corresponding channel in the first image.
  • the third image is converted from the LAB color space to the RGB color space to obtain the fourth image, which is the image after consistent rendering of the bottom layer.
  • image 131 is a virtual background image
  • image 132 is an original video image
  • image 133 is a background replacement obtained by using the above-mentioned method 2
  • the resulting image, image 134 is the background-replaced image obtained using the prior art.
  • the color, color temperature, brightness, and contrast of the person (ie, the subject image) 135 in the image 133 are consistent with the color, color temperature, brightness, and contrast of the image 131.
  • the background is more consistent in color, brightness, contrast, color temperature, etc., which in turn makes the background replacement effect better.
  • the color, color temperature, brightness, and contrast of the person 135 in the image 134 are consistent with the hue, color temperature, brightness, and contrast in the image 132, and are quite different from the color, color temperature, brightness, and contrast of the image 131, which in turn results in the image 134
  • the difference between the people and the background in the color, contrast, color temperature, etc. is large, that is, the consistency between the people and the background is poor, and the background replacement effect is poor.
  • the process of the consistent rendering based on the positional relationship can be as follows:
  • STN Spatial Transformer Network
  • the main image can be the main image after bottom-level consistent rendering, and if the result of the bottom-level consistent rendering process is the original video image after bottom-level consistent rendering, it is extracted from the original video image. Extract the main image to get the main image after the underlying consistent rendering.
  • image affine transformation (Warp) is performed on the subject image using the first transformation matrix to obtain the subject image after Warp.
  • Warp the virtual background image to be rendered by using the second change matrix to obtain the virtual background image to be rendered after Warp.
  • the STN network adjust the mutual rotation, translation or zooming of the foreground (ie, the main image) and the virtual background image to be rendered, and perform cropping.
  • first STN network and second STN network are pre-trained networks.
  • the training process of the STN network adopts the method of adversarial learning, and the specific process can be as follows:
  • the main image for training into the pre-built third STN network to obtain the change matrix H0 output by the third STN network;
  • the background image is input to the pre-built fourth STN network, and the change matrix H1 output by the fourth STN network is obtained.
  • the virtual background image for training refers to the image used for background replacement in the training data.
  • the main image may be an image after consistent rendering at the bottom layer, or may not be an image after consistent rendering at the bottom layer.
  • Warp the main image for training using the change matrix H0, and warp the background image for training using the change matrix H1; then fuse the main image after Warp and the virtual background image after Warp to obtain a composite image.
  • the synthesized image is input to the discriminator.
  • the discriminator judges the quality of the synthetic image by judging the difference between the synthetic image and the real image. The smaller the difference between the synthetic image and the real image, the better the synthetic image, and vice versa, the larger the difference, the worse the synthetic image.
  • the discriminator When the discriminator considers that the input synthetic image is the same as the real image, it is considered that the STN network training is completed, and the above-mentioned first STN network and second STN network are obtained.
  • the training process of the STN network may be performed on the terminal device 100, or may be performed on other devices.
  • image 161 , image 162 and image 163 are all images after background replacement, wherein the image rendering of image 161 and image 162 The process does not perform the consistent rendering process based on the positional relationship, and the image rendering process of the image 163 performs the above-mentioned consistent rendering process based on the positional relationship. It can be seen from the comparison that in the image 161 and the image 162, the subject 164 is not properly rendered on the table 165, resulting in unreasonable phenomena such as the subject being suspended in the background-replaced image. In the image 163, the subject 164 is reasonably rendered on the table 165, and the rationality and authenticity are better.
  • the first image rendering process is a first image rendering process
  • the image rendering process includes consistent rendering based on structural semantics, consistent rendering based on interaction relationships, consistent rendering and rendering based on underlying relationships, and consistent rendering based on positional relationships.
  • the image rendering process may include the following steps:
  • Step S1701 the terminal device 100 acquires the original video image.
  • the original video image is a frame of video image in the video stream.
  • Step S1702 the terminal device 100 detects whether the target object in the original video image has an occlusion relationship.
  • the target object has an occlusion relationship
  • the above-mentioned target object refers to the subject in the original video image, and in general, the target object is a human subject.
  • the terminal device 100 may first perform multi-class semantic segmentation on the original video image to obtain multi-class semantic segmentation results, and then determine whether the subject is occluded by other objects according to the multi-class semantic segmentation results and depth information of the original video image. If the subject is occluded by other objects, it is determined that the target object has an occlusion relationship. On the contrary, if the subject is not occluded by other objects, it is determined that the target object does not have an occlusion relationship.
  • Step S1703 The terminal device 100 determines the first virtual object to be rendered corresponding to the occluding object.
  • the terminal device 100 determines that the subject is occluded by other objects according to the depth information, it then determines the category of the occluded object and the position of the occluded object in the original video image through the multi-class semantic segmentation result. Then, the first virtual object to be rendered corresponding to the occluding object is determined according to the type of the occluding object, and the rendering position of the first virtual object to be rendered is determined according to the position of the occluding object in the original video image.
  • Step S1704 the terminal device 100 detects whether the target object in the original video image performs a preset action.
  • the target object executes the preset action, go to step S1705; when the target object does not execute the preset action, go to step S1707.
  • Step S1705 The terminal device 100 determines the second virtual object to be rendered corresponding to the preset action.
  • Step S1706 the terminal device 100 determines the rendering position and rendering order of the second virtual object to be rendered.
  • the rendering sequence is: rendering the interactive object first, and then rendering the main body, the second virtual object to be rendered and the target virtual background image are fused.
  • the rendering sequence is: render the main body first, and then render the interactive object, the second virtual object to be rendered and the target virtual background image are not fused, but after the composite image is obtained, the second virtual object to be rendered and the composite image are combined.
  • the image is fused with foreground and background.
  • the rendering order of the first type of second virtual objects to be rendered is: render the interactive objects first, and then render the main body
  • the rendering order of the second type of second virtual objects to be rendered is: first Render the main body, and then render the interactive objects.
  • the first type of second virtual object to be rendered and the target virtual background image are fused
  • the second type of second virtual object to be rendered and the composite image are fused.
  • Step S1707 the terminal device 100 determines the virtual background image to be rendered.
  • the target virtual object to be rendered can be used as the foreground, and the target virtual background image can be merged with the foreground and the background to obtain the virtual background image to be rendered.
  • the target virtual object to be rendered may include a first virtual object to be rendered and/or a second virtual object to be rendered.
  • the target virtual object to be rendered only includes the first virtual object to be rendered, at this time, according to the position of the occluding object in the original video image, Taking the first virtual object to be rendered as the foreground, and performing foreground and background fusion with the target virtual background image, the virtual background image to be rendered is obtained.
  • the rendering sequence between the second virtual object to be rendered and the subject is: render the interactive object first, then render the subject, and render the virtual object to be rendered by the target Then it includes a first virtual object to be rendered and a second virtual object to be rendered.
  • the first virtual object to be rendered and the second virtual object to be rendered are regarded as the foreground, and the target virtual background image The foreground and background are merged to obtain a virtual background image to be rendered.
  • the rendering sequence between the second virtual object to be rendered and the subject is: render the subject first, then render the interactive object, and render the virtual object to be rendered by the target Then it includes the first virtual object to be rendered.
  • the first virtual object to be rendered is taken as the foreground, and the foreground and background images are merged with the target virtual background image to obtain the virtual background image to be rendered.
  • the rendering sequence between the second virtual object to be rendered and the subject is: render the interactive object first, then render the subject, and render the virtual object to be rendered virtual
  • the object includes a second virtual object to be rendered.
  • the second virtual object to be rendered is taken as the foreground, and the foreground and background images are merged with the target virtual background image to obtain the virtual background image to be rendered.
  • the target virtual background image may also be directly used as the virtual background image to be rendered.
  • the rendering sequence between the second virtual object to be rendered and the subject is: render the subject first, then render the interactive object, at this time, directly
  • the target virtual background image is used as the virtual background image to be rendered.
  • the target virtual background image is directly used as the virtual background image to be rendered.
  • image rendering may be performed according to the virtual background image to be rendered and the subject image, so as to obtain an image with the background replaced.
  • Step S1708 the terminal device 100 performs bottom-level consistent rendering according to the virtual background image to be rendered and the original video image.
  • the terminal device 100 may perform a bottom-level consistent rendering process.
  • the underlying consistent rendering process can be found above, and will not be repeated here.
  • step S1709 the terminal device 100 performs consistent rendering based on the positional relationship according to the main image after consistent rendering of the underlying layer and the virtual background image to be rendered to obtain a composite image.
  • the rendering sequence between the second virtual object to be rendered and the main body is: first rendering the interactive object, and then rendering the main body, the composite image is The output image after replacing the background.
  • the image rendering process may further include step S1710.
  • step S1710 may also be included, performing foreground and background fusion on the composite image and the second virtual object to be rendered to obtain a fused image.
  • steps S1702 to S1703 belong to the consistent rendering process based on structural semantics
  • steps S1704 to S1706 belong to the consistent rendering process based on interaction relationships.
  • the execution order of the two processes from steps S1702 to S1703 and steps S1704 to S1706 is arbitrary, and may be performed simultaneously or sequentially.
  • the prior art does not consider interaction relationship, structural semantics, underlying consistent rendering, and positional relationship in the image rendering process, resulting in unreasonable rendering of images after background replacement.
  • the occlusion relationship of the main body is determined through the consistent rendering process based on structural semantics, so as to avoid the unreasonable rendering of the main body; through the consistent rendering process based on the interaction relationship, when rendering the main body, considering the The interaction relationship between the subject and the interactive objects avoids the phenomenon of unreasonable rendering; through the change of the STN network, the rendering position of the subject image is determined, which avoids rendering the subject in an unreasonable position, thereby optimizing the reality after background replacement. , so as to avoid unreal or even unreasonable background replacement from affecting the background replacement effect.
  • the main image and the background in the image after background replacement are consistent in terms of color, brightness, contrast, hue, etc., which further improves the background replacement effect.
  • the image rendering process includes a consistent rendering process based on structural semantics, a bottom-level consistent rendering process, and a consistent rendering process based on positional relationships.
  • the image rendering process may include the following steps:
  • Step S1801 the terminal device 100 acquires the original video image.
  • Step S1802 the terminal device 100 detects whether the target object in the original video image has an occlusion relationship.
  • the target object has an occlusion relationship
  • Step S1803 The terminal device 100 determines the first virtual object to be rendered corresponding to the occluding object.
  • Step S1804 The terminal device 100 fuses the first virtual object to be rendered and the target virtual background image to obtain a virtual background image to be rendered.
  • step S1806 is entered.
  • Step S1805 the terminal device 100 takes the target virtual background image as the virtual background image to be rendered.
  • image rendering may be performed according to the virtual background image to be rendered and the subject image, so as to obtain an image with the background replaced.
  • Step S1806 the terminal device 100 performs bottom-level consistent rendering according to the virtual background image to be rendered and the original video image.
  • Step S1807 the terminal device 100 performs consistent rendering based on the positional relationship according to the main image after consistent rendering of the underlying layer and the virtual background image to be rendered to obtain a composite image.
  • the composite image is the output image after the background is replaced.
  • the occlusion relationship of the subject is determined through the consistent rendering process based on structural semantics, which avoids the unreasonable rendering of the subject; through the change of the STN network, the rendering position of the subject image is determined, avoiding the occurrence of The main body is rendered in an unreasonable position, thereby optimizing the realism of the background replacement, thereby preventing the background replacement from being unreal or even unreasonably affecting the background replacement effect.
  • the main image and the background in the image after background replacement are consistent in terms of color, brightness, contrast, hue, etc., which further improves the background replacement effect.
  • the image rendering process includes a consistent rendering process based on an interaction relationship, a bottom-level consistent rendering process, and a consistent rendering process based on a position relationship.
  • the image rendering process may include the following steps:
  • Step S1901 the terminal device 100 acquires the original video image.
  • Step S1902 the terminal device 100 detects whether the target object in the original video image performs a preset action. If yes, go to step S1903, if not, go to step S1906.
  • Step S1903 the terminal device 100 determines the second virtual object to be rendered corresponding to the preset action.
  • Step S1904 the terminal device 100 determines the rendering order and rendering position of the second virtual object to be rendered.
  • step S1905 that is, according to the rendering position of the second virtual object to be rendered, the virtual object to be rendered
  • the object is used as the foreground, and the foreground and background images are merged with the target virtual background image to obtain the virtual background image to be rendered.
  • step S1906 is entered.
  • the image rendering process further includes step S1909.
  • image rendering may be performed according to the subject image and the virtual background image to be rendered, so as to obtain an image with a background replaced.
  • Step S1905 the terminal device 100 fuses the second virtual object to be rendered and the target virtual background image to obtain the virtual background image to be rendered.
  • Step S1906 the terminal device 100 takes the target virtual background image as the virtual background image to be rendered.
  • image rendering may be performed according to the virtual background image to be rendered and the subject image, to obtain an image with a background replaced.
  • Step S1907 the terminal device 100 performs bottom-level consistent rendering on the virtual background image to be rendered and the original video image.
  • step S1908 the terminal device 100 performs consistent rendering based on the positional relationship according to the main image after consistent rendering of the underlying layer and the virtual background image to be rendered to obtain a composite image.
  • step S1909 the terminal device 100 performs foreground and background fusion on the composite image and the second virtual object to be rendered to obtain a fused image.
  • the composite image is the output image after the background is replaced; if the rendering order of the second virtual object to be rendered is later, the fused image is The output image after background replacement.
  • the image rendering process through the consistent rendering process based on the interaction relationship, when rendering the subject, the interaction relationship between the subject and the interactive object is considered to avoid the phenomenon of unreasonable rendering; through the change of the STN network, the image of the subject is determined
  • the rendering position avoids rendering the subject in an unreasonable position, thereby optimizing the reality after background replacement, thereby avoiding unreal or even unreasonable effects of background replacement on the background replacement effect.
  • the main image and the background in the image after background replacement are consistent in terms of color, brightness, contrast, hue, etc., which further improves the background replacement effect.
  • the image rendering process may only include a consistent rendering process based on structural semantics and/or a consistent rendering process based on interaction relationships, excluding the above-mentioned underlying consistent rendering process and positional relationship-based consistent rendering process .
  • the image rendering process can be as follows:
  • the terminal device 100 first executes the above-mentioned consistent rendering process based on structure and semantics and/or consistent rendering process based on interaction relationship to obtain a virtual background image to be rendered.
  • consistent rendering process based on structure and semantics and/or consistent rendering process based on interaction relationship to obtain a virtual background image to be rendered.
  • the terminal device 100 can process the virtual background image to be rendered by using the existing color tone and brightness processing methods, for example, detect the brightness of the current scene and the brightness of the virtual background, and adjust the exposure time when the brightness of the scene is greater than the brightness of the virtual background, and when the brightness of the scene is greater than the brightness of the virtual background, adjust the exposure time. When the brightness is less than the brightness of the virtual background, add virtual lighting in the virtual background.
  • the terminal device 100 performs foreground and background fusion on the processed virtual background image to be rendered and the subject image to obtain a final output image.
  • the terminal device 100 may not perform hue or brightness processing on the to-be-rendered virtual background image, but directly perform foreground and background fusion on the to-be-rendered virtual background image and the main image to obtain the final output image.
  • the image rendering process may include, in addition to a consistent rendering process based on structural semantics and/or a consistent rendering process based on interaction relationships, an underlying consistent rendering process or a consistent rendering process based on positional relationships .
  • the image rendering process can be as follows:
  • the terminal device 100 first executes the above-mentioned consistent rendering process based on structure and semantics and/or consistent rendering process based on interaction relationship to obtain a virtual background image to be rendered.
  • the terminal device 100 first executes the above-mentioned consistent rendering process based on structure and semantics and/or consistent rendering process based on interaction relationship to obtain a virtual background image to be rendered.
  • the terminal device 100 performs a bottom-level consistent rendering process, that is, performs a bottom-level consistent rendering process based on the virtual background image and the main image to be rendered. Finally, the main image after consistent rendering of the bottom layer and the virtual background image to be rendered are merged with the foreground and background to obtain the final output image.
  • the terminal device 100 performs the consistent rendering process based on the positional relationship, that is, the subject image and the virtual background image to be rendered are input into the trained STN network, and then Warp is performed according to the change matrix, and finally the subject image after Warp and the image after Warp are Warp.
  • the virtual background image to be rendered is subjected to foreground and background fusion to obtain a final output image.
  • the image rendering process does not perform the consistent rendering process based on structural semantics and the consistent rendering process based on interaction relationships, but only performs the underlying consistent rendering process and the consistent rendering process based on positional relationships.
  • the image rendering process can be as follows:
  • the terminal device 100 takes the target virtual background image as the background image to be rendered, and performs a bottom-level consistent rendering process based on the background image to be rendered and the subject image; finally, based on the bottom-level consistent rendering of the subject image and the to-be-rendered virtual background image, a positional relationship-based rendering process is performed.
  • the consistent rendering process to get the output image.
  • the image rendering process may only include the underlying consistent rendering process, without performing the consistent rendering process based on structural semantics, the consistent rendering process based on interaction relationship, and the consistent rendering process based on position relationship.
  • the image rendering process can be as follows:
  • the terminal device 100 uses the target virtual background image as the background image to be rendered, and performs a bottom-level consistent rendering process based on the background image to be rendered and the subject image. Finally, the subject image after bottom-level consistent rendering and the virtual background image to be rendered are subjected to foreground and background fusion. , to get the output image.
  • the image rendering process does not perform the consistent rendering process based on structural semantics, the consistent rendering process based on interaction relationship, the consistent rendering process based on location relationship, and the underlying consistent rendering process, but only the consistency based on position relationship. rendering process.
  • the image rendering process can be as follows:
  • the terminal device 100 performs a consistent rendering process based on the positional relationship according to the target virtual background image and the subject image to obtain an output image.
  • the second image rendering process, the third image rendering process, and other possible image rendering processes although the effect is worse than the background replacement effect of the first image rendering process, they can still improve the background replacement effect.
  • the above describes the process of performing virtual background replacement for a certain frame of original video image to obtain an output image based on the original video image or the subject image and the target virtual background image.
  • the mobile phone determines that the target virtual background image is the virtual background image 411 , and then performs uploading based on the original video image corresponding to the answering interface 48 and the virtual background image 411 .
  • an output image is obtained, and finally the output image is displayed to obtain an image 410 with the background replaced.
  • FIG. 4 in response to the click operation on the virtual background image 411 , the mobile phone determines that the target virtual background image is the virtual background image 411 , and then performs uploading based on the original video image corresponding to the answering interface 48 and the virtual background image 411 .
  • an output image is obtained, and finally the output image is displayed to obtain an image 410 with the background replaced.
  • the mobile phone in response to the click operation of the virtual background image 412, determines that the target virtual background image is the virtual background image 412, and then performs the above mentioned steps based on the original video image and the virtual background image 412 corresponding to the answering interface 48. and any image rendering process mentioned above, an output image as displayed in interface 415 is obtained.
  • the virtual background replacement process provided by the embodiments of the present application can be applied to a video call scenario of a mobile phone, and can also be applied to a video call scenario of a large-screen device.
  • a video call scenario of a mobile phone in the home scenario, the user makes a video call through the large-screen device 201 , which is installed with Changlian Call.
  • the virtual background selection window can be called up through the magic pen 202 in the video call interface, and the user can select a background image for replacement from the window.
  • the large-screen device performs any one of the above-mentioned image rendering processes according to the target virtual background image and the original video image to obtain the image 203 with the background replaced.
  • the terminal device 100 may perform the above-mentioned virtual background replacement process for each frame of image in the video stream, and may also perform the above-mentioned virtual background replacement process every 5 frames or every 10 frames, that is, every 5 frames Or 10 frames, the original video image is segmented to obtain the main image and the original background image, and then the target virtual background image is determined, and any one of the above-mentioned image rendering processes is performed based on the main image and the target virtual background image.
  • the terminal device 100 When the terminal device 100 continues to perform virtual background replacement on the video stream, the terminal device 100 renders the corresponding virtual object in the virtual background image if it recognizes that there is an occlusion relationship and/or interaction relationship between the subject and the object in the original background image. However, if the terminal device 100 recognizes that the interaction relationship ends and/or the occlusion relationship ends, the first virtual object to be rendered and/or the second virtual object to be rendered may be removed.
  • the end of the interaction relationship refers to the end of the interaction between the subject and the interactive object in the original background image.
  • the interactive object is a chair in the original background image
  • the interaction is "sit down" when the person stands from the chair When getting up, it is considered that the interaction action of "sit down" is over, and the interaction between the person and the chair in the original background image is over.
  • the end of the occlusion relationship refers to the change of the subject from being occluded to not being occluded.
  • the subject is a person
  • the person is occluded by the table in the original background image
  • the person is not occluded by the table, and the occlusion relationship is considered to end.
  • the terminal device 100 continuously performs the virtual background replacement process on the video images in the video stream. If at a certain moment, through the above-mentioned consistent rendering process based on structure and semantics, it is determined that the subject in the original video image has not been If it is occluded, it is considered that the occlusion relationship ends, and the first virtual object to be rendered is not rendered. Similarly, through the above-mentioned consistent rendering process based on the interaction relationship, it is determined that the interaction relationship ends, and the second virtual object to be rendered does not need to be rendered. When the action corresponding to the interaction action is identified, the interaction relationship can be considered to be over. For example, when the interaction action is "sit down", the action corresponding to the interaction action is "stand up”, that is, when the "stand up” action is recognized, the interaction relationship is considered to end.
  • the terminal device 100 performs the subsequent bottom layer consistent rendering process and the positional relationship-based consistent rendering process, and after obtaining the output image of the current virtual background replacement process, there is no virtual background in the output image. object. In this way, from the user's visual point of view, when the interaction relationship ends or the occlusion relationship ends, the virtual object in the image after the background replacement will disappear.
  • the chair in the background-replaced image also disappears, and when the person goes from being occluded to not being occluded, the object used for occlusion (such as a table) in the background-replaced image also disappears disappear.
  • the image 211 is the image obtained after the first virtual background replacement process is performed on the original video image 1 .
  • the image 211 is the image 78 in the above-mentioned FIG. 7
  • the original video image 1 is the image 71 in FIG. 7
  • the target virtual background image is the image 74 in FIG. 7 .
  • the image 211 includes a main body 212 and a main body 213 , and a virtual object 214 .
  • the video stream includes an original video image 1, an original video image 2, an original video image 3, and an original video image 4 in sequence.
  • the second virtual background replacement process the terminal device 100 performs virtual background replacement on the original video image 2 to obtain a background-replaced image 215 . Specifically, the terminal device 100 determines whether the main body 212 and the main body 213 have an occlusion relationship and/or an interaction relationship based on the original video image 2 . At this time, the main body 213 has an occlusion relationship, but the main body 213 does not have an occlusion relationship, and the virtual object 214 and the target virtual background image are merged with the front and the back to obtain the virtual background image to be rendered. Then, according to the virtual background image to be rendered and the original video image 2 , the underlying consistent rendering process and the positional relationship-based consistent rendering process are sequentially performed to obtain an image 215 .
  • the third virtual background replacement process the terminal device 100 performs virtual background replacement on the original video image 3 to obtain a background-replaced image 216 .
  • the terminal device 100 performs a consistent rendering process based on structure and semantics and a consistent rendering process based on an interaction relationship on the original video image, so as to determine whether the subject has an occlusion relationship and an interaction relationship.
  • the main body 212 has an interactive relationship with the chair in the original background
  • the main body 213 has an occlusion relationship with the table in the original background, so the virtual objects 217 and 214 to be rendered are determined.
  • the virtual object 217 and the virtual object 214 are rendered in the target rendering virtual background image to obtain the virtual background image to be rendered, and then based on the virtual background image to be rendered and the original video image 3, the underlying consistent rendering process and the positional relationship-based rendering process are sequentially performed. Consistent rendering process, resulting in image 216.
  • the fourth virtual background replacement process the terminal device 100 performs virtual background replacement on the original video image 4 to obtain a background-replaced image 218 .
  • the terminal device 100 performs a consistent rendering process based on structure and semantics and a consistent rendering process based on an interaction relationship on the original video image, so as to determine whether the subject has an occlusion relationship and an interaction relationship.
  • the main body 213 in the original video image 4 has no occlusion relationship and interaction relationship, and the main body 212 has an interaction relationship and no occlusion relationship, so it is determined that the virtual object 217 needs to be rendered.
  • the virtual object 217 and the target virtual background image are fused with the background and the background to obtain the virtual background image to be rendered.
  • the bottom layer consistent rendering process and the positional relationship-based consistent rendering process are sequentially performed to obtain an image 218 .
  • the terminal device sequentially displays image 211, image 215, image 216 and image 218 on the display screen according to the playback sequence of the original video image. Accordingly, there is an additional chair, and when neither the main body 212 nor the main body 213 is blocked, the previously rendered virtual object 214 will also disappear.
  • the virtual background replacement solution provided by the embodiment of the present application can be applied to the virtual background replacement scenarios such as background crossing, special effect production, video conference, photographing, and video recording, in addition to video call scenarios.
  • the following will exemplarily introduce the shooting scene and the video recording scene.
  • FIG. 22 for a schematic interface diagram of a virtual background replacement process in a shooting scene.
  • the mobile phone displays a preview interface 222 in response to a click operation on the camera 221 .
  • the mobile phone pops up a window 224 in the preview interface 222, and the window 224 displays scenes 225 to 228 in sequence.
  • the mobile phone After the mobile phone receives the click operation on the scene 225 , the mobile phone displays the preview interface 229 .
  • the image corresponding to the preview interface 222 is the original video image
  • the image corresponding to the scene 225 is the target virtual background image.
  • the mobile phone first performs a consistent rendering process based on structural semantics and a consistent rendering process based on interaction relationships to determine whether virtual objects need to be rendered. If the virtual object needs to be rendered, the virtual object to be rendered and the target virtual background image are fused to obtain the virtual background image to be rendered. However, in the current situation, since there is no occlusion relationship and interaction relationship between the subject and the objects in the original background image, it is not necessary to render the virtual object. Then, based on the main image and the virtual background image to be rendered, the underlying consistent rendering process and the positional relationship-based consistent rendering process are performed to obtain an output image after background replacement, which is the image corresponding to the preview interface 229 .
  • the mobile phone After the mobile phone displays the preview interface 229, the user can click the control 2210 to take a photo. After receiving the click operation on the control 2210 , the mobile phone saves the image corresponding to the preview interface 229 as a picture, and displays the picture in the control 2211 .
  • the user can view the captured picture by clicking 2211.
  • a picture preview interface is displayed, and the captured picture 2212 is displayed in the picture preview interface.
  • Picture 2212 is the image after background replacement.
  • the mobile phone recognizes that in the image corresponding to the preview interface 222, a person is blocked by a certain object, and/or the person makes a certain preset action, the captured picture 2212 will also There will be corresponding virtual objects.
  • the preview interface also displays images such as image 211 , image 215 , image 216 and image 218 .
  • the user can also use the magic pen to replace the virtual background again.
  • the mobile phone displays the preview interface 232 .
  • the preview interface 232 includes a magic pen 233, and the preview interface 232 displays the original video image collected by the mobile phone through the camera.
  • the mobile phone After the mobile phone receives the click operation on the control 234, the mobile phone starts recording and displays the recording interface 235, which still displays the original video image collected by the mobile phone through the camera. During the recording process, the user can click the magic pen 233 to replace the virtual background.
  • the mobile phone When the mobile phone receives the click operation on the magic pen 233 in the video recording interface 235, the mobile phone pops up a window 236 in the video recording interface 235, and the window 236 displays scenes 237-2310 that can be used for background replacement. After the user clicks on the scene 237 in the window 236 , the mobile phone performs a virtual background replacement process in response to the click operation, obtains a background-replaced image, and displays the background-replaced image in the interface 2311 .
  • the image corresponding to the scene 237 is the target virtual background image
  • the image corresponding to the recording interface 235 is the original video image.
  • the performance rendering process, the underlying consistent rendering process, and the consistent rendering process based on the positional relationship are used to obtain the output image after background replacement.
  • the virtual background is replaced after the recording starts.
  • the virtual background can also be replaced before the recording starts, that is, the window 236 is called up through the magic pen in the preview interface 232, and the corresponding Scenes.
  • the mobile phone can also perform the above virtual background recommendation process. 22 and 23 are the same as or similar to the above, reference may be made to the above, and details are not repeated here.
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.
  • the embodiments of the present application provide a computer program product, when the computer program product runs on a terminal device, so that the terminal device can implement the steps in the foregoing method embodiments when executed.
  • An embodiment of the present application further provides a chip system, where the chip system includes a processor, the processor is coupled to a memory, and the processor executes a computer program stored in the memory, so as to implement the methods described in the foregoing method embodiments. method.
  • the chip system may be a single chip, or a chip module composed of multiple chips.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting “.
  • the phrases “if it is determined” or “if the [described condition or event] is detected” may be interpreted, depending on the context, to mean “once it is determined” or “in response to the determination” or “once the [described condition or event] is detected. ]” or “in response to detection of the [described condition or event]”.
  • references in this specification to "one embodiment” or “some embodiments” and the like mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.

Abstract

本申请实施例公开了一种图像渲染方法和装置,该方法包括:获取待处理图像;检测出待处理图像中的目标对象执行预设动作和/或目标对象被第一物体遮挡;确定预设动作对应的第二待渲染虚拟物体和/或与第一物体对应的第一待渲染虚拟物体;将第一待渲染虚拟物体和/或第二待渲染虚拟物体渲染在目标虚拟背景图像中,得到待渲染虚拟背景图像,第二待渲染虚拟物体对应的交互物体的深度值大于目标对象的深度值;根据待渲染虚拟背景图像和主体图像进行图像渲染,得到渲染后的图像。本申请实施例可以提高渲染后的图像的合理性和真实性,进而提高虚拟背景替换效果。

Description

图像渲染方法和装置
本申请要求于2020年11月09日提交国家知识产权局、申请号为202011240398.8、申请名称为“图像渲染方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,尤其涉及一种图像渲染方法和装置。
背景技术
随着图像处理技术的不断发展,具备虚拟背景替换功能的产品也越来越多。例如,畅连通话、zoom视频会议以及可立拍等。
虚拟背景替换是指将原图像的背景替换成另一个不同的背景。在虚拟背景替换时,一般先对原图像进行前后景分割,得到前景和原始背景;再将前景和虚拟背景进行图像渲染和图像融合,得到背景替换后的图像。
现有技术中,背景替换后的图像的合理性和真实性等欠佳,导致背景替换效果较差。
发明内容
本申请实施例提供一种图像渲染方法和装置,可以提高虚拟背景替换效果。
第一方面,本申请实施例提供一种图像渲染方法,该方法包括:终端设备首先获取待处理图像,该待处理图像可以是指视频流中的原始视频图像;再检测待处理图像中的目标对象是否执行预设动作和/或目标对象是否被第一物体遮挡;如果检测出待处理图像中的目标对象执行预设动作和/或目标对象被第一物体遮挡,相应地,确定预设动作对应的第二待渲染虚拟物体和/或与第一物体对应的第一待渲染虚拟物体;接着将第一待渲染虚拟物体和/或第二待渲染虚拟物体渲染在目标虚拟背景图像中,得到待渲染虚拟背景图像,其中,第二待渲染虚拟物体对应的交互物体的深度值大于目标对象的深度值;最后根据待渲染虚拟背景图像和主体图像进行图像渲染,得到渲染后的图像,该主体图像为从待处理图像中提取的,且包括目标对象的图像。
如果检测出目标对象没有执行预设动作,也没有被遮挡;或者,目标对象没有被遮挡,且目标对象执行了预设动作,但第二待渲染虚拟物体对应的交互物体的深度值小于目标对象的深度值,则将目标虚拟背景图像作为待渲染背景图像,再根据待渲染虚拟背景图像和主体图像进行图像渲染,得到渲染后的图像。
本申请实施例在检测出目标对象被遮挡和/或执行了预设动作后,则在虚拟背景图像中渲染与之对应的虚拟物体,即当目标对象被遮挡时,则渲染对应的虚拟物体进行遮挡,当目标对象执行了交互动作,则渲染对应的交互虚拟物体,以尽可能地避免渲染后的图像出现不合理和不真实的现象,提高渲染后的图像的合理性和真实性,从而提高虚拟背景替换效果。
示例性地,目标对象为人,即主体图像为人像主体图像时,如果检测到人像主体执行了“坐下”这一预设动作,并确定出“坐下”这一预设动作对应的第二待渲染虚 拟物体为“椅子”。此时,假设待处理图像中“椅子”对应的交互物体(例如为凳子等)的深度值大于目标对象的深度值,则将“椅子”渲染在目标虚拟背景图像中的相应位置,得到待渲染虚拟背景图像。再将待渲染虚拟背景图像和主体图像进行图像渲染,得到渲染后的图像,该渲染后的图像为虚拟背景替换后的图像。
从视觉上来说,待处理图像中的人像主体坐在交互物体上,渲染后的图像中人像主体也会坐在“椅子”上,使得背景替换后的图像和原始视频图像在交互关系上相一致,避免了背景替换后的图像中出现人坐在空中等不合理的现象。
需要说明的是,当检测出目标对象执行了预设动作,也被第一物体遮挡,且第二待渲染虚拟物体的交互物体的深度值大于目标对象的深度值,将第一待渲染虚拟物体和第二待渲染虚拟物体渲染在目标虚拟背景图像中,得到待渲染虚拟背景图像。
当检测出目标对象执行了预设动作,但没有被第一物体遮挡,且第二待渲染虚拟物体的交互物体的深度值大于目标对象的深度值,将第二待渲染虚拟物体渲染在目标虚拟背景图像中,得到待渲染虚拟背景图像。
当检测目标对象没有执行预设动作,但被第一物体遮挡;或者,目标对象执行了预设动作,但第二待渲染虚拟物体对应的交互物体的深度值小于目标对象的深度值,目标对象被第一物体遮挡,则将第一待渲染虚拟物体渲染在目标虚拟背景图像中,得到待渲染虚拟背景图像。
在第一方面的一些可能的实现方式中,上述根据待渲染虚拟背景图像和主体图像进行图像渲染,得到渲染后的图像的过程可以包括:基于待渲染虚拟背景图像进行底层一致性渲染,得到底层一致性渲染后的主体图像;根据底层一致性渲染后的主体图像和待渲染虚拟背景图像进行图像渲染,得到渲染后的图像。
在第一方面的一些可能的实现方式中,上述基于待渲染虚拟背景图像进行底层一致性渲染,得到底层一致性渲染后的主体图像的过程可以包括:将待渲染虚拟背景图像的低频图像以及待处理图像输入至预先训练完成的第一风格迁移模型,获得第一风格迁移模型输出的底层一致性渲染后的待处理图像;从底层一致性渲染后的待处理图像进行主体图像提取,得到底层一致性渲染后的主体图像。
在该实现方式中,将低频图像作为模型的输入,可以忽略掉图像中纹理等低层特征。通过底层一致性渲染过程,可以进一步提高虚拟背景替换效果。
在第一方面的一些可能的实现方式中,风格迁移模型的训练过程可以包括:获得训练数据集,训练数据集包括第一虚拟背景图像和原始视频图像;将第一虚拟背景图像的低频图像和原始视频图像输入至预先构建的第二风格迁移模型,获得第二风格迁移模型输出的正向训练结果;计算正向训练结果和第一虚拟背景图像的低频图像之间的第一损失值;将正向训练结果和原始视频图像的低频图像输入至正向训练后的第二风格迁移模型,获得正向训练后的第二风格迁移模型输出的反向训练结果;计算反向训练结果和原始视频图像之间的第二损失值;计算反向训练结果和原始视频图像的低频图像之间的第三损失值;根据第一损失值调整第二风格迁移模型的网络参数,并根据第二损失值和第三损失值,调整正向训练后的第二风格迁移模型的网络参数;重复进行训练过程,当符合预定条件时,得到训练完成的第一风格迁移模型。
其中,预定条件可以用于表征模型的损失值趋于稳定,具体可以是指第一损失值、 第二损失值和第三损失值均稳定在某一个数值附近。当符合预定条件时,则认为模型训练完成,得到训练完成的第一风格迁移模型。
第一损失值为LAB空间的损失值,即将正向训练结果和第一虚拟背景图像的低频图像均转到LAB空间后,再计算两个图像在LAB域的方差差异和均值差异,以约束全局的色彩、亮度、饱和度等的相似性。第三损失值也是LAB空间的损失值。
在该实现方式中,通过上述训练过程得到的模型不仅可以保证风格上的一致,还可以保证图像内容上的一致。使用该第一风格迁移模型进行底层一致性渲染,可以进一步提高背景替换效果。
在第一方面的一些可能的实现方式中,上述基于待渲染虚拟背景图像进行底层一致性渲染,得到底层一致性渲染后的主体图像的过程也可以包括:将待渲染虚拟背景图像转到LAB色彩空间,得到第一图像;分别计算第一图像的L通道、A通道、B通道的第一标准差和第一均值;将主体图像转到LAB色彩空间,得到第二图像;根据第一标准差和第一均值修正第二图像,得到第三图像,第三图像的L通道、A通道、B通道的第二标准差与第一标准的差值在第一预设阈值区间内,第二均值与第一均值的差值在第二预设阈值区间内;将第三图像从LAB色彩空间转到RGB色彩空间,得到第四图像,第四图像为底层一致性渲染后的主体图像。
可以理解的是,在LAB色彩空间中,每个通道均有其对于的方差和均值,根据第一图像中每个通道的标准差和均值,修正第二图像中对应通道的标准差和均值。例如,第一图像的L通道的第一标准差为A1,第一均值为B1,A通道的第一标准差为A2,第一均值为B2,B通道的第一标准差为A3,第一均值为B3。将第二图像的L通道的标准差设置为A1,均值设置为B1;将第二图像的A通道的标准差设置为A2,均值设置为B2;将第二图像的B通道的标准差设置为A3,均值设置为B3,以根据第一图像的标准差和均值对第二图像进行修正,得到第三图像。
第一预设阈值区间和第二预设阈值区间均可以根据实际需要设定,例如,可以将第一预设阈值区间和第二阈值区间均可以设置为0。
通过该实现方式的底层一致性渲染过程,可以进一步提高虚拟背景替换效果。
在第一方面的一些可能的实现方式中,上述根据底层一致性渲染后的主体图像和待渲染虚拟背景,得到渲染后的图像的过程可以包括:将底层一致性渲染后的主体图像输入至预先训练完成的第一STN网络,得到第一STN网络输出的第一变化矩阵;将待渲染虚拟背景图像输入至预先训练完成的第二STN网络,得到第二STN网络输出的第二变化矩阵;使用第一变化矩阵对底层一致性渲染后的主体图像进行图像仿射变化,得到第一变化图像;使用第二变化矩阵对待渲染虚拟背景图像进行图像仿射变化,得到第二变化图像;将第一变化图像和第二变化图像进行图像合成,得到渲染后的图像。
在该实现方式,通过STN网络让主体图像渲染在更合理的位置,进一步提高了背景替换后的图像的合理性和真实性。
在第一方面的一些可能的实现方式中,上述将第二待渲染虚拟物体在目标虚拟背景图像中的过程可以包括:根据待处理图像的语义分割结果,确定预设动作对应的交互物体在待处理图像中的第一位置;将目标虚拟背景图像中与第一位置对应的第二位 置作为第一待渲染虚拟物体的渲染位置;确定待处理图像中的交互物体的深度值大于目标对象的深度值;在目标虚拟背景图像的渲染位置渲染第二待渲染虚拟物体。
在第一方面的一些可能的实现方式中,上述检测待处理图像中的目标对象被第一物体遮挡的过程可以包括:根据待处理图像的语义分割结果,确定待处理图像中各个像素点的所属类别;获取待处理图像的深度信息;当根据深度信息,确定目标对象的预设范围内存在深度值小于目标对象的深度值的目标像素点,将目标像素点对应的类别作为第一物体,并确定目标对象被第一物体遮挡。
可以理解的是,目标对象的预设范围可以是指目标对象对应的像素点的周边预设范围之内。在确定出目标像素点之后,再将目标像素点映射到语义分割结果,即根据语义分割结果中各个像素点所属类别,确定目标像素点的所属类别,从而确定出遮挡物体的所属类别。反之,如果不存在目标像素点,则认为目标对象没有被遮挡。
在第一方面的一些可能的实现方式中,在将第一待渲染虚拟物体和/或第二待渲染虚拟物体渲染在目标虚拟背景图像中,得到待渲染虚拟背景图像之前,该方法还可以包括:根据待处理图像的原始背景图像和各个第二虚拟背景图像之间的相似性,确定待推荐虚拟背景图像;显示待推荐虚拟背景图像。
在该实现方式中,根据原始背景图像和虚拟背景图像之间的相似性,给用户推荐虚拟背景图像,可以让用于背景替换的虚拟背景图像和原始背景图像的更相关。
在第一方面的一些可能的实现方式中,上述根据待处理图像的原始背景图像和各个第二虚拟背景图像之间的相似性,确定待推荐虚拟背景图像的过程可以包括:对待处理图像进行前后景分割,获得待处理图像的原始背景图像;对原始背景图像进行多类语义分割,得到第二语义分割结果;对各个第二虚拟背景图像进行多类语义分割,得到各个第二虚拟背景图像的第三语义分割结果;根据第二语义分割结果和第三语义分割结果,计算原始背景图像和各个第二虚拟背景图像的IOU值;分别计算原始背景图像的第一颜色分布曲线,以及各个第二虚拟背景图像的第二颜色分布曲线;计算第一颜色分布曲线与各个第二颜色分布曲线之间的曲线相似度;根据曲线相似度和IOU值,从第二虚拟背景图像中确定待推荐虚拟背景图像。
在第一方面的一些可能的实现方式中,该方法还可以包括:若第二待渲染虚拟物体对应的交互物体的深度值小于目标对象的深度值,将第一待渲染虚拟物体渲染在目标虚拟背景图像中,得到待渲染虚拟背景图像,或者,将目标虚拟背景图像作为待渲染虚拟背景图像;
此时,在根据待渲染虚拟背景图像和主体图像进行图像渲染,得到渲染后的图像之后,该方法还可以包括:根据第二待渲染虚拟物体的渲染位置,将第二待渲染虚拟物体渲染在和渲染后的图像中,得到输出图像。
也就是说,当检测出目标对象执行了预设动作,但第二待渲染虚拟物体对应的交互物体的深度值小于目标对象的深度值,则不将第二待渲染渲染虚拟物体渲染在目标虚拟背景图像中,而是得到渲染后的图像之后,再将第二待渲染虚拟物体和渲染后的图像进行融合,得到最终的输出图像。
其中,当第二待渲染虚拟物体对应的交互物体的深度值小于目标对象的深度值时,第二待渲染虚拟物体和主体之间的渲染顺序为:先渲染主体,再渲染第二待渲染虚拟 物体;反之,当第二待渲染虚拟物体对应的交互物体的深度值大于目标对象的深度值时,第二待渲染虚拟物体和主体之间的渲染顺序为:先渲染第二待渲染虚拟物体,再渲染主体。
第二方面,本申请实施例提供一种图像渲染装置,该装置可以包括:
图像获取模块,用于获取待处理图像;检测模块,用于检测出待处理图像中的目标对象执行预设动作和/或目标对象被第一物体遮挡;虚拟物体确定模块,用于确定预设动作对应的第二待渲染虚拟物体和/或与第一物体对应的第一待渲染虚拟物体;虚拟物体渲染模块,用于将第一待渲染虚拟物体和/或第二待渲染虚拟物体渲染在目标虚拟背景图像中,得到待渲染虚拟背景图像,其中,第二待渲染虚拟物体对应的交互物体的深度值大于目标对象的深度值;渲染模块,用于根据待渲染虚拟背景图像和主体图像进行图像渲染,得到渲染后的图像,该主体图像为从待处理图像中提取的,且包括目标对象的图像。
在第二方面的一些可能的实现方式中,渲染模块具体用于:基于待渲染虚拟背景图像进行底层一致性渲染,得到底层一致性渲染后的主体图像;根据底层一致性渲染后的主体图像和待渲染虚拟背景图像进行图像渲染,得到渲染后的图像。
在第二方面的一些可能的实现方式中,渲染模块具体用于:将待渲染虚拟背景图像的低频图像以及待处理图像输入至预先训练完成的第一风格迁移模型,获得第一风格迁移模型输出的底层一致性渲染后的待处理图像;从底层一致性渲染后的待处理图像进行主体图像提取,得到底层一致性渲染后的主体图像。
在第二方面的一些可能的实现方式中,还包括模型训练模块,用于:获得训练数据集,训练数据集包括第一虚拟背景图像和原始视频图像;将第一虚拟背景图像的低频图像和原始视频图像输入至预先构建的第二风格迁移模型,获得第二风格迁移模型输出的正向训练结果;计算正向训练结果和第一虚拟背景图像的低频图像之间的第一损失值;将正向训练结果和原始视频图像的低频图像输入至正向训练后的第二风格迁移模型,获得正向训练后的第二风格迁移模型输出的反向训练结果;计算反向训练结果和原始视频图像之间的第二损失值;计算反向训练结果和原始视频图像的低频图像之间的第三损失值;根据第一损失值调整第二风格迁移模型的网络参数,并根据第二损失值和第三损失值,调整正向训练后的第二风格迁移模型的网络参数;重复进行训练过程,当符合预定条件时,得到训练完成的第一风格迁移模型。
在第二方面的一些可能的实现方式中,渲染模块具体用于:将待渲染虚拟背景图像转到LAB色彩空间,得到第一图像;分别计算第一图像的L通道、A通道、B通道的第一标准差和第一均值;将主体图像转到LAB色彩空间,得到第二图像;根据第一标准差和第一均值修正第二图像,得到第三图像,第三图像的L通道、A通道、B通道的第二标准差与第一标准的差值在第一预设阈值区间内,第二均值与第一均值的差值在第二预设阈值区间内;将第三图像从LAB色彩空间转到RGB色彩空间,得到第四图像,第四图像为底层一致性渲染后的主体图像。
在第二方面的一些可能的实现方式中,渲染模块具体用于:将底层一致性渲染后的主体图像输入至预先训练完成的第一STN网络,得到第一STN网络输出的第一变化矩阵;将待渲染虚拟背景图像输入至预先训练完成的第二STN网络,得到第二STN 网络输出的第二变化矩阵;使用第一变化矩阵对底层一致性渲染后的主体图像进行图像仿射变化,得到第一变化图像;使用第二变化矩阵对待渲染虚拟背景图像进行图像仿射变化,得到第二变化图像;将第一变化图像和第二变化图像进行图像合成,得到渲染后的图像。
在第二方面的一些可能的实现方式中,虚拟物体渲染模块具体用于:根据待处理图像的语义分割结果,确定预设动作对应的交互物体在待处理图像中的第一位置;将目标虚拟背景图像中与第一位置对应的第二位置作为第一待渲染虚拟物体的渲染位置;确定待处理图像中的交互物体的深度值大于目标对象的深度值;在目标虚拟背景图像的渲染位置渲染第二待渲染虚拟物体。
在第二方面的一些可能的实现方式中,检测模块具体用于:根据待处理图像的语义分割结果,确定待处理图像中各个像素点的所属类别;获取待处理图像的深度信息;当根据深度信息,确定目标对象的预设范围内存在深度值小于目标对象的深度值的目标像素点,将目标像素点对应的类别作为第一物体,并确定目标对象被第一物体遮挡。
在第二方面的一些可能的实现方式中,还包括背景推荐模块,用于:根据待处理图像的原始背景图像和各个第二虚拟背景图像之间的相似性,确定待推荐虚拟背景图像;显示待推荐虚拟背景图像。
在第二方面的一些可能的实现方式中,背景推荐模块具体用于:对待处理图像进行前后景分割,获得待处理图像的原始背景图像;对原始背景图像进行多类语义分割,得到第二语义分割结果;对各个第二虚拟背景图像进行多类语义分割,得到各个第二虚拟背景图像的第三语义分割结果;根据第二语义分割结果和第三语义分割结果,计算原始背景图像和各个第二虚拟背景图像的IOU值;分别计算原始背景图像的第一颜色分布曲线,以及各个第二虚拟背景图像的第二颜色分布曲线;计算第一颜色分布曲线与各个第二颜色分布曲线之间的曲线相似度;根据曲线相似度和IOU值,从第二虚拟背景图像中确定待推荐虚拟背景图像。
在第二方面的一些可能的实现方式中,虚拟物体渲染模块还用于:若第二待渲染虚拟物体对应的交互物体的深度值小于目标对象的深度值,将第一待渲染虚拟物体渲染在目标虚拟背景图像中,得到待渲染虚拟背景图像,或者,将目标虚拟背景图像作为待渲染虚拟背景图像;将第二待渲染虚拟物体渲染在和渲染后的图像中,得到输出图像。
第三方面,本申请实施例提供一种终端设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述第一方面任一项的方法。
第四方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现如上述第一方面任一项的方法。
第五方面,本申请实施例提供一种芯片系统,该芯片系统包括处理器,处理器与存储器耦合,处理器执行存储器中存储的计算机程序,以实现如上述第一方面任一项所述的方法。该芯片系统可以为单个芯片,或者多个芯片组成的芯片模组。
第六方面,本申请实施例提供一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面任一项所述的方法。
可以理解的是,上述第二方面至第六方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。
附图说明
图1为本申请实施例提供的终端设备100的结构示意图;
图2为本申请实施例提供的终端设备100的软件结构框图;
图3为本申请实施例提供的图像渲染方案的流程示意框图;
图4为本申请实施例提供的视频通话场景的界面示意图;
图5为本申请实施例提供的虚拟背景图像示意图;
图6为本申请实施例提供的背景推荐过程的界面示意图;
图7为本申请实施例提供的基于结构语义的一致性渲染的效果示意图;
图8为本申请实施例提供的基于交互关系的一致性渲染的效果示意图;
图9为本申请实施例提供的基于模型的底层一致性渲染示意图;
图10为本申请实施例提供的风格迁移模型训练过程示意图;
图11为本申请实施例提供的底层一致性渲染的效果示意图;
图12为本申请实施例提供的基于图像处理算法的底层一致性渲染过程示意图;
图13为本申请实施例提供的基于图像处理算法的底层一致性渲染的效果示意图;
图14为本申请实施例提供的基于位置关系的一致性渲染过程示意图;
图15为本申请实施例提供的STN网络训练过程示意图;
图16为本申请实施例提供的基于位置关系的一致性渲染的效果示意框图;
图17为本申请实施例提供的图像渲染过程的一种流程示意图;
图18为本申请实施例提供的图像渲染过程的另一种流程示意图;
图19为本申请实施例提供的图像渲染过程的又一种流程示意图;
图20为本申请实施例提供的大屏设备的视频通话场景示意图;
图21为本申请实施例提供的虚拟背景替换图像的变化示意图;
图22为本申请实施例提供的拍摄场景下的虚拟背景替换过程的界面示意图;
图23为本申请实施例提供的录像场景下的虚拟背景替换过程的界面示意图。
具体实施方式
当前的虚拟背景替换过程中,在图像渲染时通常没有考虑原始背景和虚拟背景之间的相关性,进而导致背景替换后的图像的合理性和真实性等欠佳,即背景替换后的图像中存在不合理甚至不真实的现象。其中,原始背景是指原始图像中的背景。
例如,没有考虑虚拟背景中的内容和位置等,将人渲染在虚拟背景中的桌椅之上,使得背景替换后的图像中出现人悬空于座椅上的不合理现象。
又例如,没有考虑原始图像中的人体是否被物体遮挡,而是直接将人渲染在虚拟背景的相应位置,使得背景替换后的图像中出现人体悬空或者只有一部分人体的不合理现象。
又例如,当原始图像中的人从站立变成坐下时,直接将人体渲染在虚拟背景的相应位置,使得背景替换后的图像中出现人坐在空中,而不是坐在椅子等支撑物体上的不合理现象。
另外,当前的虚拟背景替换过程中,色调和亮度等一致性处理方式单一,使得前 景的色度、亮度等和虚拟背景的色调、亮度等不一致,不统一,从而影响前景和虚拟背景的融合效果。
针对上述问题,本申请实施例提供图像渲染方案,在图像渲染时考虑原始背景和虚拟背景之间的相关性,以提高虚拟背景替换的合理性和真实性,从而提高虚拟背景替换效果。
进一步地,本申请实施例还将前景和虚拟背景的色调、亮度、对比度以及颜色等进行一致性渲染,使得前景的色调、亮度、对比度以及颜色等和虚拟背景的色调、亮度、对比度以及颜色等一致,提高前景和虚拟背景的融合效果。
本申请实施例提供的图像渲染方案可以应用于终端设备,该终端设备可以为手机、平板电脑、笔记本电脑或者可穿戴设备等便携式终端设备,可以为增强现实(augmented reality,AR)设备或虚拟现实(virtual reality,VR)设备,也可以为车载设备、上网本或智慧屏等终端设备。本申请实施例对终端设备的具体类型不作任何限制。
示例性地,图1示出了终端设备100的一种结构示意图。
终端设备100可以包括处理器110,存储器120,摄像头130,显示屏140等。
可以理解的是,本发明实施例示意的结构并不构成对终端设备100的具体限定。在本申请另一些实施例中,终端设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口等。
MIPI接口可以被用于连接处理器110与显示屏140,摄像头130等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。在一些实施例中,处理器110和摄像头130通过CSI接口通信,实现终端设备100的拍摄功能。处理器110和显示屏140通过DSI接口通信,实现终端设备100的显示功能。
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器110与摄像头130,显示屏140等。可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对终端设备100的结构限定。在本申请另一些实施例中,终端设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
终端设备100通过GPU,显示屏140,以及应用处理器等实现显示功能。GPU为 图像处理的微处理器,连接显示屏140和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏140用于显示图像,视频等。显示屏140包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,终端设备100可以包括1个或N个显示屏140,N为大于1的正整数。
终端设备100可以通过ISP,摄像头130,视频编解码器,GPU,显示屏140以及应用处理器等实现拍摄功能。
ISP用于处理摄像头130反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头130中。
摄像头130用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,终端设备100可以包括1个或N个摄像头130,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当终端设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。终端设备100可以支持一种或多种视频编解码器。这样,终端设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现终端设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
存储器120可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。存储器120可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储终端设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,存储器120可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一 个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行终端设备100的各种功能应用以及数据处理。
终端设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构的Android系统为例,示例性说明终端设备100的软件结构。
图2为本申请实施例的终端设备100的一种软件结构框图。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。
应用程序层可以包括一系列应用程序包。
如图2所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,畅连通话等应用程序。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。
如图2所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
下面结合拍摄场景,示例性地说明终端设备100软件以及硬件的工作流程。
当触摸传感器接收到触摸操作,相应的硬件中断被发给内核层。内核层将触摸操作加工成原始输入事件(包括触摸坐标,触摸操作的时间戳等信息)。原始输入事件被存储在内核层。应用程序框架层从内核层获取原始输入事件,识别该输入事件所对应的控件。以该触摸操作是触摸单击操作,该单击操作所对应的控件为相机应用图标的控件为例,相机应用调用应用框架层的接口,启动相机应用,进而通过调用内核层启动摄像头驱动,通过摄像头130捕获静态图像或视频。
摄像头130捕获到图像或视频之后,通过ISP将电信号转换成数字图像信号。数字图像信号再被输入到DSP进行加工处理,将数字图像信号转换成标准的RGB,YUV等格式的图像信号。最后通过GPU、显示屏和应用处理器等进行图像显示。
在获取到图像信号之后,通过CPU执行本申请实施例提供的图像渲染方案进行图像渲染,获得背景替换后的图像,并通过GUP、显示屏和应用处理器等将背景替换后的图像进行图像显示。
下面根据图1和图2示出的终端设备100示例性地介绍本申请实施例提供的图像渲染方案。
参见图3示出的图像渲染方案的流程示意框图,该图像渲染方案可以包括以下步骤:
步骤S301、终端设备100获取视频流。
可以理解的是,终端设备100可以通过自身集成的摄像头实时采集视频流,该摄像头可以是前置摄像头,也可以是后置摄像头;可以接收其它终端设备传入的视频流;也可以通过读取预先录制好并存储在本地的视频流。本申请实施例对终端设备100获取视频流的方式不作限定。
例如,参见图4示出的本申请实施例提供的视频通话场景的界面示意图,如图4中的(a)~(c)所示,终端设备100具体为手机,手机的主界面上包括电话41和畅连通话42,还包括智慧生活、设置、应用商城等应用程序。用户可以通过电话41和畅连通话42发起视频通话,下面以用户通过电话41发起视频通话进行介绍。
首先,用户点击手机主界面上的电话41,手机响应于用户的点击操作,显示电话41的主界面。接着,用户再点击功能栏中的畅连通话43,手机响应于用户的点击操作,显示畅连通话界面44。在畅连通话界面44中,显示有可拨打的联系人。然后,用户可以点击控件45,手机响应于用户的点击操作,向露西发起网络视频通话,显示呼叫界面46。呼叫界面46中包括魔术笔47。
在该过程中,当手机接收到针对控件45的点击操作之后,则调用前置摄像头采集图像数据,此时,手机则可以获取到视频流。再通过DSP、GPU、应用处理器以及显示屏等,将采集到的图像信号显示在显示屏上。
步骤S302、终端设备100从视频流的视频图像中提取主体图像。
需要说明的是,视频流包括在时序上连续的多帧图像,终端设备100可以对每一帧图像均进行主体图像提取,也可以间隔预设数量帧进行主体图像提取,具体可以根据实际应用需求确定,在此不作限定。
具体应用中,终端设备100从视频图像中提取主体图像的方式可以是任意的。例如,可以通过语义分割和实例分割的方式,从视频图像中提取出主体图像。此时,先对视频图像进行多类语义分割,得到多类语义分割结果;再对多类语义分割结果进行实例分割,得到视频图像中的主体图像。
其中,语义分割(Semantic segmentation)可以给图像中的每个像素一个语义标签,该语义标签用于标识该像素点的所属类别。例如,把人的标签设为红色像素,即把图像中人的像素点均标记为红色。
换句话说,通过语义分割结果可以得知原始视频图像所包含的类别,以及各个类别的所处位置和占比等。例如,通过多类语义分割结果可以得知原视频图像中包含人和树两个类别,以及人和树这两个类别在原始视频图像中的位置和占比等。
实例分割(instance segmentation)可以在语义分割的基础上,区分出同一类不同的个体。即实例分割可以把属于同一类别的不同个体区分出来,例如,把同属于人的两个个体区分成人物1和人物2。
视频图像通常可以分为前景和背景,上述主体图像通常可以理解为视频图像中的前景。为了便于说明,本申请将所述视频流中用于提取主体图像的视频图像作为原始视频图像。
步骤S303、终端设备100确定目标虚拟背景图像。
可以理解的是,目标虚拟背景图像是指用于替换原始背景的虚拟背景。终端设备100确定出目标虚拟背景图像后,可以根据该目标虚拟背景图像和上述主体图像进行图像渲染过程,以得到背景替换后的输出图像。
在一些实施例中,该目标虚拟背景图像可以由用户选择。此时,终端设备100可以将虚拟背景库中的多个虚拟背景图像,按照预先设定的显示顺序显示在显示屏上,以供用户选择。用户选择之后,终端设备100响应于用户的选择操作,将用户选定的虚拟背景图像作为目标虚拟背景图像。
举例来说,参见图4示出的本申请实施例提供的视频通话场景的界面示意图,图4中的(a)~(c)为如何发起畅连通话的过程,在此不再赘述。
如图4中的(d)~(f)所示,当露西接听后,手机显示接听界面48。接听界面48中也包括魔术笔47。
可以理解,手机通过摄像头采集视频流,并将视频流实时显示在呼叫界面46和接听界面48上。此时,呼叫界面46和接听界面48所显示的是未进行虚拟背景替换的视频图像。图4中未示出视频图像的真实背景。
用户可以通过魔术笔47进行虚拟背景替换。参见图4中的(e),当用户点击魔术笔47时,手机响应于用户针对魔术笔47的点击操作,在接听界面48上显示窗口49,窗口49中包括美肤和场景两个选项。用户点击场景选项后,手机响应于用户的点击操作,在窗口49上显示虚拟背景图像411~414。虚拟背景图像411~414可以如图5所示。
用户点击窗口49中的虚拟背景图像411,手机响应于用户的点击操作,将虚拟背景图像411作为目标虚拟背景图像。至此,手机通过用户选择的方式确定出了目标虚拟背景图像。
手机确定出目标虚拟背景图像之后,可以进一步根据目标虚拟背景图像和主体图像进行图像渲染,得到背景替换后的输出图像,并将输出图像显示在显示屏上,得到背景替换后的界面410。根据目标虚拟背景图像和主体图像进行图像渲染的过程在下文再具体介绍。
在这种情况下,手机没有进行虚拟背景推荐,在用户点击窗口49中的场景选项时,手机按照默认显示顺序,将虚拟背景库中的虚拟背景图像依次显示在窗口49中。此时,默认显示顺序为虚拟背景图像411、虚拟背景图像412、虚拟背景图像413和虚拟背景图像414。
需要说明,除了可以通过接听界面48中的魔术笔47进行虚拟背景替换之外,也可以通过呼叫界面47中的魔术笔47进行虚拟背景替换,两者的过程类型相同,在此不再赘述。
在另一些实施例中,终端设备100也可以先根据各个虚拟背景图像和原始背景图像之间的相似性,确定出待推荐虚拟背景图像;然后再将待推荐虚拟背景图像显示在显示屏上,以向用户推荐虚拟背景。
示例性地,虚拟背景推荐过程可以如下:
首先,终端设备100对原始背景图像进行多类语义分割,得到第一分割结果;对虚拟背景图库中的各个虚拟背景图像进行多类语义分割,得到各个虚拟背景图像的第二分割结果。
其中,上述原始背景图像是指对原始视频图像中的背景图像,具体可以通过对原始视频图像进行前景和背景分割得到。
然后,计算第一分割结果和各个第二分割结果之间的交并比(Intersection-over-Union,IOU)。IOU值可以用于表征原始背景图像和虚拟背景图像在结构和内容上的相似性。
接着,终端设备100对原始背景图像进行颜色分布曲线统计,得到第一颜色分布曲线;对虚拟背景库中的各个虚拟背景图像进行颜色分布曲线统计,得到第二颜色分布曲线。再计算第一颜色分布曲线和各个第二颜色分布曲线之间的曲线相似度。该曲线相似度可以用于表征原始背景图像和各个虚拟背景图像之间的颜色相似性。
最后,终端设备100根据IOU值和曲线相似度,确定出待推荐虚拟背景图像。
具体地,预先设置IOU值的第一权重和曲线相似度的第二权重。针对每个虚拟背景图像,将该虚拟背景图像的IOU值和第一权重相乘,得到第一乘积,将曲线相似度和第二权重相乘,得到第二乘积;然后再将各个虚拟背景图像的第一乘积和第二乘积相加,得到每个虚拟背景图像的推荐评分。最后,按照每个虚拟背景图像的推荐评分的高低进行排序,筛选前K个虚拟背景图像作为待推荐虚拟背景图像,K为正整数。
举例来说,假设虚拟背景库中包括图5中的4幅虚拟背景图像。通过上述示例的虚拟背景推荐过程,手机分别计算图5中各个虚拟背景图像的推荐评分,并且,推荐评分由高到底依次为虚拟背景图像412、虚拟背景图像413、虚拟背景图像414以及虚 拟背景图像411。K=3,即筛选前3个虚拟背景图像作为待推荐虚拟背景图像。
参见图6示出的背景推荐过程的界面示意图,如图6所示,还以图4中示出的视频通话场景为基础,手机确定出待推荐虚拟背景图像之后,当用户需要通过魔术笔47进行背景替换时,手机响应于用户针对魔术笔47的操作,在接听界面上显示窗口49。并且,手机按照推荐评分的高低,在窗口49中从左往右依次显示虚拟背景图像412、虚拟背景图像413、虚拟背景图像414以及虚拟背景图像411。并且,在虚拟背景图像412、虚拟背景图像413、虚拟背景图像414的显示位置均加了一个方框,以起到提示用户的作用,即通过加方框的形式向用户推荐虚拟背景图像412~414。
用户点击窗口49中所推荐的虚拟背景图像412之后,手机响应于用户的点击操作,根据虚拟背景图像412和主体图像进行图像渲染,得到背景替换后的输出图像,并将输出图像显示在显示屏上,得到背景替换后的界面415。
需要说明,虚拟背景推荐的提示方式是任意的,并不限于图6示出的加方框的形式。例如,可以根据推荐评分高低,给虚拟背景图像加上不同颜色的方框,或者给每个待推荐虚拟背景图像增加箭头指示符,以引导用户选择待推荐虚拟背景图像。
还有,除了可以在窗口49中进行虚拟背景推荐之外,还可以通过弹窗等形式进行虚拟背景推荐,例如,手机确定出待推荐虚拟背景图像之后,在接听界面48中主动弹出一个窗口,并在该窗口中显示待推荐虚拟背景图像。
另外,虚拟背景推荐过程并不限于上文提及的过程。例如,在根据IOU值和曲线相似度,确定待推荐虚拟背景图像过程中,可以筛选出推荐评分高于预设分数阈值的虚拟背景图像作为待推荐虚拟背景图像。
相较而言,如果不进行虚拟背景推荐,用户所选择的虚拟背景图像,可能在颜色、结构和内容上与原始背景图像的颜色、结构和内容相差较大,进而使得虚拟背景替换效果较差。而基于原始背景图像和虚拟背景图像之间的相似性,给用户推荐虚拟背景,可以使得目标虚拟背景图像的和原始背景图像更加相关,即目标虚拟背景图像的颜色、结构和内容与原始背景图像的颜色、结构和内容更相近,进一步提高了虚拟背景替换效果,也提高了用户体验。
上文示出的确定目标虚拟背景图像的方式均需要人为选择,而在其它一些实施例中,终端设备100也可以主动确定目标虚拟背景图像,不用人为参与。此时,终端设备100可以随机选择一幅虚拟背景图像作为目标虚拟背景图像,也可以通过上文提及的虚拟背景推荐过程,计算出虚拟背景库中各个虚拟背景图像的推荐评分,选择推荐评分最高的虚拟背景图像作为目标虚拟背景图像。
步骤S304、终端设备100根据主体图像和目标虚拟背景图像,进行图像渲染,得到输出图像。该输出图像为背景替换后的图像。
由于背景替换过程中涉及图像渲染过程,背景替换后的图像看作图像渲染后的图像。
具体地,终端设备100在确定出目标虚拟背景图像之后,根据主体图像和目标虚拟背景图像进行一致性渲染,得到背景替换后的图像,并将该图像显示在显示屏上。
需要说明的是,本申请实施例中的一致性渲染是指在考虑了原始视频图像和目标虚拟背景图像之间的相关性的图像渲染过程。而原始视频图像和目标虚拟背景图像之 间的相关性可以包括以下至少一项:内容、遮挡、位置和交互等。
其中,内容是指图像内容,即背景替换后的图像的图像内容和原始视频图像的图像内容相一致。具体可以通过底层一致性渲染过程让背景替换后的图像在图像内容上与原始视频图像相一致。
底层特征可以是指图像的颜色、色调、亮度、对比度等特征,针对底层特征的渲染可以称为底层一致性渲染。通过底层一致性渲染,可以使得主体图像的颜色、色调、亮度以及对比度等和目标虚拟背景图像颜色、色调、亮度以及对比度等相一致,进而使得背景替换后的图像中的主体图像部分和背景的颜色、色调、亮度以及对比度等相一致,另外,还使得背景替换后的图像和原始视频图像的图像内容相一致,从而进一步提高了虚拟背景替换效果。
遮挡是指原始背景图像中的主体和物体之间是否存在遮挡关系,当存在遮挡关系时,背景替换后的图像中也要体现对应的遮挡关系,以让背景替换后的图像与原始视频图像在主体的遮挡关系上相一致。
具体应用中,可以通过基于结构语义的一致性渲染过程,使得背景替换后的图像和原始视频图像在主体遮挡关系上相一致。通过基于结构语义的高层一致性渲染可以确定出主体是否被遮挡,并且在主体被遮挡的情况下,确定出需要渲染的虚拟物体。
位置是指背景替换后的图像中的主体和物体之间的位置关系,具体可以体现为主体在虚拟背景图像中的渲染位置。通过基于位置关系的一致性渲染可以确定出主体图像在目标虚拟背景图像中的位姿,并将主体图像渲染在待虚拟背景图像中的合理位置。
交互是指原始视频图像中的主体和物体之间的交互关系,具体可以体现在主体是否执行了预设交互动作,如果执行了预设交互动作,主体和对应的物体之间则存在交互关系。相应地,背景替换后的图像中也要体现主体和对应物体的交互关系,以让背景替换后的图像和原始视频图像在交互关系上相一致。通过基于交互关系的一致性渲染可以在主体做出预设动作时,增加或减少相应的虚拟物体。例如,主体为人时,当原始视频图像中的人做出“坐下”动作时,在图像的合理位置渲染椅子或凳子等虚拟物体,以增加背景替换后的真实感,提高虚拟背景替换效果。
在图像渲染过程中,终端设备100可以基于底层特征、结构语义、位置关系以及交互关系中的至少一种进行一致性渲染。
换句话说,终端设备100基于目标虚拟背景图像和主体图像,得到背景替换后的图像的过程中,可以执行以下至少一种一致性渲染过程:底层一致性渲染、基于结构语义的一致性渲染、基于位置关系的一致性渲染以及基于交互关系的一致性渲染。
下面分别对各个一致性渲染过程进行介绍。
基于结构语义的一致性渲染的过程可以如下:
首先,终端设备100可以对原始视频图像进行多类语义分割,得到多类语义分割结果。
可以理解,上述原始视频图像通常是指上述步骤S302中的原始视频图像。
接着,终端设备100再根据原始视频图像中各个类别的深度信息,确定主体是否被遮挡。深度信息可以表征各个类别的前后关系,根据深度信息可以确定主体前面是否有其他物体。如果确定主体前面有其他物体遮挡,可以进一步根据多类语义分割结 果确定出主体前面的其他物体的类别。
例如,原始视频图像中的主体为人,通过原始视频图像的深度信息,确定出原始视频图像中的人前面有其他物体。再通过多类语义分割结果确定出人前面的其他物体为桌子,则可以确定出人被桌子遮挡,即人与桌子存在遮挡关系。
其中,通过多类语义分割结果可以得知原始视频图像中每个像素点的类别,而通过原始视频图像的深度图可以得知每个像素点离摄像头的远近距离(即深度值)。通过多类语义分割结果和深度信息,确定出主体所在的位置以及主体的深度值之后,在主体周边的预设范围内查找是否存在小于主体深度值的像素点,如果存在,将深度值小于主体深度值的像素点映射到多类语义分割结果中,确定出这些像素点的类别,从而确定出遮挡物体的类别。
终端设备100确定出原始视频图像中的主体被遮挡,且遮挡主体的物体类别之后,再确定第一待渲染虚拟物体。例如,原始视频图像中的主体为人,人的前面有桌子遮挡。此时,终端设备100可以推荐与桌子类似或相关的虚拟物体,该虚拟物体可以例如为预先录入到虚拟物体库里面的各款式桌子。用户可以根据自己的需要从推荐的虚拟物体中选择其中一个或多个作为第一待渲染虚拟物体。
当然,终端设备100也可以不执行虚拟物体推荐过程,而是直接从虚拟物体库中选择相关的虚拟物体作为第一待渲染虚拟物体。
终端设备100确定出第一待渲染虚拟物体之后,可以将该第一待渲染虚拟物体作为前景,与目标虚拟背景图像进行前后景融合,得到新的虚拟背景图像。例如,第一待渲染虚拟物体为桌子,将桌子作为前景,与目标虚拟背景图像进行图像渲染,得到新的虚拟背景图像。
其中,在将第一待渲染虚拟物体渲染到目标虚拟背景图像的过程中,可以将遮挡物体在原始背景图像中的中心位置作为第一待渲染虚拟物体的渲染初始位置。当然,用户也可以自主确定第一待渲染虚拟物体的渲染位置。例如,用户可以通过拖拽的方式调整第一待渲染虚拟物体的渲染位置。
终端设备100如果根据多类语义分割结果和深度信息,确定原始视频图像中的主体没有被其他物体遮挡,则可以不用确定出第一待渲染虚拟物体,不用在目标虚拟背景图像中渲染第一待渲染虚拟物体。
示例性地,参见图7示出的基于结构语义的一致性渲染的效果示意图,如图7中的(a)所示,图像71为原始视频图像,该图像包含主体72和桌子73。根据图像71的多类语义分割结果,以及深度信息,确定出主体72和桌子73之间存在遮挡关系,即主体72被桌子73遮挡了部分。确定出主体72和桌子73存在遮挡关系之后,则可以进一步确定出第一待渲染虚拟物体。
参见图7中的(b),图像74为目标虚拟背景图像。假设虚拟物体库中包括物体75~77,而物体75为系统推荐的虚拟物体,具体可以通过加方框的形式提示用户。
假设用户选择物体75作为第一待渲染虚拟物体,此时,可以先将图像74和物体75进行渲染,得到待渲染虚拟背景图像,然后将待渲染背景图像和主体72进行合成,得到背景替换后的图像,该图像如图7中的(c)中的图像78。通过对比可知,图像71中的主体72和桌子73之间存在遮挡关系,图像78中的主体72和物体75之间存 在遮挡关系,使得背景替换后的图像与原始视频图像在遮挡关系上相一致。
基于交互关系的一致性渲染的过程可以如下:
终端设备100可以基于连续多帧原始视频图像,进行动作识别。如果识别到原始视频图像中的主体做出了预设动作,则确定出与预设动作关联的第二待渲染虚拟物体。其中,预设动作可以根据实际应用设定,例如,预设动作为“坐下”以及“手拿物体”等。动作识别方式可以是现有任意方式,在此不作限定。
终端设备100在识别出预设动作后,可以从虚拟物体库中选取与预设动作关联的物体,以确定出第二待渲染虚拟物体。其中,可以预先设置每个预设动作和虚拟物体之间的关联关系,后续直接通过预设动作选取虚拟物体。例如,预设动作包括“坐下”以及“手拿物体”,“坐下”对应的虚拟物体为椅子或凳子,“手拿物体”对应的虚拟物体为杯子。
此时,用户也可以在初始化阶段预先设置好需要使用的虚拟物体,终端设备100在识别出预设动作后,自动从用户设置的虚拟物体中选取出对应的虚拟物体。例如,在视频通话开始之前,用户预先设置视频通话过程中可能需要使用哪些虚拟物体;设置完成之后,在视频通话过程中,手机识别出预设动作后,从所设置的虚拟物体中选取对应的虚拟物体。
在其它一些实施例中,终端设备100在识别出预设动作之后,也可以向用户推荐虚拟物体,由用户选择所需要的虚拟物体。但是,由用户选择虚拟物体需要耗费一定的时间,会导致虚拟物体渲染具有一定的滞后性,因此,为了确保主体在做出预设动作后,可以及时在相应位置渲染对应的虚拟物体,通常情况下不用用户选择虚拟物体,而是由终端设备自主确定出需要渲染的虚拟物体。
终端设备100确定出第二待渲染虚拟物体之后,需要进一步确定出该第二待渲染虚拟物体的渲染位置。具体地,终端设备100对原始视频图像进行多类语义分割,根据多类语义分割结果,确定出原始视频图像中的交互物体的位置。交互物体是指与预设动作对应的物体,例如,原始视频图像中的主体为人,人做出了“坐下”的动作,即人坐在了一张椅子上,该椅子则为上述交互物体。
通过多类语义分割结果可以得知图像中包含的类别以及各个类别所处位置,因此通过原始视频图像的多类语义分割结果可以得知交互物体在原始视频图像中的所处位置。
根据交互物体在原始视频图像中的位置,确定第二待渲染虚拟物体在待渲染虚拟背景图像中的渲染位置。例如,交互物体为椅子,椅子在原始视频图像中的像素位置为第一位置;将待渲染虚拟背景图像中与第一位置相对应的位置作为渲染位置。
同时,终端设备100还可以根据原始视频图像中各个类别的深度信息,确定出原始视频图像中的交互物体和主体之间的前后位置关系,即确定交互物体和主体哪个在前,哪个在后。并依据前后位置关系,设置第二待渲染虚拟物体和主体之间的渲染顺序。例如,如果主体在前,交互物体在后,渲染顺序则为:先渲染交互物体,再渲染主体。
如果渲染顺序为:先渲染交互物体,再渲染主体,终端设备100确定出第二待渲染虚拟物体之后,可以将第二待渲染虚拟物体作为前景,与目标虚拟背景图像进行渲 染,得到新的虚拟背景图像。如果没有识别出预设动作,则不用在目标虚拟背景图像中渲染第二待渲染虚拟物体。
如果渲染顺序为:先渲染主体,再渲染交互物体,终端设备100确定出第二待渲染虚拟物体之后,在主体图像和待渲染虚拟背景图像进行融合之后,再将第二待渲染虚拟物体渲染在融合之后的图像中,得到最终的输出图像。
参见图8示出的基于交互关系的一致性渲染的效果示意图,如图8所示,图像81和图像82为原始视频图像,这两个图像在原始视频流中的时序先后顺序为:图像81在先,图像82在后。
图像83和图像84为使用现有的虚拟背景替换方式得到的背景替换后的输出图像,其中,图像83与图像81对应,图像82与图像84对应。
图像85和图像86为使用上述基于交互关系的一致性渲染后得到的背景替换后的图像。其中,图像85和图像81对应,图像86和图像82对应。
针对图像81和图像82来说,图像81中的人物主体87处于站立状态;图像82中的人物主体87处于坐下状态,即人物主体87从站立到坐下。
以图像83和图像84作为第一组图像,图像85和图像86作为第二组图像,对比第一组图像和第二组图像可知,第一组图像中,当人的状态从站立变为坐下后,虚拟背景图像中并没有在相应位置渲染椅子,使得背景替换后的图像中出现人物主体87坐在空中的不合理性现象。而第二组图像中,由于使用上述基于交互关系的一致性渲染之后,当识别到“坐下”这个动作时,在相应位置渲染一张椅子88,使得背景替换后的图像中,人物主体87坐在椅子88上,合理性较好。
底层一致性渲染过程:
底层一致性渲染可以包括两种不同的实现方式,下面分别对两种不同的底层一致性渲染方式进行介绍。
方式一:
终端设备100将待渲染虚拟背景图像的低频图像,以及原始视频图像输入到预先训练好的风格迁移模型,风格迁移模型的输出为底层一致性渲染后的原始视频图像。
图像风格迁移(style transfer)是指利用算法学习某一张图片的风格,然后再把这种风格应用到另外一张图片上,或者说将一张图片的风格迁移到另一张图片。
风格迁移模型是指用于实现图像风格迁移的模型,即通过风格迁移模型可以将待渲染虚拟背景图像的风格迁移至原始视频图像。
参见图9示出的基于模型的底层一致性渲染过程的流程示意图,该过程可以包括以下步骤:
步骤S901、终端设备100获取待渲染虚拟背景图像的低频图像。
具体应用中,在确定出待渲染虚拟背景图像之后,则可以生成该待渲染虚拟背景图像的低频图像。
步骤S902、终端设备100将待渲染虚拟背景图像的低频图像和原始视频图像输入至预先训练完成的风格迁移模型,获得风格迁移模型输出的底层一致性渲染后的原始视频图像。
参见图10示出的风格迁移模型训练过程示意图,风格迁移模型的训练过程可以如 下:
首先,构建出风格迁移模型。
接着,对风格迁移模型进行正向训练。
正向训练时,风格迁移模型的输入为原始视频图像和虚拟背景图像的低频图像,输出为正向训练结果。此时,虚拟背景图像是指训练数据集中的用于背景替换的图像。
每次得到正向训练结果之后,将正向训练结果转到颜色-对立空间(Lab color space,LAB)后,再计算正向训练结果在LAB域的第一方差和第一均值;将虚拟背景图像转到LAB空间后,再计算虚拟背景图像在LAB域的第二方差和第二均值。最后,根据第一方差和第二方差,以及第一均值和第二均值,分别计算出正向训练结果和虚拟背景图像在LAB域的均值差异和方差差异。
LAB域的均值差异和方差差异作为模型输出和输入之间的损失值(Loss),即正向训练时的Loss为LAB域的Loss。通过LAB域的Loss可以约束正向训练结果和虚拟背景之间的色彩、亮度和饱和度等的相似性。
计算出损失值之后,根据损失值调整风格迁移模型的网络参数。通过正向训练,可以让主体图像或者原始视频图像的色彩、亮度、饱和度等与虚拟背景图像的色彩、亮度、饱和度等相一致。
然后,基于正向训练的结果,对正向训练后的风格迁移模型进行反向训练,得到训练完成的风格迁移模型。
在反向训练过程时,将原始视频图像的低频图像以及正向训练结果输入到正向训练后的风格迁移模型,模型的输出为反向训练结果。
每次得到反向训练结果之后,计算反向训练结果和原始视频图像的低频图像之间在LAB域的Loss。具体计算过程可以如下:将反向训练结果和原始视频图像的低频图像均转到LAB空间,并计算反向训练结果在LAB域的方差和均值,计算原始视频图像的低频图像在LAB域的方差和均值,根据反向训练结果的方差和均值,以及原始视频图像的低频图像的方差和均值,计算出反向训练结果和原始视频图像的低频图像之间的均值差异和方差差异,将反向训练结果和原始视频图像的低频图像之间在LAB域的均值和方差差异作为LAB域的Loss。同时,计算反向训练结果和原始视频图像之间的损失值。再将LAB域的Loss,以及反向训练结果和原始视频图像之间的损失值进行加权后,得到一个损失值,再根据该损失值调整正向训练后的风格迁移模型的网络参数。
依此迭代训练多次,直到正向训练的损失值和反向训练的损失值趋于稳定,则认为训练完成,得到最终的风格迁移模型。
通过反向训练,可以让主体图像或者原始视频图像的内容和待渲染虚拟背景图像的内容相一致。
本申请实施例中,一次训练包括正向训练过程和反向训练过程,即在当次训练过程中,进行一次正向训练过程后,再基于正向训练的结果进行反向训练过程,同理,在下一次训练过程中,依然是先进行一次正向训练过程,再进行反向训练。依次迭代训练多次,得到训练完成的风格迁移模型。
需要说明,风格迁移模型的训练过程可以在终端设备100上进行,也可以不在终 端设备100上进行,而是在其他设备训练完成后再加载到终端设备100。
现有的风格迁移模型训练过程中,正向训练和反向训练的模型不一样,权重不一样,即反向训练不是基于正向训练得到的模型和结果的。这样,使用训练完成的模型进行风格迁移后,风格迁移后的图像中的图像内容与原图像的图像内容不一致。
并且,不同风格对应不同的风格迁移模型,例如,有3幅不同的背景图像,则需要3个不同的风格迁移模型,才能分别将这3幅背景图像的风格迁移至对应的图片上。
而上述方式一,正向训练和反向训练的模型一样,权重一样,即反向训练是基于正向训练后的风格迁移模型和正向训练结果。这样,在风格迁移时,不仅能保证风格迁移后的图像与待渲染虚拟背景图像的色彩、亮度、饱和度等相一致,还可以保证风格迁移后的图像中的图像内容,与原始视频图像的图像内容基于相一致,渲染效果更佳。风格迁移后的图像是指底层一致性渲染得到的图像。
换句话说,现有技术正向训练和反向训练不是同一个模型,导致风格迁移后的图像内容跟原始视频图像的内容不一致。而上述方式一正向训练和反向训练均是同一个模型,所以风格迁移后的图像内容和原始视频图像的图像内容一样。此外,在模型应用阶段,现有技术的风格迁移模型的输入为待渲染虚拟背景图像,而上述方式一的模型输入为原始视频图像和待渲染虚拟背景图像的低频图像。
另外,上述方式一中,不同风格对应同一个风格迁移模型,例如,有3幅不同的背景图像,则需要一个风格迁移模型,则可以分别将这3幅背景图像的风格迁移至对应的图片上。也就是说,通过上述方式一的模型训练方式得到的风格迁移模型,可以将多种不同风格迁移到另一张图片上。
还需要说明,上文介绍的模型训练过程和底层一致性渲染过程,均是以原始视频图像进行介绍的。在另一些实施例中,原始视频图像也可以替换为主体图像,即图9和图10中的原始视频图像可以替换成主体图像。例如,从原始视频图像中提取出主体图像,然后将主体图像和待渲染虚拟背景图像的低频图像输入到风格迁移模型中,进行正向训练。又例如,在模型实际应用阶段,将主体图像和待渲染虚拟背景图像的低频图像输入到已训练完成的风格迁移模型,获得风格迁移模型输出的底层一致性渲染后的主体图像。
还有,在模型应用阶段和模型训练阶段均采用待渲染虚拟背景图像的低频图像,低频图像可以忽略掉纹理等底层特征,可进行多风格的反向训练,再加上正向训练和反向训练为同一模型,降低了失真。
相较而言,现有技术中,一般是采用固定色温、固定光照进行图像渲染,即给前景渲染固定方向的虚拟光照,或者给前景渲染固定的色温,可能会导致背景替换后的图像中的主体和背景的颜色、色温、亮度、对比度等不一致,背景替换效果较差。
而通过上述方式一或方式二的底层一致性渲染,让背景替换后的图像中的主体和背景的颜色、色温、亮度、对比度等相一致,背景替换效果更佳。
参见图11示出的底层一致性渲染的效果示意图,如图11所示,图像111为原始视频图像,图像112为虚拟背景图像,图像113为使用上述方式一得到的背景替换后的图像,图像114为使用现有方式得到的背景替换后的图像。图像111中的背景颜色和色调等主要是以海洋蓝为主,且图像111中的人的衣服颜色为颜色1,例如颜色1 为白色。图像112中的背景颜色和色调等主要以落日黄为主。
使用上述方式一:
可以使得主体图像和虚拟背景图像在颜色、色调、对比度和色温上相一致。具体表现为:图像113中的人(即主体图像)的衣服颜色为颜色2,例如颜色2为黄色,即图像113中人的颜色、色温等和背景的颜色、色温等相一致,主体和背景的一致性较高。
而使用现有方式,主体图像和虚拟背景图像在颜色、色调、对比度和色温等上相差较大。具体表现为:图像114中的人的衣服颜色为颜色1,即与原始视频图像中的主体图像一致。这样,图像114中的主体图像和背景在颜色、色温上相差较大,主体和背景的一致性较差。
方式二:
参见图12示出的基于图像处理算法的底层一致性渲染过程示意图,如图12所示,首先,终端设备100将待渲染虚拟背景图像从RGB色彩空间转到LAB色彩空间,得到LAB色彩空间下的第一图像,再计算第一图像的L、A、B三个通道的第一标准差(std)和第三均值(mean)。每个通道均有各自对应的标准差和均值。
然后,终端设备100将主体图像或者原始视频图像转到LAB色彩空间,得到LAB色彩空间下的第二图像,并根据第一图像的第一标准差和第三均值,修正第二图像的标准差和均值,得到第三图像。
具体地,将第二图像的L、A、B三个通道的标准差分别设置为第一图像中的对应通道的第一标准差,或者让第二图像的L、A、B三个通道的标准差与第一图像中的对应通道的第一标准差之间的差值在预设阈值区间内;将第二图像的L、A、B三个通道的均值分别设置为第一图像中的对应通道的第三均值,或者让第二图像的L、A、B三个通道的均值和第一图像中的对应通道的第三均值之间的差值在预设阈值区间内。
也就是说,第三图像的的L、A、B三个通道标准差等于第一图像中对应通道的第一标准差或者接近于第一标准差,第三图像的的L、A、B三个通道均值等于第一图像中对应通道的第三均值或者接近于第三均值。
最后,将第三图像从LAB色彩空间转到RGB色彩空间,得到第四图像,该第四图像即为底层一致性渲染后的图像。
参见图13示出的基于图像处理算法的底层一致性渲染的效果示意图,如图13所示,图像131为虚拟背景图像,图像132为原始视频图像,图像133为使用上述方式二得到的背景替换后的图像,图像134为使用现有技术得到的背景替换后的图像。
通过对比图像133和图像134可知,图像133的人(即主体图像)135的颜色、色温、亮度和对比度等与图像131的颜色、色温、亮度和对比度等相一致,即图像133中的人和背景在颜色、亮度、对比度、色温等的一致性更佳,进而使得背景替换效果更佳。而图像134中的人135的颜色、色温、亮度和对比度与图像132中的色调、色温、亮度和对比度相一致,与图像131的颜色、色温、亮度和对比度等相差较大,进而导致图像134中的人和背景在颜色、对比度、色温等上相差较大,即人和背景的一致性较差,进而使得背景替换效果较差。
基于位置关系的一致性渲染过程:
参见图14示出的基于位置关系的一致性渲染过程示意图,该基于位置关系的一致性渲染的过程可以如下:
首先,将主体图像输入到第一空间变化网络(Spatial Transformer Network,STN),获得第一STN网络输出的第一变化矩阵;将待渲染虚拟背景图像输入第二STN网络,获得第二STN网络输出的第二变化矩阵。
其中,如果进行底层一致性渲染过程,主体图像可以是底层一致渲染后的主体图像,而如果底层一致性渲染过程的结果是底层一致性渲染后的原始视频图像,则从该原始视频图像中提取出主体图像,以得到底层一致性渲染后的主体图像。
然后,使用第一变化矩阵对主体图像进行图像仿射变换(Warp),得到Warp后的主体图像。使用第二变化矩阵对待渲染虚拟背景图像进行Warp,得到Warp后的待渲染虚拟背景图像。
最后,将Warp后的主体图像和Warp后的待渲染虚拟背景图像进行前后景融合,得到合成图像。
通过STN网络,调整前景(即主体图像)和待渲染虚拟背景图像的相互旋转、平移或缩放放水,并进行裁剪。
可以理解的是,上述第一STN网络和第二STN网络均是预先训练完成的网络。其中,STN网络的训练过程采用对抗学习的方式,具体过程可以如下:
参见图15示出的STN网络训练过程示意图,如图15所示,将训练用的主体图像输入至预先构建的第三STN网络,得到第三STN网络输出的变化矩阵H0;将训练用的虚拟背景图像输入至预先构建的第四STN网络,得到第四STN网络输出的变化矩阵H1。此时,训练用的虚拟背景图像是指训练数据中用于背景替换的图像。其中,该主体图像可以是底层一致性渲染后的图像,也可以不是底层一致性渲染后的图像。
使用变化矩阵H0对训练用的主体图像进行Warp,使用变化矩阵H1对训练用的背景图像进行Warp;再将Warp后的主体图像和Warp后的虚拟背景图像进行前后景融合,得到合成图像。
最后,将合成图像输入到判别器。判别器通过判别该合成图像和真实图像之间的差异大小,以判定该合成图像的好坏。合成图像和真实图像的差异越小,合成图像越好,反之,差异越大,合成图像越坏。
当判别器认为输入的合成图像和真实图像相同时,则认为STN网络训练完成,得到上述第一STN网络和第二STN网络。
可以理解,STN网络的训练过程可以在终端设备100上进行,也可以是在其它设备上进行。
参见图16示出的基于位置关系的一致性渲染的效果示意框图,如图16所示,图像161、图像162和图像163均为背景替换后的图像,其中,图像161和图像162的图像渲染过程没有进行基于位置关系的一致性渲染过程,图像163的图像渲染过程进行了上述基于位置关系的一致性渲染过程。通过对比可知,图像161和图像162中,没有将主体164合理地渲染在桌子165上,导致背景替换后的图像中出现主体悬空等不合理现象。而图像163中,主体164合理地渲染在桌子165上,合理性和真实性更佳。
在分别介绍完各种一致性渲染过程之后,下面对可能的图像渲染过程进行示例性说明。
第一种图像渲染过程:
该图像渲染过程包括基于结构语义的一致性渲染、基于交互关系的一致性渲染、底层一致性渲染渲染以及基于位置关系的一致性渲染。
参见图17示出的图像渲染过程的一种流程示意图,该图像渲染过程可以包括如下步骤:
步骤S1701、终端设备100获取原始视频图像。
可以理解的是,该原始视频图像是视频流中的一帧视频图像。
步骤S1702、终端设备100检测原始视频图像中的目标对象是否存在遮挡关系。当目标对象存在遮挡关系时,进入步骤S1703;当目标对象不存在遮挡关系时,进入步骤S1707。
需要说明的是,上述目标对象是指原始视频图像中的主体,通常情况下,该目标对象为人物主体。
具体应用中,终端设备100可以先对原始视频图像进行多类语义分割,得到多类语义分割结果,再根据多类语义分割结果和原始视频图像的深度信息,确定主体是否被其他物体遮挡。如果主体被其他物体遮挡,则确定目标对象存在遮挡关系,反之,如果主体没有被其他物体遮挡,则确定目标对象不存在遮挡关系。
步骤S1703、终端设备100确定出遮挡物体对应的第一待渲染虚拟物体。
具体应用中,终端设备100根据深度信息确定出主体被其他物体遮挡时,再通过多类语义分割结果确定出遮挡物体的类别,以及遮挡物体在原始视频图像中的所处位置。然后,根据遮挡物体的类别,确定出遮挡物体对应的第一待渲染虚拟物体,再根据遮挡物体在原始视频图像中的所处位置,确定出第一待渲染虚拟物体的渲染位置。
步骤S1704、终端设备100检测原始视频图像中的目标对象是否执行预设动作。当目标对象执行预设动作,则进入步骤S1705;当目标对象没有执行预设动作,则进入步骤S1707。
步骤S1705、终端设备100确定预设动作对应的第二待渲染虚拟物体。
步骤S1706、终端设备100确定第二待渲染虚拟物体的渲染位置和渲染顺序。
需要说明的是,当渲染顺序为:先渲染交互物体,再渲染主体时,则将第二待渲染虚拟物体和目标虚拟背景图像进行融合。
当渲染顺序为:先渲染主体,再渲染交互物体时,则不将第二待渲染虚拟物体和目标虚拟背景图像进行融合,而是在得到合成图像之后,再对第二待渲染虚拟物体和合成图像进行前后景融合。
如果同时存在两类第二待渲染虚拟物体,第一类第二待渲染虚拟物体的渲染顺序为:先渲染交互物体,再渲染主体,第二类第二待渲染虚拟物体的渲染顺序为:先渲染主体,再渲染交互物体。此时,将第一类第二待渲染虚拟物体和目标虚拟背景图像进行融合,将第二类第二待渲染虚拟物体和合成图像进行融合。
步骤S1707、终端设备100确定待渲染虚拟背景图像。
在一些实施例中,可以将目标待渲染虚拟物体作为前景,与目标虚拟背景图像进 行前后景融合,得到待渲染虚拟背景图像。
该目标待渲染虚拟物体可以包括第一待渲染虚拟物体和/或第二待渲染虚拟物体。
当检测出目标对象存在遮挡关系,且检测出目标对象没有执行预设动作,目标待渲染虚拟物体只包括第一待渲染虚拟物体,此时,根据遮挡物体在原始视频图像中的所处位置,将第一待渲染虚拟物体作为前景,与目标虚拟背景图像进行前后景融合,得到待渲染虚拟背景图像。
当检测出目标对象存在遮挡关系,且检测出目标对象执行预设动作时,如果第二待渲染虚拟物体和主体之间的渲染顺序为:先渲染交互物体,再渲染主体,目标待渲染虚拟物体则包括第一待渲染虚拟物体和第二待渲染虚拟物体。此时,根据第二待渲染虚拟物体的渲染位置,以及遮挡物体在原始视频图像中的所处位置,将第一待渲染虚拟物体和第二待渲染虚拟物体作为前景,与目标虚拟背景图像进行前后景融合,得到待渲染虚拟背景图像。
当检测出目标对象存在遮挡关系,且检测出目标对象执行预设动作时,如果第二待渲染虚拟物体和主体之间的渲染顺序为:先渲染主体,再渲染交互物体,目标待渲染虚拟物体则包括第一待渲染虚拟物体。此时,根据遮挡物体在原始视频图像中的所处位置,将第一待渲染虚拟物体作为前景,与目标虚拟背景图像进行前后景融合,得到待渲染虚拟背景图像。
当检测出目标对象不存在遮挡关系,且检测出目标对象执行预设动作时,如果第二待渲染虚拟物体和主体之间的渲染顺序为:先渲染交互物体,再渲染主体,目标待渲染虚拟物体则包括第二待渲染虚拟物体。此时,根据第二待渲染虚拟物体的渲染位置,将第二待渲染虚拟物体作为前景,与目标虚拟背景图像进行前后景融合,得到待渲染虚拟背景图像。
在另一些实施例中,也可以直接将目标虚拟背景图像作为待渲染虚拟背景图像。
当检测出目标对象不存在遮挡关系,且检测目标对象执行预设动作时,如果第二待渲染虚拟物体和主体之间的渲染顺序为:先渲染主体,再渲染交互物体,此时,直接将目标虚拟背景图像作为待渲染虚拟背景图像。
当检测出目标对象不存在遮挡关系,且检测出目标对象没有执行预设动作时,直接将目标虚拟背景图像作为待渲染虚拟背景图像。
在确定出待渲染虚拟背景图像之后,则可以根据待渲染虚拟背景图像和主体图像,进行图像渲染,以得到背景替换后的图像。
步骤S1708、终端设备100根据待渲染虚拟背景图像和原始视频图像,进行底层一致性渲染。
终端设备100在确定出待渲染虚拟背景图像之后,可以进行底层一致性渲染过程。底层一致性渲染过程可以参见上文,在此不再赘述。
步骤S1709、终端设备100根据底层一致性渲染后的主体图像和待渲染虚拟背景图像,进行基于位置关系的一致性渲染,得到合成图像。
需要说明的是,基于位置关系的一致性渲染可以参见上文,在此不再赘述。
当检测出目标对象没有执行预设动作,或者目标对象执行了预设动作,但第二待渲染虚拟物体和主体之间的渲染顺序为:先渲染交互物体,再渲染主体时,该合成图 像则为背景替换后的输出图像。
当检测出目标对象执行了预设动作,且第二待渲染虚拟物体和主体之间的渲染顺序为:先渲染主体,再渲染交互物体,该合成图像不是背景替换后的输出图像。此时,在获得合成图像之后,还需要根据第二待渲染虚拟物体的渲染位置,将第二待渲染虚拟物体作为前景,合成图像作为后景,进行前后景融合,得到融合后的图像,该融合后的图像为背景替换后的输出图像。因此,在这种情况下,该图像渲染过程还可以包括步骤S1710。
可选地,还可以包括步骤S1710、将合成图像和第二待渲染虚拟物体进行前后景融合,得到融合后的图像。
需要说明的是,步骤S1702~步骤S1703属于基于结构语义的一致性渲染过程,步骤S1704~步骤S1706属于基于交互关系的一致性渲染过程。步骤S1702~步骤S1703和步骤S1704~步骤S1706两个过程的执行顺序是任意的,可以同时执行,也可以先后执行。
相较而言,现有技术在图像渲染过程中并不会考虑交互关系、结构语义、底层一致性渲染以及位置关系,从而导致背景替换后的图像出现渲染不合理的现象。而本申请实施例中,通过基于结构语义的一致性渲染过程,确定了主体的遮挡关系,避免了主体渲染不合理的情况;通过基于交互关系的一致性渲染过程,在渲染主体时,考虑了主体和交互物体的交互关系,避免出现渲染不合理的现象;通过STN网络变化,确定出主体图像的渲染位置,避免了出现将主体渲染在不合理的位置,从而优化了背景替换后的真实感,进而避免了背景替换不真实甚至不合理影响到背景替换效果。
另外,通过底层一致性渲染过程,让背景替换后的图像中的主体图像和背景,在颜色、亮度、对比度、色调等方面相一致,进一步提高了背景替换效果。
第二种图像渲染过程:
该图像渲染过程包括基于结构语义的一致性渲染过程、底层一致性渲染过程和基于位置关系的一致性渲染过程。
参见图18示出的图像渲染过程的另一种流程示意图,该图像渲染过程可以包括以下步骤:
步骤S1801、终端设备100获取原始视频图像。
步骤S1802、终端设备100检测原始视频图像中的目标对象是否存在遮挡关系。当目标对象存在遮挡关系时,进入步骤S1803;当目标对象不存在遮挡关系时,进入步骤S1805。
步骤S1803、终端设备100确定遮挡物体对应的第一待渲染虚拟物体。
步骤S1804、终端设备100将第一待渲染虚拟物体和目标虚拟背景图像进行融合,得到待渲染虚拟背景图像。
得到待渲染虚拟背景图像之后,进入步骤S1806。
步骤S1805、终端设备100将目标虚拟背景图像作为待渲染虚拟背景图像。
在确定出待渲染虚拟背景图像之后,则可以根据待渲染虚拟背景图像和主体图像,进行图像渲染,以得到背景替换后的图像。
步骤S1806、终端设备100根据待渲染虚拟背景图像和原始视频图像,进行底层 一致性渲染。
步骤S1807、终端设备100根据底层一致性渲染后的主体图像和待渲染虚拟背景图像,进行基于位置关系的一致性渲染,得到合成图像。
在该图像渲染过程中,该合成图像为背景替换后的输出图像。
在该图像渲染过程中,通过基于结构语义的一致性渲染过程,确定了主体的遮挡关系,避免了主体渲染不合理的情况;通过STN网络变化,确定出主体图像的渲染位置,避免了出现将主体渲染在不合理的位置,从而优化了背景替换后的真实感,进而避免了背景替换不真实甚至不合理影响到背景替换效果。另外,通过底层一致性渲染过程,让背景替换后的图像中的主体图像和背景,在颜色、亮度、对比度、色调等方面相一致,进一步提高了背景替换效果。
第三种图像渲染过程:
该图像渲染过程包括基于交互关系的一致性渲染过程、底层一致性渲染过程和基于位置关系的一致性渲染过程。
参见图19示出的图像渲染过程的又一种流程示意图,该图像渲染过程可以包括以下步骤:
步骤S1901、终端设备100获取原始视频图像。
步骤S1902、终端设备100检测原始视频图像中的目标对象是否执行预设动作。如果是,则进入步骤S1903,如果否,则进入步骤S1906。
步骤S1903、终端设备100确定预设动作对应的第二待渲染虚拟物体。
步骤S1904、终端设备100确定第二待渲染虚拟物体的渲染顺序和渲染位置。
其中,如果第二待渲染虚拟物体和主体之间的渲染顺序为:先渲染交互物体,再渲染主体,则进入步骤S1905,即根据第二待渲染虚拟物体的渲染位置,将第二待渲染虚拟物体作为前景,与目标虚拟背景图像进行前后景融合,得到待渲染虚拟背景图像。
如果第二待渲染虚拟物体和主体之间的渲染顺序为:先渲染主体,再渲染交互物体,则进入步骤S1906,此时,该图像渲染过程还包括步骤S1909。
在确定出待渲染虚拟背景图像之后,可以根据主体图像和待渲染虚拟背景图像,进行图像渲染,以得到背景替换后的图像。
步骤S1905、终端设备100将第二待渲染虚拟物体和目标虚拟背景图像进行融合,得到待渲染虚拟背景图像。
步骤S1906、终端设备100将目标虚拟背景图像作为待渲染虚拟背景图像。
在确定出待渲染虚拟背景图像之后,可以根据待渲染虚拟背景图像和主体图像进行图像渲染,得到背景替换后的图像。
步骤S1907、终端设备100将待渲染虚拟背景图像和原始视频图像,进行底层一致性渲染。
步骤S1908、终端设备100根据底层一致性渲染后的主体图像和待渲染虚拟背景图像,进行基于位置关系的一致性渲染,得到合成图像。
步骤S1909、终端设备100将合成图像和第二待渲染虚拟物体进行前后景融合,得到融合后的图像。
在该图像渲染过程中,如果第二待渲染虚拟物体的渲染顺序在先,该合成图像为背景替换后的输出图像;如果第二待渲染虚拟物体的渲染顺序在后,该融合后的图像为背景替换后的输出图像。
在该图像渲染过程中,通过基于交互关系的一致性渲染过程,在渲染主体时,考虑了主体和交互物体的交互关系,避免出现渲染不合理的现象;通过STN网络变化,确定出主体图像的渲染位置,避免了出现将主体渲染在不合理的位置,从而优化了背景替换后的真实感,进而避免了背景替换不真实甚至不合理影响到背景替换效果。另外,通过底层一致性渲染过程,让背景替换后的图像中的主体图像和背景,在颜色、亮度、对比度、色调等方面相一致,进一步提高了背景替换效果。
其它可能的图像渲染过程:
在一些实施例中,该图像渲染过程可以只包括基于结构语义的一致性渲染过程和/或基于交互关系的一致性渲染过程,不包括上述底层一致性渲染过程和基于位置关系的一致性渲染过程。该图像渲染过程可以如下:
终端设备100先执行上述基于结构语义的一致性渲染过程和/或基于交互关系的一致性渲染过程,得到待渲染虚拟背景图像。具体过程可以参见上文示出图像渲染过程,在此不再赘述。
然后,终端设备100可以采用现有的色调、亮度处理方式对待渲染虚拟背景图像进行处理,例如,检测当前场景亮度和虚拟背景亮度,当场景亮度大于虚拟背景亮度时,调整曝光时间,而当场景亮度小于虚拟背景亮度时,在虚拟背景中增加虚拟光照。
最后,终端设备100将处理后的待渲染虚拟背景图像和主体图像进行前后景融合,得到最终的输出图像。
或者,终端设备100也可以不对待渲染虚拟背景图像进行色调或亮度上的处理,而是直接将待渲染虚拟背景图像和主体图像进行前后景融合,得到最终的输出图像。
在一些实施例中,该图像渲染过程除了可以包括基于结构语义的一致性渲染过程和/或基于交互关系的一致性渲染过程,还可以包括底层一致性渲染过程或基于位置关系的一致性渲染过程。此时,该图像渲染过程可以如下:
终端设备100先执行上述基于结构语义的一致性渲染过程和/或基于交互关系的一致性渲染过程,得到待渲染虚拟背景图像。具体过程可以参见上文示出的图像渲染过程,在此不再赘述。
然后,终端设备100进行底层一致性渲染过程,即基于待渲染虚拟背景图像和主体图像,进行底层一致性渲染过程。最后,将底层一致性渲染后的主体图像和待渲染虚拟背景图像进行前后景融合,得到最终的输出图像。
或者,终端设备100进行基于位置关系的一致性渲染过程,即将主体图像和待渲染虚拟背景图像输入到训练好的STN网络,再根据变化矩阵进行Warp,最后将Warp后的主体图像和Warp后的待渲染虚拟背景图像进行前后景融合,得到最终的输出图像。
可以理解的是,除了上文提及的几种图像渲染过程,基于上文提及的几种图像渲染过程,还可以得到其它可能的图像渲染过程。
例如,图像渲染过程不进行基于结构语义的一致性渲染过程以及基于交互关系的 一致性渲染过程,只进行底层一致性渲染过程和基于位置关系的一致性渲染过程。此时,该图像渲染过程可以如下:
终端设备100将目标虚拟背景图像作为待渲染背景图像,基于待渲染背景图像和主体图像进行底层一致性渲染过程;最后,基于底层一致性渲染后的主体图像和待渲染虚拟背景图像进行基于位置关系的一致性渲染过程,得到输出图像。
又例如,图像渲染过程可以只包括底层一致性渲染过程,不进行基于结构语义的一致性渲染过程、基于交互关系的一致性渲染过程以及基于位置关系的一致性渲染过程。此时,该图像渲染过程可以如下:
终端设备100将目标虚拟背景图像作为待渲染背景图像,基于待渲染背景图像和主体图像进行底层一致性渲染过程,最后,将底层一致性渲染后的主体图像和待渲染虚拟背景图像进行前后景融合,得到输出图像。
又例如,图像渲染过程不进行基于结构语义的一致性渲染过程、基于交互关系的一致性渲染过程、基于位置关系的一致性渲染过程以及基于底层一致性渲染过程,只进行基于位置关系的一致性渲染过程。此时,该图像渲染过程可以如下:
终端设备100根据目标虚拟背景图像和主体图像,进行基于位置关系的一致性渲染过程,得到输出图像。
其它的图像渲染过程在此不再一一列举,并且,各个图像渲染过程的相同或相似之处,可以相互参见,在此不再赘述。
相较而言,第二种图像渲染过程、第三种图像渲染过程以及其它可能的图像渲染过程,虽然效果比第一种图像渲染过程的背景替换效果较差,但仍然可以提高背景替换效果。
上文描述了基于原始视频图像或主体图像,以及目标虚拟背景图像,针对某一帧原始视频图像进行虚拟背景替换,得到输出图像的过程。例如,参见图4,手机响应于针对虚拟背景图像411的点击操作,确定出目标虚拟背景图像为虚拟背景图像411,然后通过基于接听界面48对应的原始视频图像,以及虚拟背景图像411,进行上文提及的任一种图像渲染过程,得到输出图像,最后显示该输出图像,得到背景替换后的图像410。又例如,参见图6,手机响应于虚拟背景图像412的点击操作,确定出目标虚拟背景图像为虚拟背景图像412,然后基于接听界面48对应的原始视频图像和虚拟背景图像412,进行上文提及的任一种图像渲染过程,得到如界面415显示的输出图像。
本申请实施例提供的虚拟背景替换过程除了可以应用于手机端的视频通话场景,还可以应用于大屏设备的视频通话场景。例如,参见图20示出的大屏设备的虚拟背景替换场景示意图,如图20所示,在家居场景下,用户通过大屏设备201进行视频通话,该大屏设备上安装有畅连通话。通过视频通话界面中的魔术笔202可以调出虚拟背景选择窗口,用户可以从该窗口中选择用于替换的背景图像。用户选定目标虚拟背景图像之后,大屏设备则根据目标虚拟背景图像和原始视频图像,进行上文提及的任意一种图像渲染过程,得到背景替换后的图像203。
在具体应用中,终端设备100可以针对视频流中每一帧图像均进行上述虚拟背景替换过程,也可以每隔5帧或每隔10帧才进行一次上述虚拟背景替换过程,即每隔5帧或10帧,则对原始视频图像进行前后景分割,得到主体图像和原始背景图像,再确定出目标虚拟背景图像,基于主体图像和目标虚拟背景图像执行上述提及任一种图像渲染过程。
在终端设备100持续对视频流进行虚拟背景替换时,终端设备100如果识别到主体和原始背景图像中的物体存在遮挡关系和/或交互关系,则在虚拟背景图像中渲染相应的虚拟物体。而终端设备100如果识别出交互关系结束和/或遮挡关系结束,可以去除第一待渲染虚拟物体和/或第二待渲染虚拟物体。
其中,交互关系结束是指主体与原始背景图像中的交互物体的交互动作结束,例如,主体为人,交互物体为原始背景图像中的椅子,交互动作为“坐下”,当人从椅子上站起来时,则认为“坐下”这一交互动作结束,人和原始背景图像中的椅子的交互关系结束。遮挡关系结束是指主体从被遮挡到不被遮挡,例如,主体为人,某个时刻,人被原始背景图像中的桌子遮挡,下一个时刻,人没有被桌子遮挡,则认为遮挡关系结束。
从视觉上来看,当存在交互关系或遮挡关系,背景替换后的图像中会存在对应的虚拟物体,而当交互关系或遮挡关系结束,背景替换后的图像中的虚拟物体会随之消失。
从具体实现来说,终端设备100持续对视频流中的视频图像进行虚拟背景替换过程,如果在某个时刻,通过上述基于结构语义的一致性渲染过程,确定出原始视频图像中的主体没有被遮挡,则认为遮挡关系结束,不用渲染第一待渲染虚拟物体。同理,通过上述基于交互关系的一致性渲染过程,确定出交互关系结束,则不用渲染第二待渲染虚拟物体,其中,当识别出与交互动作相应的动作时,则可以认为交互关系结束,例如,当交互动作为“坐下”,与该交互动作对应的动作为“站起”,即当识别出“站起”动作时,则认为交互关系结束。
此时,由于主体没有被遮挡,且主体也没有执行预设交互动作,故待渲染虚拟背景中没有待渲染虚拟物体。终端设备100基于该待渲染虚拟背景图像和主体图像,进行后续的底层一致性渲染过程和基于位置关系的一致性渲染过程,得到当次虚拟背景替换过程的输出图像后,该输出图像中没有虚拟物体。这样,从用户视觉角度来说,当交互关系结束或者遮挡关系结束后,背景替换后的图像中的虚拟物体会消失。例如,人从椅子上站起后,背景替换后的图像中的椅子也会消失,当人从被遮挡到不被遮挡时,背景替换后的图像中用于遮挡的物体(例如桌子)也会消失。
举例来说,参见图21示出的虚拟背景替换图像的变化示意图。如图21所示,图像211为原始视频图像1进行第一次虚拟背景替换过程后得到的图像。其中,图像211为上述图7中的图像78,原始视频图像1为图7中的图像71,目标虚拟背景图像为图7中的图像74。相关介绍请参见上述图7相应内容。图像211中包括了主体212和主体213,以及虚拟物体214。
假设,视频流依次包括原始视频图像1、原始视频图像2、原始视频图像3以及原始视频图像4。
第二次虚拟背景替换过程:终端设备100对原始视频图像2进行虚拟背景替换,得到背景替换后的图像215。具体地,终端设备100基于原始视频图像2,判断主体212和主体213是否存在遮挡关系和/或交互关系。此时,主体213存在遮挡关系,主体213不存在遮挡关系,并将虚拟物体214和目标虚拟背景图像进行前后景融合,得到待渲染虚拟背景图像。再根据待渲染虚拟背景图像和原始视频图像2,依次进行底层一致性渲染过程和基于位置关系的一致性渲染过程,得到图像215。
第三次虚拟背景替换过程:终端设备100对原始视频图像3进行虚拟背景替换,得到背景替换后的图像216。具体地,终端设备100对原始视频图像进行基于结构语义的一致性渲染过程以及基于交互关系的一致性渲染过程,以判断出主体是否存在遮挡关系和交互关系。此时,主体212和原始背景中的椅子存在交互关系,主体213与原始背景中的桌子存在遮挡关系,故确定出需要渲染的虚拟物体217和虚拟物体214。接着,将虚拟物体217和虚拟物体214渲染在目标渲染虚拟背景图像中,得到待渲染虚拟背景图像,再基于待渲染虚拟背景图像和原始视频图像3,依次进行底层一致性渲染过程和基于位置关系的一致性渲染过程,得到图像216。
第四次虚拟背景替换过程:终端设备100对原始视频图像4进行虚拟背景替换,得到背景替换后的图像218。具体地,终端设备100对原始视频图像进行基于结构语义的一致性渲染过程以及基于交互关系的一致性渲染过程,以判断出主体是否存在遮挡关系和交互关系。此时,原始视频图像4中主体213不存在遮挡关系和交互关系,主体212存在交互关系,不存在遮挡关系,故确定需要渲染虚拟物体217。接着,将虚拟物体217和目标虚拟背景图像进行前后景融合,得到待渲染虚拟背景图像。最后,基于原始视频图像4和待渲染虚拟背景图像,依次进行底层一致性渲染过程和基于位置关系的一致性渲染过程,得到图像218。
终端设备依据原始视频图像的播放顺序,依次在显示屏上显示图像211、图像215、图像216和图像218,从用户视觉角度来说,当主体212从站立到坐下时,虚拟背景图像中会相应地多出一张椅子,而当主体212和主体213均没有被遮挡时,之前所渲染的虚拟物体214也会随之消失。
本申请实施例提供的虚拟背景替换方案除了可以应用于视频通话场景,也可以应用于背景穿越、特效制作、视频会议、拍照、以及录像等虚拟背景替换场景。
下面将示例性地对拍摄场景和录像场景进行介绍。
参见图22示出的拍摄场景下的虚拟背景替换过程的界面示意图。如图22所示,手机响应于针对相机221的点击操作,显示预览界面222。接着,手机响应于针对预览界面22中的魔术笔223的点击操作,在预览界面222中弹出窗口224,该窗口224中依次显示有场景225~228,该场景225~228可以依次对应图5中的虚拟背景图像412、413、414和411。用户可以根据自己的需要从窗口224中选择自己所需要替换的场景。
当手机接收到针对场景225的点击操作之后,手机显示预览界面229。此时,预览界面222对应的图像为原始视频图像,场景225对应的图像为目标虚拟背景图像。手机先基于原始视频图像,进行基于结构语义的一致性渲染过程和基于交互关系的一致性渲染过程,以判断出是否需要渲染虚拟物体。如果需要渲染虚拟物体,则将待渲染虚拟物体和目标虚拟背景图像进行融合,得到待渲染虚拟背景图像。而当前情况下, 由于主体和原始背景图像中的物体不存在遮挡关系和交互关系,故不需要渲染虚拟物体。再基于主体图像和待渲染虚拟背景图像,进行底层一致性渲染过程和基于位置关系的一致性渲染过程,得到背景替换后的输出图像,该输出图像即为预览界面229对应的图像。
手机显示预览界面229之后,用户可以点击控件2210进行拍照。手机接收针对控件2210的点击操作之后,将预览界面229对应的图像保存为一张图片,并将该图片显示在控件2211中。
用户可以通过点击2211,查看所拍摄的图片。此时,当手机接收到针对控件2211的点击操作之后,显示图片预览界面,该图片预览界面中显示有所拍摄的图片2212。图片2212为背景替换后的图像。
在图22示出的拍照过程中,手机如果识别出预览界面222对应的图像中,人被某个物体遮挡了,和/或人做出了某个预设动作,所拍摄的图片2212中也会存在对应的虚拟物体。例如,参见图21,假设在拍摄过程中,主体212和主体213作出了相应的交互动作,或者改变了遮挡关系,预览界面也显示如图像211、图像215、图像216以及图像218的图像。
另外,手机显示预览界面229之后,用户也可以通过魔术笔再次进行虚拟背景替换。
参见图23示出的录像场景下的虚拟背景替换过程的界面示意图。如图23所示,手机接收到针对相机231的点击操作之后,显示预览界面232。预览界面232中包括魔术笔233,该预览界面232显示的是手机通过摄像头采集的原始视频图像。
当手机接收到针对控件234的点击操作之后,手机则开始录像,显示录像界面235,录像界面235显示的仍然是手机通过摄像头采集的原始视频图像。在录像过程中,用户可以点击魔术笔233进行虚拟背景替换。
当手机接收到针对录像界面235中的魔术笔233的点击操作之后,手机在录像界面235中弹出窗口236,窗口236中显示有可以用于背景替换的场景237~2310。用户点击窗口236中的场景237之后,手机响应于该点击操作,进行虚拟背景替换过程,得到背景替换后的图像,并将背景替换后的图像显示在界面2311中。
此时,场景237对应的图像为目标虚拟背景图像,录像界面235对应的图像为原始视频图像,基于目标虚拟图像和原始视频图像,依次进行基于结构语义的一致性渲染过程、基于交互关系的一致性渲染过程、底层一致性渲染过程以及基于位置关系的一致性渲染过程,得到背景替换后的输出图像。
如果在录像过程中,原始视频图像中的人与背景中的物体存在遮挡关系和/或交互关系,所录制的视频中的也会存在对应的虚拟物体。
另外,图23是在录像开始后才进行虚拟背景替换的,在其它实施例中,也可以在录像开始之前进行虚拟背景替换,即通过预览界面232中的魔术笔调出窗口236,并选择对应的场景。
图22和图23中,手机也可以进行上述虚拟背景推荐过程。图22和图23中与上文相同或类似之处,可以参见上文,在此不再赘述。
本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质存储有计 算机程序,计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行时实现可实现上述各个方法实施例中的步骤。
本申请实施例还提供一种芯片系统,所述芯片系统包括处理器,所述处理器与存储器耦合,所述处理器执行存储器中存储的计算机程序,以实现如上述各个方法实施例所述的方法。所述芯片系统可以为单个芯片,或者多个芯片组成的芯片模组。
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。此外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (13)

  1. 一种图像渲染方法,其特征在于,应用于终端设备,所述方法包括:
    获取待处理图像;
    检测出所述待处理图像中的目标对象执行预设动作和/或所述目标对象被第一物体遮挡;
    确定所述预设动作对应的第二待渲染虚拟物体和/或与所述第一物体对应的第一待渲染虚拟物体;
    将所述第一待渲染虚拟物体和/或所述第二待渲染虚拟物体渲染在目标虚拟背景图像中,得到待渲染虚拟背景图像,其中,所述第二待渲染虚拟物体对应的交互物体的深度值大于所述目标对象的深度值;
    根据所述待渲染虚拟背景图像和主体图像进行图像渲染,得到渲染后的图像,所述主体图像为从所述待处理图像中提取的,且包括所述目标对象的图像。
  2. 根据权利要求1所述的方法,其特征在于,根据所述待渲染虚拟背景图像和主体图像进行图像渲染,得到渲染后的图像,包括:
    基于所述待渲染虚拟背景图像进行底层一致性渲染,得到底层一致性渲染后的主体图像;
    根据所述底层一致性渲染后的主体图像和所述待渲染虚拟背景图像进行图像渲染,得到所述渲染后的图像。
  3. 根据权利要求2所述的方法,其特征在于,基于所述待渲染虚拟背景图像进行底层一致性渲染,得到底层一致性渲染后的主体图像,包括:
    将所述待渲染虚拟背景图像的低频图像以及所述待处理图像输入至预先训练完成的第一风格迁移模型,获得所述第一风格迁移模型输出的底层一致性渲染后的待处理图像;
    从所述底层一致性渲染后的待处理图像进行主体图像提取,得到所述底层一致性渲染后的主体图像。
  4. 根据权利要求3所述的方法,其特征在于,风格迁移模型的训练过程包括:
    获得训练数据集,所述训练数据集包括第一虚拟背景图像和原始视频图像;
    将所述第一虚拟背景图像的低频图像和所述原始视频图像输入至预先构建的第二风格迁移模型,获得所述第二风格迁移模型输出的正向训练结果;
    计算所述正向训练结果和所述第一虚拟背景图像的低频图像之间的第一损失值;
    将所述正向训练结果和所述原始视频图像的低频图像输入至正向训练后的第二风格迁移模型,获得所述正向训练后的第二风格迁移模型输出的反向训练结果;
    计算所述反向训练结果和所述原始视频图像之间的第二损失值;
    计算所述反向训练结果和所述原始视频图像的低频图像之间的第三损失值;
    根据所述第一损失值调整所述第二风格迁移模型的网络参数,并根据所述第二损失值和所述第三损失值,调整所述正向训练后的第二风格迁移模型的网络参数;
    重复进行所述训练过程,当符合预定条件时,得到训练完成的所述第一风格迁移模型。
  5. 根据权利要求2所述的方法,其特征在于,基于所述待渲染虚拟背景图像进行底层一致性渲染,得到底层一致性渲染后的主体图像,包括:
    将所述待渲染虚拟背景图像转到LAB色彩空间,得到第一图像;
    分别计算所述第一图像的L通道、A通道、B通道的第一标准差和第一均值;
    将所述主体图像转到LAB色彩空间,得到第二图像;
    根据所述第一标准差和所述第一均值修正所述第二图像,得到第三图像,所述第三图像的L通道、A通道、B通道的第二标准差与所述第一标准的差值在第一预设阈值区间内,第二均值与所述第一均值的差值在第二预设阈值区间内;
    将所述第三图像从LAB色彩空间转到RGB色彩空间,得到第四图像,所述第四图像为所述底层一致性渲染后的主体图像。
  6. 根据权利要求2至5任一项所述的方法,其特征在于,根据所述底层一致性渲染后的主体图像和所述待渲染虚拟背景,得到所述渲染后的图像,包括:
    将所述底层一致性渲染后的主体图像输入至预先训练完成的第一STN网络,得到所述第一STN网络输出的第一变化矩阵;
    将所述待渲染虚拟背景图像输入至预先训练完成的第二STN网络,得到所述第二STN网络输出的第二变化矩阵;
    使用所述第一变化矩阵对所述底层一致性渲染后的主体图像进行图像仿射变化,得到第一变化图像;
    使用所述第二变化矩阵对所述待渲染虚拟背景图像进行图像仿射变化,得到第二变化图像;
    将所述第一变化图像和所述第二变化图像进行图像合成,得到所述渲染后的图像。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,将所述第二待渲染虚拟物体在目标虚拟背景图像中,包括:
    根据所述待处理图像的语义分割结果,确定所述预设动作对应的交互物体在所述待处理图像中的第一位置;
    将所述目标虚拟背景图像中与所述第一位置对应的第二位置作为所述第一待渲染虚拟物体的渲染位置;
    确定所述待处理图像中的交互物体的深度值大于所述目标对象的深度值;
    在所述目标虚拟背景图像的所述渲染位置渲染所述第二待渲染虚拟物体。
  8. 根据权利要求1所述的方法,其特征在于,检测所述待处理图像中的目标对象被第一物体遮挡,包括:
    根据所述待处理图像的语义分割结果,确定所述待处理图像中各个像素点的所属类别;
    获取所述待处理图像的深度信息;
    当根据所述深度信息,确定所述目标对象的预设范围内存在深度值小于所述目标对象的深度值的目标像素点,将所述目标像素点对应的类别作为所述第一物体,并确定所述目标对象被所述第一物体遮挡。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,在将所述第一待渲染虚拟物体和/或所述第二待渲染虚拟物体渲染在目标虚拟背景图像中,得到待渲染虚拟背 景图像之前,所述方法还包括:
    根据所述待处理图像的原始背景图像和各个第二虚拟背景图像之间的相似性,确定待推荐虚拟背景图像;
    显示所述待推荐虚拟背景图像。
  10. 根据权利要求9所述的方法,其特征在于,根据所述待处理图像的原始背景图像和各个第二虚拟背景图像之间的相似性,确定待推荐虚拟背景图像,包括:
    对所述待处理图像进行前后景分割,获得所述待处理图像的原始背景图像;
    对所述原始背景图像进行多类语义分割,得到第二语义分割结果;
    对各个所述第二虚拟背景图像进行多类语义分割,得到各个所述第二虚拟背景图像的第三语义分割结果;
    根据所述第二语义分割结果和所述第三语义分割结果,计算所述原始背景图像和各个所述第二虚拟背景图像的IOU值;
    分别计算所述原始背景图像的第一颜色分布曲线,以及各个所述第二虚拟背景图像的第二颜色分布曲线;
    计算所述第一颜色分布曲线与各个所述第二颜色分布曲线之间的曲线相似度;
    根据所述曲线相似度和所述IOU值,从所述第二虚拟背景图像中确定所述待推荐虚拟背景图像。
  11. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    若所述第二待渲染虚拟物体对应的交互物体的深度值小于所述目标对象的深度值,将所述第一待渲染虚拟物体渲染在所述目标虚拟背景图像中,得到待渲染虚拟背景图像,或者,将所述目标虚拟背景图像作为所述待渲染虚拟背景图像;
    在根据所述待渲染虚拟背景图像和主体图像进行图像渲染,得到渲染后的图像之后,所述方法还包括:
    将所述第二待渲染虚拟物体渲染在和所述渲染后的图像中,得到输出图像。
  12. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至11任一项所述的方法。
  13. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至11任一项所述的方法。
PCT/CN2021/126469 2020-11-09 2021-10-26 图像渲染方法和装置 WO2022095757A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011240398.8A CN114494566A (zh) 2020-11-09 2020-11-09 图像渲染方法和装置
CN202011240398.8 2020-11-09

Publications (1)

Publication Number Publication Date
WO2022095757A1 true WO2022095757A1 (zh) 2022-05-12

Family

ID=81457498

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/126469 WO2022095757A1 (zh) 2020-11-09 2021-10-26 图像渲染方法和装置

Country Status (2)

Country Link
CN (1) CN114494566A (zh)
WO (1) WO2022095757A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115665461A (zh) * 2022-10-13 2023-01-31 聚好看科技股份有限公司 一种视频录制方法及虚拟现实设备
CN115908663A (zh) * 2022-12-19 2023-04-04 支付宝(杭州)信息技术有限公司 一种虚拟形象的衣物渲染方法、装置、设备及介质
CN116934936A (zh) * 2023-09-19 2023-10-24 成都索贝数码科技股份有限公司 一种三维场景风格迁移方法、装置、设备及存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115830281B (zh) * 2022-11-22 2023-07-25 山东梦幻视界智能科技有限公司 一种基于MiniLED显示屏的裸眼VR沉浸式体验装置
CN115775024B (zh) * 2022-12-09 2024-04-16 支付宝(杭州)信息技术有限公司 虚拟形象模型训练方法及装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105791793A (zh) * 2014-12-17 2016-07-20 光宝电子(广州)有限公司 图像处理方法及其电子装置
US20170282074A1 (en) * 2012-06-20 2017-10-05 Microsoft Technology Licensing, Llc Multiple Frame Distributed Rendering of Interactive Content
CN109461199A (zh) * 2018-11-15 2019-03-12 腾讯科技(深圳)有限公司 画面渲染方法和装置、存储介质及电子装置
CN110062176A (zh) * 2019-04-12 2019-07-26 北京字节跳动网络技术有限公司 生成视频的方法、装置、电子设备和计算机可读存储介质
CN110176054A (zh) * 2018-02-14 2019-08-27 辉达公司 用于训练神经网络模型的合成图像的生成
CN110956654A (zh) * 2019-12-02 2020-04-03 Oppo广东移动通信有限公司 图像处理方法、装置、设备及存储介质
CN111667399A (zh) * 2020-05-14 2020-09-15 华为技术有限公司 风格迁移模型的训练方法、视频风格迁移的方法以及装置
CN111726479A (zh) * 2020-06-01 2020-09-29 北京像素软件科技股份有限公司 图像渲染的方法及装置、终端、可读存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170282074A1 (en) * 2012-06-20 2017-10-05 Microsoft Technology Licensing, Llc Multiple Frame Distributed Rendering of Interactive Content
CN105791793A (zh) * 2014-12-17 2016-07-20 光宝电子(广州)有限公司 图像处理方法及其电子装置
CN110176054A (zh) * 2018-02-14 2019-08-27 辉达公司 用于训练神经网络模型的合成图像的生成
CN109461199A (zh) * 2018-11-15 2019-03-12 腾讯科技(深圳)有限公司 画面渲染方法和装置、存储介质及电子装置
CN110062176A (zh) * 2019-04-12 2019-07-26 北京字节跳动网络技术有限公司 生成视频的方法、装置、电子设备和计算机可读存储介质
CN110956654A (zh) * 2019-12-02 2020-04-03 Oppo广东移动通信有限公司 图像处理方法、装置、设备及存储介质
CN111667399A (zh) * 2020-05-14 2020-09-15 华为技术有限公司 风格迁移模型的训练方法、视频风格迁移的方法以及装置
CN111726479A (zh) * 2020-06-01 2020-09-29 北京像素软件科技股份有限公司 图像渲染的方法及装置、终端、可读存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115665461A (zh) * 2022-10-13 2023-01-31 聚好看科技股份有限公司 一种视频录制方法及虚拟现实设备
CN115665461B (zh) * 2022-10-13 2024-03-22 聚好看科技股份有限公司 一种视频录制方法及虚拟现实设备
CN115908663A (zh) * 2022-12-19 2023-04-04 支付宝(杭州)信息技术有限公司 一种虚拟形象的衣物渲染方法、装置、设备及介质
CN115908663B (zh) * 2022-12-19 2024-03-12 支付宝(杭州)信息技术有限公司 一种虚拟形象的衣物渲染方法、装置、设备及介质
CN116934936A (zh) * 2023-09-19 2023-10-24 成都索贝数码科技股份有限公司 一种三维场景风格迁移方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN114494566A (zh) 2022-05-13

Similar Documents

Publication Publication Date Title
WO2022095757A1 (zh) 图像渲染方法和装置
JP7110502B2 (ja) 深度を利用した映像背景減算法
US10134364B2 (en) Prioritized display of visual content in computer presentations
WO2020108082A1 (zh) 视频处理方法、装置、电子设备和计算机可读介质
WO2017157272A1 (zh) 一种信息处理方法及终端
WO2020108081A1 (zh) 视频处理方法、装置、电子设备和计算机可读介质
US10609332B1 (en) Video conferencing supporting a composite video stream
US11070717B2 (en) Context-aware image filtering
WO2022068479A1 (zh) 图像处理方法、装置、电子设备及计算机可读存储介质
CN103731742B (zh) 用于视频流放的方法和装置
WO2021254502A1 (zh) 目标对象显示方法、装置及电子设备
US10957108B2 (en) Augmented reality image retrieval systems and methods
CN114096986A (zh) 自动地分割和调整图像
Turban et al. Extrafoveal video extension for an immersive viewing experience
CN113099146A (zh) 一种视频生成方法、装置及相关设备
CN115689963A (zh) 一种图像处理方法及电子设备
CN114926351B (zh) 图像处理方法、电子设备以及计算机存储介质
US20180268049A1 (en) Providing a heat map overlay representative of user preferences relating to rendered content
US10304232B2 (en) Image animation in a presentation document
WO2022088946A1 (zh) 一种弯曲文本的字符选择方法、装置和终端设备
CN114758027A (zh) 图像处理方法、装置、电子设备及存储介质
WO2022206605A1 (zh) 确定目标对象的方法、拍摄方法和装置
CN114443182A (zh) 一种界面切换方法、存储介质及终端设备
WO2022194084A1 (zh) 视频播放方法、终端设备、装置、系统及存储介质
CN116612060B (zh) 视频信息处理方法、装置及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21888454

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21888454

Country of ref document: EP

Kind code of ref document: A1