CN114494566A - Image rendering method and device - Google Patents

Image rendering method and device Download PDF

Info

Publication number
CN114494566A
CN114494566A CN202011240398.8A CN202011240398A CN114494566A CN 114494566 A CN114494566 A CN 114494566A CN 202011240398 A CN202011240398 A CN 202011240398A CN 114494566 A CN114494566 A CN 114494566A
Authority
CN
China
Prior art keywords
image
rendered
virtual
rendering
virtual background
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011240398.8A
Other languages
Chinese (zh)
Inventor
裴仁静
陈艳花
许松岑
刘宏马
梅意城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202011240398.8A priority Critical patent/CN114494566A/en
Priority to PCT/CN2021/126469 priority patent/WO2022095757A1/en
Publication of CN114494566A publication Critical patent/CN114494566A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • G06T15/205Image-based rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/04
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The embodiment of the application discloses an image rendering method and device, wherein the method comprises the following steps: acquiring an image to be processed; detecting that a target object in an image to be processed executes a preset action and/or the target object is shielded by a first object; determining a second virtual object to be rendered corresponding to the preset action and/or a first virtual object to be rendered corresponding to the first object; rendering the first virtual object to be rendered and/or the second virtual object to be rendered in the target virtual background image to obtain a virtual background image to be rendered, wherein the depth value of an interactive object corresponding to the second virtual object to be rendered is larger than that of the target object; and performing image rendering according to the virtual background image to be rendered and the main body image to obtain a rendered image. The embodiment of the application can improve the reasonability and the authenticity of the rendered image, and further improve the replacement effect of the virtual background.

Description

Image rendering method and device
Technical Field
The present application relates to the field of image processing technologies, and in particular, to an image rendering method and apparatus.
Background
With the continuous development of image processing technology, more and more products with a virtual background replacement function are provided. Such as open talk, zoom video conferencing, and standable photography.
The virtual background replacement is to replace the background of the original image with another different background. When the virtual background is replaced, the foreground and the background of the original image are generally segmented to obtain the foreground and the original background; and performing image rendering and image fusion on the foreground and the virtual background to obtain an image with the background replaced.
In the prior art, the image after background replacement has poor reasonability, authenticity and the like, so that the background replacement effect is poor.
Disclosure of Invention
The embodiment of the application provides an image rendering method and device, which can improve the replacement effect of a virtual background.
In a first aspect, an embodiment of the present application provides an image rendering method, where the method includes: the method comprises the steps that a terminal device firstly obtains an image to be processed, wherein the image to be processed can be an original video image in a video stream; detecting whether a target object in the image to be processed executes a preset action and/or whether the target object is shielded by a first object; if the target object in the image to be processed is detected to execute the preset action and/or the target object is shielded by the first object, correspondingly, a second virtual object to be rendered corresponding to the preset action and/or a first virtual object to be rendered corresponding to the first object are determined; then rendering the first virtual object to be rendered and/or the second virtual object to be rendered in the target virtual background image to obtain a virtual background image to be rendered, wherein the depth value of an interactive object corresponding to the second virtual object to be rendered is larger than that of the target object; and finally, performing image rendering according to the virtual background image to be rendered and the main image to obtain a rendered image, wherein the main image is extracted from the image to be processed and comprises the image of the target object.
If the target object is detected not to execute the preset action, the target object is not shielded; or, the target object is not shielded, the target object executes the preset action, but the depth value of the interactive object corresponding to the second virtual object to be rendered is smaller than that of the target object, the target virtual background image is used as the background image to be rendered, and then image rendering is performed according to the virtual background image to be rendered and the main body image, so as to obtain the rendered image.
According to the embodiment of the application, after the target object is detected to be shielded and/or the preset action is executed, the corresponding virtual object is rendered in the virtual background image, namely when the target object is shielded, the corresponding virtual object is rendered for shielding, and when the target object executes the interactive action, the corresponding interactive virtual object is rendered, so that the phenomenon that the rendered image is unreasonable and unreal is avoided as much as possible, the rationality and the reality of the rendered image are improved, and the virtual background replacement effect is improved.
Illustratively, when the target object is a person, that is, the subject image is a person subject image, if it is detected that the person subject performs a preset action of "sitting down", it is determined that the second virtual object to be rendered corresponding to the preset action of "sitting down" is a "chair". At this time, assuming that the depth value of the interactive object (for example, a stool or the like) corresponding to the "chair" in the image to be processed is greater than the depth value of the target object, the "chair" is rendered at the corresponding position in the target virtual background image, and the virtual background image to be rendered is obtained. And then performing image rendering on the virtual background image to be rendered and the main body image to obtain a rendered image, wherein the rendered image is the image after the virtual background is replaced.
Visually, the portrait subject in the image to be processed is seated on the interactive object, and the portrait subject in the rendered image is also seated on the chair, so that the image after background replacement is consistent with the original video image in the interactive relation, and unreasonable phenomena of people sitting in the air and the like in the image after background replacement are avoided.
It should be noted that, when it is detected that the target object executes the preset action and is also blocked by the first object, and the depth value of the interactive object of the second virtual object to be rendered is greater than the depth value of the target object, the first virtual object to be rendered and the second virtual object to be rendered are rendered in the target virtual background image, so as to obtain the virtual background image to be rendered.
And when the target object is detected to execute the preset action but not be shielded by the first object and the depth value of the interactive object of the second virtual object to be rendered is larger than the depth value of the target object, rendering the second virtual object to be rendered in the target virtual background image to obtain the virtual background image to be rendered.
When the detection target object does not execute the preset action, the detection target object is shielded by the first object; or, the target object executes the preset action, but the depth value of the interactive object corresponding to the second virtual object to be rendered is smaller than the depth value of the target object, and the target object is shielded by the first object, so that the first virtual object to be rendered is rendered in the target virtual background image, and the virtual background image to be rendered is obtained.
In some possible implementation manners of the first aspect, the rendering an image according to the virtual background image to be rendered and the main body image, and the obtaining a rendered image may include: performing bottom layer consistency rendering on the basis of the virtual background image to be rendered to obtain a main body image subjected to bottom layer consistency rendering; and performing image rendering according to the main body image and the virtual background image to be rendered after the bottom layer is rendered in a consistent manner to obtain a rendered image.
In some possible implementation manners of the first aspect, the performing bottom-layer consistent rendering based on the virtual background image to be rendered, and obtaining the subject image after bottom-layer consistent rendering may include: inputting a low-frequency image of a virtual background image to be rendered and an image to be processed into a first style migration model which is trained in advance, and obtaining a bottom-layer consistency rendered image to be processed which is output by the first style migration model; and extracting the main body image from the image to be processed after the bottom layer is rendered in a consistent manner to obtain the main body image after the bottom layer is rendered in a consistent manner.
In this implementation, the low-frequency image is used as an input of the model, and low-layer features such as textures in the image can be omitted. Through the bottom layer consistency rendering process, the virtual background replacement effect can be further improved.
In some possible implementations of the first aspect, the training process of the style migration model may include: obtaining a training data set, wherein the training data set comprises a first virtual background image and an original video image; inputting a low-frequency image of a first virtual background image and an original video image into a second style migration model which is constructed in advance, and obtaining a forward training result output by the second style migration model; calculating a first loss value between the forward training result and the low-frequency image of the first virtual background image; inputting the forward training result and the low-frequency image of the original video image into the second style migration model after the forward training to obtain a reverse training result output by the second style migration model after the forward training; calculating a second loss value between the reverse training result and the original video image; calculating a third loss value between the reverse training result and the low-frequency image of the original video image; adjusting network parameters of the second style migration model according to the first loss value, and adjusting the network parameters of the second style migration model after forward training according to the second loss value and the third loss value; and repeating the training process, and obtaining the trained first style migration model when the preset conditions are met.
The predetermined condition may be used to characterize that the loss value of the model tends to be stable, and specifically may mean that the first loss value, the second loss value, and the third loss value are all stable around a certain value. And when the preset conditions are met, the model training is considered to be finished, and a trained first style migration model is obtained.
And the first loss value is a loss value of an LAB space, namely after the forward training result and the low-frequency image of the first virtual background image are both transferred to the LAB space, the variance difference and the mean difference of the two images in the LAB space are calculated so as to restrict the similarity of global color, brightness, saturation and the like. The third penalty value is also a penalty value in the LAB space.
In the implementation mode, the model obtained through the training process can not only ensure the consistency in style, but also ensure the consistency in image content. And performing bottom-layer consistency rendering by using the first style migration model, so that the background replacement effect can be further improved.
In some possible implementation manners of the first aspect, the performing bottom-layer consistent rendering based on the virtual background image to be rendered, and the obtaining the subject image after bottom-layer consistent rendering may also include: transferring the virtual background image to be rendered to an LAB color space to obtain a first image; respectively calculating a first standard deviation and a first mean value of an L channel, an A channel and a B channel of a first image; transferring the main image to an LAB color space to obtain a second image; correcting the second image according to the first standard deviation and the first mean value to obtain a third image, wherein the difference value between the second standard deviation and the first standard of the L channel, the A channel and the B channel of the third image is within a first preset threshold interval, and the difference value between the second mean value and the first mean value is within a second preset threshold interval; and transferring the third image from the LAB color space to the RGB color space to obtain a fourth image, wherein the fourth image is a main image after the bottom layer is rendered in a consistent manner.
It will be appreciated that in the LAB color space, each channel has its own variance and mean, and that the standard deviation and mean of the corresponding channel in the second image is modified based on the standard deviation and mean of each channel in the first image. For example, the first standard deviation of the L channel of the first image is a1, the first mean is B1, the first standard deviation of the a channel is a2, the first mean is B2, the first standard deviation of the B channel is A3, and the first mean is B3. Setting the standard deviation of the L channel of the second image to be A1 and the mean value to be B1; setting the standard deviation of the A channel of the second image as A2 and the mean value as B2; and setting the standard deviation of the B channel of the second image as A3 and the average value as B3, so as to correct the second image according to the standard deviation and the average value of the first image to obtain a third image.
The first preset threshold interval and the second preset threshold interval may both be set according to actual needs, for example, both the first preset threshold interval and the second threshold interval may be set to 0.
Through the bottom layer consistency rendering process of the implementation mode, the virtual background replacement effect can be further improved.
In some possible implementation manners of the first aspect, the obtaining a rendered image according to the rendered subject image and the virtual background to be rendered based on the bottom layer consistency may include: inputting the main body image subjected to bottom layer consistency rendering to a first STN network which is trained in advance to obtain a first change matrix output by the first STN network; inputting the virtual background image to be rendered into a second STN network which is trained in advance to obtain a second change matrix output by the second STN network; carrying out image affine change on the main body image subjected to bottom layer consistency rendering by using a first change matrix to obtain a first change image; performing image affine change on the virtual background image to be rendered by using the second change matrix to obtain a second change image; and carrying out image synthesis on the first change image and the second change image to obtain a rendered image.
In the implementation mode, the main body image is rendered at a more reasonable position through the STN network, and the reasonability and the reality of the image after background replacement are further improved.
In some possible implementations of the first aspect, the process of placing the second virtual object to be rendered in the target virtual background image may include: determining a first position of an interactive object corresponding to a preset action in the image to be processed according to a semantic segmentation result of the image to be processed; taking a second position corresponding to the first position in the target virtual background image as a rendering position of a first virtual object to be rendered; determining that the depth value of an interactive object in the image to be processed is larger than that of the target object; and rendering the second virtual object to be rendered at the rendering position of the target virtual background image.
In some possible implementations of the first aspect, the detecting that the target object in the image to be processed is occluded by the first object may include: determining the category of each pixel point in the image to be processed according to the semantic segmentation result of the image to be processed; acquiring depth information of an image to be processed; and when determining that target pixel points with depth values smaller than that of the target object exist in the preset range of the target object according to the depth information, taking the category corresponding to the target pixel points as a first object, and determining that the target object is shielded by the first object.
It can be understood that the preset range of the target object may be within a peripheral preset range of the pixel point corresponding to the target object. After the target pixel points are determined, the target pixel points are mapped to the semantic segmentation result, namely the category of the target pixel points is determined according to the category of each pixel point in the semantic segmentation result, and therefore the category of the sheltered object is determined. Otherwise, if the target pixel point does not exist, the target object is not shielded.
In some possible implementations of the first aspect, before rendering the first virtual object to be rendered and/or the second virtual object to be rendered in the target virtual background image, the method may further include: determining a virtual background image to be recommended according to the similarity between the original background image of the image to be processed and each second virtual background image; and displaying the virtual background image to be recommended.
In this implementation, the virtual background image is recommended to the user according to the similarity between the original background image and the virtual background image, and the virtual background image used for background replacement and the original background image can be made more relevant.
In some possible implementation manners of the first aspect, the determining, according to the similarity between the original background image of the image to be processed and each of the second virtual background images, a virtual background image to be recommended may include: performing foreground and background segmentation on an image to be processed to obtain an original background image of the image to be processed; performing multi-class semantic segmentation on the original background image to obtain a second semantic segmentation result; performing multi-class semantic segmentation on each second virtual background image to obtain a third semantic segmentation result of each second virtual background image; calculating IOU values of the original background image and each second virtual background image according to the second semantic segmentation result and the third semantic segmentation result; respectively calculating a first color distribution curve of the original background image and a second color distribution curve of each second virtual background image; calculating the curve similarity between the first color distribution curve and each second color distribution curve; and determining a virtual background image to be recommended from the second virtual background image according to the curve similarity and the IOU value.
In some possible implementations of the first aspect, the method may further include: if the depth value of the interactive object corresponding to the second virtual object to be rendered is smaller than the depth value of the target object, rendering the first virtual object to be rendered in the target virtual background image to obtain a virtual background image to be rendered, or taking the target virtual background image as the virtual background image to be rendered;
at this time, after performing image rendering according to the virtual background image to be rendered and the main body image to obtain a rendered image, the method may further include: and rendering the second virtual object to be rendered in the rendered image according to the rendering position of the second virtual object to be rendered to obtain an output image.
That is, when it is detected that the target object executes the preset action, but the depth value of the interactive object corresponding to the second virtual object to be rendered is smaller than the depth value of the target object, the second virtual object to be rendered is not rendered in the target virtual background image, but a rendered image is obtained, and then the second virtual object to be rendered and the rendered image are fused to obtain a final output image.
When the depth value of the interactive object corresponding to the second virtual object to be rendered is smaller than the depth value of the target object, the rendering sequence between the second virtual object to be rendered and the main body is as follows: rendering the main body first, and then rendering a second virtual object to be rendered; on the contrary, when the depth value of the interactive object corresponding to the second virtual object to be rendered is greater than the depth value of the target object, the rendering sequence between the second virtual object to be rendered and the main body is as follows: rendering the second virtual object to be rendered first, and then rendering the main body.
In a second aspect, an embodiment of the present application provides an image rendering apparatus, which may include:
the image acquisition module is used for acquiring an image to be processed; the detection module is used for detecting that a target object in the image to be processed executes a preset action and/or the target object is shielded by a first object; the virtual object determining module is used for determining a second virtual object to be rendered corresponding to the preset action and/or a first virtual object to be rendered corresponding to the first object; the virtual object rendering module is used for rendering the first virtual object to be rendered and/or the second virtual object to be rendered in the target virtual background image to obtain a virtual background image to be rendered, wherein the depth value of an interactive object corresponding to the second virtual object to be rendered is larger than that of the target object; and the rendering module is used for rendering the image according to the virtual background image to be rendered and the main body image to obtain the rendered image, wherein the main body image is extracted from the image to be processed and comprises the image of the target object.
In some possible implementations of the second aspect, the rendering module is specifically configured to: performing bottom layer consistency rendering on the basis of the virtual background image to be rendered to obtain a main body image subjected to bottom layer consistency rendering; and performing image rendering according to the main body image and the virtual background image to be rendered after the bottom layer is rendered in a consistent manner to obtain a rendered image.
In some possible implementations of the second aspect, the rendering module is specifically configured to: inputting a low-frequency image of a virtual background image to be rendered and an image to be processed into a first style migration model which is trained in advance, and obtaining a bottom-layer consistency rendered image to be processed which is output by the first style migration model; and extracting the main body image from the image to be processed after the bottom layer is rendered in a consistent manner to obtain the main body image after the bottom layer is rendered in a consistent manner.
In some possible implementations of the second aspect, the method further includes a model training module configured to: obtaining a training data set, wherein the training data set comprises a first virtual background image and an original video image; inputting a low-frequency image of a first virtual background image and an original video image into a second style migration model which is constructed in advance, and obtaining a forward training result output by the second style migration model; calculating a first loss value between the forward training result and the low-frequency image of the first virtual background image; inputting the forward training result and the low-frequency image of the original video image into the second style migration model after the forward training to obtain a reverse training result output by the second style migration model after the forward training; calculating a second loss value between the reverse training result and the original video image; calculating a third loss value between the reverse training result and the low-frequency image of the original video image; adjusting network parameters of the second style migration model according to the first loss value, and adjusting the network parameters of the second style migration model after forward training according to the second loss value and the third loss value; and repeating the training process, and obtaining the trained first style migration model when the preset conditions are met.
In some possible implementations of the second aspect, the rendering module is specifically configured to: transferring the virtual background image to be rendered to an LAB color space to obtain a first image; respectively calculating a first standard deviation and a first mean value of an L channel, an A channel and a B channel of a first image; transferring the main image to an LAB color space to obtain a second image; correcting the second image according to the first standard deviation and the first mean value to obtain a third image, wherein the difference value between the second standard deviation and the first standard of the L channel, the A channel and the B channel of the third image is within a first preset threshold interval, and the difference value between the second mean value and the first mean value is within a second preset threshold interval; and transferring the third image from the LAB color space to the RGB color space to obtain a fourth image, wherein the fourth image is a main image after the bottom layer is rendered in a consistent manner.
In some possible implementations of the second aspect, the rendering module is specifically configured to: inputting the main body image subjected to bottom layer consistency rendering to a first STN network which is trained in advance to obtain a first change matrix output by the first STN network; inputting the virtual background image to be rendered into a second STN network which is trained in advance to obtain a second change matrix output by the second STN network; carrying out image affine change on the main body image subjected to bottom layer consistency rendering by using a first change matrix to obtain a first change image; performing image affine change on the virtual background image to be rendered by using the second change matrix to obtain a second change image; and carrying out image synthesis on the first change image and the second change image to obtain a rendered image.
In some possible implementations of the second aspect, the virtual object rendering module is specifically configured to: determining a first position of an interactive object corresponding to a preset action in the image to be processed according to a semantic segmentation result of the image to be processed; taking a second position corresponding to the first position in the target virtual background image as a rendering position of a first virtual object to be rendered; determining that the depth value of an interactive object in the image to be processed is larger than that of the target object; and rendering the second virtual object to be rendered at the rendering position of the target virtual background image.
In some possible implementations of the second aspect, the detection module is specifically configured to: determining the category of each pixel point in the image to be processed according to the semantic segmentation result of the image to be processed; acquiring depth information of an image to be processed; and when determining that target pixel points with depth values smaller than that of the target object exist in the preset range of the target object according to the depth information, taking the category corresponding to the target pixel points as a first object, and determining that the target object is shielded by the first object.
In some possible implementations of the second aspect, the method further includes a context recommendation module, configured to: determining a virtual background image to be recommended according to the similarity between the original background image of the image to be processed and each second virtual background image; and displaying the virtual background image to be recommended.
In some possible implementations of the second aspect, the context recommendation module is specifically configured to: performing foreground and background segmentation on an image to be processed to obtain an original background image of the image to be processed; performing multi-class semantic segmentation on the original background image to obtain a second semantic segmentation result; performing multi-class semantic segmentation on each second virtual background image to obtain a third semantic segmentation result of each second virtual background image; calculating IOU values of the original background image and each second virtual background image according to the second semantic segmentation result and the third semantic segmentation result; respectively calculating a first color distribution curve of the original background image and a second color distribution curve of each second virtual background image; calculating curve similarity between the first color distribution curve and each second color distribution curve; and determining a virtual background image to be recommended from the second virtual background image according to the curve similarity and the IOU value.
In some possible implementations of the second aspect, the virtual object rendering module is further to: if the depth value of the interactive object corresponding to the second virtual object to be rendered is smaller than the depth value of the target object, rendering the first virtual object to be rendered in the target virtual background image to obtain a virtual background image to be rendered, or taking the target virtual background image as the virtual background image to be rendered; and rendering the second virtual object to be rendered in the rendered image to obtain an output image.
In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method according to any one of the first aspect is implemented.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method according to any one of the above first aspects.
In a fifth aspect, embodiments of the present application provide a chip system, where the chip system includes a processor, and the processor is coupled with a memory, and executes a computer program stored in the memory to implement the method according to any one of the above first aspects. The chip system can be a single chip or a chip module consisting of a plurality of chips.
In a sixth aspect, embodiments of the present application provide a computer program product, which, when run on a terminal device, causes the terminal device to perform the method of any one of the first aspect.
It is understood that the beneficial effects of the second to sixth aspects can be seen from the description of the first aspect, and are not described herein again.
Drawings
Fig. 1 is a schematic structural diagram of a terminal device 100 according to an embodiment of the present application;
fig. 2 is a block diagram of a software structure of the terminal device 100 according to an embodiment of the present disclosure;
FIG. 3 is a schematic block diagram of a flow of an image rendering scheme provided by an embodiment of the present application;
fig. 4 is an interface schematic diagram of a video call scene according to an embodiment of the present application;
fig. 5 is a schematic view of a virtual background image provided in an embodiment of the present application;
FIG. 6 is a schematic interface diagram of a background recommendation process provided in an embodiment of the present application;
FIG. 7 is a schematic diagram illustrating an effect of consistent rendering based on structural semantics according to an embodiment of the present application;
FIG. 8 is a schematic diagram illustrating an effect of consistent rendering based on an interaction relationship according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a model-based underlying consistent rendering provided by an embodiment of the present application;
FIG. 10 is a schematic diagram of a style migration model training process provided in an embodiment of the present application;
FIG. 11 is a schematic diagram illustrating effects of bottom-level consistent rendering according to an embodiment of the present application;
FIG. 12 is a schematic diagram of a bottom-layer consistency rendering process based on an image processing algorithm according to an embodiment of the present application;
fig. 13 is a schematic diagram illustrating an effect of bottom-layer consistent rendering based on an image processing algorithm according to an embodiment of the present application;
FIG. 14 is a schematic diagram of a consistent rendering process based on a position relationship according to an embodiment of the present application;
fig. 15 is a schematic diagram of an STN network training process provided in an embodiment of the present application;
FIG. 16 is a schematic block diagram illustrating an effect of consistent rendering based on a position relationship according to an embodiment of the present application;
fig. 17 is a flowchart illustrating an image rendering process according to an embodiment of the present application;
FIG. 18 is a schematic flow chart of an image rendering process according to an embodiment of the present disclosure;
FIG. 19 is a schematic flowchart of an image rendering process according to an embodiment of the present disclosure;
fig. 20 is a schematic view of a video call scene of a large-screen device according to an embodiment of the present application;
fig. 21 is a schematic diagram illustrating a variation of a virtual background replacement image according to an embodiment of the present disclosure;
fig. 22 is an interface schematic diagram of a virtual background replacement process in a shooting scene according to an embodiment of the present application;
fig. 23 is an interface schematic diagram of a virtual background replacement process in a video recording scene according to an embodiment of the present application.
Detailed Description
In the current virtual background replacement process, the correlation between the original background and the virtual background is not usually considered during image rendering, so that the image after background replacement is poor in reasonability, reality and the like, that is, an unreasonable or even unrealistic phenomenon exists in the image after background replacement. The original background refers to a background in the original image.
For example, the contents and positions in the virtual background are not considered, and the person is rendered on the table and chair in the virtual background, so that the unreasonable phenomenon that the person hangs over the chair appears in the image after the background replacement.
For another example, whether a human body in the original image is shielded by an object is not considered, but the human body is directly rendered at a corresponding position of the virtual background, so that an unreasonable phenomenon that the human body is suspended or only a part of the human body appears in the image after the background is replaced.
For another example, when a person in the original image is sitting from a stand, the person is directly rendered at a corresponding position of the virtual background, so that the unreasonable phenomenon that the person sits in the air instead of on a supporting object such as a chair appears in the image after the background replacement.
In addition, in the current virtual background replacement process, the consistency processing mode of hue, brightness and the like is single, so that the chroma, brightness and the like of the foreground are inconsistent and not uniform with the hue, brightness and the like of the virtual background, and the fusion effect of the foreground and the virtual background is influenced.
In view of the above problems, embodiments of the present application provide an image rendering scheme, which considers the correlation between an original background and a virtual background during image rendering to improve the rationality and reality of virtual background replacement, thereby improving the virtual background replacement effect.
Furthermore, the embodiment of the application also performs consistent rendering on the color tone, brightness, contrast, color and the like of the foreground and the virtual background, so that the color tone, brightness, contrast, color and the like of the foreground are consistent with the color tone, brightness, contrast, color and the like of the virtual background, and the fusion effect of the foreground and the virtual background is improved.
The image rendering scheme provided by the embodiment of the application can be applied to terminal equipment, the terminal equipment can be portable terminal equipment such as a mobile phone, a tablet computer, a notebook computer or wearable equipment, can be Augmented Reality (AR) equipment or Virtual Reality (VR) equipment, and can also be terminal equipment such as vehicle-mounted equipment, a netbook or a smart screen. The embodiment of the present application does not set any limit to the specific type of the terminal device.
Fig. 1 shows an exemplary structural diagram of a terminal device 100.
The terminal device 100 may include a processor 110, a memory 120, a camera 130, a display 140, and the like.
It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the terminal device 100. In other embodiments of the present application, terminal device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), and/or a neural-Network Processing Unit (NPU), etc. Wherein, the different processing units may be independent devices or may be integrated in one or more processors.
The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.
In some embodiments, processor 110 may include one or more interfaces. The interface may include a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, and the like.
The MIPI interface may be used to connect the processor 110 with peripheral devices such as the display screen 140, the camera 130, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments, processor 110 and camera 130 communicate via a CSI interface to implement the capture function of terminal device 100. The processor 110 and the display screen 140 communicate through the DSI interface to implement the display function of the terminal device 100.
The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 130, the display screen 140, and the like. It should be understood that the interface connection relationship between the modules illustrated in the embodiment of the present application is only an exemplary illustration, and does not constitute a limitation on the structure of the terminal device 100. In other embodiments of the present application, the terminal device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.
The terminal device 100 implements a display function by the GPU, the display screen 140, and the application processor, etc. The GPU is a microprocessor for image processing, and is connected to the display screen 140 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
The display screen 140 is used to display images, video, and the like. The display screen 140 includes a display panel. The display panel may be a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), or the like. In some embodiments, the terminal device 100 may include 1 or N display screens 140, N being a positive integer greater than 1.
The terminal device 100 may implement a photographing function through the ISP, the camera 130, the video codec, the GPU, the display screen 140, the application processor, and the like.
The ISP is used to process the data fed back by the camera 130. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be located in camera 130.
The camera 130 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the terminal device 100 may include 1 or N cameras 130, where N is a positive integer greater than 1.
The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform fourier transform or the like on the frequency point energy.
Video codecs are used to compress or decompress digital video. The terminal device 100 may support one or more video codecs. In this way, the terminal device 100 can play or record video in a plurality of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.
The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can implement applications such as intelligent recognition of the terminal device 100, for example: image recognition, face recognition, speech recognition, text understanding, and the like.
The memory 120 may be used to store computer-executable program code, which includes instructions. The memory 120 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, a phonebook, etc.) created during use of the terminal device 100, and the like. Further, the memory 120 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like. The processor 110 executes various functional applications of the terminal device 100 and data processing by executing instructions stored in the memory 121 and/or instructions stored in a memory provided in the processor.
The software system of the terminal device 100 may adopt a hierarchical architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present application takes an Android system with a layered architecture as an example, and exemplarily illustrates a software structure of the terminal device 100.
Fig. 2 is a block diagram of a software structure of the terminal device 100 according to the embodiment of the present application.
The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom.
The application layer may include a series of application packages.
As shown in fig. 2, the application packages may include camera, gallery, calendar, phone, map, navigation, WLAN, bluetooth, music, video, connect-to-talk applications, etc.
The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.
As shown in FIG. 2, the application framework layers may include a window manager, content provider, view system, phone manager, resource manager, notification manager, and the like.
The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.
The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.
The Android Runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.
The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.
The application layer and the application framework layer run in a virtual machine. And executing java files of the application program layer and the application program framework layer into a binary file by the virtual machine. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.
The system library may include a plurality of functional modules. For example: surface managers (surface managers), Media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., OpenGL ES), 2D graphics engines (e.g., SGL), and the like.
The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications.
The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like.
The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like. The 2D graphics engine is a drawing engine for 2D drawing.
The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.
The following exemplarily explains the workflow of the software and hardware of the terminal device 100 in conjunction with a shooting scene.
When the touch sensor receives a touch operation, a corresponding hardware interrupt is sent to the kernel layer. The kernel layer processes the touch operation into an original input event (including touch coordinates, a time stamp of the touch operation, and other information). The raw input events are stored at the kernel layer. And the application program framework layer acquires the original input event from the kernel layer and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and taking a control corresponding to the click operation as a control of a camera application icon as an example, the camera application calls an interface of an application framework layer, starts the camera application, further starts a camera drive by calling a kernel layer, and captures a still image or a video through the camera 130.
After capturing an image or video, the camera 130 converts the electrical signal into a digital image signal through the ISP. The digital image signal is input into DSP for processing, and converted into image signal in standard RGB, YUV and other formats. And finally, displaying images through the GPU, the display screen, the application processor and the like.
After the image signal is acquired, the CPU executes the image rendering scheme provided by the embodiment of the application to render the image, the image after background replacement is acquired, and the image after background replacement is displayed through a GUP, a display screen, an application processor and the like.
The following exemplarily introduces an image rendering scheme provided by an embodiment of the present application according to the terminal device 100 shown in fig. 1 and 2.
Referring to the flow schematic block diagram of the image rendering scheme shown in fig. 3, the image rendering scheme may include the steps of:
in step S301, the terminal device 100 acquires a video stream.
It can be understood that the terminal device 100 may collect a video stream in real time through a camera integrated with the terminal device, where the camera may be a front-facing camera or a rear-facing camera; the video stream transmitted by other terminal equipment can be received; or by reading a video stream that has been recorded in advance and stored locally. The embodiment of the present application does not limit the manner in which the terminal device 100 acquires the video stream.
For example, referring to an interface schematic diagram of a video call scene provided in the embodiment of the present application shown in fig. 4, as shown in (a) to (c) of fig. 4, a terminal device 100 is specifically a mobile phone, and a main interface of the mobile phone includes a phone 41 and a smooth connection call 42, and further includes application programs such as smart life, settings, and an application mall. The user may initiate a video call via telephone 41 and clear-to-connect call 42, as described below with the user initiating a video call via telephone 41.
First, the user clicks the phone 41 on the main interface of the mobile phone, and the mobile phone displays the main interface of the phone 41 in response to the user's clicking operation. Then, the user clicks the open communication session 43 in the function bar again, and the mobile phone displays the open communication session interface 44 in response to the clicking operation of the user. Within the open communication interface 44, dialable contacts are displayed. Then, the user can click the control 45, and the mobile phone initiates a network video call to dewy in response to the click operation of the user, and displays the call interface 46. A magic pen 47 is included in the call interface 46.
In the process, after the mobile phone receives the click operation for the control 45, the front-facing camera is called to acquire image data, and at this time, the mobile phone can acquire the video stream. And then the acquired image signals are displayed on the display screen through the DSP, the GPU, the application processor, the display screen and the like.
In step S302, the terminal device 100 extracts a subject image from the video image of the video stream.
It should be noted that the video stream includes a plurality of frames of images that are consecutive in time sequence, and the terminal device 100 may extract a main image for each frame of image, or extract a main image at intervals of a preset number of frames, which may be determined according to actual application requirements, and is not limited herein.
In a specific application, the manner of extracting the subject image from the video image by the terminal device 100 may be arbitrary. For example, the subject image can be extracted from the video image by semantic segmentation and instance segmentation. At the moment, firstly, performing multi-class semantic segmentation on the video image to obtain multi-class semantic segmentation results; and carrying out instance segmentation on the multi-class semantic segmentation result to obtain a main image in the video image.
The Semantic segmentation (Semantic segmentation) may give each pixel in the image a Semantic label, where the Semantic label is used to identify the category to which the pixel belongs. For example, the labels of people are set as red pixels, that is, all the pixels of people in the image are marked as red.
In other words, the categories contained in the original video image, the positions and ratios of the categories, and the like can be known through the semantic segmentation result. For example, the original video image contains two categories of a person and a tree, and the positions and the ratios of the two categories of the person and the tree in the original video image can be known through the multi-category semantic segmentation result.
Instance segmentation (instance segmentation) can distinguish different individuals of the same class based on semantic segmentation. That is, example segmentation can distinguish between different individuals belonging to the same category, e.g., two individuals belonging to the same person as person 1 and person 2.
The video image can be generally divided into a foreground and a background, and the above subject image can be generally understood as the foreground in the video image. For convenience of explanation, the present application takes a video image used for extracting a subject image in the video stream as an original video image.
In step S303, the terminal device 100 determines a target virtual background image.
It is understood that the target virtual background image refers to a virtual background for replacing the original background. After the terminal device 100 determines the target virtual background image, an image rendering process may be performed according to the target virtual background image and the main body image, so as to obtain an output image after background replacement.
In some embodiments, the target virtual background image may be selected by a user. At this time, the terminal device 100 may display the plurality of virtual background images in the virtual background library on the display screen in a preset display order for the user to select. After the user selection, the terminal device 100 takes the virtual background image selected by the user as the target virtual background image in response to the user's selection operation.
For example, referring to an interface schematic diagram of a video call scene provided in the embodiment of the present application shown in fig. 4, (a) to (c) in fig. 4 are processes of how to initiate a smooth connection call, and are not described herein again.
As shown in fig. 4 (d) - (f), when the westhat answers, the mobile phone displays the answering interface 48. A magic pen 47 is also included in the listening interface 48.
It will be appreciated that the handset captures the video stream through the camera and displays the video stream in real time on the call interface 46 and the answer interface 48. At this time, the call interface 46 and the listening interface 48 display video images without virtual background replacement. The real background of the video image is not shown in fig. 4.
The user can make virtual background replacement by the magic pen 47. Referring to (e) in fig. 4, when the user clicks the magic pen 47, the mobile phone displays a window 49 on the answering interface 48 in response to the user's clicking operation on the magic pen 47, and the window 49 includes two options of skin care and scene. After the user clicks the scene option, the mobile phone responds to the click operation of the user, and displays the virtual background images 411 to 414 on the window 49. The virtual background images 411-414 can be as shown in FIG. 5.
The user clicks the virtual background image 411 in the window 49, and the mobile phone responds to the clicking operation of the user and takes the virtual background image 411 as a target virtual background image. Therefore, the mobile phone determines the target virtual background image in a mode selected by the user.
After the mobile phone determines the target virtual background image, image rendering may be further performed according to the target virtual background image and the main body image to obtain an output image after background replacement, and the output image is displayed on the display screen to obtain the interface 410 after background replacement. The process of rendering images according to the target virtual background image and the subject image is described in detail below.
In this case, the mobile phone does not recommend the virtual background, and when the user clicks the scene option in the window 49, the mobile phone sequentially displays the virtual background images in the virtual background library in the window 49 according to the default display order. At this time, the default display order is the virtual background image 411, the virtual background image 412, the virtual background image 413, and the virtual background image 414.
It should be noted that, in addition to the virtual background replacement by the magic pen 47 in the listening interface 48, the virtual background replacement may also be performed by calling the magic pen 47 in the interface 47, and the process types of the two are the same, and are not described herein again.
In other embodiments, the terminal device 100 may also determine the virtual background image to be recommended according to the similarity between each virtual background image and the original background image; and then displaying the virtual background image to be recommended on a display screen so as to recommend the virtual background to the user.
Illustratively, the virtual context recommendation process may be as follows:
firstly, the terminal device 100 performs multi-class semantic segmentation on an original background image to obtain a first segmentation result; and performing multi-class semantic segmentation on each virtual background image in the virtual background gallery to obtain a second segmentation result of each virtual background image.
The original background image refers to a background image in an original video image, and can be obtained by performing foreground and background segmentation on the original video image.
Then, an Intersection-over-unity (IOU) ratio between the first division result and each of the second division results is calculated. The IOU value may be used to characterize the similarity in structure and content of the original background image and the virtual background image.
Then, the terminal device 100 performs color distribution curve statistics on the original background image to obtain a first color distribution curve; and carrying out color distribution curve statistics on each virtual background image in the virtual background library to obtain a second color distribution curve. And then calculating the curve similarity between the first color distribution curve and each second color distribution curve. The curve similarity may be used to characterize the color similarity between the original background image and the respective virtual background images.
Finally, the terminal device 100 determines the virtual background image to be recommended according to the IOU value and the curve similarity.
Specifically, a first weight of the IOU value and a second weight of the curve similarity are set in advance. For each virtual background image, multiplying the IOU value of the virtual background image by a first weight to obtain a first product, and multiplying the curve similarity by a second weight to obtain a second product; and then adding the first product and the second product of each virtual background image to obtain the recommendation score of each virtual background image. And finally, sorting according to the recommendation scores of the virtual background images, and screening the first K virtual background images as the virtual background images to be recommended, wherein K is a positive integer.
For example, assume that the virtual background library includes 4 virtual background images in fig. 5. Through the virtual background recommendation process of the above example, the mobile phone calculates the recommendation score of each virtual background image in fig. 5, and the recommendation scores are the virtual background image 412, the virtual background image 413, the virtual background image 414, and the virtual background image 411 in sequence from top to bottom. And K is 3, namely, the first 3 virtual background images are screened as the virtual background images to be recommended.
Referring to the interface schematic diagram of the background recommendation process shown in fig. 6, as shown in fig. 6, based on the video call scene shown in fig. 4, after the mobile phone determines that a virtual background image is to be recommended, when the user needs to replace the background with the magic pen 47, the mobile phone responds to the operation of the user on the magic pen 47, and displays a window 49 on the listening interface. Then, the mobile phone displays the virtual background image 412, the virtual background image 413, the virtual background image 414, and the virtual background image 411 in the window 49 in this order from left to right according to the level of the recommendation score. In addition, a box is added to the display positions of the virtual background image 412, the virtual background image 413 and the virtual background image 414 to prompt the user, that is, the virtual background images 412 to 414 are recommended to the user in a box adding mode.
After the user clicks the recommended virtual background image 412 in the window 49, the mobile phone performs image rendering according to the virtual background image 412 and the main body image in response to the clicking operation of the user, so as to obtain an output image after background replacement, and displays the output image on the display screen, so as to obtain an interface 415 after background replacement.
The manner of prompting the virtual background recommendation is arbitrary, and is not limited to the boxed form shown in fig. 6. For example, a box with different colors may be added to the virtual background image according to the recommendation score, or an arrow indicator may be added to each virtual background image to be recommended, so as to guide the user to select the virtual background image to be recommended.
In addition to the virtual background recommendation in the window 49, the virtual background recommendation may also be performed in the form of a pop-up window, for example, after the mobile phone determines the virtual background image to be recommended, a window is actively popped up in the listening interface 48, and the virtual background image to be recommended is displayed in the window.
In addition, the virtual context recommendation process is not limited to the above-mentioned process. For example, in the process of determining the virtual background image to be recommended according to the IOU value and the curve similarity, the virtual background image with the recommendation score higher than the preset score threshold may be screened out as the virtual background image to be recommended.
In contrast, if the virtual background recommendation is not performed, the color, structure and content of the virtual background image selected by the user may be greatly different from those of the original background image, so that the virtual background replacement effect is poor. And based on the similarity between the original background image and the virtual background image, the virtual background is recommended to the user, so that the target virtual background image is more related to the original background image, namely the color, the structure and the content of the target virtual background image are more similar to those of the original background image, the virtual background replacement effect is further improved, and the user experience is also improved.
The above-described manners of determining the target virtual background image all need to be selected manually, and in other embodiments, the terminal device 100 may also actively determine the target virtual background image without human intervention. At this time, the terminal device 100 may randomly select one virtual background image as the target virtual background image, or may calculate the recommendation score of each virtual background image in the virtual background library through the above-mentioned virtual background recommendation process, and select the virtual background image with the highest recommendation score as the target virtual background image.
Step S304, the terminal device 100 performs image rendering according to the main body image and the target virtual background image, and obtains an output image. The output image is the image after the background replacement.
Because the background replacement process involves an image rendering process, the image after the background replacement is regarded as the image after the image rendering.
Specifically, after determining the target virtual background image, the terminal device 100 performs consistent rendering according to the main body image and the target virtual background image to obtain an image with a replaced background, and displays the image on a display screen.
It should be noted that consistent rendering in the embodiments of the present application refers to an image rendering process that takes into account a correlation between an original video image and a target virtual background image. And the correlation between the original video image and the target virtual background image may include at least one of: content, occlusion, location, interaction, etc.
The content refers to image content, that is, the image content of the image after the background replacement is consistent with the image content of the original video image. Specifically, the image after background replacement can be made consistent with the original video image in image content through a bottom layer consistency rendering process.
The underlying features may refer to the color, tone, brightness, contrast, etc. of the image, and rendering for the underlying features may be referred to as underlying consistent rendering. Through bottom layer consistency rendering, the color, tone, brightness, contrast and the like of the main image can be consistent with the color, tone, brightness, contrast and the like of the target virtual background image, so that the color, tone, brightness, contrast and the like of the main image part in the image after background replacement are consistent with the color, tone, brightness, contrast and the like of the background, and in addition, the image after background replacement is consistent with the image content of the original video image, so that the virtual background replacement effect is further improved.
The occlusion refers to whether an occlusion relationship exists between a main body and an object in an original background image, and when the occlusion relationship exists, the corresponding occlusion relationship is also reflected in the image after the background replacement, so that the image after the background replacement is consistent with the original video image in the occlusion relationship of the main body.
In specific application, the image after background replacement and the original video image are consistent in the main body occlusion relation through the consistency rendering process based on the structural semantics. Whether the main body is shielded or not can be determined through high-level consistency rendering based on structural semantics, and in the case that the main body is shielded, a virtual object needing to be rendered is determined.
The position refers to a position relationship between the subject and the object in the image after the background replacement, and may be specifically embodied as a rendering position of the subject in the virtual background image. The pose of the main body image in the target virtual background image can be determined through consistent rendering based on the position relation, and the main body image is rendered at a reasonable position in the virtual background image to be rendered.
The interaction refers to an interaction relationship between a subject and an object in an original video image, and may be specifically embodied as whether the subject performs a preset interaction, and if the preset interaction is performed, the interaction relationship exists between the subject and the corresponding object. Correspondingly, the image after the background replacement also shows the interactive relationship between the main body and the corresponding object, so that the image after the background replacement is consistent with the original video image in the interactive relationship. Through the consistent rendering based on the interaction relation, when the main body makes a preset action, the corresponding virtual object can be increased or decreased. For example, when the subject is a person, and the person in the original video image performs a "sitting" action, virtual objects such as chairs or stools are rendered at a reasonable position of the image, so as to increase the sense of reality after background replacement and improve the virtual background replacement effect.
In the image rendering process, the terminal device 100 may perform consistent rendering based on at least one of the underlying features, structural semantics, positional relationships, and interactive relationships.
In other words, in the process of obtaining the image after replacing the background based on the target virtual background image and the subject image, the terminal device 100 may perform at least one of the following consistent rendering processes: the method comprises the following steps of bottom-layer consistent rendering, structural semantic based consistent rendering, position relation based consistent rendering and interactive relation based consistent rendering.
The following describes each of the consistency rendering processes.
The process of consistent rendering based on structural semantics may be as follows:
first, the terminal device 100 may perform multi-class semantic segmentation on the original video image to obtain a multi-class semantic segmentation result.
It is to be understood that the original video image is generally referred to as the original video image in step S302.
Then, the terminal device 100 determines whether the main body is occluded according to the depth information of each category in the original video image. The depth information can represent the context of each category, and whether other objects exist in front of the main body can be determined according to the depth information. If the fact that other objects are shielded in front of the main body is determined, the types of the other objects in front of the main body can be further determined according to the multi-type semantic segmentation result.
For example, the main body in the original video image is a person, and it is determined that there are other objects in front of the person in the original video image according to the depth information of the original video image. And then, determining that other objects in front of the person are tables through the multi-class semantic segmentation result, and determining that the person is shielded by the tables, namely that the person and the tables have shielding relation.
The category of each pixel point in the original video image can be known through the multi-category semantic segmentation result, and the distance (i.e. depth value) between each pixel point and the camera can be known through the depth map of the original video image. After the position of the main body and the depth value of the main body are determined through the multi-class semantic segmentation result and the depth information, whether pixel points smaller than the depth value of the main body exist or not is searched in a preset range around the main body, if yes, the pixel points with the depth values smaller than the depth value of the main body are mapped to the multi-class semantic segmentation result, the classes of the pixel points are determined, and therefore the class of the sheltered object is determined.
After the terminal device 100 determines that the main body in the original video image is occluded and the object type of the occluded main body, it determines the first virtual object to be rendered. For example, the subject in the original video image is a person, and the front of the person is covered by a table. At this time, the terminal device 100 may recommend a virtual object similar to or related to a table, which may be, for example, each style of table previously entered into the virtual object library. The user can select one or more of the recommended virtual objects as the first virtual object to be rendered according to the needs of the user.
Of course, the terminal device 100 may not perform the virtual object recommendation process, but directly select the relevant virtual object from the virtual object library as the first virtual object to be rendered.
After the terminal device 100 determines the first virtual object to be rendered, the first virtual object to be rendered may be used as a foreground to perform foreground and background fusion with the target virtual background image, so as to obtain a new virtual background image. For example, the first virtual object to be rendered is a table, the table is used as a foreground, and image rendering is performed on the table and the target virtual background image to obtain a new virtual background image.
During the process of rendering the first virtual object to be rendered to the target virtual background image, the center position of the shielding object in the original background image may be used as the rendering initial position of the first virtual object to be rendered. Of course, the user may also autonomously determine the rendering position of the first virtual object to be rendered. For example, the user may adjust the rendering position of the first virtual object to be rendered by means of dragging.
If the terminal device 100 determines that the main body in the original video image is not occluded by other objects according to the multi-class semantic segmentation result and the depth information, it may not be necessary to determine the first virtual object to be rendered, and it is not necessary to render the first virtual object to be rendered in the target virtual background image.
Illustratively, referring to the effect diagram of the structural semantic based consistent rendering shown in fig. 7, as shown in (a) in fig. 7, an image 71 is an original video image, which includes a subject 72 and a table 73. According to the multi-class semantic segmentation result of the image 71 and the depth information, it is determined that an occlusion relationship exists between the main body 72 and the table 73, namely, the main body 72 is occluded by the table 73. After the occlusion relationship between the main body 72 and the table 73 is determined, the first virtual object to be rendered may be further determined.
Referring to (b) in fig. 7, the image 74 is a target virtual background image. Assuming that the virtual object library includes the objects 75-77, and the object 75 is a virtual object recommended by the system, the user may be specifically prompted in a form of adding a box.
Assuming that the user selects the object 75 as the first virtual object to be rendered, the image 74 and the object 75 may be rendered first to obtain a virtual background image to be rendered, and then the background image to be rendered and the main body 72 may be synthesized to obtain an image with replaced background, such as the image 78 in (c) of fig. 7. By contrast, the subject 72 in the image 71 and the table 73 have an occlusion relationship, and the subject 72 in the image 78 and the object 75 have an occlusion relationship, so that the image after the background replacement is consistent with the original video image in the occlusion relationship.
The process of interactive relationship based consistent rendering may be as follows:
the terminal device 100 can perform motion recognition based on a plurality of consecutive frames of original video images. And if the main body in the original video image is recognized to make a preset action, determining a second virtual object to be rendered associated with the preset action. The preset action may be set according to practical applications, for example, the preset action is "sit down" or "hold object by hand" or the like. The motion recognition method may be any conventional method, and is not limited herein.
After recognizing the preset action, the terminal device 100 may select an object associated with the preset action from the virtual object library to determine a second virtual object to be rendered. The association relationship between each preset action and the virtual object can be preset, and the virtual object can be selected directly through the preset action subsequently. For example, the preset actions include "sit down" and "hold the object with the hand," the virtual object corresponding to the "sit down" is a chair or a stool, and the virtual object corresponding to the "hold the object with the hand" is a cup.
At this time, the user may also preset virtual objects to be used in the initialization stage, and after recognizing the preset action, the terminal device 100 automatically selects corresponding virtual objects from the virtual objects set by the user. For example, before a video call starts, a user presets which virtual objects may need to be used in the video call; after the setting is finished, in the video call process, after the mobile phone identifies the preset action, the corresponding virtual object is selected from the set virtual objects.
In some other embodiments, the terminal device 100 may also recommend a virtual object to the user after recognizing the preset action, and the user selects a desired virtual object. However, it takes a certain amount of time for the user to select the virtual object, which may cause the rendering of the virtual object to have a certain hysteresis, and therefore, in order to ensure that the subject can render the corresponding virtual object at the corresponding position in time after making the preset action, the user is not required to select the virtual object in general, but the terminal device autonomously determines the virtual object to be rendered.
After the terminal device 100 determines the second virtual object to be rendered, it needs to further determine a rendering position of the second virtual object to be rendered. Specifically, the terminal device 100 performs multi-class semantic segmentation on the original video image, and determines the position of the interactive object in the original video image according to the multi-class semantic segmentation result. The interactive object is an object corresponding to a preset action, for example, a main body in an original video image is a person, and the person performs a "sitting" action, that is, the person sits on a chair, which is the above-mentioned interactive object.
The classes contained in the image and the positions of the classes can be known through the multi-class semantic segmentation result, so that the positions of the interactive objects in the original video image can be known through the multi-class semantic segmentation result of the original video image.
And determining the rendering position of the second virtual object to be rendered in the virtual background image to be rendered according to the position of the interactive object in the original video image. For example, the interactive object is a chair, and the pixel position of the chair in the original video image is a first position; and taking the position corresponding to the first position in the virtual background image to be rendered as a rendering position.
Meanwhile, the terminal device 100 may determine a front-back position relationship between the interactive object and the main body in the original video image according to the depth information of each category in the original video image, that is, determine which of the interactive object and the main body is in front of the main body and which is behind the main body. And setting a rendering sequence between the second virtual object to be rendered and the main body according to the front-back position relation. For example, if the subject is front and the interactive object is back, the rendering order is: the interactive object is rendered first, and then the main body is rendered.
If the rendering order is: rendering the interactive object first, and then rendering the main body, after the terminal device 100 determines the second virtual object to be rendered, the second virtual object to be rendered may be used as a foreground to be rendered with the target virtual background image, so as to obtain a new virtual background image. And if the preset action is not recognized, not rendering the second virtual object to be rendered in the target virtual background image.
If the rendering order is: rendering the main body first, and then rendering the interactive object, after the terminal device 100 determines the second virtual object to be rendered, and after the main body image and the virtual background image to be rendered are fused, rendering the second virtual object to be rendered in the fused image, and obtaining a final output image.
Referring to fig. 8, which is a schematic diagram illustrating an effect of consistent rendering based on an interaction relationship, as shown in fig. 8, an image 81 and an image 82 are original video images, and a time sequence order of the two images in an original video stream is: image 81 precedes image 82.
The image 83 and the image 84 are output images obtained by replacing the background by using a conventional virtual background replacement method, wherein the image 83 corresponds to the image 81, and the image 82 corresponds to the image 84.
The image 85 and the image 86 are images obtained by replacing the background by the above-described consistency rendering based on the interaction relationship. The image 85 corresponds to the image 81, and the image 86 corresponds to the image 82.
With respect to the image 81 and the image 82, the human subject 87 in the image 81 is in a standing state; the human subject 87 in the image 82 is in a sitting state, i.e., the human subject 87 is from standing to sitting.
By comparing the first and second images, i.e., the images 83 and 84 and the images 85 and 86, respectively, as the first and second images, it can be seen that the chair is not rendered at the corresponding position in the virtual background image when the person changes from standing to sitting in the first image, so that the irrationality of the person's body 87 sitting in the air appears in the image after the background replacement. In the second group of images, since a chair 88 is rendered at a corresponding position when the action of "sitting down" is recognized after the consistent rendering based on the interactive relationship is used, the person main body 87 sits on the chair 88 in the image with the replaced background, and the image is reasonable.
Bottom-layer consistency rendering process:
the bottom-level consistent rendering may include two different implementations, and the two different bottom-level consistent rendering implementations are described below.
The first method is as follows:
the terminal device 100 inputs the low-frequency image of the virtual background image to be rendered and the original video image into a pre-trained style migration model, and the output of the style migration model is the original video image after the rendering of the bottom layer consistency.
Image style migration (style transfer) refers to learning the style of a certain picture by an algorithm and then applying the style to another picture, or transferring the style of a picture to another picture.
The style migration model is a model for realizing image style migration, namely, the style of the virtual background image to be rendered can be migrated to the original video image through the style migration model.
Referring to FIG. 9, a flow diagram of a model-based underlying consistent rendering process is shown, which may include the steps of:
step S901, the terminal device 100 acquires a low-frequency image of the virtual background image to be rendered.
In a specific application, after the virtual background image to be rendered is determined, a low-frequency image of the virtual background image to be rendered may be generated.
Step S902, the terminal device 100 inputs the low-frequency image and the original video image of the virtual background image to be rendered to the style migration model trained in advance, and obtains the original video image output by the style migration model and rendered with the consistency of the bottom layer.
Referring to the schematic diagram of the style migration model training process shown in fig. 10, the training process of the style migration model may be as follows:
first, a style migration model is constructed.
Next, the style migration model is forward trained.
During forward training, the input of the style transition model is a low-frequency image of an original video image and a virtual background image, and the output is a forward training result. At this time, the virtual background image refers to an image for background replacement in the training data set.
After a forward training result is obtained every time, the forward training result is transferred to a color-opponent space (LAB), and then a first variance and a first mean value of the forward training result in the LAB domain are calculated; and after the virtual background image is transferred to the LAB space, calculating a second variance and a second mean value of the virtual background image in the LAB space. And finally, respectively calculating the mean difference and the variance difference of the forward training result and the virtual background image in the LAB domain according to the first variance and the second variance and the first mean and the second mean.
The mean difference and variance difference of the LAB domain are used as Loss values (Loss) between the model output and the input, i.e. the Loss in the forward training is the Loss in the LAB domain. The similarity of color, brightness, saturation, etc. between the forward training results and the virtual background can be constrained by the Loss of the LAB domain.
And after calculating the loss value, adjusting the network parameters of the style migration model according to the loss value. Through forward training, the color, brightness, saturation and the like of the main image or the original video image can be consistent with the color, brightness, saturation and the like of the virtual background image.
And then, carrying out reverse training on the style migration model after the forward training based on the result of the forward training to obtain the trained style migration model.
And during the reverse training process, inputting the low-frequency image of the original video image and the forward training result into the style migration model after the forward training, wherein the output of the model is the reverse training result.
After each time the reverse training result is obtained, the Loss in the LAB domain between the reverse training result and the low frequency image of the original video image is calculated. The specific calculation process may be as follows: and transferring the reverse training result and the low-frequency image of the original video image to an LAB space, calculating the variance and mean of the reverse training result in an LAB domain, calculating the variance and mean of the low-frequency image of the original video image in the LAB domain, calculating the mean difference and variance difference between the reverse training result and the low-frequency image of the original video image according to the variance and mean of the reverse training result and the variance and mean of the low-frequency image of the original video image, and taking the mean and variance difference between the reverse training result and the low-frequency image of the original video image in the LAB domain as the Loss of the LAB domain. Meanwhile, a loss value between the reverse training result and the original video image is calculated. And weighting the Loss value of the LAB domain and the Loss value between the reverse training result and the original video image to obtain a Loss value, and adjusting the network parameters of the style migration model after the forward training according to the Loss value.
And (4) performing iterative training for multiple times until the loss value of the forward training and the loss value of the reverse training tend to be stable, and determining that the training is finished to obtain the final style migration model.
Through reverse training, the content of the main image or the original video image can be consistent with the content of the virtual background image to be rendered.
In the embodiment of the application, a training includes forward training process and reverse training process, is carrying out forward training process back again based on the result of forward training promptly when training the in-process, and the same reason, in training the in-process next time, still carry out forward training process earlier, carry out reverse training again. And (5) performing iterative training for multiple times in sequence to obtain a trained style migration model.
It should be noted that the training process of the style migration model may be performed on the terminal device 100, or may not be performed on the terminal device 100, and the training process may be loaded to the terminal device 100 after the training of other devices is completed.
In the existing style migration model training process, the forward training model and the reverse training model are different, and the weights are different, namely the reverse training is not based on the model and the results obtained by the forward training. In this way, after the style transition is performed using the trained model, the image content in the image after the style transition does not match the image content of the original image.
Moreover, different styles correspond to different style migration models, for example, if there are 3 different background images, 3 different style migration models are needed to migrate the styles of the 3 background images to the corresponding pictures respectively.
In the first mode, the forward training and the backward training are the same in model and the same in weight, that is, the backward training is based on the style migration model after the forward training and the forward training result. Therefore, when the style is transferred, the color, the brightness, the saturation and the like of the image after the style transfer and the virtual background image to be rendered are consistent, the image content in the image after the style transfer is consistent with the image content of the original video image, and the rendering effect is better. The image after style migration is an image obtained by bottom layer consistent rendering.
In other words, the forward training and the reverse training in the prior art are not the same model, which results in the content of the image after the style migration being inconsistent with the content of the original video image. In the above mode, forward training and reverse training are the same model, so that the image content after style migration is the same as that of the original video image. In addition, in the model application stage, the input of the style migration model in the prior art is the virtual background image to be rendered, and the model input in the first mode is the original video image and the low-frequency image of the virtual background image to be rendered.
In addition, in the first mode, different styles correspond to the same style migration model, for example, if there are 3 different background images, one style migration model is needed, and the styles of the 3 background images can be migrated to the corresponding pictures respectively. That is, the style migration model obtained by the model training method of the first method can migrate a plurality of different styles to another picture.
It should be further noted that the model training process and the underlying consistency rendering process described above are both described in terms of original video images. In other embodiments, the original video image may be replaced with the subject image, that is, the original video image in fig. 9 and 10 may be replaced with the subject image. For example, a main body image is extracted from an original video image, and then the main body image and a low-frequency image of a virtual background image to be rendered are input into a style transition model for forward training. For another example, in the actual application stage of the model, the low-frequency image of the subject image and the virtual background image to be rendered are input to the trained style migration model, and the subject image after the bottom-layer consistent rendering output by the style migration model is obtained.
In addition, the low-frequency image of the virtual background image to be rendered is adopted in both the model application stage and the model training stage, the low-frequency image can omit bottom layer characteristics such as textures and the like, multi-style reverse training can be performed, and the forward training and the reverse training are the same model, so that distortion is reduced.
Compared with the prior art, in general, a fixed color temperature and a fixed illumination are adopted for image rendering, that is, virtual illumination in a fixed direction is rendered for a foreground, or the fixed color temperature is rendered for the foreground, which may cause inconsistency of the color, the color temperature, the brightness, the contrast ratio and the like of a main body and a background in an image after background replacement, and the background replacement effect is poor.
And the bottom layer is rendered in a consistent manner in the first or second mode, so that the color, color temperature, brightness, contrast and the like of the main body and the background in the image after background replacement are consistent, and the background replacement effect is better.
Referring to the schematic effect diagram of the underlying consistent rendering shown in fig. 11, as shown in fig. 11, an image 111 is an original video image, an image 112 is a virtual background image, an image 113 is an image obtained by replacing a background in the first manner, and an image 114 is an image obtained by replacing a background in the existing manner. The background color, hue, and the like in the image 111 are mainly marine blue, and the clothing color of the person in the image 111 is color 1, for example, color 1 is white. Background color, hue, and the like in the image 112 are mainly sunset yellow.
The first mode is used:
the subject image and the virtual background image can be made to coincide in color, hue, contrast, and color temperature. The concrete expression is as follows: the clothing color of the person (i.e., the subject image) in the image 113 is color 2, for example, the color 2 is yellow, that is, the color, color temperature, etc. of the person in the image 113 are consistent with the color, color temperature, etc. of the background, and the consistency of the subject and the background is high.
With the existing method, the subject image and the virtual background image differ greatly in color, hue, contrast, color temperature, and the like. The concrete expression is as follows: the person's clothing in image 114 is color 1, i.e., consistent with the subject image in the original video image. Thus, the subject image and the background in the image 114 have a large difference in color and color temperature, and the subject and the background have poor consistency.
The second method comprises the following steps:
referring to the bottom-layer consistent rendering process diagram based on the image processing algorithm shown in fig. 12, as shown in fig. 12, first, the terminal device 100 converts the virtual background image to be rendered from the RGB color space to the LAB color space to obtain a first image in the LAB color space, and then calculates a first standard deviation (std) and a third mean (mean) of L, A, B channels of the first image. Each channel has its own corresponding standard deviation and mean.
Then, the terminal device 100 converts the main image or the original video image into an LAB color space to obtain a second image in the LAB color space, and corrects the standard deviation and the mean of the second image according to the first standard deviation and the third mean of the first image to obtain a third image.
Specifically, the standard deviations of L, A, B channels of the second image are respectively set as the first standard deviations of the corresponding channels in the first image, or the difference values between the standard deviations of L, A, B channels of the second image and the first standard deviations of the corresponding channels in the first image are within a preset threshold interval; setting the mean value of L, A, B channels of the second image as the third mean value of the corresponding channel in the first image, or making the difference value between the mean value of L, A, B channels of the second image and the third mean value of the corresponding channel in the first image within a preset threshold interval.
That is, the L, A, B three-channel standard deviations of the third image are equal to or close to the first standard deviation of the corresponding channel in the first image, and the L, A, B three-channel mean of the third image is equal to or close to the third mean of the corresponding channel in the first image.
And finally, transferring the third image from the LAB color space to the RGB color space to obtain a fourth image, wherein the fourth image is the image after the bottom layer is rendered in a consistent manner.
Referring to fig. 13, which is a schematic diagram illustrating the effect of the bottom-layer consistent rendering based on the image processing algorithm, as shown in fig. 13, an image 131 is a virtual background image, an image 132 is an original video image, an image 133 is an image obtained by replacing the background in the above-mentioned manner two, and an image 134 is an image obtained by replacing the background in the prior art.
By comparing the image 133 with the image 134, it can be seen that the color, color temperature, brightness, contrast, and the like of the person (i.e., subject image) 135 of the image 133 are consistent with the color, color temperature, brightness, contrast, and the like of the image 131, that is, the person and the background in the image 133 are more consistent in color, brightness, contrast, color temperature, and the like, and the background replacement effect is further improved. The color, color temperature, brightness and contrast of the person 135 in the image 134 are consistent with the hue, color temperature, brightness and contrast in the image 132, and have a large difference with the color, color temperature, brightness and contrast of the image 131, which further causes the person and the background in the image 134 to have a large difference in color, contrast, color temperature and the like, that is, the consistency between the person and the background is poor, and further causes the background replacement effect to be poor.
The consistency rendering process based on the position relation comprises the following steps:
referring to fig. 14, a schematic diagram of a consistent rendering process based on a position relationship is shown, where the process of consistent rendering based on a position relationship may be as follows:
firstly, inputting a subject image into a first Spatial Transformer Network (STN) to obtain a first variation matrix output by the first STN; and inputting the virtual background image to be rendered into the second STN network to obtain a second change matrix output by the second STN network.
If the bottom layer consistency rendering process is carried out, the main body image can be the main body image after the bottom layer consistency rendering, and if the result of the bottom layer consistency rendering process is the original video image after the bottom layer consistency rendering, the main body image is extracted from the original video image so as to obtain the main body image after the bottom layer consistency rendering.
Then, affine transformation (Warp) is performed on the subject image by using the first variation matrix, and the subject image after Warp is obtained. And performing Warp on the virtual background image to be rendered by using the second change matrix to obtain the Warp virtual background image to be rendered.
And finally, performing foreground and background fusion on the main body image subjected to the Warp and the virtual background image to be rendered subjected to the Warp to obtain a composite image.
And adjusting mutual rotation, translation or scaling drainage of the foreground (namely the main body image) and the virtual background image to be rendered through the STN network, and performing clipping.
It is understood that the first STN network and the second STN network are both pre-trained networks. The training process of the STN network adopts a counterstudy mode, and the specific process can be as follows:
referring to the schematic diagram of the STN network training process shown in fig. 15, as shown in fig. 15, a subject image for training is input to a third STN network constructed in advance, and a variation matrix H0 output by the third STN network is obtained; and inputting the virtual background image for training into a fourth STN network constructed in advance to obtain a change matrix H1 output by the fourth STN network. In this case, the virtual background image for training is an image used for replacing the background in the training data. The main image may be an image after bottom-layer consistent rendering, or may not be an image after bottom-layer consistent rendering.
Performing Warp on the main body image for training by using a change matrix H0, and performing Warp on the background image for training by using a change matrix H1; and performing foreground and background fusion on the main body image subjected to the Warp and the virtual background image subjected to the Warp to obtain a synthetic image.
Finally, the composite image is input to the discriminator. The discriminator judges whether the composite image is good or bad by discriminating the magnitude of the difference between the composite image and the real image. The smaller the difference between the synthesized image and the real image, the better the synthesized image, and conversely, the larger the difference, the worse the synthesized image.
And when the discriminator considers that the input synthetic image is the same as the real image, the training of the STN network is considered to be finished, and the first STN network and the second STN network are obtained.
It is understood that the training process of the STN network may be performed on the terminal device 100, or may be performed on other devices.
Referring to the schematic block diagram of the effect of the consistent rendering based on the position relationship shown in fig. 16, as shown in fig. 16, the image 161, the image 162, and the image 163 are all images after background replacement, wherein the image rendering process of the image 161 and the image 162 is not performed with the consistent rendering process based on the position relationship, and the image rendering process of the image 163 is performed with the consistent rendering process based on the position relationship. As can be seen from the comparison, in the image 161 and the image 162, the main body 164 is not reasonably rendered on the table 165, and thus an unreasonable phenomenon, such as the main body being suspended, occurs in the image after the background replacement. In the image 163, the subject 164 is reasonably rendered on the table 165, and the reasonableness and reality are better.
After the various consistent rendering processes have been separately introduced, the following is an exemplary description of possible image rendering processes.
A first image rendering process:
the image rendering process comprises structure semantic based consistency rendering, interactive relationship based consistency rendering, bottom layer consistency rendering and position relationship based consistency rendering.
Referring to fig. 17, a flow diagram of an image rendering process is shown, which may include the steps of:
in step S1701, the terminal device 100 acquires an original video image.
It will be appreciated that the original video image is a frame of video image in the video stream.
Step 1702, the terminal device 100 detects whether the target object in the original video image has an occlusion relation. When the target object has an occlusion relation, the step S1703 is performed; when the target object does not have an occlusion relationship, the process proceeds to step S1707.
The target object refers to a subject in the original video image, and is usually a human subject.
In a specific application, the terminal device 100 may perform multi-class semantic segmentation on the original video image to obtain multi-class semantic segmentation results, and then determine whether the main body is occluded by other objects according to the multi-class semantic segmentation results and depth information of the original video image. And if the main body is occluded by other objects, determining that the target object has an occlusion relation, and otherwise, if the main body is not occluded by other objects, determining that the target object does not have an occlusion relation.
Step S1703, the terminal device 100 determines a first virtual object to be rendered corresponding to the blocking object.
In specific application, when the terminal device 100 determines that the main body is shielded by other objects according to the depth information, the type of the shielded object and the position of the shielded object in the original video image are determined according to the multi-type semantic segmentation result. And then, determining a first virtual object to be rendered corresponding to the shielding object according to the category of the shielding object, and determining the rendering position of the first virtual object to be rendered according to the position of the shielding object in the original video image.
In step S1704, the terminal device 100 detects whether the target object in the original video image performs a preset action. When the target object executes the preset action, the step S1705 is entered; when the target object does not perform the preset action, the process proceeds to step S1707.
Step S1705, the terminal device 100 determines a second virtual object to be rendered corresponding to the preset action.
In step S1706, the terminal device 100 determines a rendering position and a rendering order of the second virtual object to be rendered.
It should be noted that, when the rendering order is: and when the interactive object is rendered first and the main body is rendered, fusing the second virtual object to be rendered and the target virtual background image.
When the rendering order is: and when the main body is rendered first and the interactive object is rendered, the second virtual object to be rendered and the target virtual background image are not fused, and after the synthetic image is obtained, foreground and background fusion is performed on the second virtual object to be rendered and the synthetic image.
If two types of second virtual objects to be rendered exist at the same time, the rendering sequence of the first type of second virtual objects to be rendered is as follows: rendering the interactive object first, and then rendering the main body, wherein the rendering sequence of the second type of virtual object to be rendered is as follows: the main body is rendered first, and then the interactive object is rendered. At the moment, the first type of second virtual object to be rendered and the target virtual background image are fused, and the second type of second virtual object to be rendered and the synthetic image are fused.
Step S1707, the terminal device 100 determines a virtual background image to be rendered.
In some embodiments, the target virtual object to be rendered may be used as a foreground, and the foreground and background are fused with the target virtual background image to obtain the virtual background image to be rendered.
The target virtual object to be rendered may include the first virtual object to be rendered and/or the second virtual object to be rendered.
When the target object is detected to have a shielding relation and the target object is detected not to execute a preset action, the target virtual object to be rendered only comprises a first virtual object to be rendered, at the moment, the first virtual object to be rendered is taken as a foreground according to the position of the shielding object in the original video image, and is subjected to foreground and background fusion with the target virtual background image to obtain the virtual background image to be rendered.
When the target object is detected to have a shielding relation and the target object is detected to execute a preset action, if the rendering sequence between the second virtual object to be rendered and the main body is as follows: and rendering the interactive object first and then rendering the main body, wherein the target virtual object to be rendered comprises a first virtual object to be rendered and a second virtual object to be rendered. At the moment, according to the rendering position of the second virtual object to be rendered and the position of the shielding object in the original video image, the first virtual object to be rendered and the second virtual object to be rendered are used as the foreground to be fused with the target virtual background image in the foreground and background mode, and the virtual background image to be rendered is obtained.
When the target object is detected to have a shielding relation and the target object is detected to execute a preset action, if the rendering sequence between the second virtual object to be rendered and the main body is as follows: and rendering the main body first and then the interactive object, wherein the target virtual object to be rendered comprises the first virtual object to be rendered. At the moment, according to the position of the shielding object in the original video image, the first virtual object to be rendered is used as a foreground and is subjected to foreground and background fusion with the target virtual background image, and the virtual background image to be rendered is obtained.
When it is detected that the target object has no shielding relation and the target object executes a preset action, if the rendering sequence between the second virtual object to be rendered and the main body is: and rendering the interactive object first, and then rendering the main body, wherein the target virtual object to be rendered comprises a second virtual object to be rendered. At the moment, according to the rendering position of the second virtual object to be rendered, the second virtual object to be rendered is used as a foreground, and foreground and background fusion is carried out on the second virtual object to be rendered and the target virtual background image, so that the virtual background image to be rendered is obtained.
In other embodiments, the target virtual background image may also be directly used as the virtual background image to be rendered.
When it is detected that the target object has no shielding relation and the target object is detected to execute a preset action, if the rendering sequence between the second virtual object to be rendered and the main body is: and rendering the main body firstly, and then rendering the interactive object, wherein at the moment, the target virtual background image is directly used as the virtual background image to be rendered.
And when the target object is detected to have no shielding relation and the target object is detected not to execute the preset action, directly taking the target virtual background image as the virtual background image to be rendered.
After the virtual background image to be rendered is determined, image rendering can be performed according to the virtual background image to be rendered and the main body image, so that an image with a replaced background is obtained.
Step S1708, the terminal device 100 performs bottom layer consistency rendering according to the virtual background image to be rendered and the original video image.
After determining the virtual background image to be rendered, the terminal device 100 may perform a bottom-layer consistent rendering process. The bottom-layer consistency rendering process may be referred to above and will not be described herein.
Step S1709, the terminal device 100 performs consistent rendering based on the position relationship according to the main image and the virtual background image to be rendered after the consistent rendering of the bottom layer, to obtain a composite image.
It should be noted that, for the consistent rendering based on the position relationship, reference may be made to the above, and details are not described herein again.
When it is detected that the target object does not execute the preset action, or the target object executes the preset action, but the rendering sequence between the second virtual object to be rendered and the main body is as follows: when the interactive object is rendered first and the main body is rendered, the composite image is the output image after the background is replaced.
When the target object is detected to execute the preset action, the rendering sequence between the second virtual object to be rendered and the main body is as follows: and rendering the main body first and then rendering the interactive object, wherein the synthetic image is not an output image after background replacement. At this time, after the composite image is obtained, according to the rendering position of the second virtual object to be rendered, the second virtual object to be rendered is used as a foreground, the composite image is used as a background, and foreground and background fusion is performed to obtain a fused image, where the fused image is an output image after background replacement. Therefore, in this case, the image rendering process may further include step S1710.
Optionally, the method may further include step S1710 of performing foreground and background fusion on the composite image and the second virtual object to be rendered to obtain a fused image.
It should be noted that steps S1702 to S1703 belong to a consistent rendering process based on structural semantics, and steps S1704 to S1706 belong to a consistent rendering process based on an interaction relationship. The execution sequence of the two processes from step S1702 to step S1703 and from step S1704 to step S1706 is arbitrary, and may be executed simultaneously or sequentially.
Compared with the prior art, the interaction relationship, the structural semantics, the bottom layer consistency rendering and the position relationship are not considered in the image rendering process, so that the phenomenon that the image after the background replacement is unreasonably rendered is caused. In the embodiment of the application, the shielding relation of the main body is determined through the consistency rendering process based on the structural semantics, so that the condition that the main body is unreasonable in rendering is avoided; by the consistent rendering process based on the interactive relation, the interactive relation between the main body and the interactive object is considered when the main body is rendered, and the phenomenon of unreasonable rendering is avoided; the rendering position of the main body image is determined through STN network change, and the phenomenon that the main body is rendered at an unreasonable position is avoided, so that the reality after background replacement is optimized, and further the background replacement effect is prevented from being influenced by unreality or even unreasonable background replacement.
In addition, through the bottom layer consistency rendering process, the main body image and the background in the image after the background replacement are consistent in the aspects of color, brightness, contrast, tone and the like, and the background replacement effect is further improved.
A second image rendering process:
the image rendering process comprises a structural semantic based consistency rendering process, a bottom layer consistency rendering process and a position relation based consistency rendering process.
Referring to another flow diagram of the image rendering process shown in fig. 18, the image rendering process may include the steps of:
step S1801, the terminal device 100 acquires an original video image.
Step S1802, the terminal device 100 detects whether there is an occlusion relationship with a target object in the original video image. When the target object has an occlusion relation, the method proceeds to step S1803; when the target object does not have an occlusion relationship, the process proceeds to step S1805.
Step S1803, the terminal device 100 determines a first virtual object to be rendered corresponding to the blocking object.
Step S1804, the terminal device 100 fuses the first virtual object to be rendered and the target virtual background image to obtain a virtual background image to be rendered.
After the virtual background image to be rendered is obtained, the process proceeds to step S1806.
In step S1805, the terminal device 100 takes the target virtual background image as a virtual background image to be rendered.
After the virtual background image to be rendered is determined, image rendering can be performed according to the virtual background image to be rendered and the main body image, so that an image with a replaced background is obtained.
Step S1806, the terminal device 100 performs bottom layer consistency rendering according to the virtual background image to be rendered and the original video image.
Step S1807, the terminal device 100 performs consistent rendering based on the position relationship according to the main body image and the virtual background image to be rendered after the consistent rendering of the bottom layer, so as to obtain a composite image.
In the image rendering process, the composite image is an output image after background replacement.
In the image rendering process, the shielding relation of the main body is determined through the consistency rendering process based on the structural semantics, and the condition that the main body is unreasonably rendered is avoided; the rendering position of the main body image is determined through STN network change, and the phenomenon that the main body is rendered at an unreasonable position is avoided, so that the reality after background replacement is optimized, and further the background replacement effect is prevented from being influenced by unreality or even unreasonable background replacement. In addition, through the bottom layer consistency rendering process, the main body image and the background in the image after the background replacement are consistent in the aspects of color, brightness, contrast, tone and the like, and the background replacement effect is further improved.
A third image rendering process:
the image rendering process comprises an interactive relationship-based consistent rendering process, a bottom layer consistent rendering process and a position relationship-based consistent rendering process.
Referring to fig. 19, a further flowchart of an image rendering process is shown, which may include the steps of:
in step S1901, the terminal device 100 acquires an original video image.
In step S1902, the terminal device 100 detects whether a target object in the original video image performs a preset action. If yes, the process proceeds to step 1903, and if no, the process proceeds to step 1906.
In step S1903, the terminal device 100 determines a second virtual object to be rendered corresponding to the preset action.
In step S1904, the terminal device 100 determines the rendering order and the rendering position of the second virtual object to be rendered.
Wherein if the rendering sequence between the second virtual object to be rendered and the subject is: and (3) rendering the interactive object first, and then rendering the main body, and then entering step S1905, namely, according to the rendering position of the second virtual object to be rendered, taking the second virtual object to be rendered as a foreground, and performing foreground and background fusion with the target virtual background image to obtain the virtual background image to be rendered.
If the rendering sequence between the second virtual object to be rendered and the subject is: if the main body is rendered first and then the interactive object is rendered, the process proceeds to step S1906, and at this time, the image rendering process further includes step S1909.
After the virtual background image to be rendered is determined, image rendering can be performed according to the main body image and the virtual background image to be rendered, so that an image with a background replaced can be obtained.
In step S1905, the terminal device 100 fuses the second virtual object to be rendered and the target virtual background image to obtain a virtual background image to be rendered.
In step S1906, the terminal device 100 sets the target virtual background image as a virtual background image to be rendered.
After the virtual background image to be rendered is determined, image rendering can be performed according to the virtual background image to be rendered and the main body image, and an image with a replaced background is obtained.
Step S1907, the terminal device 100 performs bottom-layer consistent rendering on the virtual background image to be rendered and the original video image.
Step S1908, the terminal device 100 performs consistent rendering based on the position relationship according to the main body image and the virtual background image to be rendered after the consistent rendering of the bottom layer, so as to obtain a composite image.
Step S1909, the terminal device 100 performs foreground and background fusion on the composite image and the second virtual object to be rendered to obtain a fused image.
In the image rendering process, if the rendering sequence of the second virtual object to be rendered is prior, the synthetic image is an output image after background replacement; and if the rendering sequence of the second virtual object to be rendered is back, the fused image is the output image after background replacement.
In the image rendering process, through a consistent rendering process based on an interactive relation, when the main body is rendered, the interactive relation between the main body and an interactive object is considered, and the phenomenon of unreasonable rendering is avoided; the rendering position of the main body image is determined through STN network change, and the phenomenon that the main body is rendered at an unreasonable position is avoided, so that the reality after background replacement is optimized, and further the background replacement effect is prevented from being influenced by unreality or even unreasonable background replacement. In addition, through the bottom layer consistency rendering process, the main body image and the background in the image after the background replacement are consistent in the aspects of color, brightness, contrast, hue and the like, and the background replacement effect is further improved.
Other possible image rendering processes:
in some embodiments, the image rendering process may include only a structural semantic based consistency rendering process and/or an interactive relationship based consistency rendering process, excluding the underlying consistency rendering process and the positional relationship based consistency rendering process described above. The image rendering process may be as follows:
the terminal device 100 first executes the above-mentioned structure semantic based consistency rendering process and/or interaction relationship based consistency rendering process to obtain a virtual background image to be rendered. The specific process may refer to the image rendering process shown above, and is not described herein again.
Then, the terminal device 100 may process the virtual background image to be rendered by using an existing hue and brightness processing manner, for example, detect the current scene brightness and the virtual background brightness, adjust the exposure time when the scene brightness is greater than the virtual background brightness, and add virtual illumination to the virtual background when the scene brightness is less than the virtual background brightness.
Finally, the terminal device 100 performs foreground and background fusion on the processed virtual background image to be rendered and the main body image to obtain a final output image.
Alternatively, the terminal device 100 may not perform processing on the color tone or brightness on the virtual background image to be rendered, but directly perform foreground and background fusion on the virtual background image to be rendered and the main body image to obtain a final output image.
In some embodiments, the image rendering process may include an underlying consistent rendering process or a positional relationship based consistent rendering process in addition to the structural semantic based consistent rendering process and/or the interactive relationship based consistent rendering process. At this time, the image rendering process may be as follows:
the terminal device 100 first executes the above-mentioned structure semantic based consistency rendering process and/or interaction relationship based consistency rendering process to obtain a virtual background image to be rendered. The specific process may refer to the image rendering process shown above, and is not described herein again.
Then, the terminal device 100 performs a bottom-layer consistent rendering process, that is, performs a bottom-layer consistent rendering process based on the virtual background image to be rendered and the main body image. And finally, performing foreground and background fusion on the main body image subjected to the bottom layer consistent rendering and the virtual background image to be rendered to obtain a final output image.
Or, the terminal device 100 performs a consistent rendering process based on the position relationship, that is, the main body image and the virtual background image to be rendered are input into the trained STN network, then performs Warp according to the change matrix, and finally performs foreground and background fusion on the main body image after Warp and the virtual background image to be rendered after Warp to obtain a final output image.
It will be appreciated that in addition to the several image rendering processes mentioned above, other possible image rendering processes may be derived based on the several image rendering processes mentioned above.
For example, the image rendering process does not perform the structure semantic based consistency rendering process and the interactive relationship based consistency rendering process, and only performs the bottom layer consistency rendering process and the position relationship based consistency rendering process. At this time, the image rendering process may be as follows:
the terminal device 100 takes the target virtual background image as a background image to be rendered, and performs a bottom layer consistency rendering process based on the background image to be rendered and the main body image; and finally, performing a consistent rendering process based on the position relation on the main body image and the virtual background image to be rendered after the consistent rendering based on the bottom layer to obtain an output image.
For another example, the image rendering process may include only the bottom-level consistency rendering process, and the structure-semantic-based consistency rendering process, the interaction-relationship-based consistency rendering process, and the position-relationship-based consistency rendering process are not performed. At this time, the image rendering process may be as follows:
the terminal device 100 uses the target virtual background image as a background image to be rendered, performs a bottom layer consistency rendering process based on the background image to be rendered and the main body image, and finally performs foreground and background fusion on the main body image subjected to bottom layer consistency rendering and the virtual background image to be rendered to obtain an output image.
For another example, the image rendering process does not perform the structure semantic based consistency rendering process, the interaction relationship based consistency rendering process, the location relationship based consistency rendering process, and the bottom layer based consistency rendering process, and only performs the location relationship based consistency rendering process. At this time, the image rendering process may be as follows:
the terminal device 100 performs a consistent rendering process based on the position relationship according to the target virtual background image and the subject image, to obtain an output image.
Other image rendering processes are not listed one by one, and the same or similar parts of the image rendering processes can be referred to one another, and are not described herein again.
In contrast, the second image rendering process, the third image rendering process, and possibly other image rendering processes, although less effective than the first image rendering process in the background replacement, may still improve the background replacement effect.
The above describes a process of performing virtual background replacement on a frame of original video image based on the original video image or the subject image and the target virtual background image to obtain an output image. For example, referring to fig. 4, in response to a click operation on the virtual background image 411, the mobile phone determines that the target virtual background image is the virtual background image 411, then performs any one of the above-mentioned image rendering processes based on the original video image corresponding to the listening interface 48 and the virtual background image 411 to obtain an output image, and finally displays the output image to obtain the image 410 after the background replacement. For another example, referring to fig. 6, the mobile phone determines that the target virtual background image is the virtual background image 412 in response to the click operation of the virtual background image 412, and then performs any one of the above-mentioned image rendering processes based on the original video image and the virtual background image 412 corresponding to the listening interface 48 to obtain an output image displayed on the interface 415.
The virtual background replacement process provided by the embodiment of the application can be applied to video call scenes of mobile phone ends and can also be applied to video call scenes of large-screen equipment. For example, referring to the schematic diagram of the virtual background replacement scene of the large-screen device shown in fig. 20, in the home scene, the user makes a video call through the large-screen device 201, and the large-screen device is installed with a smooth connection call. A virtual background selection window may be called up through the magic pen 202 in the video call interface, from which the user may select a background image for replacement. After the user selects the target virtual background image, the large-screen device performs any one of the above-mentioned image rendering processes according to the target virtual background image and the original video image, so as to obtain the image 203 with the replaced background.
In a specific application, the terminal device 100 may perform the virtual background replacement process for each frame of image in the video stream, or may perform the virtual background replacement process once every 5 frames or every 10 frames, that is, perform foreground and background segmentation on the original video image every 5 frames or every 10 frames to obtain a main image and an original background image, determine a target virtual background image, and perform any one of the image rendering processes based on the main image and the target virtual background image.
When the terminal device 100 continuously performs virtual background replacement on the video stream, if the terminal device 100 recognizes that the subject and the object in the original background image have an occlusion relationship and/or an interaction relationship, rendering a corresponding virtual object in the virtual background image. And if the terminal device 100 recognizes that the interaction relation and/or the occlusion relation are ended, the first virtual object to be rendered and/or the second virtual object to be rendered may be removed.
The ending of the interaction relationship refers to ending the interaction between the subject and the interaction object in the original background image, for example, the subject is a person, the interaction object is a chair in the original background image, the interaction is "sitting", when the person stands up from the chair, the interaction of "sitting" is considered to be ended, and the interaction between the person and the chair in the original background image is ended. The occlusion relationship is ended from being occluded to not being occluded, for example, if the subject is a person, the person is occluded by a table in the original background image at a certain time, and the person is not occluded by the table at the next time, the occlusion relationship is considered to be ended.
Visually, when an interaction relationship or a shielding relationship exists, the corresponding virtual object exists in the image after the background replacement, and when the interaction relationship or the shielding relationship is finished, the virtual object in the image after the background replacement disappears.
Specifically, the terminal device 100 continuously performs a virtual background replacement process on the video image in the video stream, and if it is determined that the main body in the original video image is not occluded through the above-mentioned consistency rendering process based on the structural semantics at a certain time, it is considered that the occlusion relationship is ended, and the first virtual object to be rendered is not to be rendered. Similarly, by the above consistency rendering process based on the interaction relationship, it is determined that the interaction relationship is ended, and then the second virtual object to be rendered is not required to be rendered, wherein when the action corresponding to the interaction action is identified, the interaction relationship is considered to be ended, for example, when the interaction action is "sitting", the action corresponding to the interaction action is "standing", that is, when the "standing" action is identified, the interaction relationship is considered to be ended.
At this time, since the main body is not shielded and the main body does not perform the preset interaction, the virtual object to be rendered is not present in the virtual background to be rendered. The terminal device 100 performs subsequent bottom-layer consistency rendering process and consistency rendering process based on the position relationship based on the virtual background image to be rendered and the main body image, and obtains an output image of the current virtual background replacement process, where no virtual object exists in the output image. In this way, from the perspective of the user, when the interaction relationship is over or the occlusion relationship is over, the virtual object in the image after the background replacement disappears. For example, when a person stands up from a chair, the chair in the image with the replaced background also disappears, and when the person is shielded from the chair to be not shielded, an object (for example, a table) for shielding in the image with the replaced background also disappears.
For example, see fig. 21 for a schematic diagram of a change of the virtual background replacement image. As shown in fig. 21, the image 211 is an image obtained by performing a first virtual background replacement process on the original video image 1. The image 211 is the image 78 in fig. 7, the original video image 1 is the image 71 in fig. 7, and the target virtual background image is the image 74 in fig. 7. For a related introduction, reference is made to fig. 7 above. The image 211 includes a subject 212 and a subject 213, and a virtual object 214.
It is assumed that the video stream comprises in sequence an original video image 1, an original video image 2, an original video image 3 and an original video image 4.
The second virtual background replacement process: the terminal device 100 performs virtual background replacement on the original video image 2 to obtain an image 215 after background replacement. Specifically, the terminal device 100 determines whether or not the subject 212 and the subject 213 have an occlusion relationship and/or an interaction relationship based on the original video image 2. At this time, the main body 213 has an occlusion relationship, the main body 213 does not have an occlusion relationship, and the virtual object 214 and the target virtual background image are subjected to foreground and background fusion to obtain a virtual background image to be rendered. And then, according to the virtual background image to be rendered and the original video image 2, sequentially performing a bottom layer consistency rendering process and a consistency rendering process based on the position relationship to obtain an image 215.
The third virtual background replacement process: the terminal device 100 performs virtual background replacement on the original video image 3 to obtain an image 216 after background replacement. Specifically, the terminal device 100 performs a structural semantic-based consistency rendering process and an interaction relationship-based consistency rendering process on the original video image to determine whether the main body has an occlusion relationship and an interaction relationship. At this time, the main body 212 and the chair in the original background have an interactive relationship, and the main body 213 and the table in the original background have an occlusion relationship, so that the virtual object 217 and the virtual object 214 to be rendered are determined. Then, rendering the virtual object 217 and the virtual object 214 in the target rendering virtual background image to obtain a virtual background image to be rendered, and then sequentially performing a bottom layer consistency rendering process and a consistency rendering process based on the position relationship based on the virtual background image to be rendered and the original video image 3 to obtain an image 216.
The fourth virtual background replacement process: the terminal device 100 performs virtual background replacement on the original video image 4 to obtain an image 218 after background replacement. Specifically, the terminal device 100 performs a structural semantic-based consistency rendering process and an interaction relationship-based consistency rendering process on the original video image to determine whether the main body has an occlusion relationship and an interaction relationship. At this time, in the original video image 4, the main body 213 has no occlusion relationship or interaction relationship, and the main body 212 has interaction relationship or no occlusion relationship, so that it is determined that the virtual object 217 needs to be rendered. Then, the virtual object 217 and the target virtual background image are subjected to foreground and background fusion to obtain a virtual background image to be rendered. Finally, based on the original video image 4 and the virtual background image to be rendered, a bottom layer consistency rendering process and a consistency rendering process based on the position relationship are sequentially performed to obtain an image 218.
The terminal device displays the image 211, the image 215, the image 216 and the image 218 on the display screen in sequence according to the playing sequence of the original video image, and from the visual point of view of the user, when the main body 212 is standing to sitting, there is a corresponding extra chair in the virtual background image, and when neither the main body 212 nor the main body 213 is occluded, the virtual object 214 rendered before will disappear.
The virtual background replacement scheme provided by the embodiment of the application can be applied to video call scenes and can also be applied to virtual background replacement scenes such as background traversing, special effect making, video conferences, photographing and video recording.
The following describes a shooting scene and a video recording scene by way of example.
See fig. 22 for an interface diagram of the virtual background replacement process in the shooting scene. As shown in fig. 22, the cell phone displays a preview interface 222 in response to a click operation for the camera 221. Then, the mobile phone responds to the click operation of the magic pen 223 in the preview interface 22, and pops up a window 224 in the preview interface 222, wherein scenes 225 to 228 are sequentially displayed in the window 224, and the scenes 225 to 228 can sequentially correspond to the virtual background images 412, 413, 414 and 411 in fig. 5. The user can select the scene which the user needs to replace from the window 224 according to the user's needs.
When the cell phone receives a click operation for the scene 225, the cell phone displays a preview interface 229. At this time, the image corresponding to the preview interface 222 is an original video image, and the image corresponding to the scene 225 is a target virtual background image. The mobile phone firstly carries out a structural semantic based consistency rendering process and an interactive relationship based consistency rendering process based on an original video image so as to judge whether a virtual object needs to be rendered. And if the virtual object needs to be rendered, fusing the virtual object to be rendered and the target virtual background image to obtain the virtual background image to be rendered. In the current situation, because the subject and the object in the original background image have no occlusion relation or interaction relation, the virtual object does not need to be rendered. And then, based on the main body image and the virtual background image to be rendered, performing a bottom layer consistency rendering process and a consistency rendering process based on the position relationship to obtain an output image after background replacement, wherein the output image is an image corresponding to the preview interface 229.
After the cell phone displays preview interface 229, the user can click on control 2210 to take a picture. After the mobile phone receives the click operation for the control 2210, an image corresponding to the preview interface 229 is saved as a picture, and the picture is displayed in the control 2211.
The user can view the picture taken by clicking 2211. At this time, after the mobile phone receives a click operation on the control 2211, a picture preview interface is displayed, in which the taken picture 2212 is displayed. Picture 2212 is an image after background replacement.
In the photographing process shown in fig. 22, if the mobile phone recognizes that a person is blocked by an object and/or the person makes a preset action in the image corresponding to the preview interface 222, a corresponding virtual object may also exist in the photographed image 2212. For example, referring to fig. 21, assuming that the subject 212 and the subject 213 make corresponding interaction or change the occlusion relationship during shooting, the preview interface also displays images such as an image 211, an image 215, an image 216, and an image 218.
In addition, after the mobile phone displays the preview interface 229, the user may replace the virtual background again by the magic pen.
See fig. 23 for an interface diagram of the virtual background replacement process in the video recording scenario. As shown in fig. 23, after the cell phone receives the click operation for the camera 231, a preview interface 232 is displayed. The preview interface 232 includes a magic pen 233 therein, and the preview interface 232 displays an original video image captured by the mobile phone through the camera.
After the mobile phone receives the click operation for the control 234, the mobile phone starts to record video and displays the video interface 235, and the video interface 235 still displays the original video image acquired by the mobile phone through the camera. During the recording process, the user can click the magic pen 233 to replace the virtual background.
After the mobile phone receives the clicking operation of the magic pen 233 in the video interface 235, the mobile phone pops up the window 236 in the video interface 235, and scenes 237-2310 which can be used for replacing the background are displayed in the window 236. After the user clicks the scene 237 in the window 236, the mobile phone performs a virtual background replacement process in response to the click operation, obtains an image after background replacement, and displays the image after background replacement in the interface 2311.
At this time, the image corresponding to the scene 237 is a target virtual background image, the image corresponding to the video interface 235 is an original video image, and a consistent rendering process based on the structural semantics, a consistent rendering process based on the interactive relationship, a bottom layer consistent rendering process, and a consistent rendering process based on the position relationship are sequentially performed based on the target virtual image and the original video image, so as to obtain an output image after the background replacement.
If in the video recording process, the person in the original video image has an occlusion relationship and/or an interaction relationship with the object in the background, the recorded video also has a corresponding virtual object.
In addition, fig. 23 is used for replacing the virtual background after the video recording is started, and in other embodiments, the virtual background may be replaced before the video recording is started, that is, the window 236 is called up by the magic pen in the preview interface 232, and the corresponding scene is selected.
In fig. 22 and 23, the mobile phone may also perform the above-described virtual context recommendation process. The same or similar points in fig. 22 and 23 as above can be referred to above, and are not described again here.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps that can be implemented in the above method embodiments.
The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.
Embodiments of the present application further provide a chip system, where the chip system includes a processor, the processor is coupled with a memory, and the processor executes a computer program stored in the memory to implement the methods according to the above method embodiments. The chip system can be a single chip or a chip module consisting of a plurality of chips.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment. It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing a relative importance or importance. Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise.
Finally, it should be noted that: the above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (13)

1. An image rendering method is applied to a terminal device, and comprises the following steps:
acquiring an image to be processed;
detecting that a target object in the image to be processed executes a preset action and/or the target object is shielded by a first object;
determining a second virtual object to be rendered corresponding to the preset action and/or a first virtual object to be rendered corresponding to the first object;
rendering the first virtual object to be rendered and/or the second virtual object to be rendered in a target virtual background image to obtain a virtual background image to be rendered, wherein a depth value of an interactive object corresponding to the second virtual object to be rendered is larger than a depth value of the target object;
and performing image rendering according to the virtual background image to be rendered and the main body image to obtain a rendered image, wherein the main body image is extracted from the image to be processed and comprises the image of the target object.
2. The method of claim 1, wherein performing image rendering according to the virtual background image to be rendered and the subject image to obtain a rendered image comprises:
performing bottom layer consistency rendering on the basis of the virtual background image to be rendered to obtain a main body image subjected to bottom layer consistency rendering;
and performing image rendering according to the main body image subjected to the bottom layer consistent rendering and the virtual background image to be rendered to obtain the rendered image.
3. The method of claim 2, wherein performing a bottom-layer consistent rendering based on the virtual background image to be rendered to obtain a subject image after the bottom-layer consistent rendering, comprises:
inputting the low-frequency image of the virtual background image to be rendered and the image to be processed into a first style migration model which is trained in advance, and obtaining a bottom-layer consistency rendered image to be processed which is output by the first style migration model;
and extracting a main body image from the image to be processed after the bottom layer is rendered in a consistent manner to obtain the main body image after the bottom layer is rendered in a consistent manner.
4. The method of claim 3, wherein the training process of the style migration model comprises:
obtaining a training data set, wherein the training data set comprises a first virtual background image and an original video image;
inputting the low-frequency image of the first virtual background image and the original video image into a second style migration model which is constructed in advance, and obtaining a forward training result output by the second style migration model;
calculating a first loss value between the forward training result and a low-frequency image of the first virtual background image;
inputting the forward training result and the low-frequency image of the original video image into a second style migration model after forward training to obtain a reverse training result output by the second style migration model after forward training;
calculating a second loss value between the reverse training result and the original video image;
calculating a third loss value between the reverse training result and a low-frequency image of the original video image;
adjusting the network parameters of the second style migration model according to the first loss value, and adjusting the network parameters of the forward-trained second style migration model according to the second loss value and the third loss value;
and repeating the training process, and obtaining the first style migration model after training when the preset conditions are met.
5. The method of claim 2, wherein performing a bottom-level consistent rendering based on the virtual background image to be rendered to obtain a body image after the bottom-level consistent rendering, comprises:
transferring the virtual background image to be rendered to an LAB color space to obtain a first image;
respectively calculating a first standard deviation and a first mean value of an L channel, an A channel and a B channel of the first image;
transferring the main image to an LAB color space to obtain a second image;
correcting the second image according to the first standard deviation and the first mean value to obtain a third image, wherein the difference value between the second standard deviation of an L channel, an A channel and a B channel of the third image and the first standard is within a first preset threshold interval, and the difference value between the second mean value and the first mean value is within a second preset threshold interval;
and transferring the third image from an LAB color space to an RGB color space to obtain a fourth image, wherein the fourth image is a main image after the bottom layer is rendered in a consistent manner.
6. The method according to any one of claims 2 to 5, wherein obtaining the rendered image according to the bottom-layer consistent rendered subject image and the virtual background to be rendered comprises:
inputting the main body image subjected to bottom layer consistent rendering to a first STN network which is trained in advance to obtain a first change matrix output by the first STN network;
inputting the virtual background image to be rendered to a second STN network trained in advance to obtain a second change matrix output by the second STN network;
performing image affine change on the main body image subjected to bottom layer consistency rendering by using the first change matrix to obtain a first change image;
performing image affine change on the virtual background image to be rendered by using the second change matrix to obtain a second change image;
and carrying out image synthesis on the first change image and the second change image to obtain the rendered image.
7. The method according to any one of claims 1 to 6, wherein the step of placing the second virtual object to be rendered in the target virtual background image comprises:
determining a first position of an interactive object corresponding to the preset action in the image to be processed according to a semantic segmentation result of the image to be processed;
taking a second position corresponding to the first position in the target virtual background image as a rendering position of the first virtual object to be rendered;
determining that the depth value of an interactive object in the image to be processed is larger than the depth value of the target object;
rendering the second virtual object to be rendered at the rendering position of the target virtual background image.
8. The method of claim 1, wherein detecting that a target object in the image to be processed is occluded by a first object comprises:
determining the category of each pixel point in the image to be processed according to the semantic segmentation result of the image to be processed;
acquiring depth information of the image to be processed;
when determining that a target pixel point with a depth value smaller than that of the target object exists in the preset range of the target object according to the depth information, taking the category corresponding to the target pixel point as the first object, and determining that the target object is shielded by the first object.
9. The method according to any one of claims 1 to 8, wherein before rendering the first virtual object to be rendered and/or the second virtual object to be rendered in the target virtual background image, resulting in a virtual background image to be rendered, the method further comprises:
determining a virtual background image to be recommended according to the similarity between the original background image of the image to be processed and each second virtual background image;
and displaying the virtual background image to be recommended.
10. The method according to claim 9, wherein determining the virtual background image to be recommended according to the similarity between the original background image of the image to be processed and each of the second virtual background images comprises:
performing foreground and background segmentation on the image to be processed to obtain an original background image of the image to be processed;
performing multi-class semantic segmentation on the original background image to obtain a second semantic segmentation result;
performing multi-class semantic segmentation on each second virtual background image to obtain a third semantic segmentation result of each second virtual background image;
calculating IOU values of the original background image and each second virtual background image according to the second semantic segmentation result and the third semantic segmentation result;
respectively calculating a first color distribution curve of the original background image and a second color distribution curve of each second virtual background image;
calculating curve similarity between the first color distribution curve and each of the second color distribution curves;
and determining the virtual background image to be recommended from the second virtual background image according to the curve similarity and the IOU value.
11. The method of claim 1, further comprising:
if the depth value of the interactive object corresponding to the second virtual object to be rendered is smaller than the depth value of the target object, rendering the first virtual object to be rendered in the target virtual background image to obtain a virtual background image to be rendered, or taking the target virtual background image as the virtual background image to be rendered;
after performing image rendering according to the virtual background image to be rendered and the main body image to obtain a rendered image, the method further includes:
rendering the second virtual object to be rendered in the rendered image to obtain an output image.
12. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 11 when executing the computer program.
13. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 11.
CN202011240398.8A 2020-11-09 2020-11-09 Image rendering method and device Pending CN114494566A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011240398.8A CN114494566A (en) 2020-11-09 2020-11-09 Image rendering method and device
PCT/CN2021/126469 WO2022095757A1 (en) 2020-11-09 2021-10-26 Image rendering method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011240398.8A CN114494566A (en) 2020-11-09 2020-11-09 Image rendering method and device

Publications (1)

Publication Number Publication Date
CN114494566A true CN114494566A (en) 2022-05-13

Family

ID=81457498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011240398.8A Pending CN114494566A (en) 2020-11-09 2020-11-09 Image rendering method and device

Country Status (2)

Country Link
CN (1) CN114494566A (en)
WO (1) WO2022095757A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115775024A (en) * 2022-12-09 2023-03-10 支付宝(杭州)信息技术有限公司 Virtual image model training method and device
CN115830281A (en) * 2022-11-22 2023-03-21 山东梦幻视界智能科技有限公司 Naked eye VR immersive experience device based on MiniLED display screen

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115665461B (en) * 2022-10-13 2024-03-22 聚好看科技股份有限公司 Video recording method and virtual reality device
CN115908663B (en) * 2022-12-19 2024-03-12 支付宝(杭州)信息技术有限公司 Virtual image clothing rendering method, device, equipment and medium
CN116934936A (en) * 2023-09-19 2023-10-24 成都索贝数码科技股份有限公司 Three-dimensional scene style migration method, device, equipment and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9682321B2 (en) * 2012-06-20 2017-06-20 Microsoft Technology Licensing, Llc Multiple frame distributed rendering of interactive content
CN105791793A (en) * 2014-12-17 2016-07-20 光宝电子(广州)有限公司 Image processing method and electronic device
US10867214B2 (en) * 2018-02-14 2020-12-15 Nvidia Corporation Generation of synthetic images for training a neural network model
CN109461199B (en) * 2018-11-15 2022-12-30 腾讯科技(深圳)有限公司 Picture rendering method and device, storage medium and electronic device
CN110062176B (en) * 2019-04-12 2020-10-30 北京字节跳动网络技术有限公司 Method and device for generating video, electronic equipment and computer readable storage medium
CN110956654B (en) * 2019-12-02 2023-09-19 Oppo广东移动通信有限公司 Image processing method, device, equipment and storage medium
CN111667399B (en) * 2020-05-14 2023-08-25 华为技术有限公司 Training method of style migration model, video style migration method and device
CN111726479B (en) * 2020-06-01 2023-05-23 北京像素软件科技股份有限公司 Image rendering method and device, terminal and readable storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115830281A (en) * 2022-11-22 2023-03-21 山东梦幻视界智能科技有限公司 Naked eye VR immersive experience device based on MiniLED display screen
CN115775024A (en) * 2022-12-09 2023-03-10 支付宝(杭州)信息技术有限公司 Virtual image model training method and device
CN115775024B (en) * 2022-12-09 2024-04-16 支付宝(杭州)信息技术有限公司 Virtual image model training method and device

Also Published As

Publication number Publication date
WO2022095757A1 (en) 2022-05-12

Similar Documents

Publication Publication Date Title
WO2022095757A1 (en) Image rendering method and apparatus
US11727577B2 (en) Video background subtraction using depth
CN109691054A (en) Animation user identifier
US11070717B2 (en) Context-aware image filtering
CN111556336B (en) Multimedia file processing method, device, terminal equipment and medium
CN103731742B (en) For the method and apparatus of video streaming
JP2023539620A (en) Facial image processing method, display method, device and computer program
CN113453040B (en) Short video generation method and device, related equipment and medium
CN114096986A (en) Automatically segmenting and adjusting an image
CN114640783B (en) Photographing method and related equipment
CN115689963B (en) Image processing method and electronic equipment
Turban et al. Extrafoveal video extension for an immersive viewing experience
CN114926351B (en) Image processing method, electronic device, and computer storage medium
CN114758027A (en) Image processing method, image processing device, electronic equipment and storage medium
CN113099146A (en) Video generation method and device and related equipment
CN116916151B (en) Shooting method, electronic device and storage medium
CN117061882A (en) Video image processing method, apparatus, device, storage medium, and program product
CN113395441A (en) Image color retention method and device
CN114640798B (en) Image processing method, electronic device, and computer storage medium
KR102360919B1 (en) A host video directing system based on voice dubbing
CN115587938A (en) Video distortion correction method and related equipment
CN114443182A (en) Interface switching method, storage medium and terminal equipment
CN116363017B (en) Image processing method and device
CN116612060B (en) Video information processing method, device and storage medium
CN116091572B (en) Method for acquiring image depth information, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination