CN117576338A - Dynamic image generation method and device, electronic equipment and storage medium - Google Patents

Dynamic image generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117576338A
CN117576338A CN202210943580.2A CN202210943580A CN117576338A CN 117576338 A CN117576338 A CN 117576338A CN 202210943580 A CN202210943580 A CN 202210943580A CN 117576338 A CN117576338 A CN 117576338A
Authority
CN
China
Prior art keywords
image
depth
pixel
processed
edge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210943580.2A
Other languages
Chinese (zh)
Inventor
孙爽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210943580.2A priority Critical patent/CN117576338A/en
Publication of CN117576338A publication Critical patent/CN117576338A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/20Finite element generation, e.g. wire-frame surface description, tesselation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/40Filling a planar surface by adding surface attributes, e.g. colour or texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • G06T15/205Image-based rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images

Abstract

The present invention relates to the field of computer technologies, and in particular, to the field of image processing technologies, and provides a method, an apparatus, an electronic device, and a storage medium for generating a dynamic image, so as to improve the efficiency and universality of dynamic image generation. The method comprises the following steps: acquiring a first depth image corresponding to a static image to be processed; obtaining each edge pixel in the first depth image by carrying out edge detection on the first depth image, and determining a mask region in the image to be processed based on each edge pixel; image filling is carried out on the mask region, and a filling image is obtained; and constructing a target three-dimensional grid according to the image to be processed and the filling image, and rendering the target three-dimensional grid based on different preset camera positions to obtain a dynamic image generated based on the image to be processed. According to the method and the device, the mask area is obtained by utilizing the depth edge, the mask area is filled, the three-dimensional grid is constructed according to the images before and after filling, and rendering is carried out, so that the generation efficiency and universality of the dynamic image are effectively improved.

Description

Dynamic image generation method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to the field of image processing technologies, and provides a dynamic image generating method, a dynamic image generating device, an electronic device, and a storage medium.
Background
The moving image is a moving video in which scene contents of a still image under different camera angles are presumed, and a specific angle of view is changed by continuous camera transformation.
In the related art, a manner of generating a moving image based on a single image includes: based on layered depth image (layered depth image, LDI), based on multi-plane image (MPI). These two ways are 2.5D representations of three-dimensional space.
The LDI mode divides the depth of the image into layers according to edges, fills each edge, and has long processing time, and independently obtains a filling area for each edge, so that the LDI mode does not meet the actual scene requirement.
The MPI method simulates a three-dimensional representation of an image as a superposition of multiple planes, each plane representing a relative depth, and transforms all planes when the camera changes, which also has the drawback of long time consumption, and cannot be truly applied to some special effect processing scenes.
In summary, how to efficiently generate dynamic images is needed to be solved.
Disclosure of Invention
The embodiment of the application provides a dynamic image generation method, a dynamic image generation device, an electronic device and a storage medium, which are used for improving the generation efficiency and universality of dynamic images.
The method for generating the dynamic image provided by the embodiment of the application comprises the following steps:
acquiring a first depth image corresponding to an image to be processed, wherein the image to be processed is a static image;
obtaining each edge pixel in the first depth image by carrying out edge detection on the first depth image, and determining a mask region in the image to be processed based on each edge pixel;
image filling is carried out on the mask region, and a filling image corresponding to the image to be processed is obtained;
and constructing a target three-dimensional grid according to the image to be processed and the filling image, and rendering the target three-dimensional grid based on different preset camera positions to obtain a dynamic image generated based on the image to be processed.
The dynamic image generating apparatus provided in the embodiment of the present application includes:
the depth acquisition unit is used for acquiring a first depth image corresponding to an image to be processed, wherein the image to be processed is a static image;
A mask unit, configured to obtain each edge pixel in the first depth image by performing edge detection on the first depth image, and determine a mask region in the image to be processed based on each edge pixel;
the filling unit is used for carrying out image filling on the mask region to obtain a filling image corresponding to the image to be processed;
the generating unit is used for constructing a target three-dimensional grid according to the image to be processed and the filling image, and rendering the target three-dimensional grid based on different preset camera positions to obtain a dynamic image generated based on the image to be processed.
Optionally, the depth acquisition unit is specifically configured to:
performing depth estimation on the image to be processed through a depth estimation network to obtain an output image of the depth estimation network, wherein the pixel value of each pixel in the output image represents the depth information of the corresponding pixel in the image to be processed relative to a camera; and performs any one of the following operations:
taking the output image as a first depth image corresponding to the image to be processed;
and performing depth adjustment on the output image to obtain a first depth image corresponding to the image to be processed.
Optionally, the depth acquisition unit is specifically configured to:
sequentially carrying out average filtering and normalization processing on the output image to obtain an intermediate image;
and carrying out at least one median filtering on a non-edge area in the intermediate image to obtain a first depth image corresponding to the image to be processed.
Optionally, the depth acquisition unit is further configured to determine the non-edge area by:
comparing the difference between pixel values of every two adjacent pixels in the intermediate image with a preset depth difference value;
and taking an area formed by pixels with the difference of corresponding pixel values not exceeding the preset depth difference value in the intermediate image as the non-edge area.
Optionally, the depth acquisition unit is specifically configured to:
carrying out weighted median filtering on a non-edge area in the intermediate image at least once to obtain a first depth image corresponding to the image to be processed, wherein in each weighted median filtering process, the following operations are respectively carried out on each pixel in the non-edge area:
determining, for a pixel, a weight for each pixel within a first detection window centered on the pixel, wherein the weight for each pixel is determined based on a difference in pixel values of the pixel and the one pixel;
Sequentially accumulating the weights according to the order from small to large, and taking the pixel value of the corresponding weight as a target pixel value when the accumulation result exceeds the average value of the weights;
updating the pixel value of the one pixel based on the target pixel value.
Optionally, the depth acquisition unit is specifically configured to:
obtaining segmentation information of each object in the image to be processed;
and adjusting pixel values of pixels belonging to the same object in the output image based on the segmentation information of each object to obtain a first depth image corresponding to the image to be processed.
Optionally, the masking unit is specifically configured to:
performing edge detection on the first depth image to obtain edge information of each pixel in the first depth image, wherein the edge information of each pixel is a pixel value of a corresponding pixel determined by an edge detection operator;
and taking the pixels, corresponding to the edge information of which is larger than a preset threshold value, in the first depth image as edge pixels.
Optionally, the masking unit is specifically configured to:
comparing the edge information of each pixel with an edge average depth in a second detection window which takes each edge pixel as a center, wherein the edge average depth is the average value of the edge information of each edge pixel;
And taking a region formed by pixels with corresponding edge information larger than the average edge depth in the image to be processed as the mask region.
Optionally, the generating unit is specifically configured to:
constructing an original three-dimensional grid based on the image to be processed and the first depth image;
constructing an intermediate three-dimensional grid based on the filling image and a second depth image corresponding to the filling image;
and merging the original three-dimensional grid and the intermediate three-dimensional grid to obtain the target three-dimensional grid.
Optionally, the generating unit is specifically configured to:
constructing point information in the original three-dimensional grid based on the two-dimensional coordinates of each pixel in the image to be processed and the pixel value of each pixel in the first depth image;
and constructing the surface patch information in the original three-dimensional grid based on the four-connected information of the image to be processed, wherein the surface patch information represents the link relation between points in the original three-dimensional grid, and the points corresponding to the edge pixels in the original three-dimensional grid are not linked.
Optionally, the generating unit is specifically configured to:
constructing point information in the intermediate three-dimensional grid based on the two-dimensional coordinates of each pixel in the filling image and the pixel value of each pixel in the second depth image;
And constructing the patch information in the middle three-dimensional grid based on the four-connected information of the filling image, wherein the patch information represents the link relation between points in the middle three-dimensional grid.
An electronic device provided in an embodiment of the present application includes a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, causes the processor to execute any one of the steps of the dynamic image generating method described above.
The embodiment of the application provides a computer readable storage medium, which comprises a computer program, wherein the computer program is used for enabling an electronic device to execute the steps of any one of the dynamic image generating methods when the computer program runs on the electronic device.
Embodiments of the present application provide a computer program product comprising a computer program stored in a computer readable storage medium; when a processor of an electronic device reads the computer program from a computer-readable storage medium, the processor executes the computer program so that the electronic device performs the steps of any one of the moving image generating methods described above.
The beneficial effects of the application are as follows:
the embodiment of the application provides a dynamic image generation method, a dynamic image generation device, electronic equipment and a storage medium. The dynamic image generation method combines the technologies of depth estimation, image filling, image rendering and the like, acquires the depth information of an image through the depth estimation, acquires a depth edge based on the image depth, acquires a mask area of the image by using the depth edge, wherein the mask area is a depth shielding area, and fills the mask area by using an image filling method to obtain a corresponding filling image, namely a repair image; and finally, constructing a target three-dimensional grid according to the images before and after filling, and rendering the network according to camera movement to obtain a dynamic image. According to the method, a filling area is not required to be acquired based on each edge, each edge is filled, and only the image to be processed and the filling image are combined to construct the target three-dimensional grid, instead of simulating the three-dimensional expression of the image into superposition of a plurality of planes, the conversion of all planes is not required, so that the processing time is shorter, the method can be applied to various scenes, and the generation efficiency and universality of dynamic images are effectively improved.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 is an alternative schematic diagram of an application scenario in an embodiment of the present application;
fig. 2 is a flow chart of a dynamic image generating method in an embodiment of the present application;
FIG. 3 is a schematic illustration of a still image in an embodiment of the present application;
FIG. 4 is a schematic diagram of a depth image according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of a depth optimization algorithm according to an embodiment of the present application;
FIG. 6 is a schematic diagram of an averaging filter in an embodiment of the present application;
FIG. 7 is a schematic diagram of a second detection window according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a mask image according to an embodiment of the present application;
FIG. 9 is a schematic illustration of a fill image in an embodiment of the present application;
fig. 10 is a logic diagram of a dynamic image generating method according to an embodiment of the present application;
fig. 11 is a schematic view of a video frame of a moving image generated based on the moving image generating method in the embodiment of the present application;
fig. 12 is a schematic view of a video frame of another moving image generated based on the moving image generating method in the embodiment of the present application;
fig. 13 is a specific flowchart of a dynamic image generating method according to an embodiment of the present application;
fig. 14 is a schematic structural diagram of a moving image generating apparatus in the embodiment of the present application;
fig. 15 is a schematic diagram of a hardware composition structure of an electronic device to which the embodiments of the present application are applied;
fig. 16 is a schematic diagram of a hardware composition structure of another electronic device to which the embodiments of the present application are applied.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the technical solutions of the present application, but not all embodiments. All other embodiments, which can be made by a person of ordinary skill in the art without any inventive effort, based on the embodiments described in the present application are intended to be within the scope of the technical solutions of the present application.
Some of the concepts involved in the embodiments of the present application are described below.
Static image: refers to an image that does not change state over time.
Dynamic image: by a dynamic image is understood an image that is composed of a plurality of still images, i.e. an image that produces a certain dynamic effect when a particular group of still images is switched at a specified frequency. A common representation on the network is an animation in an image interchange format (Graphics Interchange Format, gif), which is to switch images of multiple layers according to time, so as to achieve the effect of animation.
Depth image: also referred to as range imaging, refers to an image that uses the distance (depth) from an image acquisition device to points in the scene as pixel values, which directly reflect the geometry of the visible surface of the scene. The depth image can be calculated as point cloud data through coordinate conversion, and the point cloud data with regular and necessary information can also be reversely calculated as depth image data.
Edge pixels: refers to pixel points located at the depth edges in the image. An edge is a boundary of different regions, is a collection of pixels with significant variation in surrounding (local) pixels, has both magnitude and direction properties, and can be created by significant variation in local features and surrounding pixels. Based on this, the depth edge refers to a pixel point that can characterize the distinguishing boundary of the front object and the rear region, as determined by analyzing the pixel value of the pixel point.
Image mask and mask area: an image mask refers to a process or area of image processing that is controlled by masking the processed image (either entirely or partially) with a selected image or object. The particular image or object used for overlay is referred to as a mask template. The mask region is also called a depth shielding region, in this embodiment, the region to be shielded is a region to be shielded, in the image to be processed, the front object shields a part of the rear region, the part of the region is the mask region, and in the binarized mask image, the pixel value of the pixel point of the part of the region can be set to be 1.
Filling an image: refers to an image obtained by filling a mask region in an image to be processed by an image filling technique. In the embodiment of the application, image filling refers to estimating pixel information of an unknown region by combining pixel information of a known region in an image, so that the filled pixels are semantically consistent with the known pixels. The unknown region is a mask region in the image, and the known region is other regions except the mask region in the image.
Four pieces of communication information: the four-way connection means the up, down, left and right positions corresponding to the pixel positions, which are immediately adjacent positions. In total, 4 directions, so called four-way. The four-connected information is connected information which can represent the position relation between pixels in the image.
Median filtering: is a nonlinear image smoothing technique that sets the gray value of each pixel to the median of all pixel gray values within a certain neighborhood window of the point. The median filtering can effectively remove salt and pepper noise.
Weighted median filtering: each pixel in the window is multiplied by a corresponding weight, then the values multiplied by the weight are used for sorting, and the median value is taken to replace the gray value of the central element. The median filtering can be regarded as weighted median filtering where the weight of each pixel is 1.
Sobel operator: is one of the most important operators in pixel image edge detection, and plays a role in the information technology fields of machine learning, digital media, computer vision and the like. Technically, it is a discrete first order difference operator that is used to calculate an approximation of the first order gradient of the image brightness function. Using this operator at any point in the image will result in a gradient vector or normal vector for that point.
The following briefly describes the design concept of the embodiment of the present application:
the dynamic image is a special effect expression which is very attractive to the interest of the object, and the static image is converted into continuous dynamic video expression, so that the dynamic image has strong demands in the field of long and short videos. Dynamic photos are methods for estimating scene contents of a static image under different camera angles, and obtaining a dynamic video with a specific angle of view change through continuous camera transformation.
Research on dynamic photos, i.e. new view angle synthesis algorithms based on single images, can be roughly divided into methods based on point clouds, LDI and MPI, etc.
Based on a point cloud mode, the image and the corresponding estimated depth value are directly expressed as the point cloud in the space, and the starting point and the ending point of the camera position are filled with the image color and the depth through a preset camera path, and are rendered in the corresponding three-dimensional space. The scheme can cause image mold penetration condition due to no building of the elevation sheet when large deformation is caused, and the mold penetration condition caused by overlapping of filling of the starting position and the ending position can not be processed. The LDI and MPI of the latter two are actually 2.5D expressions of three-dimensional space, and the processing time is long.
In summary, the method has the problems of poor effect in actual scenes or overlong algorithm processing time and the like caused by training based on synthetic data in actual use, and cannot be really applied to special effect processing scenes in the application.
In view of this, embodiments of the present application provide a moving image generating method, apparatus, electronic device, and storage medium. The dynamic image generation method combines the technologies of depth estimation, image filling, image rendering and the like, acquires the depth information of an image through the depth estimation, acquires a depth edge based on the image depth, acquires a mask area of the image by using the depth edge, wherein the mask area is a depth shielding area, and fills the mask area by using an image filling method to obtain a corresponding filling image, namely a repair image; and finally, constructing a target three-dimensional grid according to the images before and after filling, and rendering the network according to camera movement to obtain a dynamic image. According to the method, a filling area is not required to be acquired based on each edge, each edge is filled, and only the image to be processed and the filling image are combined to construct the target three-dimensional grid, instead of simulating the three-dimensional expression of the image into superposition of a plurality of planes, the conversion of all planes is not required, so that the processing time is shorter, the method can be applied to various scenes, and the generation efficiency and universality of dynamic images are effectively improved.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and are not intended to limit the present application, and embodiments and features of embodiments of the present application may be combined with each other without conflict.
Fig. 1 is a schematic view of an application scenario in an embodiment of the present application. The application scenario diagram includes two terminal devices 110 and a server 120.
In the embodiment of the present application, the terminal device 110 includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a desktop computer, an electronic book reader, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, and the like; the terminal device may be provided with a client related to the dynamic image, where the client may be software (such as a browser, clipping software, etc.), or may be a web page, an applet, etc., and the server 120 may be a background server corresponding to the software or the web page, the applet, etc., or a server specifically used for generating the dynamic image, which is not specifically limited in this application. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform.
Note that, the moving image generation method in each embodiment of the present application may be performed by an electronic device, which may be the terminal device 110 or the server 120, that is, the method may be performed by the terminal device 110 or the server 120 alone, or may be performed by both the terminal device 110 and the server 120 together. For example, when the terminal device 110 and the server 120 perform together, the terminal device 110 obtains an image to be processed, and sends the image to be processed to the server 120, and then the server 120 performs depth estimation on the image to be processed to obtain a first depth image corresponding to the image to be processed, and further performs edge detection on the first depth image to obtain each edge pixel in the first depth image, and further determines a mask region in the first depth image based on the edge pixels, and then the server 120 fills the mask region to be processed through an image filling technology to obtain a filling image corresponding to the image to be processed; finally, a target three-dimensional grid is constructed according to the image to be processed and the filling image, the target three-dimensional grid is rendered through an image rendering technology, a corresponding dynamic image is generated, and the dynamic image is sent to the terminal equipment 110 for display.
It should be noted that, the above-listed processes executed by the terminal device 110 and the server 120 together are only illustrative, any manner of executing the dynamic image generating method in the embodiment of the present application by the electronic device is applicable to the embodiment of the present application, and will not be described in detail herein, and the following mainly uses the server as an execution body for illustration.
In an alternative embodiment, the terminal device 110 and the server 120 may communicate via a communication network.
In an alternative embodiment, the communication network is a wired network or a wireless network.
It should be noted that, the embodiment shown in fig. 1 is merely an example, and the number of terminal devices and servers is not limited in practice, and is not specifically limited in the embodiment of the present application.
In the embodiment of the present application, when the number of servers is plural, plural servers may be configured as a blockchain, and the servers are nodes on the blockchain; the dynamic image generating method disclosed in the embodiments of the present application, wherein the image data involved may be stored on a blockchain, for example, related data (such as pixel values) of an image to be processed, a depth image, a filling image, etc., three-dimensional grid related data, etc.
In addition, the embodiments of the present application may be applied to various scenarios including, but not limited to, cloud technology, artificial intelligence, intelligent transportation, assisted driving, and the like.
The moving image generation method provided by the exemplary embodiments of the present application will be described below with reference to the accompanying drawings in conjunction with the application scenarios described above, and it should be noted that the application scenarios described above are only shown for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in any way in this respect.
Referring to fig. 2, a flowchart of an implementation of a method for generating a dynamic image according to an embodiment of the present application is shown, taking a server as an execution body as an example, and the specific implementation flow of the method is as follows (S21-S24):
s21: the server acquires a first depth image corresponding to the image to be processed.
The image to be processed is a static image, namely, an image which does not change state with time. As shown in fig. 3, which is a schematic view of a still image in an embodiment of the present application. The foreground object in the image is a person and a dog, and the background comprises trees, walls and the like.
It should be noted that, the example diagram in fig. 3 is a black-and-white line drawing presented according to requirements, and in fact, the diagram is a Red Green Blue (RGB) diagram, and the example in fig. 3 is for reference only (the same applies to the related schematic diagram of the filled image).
In the embodiment of the application, the corresponding depth image can be obtained by performing depth estimation on the image to be processed. Herein, the depth image corresponding to the image to be processed may be simply referred to as a first depth image, and the depth image corresponding to the fill image may be simply referred to as a second depth image.
In the embodiment of the application, for a given image, a depth estimation network may be used to perform depth estimation on the image. Optionally, the specific procedure in implementing step S21 is as follows:
inputting the image to be processed into a depth estimation network, and carrying out depth estimation on the image to be processed through the depth estimation network to obtain an output image of the depth estimation network; further, a first depth image corresponding to the image to be processed is determined according to the output image.
For example, a monocular depth estimation network is used for depth estimation, i.e. its depth information is estimated on the basis of one image, whereby the pixel values of the pixels in the output image characterize the depth information of the corresponding pixels in the image to be processed with respect to the camera.
Specifically, the monocular depth estimation network estimates depth information for each pixel in the image relative to the camera for a given single view static image. The output of the network is a single channel image consistent with the input size, the value of each pixel in the image representing the relative depth of each pixel in the corresponding input. The corresponding inputs and outputs of the algorithm are seen in the gray scale map to which the input image and depth estimate points in the algorithm map, and thus, the pixel values of the pixels may also be referred to as gray scale values.
It should be noted that the depth estimation network in the embodiments of the present application may be any network structure, which is not specifically limited herein.
When determining the first depth image corresponding to the image to be processed according to the output image, an alternative implementation manner is that after the output image is obtained, the pixel value in the output image can represent the depth information, so that the output image can be directly used as the first depth image corresponding to the image to be processed.
Fig. 4 is a schematic diagram of a depth image according to an embodiment of the present application. The image on the left side in fig. 4 is an output image obtained by performing depth estimation through the monocular depth estimation network.
It is obvious that the depth image obtained through the monocular depth estimation network is blurred at the edge of the image, as shown in the left side of fig. 4, is blurred at the edge of an object (a person and a dog) corresponding to the image, and cannot effectively distinguish the depth difference of the object before and after the boundary, so that the depth image obtained through the monocular depth estimation network can be further optimized, such as sharpening the edge through a method of edge-preserving filtering, edge information in the image can be effectively reserved in the filtering process, and an effective depth edge can be obtained based on the edge information.
That is, when determining the first depth image corresponding to the image to be processed according to the output image, another alternative implementation manner is to perform depth adjustment on the output image, sharpen the edge, and obtain the first depth image corresponding to the image to be processed.
An alternative depth adjustment (depth optimization) method is as follows:
firstly, sequentially carrying out average filtering and normalization processing on output images of a monocular depth estimation network to obtain an intermediate image; and further, carrying out at least one median filtering on the non-edge area in the intermediate image to obtain a first depth image corresponding to the image to be processed.
In the following, two weighted median filtering is taken as an example, and fig. 5 is a schematic flow chart of a depth optimization algorithm in the embodiment of the present application. For monocular depth estimation the relative depth of the network output is first filtered 3 x 3 on average (Blur).
The 3×3 average filtering specifically refers to applying a template, i.e., a 3×3 window, to the target pixel on the image, where the template includes neighboring pixels around the target pixel (8 pixels around the target pixel as the center, forming a filtering template, i.e., removing the target pixel itself), and then replacing the original pixel value with the average value of all the pixels in the template. The process is shown in formula (1):
Where (x, y) denotes the position of the pixel in the image, x denotes the number of rows in the image, and y denotes the number of columns in the image. 3X 3 average filtering, i.e. the values of i and j are [ -1,1]An integer. Blur (Blur) x,y I.e. for one pixel Input x,y The average filtered pixel values are performed.
Taking P22 as an example, referring to fig. 6, a schematic diagram of 3×3 average filtering in the embodiment of the present application is shown. The updated pixel value corresponding to P22 is the average value of the current pixel values of nine pixels P11, P12, P13, P21, P22, P23, P31, P32, and P33. For each pixel in the output image, the pixel value may be updated based on equation (1) above as the target pixel.
After Blur is carried out on the output image, the obtained image is further normalized, and the pixel value of the image is normalized to be between 0 and 1, and the process is shown in a formula (2):
Norm x,y =(Input x,y -min(Input))/(max(Input)-min(Input)) (2)
wherein, for a pixel, its current pixel value is Input x,y (Input here) x,y Mean pixel value after average filtering), and the normalized pixel value is Norm x,y Max (Input) is the largest pixel value in the image, and min (Input) is the smallest pixel value in the image.
In the embodiment of the present application, the range of the pixel value (also called gray value) of the normalized image is between 0 and 1, and the obtained image is the intermediate image. Further, a non-edge region of the intermediate image is twice weighted median filtered (Weighted Median Filter, WMF).
The median filtering can be regarded as WMF where the weight of each pixel is 1.
Optionally, the non-edge regions in the intermediate image are determined by:
comparing the difference between pixel values of every two adjacent pixels in the intermediate image with a preset depth difference value; furthermore, the region formed by the pixels in the intermediate image, the difference between the corresponding pixel values of which does not exceed the preset depth difference value, is used as the non-edge region.
For example, setting the preset depth difference value diff=0.05, it is possible to detect edges using the difference of diff=0.05 in the present application, i.e., pixels having a pixel value between adjacent pixels (i.e., a depth gray difference) greater than 0.05 are all considered edge pixels, and pixels having a depth gray difference between adjacent pixels not greater than 0.05 are all considered non-edge pixels, which are combined to form a non-edge region.
Next, as shown in fig. 5, by performing WMF twice on the non-edge region in the intermediate image, the obtained result is the first depth image. Optionally, each time a weighted median filter is applied to a non-edge region in the intermediate image, the following operations are performed separately for each pixel in the non-edge region:
For a pixel, first determining a weight of each pixel within a first detection window centered on the pixel, wherein the weight of each pixel is determined based on a difference between the pixel and a pixel value of the pixel;
for the non-edge area, WMF is performed in the present application, and the weight calculation formula in this process is shown in formula (3):
wherein Input is p Representing the current pixel value of pixel point p, input q The current pixel value of pixel q is represented. For a pixel point q in the first detection window with the pixel point p as the center, the corresponding weight can be calculated by the above formula (3).
Further, the weights are accumulated sequentially from small to large, and when the accumulated result exceeds the average value of the weights, the pixel value corresponding to the weight is used as a target pixel value; and updating the pixel value of the pixel based on the target pixel value, for example, taking the target pixel value as the updated pixel value of the pixel.
The method and the device accumulate weights in order from small to large, and when the average value is exceeded, the corresponding pixel gray value is used as an output value. In addition, other manners may be adopted, such as multiplying each pixel in the first detection window by a corresponding weight, then sorting by the value obtained by multiplying the weight, taking the median value to replace the pixel value of the central element, and so on.
In the above embodiment, the median filtering process of the non-edge area is repeated twice, and the result after depth optimization can be seen in the right image in fig. 4. Compared with the relative depth image (namely the output image) output by the original monocular depth estimation network, after optimization, the depth information of the adjacent areas is smoother, the depth information of the edge areas is sharper, the construction of the three-dimensional grid is facilitated, the depth consistency of the continuous areas is ensured, and the depth differences of the areas with different layers are distinguished.
It should be noted that, in addition to the depth optimization method illustrated in fig. 5, information such as segmentation may be optionally combined to ensure depth consistency of the object.
Specifically, the segmentation information of each object in the image to be processed is firstly obtained, the process can be realized based on an image segmentation network, an image segmentation algorithm and the like, and the objects in the image can be characters, animals, buildings, vehicles and the like, and the specific limitation is not provided herein.
And further, based on the segmentation information of each object, adjusting the pixel value of each pixel belonging to the same object in the output image so as to obtain a first depth image corresponding to the image to be processed. In this process, when the depth of the output image is adjusted, an average may be made for the depth of the object corresponding to the image depth in combination with the object segmentation result of the image, for example, for the person in fig. 3, each pixel belonging to the person is determined according to the segmentation information of the person, and the average value of the pixel values of the pixels in the output image is used as the updated pixel value to ensure that the depth of the same object is consistent.
In addition, any two depth adjustment methods can be combined, such as first performing primary pixel value update based on image segmentation information average value calculation, and then re-adjusting by combining average filtering, normalization, median filtering and the like; or, the first adjustment is performed based on average filtering, normalization and median filtering, and then the second adjustment is performed in combination with image segmentation information, etc., which are not particularly limited herein.
The depth-edge based filling process is described in detail below:
s22: and the server acquires each edge pixel in the first depth image by carrying out edge detection on the first depth image, and determines a mask region in the image to be processed based on each edge pixel.
The edge pixels refer to pixels located at depth edges in the image, and in this embodiment, specifically refer to pixels representing a distinguishing boundary between a front object and a rear region.
Alternatively, edge pixels in the first depth image may be determined by:
firstly, edge detection is carried out on a first depth image, and edge information of each pixel in the first depth image is obtained, wherein the edge information of each pixel is a pixel value of a corresponding pixel determined by an edge detection operator; and further, taking the corresponding pixels with the edge information larger than the preset threshold value in the first depth image as edge pixels.
In the embodiment of the present application, after the optimization of the depth is completed, the edge of the depth image may be obtained, and in the embodiment of the present application, the edge is obtained using an edge detection operator, unlike the manner of calculating the edge in the above-described determination of the non-edge region. There are many common edge detection operators, such as Sobel operator, roberts operator, canny operator, prewitt operator, laplacian operator, etc.
Taking the Sobel operator as an example, the Sobel operator detects an image edge by calculating an approximate gradient of an image. The formula of the Sobel operator is shown in formula (4), wherein Gx and Gy are gradients of the image in the horizontal direction and the vertical direction. Taking the Sobel operator of 3X3 as an example, the calculation of Gx and Gy is shown in formula (5).
The edge information of the depth can be calculated by using a Sobel operator, and the formula (6) is shown.
Wherein depth is edge Is a continuous value within 0 to 1.
In the embodiment of the application, depth based on each pixel edge To determine edge pixels, depth may be determined edge Pixels greater than 0.5 can be considered edge pixels compared to a predetermined threshold (e.g., 0.5).
Optionally, after determining the edge pixels in the first depth image, a depth-based mask may be calculated based on each edge pixel to determine a mask region in the image to be processed, which specifically includes the following steps:
Comparing the edge information of each pixel with the average edge depth in a second detection window which takes each edge pixel as a center; furthermore, an area formed by pixels with corresponding edge information larger than the average edge depth in the image to be processed is used as a mask area. Wherein the edge average depth is the average value of the edge information of each edge pixel.
For example, the edge information depth of each pixel is obtained based on the above formula (6) edge Then, edge pixels can be screened out and depth of the edge pixels can be reduced edge As the average depth of the edge.
Further, the depth region on the front side around the edge can be obtained from the edge information. Specifically, for points on the edge, a window of a fixed size (i.e., a second detection window) is given, and the depth within the window is set to be greater than the average depth of the edge as the mask area.
As shown in fig. 7, a schematic diagram of a second detection window in the present embodiment is shown. In the embodiment of the application, depth is edge The mask area is determined in such a manner that the pixel value of the pixel larger than the edge average depth is set to 1 (white) and the pixel values of the remaining pixels are set to 0 (black). Let a edge pixels in the first depth image be present, the mean value of the pixel values of these edge pixels being b, i.e. the average depth of the edge being b. For one edge pixel P33, assuming that the size of the second detection window is 5×5, the remaining 24 adjacent pixels around the P33 can be obtained, and the pixel values of the pixels are updated by analyzing whether the pixel values of the 25 pixels are greater than b, for example, the pixel value of the pixel greater than b is set to 1 (white) and the pixel value of the pixel not greater than b is set to 0 (black) in fig. 7.
A black-and-white binary image, also called a mask image, is obtained through the above, and as shown in fig. 8, it is a schematic diagram of a mask image in the embodiment of the present application, where a white area is a mask area, that is, an area that needs to be filled later.
It should be noted that, in the embodiment of the present application, the image size is not changed, that is, the size of the image to be processed is the same as that of the first depth image, because the mask area generated based on each edge pixel in the first depth image is the same as that in the image to be processed, because the subsequent image filling is performed on the basis of the original image to be processed (may also be referred to as the original image), and thus, it may be referred to as determining the mask area in the image to be processed, and performing image filling on the mask area of the image to be processed.
S23: and the server fills the mask region with the image to obtain a filling image corresponding to the image to be processed.
In the embodiment of the application, with the mask and the original image based on the depth calculation, the depth information of the mask covering portion can be filled based on the image restoration network. The mask area to be filled in the application is a rear area shielded by a front object in the image.
The image restoration network estimates pixel values of the corresponding unknown areas for a given RGB image and a mask (0-1 diagram, 1 represents the unknown areas) corresponding to the RGB image, and enables the filled pixels to be consistent with the known pixels semantically.
In the embodiment of the application, the input of the image restoration network is the image to be processed (RGB image) and the mask image (0-1 image), and the output result is the filling image corresponding to the image to be processed. Referring to fig. 9, a schematic diagram of a filling image in an embodiment of the present application is shown, in which a rear area blocked by a front object (a person and a dog) in an image to be processed is filled through an image restoration network, so that the filled pixels are semantically consistent with known pixels, for example, a tree portion blocked by the person is semantically consistent with a tree portion of an original image to be processed after being filled, for example, a wall portion blocked by the person and the dog is semantically consistent with a wall portion in the original image to be processed after being filled, and so on.
After the filling image is obtained through the depth edge filling, a three-dimensional grid can be constructed by combining the image to be processed and the filling image, and image rendering is carried out, and the method concretely comprises the following steps.
S24: the server constructs a target three-dimensional grid according to the image to be processed and the filling image, and renders the target three-dimensional grid based on different preset camera positions to obtain a dynamic image generated based on the image to be processed.
By setting different preset camera positions, the constructed grid can be imaged from different angles (such as in the order from far to near to the camera positions) based on the preset camera paths, so that the effect of the dynamic image is obtained.
In carrying out step S24, this part mainly comprises two major steps: constructing a three-dimensional grid and rendering. The three-dimensional grid can be constructed based on a mesh function, wherein the mesh function is a function for drawing a three-dimensional curved surface. In the embodiment of the application, the grid construction part is divided into two steps, namely, the original picture information and the repair image information are utilized to enrich the three-dimensional space. Based on this, in constructing the target three-dimensional mesh, an alternative embodiment is (S241-S243, not shown in fig. 2):
s241: and constructing an original three-dimensional grid based on the image to be processed and the first depth image.
Specifically, the construction grid includes point information, i.e., a spatial three-dimensional coordinate xyz, and patch information, i.e., a link relationship between points.
An alternative implementation manner is that point information in an original three-dimensional grid is constructed based on two-dimensional coordinates of pixels in an image to be processed and pixel values of the pixels in a first depth image; further, patch information in an original three-dimensional grid is constructed based on four-connected information of the image to be processed, wherein the patch information characterizes a link relation between points in the original three-dimensional grid, and the points corresponding to all edge pixels in the original three-dimensional grid are not linked.
In the embodiment of the present application, the image to be processed is a two-dimensional RGB image, and the first depth image is a two-dimensional gray-scale image, wherein the pixel value (i.e., gray-scale value) of the pixel represents the depth information of the corresponding pixel in the image to be processed relative to the camera. Considering the sharper depth information of the edge region in the optimized image illustrated in fig. 4, the original three-dimensional mesh may be constructed using the first depth image obtained through the depth adjustment.
For example, when the original three-dimensional grid is constructed, the coordinates of the original two-dimensional image (i.e., the image to be processed) may be used as x and y, and the z coordinates may use the depth values of the corresponding pixels (i.e., the corresponding pixel values in the first depth image), and the color thereof is the RGB color in the corresponding image to be processed. After the points are constructed, the patch information needs to be further constructed, in the embodiment of the application, the four-connected information of the image to be processed is directly used for construction, and the fact that the points corresponding to the pixels of each edge are not linked can be guaranteed to be not linked at the depth edge.
The four-connection information can effectively represent the position relation among pixels and can indirectly reflect the link relation among points. .
S242: and constructing an intermediate three-dimensional grid based on the filling image and the second depth image corresponding to the filling image.
Similar to the above process of constructing the original three-dimensional grid, corresponding point information and patch information need to be determined, and in an alternative embodiment, the point information in the intermediate three-dimensional grid is constructed based on the two-dimensional coordinates of each pixel in the filling image and the pixel values of each pixel in the second depth image; further, patch information in an intermediate three-dimensional mesh is constructed based on four-connected information of the fill image, where the patch information characterizes a link relationship between points in the intermediate three-dimensional mesh.
In the embodiment of the present application, the filling image is also a two-dimensional RGB image, and the second depth image is a two-dimensional gray-scale image, wherein the pixel value (i.e. gray-scale value) of the pixel represents the depth information of the corresponding pixel in the filling image relative to the camera. The process of constructing the intermediate three-dimensional grid based on this is similar to step S241, in which if the coordinates of the filling image are used as x and y, the z coordinates use the depth values of the corresponding pixels (i.e., the corresponding pixel values in the second depth image), and the color thereof is the RGB color in the corresponding filling image. After the point is constructed, the patch information is directly constructed by using the four-way information of the filling image itself. Wherein no chaining at depth edges need to be provided, since the fill image is obtained based on depth edge filling.
S243: and merging the original three-dimensional grid and the intermediate three-dimensional grid to obtain the target three-dimensional grid.
In step S243, the two original three-dimensional grids are combined, and then the combined target three-dimensional grid is used for image rendering.
Specifically, the rendering part is a technology of rendering the information of the three-dimensional space into the image of the two-dimensional space, and the built grid can be imaged from different angles by setting different camera positions, so that the effect of the dynamic image is obtained.
In step S241 and step S242, the difference between two points of constructing the three-dimensional mesh is: one constructed using the original image and the estimated depth information, and one constructed using the restored image, the latter fills the rear area blocked by some front object, thus ensuring that more unknown information can be seen when the viewing angle changes, and the generated dynamic image is more.
In addition, besides the above-listed calculation of the combination of two three-dimensional grids, each frame of motion in the camera may be calculated, each frame is calculated to calculate a three-dimensional grid, and the like, and the image frame is filled in the motion of the camera, which is not particularly limited herein.
The method in the embodiment of the application can be applied to various image processing related program products, such as editing software, special effect software and the like, and can be widely applied to all image and video related special effect scenes. The product is designed to input a still image and output a video to make the image dynamic.
Fig. 10 is a schematic logic diagram of a dynamic image generating method according to an embodiment of the present application. Fig. 10 is a schematic illustration of the above process, still taking the still image as the image to be processed as illustrated in fig. 3, and taking the method as an example through a program product, where the image to be processed is an input image of the program product, and the finally obtained dynamic image is an output video of the program product, where the processing process is as follows: firstly, performing depth estimation on an input image (which can be implemented based on the monocular depth estimation network) to obtain an image a (namely an output image of the monocular depth estimation network), and further performing depth optimization on the image a to obtain an image b (namely a first depth image), wherein an original three-dimensional grid can be constructed based on an image to be processed and the first depth image; in addition, edge detection is further performed based on the image b, and a depth edge is obtained as shown in the image c; further, edge mask is further performed based on the image c, and a binarized image shown in the image d, namely a mask image is obtained; after the mask image is obtained, the corresponding position in the image to be processed can be filled with the image based on the determined mask area, so that a filled image is obtained; at this time, an intermediate three-dimensional mesh may be constructed based on the fill image and its corresponding second depth image (not shown in fig. 10). Finally, the three-dimensional grids can be rendered according to camera movement based on the original three-dimensional grids and the middle three-dimensional grids and different preset camera positions so as to obtain dynamic images formed by image frames with different angles. Wherein, fig. 10 is rendered at a distance from far to near, i.e. the foreground and background in the image are gradually enlarged, resembling the effect that a camera is gradually approaching the photographed scene.
In the above embodiment, technologies such as depth estimation, image filling and image rendering are combined, depth information of an image is obtained through a depth estimation algorithm, and edge sharpening is performed on the depth of the image by edge protection filtering, so that a depth edge is obtained. And acquiring a filling mask of the image by using the depth edge, wherein the mask is a depth shielding area, filling the mask by using an image filling algorithm, constructing a three-dimensional grid according to pictures before and after filling, and rendering the network according to camera movement to acquire a video.
The method realizes the effects of changing the three-dimensional operation mirror by a large scale camera, such as a special effect of a specific camera view angle such as a Seikoko effect, and the like, achieves the effect of 1080P processing time of 3 seconds on the T4 GPU in performance, effectively reduces the image processing time, is shorter, can be applied to various scenes, and improves the generation efficiency and universality of dynamic images. In addition, the process performs image rendering through the three-dimensional grid, and the process needs to build the elevation sheet, so that the mold penetration condition is not caused.
It should be noted that, the dynamic image generating method in the embodiment of the present application is used for acquiring a dynamic video for a static input, and the relevant effects are shown by some video frames, and refer to fig. 11 and fig. 12, which are examples of video frames of two dynamic images generated based on the dynamic image generating method in the embodiment of the present application.
In fig. 11, the leftmost image is an original still image (i.e., an image to be processed), and the right three columns are three video frames captured in a dynamic image generated based on the dynamic image generating method in the embodiment of the application, and the relationship between the portrait and the top branch of the head can be found to be displaced by comparison.
As shown in fig. 12, the leftmost image is an original still image (i.e., an image to be processed), the right three columns are three video frames captured in a dynamic image generated by the dynamic image generating method according to the embodiment of the present application, and the relationship between the cat and the basket and the plant can be found to be shifted by comparison.
The following illustrates an implementation flow of the moving image generation method in the embodiment of the present application:
referring to fig. 13, which is a schematic flowchart of a dynamic image generating method according to an embodiment of the present application, taking a server as an execution body as an example, the specific implementation flow of the method is as follows:
step S1301: the server acquires a static image to be processed;
step S1302: the server carries out depth estimation on the image to be processed through a monocular depth estimation network to obtain an output image of the monocular depth estimation network;
Step S1303: the server sequentially carries out average filtering and normalization processing on the output image to obtain an intermediate image;
step S1304: the server performs median filtering on the non-edge area in the intermediate image twice to obtain a first depth image corresponding to the image to be processed;
step S1305: the server acquires each edge pixel in the first depth image by carrying out edge detection on the first depth image;
step S1306: the server determines a mask area based on each edge pixel;
step S1307: the server fills the image to be processed based on the mask area to obtain a corresponding filled image;
step S1308: the server constructs an original three-dimensional grid based on the image to be processed and the first depth image;
step S1309: the server constructs an intermediate three-dimensional grid based on the filling image and a second depth image corresponding to the filling image;
step S1310: the server merges the original three-dimensional grid and the middle three-dimensional grid to obtain a target three-dimensional grid;
step S1311: and the server renders the target three-dimensional grid based on different preset camera positions to obtain a dynamic image generated based on the image to be processed.
It should be noted that, the timing between the steps in the flowchart illustrated in fig. 13 is merely illustrative, for example, step S1308 may be performed after step S1304, or step S1308 and step S1305 may be performed simultaneously, which is not described herein.
In conclusion, the method for generating the dynamic image is more visual, has universality, can achieve a quick processing effect, can achieve effects such as a Seikouke special effect and a three-dimensional mirror, and provides various camera speed changing schemes. In addition, the method can be applied to various social entertainment special effect scenes to increase the entertainment interest of the object.
Based on the same inventive concept, the embodiment of the application also provides a dynamic image generating device. As shown in fig. 14, which is a schematic structural diagram of the moving image generating apparatus 1400, may include:
a depth acquiring unit 1401, configured to acquire a first depth image corresponding to an image to be processed, where the image to be processed is a still image;
a mask unit 1402, configured to obtain each edge pixel in the first depth image by performing edge detection on the first depth image, and determine a mask region in the image to be processed based on each edge pixel;
a filling unit 1403, configured to perform image filling on the mask area, so as to obtain a filled image corresponding to the image to be processed;
a generating unit 1404, configured to construct a target three-dimensional grid according to the image to be processed and the filling image, and render the target three-dimensional grid based on different preset camera positions, so as to obtain a dynamic image generated based on the image to be processed.
Optionally, the depth acquisition unit 1401 is specifically configured to:
carrying out depth estimation on an image to be processed through a depth estimation network to obtain an output image of the depth estimation network, wherein the pixel value of each pixel in the output image represents the depth information of the corresponding pixel in the image to be processed relative to a camera; and performs any one of the following operations:
taking the output image as a first depth image corresponding to the image to be processed;
and carrying out depth adjustment on the output image to obtain a first depth image corresponding to the image to be processed.
Optionally, the depth acquisition unit 1401 is specifically configured to:
sequentially carrying out average filtering and normalization processing on the output image to obtain an intermediate image;
and carrying out median filtering on the non-edge area in the intermediate image at least once to obtain a first depth image corresponding to the image to be processed.
Optionally, the depth acquisition unit 1401 is further configured to determine the non-edge region by:
comparing the difference between pixel values of every two adjacent pixels in the intermediate image with a preset depth difference value;
and taking a region formed by pixels with the difference of corresponding pixel values not exceeding a preset depth difference value in the intermediate image as a non-edge region.
Optionally, the depth acquisition unit 1401 is specifically configured to:
carrying out weighted median filtering on a non-edge area in the intermediate image at least once to obtain a first depth image corresponding to the image to be processed, wherein in each weighted median filtering process, the following operations are respectively carried out on each pixel in the non-edge area:
determining, for a pixel, a weight for each pixel within a first detection window centered on the pixel, wherein the weight for each pixel is determined based on a difference between the pixel value of the pixel and the pixel value of the pixel;
sequentially accumulating the weights according to the order from small to large, and taking the pixel value of the corresponding weight as a target pixel value when the accumulation result exceeds the average value of the weights;
the pixel value of a pixel is updated based on the target pixel value.
Optionally, the depth acquisition unit 1401 is specifically configured to:
obtaining segmentation information of each object in an image to be processed;
and adjusting pixel values of pixels belonging to the same object in the output image based on the segmentation information of the objects to obtain a first depth image corresponding to the image to be processed.
Optionally, the masking unit 1402 is specifically configured to:
edge detection is carried out on the first depth image, edge information of each pixel in the first depth image is obtained, and the edge information of each pixel is a pixel value of a corresponding pixel determined by an edge detection operator;
And taking the pixels, corresponding to the edge information of which is larger than a preset threshold value, in the first depth image as edge pixels.
Optionally, the masking unit 1402 is specifically configured to:
comparing the edge information of each pixel with the average edge depth in a second detection window which takes each edge pixel as the center, wherein the average edge depth is the average value of the edge information of each edge pixel;
and taking an area formed by pixels with corresponding edge information larger than the average depth of the edge in the image to be processed as a mask area.
Optionally, the generating unit 1404 is specifically configured to:
constructing an original three-dimensional grid based on the image to be processed and the first depth image;
constructing an intermediate three-dimensional grid based on the filling image and a second depth image corresponding to the filling image;
and merging the original three-dimensional grid and the intermediate three-dimensional grid to obtain the target three-dimensional grid.
Optionally, the generating unit 1404 is specifically configured to:
constructing point information in an original three-dimensional grid based on the two-dimensional coordinates of each pixel in the image to be processed and the pixel value of each pixel in the first depth image;
and constructing the surface patch information in the original three-dimensional grid based on the four-connected information of the image to be processed, wherein the surface patch information represents the link relation between points in the original three-dimensional grid, and the points corresponding to the edge pixels in the original three-dimensional grid are not linked.
Optionally, the generating unit 1404 is specifically configured to:
constructing point information in the intermediate three-dimensional grid based on the two-dimensional coordinates of each pixel in the filling image and the pixel value of each pixel in the second depth image;
and constructing the patch information in the middle three-dimensional grid based on the four-connected information of the filling image, wherein the patch information characterizes the link relation between points in the middle three-dimensional grid.
The dynamic image generation method combines the technologies of depth estimation, image filling, image rendering and the like, acquires the depth information of an image through the depth estimation, acquires a depth edge based on the image depth, acquires a mask area of the image by using the depth edge, wherein the mask area is a depth shielding area, and fills the mask area by using an image filling method to obtain a corresponding filling image, namely a repair image; and finally, constructing a target three-dimensional grid according to the images before and after filling, and rendering the network according to camera movement to obtain a dynamic image. According to the method, a filling area is not required to be acquired based on each edge, each edge is filled, and only the image to be processed and the filling image are combined to construct the target three-dimensional grid, instead of simulating the three-dimensional expression of the image into superposition of a plurality of planes, the conversion of all planes is not required, so that the processing time is shorter, the method can be applied to various scenes, and the generation efficiency and universality of dynamic images are effectively improved.
For convenience of description, the above parts are described as being functionally divided into modules (or units) respectively. Of course, the functions of each module (or unit) may be implemented in the same piece or pieces of software or hardware when implementing the present application.
Having described the moving image generation method and apparatus of the exemplary embodiments of the present application, next, an electronic device according to another exemplary embodiment of the present application is described.
Those skilled in the art will appreciate that the various aspects of the present application may be implemented as a system, method, or program product. Accordingly, aspects of the present application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to generally as a "circuit," "module" or "system" in this application.
The embodiment of the application also provides electronic equipment based on the same inventive concept as the embodiment of the method. In one embodiment, the electronic device may be a server, such as server 120 shown in FIG. 1. In this embodiment, the structure of the electronic device may include a memory 1501, a communication module 1503, and one or more processors 1502 as shown in fig. 15.
A memory 1501 for storing computer programs executed by the processor 1502. The memory 1501 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant communication function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.
The memory 1501 may be a volatile memory (RAM) such as a random-access memory (RAM); the memory 1501 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); or memory 1501, is any other medium capable of carrying or storing a desired computer program in the form of instructions or data structures and capable of being accessed by a computer, but is not limited thereto. The memory 1501 may be a combination of the above memories.
The processor 1502 may include one or more central processing units (central processing unit, CPU) or digital processing units, or the like. A processor 1502 for implementing the above-described moving image generation method when calling the computer program stored in the memory 1501.
The communication module 1503 is used for communicating with the terminal device and other servers.
The specific connection medium between the memory 1501, the communication module 1503, and the processor 1502 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 1501 and the processor 1502 are connected by the bus 1504 in fig. 15, and the bus 1504 is depicted in a bold line in fig. 15, and the connection manner between other components is only schematically illustrated, but not limited to. The bus 1504 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in fig. 15, but only one bus or one type of bus is not depicted.
The memory 1501 stores therein a computer storage medium in which computer executable instructions for implementing the moving image generating method of the embodiment of the present application are stored. The processor 1502 is configured to execute the dynamic image generating method described above, as shown in fig. 2.
In another embodiment, the electronic device may also be other electronic devices, such as terminal device 110 shown in fig. 1. In this embodiment, the structure of the electronic device may include, as shown in fig. 16: communication component 1610, memory 1620, display unit 1630, camera 1640, sensor 1650, audio circuitry 1660, bluetooth module 1670, processor 1680, and the like.
The communication component 1610 is for communicating with a server. In some embodiments, a circuit wireless fidelity (Wireless Fidelity, wiFi) module may be included, where the WiFi module belongs to a short-range wireless transmission technology, and the electronic device may help the user to send and receive information through the WiFi module.
Memory 1620 may be used to store software programs and data. The processor 1680 performs various functions of the terminal device 110 and data processing by executing software programs or data stored in the memory 1620. The memory 1620 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. The memory 1620 stores an operating system that enables the terminal device 110 to operate. The memory 1620 in the present application may store an operating system and various application programs, and may also store a computer program for executing the moving image generating method according to the embodiment of the present application.
The display unit 1630 may also be used to display information input by a user or information provided to the user and a graphical user interface (graphical user interface, GUI) of various menus of the terminal device 110. Specifically, the display unit 1630 may include a display screen 1632 disposed on the front side of the terminal device 110. The display 1632 may be configured in the form of a liquid crystal display, light emitting diodes, or the like. The display unit 1630 may be used to display still images, moving images, and the like in the embodiments of the present application.
The display unit 1630 may also be used to receive input numeric or character information, generate signal inputs related to user settings and function control of the terminal device 110, and in particular, the display unit 1630 may include a touch screen 1631 disposed on the front of the terminal device 110, and may collect touch operations on or near the user, such as clicking buttons, dragging scroll boxes, and the like.
The touch screen 1631 may cover the display screen 1632, or the touch screen 1631 and the display screen 1632 may be integrated to implement input and output functions of the terminal device 110, and after integration, the touch screen may be abbreviated as touch screen. The display unit 1630 may display application programs and corresponding operation steps.
The camera 1640 may be used to capture still images, and a user may post images captured by the camera 1640 through an application. The camera 1640 may be one or a plurality of cameras. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive elements convert the optical signals to electrical signals, which are then passed to the processor 1680 for conversion to digital image signals.
The terminal device may further include at least one sensor 1650, such as an acceleration sensor 1651, a distance sensor 1652, a fingerprint sensor 1653, a temperature sensor 1654. The terminal device may also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, light sensors, motion sensors, and the like.
Audio circuitry 1660, speakers 1661, and microphone 1662 may provide an audio interface between the user and the terminal device 110. The audio circuit 1660 may transmit the received electrical signal converted from audio data to the speaker 1661, and convert the electrical signal into an audio signal by the speaker 1661 to be output. The terminal device 110 may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 1662 converts the collected sound signals into electrical signals, which are received by the audio circuit 1660 and converted into audio data, which are output to the communication component 1610 for transmission to, for example, another terminal device 110, or to the memory 1620 for further processing.
The bluetooth module 1670 is used to exchange information with other bluetooth devices having bluetooth modules through bluetooth protocols. For example, the terminal device may establish a bluetooth connection with a wearable electronic device (e.g., a smart watch) that also has a bluetooth module through bluetooth module 1670, thereby performing data interaction.
The processor 1680 is a control center of the terminal device, connects various parts of the entire terminal using various interfaces and lines, and performs various functions of the terminal device and processes data by running or executing software programs stored in the memory 1620 and calling data stored in the memory 1620. In some embodiments, the processor 1680 may include one or more processing units; the processor 1680 may also integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., and a baseband processor that primarily handles wireless communications. It will be appreciated that the baseband processor described above may not be integrated into the processor 1680. The processor 1680 in the present application may run an operating system, an application program, a user interface display, and a touch response, and a dynamic image generation method of the embodiments of the present application. In addition, a processor 1680 is coupled to the display unit 1630.
In some possible embodiments, aspects of the dynamic image generating method provided herein may also be implemented in the form of a program product comprising a computer program for causing an electronic device to perform the steps in the dynamic image generating method according to the various exemplary embodiments of the present application described herein above when the program product is run on the electronic device, e.g. the electronic device may perform the steps as shown in fig. 2.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and comprise a computer program and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.
The readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave in which a readable computer program is embodied. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.
A computer program embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer programs for performing the operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer program may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device and partly on a remote electronic device or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic device may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., connected through the internet using an internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.
Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having a computer-usable computer program embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program commands may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the commands executed by the processor of the computer or other programmable data processing apparatus produce means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program commands may also be stored in a computer readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the commands stored in the computer readable memory produce an article of manufacture including command means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (15)

1. A dynamic image generating method, characterized in that the method comprises:
Acquiring a first depth image corresponding to an image to be processed, wherein the image to be processed is a static image;
obtaining each edge pixel in the first depth image by carrying out edge detection on the first depth image, and determining a mask region in the image to be processed based on each edge pixel;
image filling is carried out on the mask region, and a filling image corresponding to the image to be processed is obtained;
and constructing a target three-dimensional grid according to the image to be processed and the filling image, and rendering the target three-dimensional grid based on different preset camera positions to obtain a dynamic image generated based on the image to be processed.
2. The method of claim 1, wherein the acquiring a first depth image corresponding to the image to be processed comprises:
performing depth estimation on the image to be processed through a depth estimation network to obtain an output image of the depth estimation network, wherein the pixel value of each pixel in the output image represents the depth information of the corresponding pixel in the image to be processed relative to a camera; and performs any one of the following operations:
taking the output image as a first depth image corresponding to the image to be processed;
And performing depth adjustment on the output image to obtain a first depth image corresponding to the image to be processed.
3. The method of claim 2, wherein performing depth adjustment on the output image to obtain a first depth image corresponding to the image to be processed comprises:
sequentially carrying out average filtering and normalization processing on the output image to obtain an intermediate image;
and carrying out at least one median filtering on a non-edge area in the intermediate image to obtain a first depth image corresponding to the image to be processed.
4. A method according to claim 3, wherein the non-edge region is determined by:
comparing the difference between pixel values of every two adjacent pixels in the intermediate image with a preset depth difference value;
and taking an area formed by pixels with the difference of corresponding pixel values not exceeding the preset depth difference value in the intermediate image as the non-edge area.
5. A method according to claim 3, wherein the median filtering is performed at least once on the non-edge region in the intermediate image to obtain the first depth image corresponding to the image to be processed, and the method comprises:
Carrying out weighted median filtering on a non-edge area in the intermediate image at least once to obtain a first depth image corresponding to the image to be processed, wherein in each weighted median filtering process, the following operations are respectively carried out on each pixel in the non-edge area:
determining, for a pixel, a weight for each pixel within a first detection window centered on the pixel, wherein the weight for each pixel is determined based on a difference in pixel values of the pixel and the one pixel;
sequentially accumulating the weights according to the order from small to large, and taking the pixel value of the corresponding weight as a target pixel value when the accumulation result exceeds the average value of the weights;
updating the pixel value of the one pixel based on the target pixel value.
6. The method of claim 2, wherein performing depth adjustment on the output image to obtain a first depth image corresponding to the image to be processed comprises:
obtaining segmentation information of each object in the image to be processed;
and adjusting pixel values of pixels belonging to the same object in the output image based on the segmentation information of each object to obtain a first depth image corresponding to the image to be processed.
7. The method according to any one of claims 1 to 6, wherein the obtaining each edge pixel in the first depth image by performing edge detection on the first depth image includes:
performing edge detection on the first depth image to obtain edge information of each pixel in the first depth image, wherein the edge information of each pixel is a pixel value of a corresponding pixel determined by an edge detection operator;
and taking the pixels, corresponding to the edge information of which is larger than a preset threshold value, in the first depth image as edge pixels.
8. The method of claim 7, wherein determining mask regions in the image to be processed based on the respective edge pixels comprises:
comparing the edge information of each pixel with an edge average depth in a second detection window which takes each edge pixel as a center, wherein the edge average depth is the average value of the edge information of each edge pixel;
and taking a region formed by pixels with corresponding edge information larger than the average edge depth in the image to be processed as the mask region.
9. The method of any one of claims 1 to 6, wherein said constructing a target three-dimensional grid from said image to be processed and said fill image comprises:
Constructing an original three-dimensional grid based on the image to be processed and the first depth image;
constructing an intermediate three-dimensional grid based on the filling image and a second depth image corresponding to the filling image;
and merging the original three-dimensional grid and the intermediate three-dimensional grid to obtain the target three-dimensional grid.
10. The method of claim 9, wherein the constructing an original three-dimensional grid based on the image to be processed and the first depth image comprises:
constructing point information in the original three-dimensional grid based on the two-dimensional coordinates of each pixel in the image to be processed and the pixel value of each pixel in the first depth image;
and constructing the surface patch information in the original three-dimensional grid based on the four-connected information of the image to be processed, wherein the surface patch information represents the link relation between points in the original three-dimensional grid, and the points corresponding to the edge pixels in the original three-dimensional grid are not linked.
11. The method of claim 9, wherein constructing an intermediate three-dimensional grid based on the fill image and a second depth image corresponding to the fill image comprises:
Constructing point information in the intermediate three-dimensional grid based on the two-dimensional coordinates of each pixel in the filling image and the pixel value of each pixel in the second depth image;
and constructing the patch information in the middle three-dimensional grid based on the four-connected information of the filling image, wherein the patch information represents the link relation between points in the middle three-dimensional grid.
12. A moving image generating apparatus, comprising:
the depth acquisition unit is used for acquiring a first depth image corresponding to an image to be processed, wherein the image to be processed is a static image;
a mask unit, configured to obtain each edge pixel in the first depth image by performing edge detection on the first depth image, and determine a mask region in the image to be processed based on each edge pixel;
the filling unit is used for carrying out image filling on the mask region to obtain a filling image corresponding to the image to be processed;
the generating unit is used for constructing a target three-dimensional grid according to the image to be processed and the filling image, and rendering the target three-dimensional grid based on different preset camera positions to obtain a dynamic image generated based on the image to be processed.
13. An electronic device comprising a processor and a memory, wherein the memory stores a computer program which, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 11.
14. A computer readable storage medium, characterized in that it comprises a computer program for causing an electronic device to perform the steps of the method according to any one of claims 1-11 when said computer program is run on the electronic device.
15. A computer program product comprising a computer program, the computer program being stored on a computer readable storage medium; when the computer program is read from the computer readable storage medium by a processor of an electronic device, the processor executes the computer program, causing the electronic device to perform the steps of the method of any one of claims 1-11.
CN202210943580.2A 2022-08-08 2022-08-08 Dynamic image generation method and device, electronic equipment and storage medium Pending CN117576338A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210943580.2A CN117576338A (en) 2022-08-08 2022-08-08 Dynamic image generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210943580.2A CN117576338A (en) 2022-08-08 2022-08-08 Dynamic image generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117576338A true CN117576338A (en) 2024-02-20

Family

ID=89884994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210943580.2A Pending CN117576338A (en) 2022-08-08 2022-08-08 Dynamic image generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117576338A (en)

Similar Documents

Publication Publication Date Title
KR102574141B1 (en) Image display method and device
CN108898567B (en) Image noise reduction method, device and system
US11526995B2 (en) Robust use of semantic segmentation for depth and disparity estimation
WO2022042049A1 (en) Image fusion method, and training method and apparatus for image fusion model
CA2745380C (en) Devices and methods for processing images using scale space
CN113066017B (en) Image enhancement method, model training method and equipment
US20110148868A1 (en) Apparatus and method for reconstructing three-dimensional face avatar through stereo vision and face detection
CN111402146A (en) Image processing method and image processing apparatus
KR20230084486A (en) Segmentation for Image Effects
Xiao et al. Single image dehazing based on learning of haze layers
KR20200135102A (en) Image processing apparatus and image processing method thereof
KR102311796B1 (en) Method and Apparatus for Deblurring of Human Motion using Localized Body Prior
CN109816694A (en) Method for tracking target, device and electronic equipment
CN109214996A (en) A kind of image processing method and device
CN114627034A (en) Image enhancement method, training method of image enhancement model and related equipment
KR102262671B1 (en) Method and storage medium for applying bokeh effect to video images
WO2022233252A1 (en) Image processing method and apparatus, and computer device and storage medium
BR102020027013A2 (en) METHOD TO GENERATE AN ADAPTIVE MULTIPLANE IMAGE FROM A SINGLE HIGH RESOLUTION IMAGE
CN114170290A (en) Image processing method and related equipment
WO2022021287A1 (en) Data enhancement method and training method for instance segmentation model, and related apparatus
WO2018029399A1 (en) Apparatus, method, and computer program code for producing composite image
US20220398704A1 (en) Intelligent Portrait Photography Enhancement System
US20230131418A1 (en) Two-dimensional (2d) feature database generation
WO2023133285A1 (en) Anti-aliasing of object borders with alpha blending of multiple segmented 3d surfaces
CN117576338A (en) Dynamic image generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination