CN117354480A - Video generation method, device, equipment and storage medium - Google Patents

Video generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN117354480A
CN117354480A CN202311245123.7A CN202311245123A CN117354480A CN 117354480 A CN117354480 A CN 117354480A CN 202311245123 A CN202311245123 A CN 202311245123A CN 117354480 A CN117354480 A CN 117354480A
Authority
CN
China
Prior art keywords
pixels
pixel
edge
background
depth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311245123.7A
Other languages
Chinese (zh)
Inventor
袁苇航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202311245123.7A priority Critical patent/CN117354480A/en
Publication of CN117354480A publication Critical patent/CN117354480A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/261Image signal generators with monoscopic-to-stereoscopic image conversion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/275Image signal generators from 3D object models, e.g. computer-generated stereoscopic image signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure provides a video generation method, a device, equipment and a storage medium, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, virtual reality, deep learning and the like, and can be applied to scenes such as content generation, meta universe and the like based on artificial intelligence. The specific implementation scheme is as follows: and carrying out depth estimation on each pixel in the original image to obtain the depth of each pixel in the original image, and then carrying out edge recognition according to the depth of each pixel in the original image to obtain the foreground edge pixel and the background edge pixel. From the original image, foreground extension pixels extending from the front Jing Bianyuan pixels to the outer periphery are generated, and background extension pixels extending from the background edge pixels to the outer periphery are generated. And then carrying out three-dimensional modeling according to pixels in the original image, foreground extension pixels and background extension pixels, and carrying out video generation based on a three-dimensional model obtained by modeling.

Description

Video generation method, device, equipment and storage medium
Technical Field
The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, virtual reality, deep learning and the like, and can be applied to scenes such as content generation, meta universe and the like based on artificial intelligence, in particular to a video generation method, a video generation device, video generation equipment and a storage medium.
Background
The video generation method under image guidance can generate a section of dynamic video based on a static image. Still images used to generate video may encompass a wide range of real scene images and partially stylized generated images.
The video generation method can be used in a picture-text-to-video system, for example: based on the text information, an image is generated through an image generation model, and then a video corresponding to the image is produced through a three-dimensional mirror and the like.
Disclosure of Invention
The present disclosure provides a video generation method, apparatus, device, and storage medium.
According to an aspect of the present disclosure, there is provided a video generating method including:
performing depth estimation on each pixel in an original image to obtain the depth of each pixel in the original image;
performing edge recognition according to the depth of each pixel in the original image to obtain a foreground edge pixel and a background edge pixel;
generating foreground extension pixels extending from the front Jing Bianyuan pixels to the periphery and generating background extension pixels extending from the background edge pixels to the periphery according to the original image;
performing three-dimensional modeling according to pixels in the original image, the foreground extension pixels and the background extension pixels;
And generating a video based on the three-dimensional model obtained by modeling.
According to another aspect of the present disclosure, there is provided a video generating apparatus including:
the estimating module is used for estimating the depth of each pixel in the original image to obtain the depth of each pixel in the original image;
the identification module is used for carrying out edge identification according to the depth of each pixel in the original image to obtain a foreground edge pixel and a background edge pixel;
an expansion module for generating foreground expansion pixels expanded from the front Jing Bianyuan pixels to the periphery and generating background expansion pixels expanded from the background edge pixels to the periphery, based on the original image;
the modeling module is used for carrying out three-dimensional modeling according to the pixels in the original image, the foreground extension pixels and the background extension pixels;
and the generation module is used for generating the video based on the three-dimensional model obtained by modeling.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect embodiment of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect embodiment of the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described in embodiments of the first aspect of the present disclosure.
According to the video generation method, the video generation device, the video generation equipment and the video storage medium, after the depth of each pixel in the original image is obtained by estimating the depth of each pixel in the original image, edge recognition is carried out according to the depth of each pixel in the original image, and the foreground edge pixels and the background edge pixels are obtained. From the original image, foreground extension pixels extending from the front Jing Bianyuan pixels to the outer periphery are generated, and background extension pixels extending from the background edge pixels to the outer periphery are generated. And then carrying out three-dimensional modeling according to pixels in the original image, foreground extension pixels and background extension pixels, and carrying out video generation based on a three-dimensional model obtained by modeling.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a flowchart of a video generating method according to an embodiment of the disclosure;
fig. 2 is a flowchart of another video generating method according to an embodiment of the disclosure;
FIG. 3 is a schematic diagram of front background edge partitioning;
FIG. 4 is a flow chart of one possible video generation method;
fig. 5 is a schematic structural diagram of a video generating apparatus 500 according to an embodiment of the disclosure;
fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the related art, a hierarchical representation method (LDI, layered depth image) of a depth map is used, monocular depth information is estimated for an input image, and clusters are divided into different levels according to depth values of each pixel. For the missing pixels in each hierarchical image, surrounding pixels are diffused to the missing pixels by heuristic methods. Thereby constructing a three-dimensional scene, and rendering the three-dimensional scene based on the mirror track of the camera to obtain a video.
However, in the related art, a plurality of levels need to be divided, so that the missing pixels are predicted for each level, the processing speed is low, and the calculation amount is large.
Aiming at the problem of lower efficiency in the related art, in the method, a single real scene or a stylized original image generated by AI is input, monocular depth estimation is firstly carried out on the original image, then edge pixels are extracted based on the estimated depth, the edge pixels comprise foreground edge pixels and background edge pixels, corresponding areas, namely an area of foreground extended pixels and an area of background extended pixels, are diffused according to the foreground edge pixels and the background edge pixels, RGB value drawing is carried out on the expanded areas, a three-dimensional model is generated based on the expanded areas, and a video is generated in a dynamic mirror mode.
In the method, the original image is divided into the foreground layer and the background layer, and the RGB values of the pixels expanded by the edges of the front background layer are drawn, so that the whole flow only needs to draw and estimate the depth twice for model prediction, and compared with all the existing methods, the processing speed is greatly reduced, and the calculated amount is reduced.
In addition, the foreground edge pixels and the background edge pixels are identified based on depth, so that the application range is wider compared with the application range of edge pixel identification based on a semantic segmentation mode, and the problem that the accurate recall rate of a semantic segmentation model is not high in an untrained scene is avoided.
Fig. 1 is a flowchart of a video generating method according to an embodiment of the present disclosure, where the method provided by the present embodiment may be executed by a video generating apparatus, as shown in fig. 1, and includes:
and step 101, estimating the depth of each pixel in the original image to obtain the depth of each pixel in the original image.
The original image may be machine-generated, or may be manually drawn, which is not limited in this embodiment.
For example: the method is applied to an image-text-to-video system, paragraphs in lyrics or other constraint conditions are taken as input, an image is generated through an image generation model, the generated image is taken as an original image, then a video is produced through the video generation method in the embodiment, and videos generated by multiple sections of images can be further spliced into a MV equal-length video.
Also for example: the method is applied to cover stories, and for pictures recommended daily in a user album, the pictures are used as original images, and the video generation method in the embodiment is executed to generate corresponding videos so as to dynamically display album top pages.
For another example: the method is applied to the generation of live broadcast posters of electronic commerce, a single-frame poster is used as an original image, a mirror-moving track can be designed to enable camera coordinates of the last frame to be identical to those of the first frame, so that a video capable of being played circularly can be produced, and the video can be played circularly as a background when the electronic commerce is live broadcast.
Optionally, the objects presented in the original image are objects comprising a foreground and a background, there being some depth difference between the objects as foreground and the objects as background.
Under the condition that each pixel point in the original image does not carry a depth value, the depth estimation of the original image can be carried out in a monocular vision mode.
Under the condition that each pixel point in the original image carries a depth value, the depth value carried by each pixel point can be directly adopted. Or fusing the carried depth value with the depth value obtained by carrying out depth estimation on the original image in a monocular vision mode. In this embodiment, a specific depth estimation method for obtaining the depth of each pixel is not limited. Note that, the pixels mentioned in this embodiment and the following embodiments may be single pixel points or pixel units formed by combining a plurality of pixel points.
And 102, carrying out edge recognition according to the depth of each pixel in the original image to obtain a foreground edge pixel and a background edge pixel.
Where the front Jing Bianyuan pixels refer to pixels belonging to the edge of the foreground object.
Similarly, a background edge pixel refers to a pixel belonging to an edge of a background object.
Optionally, edge recognition is performed on the original image based on variations in pixel depth, since foreground and background objects are typically located on different depth layers, which may have depth differences. From this, the foreground edge pixels and the background edge pixels can be identified. The front Jing Bianyuan pixels have a depth difference from the neighboring background edge pixels and no large depth difference from the neighboring foreground pixels. Similarly, background edge pixels have no large depth difference from neighboring background pixels and depth differences from neighboring foreground edge pixels.
Step 103, generating foreground extension pixels which are obtained by extending the front Jing Bianyuan pixels to the periphery according to the original image, and generating background extension pixels which are extended from the background edge pixels to the periphery.
Wherein the foreground extension pixels are pixels located at the periphery of the foreground, which are invisible due to the viewing angle in the original image. Background expanded pixels, in the original image, pixels located at the periphery of the background that are not visible due to occlusion under the view angle of the original image.
Alternatively, the pixel values of the invisible pixels in the original image due to the viewing angle are predicted from the original image, and may be RGB values in particular. And predicting background pixels in the original image that are not visible due to the foreground occlusion from the original image.
And 104, performing three-dimensional modeling according to the pixels in the original image, the foreground extension pixels and the background extension pixels.
Alternatively, the model for three-dimensional modeling is a three-dimensional mesh model using triangular patches, and may be other models, which are not limited in this embodiment.
And 105, generating a video based on the three-dimensional model obtained by modeling.
Optionally, positioning a plurality of camera positions based on a camera lens trajectory in a three-dimensional space in which the three-dimensional model is located; determining an imaging map of the three-dimensional model at each camera position view angle; and according to the moment that the camera is positioned at each camera position in the camera moving mirror track, arranging the corresponding imaging graphs to obtain the generated video.
In this embodiment, after the depth of each pixel in the original image is obtained by performing depth estimation on each pixel in the original image, edge recognition is performed according to the depth of each pixel in the original image, so as to obtain a foreground edge pixel and a background edge pixel. From the original image, foreground extension pixels extending from the front Jing Bianyuan pixels to the outer periphery are generated, and background extension pixels extending from the background edge pixels to the outer periphery are generated. And then carrying out three-dimensional modeling according to pixels in the original image, foreground extension pixels and background extension pixels, and carrying out video generation based on a three-dimensional model obtained by modeling. Since only the depth of each pixel in the original image needs to be divided into two classes, namely foreground edge pixels and background edge pixels by edge recognition, the divided levels are reduced compared with the related LDI technology. In addition, in the present disclosure, only the foreground extension pixels obtained by extending from the previous Jing Bianyuan pixels to the periphery and the background extension pixels extending from the background edge pixels to the periphery need to be complemented according to the original image, so that compared with the mode that the edges of the existing pixels and the missing pixels in different levels and the missing pixels need to be complemented in the related art, the calculation amount is reduced, and the efficiency of video generation is improved.
The 3D fortune mirror is used as a video generating method of picture guidance, can cover a wide range of real scene images and partial stylized generated images, and converts a static image into a dynamic fortune mirror video with any length. In order to clearly illustrate the video generation process in the scene of the 3D mirror, another video generation method is provided, and fig. 2 is a schematic flow chart of another video generation method provided by an embodiment of the present disclosure, where the method provided by the embodiment may be executed by a video generation device, as shown in fig. 2, and includes:
step 201, an original image is input.
Step 202, performing depth estimation on each pixel in the original image to obtain the depth of each pixel in the original image.
Alternatively, the depth estimation may use a depth model, which may estimate the depth of the pixel based on a monocular depth estimation technique.
For an input original image I, a depth model is adopted to carry out depth estimation to generate a single-channel depth map D, and the value of each pixel in the depth map D indicates depth, namely a depth value, and represents the Z-axis value of an object displayed by the depth map D in a world coordinate system. The larger depth value indicates that the more front the pixel is, the more likely it is that the foreground pixel is, and vice versa, the background pixel is.
In step 203, depth edge extraction is performed to extract foreground edge pixels and background edge pixels from the original image based on depth.
Optionally, according to the depth of each pixel in the original image, determining an edge map by adopting an edge detection algorithm with an adaptive threshold, wherein the pixel value of each pixel in the edge map is used for indicating whether the corresponding pixel in the original image is an edge pixel or not; and dividing the pixels in the original image into foreground edge pixels and background edge pixels according to the depth difference of the corresponding pixels in the original image for the adjacent edge pixels in the edge image. Only the front background division is needed for the edge part, so that the front background recognition efficiency is improved to a certain extent.
As a possible implementation manner, canny edge detection of the adaptive threshold is performed on the depth map D to achieve depth edge extraction to obtain an edge map C. The edge map C is a binary map, and pixel values greater than zero in the edge map C identify that there is a depth discontinuity at the pixel location, possibly an edge between the foreground and the background, to extract therefrom foreground and background edge pixels from the original image. A pixel value equal to zero in the edge map C identifies that the pixel location does not belong to an edge.
Further, several possible correction procedures may be performed before and after extracting the foreground edge pixels and the background edge pixels in step 203:
as a first possible modification, after determining an edge map according to the depth of each pixel in the original image by using an edge detection algorithm with an adaptive threshold, before dividing the pixels in the original image into a foreground edge pixel and a background edge pixel, a plurality of connected regions may be determined in the edge map according to the pixel values of each pixel in the edge map, where the pixel values of each pixel in the connected regions indicate pixels belonging to an edge. And removing the connected areas with the number of the pixel points smaller than the number threshold value from the connected areas so as to form the connected areas as large as possible, and avoiding interference caused by false recognition and influencing visual effect.
For example: and identifying 8 adjacent domain connected regions, counting the number of pixels in each connected region, and obtaining connected regions with the number of pixels smaller than a number threshold value a (for example, the value range of the configurable a is 8-12, and the typical value is 10), wherein in the edge map C, the number of pixels in the connected regions with the number of pixels smaller than the number threshold value a is set as 0.
As a second possible modification manner, after determining an edge map according to the depth of each pixel in the original image by adopting an edge detection algorithm with an adaptive threshold, before dividing the pixels in the original image into a foreground edge pixel and a background edge pixel, determining a plurality of connected areas in the edge map according to the pixel values of each pixel in the edge map, wherein the pixel values of each pixel in the connected areas indicate pixels belonging to an edge; and combining two communication areas with the distance smaller than a distance threshold value in the plurality of communication areas to form the communication area as large as possible, so that interference caused by misidentification is avoided, and visual effect is influenced.
For example: and counting a pixel set { p_e } with the number of 1 pixels, wherein the number of pixels in all 8-domain regions is greater than 1, the pixels are endpoint pixels of all edge lines in the edge map C, and if the distance between two endpoint pixels in { p_e } is smaller than a distance threshold b (for example, the value range of b can be configured to be 15-25 pixel distances, and the typical value is 20 pixel distances), and the two endpoint pixels do not belong to the same connected region, a straight line is drawn by taking the two points as endpoints in the edge map C, namely, the two endpoint pixels are connected and combined into the same connected region, namely, adjacent lines are connected into a straight line. Thus, the problem of discontinuous depth edges detected by the edges can be effectively relieved.
It should be noted that the first possible modification described above and/or the second possible modification may be performed with respect to the edge map C. The edge map after correction is taken as an edge map C1, and the edge map C1 is also a binary map similar to the edge map C.
The foreground edge pixels and the background edge pixels are divided based on the edge map C1.
Alternatively, for pixels p with each pixel value greater than 0 in the edge map C1, fig. 3 is a schematic diagram of the front background edge division, and as shown in fig. 3, the pixel p coordinates are (0, 0), and the 4 adjacent pixels q e { p_ (-1, 0), p_ (0, +1), p_ (+1, 0), p_ (0, -1) }, and p on the extension line of p-directed q are adjacent to q pixels q2=q+Δi, and p on the extension line of p-directed q are adjacent to p pixels p2=p- Δi, where Δi is the offset relative to the pixel q or p, respectively. If (p+p2) > (q+q2), then we say that p is the foreground edge, q is the background edge, p is added to the foreground edge pixel set N, q is added to the background edge pixel set F. Note that Δi is taken as an example in fig. 3.
As a third possible modification, a first target pixel is identified from the set of front Jing Bianyuan pixels as a foreground edge endpoint; determining a path connecting any two first target pixels through pixels in the set of front Jing Bianyuan pixels; and deleting pixels which do not pass through any path in the set of the front Jing Bianyuan pixels as redundant pixels.
For example: after determining the front Jing Bianyuan pixel set N and the background edge pixel set F, for each pixel point P in the set N, counting a pixel set NE with a number of 1 pixels with a pixel value greater than 1 in 8-domain in the edge map C1, wherein the points are endpoint pixels of all lines in N, taking the pixels in NE as a starting point, and traversing the pixels in all N with 4-domain breadth preferentially to obtain all possible paths { P } from each pixel in NE to other pixels in NE. Only the points on the { P } paths in the N set are kept, and the points not on the paths are deleted in N, thereby removing redundant foreground edge pixel points.
As a fourth possible modification, for any one of the background edge pixels, querying a plurality of second target pixels of the neighborhood; determining a second target pixel of the minimum depth from the plurality of second target pixels; in the case where the second target pixel of the smallest depth does not belong to the first Jing Bianyuan pixel, the second target pixel of the smallest depth is added as a background edge pixel to the set of background edge pixels to correct the disturbance due to the misrecognition.
For example: for the pixels in N, pixels in its 4-prime domain are calculated, from which the pixels that are not in the N set and have the smallest depth (i.e., the depth values recorded in depth map D) are determined and added to the F set.
As a fifth possible modification, for any one of the background edge pixels, inquiring whether there is an adjacent background edge pixel; in the absence of adjacent background edge pixels, the background edge pixels are deleted to correct isolated pixels due to misrecognition. For example: and filtering pixels which are not adjacent to the pixels in other F in the F set.
As a sixth possible modification, contour pixels of the periphery of the edge pixels are determined in the edge map; and replacing the depth of the surrounded edge pixels by the depth corresponding to the outline pixels, so as to ensure that obvious depth abrupt changes exist between the foreground edge pixels and the background edge pixels. Optionally, in order to determine contour pixels of the periphery of the edge pixels in the edge map, the edge pixels in the edge map may be expanded to obtain a first expansion map; re-expanding the edge pixels in the first expansion map to obtain a second expansion map; and removing edge pixels overlapped with the first expansion map from the second expansion map to take the reserved edge pixels as the contour pixels.
For example: and (3) expanding the edge map C1 for three times to obtain Cd, then expanding the Cd for another time to obtain Cd2, obtaining Cc by using Cd2-C1, namely, a binary map of edge pixels in the Cd, traversing adjacent pixels by 4 neighborhood depth priority for all pixels with values larger than 0 in the Cc, and replacing the depth value of the current pixel position by the depth value of the position of the initial pixel (pixel point in the Cc) traversed by the first traversed pixel belonging to the N or F set, thereby updating the depth of the edge pixel.
As a seventh possible modification, the depth of the front Jing Bianyuan pixel is increased according to the depth of the neighborhood pixel of the foreground edge pixel; and reducing the depth of the background edge pixel according to the depth of the field pixel of the background edge pixel. The obvious depth mutation between the foreground edge pixels and the background edge pixels is ensured.
For example: for each pixel p in N, the depth value of pixel p is replaced with the maximum depth value of 9 neighborhood pixels including itself.
Also for example: for each pixel p in F, the depth value of pixel p is replaced with the minimum depth value of 9 neighborhood pixels including itself.
And 204, based on the foreground edge pixels and the background edge pixels, expanding to obtain the depth of foreground expanded pixels at the periphery of the foreground edge pixels, and expanding to obtain the depth of background expanded pixels at the periphery of the background edge pixels according to the depth of the background edge pixels.
Optionally, a foreground edge pixel and the background edge pixel may be added as elements to a queue, wherein attribute information of the elements includes: a first coordinate of a pixel to which an element belongs, a foreground-background label for indicating that the pixel belongs to a foreground or background, a diffusion step number, and a second coordinate of a diffusion start pixel. Taking elements one by one in a queue, and each time a target element is taken out, configuring pixels corresponding to the first coordinate position in a foreground image or a background image to set non-zero values based on the foreground mark of the target element; and if the expansion step number of the target element is greater than zero, inquiring the neighborhood pixel of the target element in the original image, and adding the neighborhood pixel of the target element as a newly added element into the queue. The expansion step number of the newly added element is the diffusion step number of the target element minus one, the second coordinate of the newly added element is the first coordinate of the target element, and the foreground mark of the newly added element is the same as the target element.
As a possible implementation manner, the number of steps by which the pixel to which the element belongs is a background edge pixel is a first step number; the number of steps of diffusion, where the pixel to which the element belongs is the first Jing Bianyuan pixel, is the second number of steps; wherein the second number of steps is greater than the first number of steps. So that the front Jing Bianyuan pixels diffuse more than the background edge pixels, and more background content is presented by the foreground diffuse pixels after complementation.
For example: a new foreground mask Mn and background mask Mf, mf and Mn are binary maps, where all pixel values are 0; initializing a depth map Dm with all 0 values, wherein the value in the depth map Dm is a floating point number type and represents the depth information of the pixels to be complemented (inpainting). And initializing a weight map Wf with the value of 0, wherein the value in the weight map Wf is a floating point number type and represents the weight of weighted fusion of the full background area pixels and the original background pixels.
A queue Q is constructed so that the extended process is more ordered. Each element in the queue Q contains (p, is_near, step, root), where p is the pixel coordinate, is_near is used to mark whether it is a foreground pixel or a background pixel, step is the number of steps of diffusion, root is the starting pixel coordinate of traversal. For the pixel step in N, it is initially set to Sn (e.g., may take a value of 200), and for the pixel step in F, it is initially set to Sf (e.g., may take a value of 10).
All pixels in N and F are sequentially put into a queue Q, then elements are circularly fetched in the queue, and for each fetched element q= (Q [ p ], Q [ is_near ], Q [ step ], Q [ root ]), a depth map Dm [ Q [ p ] ] is set as D [ Q [ root ] ]. Mn [ q ] is set to 1 when q [ is_near ] is true, otherwise Mf [ q ] is set to 1, and Wf [ q ] = q [ step ]/Sf is set at the same time. If q [ step ] >0, access its 4 neighborhood of contiguous pixel locations q [ p ]. Neighbor: if Q [ p ] neighbor [ i ] is not accessed in Q at the same time, (Q [ p ] neighbor [ i ], Q [ is_near ], Q [ step ] -1, Q [ root ]) is stored in Q. The loop is then repeated until Q is null, and a foreground mask map that has diffused Sn steps and a background mask map that has diffused Sf steps, and a depth map Dm that has all traversed to the diffusion of pixels, are obtained. Because the original image has some pixel missing due to view angle shielding, the modeling is more complete and is close to the real scene when the subsequent three-dimensional modeling is facilitated by referring to the depth pixel expansion of the pixel in the original image.
In step 205, pixel value complementation is performed according to the pixel values of the pixels in the original image, so as to determine the pixel values of the foreground extension pixels and the background extension pixels.
Optionally, a target mask map is synthesized from the foreground extension pixels and the background extension pixels. Inputting the original image and the mask map into a drawing model to determine the pixel value of each extended pixel in the target mask map according to the depth and the pixel value of each pixel in the original image and the depth of each extended pixel in the target mask map; wherein the extension pixels include the foreground extension pixels and the background extension pixels. According to the method, the original image can be referred to, pixels invisible in the original image can be complemented to a certain extent, modeling is more complete in the follow-up three-dimensional modeling, and the method is close to a real scene.
As a possible implementation, the foreground extended pixels may be used as a foreground mask map, the background extended pixels may be used as a background mask map, and the foreground mask map and the background mask map may be summed to obtain the target mask map. And simultaneously, the foreground and the background are predicted, so that the foreground and the background are complemented, and the problem that the joint of the foreground and the background is hard is avoided. For example: and summing Mf and Mn to obtain M, taking M as a mask diagram, taking original image I as input, and using a LAMA algorithm to complement RGB channels of a non-zero region in the M to obtain a complemented image R. And fusing the pixel values of the image and the original image according to the weight graph, thereby achieving the gradual transition picture effect.
And 206, performing three-dimensional modeling according to the pixels in the original image, the foreground extension pixels and the background extension pixels to obtain a three-dimensional model of the grid.
Optionally, establishing a connection edge between pixels in the original image and the neighborhood pixels; deleting the connection edge in the case that two pixels connected by the connection edge are the front Jing Bianyuan pixel and the background edge pixel, respectively; establishing a connecting edge between background edge pixels matched with the depth in the original image according to the depth of the foreground extension pixels; combining the connecting edges to obtain a plurality of triangular patches in a three-dimensional space; and rendering each triangular patch according to the pixel value of the pixel in the original image and the pixel values of the front Jing Bianyuan pixel and the background edge pixel to obtain a three-dimensional model which is relatively attached to the actual scene.
As a possible implementation manner, a graph structure model G is constructed, where each node gw h c in G represents a pixel, where w and h are coordinate positions of the pixel in an image or mask of I, M, and c represents a pixel point of a c-th layer of the pixel position, and the layers of the pixel points are arranged to ensure that depth values are arranged from large to small, that is, the more the upper layer of pixels tend to be foreground.
The G [ w ] [ h ] [ c ] has a plurality of attributes { v, d, is_near, is_far, edge }, wherein v is the RGB value of the pixel, d is the depth value of the pixel, is_near is the mark of whether the pixel is a foreground edge pixel, is_far is the mark of whether the pixel is a background edge pixel, edge is a list of edges between the pixel and other pixels, the limit edge can only exist between the adjacent pixels with the space coordinate of 4 neighborhood, and the layer sequence number is not limited.
Firstly, establishing a connection edge between pixels in the original image and adjacent pixels; in the case where the two pixels connected by the connection edge are the front Jing Bianyuan pixel and the background edge pixel, respectively, the connection edge is deleted. Initializing G, p [ v ] =i [ p ], p [ D ] =d [ p ] for all pixels { p } in the original image I, if p is in N, p [ is_near ] =true value (True), if p is in F, p [ is_far ] =true; for any 4 neighborhood pixel q of p, if p [ is_near ] =true and q [ is_far ] =true, or p [ is_far ] =true and q [ is_near ] =true, then there is no connecting edge between p and q, otherwise, generating connecting edge epq to be added to the connecting edge set p [ edge list ].
And establishing a connecting edge between background edge pixels matched with the depth in the original image according to the depth of the foreground extension pixels. For all pixel sets { p } n in Mn that are not 0, it is inserted into G [ w ] [ h ] [0] according to its coordinates, i.e., layer 1 of the coordinate position, and then p [ v ] =Rp ], p [ d ] =Dm [ p ] is set. Then traversing its 4 neighborhood pixel position q, if q is in { p } n or q [ is_far ] =true, then constructing a connecting edge epq between p and q adds to p [ edge ]. For all the pixel sets { P } f which are not 0 in Mf, inserting the pixel sets { P } f into the G [ w ] [ h ] [ 1] according to the coordinates, namely, the last layer of the coordinate positions, then setting P [ v ] =Rp ] +IP ] (1-wfp ]), P [ d ] =Dm [ P ], traversing the 4 neighborhood pixel positions q, and if q is in { P } f or q [ is_far ] =true, constructing a connecting edge epq between P and q and adding the connecting edge epq into P [ edgelist ].
Step 207, obtaining a camera lens trajectory.
Alternatively, if F frames of video are to be generated, F camera coordinates ose need to be constructed, and the camera lens trajectory f= { ose } is composed of the F camera coordinates.
As one possible implementation, the camera mirror trajectory is push/pull, for example: pose= { [1, 0], [0,1, 0], … …, [0, cos ((i/f) pi/4-pi/4) ×z+x,0], [0, 0] }, where i is the frame number and Z and X are constants. Variable speed push/pull mirrors may be presented.
As another possible implementation, the camera mirror trajectory is rotated horizontally, for example: pose= { [ cos (- θ/180 pi), 0, -sin (- θ/180 pi), -sin (σ. Pi/4). Times.X],[0,1,0,-sin(σ*π/4)*Y],[sin(-θ/180π),0,cos(-θ/180π),-sin(σ*π/4)*Z],[0,0,0,0]θ= -z+2z i/f,x, Y, Z is a constant.
As yet another possible implementation, the camera mirror trajectory is a planar turn, for example: pose= { [1,0, -cos (σ. Pi.). Times.X],[0,1,0,-sin(σ*π)*X],[0,0,1,0],[0,0,0,0]}, whereinX, Y is a constant.
It should be noted that, the person skilled in the art may also adjust to obtain a new track or design other tracks, which is not limited in this embodiment.
Step 208, generating a video based on the camera motion mirror track and the three-dimensional model.
Optionally, positioning a plurality of camera positions based on a camera lens trajectory in a three-dimensional space in which the three-dimensional model is located; determining an imaging map of the three-dimensional model at each camera position view angle; and according to the moment that the camera is positioned at each camera position in the camera moving mirror track, arranging the corresponding imaging graphs to obtain the generated video. Because a plurality of optional tracks can be configured, the mode of generating the video is more flexible and changeable, and different user demands are met.
It can be seen that the video generation process may be divided into input images as shown in fig. 4, and then depth estimation is performed on the input images, depth edge extraction is performed based on the depth obtained by the depth estimation, and pixel completion is performed after expanding the extracted foreground edge pixels and background edge pixels. And constructing a three-dimensional grid model based on pixels in the original image and the completed pixels, and rendering the three-dimensional grid model frame by combining with a camera track to obtain an output video.
In this embodiment, after the depth of each pixel in the original image is obtained by performing depth estimation on each pixel in the original image, edge recognition is performed according to the depth of each pixel in the original image, so as to obtain a foreground edge pixel and a background edge pixel. From the original image, foreground extension pixels extending from the front Jing Bianyuan pixels to the outer periphery are generated, and background extension pixels extending from the background edge pixels to the outer periphery are generated. And then carrying out three-dimensional modeling according to pixels in the original image, foreground extension pixels and background extension pixels, and carrying out video generation based on a three-dimensional model obtained by modeling. Since only the depth of each pixel in the original image needs to be divided into two classes, namely foreground edge pixels and background edge pixels by edge recognition, the divided levels are reduced compared with the related LDI technology. In addition, in the present disclosure, only the foreground extension pixels obtained by extending from the previous Jing Bianyuan pixels to the periphery and the background extension pixels extending from the background edge pixels to the periphery need to be complemented according to the original image, so that compared with the mode that the edges of the existing pixels and the missing pixels in different levels and the missing pixels need to be complemented in the related art, the calculation amount is reduced, and the efficiency of video generation is improved.
Fig. 5 is a schematic structural diagram of a video generating apparatus 500 according to an embodiment of the disclosure, as shown in fig. 5, including: an estimation module 501, an identification module 502, an extension module 503, a modeling module 504, and a generation module 505.
The estimating module 501 is configured to perform depth estimation on each pixel in an original image, so as to obtain a depth of each pixel in the original image.
And the recognition module 502 is configured to perform edge recognition according to the depth of each pixel in the original image, so as to obtain a foreground edge pixel and a background edge pixel.
An expansion module 503, configured to generate foreground expansion pixels that are expanded from the front Jing Bianyuan pixels to the periphery, and generate background expansion pixels that are expanded from the background edge pixels to the periphery, according to the original image.
A modeling module 504, configured to perform three-dimensional modeling according to pixels in the original image, the foreground extension pixels, and the background extension pixels.
The generating module 505 is configured to generate a video based on the three-dimensional model obtained by modeling.
In some possible embodiments, expansion module 503 includes:
the expansion unit is used for expanding and obtaining the depth of the foreground expansion pixel at the periphery of the foreground edge pixel point according to the depth of the front Jing Bianyuan pixel, and expanding and obtaining the depth of the background expansion pixel at the periphery of the background edge pixel according to the depth of the background edge pixel;
And the complementing unit is used for determining the pixel value of the foreground extension pixel and the pixel value of the background extension pixel according to the pixel value of each pixel in the original image.
Optionally, a complement unit for:
synthesizing a target mask map according to the foreground extension pixels and the background extension pixels;
inputting the original image and the mask map into a drawing model to determine the pixel value of each extended pixel in the target mask map according to the depth and the pixel value of each pixel in the original image and the depth of each extended pixel in the target mask map; wherein the extension pixels include the foreground extension pixels and the background extension pixels.
The complementing unit synthesizes a target mask diagram according to the foreground extension pixels and the background extension pixels, and comprises the following steps: and taking the foreground extension pixels as a foreground mask map, taking the background extension pixels as a background mask map, and summing the foreground mask map and the background mask map to obtain the target mask map.
Optionally, the expansion unit is configured to:
adding the front Jing Bianyuan pixel and the background edge pixel as elements into a queue, wherein attribute information of the elements comprises: a first coordinate of a pixel to which an element belongs, a foreground-background mark for indicating that the pixel belongs to a foreground or a background, a diffusion step number, and a second coordinate of a diffusion start pixel;
Taking elements one by one in the queue, and each time a target element is taken out, configuring pixels corresponding to the first coordinate position in a foreground image or a background image to set non-zero values based on the foreground mark and the background mark of the target element; the method comprises the steps of,
if the expansion step number of the target element is greater than zero, inquiring a neighborhood pixel of the target element in the original image, and adding the neighborhood pixel of the target element as a new element into the queue;
the expansion step number of the newly added element is the diffusion step number of the target element minus one, the second coordinate of the newly added element is the first coordinate of the target element, and the foreground mark of the newly added element is the same as the target element.
Optionally, the diffusion step number of the pixel to which the element belongs is the background edge pixel is a first step number; the pixel to which the element belongs is the first Jing Bianyuan pixel, and the diffusion step number is the second step number; wherein the second number of steps is greater than the first number of steps.
In some possible embodiments, the identification module 502 includes:
the edge recognition unit is used for determining an edge map by adopting an edge detection algorithm of an adaptive threshold according to the depth of each pixel in the original image, wherein the pixel value of each pixel in the edge map is used for indicating whether the corresponding pixel in the original image is an edge pixel or not;
The dividing unit is used for dividing the pixels in the original image into foreground edge pixels and background edge pixels according to the depth difference of the corresponding pixels in the original image for the adjacent edge pixels in the edge image.
Optionally, the identification module 502 further includes:
a first correction unit, configured to determine a plurality of connected regions in the edge map according to pixel values of pixels in the edge map, where the pixel values of the pixels in the connected regions indicate pixels belonging to an edge; and removing the connected areas with the number of the pixel points smaller than the number threshold value from the connected areas.
Optionally, the identification module 502 further includes:
a second correction unit, configured to determine a plurality of connected regions in the edge map according to pixel values of pixels in the edge map, where the pixel values of the pixels in the connected regions indicate pixels belonging to an edge; and merging two connected areas with the distance smaller than a distance threshold value in the plurality of connected areas.
Optionally, the identification module 502 further includes:
a third correction unit for identifying a first target pixel as a foreground edge endpoint from the set of the front Jing Bianyuan pixels; determining a path connecting any two first target pixels through pixels in the set of front Jing Bianyuan pixels; and deleting pixels which do not pass through any path in the set of the front Jing Bianyuan pixels as redundant pixels.
Optionally, the identification module 502 further includes:
a fourth correction unit, configured to query, for any one of the background edge pixels, a plurality of second target pixels in a neighborhood; determining a second target pixel of the minimum depth from the plurality of second target pixels; in the case that the second target pixel of the smallest depth does not belong to the first Jing Bianyuan pixel, the second target pixel of the smallest depth is added as a background edge pixel to the set of background edge pixels.
Optionally, the identification module 502 further includes:
a fifth correction unit, configured to query, for any one of the background edge pixels, whether there is an adjacent background edge pixel; in the absence of adjacent background edge pixels, the background edge pixels are deleted.
Optionally, the identification module 502 further includes:
a sixth correction unit configured to determine contour pixels of an outer periphery of an edge pixel in the edge map; and replacing the depth of the surrounded edge pixels with the depth corresponding to the outline pixels.
Wherein optionally, the sixth correction unit determines contour pixels of the periphery of the edge pixels in the edge map, including: expanding the edge pixels in the edge map to obtain a first expansion map; re-expanding the edge pixels in the first expansion map to obtain a second expansion map;
And removing edge pixels overlapped with the first expansion map from the second expansion map to take the reserved edge pixels as the contour pixels.
In some possible embodiments, the depth of the front Jing Bianyuan pixels is greater than the depth of the background edge pixels; based on this, the identification module 502 further comprises:
a seventh correction unit, configured to adjust the depth of the front Jing Bianyuan pixel according to the depth of the neighboring pixel of the foreground edge pixel; and reducing the depth of the background edge pixel according to the depth of the field pixel of the background edge pixel.
In some possible embodiments, modeling module 504 is configured to:
establishing a connection edge between pixels in the original image and the neighborhood pixels;
deleting the connection edge in the case that two pixels connected by the connection edge are the front Jing Bianyuan pixel and the background edge pixel, respectively;
establishing a connecting edge between background edge pixels matched with the depth in the original image according to the depth of the foreground extension pixels;
combining the connecting edges to obtain a plurality of triangular patches in a three-dimensional space;
and rendering each triangular patch according to the pixel value of the pixel in the original image and the pixel values of the front Jing Bianyuan pixel and the background edge pixel to obtain a three-dimensional model.
In some possible embodiments, the generating module 505 includes:
positioning a plurality of camera positions based on a camera moving mirror track in a three-dimensional space where the three-dimensional model is positioned;
determining an imaging map of the three-dimensional model at each camera position view angle;
and according to the moment that the camera is positioned at each camera position in the camera moving mirror track, arranging the corresponding imaging graphs to obtain the generated video.
In this embodiment, after the depth of each pixel in the original image is obtained by performing depth estimation on each pixel in the original image, edge recognition is performed according to the depth of each pixel in the original image, so as to obtain a foreground edge pixel and a background edge pixel. From the original image, foreground extension pixels extending from the front Jing Bianyuan pixels to the outer periphery are generated, and background extension pixels extending from the background edge pixels to the outer periphery are generated. And then carrying out three-dimensional modeling according to pixels in the original image, foreground extension pixels and background extension pixels, and carrying out video generation based on a three-dimensional model obtained by modeling. Since only the depth of each pixel in the original image needs to be divided into two classes, namely foreground edge pixels and background edge pixels by edge recognition, the divided levels are reduced compared with the related LDI technology. In addition, in the present disclosure, only the foreground extension pixels obtained by extending from the previous Jing Bianyuan pixels to the periphery and the background extension pixels extending from the background edge pixels to the periphery need to be complemented according to the original image, so that compared with the mode that the edges of the existing pixels and the missing pixels in different levels and the missing pixels need to be complemented in the related art, the calculation amount is reduced, and the efficiency of video generation is improved.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a ROM (Read-Only Memory) 602 or a computer program loaded from a storage unit 608 into a RAM (Random Access Memory ) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 801, the ROM 802, and the RAM 603 are connected to each other by a bus 604. An I/O (Input/Output) interface 605 is also connected to bus 604.
Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing units 601 include, but are not limited to, a CPU (Central Processing Unit ), a GPU (Graphic Processing Units, graphics processing unit), various dedicated AI (Artificial Intelligence ) computing chips, various computing units running machine learning model algorithms, DSPs (Digital Signal Processor, digital signal processors), and any suitable processors, controllers, microcontrollers, and the like. The computing unit 601 performs the respective methods and processes described above, such as a video generation method. For example, in some embodiments, the video generation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When a computer program is loaded into RAM 603 and executed by computing unit 601, one or more steps of the video generation method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the video generation method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit System, FPGA (Field Programmable Gate Array ), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit), ASSP (Application Specific Standard Product, special-purpose standard product), SOC (System On Chip ), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, erasable programmable read-Only Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., CRT (Cathode-Ray Tube) or LCD (Liquid Crystal Display ) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network ), WAN (Wide Area Network, wide area network), internet and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be noted that, artificial intelligence is a subject of studying a certain thought process and intelligent behavior (such as learning, reasoning, thinking, planning, etc.) of a computer to simulate a person, and has a technology at both hardware and software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (37)

1. A video generation method, comprising:
performing depth estimation on each pixel in an original image to obtain the depth of each pixel in the original image;
performing edge recognition according to the depth of each pixel in the original image to obtain a foreground edge pixel and a background edge pixel;
generating foreground extension pixels extending from the front Jing Bianyuan pixels to the periphery and generating background extension pixels extending from the background edge pixels to the periphery according to the original image;
performing three-dimensional modeling according to pixels in the original image, the foreground extension pixels and the background extension pixels;
and generating a video based on the three-dimensional model obtained by modeling.
2. The method of claim 1, wherein the generating foreground extension pixels extending peripherally from the front Jing Bianyuan pixels and generating background extension pixels extending peripherally from the background edge pixels from the original image comprises:
according to the depth of the front Jing Bianyuan pixel, the depth of a foreground extension pixel at the periphery of the foreground edge pixel point is obtained by extension, and according to the depth of the background edge pixel, the depth of a background extension pixel at the periphery of the background edge pixel is obtained by extension;
And determining the pixel value of the foreground extension pixel and the pixel value of the background extension pixel according to the pixel value of each pixel in the original image.
3. The method of claim 2, wherein the determining the pixel value of the foreground extension pixel and the pixel value of the background extension pixel from the pixel values of the pixels in the original image comprises:
synthesizing a target mask map according to the foreground extension pixels and the background extension pixels;
inputting the original image and the mask map into a drawing model to determine the pixel value of each extended pixel in the target mask map according to the depth and the pixel value of each pixel in the original image and the depth of each extended pixel in the target mask map; wherein the extension pixels include the foreground extension pixels and the background extension pixels.
4. A method according to claim 3, wherein said synthesizing a target mask map from said foreground extension pixels and said background extension pixels comprises:
and taking the foreground extension pixels as a foreground mask map, taking the background extension pixels as a background mask map, and summing the foreground mask map and the background mask map to obtain the target mask map.
5. The method of claim 2, wherein the expanding the depth of the foreground expanded pixels around the foreground edge pixels according to the depth of the front Jing Bianyuan pixels and expanding the depth of the background expanded pixels around the background edge pixels according to the depth of the background edge pixels comprises:
adding the front Jing Bianyuan pixel and the background edge pixel as elements into a queue, wherein attribute information of the elements comprises: a first coordinate of a pixel to which an element belongs, a foreground-background mark for indicating that the pixel belongs to a foreground or a background, a diffusion step number, and a second coordinate of a diffusion start pixel;
taking elements one by one in the queue, and each time a target element is taken out, configuring pixels corresponding to the first coordinate position in a foreground image or a background image to set non-zero values based on the foreground mark and the background mark of the target element; the method comprises the steps of,
if the expansion step number of the target element is greater than zero, inquiring a neighborhood pixel of the target element in the original image, and adding the neighborhood pixel of the target element as a new element into the queue;
the expansion step number of the newly added element is the diffusion step number of the target element minus one, the second coordinate of the newly added element is the first coordinate of the target element, and the foreground mark of the newly added element is the same as the target element.
6. The method of claim 5, wherein the number of diffusion steps for which the pixel belongs to is the background edge pixel is a first step; the pixel to which the element belongs is the first Jing Bianyuan pixel, and the diffusion step number is the second step number;
wherein the second number of steps is greater than the first number of steps.
7. The method according to any one of claims 1-6, wherein the performing edge recognition according to the depth of each pixel in the original image to obtain a foreground edge pixel and a background edge pixel includes:
determining an edge map by adopting an edge detection algorithm of an adaptive threshold according to the depth of each pixel in the original image, wherein the pixel value of each pixel in the edge map is used for indicating whether the corresponding pixel in the original image is an edge pixel or not;
and dividing the pixels in the original image into foreground edge pixels and background edge pixels according to the depth difference of the corresponding pixels in the original image for the adjacent edge pixels in the edge image.
8. The method of claim 7, wherein after determining an edge map using an edge detection algorithm with an adaptive threshold according to the depth of each pixel in the original image, further comprising:
Determining a plurality of connected areas in the edge graph according to the pixel values of the pixels in the edge graph, wherein the pixel values of the pixels in the connected areas indicate pixels belonging to the edge;
and removing the connected areas with the number of the pixel points smaller than the number threshold value from the connected areas.
9. The method of claim 7, wherein after determining an edge map using an edge detection algorithm with an adaptive threshold according to the depth of each pixel in the original image, further comprising:
determining a plurality of connected areas in the edge graph according to the pixel values of the pixels in the edge graph, wherein the pixel values of the pixels in the connected areas indicate pixels belonging to the edge;
and merging two connected areas with the distance smaller than a distance threshold value in the plurality of connected areas.
10. The method of claim 7, wherein the dividing the pixels in the original image into foreground edge pixels and background edge pixels according to the depth difference of the corresponding pixels in the original image for the adjacent edge pixels in the edge map further comprises:
identifying a first target pixel from the set of front Jing Bianyuan pixels as a foreground edge endpoint;
Determining a path connecting any two first target pixels through pixels in the set of front Jing Bianyuan pixels;
and deleting pixels which do not pass through any path in the set of the front Jing Bianyuan pixels as redundant pixels.
11. The method of claim 7, wherein the dividing the pixels in the original image into foreground edge pixels and background edge pixels according to the depth difference of the corresponding pixels in the original image for the adjacent edge pixels in the edge map further comprises:
querying a plurality of second target pixels of a neighborhood for any of the background edge pixels;
determining a second target pixel of the minimum depth from the plurality of second target pixels;
in the case that the second target pixel of the smallest depth does not belong to the first Jing Bianyuan pixel, the second target pixel of the smallest depth is added as a background edge pixel to the set of background edge pixels.
12. The method of claim 7, wherein the dividing the pixels in the original image into foreground edge pixels and background edge pixels according to the depth difference of the corresponding pixels in the original image for the adjacent edge pixels in the edge map further comprises:
Querying whether adjacent background edge pixels exist for any background edge pixel;
in the absence of adjacent background edge pixels, the background edge pixels are deleted.
13. The method of claim 7, wherein after determining an edge map using an edge detection algorithm with an adaptive threshold according to the depth of each pixel in the original image, further comprising:
determining contour pixels of the periphery of the edge pixels in the edge map;
and replacing the depth of the surrounded edge pixels with the depth corresponding to the outline pixels.
14. The method of claim 13, wherein the determining contour pixels around edge pixels in the edge map comprises:
expanding the edge pixels in the edge map to obtain a first expansion map;
re-expanding the edge pixels in the first expansion map to obtain a second expansion map;
and removing edge pixels overlapped with the first expansion map from the second expansion map to take the reserved edge pixels as the contour pixels.
15. The method of claim 7, wherein the depth of the front Jing Bianyuan pixels is greater than the depth of the background edge pixels;
The dividing the pixels in the original image into foreground edge pixels and background edge pixels according to the depth difference of the corresponding pixels in the original image for the adjacent edge pixels in the edge image, further includes:
according to the depth of the neighborhood pixels of the foreground edge pixels, the depth of the front Jing Bianyuan pixels is increased;
and reducing the depth of the background edge pixel according to the depth of the field pixel of the background edge pixel.
16. The method of claim 2, wherein the three-dimensional modeling from pixels in the original image, the foreground extension pixels, and the background extension pixels comprises:
establishing a connection edge between pixels in the original image and the neighborhood pixels;
deleting the connection edge in the case that two pixels connected by the connection edge are the front Jing Bianyuan pixel and the background edge pixel, respectively;
establishing a connecting edge between background edge pixels matched with the depth in the original image according to the depth of the foreground extension pixels;
combining the connecting edges to obtain a plurality of triangular patches in a three-dimensional space;
and rendering each triangular patch according to the pixel value of the pixel in the original image and the pixel values of the front Jing Bianyuan pixel and the background edge pixel to obtain a three-dimensional model.
17. The method of any of claims 1-6, wherein the video generation based on the modeled three-dimensional model comprises:
positioning a plurality of camera positions based on a camera moving mirror track in a three-dimensional space where the three-dimensional model is positioned;
determining an imaging map of the three-dimensional model at each camera position view angle;
and according to the moment that the camera is positioned at each camera position in the camera moving mirror track, arranging the corresponding imaging graphs to obtain the generated video.
18. A video generating apparatus comprising:
the estimating module is used for estimating the depth of each pixel in the original image to obtain the depth of each pixel in the original image;
the identification module is used for carrying out edge identification according to the depth of each pixel in the original image to obtain a foreground edge pixel and a background edge pixel;
an expansion module for generating foreground expansion pixels expanded from the front Jing Bianyuan pixels to the periphery and generating background expansion pixels expanded from the background edge pixels to the periphery, based on the original image;
the modeling module is used for carrying out three-dimensional modeling according to the pixels in the original image, the foreground extension pixels and the background extension pixels;
And the generation module is used for generating the video based on the three-dimensional model obtained by modeling.
19. The apparatus of claim 18, wherein the expansion module comprises:
the expansion unit is used for expanding and obtaining the depth of the foreground expansion pixel at the periphery of the foreground edge pixel point according to the depth of the front Jing Bianyuan pixel, and expanding and obtaining the depth of the background expansion pixel at the periphery of the background edge pixel according to the depth of the background edge pixel;
and the complementing unit is used for determining the pixel value of the foreground extension pixel and the pixel value of the background extension pixel according to the pixel value of each pixel in the original image.
20. The apparatus of claim 19, wherein the completion unit is configured to:
synthesizing a target mask map according to the foreground extension pixels and the background extension pixels;
inputting the original image and the mask map into a drawing model to determine the pixel value of each extended pixel in the target mask map according to the depth and the pixel value of each pixel in the original image and the depth of each extended pixel in the target mask map; wherein the extension pixels include the foreground extension pixels and the background extension pixels.
21. The apparatus of claim 20, wherein the completion unit is configured to:
and taking the foreground extension pixels as a foreground mask map, taking the background extension pixels as a background mask map, and summing the foreground mask map and the background mask map to obtain the target mask map.
22. The apparatus of claim 19, wherein the expansion unit is configured to:
adding the front Jing Bianyuan pixel and the background edge pixel as elements into a queue, wherein attribute information of the elements comprises: a first coordinate of a pixel to which an element belongs, a foreground-background mark for indicating that the pixel belongs to a foreground or a background, a diffusion step number, and a second coordinate of a diffusion start pixel;
taking elements one by one in the queue, and each time a target element is taken out, configuring pixels corresponding to the first coordinate position in a foreground image or a background image to set non-zero values based on the foreground mark and the background mark of the target element; the method comprises the steps of,
if the expansion step number of the target element is greater than zero, inquiring a neighborhood pixel of the target element in the original image, and adding the neighborhood pixel of the target element as a new element into the queue;
The expansion step number of the newly added element is the diffusion step number of the target element minus one, the second coordinate of the newly added element is the first coordinate of the target element, and the foreground mark of the newly added element is the same as the target element.
23. The apparatus of claim 22, wherein the number of diffusion steps for which the element belongs to the pixel is the background edge pixel is a first number of steps; the pixel to which the element belongs is the first Jing Bianyuan pixel, and the diffusion step number is the second step number;
wherein the second number of steps is greater than the first number of steps.
24. The apparatus of any of claims 18-23, wherein the identification module comprises:
the edge recognition unit is used for determining an edge map by adopting an edge detection algorithm of an adaptive threshold according to the depth of each pixel in the original image, wherein the pixel value of each pixel in the edge map is used for indicating whether the corresponding pixel in the original image is an edge pixel or not;
the dividing unit is used for dividing the pixels in the original image into foreground edge pixels and background edge pixels according to the depth difference of the corresponding pixels in the original image for the adjacent edge pixels in the edge image.
25. The apparatus of claim 24, wherein the identification module further comprises:
a first correction unit, configured to determine a plurality of connected regions in the edge map according to pixel values of pixels in the edge map, where the pixel values of the pixels in the connected regions indicate pixels belonging to an edge; and removing the connected areas with the number of the pixel points smaller than the number threshold value from the connected areas.
26. The apparatus of claim 24, wherein the identification module further comprises:
a second correction unit, configured to determine a plurality of connected regions in the edge map according to pixel values of pixels in the edge map, where the pixel values of the pixels in the connected regions indicate pixels belonging to an edge; and merging two connected areas with the distance smaller than a distance threshold value in the plurality of connected areas.
27. The apparatus of claim 24, wherein the identification module further comprises:
a third correction unit for identifying a first target pixel as a foreground edge endpoint from the set of the front Jing Bianyuan pixels; determining a path connecting any two first target pixels through pixels in the set of front Jing Bianyuan pixels; and deleting pixels which do not pass through any path in the set of the front Jing Bianyuan pixels as redundant pixels.
28. The apparatus of claim 24, wherein the identification module further comprises:
a fourth correction unit, configured to query, for any one of the background edge pixels, a plurality of second target pixels in a neighborhood; determining a second target pixel of the minimum depth from the plurality of second target pixels; in the case that the second target pixel of the smallest depth does not belong to the first Jing Bianyuan pixel, the second target pixel of the smallest depth is added as a background edge pixel to the set of background edge pixels.
29. The apparatus of claim 24, wherein the identification module further comprises:
a fifth correction unit, configured to query, for any one of the background edge pixels, whether there is an adjacent background edge pixel; in the absence of adjacent background edge pixels, the background edge pixels are deleted.
30. The apparatus of claim 24, wherein the identification module further comprises:
a sixth correction unit configured to determine contour pixels of an outer periphery of an edge pixel in the edge map; and replacing the depth of the surrounded edge pixels with the depth corresponding to the outline pixels.
31. The apparatus of claim 30, wherein the sixth modification unit is further configured to:
Expanding the edge pixels in the edge map to obtain a first expansion map;
re-expanding the edge pixels in the first expansion map to obtain a second expansion map;
and removing edge pixels overlapped with the first expansion map from the second expansion map to take the reserved edge pixels as the contour pixels.
32. The apparatus of claim 24, wherein the depth of the front Jing Bianyuan pixels is greater than the depth of the background edge pixels;
the identification module further comprises:
a seventh correction unit, configured to adjust the depth of the front Jing Bianyuan pixel according to the depth of the neighboring pixel of the foreground edge pixel; and reducing the depth of the background edge pixel according to the depth of the field pixel of the background edge pixel.
33. The apparatus of claim 19, wherein the modeling module is to:
establishing a connection edge between pixels in the original image and the neighborhood pixels;
deleting the connection edge in the case that two pixels connected by the connection edge are the front Jing Bianyuan pixel and the background edge pixel, respectively;
establishing a connecting edge between background edge pixels matched with the depth in the original image according to the depth of the foreground extension pixels;
Combining the connecting edges to obtain a plurality of triangular patches in a three-dimensional space;
and rendering each triangular patch according to the pixel value of the pixel in the original image and the pixel values of the front Jing Bianyuan pixel and the background edge pixel to obtain a three-dimensional model.
34. The apparatus of any of claims 18-23, wherein the generating module comprises:
positioning a plurality of camera positions based on a camera moving mirror track in a three-dimensional space where the three-dimensional model is positioned;
determining an imaging map of the three-dimensional model at each camera position view angle;
and according to the moment that the camera is positioned at each camera position in the camera moving mirror track, arranging the corresponding imaging graphs to obtain the generated video.
35. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-17.
36. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-17.
37. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-17.
CN202311245123.7A 2023-09-25 2023-09-25 Video generation method, device, equipment and storage medium Pending CN117354480A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311245123.7A CN117354480A (en) 2023-09-25 2023-09-25 Video generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311245123.7A CN117354480A (en) 2023-09-25 2023-09-25 Video generation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117354480A true CN117354480A (en) 2024-01-05

Family

ID=89365930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311245123.7A Pending CN117354480A (en) 2023-09-25 2023-09-25 Video generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117354480A (en)

Similar Documents

Publication Publication Date Title
WO2016110239A1 (en) Image processing method and device
CN111462275A (en) Map production method and device based on laser point cloud
CN112560862B (en) Text recognition method and device and electronic equipment
CN107871321B (en) Image segmentation method and device
CN107016682A (en) A kind of notable object self-adapting division method of natural image
US20190080512A1 (en) Three-dimensional graphics image processing
CN111768356A (en) Face image fusion method and device, electronic equipment and storage medium
CN112802037A (en) Portrait extraction method, device, electronic equipment and storage medium
CN112954450A (en) Video processing method and device, electronic equipment and storage medium
CN111275824A (en) Surface reconstruction for interactive augmented reality
JP2023525462A (en) Methods, apparatus, electronics, storage media and computer programs for extracting features
CN110807379A (en) Semantic recognition method and device and computer storage medium
CN115439543A (en) Method for determining hole position and method for generating three-dimensional model in metauniverse
CN114926849A (en) Text detection method, device, equipment and storage medium
CN113766117B (en) Video de-jitter method and device
CN114037087A (en) Model training method and device, depth prediction method and device, equipment and medium
CN117422851A (en) Virtual clothes changing method and device and electronic equipment
CN113379748B (en) Point cloud panorama segmentation method and device
CN113269280A (en) Text detection method and device, electronic equipment and computer readable storage medium
Dimiccoli et al. Exploiting t-junctions for depth segregation in single images
CN115375847B (en) Material recovery method, three-dimensional model generation method and model training method
CN114333038B (en) Training method of object recognition model, object recognition method, device and equipment
CN117354480A (en) Video generation method, device, equipment and storage medium
CN115578495A (en) Special effect image drawing method, device, equipment and medium
CN114005098A (en) Method and device for detecting lane line information of high-precision map and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination