CN112712487A

CN112712487A - Scene video fusion method and system, electronic equipment and storage medium

Info

Publication number: CN112712487A
Application number: CN202011536124.3A
Authority: CN
Inventors: 潘金龙; 宋亚连
Original assignee: Beijing Softcom Smart City Technology Co ltd
Current assignee: Beijing Softcom Smart City Technology Co ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-04-27

Abstract

The invention discloses a scene video fusion method, a scene video fusion system, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring videos acquired by at least two cameras, and determining at least two frames of images containing the same target object in the acquired videos; fusing the at least two frames of images into one frame of image based on the at least two camera position parameters to obtain a fused image; the target object in the fused image is projected to the corresponding position in the virtual scene, and the technical scheme of the embodiment of the invention can quickly realize image fusion, can realize mutual superposition of the real image and the virtual scene, effectively increases the interactivity between the virtual scene and the reality and enhances the visual experience.

Description

Scene video fusion method and system, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of information processing, in particular to a scene video fusion method, a scene video fusion system, electronic equipment and a storage medium.

Background

The scene video fusion technology is a branch of virtual reality technology, and is a development stage of virtual reality, and can fuse one or more image sequence videos collected by a video collecting device and related to a certain scene or model with a virtual scene related to the scene to generate a new virtual scene or model related to the scene. In the construction of smart cities, especially in the aspects of security and management, the fusion of scene videos is very important.

Some traditional monitoring equipment such as ordinary cameras can not realize virtual scene and reality combination, and correlation technique fuses original images directly in the process of realizing scene video fusion in addition, and the fusion speed is slow, and is inefficient.

Disclosure of Invention

The embodiment of the invention provides a scene video fusion method, a scene video fusion system, electronic equipment and a storage medium, which can realize image fusion rapidly, realize mutual superposition of a real image and a virtual scene, effectively increase the interaction between the virtual scene and the reality and enhance the visual experience.

In a first aspect, an embodiment of the present invention provides a scene video fusion method, where the method includes:

acquiring videos acquired by at least two cameras, and determining at least two frames of images containing the same target object in the acquired videos; the at least two frames of images are respectively from videos collected by at least two corresponding cameras;

fusing the at least two frames of images into one frame of image based on the at least two camera position parameters to obtain a fused image;

and projecting the target real object in the fusion image to a corresponding position in a virtual scene.

In a second aspect, an embodiment of the present invention further provides a scene video fusion system, where the system includes:

the image acquisition module is used for acquiring videos acquired by at least two cameras and determining at least two frames of images containing the same target object in the acquired videos; the at least two frames of images are respectively from videos collected by at least two corresponding cameras;

the image fusion module is used for fusing the at least two frames of images into one frame of image based on the at least two camera position parameters to obtain a fused image;

and the projection module is used for projecting the target real object in the fusion image to the corresponding position in the virtual scene.

In a third aspect, an embodiment of the present invention further provides an electronic device, where the device includes: one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the scene video fusion method according to any one of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the scene video fusion method according to any one of the embodiments of the present invention.

According to the technical scheme provided by the embodiment of the invention, videos acquired by at least two cameras are acquired, at least two frames of images containing the same target object are determined in the acquired videos, the at least two frames of images are fused into one frame of image based on the position parameters of the at least two cameras to obtain a fused image, and the target object in the fused image is projected to the corresponding position in the virtual scene, so that the rapid fusion of the images can be realized, the mutual superposition of the real image and the virtual scene can be realized, the interactivity between the virtual scene and the reality can be effectively increased, and the visualization experience can be enhanced.

Drawings

Fig. 1 is a flowchart of a scene video fusion method according to an embodiment of the present invention;

fig. 2 is a flowchart of another scene video fusion method provided in the embodiment of the present invention;

fig. 3 is a flowchart of another scene video fusion method according to an embodiment of the present invention;

fig. 4 is a flowchart of another scene video fusion method provided in the embodiment of the present invention;

FIG. 5(a) is the original image before affine transformation;

FIG. 5(b) is a diagram of the output effect after affine transformation;

FIG. 6(a) is the original image before perspective transformation;

FIG. 6(b) is a graph of the output effect after perspective transformation;

FIG. 7 is a flow chart of image fusion;

FIG. 8(a) is the original image before radial distortion offset correction;

FIG. 8(b) is an effect diagram of pincushion distortion deviation correction on an original image;

FIG. 8(c) is a diagram showing the effect of barrel distortion aberration correction on the original image;

FIG. 9(a) is the original image before tangential distortion offset correction;

FIG. 9(b) is an effect diagram after tangential distortion deviation correction of the original image;

FIG. 10 is a diagram of an original example in a semantic segmentation process;

FIG. 11 is a semantic segmentation mask map corresponding to the original exemplary map in the semantic segmentation process;

FIG. 12 is a semantic segmentation foreground map in a semantic segmentation process;

FIG. 13 is a semantic segmentation composition graph in the semantic segmentation process;

FIG. 14 is an alpha channel map obtained by AI matting;

FIG. 15 is a composite illustration of AI matting;

FIG. 16 is a diagram illustrating a scene video fusion system architecture according to an embodiment of the present invention;

fig. 17 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Fig. 1 is a flowchart of a scene video fusion method according to an embodiment of the present invention, where the method may be performed by a scene video fusion system, where the system may be implemented by software and/or hardware, and the system may be configured in an electronic device such as a server. Optionally, the method is applied to a scene in which a virtual scene is fused with a video acquired by a camera. As shown in fig. 1, the technical solution provided by the embodiment of the present invention specifically includes:

s110: acquiring videos acquired by at least two cameras, and determining at least two frames of images containing the same target object in the acquired videos; the at least two frames of images are respectively from videos collected by the corresponding at least two cameras.

In the embodiment of the present invention, optionally, the position of the camera in the real indoor scene is determined in advance, and the position is determined based on the distortion generated by the fusion of the field of view range and the image, that is, the determined position of the camera may satisfy the requirement of being able to capture a wider field of view range, for example, the range of the whole real indoor scene may be captured by at least two cameras, and the effect of minimizing distortion generated when images in videos acquired by different cameras are fused may be achieved. The video collected by the camera can be in various formats, such as formats of avi, mp4, rtsp, m3u8, and other video formats. At least one frame of image is acquired from videos acquired by different cameras respectively, and the acquired at least two frames of images need to contain the same target object, namely, at least two frames of images need to have overlapped areas. The target object can be a person, an object or other things in the real indoor scene.

In an implementation manner of the embodiment of the present invention, optionally, after determining at least two frames of images including the same target physical object in the acquired video, the method further includes: and carrying out image transformation on the at least two frames of images to obtain a transformed image.

In the embodiment of the present invention, optionally, the image transformation may be to perform image transformation on an image at an oblique angle of view based on affine transformation to obtain a transformed image, or may be to perform image transformation on an image captured by a short-focus camera into an image format captured by a long-focus camera based on perspective transformation to obtain a transformed image, or may be to obtain a transformed image by using another transformation method, and which image transformation method is better to be used is determined according to actual situations.

S120: and fusing the at least two frames of images into one frame of image based on the at least two camera position parameters to obtain a fused image.

In the embodiment of the present invention, optionally, each camera position is fixed, and each camera position corresponds to one position parameter, that is, a transformation matrix of the camera, where the position parameter is determined by the camera position, and the transformation matrix of the camera is determined based on a reference coordinate, where the reference coordinate may select a corner in a real indoor scene, or may use another reference coordinate as the reference coordinate, and may be determined according to a situation. And fusing at least two frames of images into one frame of image through an image fusion algorithm based on the position parameters of each camera, so as to obtain a fused image.

Therefore, at least two frames of images are fused into one frame of image based on at least two camera position parameters, so that the images can be rapidly fused to obtain a fused image.

S130: and projecting the target real object in the fusion image to a corresponding position in a virtual scene.

In the embodiment of the present invention, optionally, the virtual scene is obtained by modeling or mapping the real indoor scene in a 3D modeling manner or in a mapping manner, the virtual scene and the real indoor scene have a 1:1 proportional correspondence, and a projection matrix is determined based on a position parameter of a camera in the virtual scene by using a specific Video editing tool, such as a Video Mixer Editor, so as to implement virtual reality by projecting a target real object in the fused image to a corresponding position in the virtual scene, or implement virtual reality in other manners.

In an implementation manner of the embodiment of the present invention, optionally, the projecting the target real object in the fused image to a corresponding position in a virtual scene includes: determining a projection relation based on the position parameters of the camera in the virtual scene and the position parameters of the camera in the real indoor scene; and projecting the target real object in the fusion image to the corresponding position in the virtual scene based on the projection relation.

In the embodiment of the present invention, optionally, the position parameters of the cameras in the virtual scene correspond to the position parameters of the cameras in the real indoor scene one to one, and the position parameters of the cameras in the virtual scene may be completely consistent with the position parameters of the cameras in the real indoor scene, or may be in other corresponding relationships. And determining a projection relation based on the corresponding relation, performing projection conversion on each pixel point of the fusion image based on a corresponding coordinate transformation formula according to the projection relation, realizing conversion between two three-dimensional coordinate systems through the coordinate transformation formula, and projecting the target object in the fusion image to a corresponding position in the virtual scene.

Therefore, the projection relation is determined based on the position parameter of the camera in the virtual scene and the position parameter of the camera in the real indoor scene, and the target real object in the fusion image is projected to the corresponding position in the virtual scene based on the projection relation, so that the virtual scene can correspond to the real indoor scene, and the reality and the realistic experience can be enhanced.

In the embodiment of the present invention, optionally, in the process of projecting the target object in the fusion image to the corresponding position in the virtual scene, an editing tool may be used for editing, including cropping (that is, a portion that is not a main target or affects a visual effect is smeared out due to mismatching of projection caused by the field angle transformation), perspective transformation, and color difference matching (that is, color removal, crack elimination, etc. may be implemented, and the color at the corner point is adjusted according to the color of the video, so as to reduce the color difference), so that the fusion is more stable.

According to the technical scheme provided by the embodiment of the invention, the videos acquired by the at least two cameras are acquired, the at least two frames of images containing the same target object are determined in the acquired videos, the at least two frames of images are fused into one frame of image based on the position parameters of the at least two cameras to obtain a fused image, the target object in the fused image is projected to the corresponding position in the virtual scene, namely, the at least two camera position parameters are added when the at least two frames of images are fused, and the target object in the fused image is projected to the corresponding position in the virtual scene after the fusion of the images is completed, so that the rapid fusion of the images can be realized, the mutual superposition of the real image and the virtual scene can be realized, the interactivity between the virtual scene and the reality can be effectively increased, and the visualization experience can be enhanced.

Fig. 2 is a flowchart of a scene video fusion method provided in an embodiment of the present invention, where in the embodiment of the present invention, optionally, the fusing the at least two frames of images into one frame of image based on the at least two camera position parameters to obtain a fused image includes: determining an overlapping region in the at least two frames of images based on the position parameters of the at least two cameras; and fusing the target object in the overlapping area in each frame of image by adopting an image fusion algorithm, and filling the non-overlapping area in the at least two frames of images to obtain the fused image.

As shown in fig. 2, the technical solution provided by the embodiment of the present invention includes:

s210: acquiring videos acquired by at least two cameras, and determining at least two frames of images containing the same target object in the acquired videos; the at least two frames of images are respectively from videos collected by the corresponding at least two cameras.

S220: determining an overlapping area in the at least two frames of images based on the position parameters of the at least two cameras.

In this embodiment of the present invention, optionally, according to the position parameters of the at least two cameras, the overlapping area of the at least two frames of images may be calculated, and the specific determination method includes: respectively cutting at least two frame images into small units of 10 multiplied by 10, calculating the world coordinates of the middle point of each corresponding small unit in the at least two frame images, and if the world coordinates are overlapped, representing that the world coordinates are an overlapped area; if the world coordinates do not coincide, then the representation is not an overlapping region.

S230: and fusing the target object in the overlapping area in each frame of image by adopting an image fusion algorithm, and filling the non-overlapping area in the at least two frames of images to obtain the fused image.

In this embodiment of the present invention, optionally, the RGB values of the non-overlapping areas of the at least two frames of images are set to 1, and then the at least two frames of images are input into an image fusion algorithm for image fusion, where the process of image fusion may include: the method comprises the steps of feature point extraction (namely extracting feature points in an input image according to the input image, wherein each feature point comprises two attributes of a position and a feature vector), image registration (calculating a group of images with the highest confidence coefficient and the most feature points through a specific algorithm after respectively registering the attributes of the corresponding feature points in the input image), obtaining a projection matrix from one frame of image to the other frame of image (namely realizing image splicing through the transformation of one frame of image relative to the other frame of image), filling a non-overlapped area, namely an area with an RGB value of 1, with an image of the corresponding area of an original image, performing post stitching, namely removing a black part appearing in the image fusion process, and finally outputting to obtain a fused image.

Therefore, the target object in the overlapped region in each frame of image is fused by adopting an image fusion algorithm, and the non-overlapped regions in at least two frames of images are filled to obtain a fused image, so that the feature point extraction only aiming at the overlapped region can be realized, the workload of extracting the features of the whole frame of image is greatly reduced, the image fusion efficiency is improved, and the image fusion speed is accelerated.

S240: and projecting the target real object in the fusion image to a corresponding position in a virtual scene.

Fig. 3 is a flowchart of a scene video fusion method provided in an embodiment of the present invention, where in the embodiment of the present invention, optionally, the method provided in the embodiment of the present invention further includes: and carrying out cutout on other real object except the target real object in the fused image, and projecting the cutout image of the other real object to the virtual scene.

Optionally, before matting the other real object except the target real object in the fused image, the method provided in the embodiment of the present invention further includes: and inputting the image containing the other real object in the real indoor scene corresponding to the virtual scene and the corresponding mask file of the real object into a semantic segmentation model, and training the semantic segmentation model.

As shown in fig. 3, the technical solution provided by the embodiment of the present invention includes:

s310: acquiring videos acquired by at least two cameras, and determining at least two frames of images containing the same target object in the acquired videos; the at least two frames of images are respectively from videos collected by the corresponding at least two cameras.

S320: and fusing the at least two frames of images into one frame of image based on the at least two camera position parameters to obtain a fused image.

S330: and projecting the target real object in the fusion image to a corresponding position in a virtual scene.

S340: and inputting the image containing the other real object in the real indoor scene corresponding to the virtual scene and the corresponding mask file of the real object into a semantic segmentation model, and training the semantic segmentation model.

In the embodiment of the present invention, optionally, the semantic segmentation model is a model in machine learning, and the structure of the semantic segmentation model may be a neural network structure or other structures. Before using the semantic segmentation model, training the semantic segmentation model: and inputting images containing other real object objects and mask files of the corresponding real object objects in the real indoor scene corresponding to the virtual scene in the training set into the semantic segmentation model, and training the semantic segmentation model. The mask file records a category to which each pixel of an image including other real objects in a real indoor scene corresponding to the virtual scene belongs, for example, to which real object the pixel belongs. Specifically, the images of other real objects and the mask files of the corresponding real objects in the real indoor scene corresponding to the virtual scene in the training set are input into the semantic segmentation model to obtain the category corresponding to each pixel point, the mask files of the corresponding real objects and the obtained category corresponding to each pixel point are simultaneously input into the loss function, whether the semantic segmentation model needs to be optimized or not is judged according to the output result of the loss function, when the output result of the loss function meets the preset condition, the training process can be stopped, and the trained semantic segmentation model is obtained.

Therefore, images containing other real object objects in the real indoor scene corresponding to the virtual scene and mask files of the corresponding real object objects are input into the semantic segmentation model for training, so that the semantic segmentation model obtained through training can identify the real object objects in the images containing other real object objects in the real indoor scene corresponding to the virtual scene, and accurate and reliable cutout content is provided for a subsequent cutout process.

S350: and carrying out cutout on other real object except the target real object in the fused image, and projecting the cutout image of the other real object to the virtual scene.

In the embodiment of the invention, optionally, other real object identifiers are complex objects, if only other real object identifiers are simply projected to a virtual scene, the effect of actually projecting to the virtual scene cannot be integrated with the surrounding environment in the virtual scene, and the requirement of natural fit of scene video fusion cannot be met, at this time, the image containing other real object identifiers need to be subjected to matting to identify other real object identifiers, and then the matting other real object identifiers are projected to the corresponding position of the virtual scene by using the projection technology. The trained semantic segmentation model can be used for matting, the AI matting technology based on the neural network based on the specific indoor target training can also be used for matting, and other modes can also be selected for matting.

In an implementation manner of the embodiment of the present invention, optionally, the matting the physical objects in the fused image except for the target physical object includes: and matting other real object objects except the target real object in the fused image through a semantic segmentation model.

In the embodiment of the present invention, optionally, the trained semantic segmentation model is obtained based on inputting the image including the other real object in the real indoor scene corresponding to the virtual scene and the mask file of the corresponding real object into the semantic segmentation model, so that other real objects in the image including the other real object in the real indoor scene corresponding to the virtual scene can be identified, and matting of the other real objects can be realized, thereby providing a reliable and accurate image source for better fusion of scene videos.

Therefore, through matting other real object objects except the target real object in the fused image, the scratched images of other real object objects are projected into the virtual scene, the virtual scene and the real indoor scene picture can be synchronized, and the stable and smooth scene video fusion effect is achieved.

Fig. 4 is a flowchart of a scene video fusion method provided in an embodiment of the present invention, and as shown in fig. 4, a black point in a real indoor scene is a fusion target, and a black point in a virtual scene is a fused projection. As shown in fig. 4, the technical solution provided by the embodiment of the present invention includes the following steps:

1. acquiring a video: video data access in conventional formats is supported, including avi, mp4, rtsp, m3u8 format, and the like.

2. Acquiring an image: the concept of video is continuous images, a complete video includes multiple frames of images and includes motion information of the images, each video has a corresponding frame number, for example, a 60-frame video, i.e., 60 frames/second, and actually 60 frames of images are played every second, so the video is also called a video stream, and because the frequency of human eye recognition is limited, when the number of frames of images seen in a unit time exceeds a certain number, the feeling that the picture is moving is caused to human eyes. So in practice video is also an image and processing video is the processing of each frame of image. Video fusion is to divide a video into a corresponding number of images according to the number of frames for processing.

3. Image transformation: the image transformation algorithm based on the OpenCV mainly comprises affine transformation and perspective transformation.

Affine transformation and perspective transformation are significant in image restoration and image local change processing. Affine transformation is applied more in 2D planes and perspective transformation is applied more in 3D planes. Both transformation principles and results are similar, but the appropriate transformation should be used for different scenarios. The mathematical principle of affine transformation and perspective transformation is that the calculation method is the product of a coordinate vector and a transformation matrix, in other words, a matrix operation. In the application level, affine transformation is transformation of an image based on 3 fixed vertices, as shown in fig. 5(a) and 5(b), fig. 5(a) is an original image before affine transformation, fig. 5(b) is an output effect graph after affine transformation, the fixed vertices are black points 51 in the graph, pixel values of the fixed vertices are not changed successively in the transformation, and the image is transformed according to a transformation rule; similarly, the perspective transformation is a transformation of an image based on 4 fixed vertices, as shown in fig. 6(a) and 6(b), fig. 6(a) is an original image before the perspective transformation, fig. 6(b) is an output effect graph after the perspective transformation, the fixed vertices are black points 61 in the graph, pixel values of the fixed vertices are not changed successively in the transformation, and the image as a whole is transformed according to a transformation rule. Referring to OpenCV, functions which are well packaged by affine transformation and perspective transformation are void warp and perspective respectively, and the two transformation functions are completely the same in form. Based on an image transformation algorithm, the method serves for the image transformation of a plurality of cameras under the same coordinate system.

4. Image fusion: the image fusion is a method for splicing at least two frames of images with overlapped areas in the same scene into a larger image, and has important significance in the field of digital twins. The union of all input images is the output of image stitching, as shown in fig. 7, image fusion is mainly performed according to the following procedures: 1) inputting an image; 2) extracting characteristics; 3) image registration; 4) random sample consensus (RANSAC); 5) deformation fusion; 6) and outputting the image.

The detection of feature points in all input images for image registration is feature extraction, and a geometric correspondence between the images needs to be established so as to compare, transform and analyze the images in a common reference system. It can be roughly divided into: algorithms that directly use pixel values of an image; algorithms for frequency domain processing, such as FFT-based methods; algorithms for low-level features, typically using edges and corners, e.g., feature-based methods; algorithms for high-level features, such as graph theory methods, are typically used to overlay portions of image objects.

The feature point extraction function is used to match elements in two frames of the input image inside the image block. These image blocks are groups of pixels in the image. Since the intensities of the pixels are very similar, it is impossible to perform accurate feature matching. In order to provide better feature matching for the image pair, angle matching is used for quantitative measurements. The corner is a good matching function. The corner features are stable when the viewpoint changes. In addition, the intensity near the corner changes abruptly. Corner detection algorithms are used to detect the corners of the image.

The deformation fusion comprises image deformation and image fusion, wherein the image deformation refers to the re-projection of one frame of image and the placement of the projected image on a larger canvas. And image fusion refers to changing the gray levels of images near the boundary, removing the gaps, and creating a mixed image, thereby achieving smooth transition between the images. Hybrid mode is used to merge two layers together.

Compared with a general image fusion algorithm, the improved image fusion algorithm increases the position parameters of a camera, can accelerate the image fusion efficiency, reduce the fusion distortion and increase the fusion field of view, and a common display card can meet the fusion of 1080 multiplied by 30.

5. Projection: the projection can calculate the main projection transformation of fusion according to the fusion image and the position parameter of the camera, and the deviation can be corrected by manually adjusting the deviation if the target object is blocked or deformed in the actual projection process.

The projection principle is that each pixel point of the image is subjected to projection conversion (conversion between two three-dimensional coordinate systems) and then output to a new image corresponding position, and the actual algorithm is to calculate the size of an output grid, calculate pixel points in the output grid and pixel points in a source image corresponding to the pixel points through a coordinate conversion formula, and perform sampling output.

And (3) correcting deviation: in the actual projection process, if the target object is blocked or deformed, the deviation can be corrected by manually adjusting the deviation. The image deviation correction is mainly of two types: radial distortion correction and tangential distortion correction.

Correcting the radial distortion deviation: the distortion at the very center is minimal and increases with increasing radius, including pincushion and barrel distortion correction. The process of radial distortion offset correction will be described with reference to fig. 8(a), 8(b), and 8(c), where fig. 8(a) is an original image before radial distortion offset correction, fig. 8(b) is an effect diagram of pincushion distortion offset correction performed on the original image, and fig. 8(c) is an effect diagram of barrel distortion offset correction performed on the original image. The formula for correcting the radial distortion is as follows (taylor series expansion front 3 terms):

where (x, y) are ideal coordinates, x_drAnd y_drIs the distorted pixel point coordinates, and: r is²＝x²+y²。

Tangential distortion deviation correction: a perspective-like transformation occurs when the lens is not parallel to the imaging plane. The tangential distortion deviation correction process is described with reference to fig. 9(a) and 9(b), where fig. 9(a) is an original image before the tangential distortion deviation correction, and fig. 9(b) is an effect diagram after the tangential distortion deviation correction is performed on the original image. The formula for tangential distortion deviation correction is as follows:

both distortion correction types are ultimately attributed to five parameters: k is a radical of₁，k₂，k₃，p₁，p₂(ii) a Knowing these five parameters, the aberration correction of the aberration can be accomplished.

The fused image has no projection parameters corresponding to an actual camera, a Video Mixer Editor which is a Video editing tool can pre-calculate a projection matrix to project the fused image to the corresponding position of a virtual scene, and then the projected image is edited, including cutting, perspective transformation, color difference matching and the like, the color of an angular point can be adjusted according to the color of the Video, so that the color difference is reduced, and the fusion is smoother and natural.

6. Matting: for other real object objects which can not meet the fusion requirement with the virtual scene only by projection operation, the outline of the complex target object can be better obtained by utilizing the matting technology, and a good projection effect is obtained. The trained semantic segmentation model can be used for matting, and an AI matting neural network trained aiming at a specific indoor target and serving for virtual scene projection can also be used for matting.

The semantic segmentation is to perform multi-classification on pixels according to semantics end to end, before a target picture is subjected to matting by a semantic segmentation model, firstly, an image containing other real objects in a real indoor scene corresponding to a virtual scene and a mask file of the corresponding real object need to be input into the semantic segmentation model, and the semantic segmentation model is trained. The complete semantic segmentation model process is described with reference to fig. 10, fig. 11, fig. 12, and fig. 13, where fig. 10 is an original example diagram in the semantic segmentation process, fig. 11 is a semantic segmentation mask diagram corresponding to the original example diagram in the semantic segmentation process, fig. 12 is a semantic segmentation foreground diagram in the semantic segmentation process, and fig. 13 is a semantic segmentation composite diagram in the semantic segmentation process.

The acquisition of the outer contour of the complex target object is a strong item of the AI matting technology, and the obtained projection effect is good. The picture is divided into a foreground part and a background part by matting and then the foreground part is extracted. The concept of an alpha channel in AI matting can be understood as transparency, and the AI matting model models the matting as the following formula: i ═ α F + (1- α) B, where I is the observed image, F is the foreground, and B is the background. Alpha is alpha, representing the transparency. The observed image is equivalent to a linear mixture of foreground and background, and the degree of mixing of each part is controlled by alpha. F and B can be understood as layers in PhotoShop, I being the superposition of two layers. The final goal of AI matting is to obtain α, and this α is not a simple binary class of 0 and 1 or a simple multi-class, and in practical operation, α is also regarded as a channel (alpha channel), whose value range is an integer of [0, 255] as in RGB color space. From this perspective, it is understood that only two semantics (foreground and background) are defined in the semantic segmentation task, and then the prediction range of each pixel of the semantic segmentation is 0 or 1, i.e. binary classification. For AI matting, the prediction range of each pixel is an integer from 0 to 255, and the accuracy requirement is higher. Therefore, AI matting is a high-version semantic segmentation task, which is more difficult.

The working process of AI matting is explained with reference to fig. 14 and fig. 15, where fig. 14 is an alpha channel diagram obtained by AI matting, fig. 15 is a synthetic example diagram of AI matting, as shown in fig. 14 and fig. 15, the AI matting synthetic image has a transparent area with a part of the original image background, and the AI matting is more natural in naturalness effect. Semantic segmentation and AI matting differ but are often handled in conjunction. The most core technique of AI matting is to calculate the alpha channel map thereof by an algorithm.

The technical scheme provided by the embodiment of the invention supports image acquisition, image transformation, image fusion, projection and matting. In the virtual scene, at least two camera video streams are fused by acquiring camera video stream data and the position parameters of a camera, and then the fused image is projected to the virtual scene through projection. The virtual scene video fusion technology is a graphic image technology for rapidly reproducing a 'dynamically enhanced virtual environment' of a real scene in the early stage of a digital twin technology, and supports multi-camera and multi-angle image fusion by combining a 5G rapid transmission and image fusion algorithm, thereby effectively enhancing the sense of clinical scenery.

Fig. 16 is a schematic diagram of a scene video fusion system architecture to which an embodiment of the present invention is applicable, where the system includes: an image acquisition module 1610, an image fusion module 1620, and a projection module 1630.

The image obtaining module 1610 is configured to obtain videos collected by at least two cameras, and determine at least two frames of images in the collected videos that include the same target physical object; the at least two frames of images are respectively from videos collected by at least two corresponding cameras; an image fusion module 1620, configured to fuse the at least two frames of images into one frame of image based on the at least two camera position parameters, so as to obtain a fused image; a projection module 1630, configured to project the target real object in the fused image to a corresponding position in a virtual scene.

In an exemplary embodiment, the system further includes a matting module, configured to matte other physical objects in the fused image except the target physical object, and project the matte images of the other physical objects into the virtual scene.

In an exemplary embodiment, the matting the physical objects in the fused image except the target physical object includes: and matting other real object objects except the target real object in the fused image through a semantic segmentation model.

In an exemplary embodiment, the system further includes a model training module, configured to input an image including the other physical objects in a real indoor scene corresponding to the virtual scene and a corresponding mask file of the physical objects into a semantic segmentation model, and train the semantic segmentation model.

In an exemplary embodiment, said fusing the at least two frames of images into one frame of image based on the at least two camera position parameters to obtain a fused image includes: determining an overlapping region in the at least two frames of images based on the position parameters of the at least two cameras; and fusing the target object in the overlapping area in each frame of image by adopting an image fusion algorithm, and filling the non-overlapping area in the at least two frames of images to obtain the fused image.

In an exemplary embodiment, the system further includes an image transformation module, configured to perform image transformation on at least two frames of images after determining that the at least two frames of images include the same target physical object in the acquired video, so as to obtain a transformed image.

In an exemplary embodiment, the projecting the target real object in the fused image to the corresponding position in the virtual scene includes: determining a projection relation based on the position parameters of the camera in the virtual scene and the position parameters of the camera in the real indoor scene; and projecting the target real object in the fusion image to the corresponding position in the virtual scene based on the projection relation.

It should be noted that each module in the embodiment of the present invention may be configured in one device, for example, may be configured in a server.

The system provided by the embodiment can execute the scene video fusion method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 17 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 17, the electronic device includes:

one or more processors 1710, one processor 1710 being illustrated in fig. 17;

a memory 1720;

the apparatus may further include: an input device 1730 and an output device 1740.

The processor 1710, the memory 1720, the input device 1730, and the output device 1740 in the apparatus may be connected by a bus or other means, such as being connected by a bus in fig. 17.

The memory 1720, which is a non-transitory computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a scene video fusion method according to an embodiment of the present invention (for example, the image obtaining module 1610, the image fusion module 1620, and the projection module 1630 shown in fig. 16). The processor 1710 executes various functional applications and data processing of the computer device by running software programs, instructions and modules stored in the memory 1720, so as to implement a scene video fusion method of the above method embodiment, that is:

The memory 1720 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, memory 1720 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 1720 may optionally include memory located remotely from the processor 1710, and such remote memory may be coupled to the terminal device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 1730 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus. The output device 1740 may include a display device such as a display screen.

An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a scene video fusion method according to an embodiment of the present invention, that is:

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A scene video fusion method is characterized by comprising the following steps:

2. The method according to claim 1, wherein said fusing the at least two frames of images into one frame of image based on the at least two camera position parameters, resulting in a fused image, comprises:

determining an overlapping region in the at least two frames of images based on the position parameters of the at least two cameras;

and fusing the target object in the overlapping area in each frame of image by adopting an image fusion algorithm, and filling the non-overlapping area in the at least two frames of images to obtain the fused image.

3. The method of claim 1, further comprising:

and carrying out cutout on other real object except the target real object in the fused image, and projecting the cutout image of the other real object to the virtual scene.

4. The method of claim 1, wherein after determining at least two frames of images in the captured video that contain the same target physical object, further comprising:

and carrying out image transformation on the at least two frames of images to obtain a transformed image.

5. The method according to claim 3, wherein the matting of the physical objects other than the target physical object in the fused image comprises:

and matting other real object objects except the target real object in the fused image through a semantic segmentation model.

6. The method of claim 1, further comprising:

and inputting the image containing the other real object in the real indoor scene corresponding to the virtual scene and the corresponding mask file of the real object into a semantic segmentation model, and training the semantic segmentation model.

7. The method according to claim 1, wherein the projecting the target real object in the fused image to the corresponding position in the virtual scene comprises:

determining a projection relation based on the position parameters of the camera in the virtual scene and the position parameters of the camera in the real indoor scene;

and projecting the target real object in the fusion image to the corresponding position in the virtual scene based on the projection relation.

8. A scene video fusion system, comprising:

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.