CN116996742A

CN116996742A - Video fusion method and system based on three-dimensional scene

Info

Publication number: CN116996742A
Application number: CN202310884577.2A
Authority: CN
Inventors: 石立阳; 曹琪; 黄星淮; 祝昌宝
Original assignee: Digital Technology Guangzhou Co ltd
Current assignee: Digital Technology Guangzhou Co ltd
Priority date: 2023-07-18
Filing date: 2023-07-18
Publication date: 2023-11-03
Anticipated expiration: 2043-07-18
Also published as: CN116996742B

Abstract

The application discloses a video fusion method based on a three-dimensional scene, which has higher efficiency than a mode of calibrating a camera or manually adjusting virtual camera parameters to fuse, does not need a carrier, and greatly improves the efficiency of video fusion. The application is a brand new video fusion technology. Compared with some video fusion technologies on the prior market, many of the prior video fusion technologies have the problems of complex operation, ideal applicable scene, large limitation, poor fusion effect and the like. The video fusion technology can automatically, quickly and accurately fuse a video to be fused only by selecting four pairs of standard point coordinates (comprising the pixel coordinates of the video image and the world coordinates corresponding to the three-dimensional live-action model), has good fusion effect, and greatly reduces the cost of fusing the video to the three-dimensional live-action model.

Description

Video fusion method and system based on three-dimensional scene

Technical Field

The application relates to the technical field of image processing, in particular to a video fusion method and system based on a three-dimensional scene.

Background

The video fusion technology plays an important role in the field of digital twinning in smart cities. The method can meet the requirement of projecting the real-time monitoring video onto the three-dimensional real-scene model data in the business scene of the smart city, thereby achieving the effect of virtual-real fusion, and being widely used in the fields of security, unmanned inspection and the like. How to automatically or semi-automatically project the video onto the three-dimensional live-action model data is the first step of achieving the video fusion effect and is the most critical step. Currently, a few video fusion technologies exist in the market, for example, in the prior art, chinese patent 202211528984.1 discloses a video fusion method, a device, an electronic apparatus and a storage medium, where the method adopted is to load a three-dimensional model in the GIS system to construct a virtual scene similar to reality; projecting the real-time monitoring video to a GIS system; and carrying out irregular clipping on the real-time monitoring video, and fusing the clipped real-time monitoring video into the constructed virtual scene.

However, the above method is limited to the shape of the three-dimensional model when video fusion occurs, and the problems of video penetration model and video repetition easily occur, which results in poor user experience.

Disclosure of Invention

The video fusion technology of the application is that standard points are sampled through video key frames, then the position and the gesture of a virtual camera of a video in a live-action three-dimensional scene are calculated, and then video streams are projected into the live-action three-dimensional scene according to the position and the gesture of the virtual camera, thereby realizing the effect of video fusion.

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application discloses a video fusion method based on a three-dimensional scene, which comprises the following steps:

step 1, acquiring a preset video image sequence, intercepting a single-frame video image at a preset position from the video image sequence, initializing image coordinates on the single-frame video image, and selecting a plurality of standard points at the preset coordinate positions on the video image;

step 2, a three-dimensional real-scene model to be fused with the video image sequence is obtained, a coordinate mapping relation between the three-dimensional real-scene model and the video image to be fused is established, and world coordinates corresponding to the standard points are determined from the three-dimensional real-scene model;

step 3, making a connection line from world coordinates corresponding to a first standard point of the video image to world coordinates corresponding to a second standard point of the video image in the three-dimensional live-action model, generating a preset number of interpolation points at vertical intervals on an extension line of the connection line according to a first preset length interval, placing the positions of the video fused virtual cameras on the interpolation points and facing the positions of the world coordinates corresponding to the first standard point, then executing rendering operation, and storing rendered data in a frame buffer;

step 4, obtaining screen coordinates corresponding to other standard points except the first standard point in the frame buffer, obtaining the screen coordinates corresponding to the other standard points through Euclidean distance method, comparing the screen coordinates corresponding to the other standard points with preset coordinate positions of the standard points on the video image, and taking the interpolation point with the highest similarity degree as the first temporary position of the virtual camera;

step 5, making a connection line from world coordinates corresponding to a first standard point of the video image to world coordinates corresponding to a second standard point of the video image in the three-dimensional live-action model, generating a preset number of interpolation points at vertical intervals again along the connection line direction by taking the first temporary position as a center according to horizontal intervals of a second preset length, placing the positions of the video fused virtual cameras on the interpolation points and facing the positions of the world coordinates corresponding to the first standard point, re-executing rendering operation, repeating the step 4 to determine the second temporary position of the virtual cameras,

and 6, continuing to adjust the interpolation interval and repeating the step 5 until the position of the virtual camera with the minimum Euclidean distance, namely the optimal fusion effect, is obtained, and projecting the video stream into the three-dimensional live-action model from the obtained position and orientation of the optimal virtual camera.

Still further, the plurality of standard points are 4 standard points of determined positions, which are located at a center point position, a bottom leftmost lower corner, a bottom middle point, and a bottom rightmost lower corner of the single frame video image.

Still further, the making a connection between world coordinates corresponding to a first standard point of the video image and world coordinates corresponding to a second standard point of the video image in the three-dimensional live-action model further includes: the first standard point is the center point coordinate of the image, and the second standard point is the center point coordinate of the bottom of the image.

Further, the first preset length and the second preset length are length values input by a user, the first preset length is initially set to 10 meters, and the second preset length is initially set to 1 meter.

Further, the obtaining the similarity between the screen coordinates corresponding to the rest standard points and the preset coordinate positions of the plurality of standard points on the video image through euclidean distance method further includes: the similarity obtained by the Euclidean distance calculation formula is expressed as:

√[(p1-q1)2+(p2-q2)2+(p3-q3)2]

wherein, p1, p2, p3 are corresponding screen coordinate values representing standard points in the frame buffer, q1, q2, q3 represent corresponding preset coordinate values on the video image.

The application also discloses a video fusion system based on the three-dimensional scene, which comprises the following modules:

the coordinate point selection module acquires a preset video image sequence, intercepts a single-frame video image at a preset position from the video image sequence, initializes image coordinates on the single-frame video image, and selects a plurality of standard points at the preset coordinate positions on the video image;

the coordinate mapping module is used for acquiring a three-dimensional real-scene model to be fused with the video image sequence, establishing a coordinate mapping relation between the three-dimensional real-scene model and the video image to be fused, and determining world coordinates corresponding to the standard points from the three-dimensional real-scene model;

the virtual camera initial rendering module is used for making a connection line from world coordinates corresponding to a first standard point of a video image to world coordinates corresponding to a second standard point of the video image in the three-dimensional live-action model, generating a preset number of interpolation points at vertical intervals on an extension line of the connection line according to a first preset length interval, placing the positions of the virtual camera fused with the video on the interpolation points and facing the positions of the world coordinates corresponding to the first standard point, then executing rendering operation, and storing rendered data in a frame buffer;

the virtual camera positioning module is used for obtaining screen coordinates corresponding to other standard points except the first standard point in the frame buffer, obtaining the screen coordinates corresponding to the other standard points through Euclidean distance method, comparing the similarity between the screen coordinates corresponding to the other standard points and preset coordinate positions of the standard points on the video image, and taking the interpolation point with the highest similarity degree as the first temporary position of the virtual camera;

a positioning updating module, which is used for making a connection line from world coordinates corresponding to a first standard point of the video image to world coordinates corresponding to a second standard point of the video image in the three-dimensional live-action model, generating a preset number of interpolation points at a vertical interval again according to a horizontal interval of a second preset length along the connection line direction by taking the first temporary position as the center, placing the positions of the video fused virtual cameras on the interpolation points and facing the positions of the world coordinates corresponding to the first standard point, re-executing rendering operation, repeating the step 4 to determine the second temporary position of the virtual cameras,

and the fusion module is used for continuously adjusting the interpolation interval and repeating the functions executed by the positioning updating module until the position of the virtual camera with the minimum Euclidean distance, namely the best fusion effect, is obtained, and the video stream is projected into the three-dimensional real scene model from the obtained position and orientation of the optimal virtual camera.

√[(p1-q1)2+(p2-q2)2+(p3-q3)2]

Compared with the prior art, the application has the beneficial effects that: compared with the mode of calibrating the camera or manually adjusting the parameter fusion of the virtual camera, the video fusion technology provided by the application has higher efficiency, does not need a carrier, and greatly improves the video fusion efficiency. The application is a brand new video fusion technology. Compared with some video fusion technologies on the prior market, many of the prior video fusion technologies have the problems of complex operation, ideal applicable scene, large limitation, poor fusion effect and the like. The video fusion technology can automatically, quickly and accurately fuse a video to be fused only by selecting four pairs of standard point coordinates (comprising the pixel coordinates of the video image and the world coordinates corresponding to the three-dimensional live-action model), has good fusion effect, and greatly reduces the cost of fusing the video to the three-dimensional live-action model.

Drawings

The application will be further understood from the following description taken in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. In the figures, like reference numerals designate corresponding parts throughout the different views.

Fig. 1 is a standard point selection diagram of a video image in an embodiment of the application.

FIG. 2 is a flow chart of implementing three-dimensional scene-based video fusion in an embodiment of the application.

FIG. 3 is a schematic diagram of the placement of the positions of virtual cameras fusing video onto these interpolation points in an embodiment of the application.

FIG. 4 is a flow chart of another implementation of three-dimensional scene-based video fusion in an embodiment of the application.

Detailed Description

The technical scheme of the application will be described in more detail below with reference to the accompanying drawings and examples.

A mobile terminal implementing various embodiments of the present application will now be described with reference to the accompanying drawings. In the following description, suffixes such as "module", "component", or "unit" for representing elements are used only for facilitating the description of the present application, and are not of specific significance per se. Thus, "module" and "component" may be used in combination.

Mobile terminals may be implemented in a variety of forms. For example, the terminals described in the present application may include mobile terminals such as mobile phones, smart phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), navigation devices, and the like, and fixed terminals such as digital TVs, desktop computers, and the like. In the following, it is assumed that the terminal is a mobile terminal. However, it will be understood by those skilled in the art that the configuration according to the embodiment of the present application can be applied to a fixed type terminal in addition to elements particularly used for a moving purpose.

A video fusion method based on three-dimensional scene as shown in fig. 1-4, the video fusion method comprising the steps of:

√[(p1-q1)2+(p2-q2)2+(p3-q3)2]

In this embodiment, the implementation technical scheme includes the following steps:

a frame of video image is cut from the video stream, four standard points are taken on the image, namely a1 (lower left corner point of the image), a2 (middle point of the bottom of the image), a3 (lower right corner point of the image) and a4 (center point of the image). Reference is made to fig. 1. And then, world coordinates B1 (world coordinates corresponding to a 1), B2 (world coordinates corresponding to a 2), B3 (world coordinates corresponding to a 3) and B4 (world coordinates corresponding to a 4) corresponding to the standard points are taken from the three-dimensional live-action model.

And (3) a connecting line of B4-B2 is made in the three-dimensional live-action model, and 100 interpolation points are generated at vertical intervals according to horizontal intervals of 10 meters in the direction of an extension line of the connecting line of B4-B2. The positions of the virtual cameras fused by the video are placed on the interpolation points, the virtual cameras look at the B4 coordinate points, and then the virtual cameras are rendered into a frame buffer. Reference is made to fig. 3.

Screen coordinates c1, c2, c3 corresponding to B1, B2, B3 are found in the frame buffer. The euclidean distance algorithm is used to find the degree of similarity between c1, c2, c3 and a1, a2, a3, and the coordinate of the interpolation point with the minimum euclidean distance, i.e. the highest degree of similarity, is tentatively set as the position P1 of the virtual camera.

Calculation formula of Euclidean distance of [ (p 1-q 1) 2+ (p 2-q 2) 2+ … + (pn-qn) 2]

It is known that P1 generates 100 interpolation points at horizontal intervals of 1 m and vertical intervals in the left-right direction of the B4-B2 straight line on the extension line of the B4-B2 line with P1 as the center. The position of the virtual camera is placed on these interpolation points, and the virtual camera looks at the B4 coordinate point and then is rendered into the frame buffer. And (3) repeating the step (3) to obtain a temporary position P2 of the virtual camera.

Continuing to adjust down the interpolation interval 0.5,0.3,0.1,0.01. And (4) repeating the step until the position Pn of the virtual camera with the minimum Euclidean distance, namely the optimal fusion effect is obtained.

And 5, projecting the position and the orientation of the video stream (note: the virtual camera looks at B4, and B4 is the world coordinate point of the three-dimensional real scene model corresponding to the video image center point) from the virtual camera obtained in the step 5 into the three-dimensional real scene model, and finishing the effect of fusing the video into the three-dimensional real scene.

The embodiment also discloses an implementation step:

step 1: a frame of video image is cut from the video stream, and the pixel coordinates of 4 index points, a1 (lower left), a2 (lower middle), a3 (lower right), a4 (center point) of the image are taken. Taking the world coordinates B1 (left lower), B2 (lower middle), B3 (right lower), B4 (central point) in the corresponding three-dimensional live-action model

Step 2: and (3) making a B4-B2 connecting line in the three-dimensional live-action model, and generating 100 interpolation points on an extension line of the B4-B2 connecting line according to a horizontal interval of 10 meters and a vertical interval. A virtual camera is placed on these interpolation points, with the camera facing the view B4. Rendering to frame buffer

Step 3: screen coordinates c1, c2, c3 corresponding to B1, B2, B3 are found in the frame buffer. The euclidean distance method was used to find the degree of similarity between c1, c2, c3 and a1, a2, a 3. The interpolation point with the highest similarity is tentatively set as the position P1 of the virtual camera.

Step 4: p1 is on the line of B4-B2, takes P1 as the center, and generates 100 interpolation points along the left-right direction of the B4-B2 straight line according to the horizontal interval of 1 meter and the vertical interval. A virtual camera is placed on these interpolation points, with the camera facing the view B4. Rendering to a frame buffer, repeating the step 3, and obtaining the position P2 of the virtual camera

Step 5: the interpolation interval is set to 0.5,0.3,0.1,0.01. And (4) repeating the step 4. Until the optimal video fusion virtual camera position is found.

Step 6: and 5, projecting the video stream from the position and the orientation of the virtual camera in the step 5 to the three-dimensional live-action model, and finishing the effect of fusing the video to the three-dimensional live-action model.

Frame buffering

Also known as post-frame caching, is a technique in computer graphics that can be used to accelerate the rendering process. At rendering time, the graphics data will be stored in the frame buffer awaiting output to the display.

Video fusion

And projecting the video into the solid three-dimensional model, and viewing the video playing effect in the three-dimensional scene.

Euclidean distance algorithm

Euclidean distance algorithm, also known as Euclidean distance algorithm, is a distance measurement method commonly used in the field of machine learning. If there are two points p and q whose coordinates in n-dimensional space are (p 1, p2, …, pn) and (q 1, q2, …, qn), respectively, the Euclidean distance between p and q is defined as:

√[(p1-q1)2+(p2-q2)2+…+(pn-qn)2]

this distance represents the distance of two points in an n-dimensional space, a measure of the length in Euclidean space as the distance between the points. In the fields of machine-learned classification, clustering algorithms, etc., euclidean distance algorithms are often used to calculate the similarity or distance between samples. In general, if the euclidean distance between two points is smaller, the higher their similarity is, and the farther the distance is, the lower their similarity is.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

While the application has been described above with reference to various embodiments, it should be understood that many changes and modifications can be made without departing from the scope of the application. It is therefore intended that the foregoing detailed description be regarded as illustrative rather than limiting, and that it be understood that it is the following claims, including all equivalents, that are intended to define the spirit and scope of this application. The above examples should be understood as illustrative only and not limiting the scope of the application. Various changes and modifications to the present application may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the application as defined in the appended claims.

Claims

1. The video fusion method based on the three-dimensional scene is characterized by comprising the following steps of:

step 2, a three-dimensional real-scene model to be fused with the preset video image sequence is obtained, a coordinate mapping relation between the three-dimensional real-scene model and the video image to be fused is established, and world coordinates corresponding to the standard points are determined from the three-dimensional real-scene model;

step 5, a connection line from world coordinates corresponding to a first standard point of the video image to world coordinates corresponding to a second standard point of the video image is made in the three-dimensional live-action model, a preset number of interpolation points are generated at vertical intervals again along the connection line direction by taking the first temporary position as a center according to horizontal intervals of a second preset length, the positions of the virtual cameras fused with the video are placed on the interpolation points, rendering operation is executed again after the positions of the world coordinates corresponding to the first standard point are oriented, and the second temporary position of the virtual camera is determined in the step 4;

and 6, continuing to adjust the interpolation interval and repeating the step 5 until the position with the minimum Euclidean distance is obtained, and projecting the video stream into the three-dimensional real scene model from the obtained position and orientation of the optimal virtual camera.

2. The method of claim 1, wherein the plurality of standard points are 4 standard points of determined positions, which are located at a center point position, a bottom left-most corner, a bottom middle point, and a bottom right-most corner of the single frame video image.

3. The method of claim 2, wherein making a connection from world coordinates corresponding to a first standard point of a video image to world coordinates corresponding to a second standard point of the video image in the three-dimensional live-action model further comprises: the first standard point is the center point coordinate of the image, and the second standard point is the center point coordinate of the bottom of the image.

4. The video fusion method according to claim 1, wherein the first preset length and the second preset length are length values input by a user, the first preset length is initially set to 10 meters, and the second preset length is initially set to 1 meter.

5. The method of claim 1, wherein in the step 4, the similarity obtained by euclidean distance method is expressed as d:

d＝√[(p1-q1)2+(p2-q2)2+(p3-q3)2]

wherein, p1, p2, p3 are corresponding screen coordinate values representing standard points in the frame buffer, q1, q2, q3 represent corresponding preset coordinate values on the video image, and the sign of v represents open square.

6. A video fusion system based on three-dimensional scenes, the video fusion system comprising the following modules:

7. The three-dimensional scene-based video fusion system of claim 6, wherein the plurality of standard points are 4 location-determining standard points located at a center point location, a bottom left-most corner, a bottom middle point, and a bottom right-most corner of the single frame video image.

8. The three-dimensional scene-based video fusion system of claim 7, wherein the making of a connection from world coordinates corresponding to a first standard point of a video image to world coordinates corresponding to a second standard point of the video image in the three-dimensional live-action model further comprises: the first standard point is the center point coordinate of the image, and the second standard point is the center point coordinate of the bottom of the image.

9. The three-dimensional scene-based video fusion system of claim 6, wherein the first preset length and the second preset length are length values entered by a user, the first preset length being initially set to 10 meters and the second preset length being initially set to 1 meter.

10. The video fusion system based on three-dimensional scene as defined in claim 6, wherein the obtaining the screen coordinates corresponding to the rest standard points by euclidean distance method and comparing the screen coordinates with the preset coordinate positions of the plurality of standard points on the video image further comprises: the similarity obtained by Euclidean distance method is expressed as d:

d＝√[(p1-q1)2+(p2-q2)2+(p3-q3)2]