CN114241126A

CN114241126A - Method for extracting object position information in monocular video based on live-action model

Info

Publication number: CN114241126A
Application number: CN202111451702.8A
Authority: CN
Inventors: 吴向阳; 周诗洋
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-03-25

Abstract

The invention discloses a method for extracting object position information in a monocular video based on a live-action model, which comprises the following steps: (1) constructing a three-dimensional live-action model of a monitoring video area; (2) projecting the monitoring video to the live-action model; (3) detecting the video frame image projected on the model by using a target detection algorithm; (4) and acquiring the geographical position of the target in the real scene model by using the target detection result. The invention realizes the three-dimensional, perspective and measurable effects of the monocular monitoring video in the real scene model based on the three-dimensional real scene model, the Cesium three-dimensional engine, the target detection and other technologies, can extract the object position information in the monitoring video, is convenient for the management and control of the video monitoring occasion, and further exerts the value of video monitoring.

Description

Method for extracting object position information in monocular video based on live-action model

Technical Field

The invention relates to a technology for extracting object position information in a monocular video based on a live-action model, and belongs to the technical field of surveying and mapping and the technical field of computers.

Background

The obtained monitoring video contains rich information, such as the position and the type of an object in the video, the relationship between the object and the surrounding environment, and the like. However, the real situation of the monitored site cannot be sensed through a single monitoring video, the automatic extraction of information such as positions in the monitoring video can further deepen the understanding of the site situation, and management personnel can quickly position and manage objects conveniently, so that the application value of video monitoring is improved, and no better means is provided for realizing the purpose. Meanwhile, object position information extraction research is rarely carried out on a monitoring video shot by a monocular camera, but with the rapid development of computer technology and the emergence of new technologies such as target detection and the like, a new idea can be provided for solving the problems by combining multiple technologies. Object detection is a computer technology related to computer vision and image processing for detecting instances of a certain class of semantic objects (e.g., people, buildings, or cars) in digital images and videos. The position information in the monitoring video is extracted by combining a new technical means, so that the method has higher application value.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, the invention provides a method for extracting object position information in a monocular video based on a real scene model.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

a method for extracting object position information in a monocular video based on a live-action model comprises the following steps:

(1) three-dimensional data of a video area are obtained through monitoring, and a three-dimensional live-action model is built according to the obtained three-dimensional data; (2) acquiring attitude parameters of a monitoring camera by using a monitored real-time video, and projecting the monitoring video onto a live-action model; (3) capturing images of videos projected on the live-action model, performing target detection on the images by using the constructed small target detection network, and outputting detected target type information and position information on the images;

(4) and acquiring the geographic position of the target in the real scene model through the pixel coordinates of the target detection result and converting the geographic position into geographic coordinates.

Preferably, the three-dimensional data in the step (1) specifically includes selecting a control point in the measurement area, performing aerial photography on the measurement area by using an unmanned aerial vehicle to obtain oblique photogrammetry, and constructing a three-dimensional live-action model through Context Capture, wherein the precision requirement of the three-dimensional live-action model is 1: 500 and the three-dimensional live-action model is subjected to refining treatment.

Preferably, the three-dimensional live-action model in the step (1) is specifically constructed by firstly selecting homonymous points of photos of various source data, resolutions and arbitrary data volumes, then obtaining a triangulation network model through multi-view matching and triangulation network construction, and then automatically mapping textures to obtain the three-dimensional live-action model of the monitoring area.

Preferably, in the step (2), the real-time monitoring video is accessed to a Web system constructed based on Cesium by specifically utilizing H5Stream to obtain a posture parameter of a monitoring camera, each position screenshot is carried out on a monitoring area three-dimensional live-action model under a first person viewing angle at the position of the camera, the posture information of the monitoring area three-dimensional live-action model in the Cesium is recorded, a most similar frame image and screenshot are matched by utilizing a picture matching algorithm, rough posture information of video projection is obtained according to the screenshot, and then the monitoring video is projected on the surface of the three-dimensional live-action model in the form of video texture and is manually fine-tuned.

Preferably, the step (3) uses a YOLOv3 general target detection algorithm and improves the general target detection algorithm to realize small target detection on the monitoring image, uses a DBSCAN + K-Means clustering algorithm to perform clustering optimization on the prior frame and modifies the three feature layers with the resolutions of 13 × 13, 26 × 26 and 52 × 52 into two feature layers with the resolutions of larger scale of 26 × 26 and 52 × 52.

Preferably, step (4) converts the pixel coordinates twice using the Cesium interface: and converting the pixel coordinate into the rectangular coordinate, converting the rectangular coordinate into the geographic coordinate, and finally acquiring the position information of the object in the real world.

Adopt the beneficial effect that above-mentioned technical scheme brought:

(1) and constructing a three-dimensional live-action model of the video monitoring area to realize all-around browsing of the monitoring area. (2) And fusing the monitoring video and the real scene model to realize the judgment of the relationship between the monitoring video and the surrounding environment. (3) The position information of the target in the surveillance video is extracted, the use value of the surveillance video is further mined, and management and control of a surveillance site are better facilitated for managers.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Fig. 2 is a perspective projection schematic of the video projection of the present invention.

FIG. 3 is a schematic diagram of a rectangular coordinate system in Cesium.

Fig. 4 is a schematic diagram of the geographic coordinate system in Cesium.

Detailed Description

The technical scheme of the invention is explained in detail in the following with the accompanying drawings.

As shown in fig. 1, the method for extracting object position information in a monocular video based on a live-action model according to the present invention includes the following steps:

step 1, data acquisition and modeling work: the method comprises the following steps of utilizing an unmanned aerial vehicle as an aerial photography platform to quickly and efficiently obtain multi-view high-resolution images of a video monitoring area, adopting an oblique photography modeling technology to establish a triangulation network model to obtain three-dimensional terrain data, and then manually carrying out fine scene three-dimensional modeling, wherein the method specifically comprises the following steps:

and collecting shape data in the survey area range as basic data for use, and collecting airspace management conditions in the survey area range as unmanned aerial vehicle route planning for use. Before the unmanned aerial vehicle flies, a certain number of control points are selected in the survey area, and then the unmanned aerial vehicle is used for aerial photography of the survey area. The overlapping degree of the photos should meet the following requirements: (1) the course overlapping degree is generally 65-85%, and the minimum value is not less than 60%; (2) the degree of lateral overlap should generally be 45% to 80% and at minimum should not be less than 40%. The image deflection angle is generally not more than 15 degrees, and the individual maximum rotation angle is not more than 25 degrees on the premise of ensuring that the course and the sidewise overlapping degree of the image meet the requirements. The area boundary coverage requirements are as follows: the course covering beyond shot boundary line is not less than two base lines, and the side covering beyond shot boundary line is not less than 50% of the image frame.

And (4) using modeling software to make an unmanned aerial vehicle oblique photogrammetry model. Firstly, homonymous point selection is carried out on photos of various source data, resolution ratios and any data quantity, then a triangulation network model is obtained through multi-view matching and triangulation network construction, a three-dimensional live-action model of a monitoring area is manufactured through automatic texture mapping, and finally model modification and optimization are carried out on the problem that shielded areas such as trees, buildings and the like on the model possibly have data loss, confusion and the like.

Step 2, projecting the monitoring video to the live-action model: acquiring attitude parameters of a monitoring camera by using a picture matching algorithm, and projecting real-time video streams accessed by H5Stream onto a live-action model by using Cesium, wherein the real-time video streams are as follows:

because the video tag of the HTML5 does not support the real-time data Stream of the RTSP, so that the monitoring video can not be displayed at the webpage end, the invention utilizes H5Stream to configure the video source into the configuration file according to the requirement, starts the H5Stream service, converts the video Stream and can access the monitoring video data into the system. Meanwhile, the video can be accessed only when the cross-domain video tag needs to be configured, and the video tag is specifically configured as follows: crosssort = "anonymous".

The method comprises the steps of carrying out each orientation screenshot on a three-dimensional live-action model of a monitoring area under a first person visual angle at the position of a camera, recording the posture information of the three-dimensional live-action model in Cesium, matching frame images of a video stream with the screenshot by using a picture matching algorithm to find the most similar one, and obtaining rough posture information of video projection according to the screenshot. In Cesium, the principle of perspective projection is used, as shown in fig. 2. The monitoring video is projected to the surface of the three-dimensional real scene model in the form of video texture, and then manual fine adjustment is carried out, so that the position of the video texture is more accurate.

Step 3, frame image interception is carried out on the video projected onto the model at any moment, target detection is carried out on the image by utilizing the constructed small target detection network, and the detected target type information and the position information on the image are output, wherein the method specifically comprises the following steps:

because the monitoring video occupies fewer pixels, the invention uses and improves the mainstream Yolov3 general target detection algorithm to realize the small target detection on the monitoring image. Firstly, clustering optimization is carried out on the prior frames by using a DBSCAN + K-Means clustering algorithm so as to select the prior frames which are more suitable for small target detection. Meanwhile, the Darknet-53 feature extraction network is modified, three feature layers with the resolutions of 13 x 13, 26 x 26 and 52 x 52 are modified into two feature layers with the resolutions of 26 x 26 and 52 x 52 with larger sizes, so that the higher resolution and the larger feature layer receptive field are kept in the deep network, and the detection capability of small targets is enhanced.

And 2, after the video is projected in the step 2, carrying out screenshot on the scene under the condition of keeping the projection posture, wherein the screenshot picture comprises a three-dimensional model and a certain frame of the projection video texture. And sending the screenshot into a target detection network for detection, and outputting a detection result.

And 4, keeping the projection posture in the step 2, converting the acquired pixel coordinates based on a Cesium coordinate conversion interface by using the output result in the step 3 to obtain the geographic coordinates of the target, wherein the method specifically comprises the following steps:

and 3, obtaining the pixel position of the object relative to the upper left corner of the picture in the video screenshot after target detection is carried out. And (3) in the three-dimensional live-action model system, keeping the projection posture in the step (2), and acquiring the real position of the object on the live-action model by using a coordinate conversion interface, so as to extract the position information of the object in the monitoring video. Firstly, converting the pixel coordinates into coordinates of a rectangular coordinate system, wherein the rectangular coordinate system is shown in FIG. 3; the rectangular coordinates are then converted to longitude and latitude coordinates of a geographic coordinate system, as shown in fig. 4.

In summary, the method for extracting the object position information in the monocular video based on the live-action model, provided by the invention, combines the technologies of three-dimensional live-action modeling, target detection, Cesium video projection and the like, further excavates the position information implied by the monitoring video, successfully solves the problem that the position information in the monocular monitoring video is difficult to extract, realizes the three-dimensional and perspective effects of the monitoring video, and is more convenient for the intelligent management of the monitoring field by the manager.

The embodiments are only for illustrating the technical idea of the present invention, and the technical idea of the present invention is not limited thereto, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the scope of the present invention.

Claims

1. A method for extracting object position information in a monocular video based on a live-action model is characterized by comprising the following steps: the method comprises the following steps that (1) three-dimensional data of a video region are obtained through monitoring, and a three-dimensional live-action model is built according to the obtained three-dimensional data; (2) acquiring attitude parameters of a monitoring camera by using the monitored real-time video, and projecting the monitoring video onto a real-scene model; (3) intercepting the image of the video projected on the live-action model, carrying out target detection on the image by using the constructed small target detection network, and outputting the detected target type information and the position information on the image; and (4) acquiring the geographic position of the target in the real scene model through the pixel coordinates of the target detection result, and converting the geographic position into geographic coordinates.

2. The method for extracting the position information of the object in the monocular video based on the live-action model according to claim 1, wherein the three-dimensional data in the step (1) specifically includes selecting a control point in a measurement area, performing aerial photography on the measurement area by using an unmanned aerial vehicle to obtain oblique photogrammetry, and constructing the three-dimensional live-action model through Context Capture, wherein the precision requirement of the three-dimensional live-action model is 1: 500 and the three-dimensional live-action model is subjected to refining processing.

3. The method for extracting the position information of the object in the monocular video based on the live-action model as claimed in claim 2, wherein the three-dimensional live-action model in the step (1) is specifically constructed by first selecting homonymous points of photos of various source data, resolution and any data volume, then obtaining a triangulation network model through multi-view matching and triangulation network construction, and then automatically mapping the texture to make the three-dimensional live-action model of the monitoring area.

4. The method for extracting the object position information in the monocular video based on the live-action model according to claim 1, characterized in that, in the step (2), the real-time monitoring video is accessed to a Web system constructed based on Cesium by specifically using H5Stream to obtain the attitude parameters of the monitoring camera, each azimuth screenshot is performed on the monitoring area three-dimensional live-action model under the first human visual angle at the position of the camera, the attitude information of the monitoring area three-dimensional live-action model in the Cesium is recorded, the frame image and the screenshot of the video Stream are matched by using a picture matching algorithm to find the most similar one, the rough attitude information projected by the video is obtained according to the screenshot, and then the monitoring video is projected on the surface of the three-dimensional live-action model in the form of video texture and is manually fine-tuned.

5. The method for extracting the object position information in the monocular video based on the live-action model as claimed in claim 1, wherein the step (3) uses a YOLOv3 general target detection algorithm and improves the same to realize the detection of small targets on the monitored image, uses a DBSCAN + K-Means clustering algorithm to perform cluster optimization on the prior frame and modifies three feature layers with the resolutions of 13 × 13, 26 × 26 and 52 × 52 into two feature layers with the resolutions of larger scale of 26 × 26 and 52 × 52.

6. The method for extracting the object position information in the monocular video based on the live-action model as claimed in claim 1, wherein step (4) uses a Cesium interface to perform two conversions on the pixel coordinates: and converting the pixel coordinate into the rectangular coordinate, converting the rectangular coordinate into the geographic coordinate, and finally acquiring the position information of the object in the real world.