CN114241126A - Method for extracting object position information in monocular video based on live-action model - Google Patents

Method for extracting object position information in monocular video based on live-action model Download PDF

Info

Publication number
CN114241126A
CN114241126A CN202111451702.8A CN202111451702A CN114241126A CN 114241126 A CN114241126 A CN 114241126A CN 202111451702 A CN202111451702 A CN 202111451702A CN 114241126 A CN114241126 A CN 114241126A
Authority
CN
China
Prior art keywords
video
live
action model
monitoring
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111451702.8A
Other languages
Chinese (zh)
Inventor
吴向阳
周诗洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202111451702.8A priority Critical patent/CN114241126A/en
Publication of CN114241126A publication Critical patent/CN114241126A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for extracting object position information in a monocular video based on a live-action model, which comprises the following steps: (1) constructing a three-dimensional live-action model of a monitoring video area; (2) projecting the monitoring video to the live-action model; (3) detecting the video frame image projected on the model by using a target detection algorithm; (4) and acquiring the geographical position of the target in the real scene model by using the target detection result. The invention realizes the three-dimensional, perspective and measurable effects of the monocular monitoring video in the real scene model based on the three-dimensional real scene model, the Cesium three-dimensional engine, the target detection and other technologies, can extract the object position information in the monitoring video, is convenient for the management and control of the video monitoring occasion, and further exerts the value of video monitoring.

Description

Method for extracting object position information in monocular video based on live-action model
Technical Field
The invention relates to a technology for extracting object position information in a monocular video based on a live-action model, and belongs to the technical field of surveying and mapping and the technical field of computers.
Background
The obtained monitoring video contains rich information, such as the position and the type of an object in the video, the relationship between the object and the surrounding environment, and the like. However, the real situation of the monitored site cannot be sensed through a single monitoring video, the automatic extraction of information such as positions in the monitoring video can further deepen the understanding of the site situation, and management personnel can quickly position and manage objects conveniently, so that the application value of video monitoring is improved, and no better means is provided for realizing the purpose. Meanwhile, object position information extraction research is rarely carried out on a monitoring video shot by a monocular camera, but with the rapid development of computer technology and the emergence of new technologies such as target detection and the like, a new idea can be provided for solving the problems by combining multiple technologies. Object detection is a computer technology related to computer vision and image processing for detecting instances of a certain class of semantic objects (e.g., people, buildings, or cars) in digital images and videos. The position information in the monitoring video is extracted by combining a new technical means, so that the method has higher application value.
Disclosure of Invention
In order to solve the technical problems mentioned in the background art, the invention provides a method for extracting object position information in a monocular video based on a real scene model.
In order to achieve the technical purpose, the technical scheme of the invention is as follows:
a method for extracting object position information in a monocular video based on a live-action model comprises the following steps:
(1) three-dimensional data of a video area are obtained through monitoring, and a three-dimensional live-action model is built according to the obtained three-dimensional data; (2) acquiring attitude parameters of a monitoring camera by using a monitored real-time video, and projecting the monitoring video onto a live-action model; (3) capturing images of videos projected on the live-action model, performing target detection on the images by using the constructed small target detection network, and outputting detected target type information and position information on the images;
(4) and acquiring the geographic position of the target in the real scene model through the pixel coordinates of the target detection result and converting the geographic position into geographic coordinates.
Preferably, the three-dimensional data in the step (1) specifically includes selecting a control point in the measurement area, performing aerial photography on the measurement area by using an unmanned aerial vehicle to obtain oblique photogrammetry, and constructing a three-dimensional live-action model through Context Capture, wherein the precision requirement of the three-dimensional live-action model is 1: 500 and the three-dimensional live-action model is subjected to refining treatment.
Preferably, the three-dimensional live-action model in the step (1) is specifically constructed by firstly selecting homonymous points of photos of various source data, resolutions and arbitrary data volumes, then obtaining a triangulation network model through multi-view matching and triangulation network construction, and then automatically mapping textures to obtain the three-dimensional live-action model of the monitoring area.
Preferably, in the step (2), the real-time monitoring video is accessed to a Web system constructed based on Cesium by specifically utilizing H5Stream to obtain a posture parameter of a monitoring camera, each position screenshot is carried out on a monitoring area three-dimensional live-action model under a first person viewing angle at the position of the camera, the posture information of the monitoring area three-dimensional live-action model in the Cesium is recorded, a most similar frame image and screenshot are matched by utilizing a picture matching algorithm, rough posture information of video projection is obtained according to the screenshot, and then the monitoring video is projected on the surface of the three-dimensional live-action model in the form of video texture and is manually fine-tuned.
Preferably, the step (3) uses a YOLOv3 general target detection algorithm and improves the general target detection algorithm to realize small target detection on the monitoring image, uses a DBSCAN + K-Means clustering algorithm to perform clustering optimization on the prior frame and modifies the three feature layers with the resolutions of 13 × 13, 26 × 26 and 52 × 52 into two feature layers with the resolutions of larger scale of 26 × 26 and 52 × 52.
Preferably, step (4) converts the pixel coordinates twice using the Cesium interface: and converting the pixel coordinate into the rectangular coordinate, converting the rectangular coordinate into the geographic coordinate, and finally acquiring the position information of the object in the real world.
Adopt the beneficial effect that above-mentioned technical scheme brought:
(1) and constructing a three-dimensional live-action model of the video monitoring area to realize all-around browsing of the monitoring area. (2) And fusing the monitoring video and the real scene model to realize the judgment of the relationship between the monitoring video and the surrounding environment. (3) The position information of the target in the surveillance video is extracted, the use value of the surveillance video is further mined, and management and control of a surveillance site are better facilitated for managers.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Fig. 2 is a perspective projection schematic of the video projection of the present invention.
FIG. 3 is a schematic diagram of a rectangular coordinate system in Cesium.
Fig. 4 is a schematic diagram of the geographic coordinate system in Cesium.
Detailed Description
The technical scheme of the invention is explained in detail in the following with the accompanying drawings.
As shown in fig. 1, the method for extracting object position information in a monocular video based on a live-action model according to the present invention includes the following steps:
step 1, data acquisition and modeling work: the method comprises the following steps of utilizing an unmanned aerial vehicle as an aerial photography platform to quickly and efficiently obtain multi-view high-resolution images of a video monitoring area, adopting an oblique photography modeling technology to establish a triangulation network model to obtain three-dimensional terrain data, and then manually carrying out fine scene three-dimensional modeling, wherein the method specifically comprises the following steps:
and collecting shape data in the survey area range as basic data for use, and collecting airspace management conditions in the survey area range as unmanned aerial vehicle route planning for use. Before the unmanned aerial vehicle flies, a certain number of control points are selected in the survey area, and then the unmanned aerial vehicle is used for aerial photography of the survey area. The overlapping degree of the photos should meet the following requirements: (1) the course overlapping degree is generally 65-85%, and the minimum value is not less than 60%; (2) the degree of lateral overlap should generally be 45% to 80% and at minimum should not be less than 40%. The image deflection angle is generally not more than 15 degrees, and the individual maximum rotation angle is not more than 25 degrees on the premise of ensuring that the course and the sidewise overlapping degree of the image meet the requirements. The area boundary coverage requirements are as follows: the course covering beyond shot boundary line is not less than two base lines, and the side covering beyond shot boundary line is not less than 50% of the image frame.
And (4) using modeling software to make an unmanned aerial vehicle oblique photogrammetry model. Firstly, homonymous point selection is carried out on photos of various source data, resolution ratios and any data quantity, then a triangulation network model is obtained through multi-view matching and triangulation network construction, a three-dimensional live-action model of a monitoring area is manufactured through automatic texture mapping, and finally model modification and optimization are carried out on the problem that shielded areas such as trees, buildings and the like on the model possibly have data loss, confusion and the like.
Step 2, projecting the monitoring video to the live-action model: acquiring attitude parameters of a monitoring camera by using a picture matching algorithm, and projecting real-time video streams accessed by H5Stream onto a live-action model by using Cesium, wherein the real-time video streams are as follows:
because the video tag of the HTML5 does not support the real-time data Stream of the RTSP, so that the monitoring video can not be displayed at the webpage end, the invention utilizes H5Stream to configure the video source into the configuration file according to the requirement, starts the H5Stream service, converts the video Stream and can access the monitoring video data into the system. Meanwhile, the video can be accessed only when the cross-domain video tag needs to be configured, and the video tag is specifically configured as follows: crosssort = "anonymous".
The method comprises the steps of carrying out each orientation screenshot on a three-dimensional live-action model of a monitoring area under a first person visual angle at the position of a camera, recording the posture information of the three-dimensional live-action model in Cesium, matching frame images of a video stream with the screenshot by using a picture matching algorithm to find the most similar one, and obtaining rough posture information of video projection according to the screenshot. In Cesium, the principle of perspective projection is used, as shown in fig. 2. The monitoring video is projected to the surface of the three-dimensional real scene model in the form of video texture, and then manual fine adjustment is carried out, so that the position of the video texture is more accurate.
Step 3, frame image interception is carried out on the video projected onto the model at any moment, target detection is carried out on the image by utilizing the constructed small target detection network, and the detected target type information and the position information on the image are output, wherein the method specifically comprises the following steps:
because the monitoring video occupies fewer pixels, the invention uses and improves the mainstream Yolov3 general target detection algorithm to realize the small target detection on the monitoring image. Firstly, clustering optimization is carried out on the prior frames by using a DBSCAN + K-Means clustering algorithm so as to select the prior frames which are more suitable for small target detection. Meanwhile, the Darknet-53 feature extraction network is modified, three feature layers with the resolutions of 13 x 13, 26 x 26 and 52 x 52 are modified into two feature layers with the resolutions of 26 x 26 and 52 x 52 with larger sizes, so that the higher resolution and the larger feature layer receptive field are kept in the deep network, and the detection capability of small targets is enhanced.
And 2, after the video is projected in the step 2, carrying out screenshot on the scene under the condition of keeping the projection posture, wherein the screenshot picture comprises a three-dimensional model and a certain frame of the projection video texture. And sending the screenshot into a target detection network for detection, and outputting a detection result.
And 4, keeping the projection posture in the step 2, converting the acquired pixel coordinates based on a Cesium coordinate conversion interface by using the output result in the step 3 to obtain the geographic coordinates of the target, wherein the method specifically comprises the following steps:
and 3, obtaining the pixel position of the object relative to the upper left corner of the picture in the video screenshot after target detection is carried out. And (3) in the three-dimensional live-action model system, keeping the projection posture in the step (2), and acquiring the real position of the object on the live-action model by using a coordinate conversion interface, so as to extract the position information of the object in the monitoring video. Firstly, converting the pixel coordinates into coordinates of a rectangular coordinate system, wherein the rectangular coordinate system is shown in FIG. 3; the rectangular coordinates are then converted to longitude and latitude coordinates of a geographic coordinate system, as shown in fig. 4.
In summary, the method for extracting the object position information in the monocular video based on the live-action model, provided by the invention, combines the technologies of three-dimensional live-action modeling, target detection, Cesium video projection and the like, further excavates the position information implied by the monitoring video, successfully solves the problem that the position information in the monocular monitoring video is difficult to extract, realizes the three-dimensional and perspective effects of the monitoring video, and is more convenient for the intelligent management of the monitoring field by the manager.
The embodiments are only for illustrating the technical idea of the present invention, and the technical idea of the present invention is not limited thereto, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the scope of the present invention.

Claims (6)

1. A method for extracting object position information in a monocular video based on a live-action model is characterized by comprising the following steps: the method comprises the following steps that (1) three-dimensional data of a video region are obtained through monitoring, and a three-dimensional live-action model is built according to the obtained three-dimensional data; (2) acquiring attitude parameters of a monitoring camera by using the monitored real-time video, and projecting the monitoring video onto a real-scene model; (3) intercepting the image of the video projected on the live-action model, carrying out target detection on the image by using the constructed small target detection network, and outputting the detected target type information and the position information on the image; and (4) acquiring the geographic position of the target in the real scene model through the pixel coordinates of the target detection result, and converting the geographic position into geographic coordinates.
2. The method for extracting the position information of the object in the monocular video based on the live-action model according to claim 1, wherein the three-dimensional data in the step (1) specifically includes selecting a control point in a measurement area, performing aerial photography on the measurement area by using an unmanned aerial vehicle to obtain oblique photogrammetry, and constructing the three-dimensional live-action model through Context Capture, wherein the precision requirement of the three-dimensional live-action model is 1: 500 and the three-dimensional live-action model is subjected to refining processing.
3. The method for extracting the position information of the object in the monocular video based on the live-action model as claimed in claim 2, wherein the three-dimensional live-action model in the step (1) is specifically constructed by first selecting homonymous points of photos of various source data, resolution and any data volume, then obtaining a triangulation network model through multi-view matching and triangulation network construction, and then automatically mapping the texture to make the three-dimensional live-action model of the monitoring area.
4. The method for extracting the object position information in the monocular video based on the live-action model according to claim 1, characterized in that, in the step (2), the real-time monitoring video is accessed to a Web system constructed based on Cesium by specifically using H5Stream to obtain the attitude parameters of the monitoring camera, each azimuth screenshot is performed on the monitoring area three-dimensional live-action model under the first human visual angle at the position of the camera, the attitude information of the monitoring area three-dimensional live-action model in the Cesium is recorded, the frame image and the screenshot of the video Stream are matched by using a picture matching algorithm to find the most similar one, the rough attitude information projected by the video is obtained according to the screenshot, and then the monitoring video is projected on the surface of the three-dimensional live-action model in the form of video texture and is manually fine-tuned.
5. The method for extracting the object position information in the monocular video based on the live-action model as claimed in claim 1, wherein the step (3) uses a YOLOv3 general target detection algorithm and improves the same to realize the detection of small targets on the monitored image, uses a DBSCAN + K-Means clustering algorithm to perform cluster optimization on the prior frame and modifies three feature layers with the resolutions of 13 × 13, 26 × 26 and 52 × 52 into two feature layers with the resolutions of larger scale of 26 × 26 and 52 × 52.
6. The method for extracting the object position information in the monocular video based on the live-action model as claimed in claim 1, wherein step (4) uses a Cesium interface to perform two conversions on the pixel coordinates: and converting the pixel coordinate into the rectangular coordinate, converting the rectangular coordinate into the geographic coordinate, and finally acquiring the position information of the object in the real world.
CN202111451702.8A 2021-12-01 2021-12-01 Method for extracting object position information in monocular video based on live-action model Pending CN114241126A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111451702.8A CN114241126A (en) 2021-12-01 2021-12-01 Method for extracting object position information in monocular video based on live-action model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111451702.8A CN114241126A (en) 2021-12-01 2021-12-01 Method for extracting object position information in monocular video based on live-action model

Publications (1)

Publication Number Publication Date
CN114241126A true CN114241126A (en) 2022-03-25

Family

ID=80752591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111451702.8A Pending CN114241126A (en) 2021-12-01 2021-12-01 Method for extracting object position information in monocular video based on live-action model

Country Status (1)

Country Link
CN (1) CN114241126A (en)

Similar Documents

Publication Publication Date Title
CN108564647B (en) A method of establishing virtual three-dimensional map
US8078396B2 (en) Methods for and apparatus for generating a continuum of three dimensional image data
EP2913796B1 (en) Method of generating panorama views on a mobile mapping system
US8666657B2 (en) Methods for and apparatus for generating a continuum of three-dimensional image data
Yahyanejad et al. Incremental mosaicking of images from autonomous, small-scale uavs
CN112053446A (en) Real-time monitoring video and three-dimensional scene fusion method based on three-dimensional GIS
US10191635B1 (en) System and method of generating a view for a point of interest
JP7273927B2 (en) Image-based positioning method and system
CN107197200A (en) It is a kind of to realize the method and device that monitor video is shown
KR102200299B1 (en) A system implementing management solution of road facility based on 3D-VR multi-sensor system and a method thereof
US10127667B2 (en) Image-based object location system and process
CN104361628A (en) Three-dimensional real scene modeling system based on aviation oblique photograph measurement
CN112469967B (en) Mapping system, mapping method, mapping device, mapping apparatus, and recording medium
US11403822B2 (en) System and methods for data transmission and rendering of virtual objects for display
CN109472865B (en) Free measurable panoramic reproduction method based on image model drawing
WO2021093679A1 (en) Visual positioning method and device
US11290705B2 (en) Rendering augmented reality with occlusion
Edelman et al. Tracking people and cars using 3D modeling and CCTV
CN110428501A (en) Full-view image generation method, device, electronic equipment and readable storage medium storing program for executing
WO2020211593A1 (en) Digital reconstruction method, apparatus, and system for traffic road
CN114299236A (en) Oblique photogrammetry space-ground fusion live-action modeling method, device, product and medium
CN109712249B (en) Geographic element augmented reality method and device
CN107896315B (en) Multisensor video fusion system and method based on A-SMGCS
Leberl et al. Aerial computer vision for a 3d virtual habitat
US9240055B1 (en) Symmetry-based interpolation in images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination