WO2023178951A1 - Image analysis method and apparatus, model training method and apparatus, and device, medium and program - Google Patents

Image analysis method and apparatus, model training method and apparatus, and device, medium and program Download PDF

Info

Publication number
WO2023178951A1
WO2023178951A1 PCT/CN2022/119646 CN2022119646W WO2023178951A1 WO 2023178951 A1 WO2023178951 A1 WO 2023178951A1 CN 2022119646 W CN2022119646 W CN 2022119646W WO 2023178951 A1 WO2023178951 A1 WO 2023178951A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
image
optical flow
updated
pixel
Prior art date
Application number
PCT/CN2022/119646
Other languages
French (fr)
Chinese (zh)
Inventor
章国锋
鲍虎军
叶伟才
余星源
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023178951A1 publication Critical patent/WO2023178951A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

An image analysis method and apparatus, a model training method and apparatus, and a device, a medium and a program. The image analysis method comprises: acquiring an image sequence, optical flow data, and reference data of each image in the image sequence (S11), wherein each image comprises a first image and a second image, which have a co-visibility relationship, the optical flow data comprises a static optical flow and an overall optical flow between the first image and the second image, the static optical flow is caused by the movement of a photographic device, the overall optical flow is caused by both the movement of the photographic device and the movement of a photographic subject, and the reference data comprises a pose and a depth; on the basis of the image sequence and the optical flow data, performing prediction to obtain an analysis result (S12), wherein the analysis result comprises optical flow calibration data of the static optical flow; and on the basis of the static optical flow and the optical flow calibration data, optimizing the pose and the depth, so as to obtain an updated pose and an updated depth (S13). By means of the solution, the precision of a pose and a depth can be improved in a dynamic scenario.

Description

图像分析方法、模型的训练方法、装置、设备、介质及程序Image analysis methods, model training methods, devices, equipment, media and programs
相关申请的交叉引用Cross-references to related applications
本公开要求2022年03月25日提交的中国专利申请号为202210307855.3、申请人为浙江商汤科技开发有限公司,申请名称为“图像分析方法及相关模型的训练方法、装置、设备和介质”的优先权,该申请的全文以引用的方式并入本公开中。This disclosure requires priority for the Chinese patent application number 202210307855.3 submitted on March 25, 2022. The applicant is Zhejiang Shangtang Technology Development Co., Ltd. and the application name is "Image analysis method and related model training method, device, equipment and medium" Right, the entire text of this application is incorporated into this disclosure by reference.
技术领域Technical field
本公开涉及计算机视觉技术领域,特别是涉及一种图像分析方法、模型的训练方法、装置、设备、介质及程序。The present disclosure relates to the field of computer vision technology, and in particular, to an image analysis method, a model training method, a device, an equipment, a medium and a program.
背景技术Background technique
同时定位和建图(Simultaneous Localization and Mapping,SLAM)是计算机视觉和机器人领域中最基本的任务之一,其应用范围包括但不限于:增强现实(Augmented Reality,AR)、虚拟现实(Virtual Reality,VR)、自动驾驶等。其中,单目稠密SLAM由于单目视频采集的简单性而备受关注,但与深度图像(Red Green Blue-Depth,RGB-D)的稠密SLAM相比,则是一项困难的任务。经研究发现,构建稳健、可靠的SLAM系统仍然具有挑战性,尤其是在动态场景中,目前SLAM系统仍然存在较大问题,而无法获取到精确的位姿和深度。Simultaneous localization and mapping (SLAM) is one of the most basic tasks in the field of computer vision and robotics. Its application scope includes but is not limited to: augmented reality (Augmented Reality, AR), virtual reality (Virtual Reality, VR), autonomous driving, etc. Among them, monocular dense SLAM has attracted much attention due to the simplicity of monocular video acquisition, but compared with dense SLAM of depth images (Red Green Blue-Depth, RGB-D), it is a difficult task. Research has found that building a robust and reliable SLAM system is still challenging, especially in dynamic scenes. Current SLAM systems still have major problems and cannot obtain accurate poses and depths.
发明内容Contents of the invention
本公开实施例提供一种图像分析方法、模型的训练方法、装置、设备、介质及程序。Embodiments of the present disclosure provide an image analysis method, a model training method, a device, equipment, a medium and a program.
本公开实施例提供了一种图像分析方法,包括:获取图像序列、光流数据和图像序列中各个图像的参考数据;其中,各个图像包括具有共视关系的第一图像和第二图像,光流数据包括第一图像与第二图像之间的静态光流和整体光流,静态光流由摄像器件运动引起,整体光流由摄像器件运动和拍摄对象运动共同引起,且参考数据包括位姿和深度;基于图像序列和光流数据,预测得到分析结果;其中,分析结果包括静态光流的光流校准数据;基于静态光流和光流校准数据,对位姿和深度进行优化,得到更新的位姿和更新的深度。Embodiments of the present disclosure provide an image analysis method, which includes: acquiring an image sequence, optical flow data, and reference data of each image in the image sequence; wherein each image includes a first image and a second image that have a common view relationship, and the optical flow data includes: The flow data includes static optical flow and overall optical flow between the first image and the second image. The static optical flow is caused by the motion of the camera device, the overall optical flow is caused by the motion of the camera device and the motion of the photographed object, and the reference data includes pose and depth; based on the image sequence and optical flow data, the analysis results are predicted; among them, the analysis results include optical flow calibration data of static optical flow; based on the static optical flow and optical flow calibration data, the pose and depth are optimized to obtain an updated position posture and updated depth.
因此,获取图像序列、光流数据和图像序列中各个图像的参考数据,且各个图像包括具有共视关系的第一图像和第二图像,光流数据包括第一图像与第二图像之间的静态光流和整体光流,静态光流由摄像器件运动引起,整体光流由摄像器件运动和拍摄对象运动共同引起,参考数据包括位姿和深度,在此基础上,再基于图像序列和光流数据,预测得到分析结果,且分析结果包括静态光流的光流校准数据,并基于静态光流和光流校准数据,对位姿和深度进行优化,得到更新的位姿和更新的深度。故通过模仿人类感知现实世界的方式,将整体光流视为由摄像器件运动和拍摄对象运动共同引起,并在图像分析过程中,参考整体光流和由摄像器件运动引起的静态光流,预测出静态光流的光流校准数据,从而能够在后续位姿和深度优化过程中,结合静态光流及其光流校准数据尽可能地降低拍摄对象运动所导致的影响,进而能够提升位姿和深度的精度。Therefore, the image sequence, the optical flow data and the reference data of each image in the image sequence are obtained, and each image includes a first image and a second image having a common view relationship, and the optical flow data includes the first image and the second image. Static optical flow and overall optical flow. Static optical flow is caused by the movement of the camera device. Overall optical flow is caused by the movement of the camera device and the movement of the photographed object. The reference data includes pose and depth. On this basis, based on the image sequence and optical flow Data, prediction and analysis results are obtained, and the analysis results include optical flow calibration data of static optical flow, and based on the static optical flow and optical flow calibration data, the pose and depth are optimized to obtain an updated pose and an updated depth. Therefore, by imitating the way humans perceive the real world, the overall optical flow is considered to be caused by the motion of the camera device and the motion of the photographed object. During the image analysis process, the overall optical flow and the static optical flow caused by the motion of the camera device are referenced to predict The optical flow calibration data of the static optical flow can be obtained, so that in the subsequent pose and depth optimization process, the static optical flow and its optical flow calibration data can be combined to reduce the impact caused by the motion of the subject as much as possible, thereby improving the pose and depth. Depth accuracy.
本公开实施例提供了一种图像分析模型的训练方法,包括:获取样本图像序列、样本光流数据和样本图像序列中各个样本图像的样本参考数据;其中,各个样本图像包括具有共视关系的第一样本图像和第二样本图像,样本光流数据包括第一样本图像与第二样本图像之间的样本静态光流和样本整体光流,样本静态光流由摄像器件运动引起,样本整体光流由摄像器件运动和拍摄对象运动共同引起,且样本参考数据包括样本位姿和样本深度;基于图像分析模型对样本图像序列和样本光流数据进行分析预测,得到样本分析结果;其中,样本分析结果包括样本静态光流的样本光流校准数据;基于样本静态光流和样本光流校准数据,对样本位姿和样本深度进行优化,得到更新的样本位姿和更新的样本深度;基于更新的样本位姿和更新的样本深度进行损失度量,得到图像分析模型的预测损失;基于预测损失,调整图像分析模型的网络参数。Embodiments of the present disclosure provide a training method for an image analysis model, which includes: obtaining a sample image sequence, sample optical flow data, and sample reference data of each sample image in the sample image sequence; wherein each sample image includes a common view relationship. The first sample image and the second sample image, the sample optical flow data include the sample static optical flow and the sample overall optical flow between the first sample image and the second sample image, the sample static optical flow is caused by the movement of the camera device, the sample The overall optical flow is caused by the motion of the camera device and the motion of the photographed object, and the sample reference data includes the sample pose and sample depth; based on the image analysis model, the sample image sequence and sample optical flow data are analyzed and predicted to obtain the sample analysis results; where, The sample analysis results include sample optical flow calibration data of the sample static optical flow; based on the sample static optical flow and sample optical flow calibration data, the sample pose and sample depth are optimized to obtain an updated sample pose and an updated sample depth; based on The updated sample pose and updated sample depth are used for loss measurement to obtain the predicted loss of the image analysis model; based on the predicted loss, the network parameters of the image analysis model are adjusted.
因此,与推理阶段类似地,通过模仿人类感知现实世界的方式,将整体光流视为由摄像器件运动和拍摄对象运动共同引起,并在图像分析过程中,参考整体光流和由摄像器件运动引起的静态光流,预测出静态光流的光流校准数据,从而能够在后续位姿和深度优化过程中,结合静态光流及其光流校准数据尽可能地降低拍摄对象运动所导致的影响,能够提升图像分析模型的模型性能,有利于提升利用图像分析模型在推理阶段得到分析结果的准确性,进而能够提升推理阶段位姿和深度的精度。Therefore, similar to the inference stage, by imitating the way humans perceive the real world, the overall optical flow is considered to be caused by both the motion of the camera device and the motion of the photographed object, and during the image analysis process, the overall optical flow and the motion of the camera device are referenced The static optical flow caused by the static optical flow is predicted to predict the optical flow calibration data of the static optical flow, so that in the subsequent pose and depth optimization process, the static optical flow and its optical flow calibration data can be combined to reduce the impact caused by the motion of the subject as much as possible. , can improve the model performance of the image analysis model, help improve the accuracy of the analysis results obtained by using the image analysis model in the inference stage, and thus improve the accuracy of the pose and depth in the inference stage.
本公开实施例提供了一种图像分析装置,包括:获取部分,被配置为获取图像序列、光流数据和图像序列中各个图像的参考数据;其中,各个图像包括具有共视关系的第一图像和第二图像,光流数 据包括第一图像与第二图像之间的静态光流和整体光流,静态光流由摄像器件运动引起,整体光流由摄像器件运动和拍摄对象运动共同引起,且参考数据包括位姿和深度;分析部分,被配置为基于图像序列和光流数据,预测得到分析结果;其中,分析结果包括静态光流的光流校准数据;优化部分,被配置为基于静态光流和光流校准数据,对位姿和深度进行优化,得到更新的位姿和更新的深度。An embodiment of the present disclosure provides an image analysis device, including: an acquisition part configured to acquire an image sequence, optical flow data, and reference data of each image in the image sequence; wherein each image includes a first image with a common view relationship and the second image, the optical flow data includes static optical flow and overall optical flow between the first image and the second image, the static optical flow is caused by the movement of the camera device, and the overall optical flow is caused by the movement of the camera device and the movement of the photographed object, And the reference data includes pose and depth; the analysis part is configured to predict the analysis results based on the image sequence and optical flow data; where the analysis results include optical flow calibration data of static optical flow; the optimization part is configured to predict the analysis results based on the static optical flow. Flow and optical flow calibration data are used to optimize pose and depth to obtain updated pose and updated depth.
本公开实施例提供了一种图像分析模型的训练装置,包括:样本获取部分,被配置为获取样本图像序列、样本光流数据和样本图像序列中各个样本图像的样本参考数据;其中,各个样本图像包括具有共视关系的第一样本图像和第二样本图像,样本光流数据包括第一样本图像与第二样本图像之间的样本静态光流和样本整体光流,样本静态光流由摄像器件运动引起,样本整体光流由摄像器件运动和拍摄对象运动共同引起,且样本参考数据包括样本位姿和样本深度;样本分析部分,被配置为基于图像分析模型对样本图像序列和样本光流数据进行分析预测,得到样本分析结果;其中,样本分析结果包括样本静态光流的样本光流校准数据;样本优化部分,被配置为基于样本静态光流和样本光流校准数据,对样本位姿和样本深度进行优化,得到更新的样本位姿和更新的样本深度;损失度量部分,被配置为基于更新的样本位姿和更新的样本深度进行损失度量,得到图像分析模型的预测损失;参数调整部分,被配置为基于预测损失,调整图像分析模型的网络参数。An embodiment of the present disclosure provides a training device for an image analysis model, including: a sample acquisition part configured to acquire a sample image sequence, sample optical flow data, and sample reference data of each sample image in the sample image sequence; wherein, each sample The image includes a first sample image and a second sample image that have a common viewing relationship, and the sample optical flow data includes a sample static optical flow and a sample overall optical flow between the first sample image and the second sample image, and the sample static optical flow Caused by the motion of the camera device, the overall optical flow of the sample is caused by the motion of the camera device and the motion of the photographed object, and the sample reference data includes the sample pose and sample depth; the sample analysis part is configured to analyze the sample image sequence and sample based on the image analysis model The optical flow data is analyzed and predicted to obtain sample analysis results; among which, the sample analysis results include sample optical flow calibration data of the sample static optical flow; the sample optimization part is configured to optimize the sample based on the sample static optical flow and sample optical flow calibration data. The pose and sample depth are optimized to obtain an updated sample pose and an updated sample depth; the loss measurement part is configured to perform loss measurement based on the updated sample pose and updated sample depth to obtain the predicted loss of the image analysis model; The parameter adjustment part is configured to adjust the network parameters of the image analysis model based on the prediction loss.
本公开实施例提供了一种电子设备,包括相互耦接的存储器和处理器,处理器用于执行存储器中存储的程序指令,以实现上述的图像分析方法,或图像分析模型的训练方法。Embodiments of the present disclosure provide an electronic device, including a memory and a processor coupled to each other. The processor is configured to execute program instructions stored in the memory to implement the above image analysis method or the image analysis model training method.
本公开实施例提供了一种计算机可读存储介质,其上存储有程序指令,程序指令被处理器执行时实现上述的图像分析方法,或图像分析模型的训练方法。Embodiments of the present disclosure provide a computer-readable storage medium on which program instructions are stored. When the program instructions are executed by a processor, the above-mentioned image analysis method or the training method of the image analysis model is implemented.
本公开实施例提供的一种计算机程序,计算机程序包括计算机可读代码,在计算机可读代码在电子设备中运行的情况下,电子设备的处理器执行用于实现上述的图像分析方法,或图像分析模型的训练方法。An embodiment of the present disclosure provides a computer program. The computer program includes computer readable code. When the computer readable code is run in an electronic device, the processor of the electronic device executes to implement the above image analysis method, or image analysis method. Analytical model training methods.
本公开实施例提供的图像分析方法、模型的训练方法、装置、设备、介质及程序,首先,获取图像序列、光流数据和图像序列中各个图像的参考数据,且各个图像包括具有共视关系的第一图像和第二图像,光流数据包括第一图像与第二图像之间的静态光流和整体光流,静态光流由摄像器件运动引起,整体光流由摄像器件运动和拍摄对象运动共同引起,参考数据包括位姿和深度,在此基础上,再基于图像序列和光流数据,预测得到分析结果,且分析结果包括静态光流的光流校准数据,并基于静态光流和光流校准数据,对位姿和深度进行优化,得到更新的位姿和更新的深度。故通过模仿人类感知现实世界的方式,将整体光流视为由摄像器件运动和拍摄对象运动共同引起,并在图像分析过程中,参考整体光流和由摄像器件运动引起的静态光流,预测出静态光流的光流校准数据,从而能够在后续位姿和深度优化过程中,结合静态光流及其光流校准数据尽可能地降低拍摄对象运动所导致的影响,进而能够提升位姿和深度的精度。The image analysis method, model training method, device, equipment, medium and program provided by the embodiments of the present disclosure firstly obtain the image sequence, optical flow data and reference data of each image in the image sequence, and each image includes a co-view relationship The first image and the second image, the optical flow data includes the static optical flow and the overall optical flow between the first image and the second image. The static optical flow is caused by the movement of the camera device, and the overall optical flow is caused by the movement of the camera device and the photographed object. Caused by motion together, the reference data includes pose and depth. On this basis, based on the image sequence and optical flow data, the analysis results are predicted, and the analysis results include optical flow calibration data of static optical flow, and are based on static optical flow and optical flow Calibrate the data, optimize the pose and depth, and obtain the updated pose and updated depth. Therefore, by imitating the way humans perceive the real world, the overall optical flow is considered to be caused by the motion of the camera device and the motion of the photographed object. During the image analysis process, the overall optical flow and the static optical flow caused by the motion of the camera device are referenced to predict The optical flow calibration data of the static optical flow can be obtained, so that in the subsequent pose and depth optimization process, the static optical flow and its optical flow calibration data can be combined to reduce the impact caused by the motion of the subject as much as possible, thereby improving the pose and depth. Depth accuracy.
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present disclosure more obvious and understandable, preferred embodiments are given below and described in detail with reference to the accompanying drawings.
附图说明Description of the drawings
为了更清楚地说明本公开实施例的技术方案,下面将对本公开实施例中所需要使用的附图进行说明。此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。In order to explain the technical solutions of the embodiments of the present disclosure more clearly, the drawings required to be used in the embodiments of the present disclosure will be described below. The accompanying drawings herein are incorporated into and constitute a part of this specification. They illustrate embodiments consistent with the disclosure and, together with the description, serve to explain the technical solutions of the disclosure.
图1是本公开图像分析方法一实施例的流程示意图;Figure 1 is a schematic flow chart of an embodiment of the image analysis method of the present disclosure;
图2是整体光流分解一实施例的示意图;Figure 2 is a schematic diagram of an embodiment of overall optical flow decomposition;
图3a是本公开图像分析方法一实施例的过程示意图;Figure 3a is a schematic process diagram of an embodiment of the image analysis method of the present disclosure;
图3b是动态更新网络一实施例的框架示意图;Figure 3b is a schematic framework diagram of an embodiment of a dynamic update network;
图4a是本公开图像分析方法确定轨迹与实际轨迹、现有技术确定轨迹一实施例的对比示意图;Figure 4a is a schematic diagram comparing the trajectory determined by the image analysis method of the present disclosure with the actual trajectory and the trajectory determined by the prior art according to an embodiment;
图4b是本公开图像分析方法确定轨迹与实际轨迹、现有技术确定轨迹另一实施例的对比示意图;Figure 4b is a schematic diagram comparing the trajectory determined by the image analysis method of the present disclosure with the actual trajectory and the trajectory determined by the prior art in another embodiment;
图5a是本公开图像分析方法确定轨迹与实际轨迹、现有技术确定轨迹又一实施例的对比示意图;Figure 5a is a schematic diagram comparing the trajectory determined by the disclosed image analysis method with the actual trajectory and the trajectory determined by the prior art in another embodiment;
图5b是本公开图像分析方法确定轨迹与实际轨迹、现有技术确定轨迹又一实施例的对比示意图;Figure 5b is a schematic diagram comparing the trajectory determined by the disclosed image analysis method with the actual trajectory and the trajectory determined by the prior art in another embodiment;
图5c是本公开图像分析方法确定轨迹与实际轨迹、现有技术确定轨迹又一实施例的对比示意图;Figure 5c is a schematic diagram comparing the trajectory determined by the disclosed image analysis method with the actual trajectory and the trajectory determined by the prior art in another embodiment;
图5d是本公开图像分析方法应用于各种数据集的地图重建示意图;Figure 5d is a schematic diagram of map reconstruction using the image analysis method of the present disclosure applied to various data sets;
图5e是本公开图像分析方法应用于运动分割任务的示意图;Figure 5e is a schematic diagram of the image analysis method of the present disclosure applied to the motion segmentation task;
图5f是本公开图像分析方法与现有技术分别应用于AR的对比示意图;Figure 5f is a schematic comparison diagram of the image analysis method of the present disclosure and the prior art respectively applied to AR;
图6是本公开图像分析模型的训练方法一实施例的流程示意图;Figure 6 is a schematic flow chart of an embodiment of the training method of the image analysis model of the present disclosure;
图7是动态场景一实施例的示意图;Figure 7 is a schematic diagram of an embodiment of a dynamic scene;
图8是本公开图像分析装置一实施例的框架示意图;Figure 8 is a schematic framework diagram of an embodiment of the image analysis device of the present disclosure;
图9是本公开图像分析模型的训练装置一实施例的框架示意图;Figure 9 is a schematic framework diagram of an embodiment of the training device of the image analysis model of the present disclosure;
图10是本公开电子设备一实施例的框架示意图;Figure 10 is a schematic framework diagram of an embodiment of the electronic device of the present disclosure;
图11是本公开计算机可读存储介质一实施例的框架示意图。FIG. 11 is a schematic diagram of an embodiment of a computer-readable storage medium of the present disclosure.
具体实施方式Detailed ways
下面结合说明书附图,对本公开实施例的方案进行详细说明。The solutions of the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、接口、技术之类的具体细节,以便透彻理解本公开。In the following description, for purposes of explanation and not limitation, specific details such as specific system structures, interfaces, technologies, and the like are set forth in order to provide a thorough understanding of the present disclosure.
本文中术语“统”和“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。此外,本文中的“多”表示两个或者多于两个。The terms "system" and "network" are often used interchangeably in this article. The term "and/or" in this article is just an association relationship that describes related objects, indicating that three relationships can exist. For example, A and/or B can mean: A exists alone, A and B exist simultaneously, and they exist alone. B these three situations. In addition, the character "/" in this article generally indicates that the related objects are an "or" relationship. In addition, "many" in this article means two or more than two.
请参阅图1,图1是本公开图像分析方法一实施例的流程示意图。可以包括如下步骤:Please refer to FIG. 1 , which is a schematic flowchart of an embodiment of the image analysis method of the present disclosure. May include the following steps:
步骤S11:获取图像序列、光流数据和图像序列中各个图像的参考数据。Step S11: Obtain the image sequence, optical flow data and reference data of each image in the image sequence.
本公开实施例中,各个图像包括具有共视关系的第一图像和第二图像。其中,在第一图像中某一像素点反投影至三维空间的三维点,若该三维点能够投影至第二图像内,则可以认为第一图像与第二图像具有共视关系,即若三维空间中某一三维点同时存在于第一图像和第二图像,则可以认为第一图像和第二图像具有共视关系。也就是说,在第一图像和第二图像两者视角至少部分重叠的情况下,可以认为第一图像和第二图像具有共视关系。此外,在分析过程中,与第一图像具有共视关系的第二图像的数量可以是一个及以上,至少一个第二图像与第一图像可以组成图像序列。In the embodiment of the present disclosure, each image includes a first image and a second image that have a common view relationship. Among them, a certain pixel point in the first image is back-projected to a three-dimensional point in the three-dimensional space. If the three-dimensional point can be projected into the second image, it can be considered that the first image and the second image have a common viewing relationship, that is, if the three-dimensional If a certain three-dimensional point in space exists in both the first image and the second image, it can be considered that the first image and the second image have a common viewing relationship. That is to say, when the viewing angles of the first image and the second image at least partially overlap, it can be considered that the first image and the second image have a co-viewing relationship. In addition, during the analysis process, the number of second images having a common view relationship with the first image may be one or more, and at least one second image and the first image may form an image sequence.
本公开实施例中,光流数据可以包括第一图像与第二图像之间的静态光流和整体光流,静态光流由摄像器件运动引起,整体光流由摄像器件运动和拍摄对象运动共同引起。示例性地,三维空间中某一三维点在摄像器件t1时刻拍摄得到的第一图像中位于P1(u1、v1),且该三维点所属物体为静止物体,在t2时刻由于摄像器件本身的运动,该三维点在摄像器件t2时刻拍摄到的第二图像中位于P2(u2,v2),则静态光流中位于像素位置P1(u1、v1)的静态光流值可以记为(u2-u1,v2-v1),第一图像与第二图像之间的静态光流即包含第一图像中各个像素点的静态光流值,从而像素点在第一图像中的像素位置加上其静态光流值即可得到该像素点所属三维点,由于摄像器件本身的运动,理论上对应于第二图像中的像素位置,且若该像素点所属三维点位于静止物体上且静态光流也完全准确,则理论上对应于第二图像中的像素位置也为该像素点所属三维点在第二图像中的投影位置;或者,仍以三维空间中某一三维点在摄像器件t1时刻拍摄得到的第一图像中位于P1(u1、v1)为例,若该三维点所属物体为运动物体,在t2时刻由于摄像器件本身的运动以及该运动物体的运动,该三维点在摄像器件t2时刻拍摄到的第二图像中位于P3(u3,v3),则整体光流中位于像素位置P1(u1、v1)的整体光流值可以记为(u3-u1,v3-v1),第一图像与第二图像之间的整体光流即包含第一图像中各像素点的整体光流值,从而像素点在第一图像中的像素位置加上其整体光流值即可得到该像素点所属三维点,由于摄像器件本身的运动以及拍摄对象的运动,理论上对应于第二图像中的像素位置,且若整体光流完全准确,则理论上对应于第二图像中的像素位置也为该像素点所属三维点在第二图像中的投影位置。In the embodiment of the present disclosure, the optical flow data may include static optical flow and overall optical flow between the first image and the second image. The static optical flow is caused by the movement of the camera device, and the overall optical flow is caused by the movement of the camera device and the movement of the photographed object. cause. For example, a certain three-dimensional point in the three-dimensional space is located at P1 (u1, v1) in the first image captured by the camera device at time t1, and the object to which the three-dimensional point belongs is a stationary object. At time t2, due to the movement of the camera device itself , the three-dimensional point is located at P2 (u2, v2) in the second image captured by the camera device at time t2, then the static optical flow value located at the pixel position P1 (u1, v1) in the static optical flow can be recorded as (u2-u1 , v2-v1), the static optical flow between the first image and the second image includes the static optical flow value of each pixel in the first image, so the pixel position of the pixel in the first image plus its static optical flow The flow value can be used to obtain the three-dimensional point to which the pixel belongs. Due to the movement of the camera device itself, it theoretically corresponds to the pixel position in the second image, and if the three-dimensional point to which the pixel belongs is located on a stationary object and the static optical flow is also completely accurate , then theoretically the corresponding pixel position in the second image is also the projection position of the three-dimensional point to which the pixel point belongs in the second image; or, the third image captured by the camera device at time t1 is still based on a certain three-dimensional point in the three-dimensional space. For example, in an image located at P1 (u1, v1), if the object to which the three-dimensional point belongs is a moving object, at time t2, due to the movement of the camera device itself and the movement of the moving object, the three-dimensional point will be captured by the camera device at time t2. In the second image, it is located at P3 (u3, v3). Then the overall optical flow value located at the pixel position P1 (u1, v1) in the overall optical flow can be recorded as (u3-u1, v3-v1). The first image and the second The overall optical flow between images includes the overall optical flow value of each pixel in the first image. Therefore, adding the pixel position of the pixel in the first image to its overall optical flow value can obtain the three-dimensional point to which the pixel belongs. Since the motion of the camera device itself and the motion of the photographed object theoretically correspond to the pixel position in the second image, and if the overall optical flow is completely accurate, then theoretically the pixel position corresponding to the second image also belongs to the pixel. The projected position of the 3D point in the second image.
在一个实施场景中,以第一图像记为图像i,第二图像记为图像j为例,则第一图像中各像素点由于摄像器件运动所引起的静态光流经坐标变换之后,对应于第二图像某一像素位置的像素点,且若像素点属于静止物体且静态光流完全准确,则第一图像中像素点与经静态光流转换坐标之后在第二图像中像素点应对应于三维空间中相同三维点,为了便于描述,可以将静态光流记为F sij。同时,第一图像中各像素点由于摄像器件运动和拍摄对象共同引起的整体光流经坐标变换之后,对应于第二图像某一像素位置的像素点,且若整体光流完全准确,则第一图像中像素点与经整体光流转换坐标之后在第二图像中像素点对应于三维空间中相同三维点,为了便于描述,可以将整体光流记为F oijIn an implementation scenario, taking the first image as image i and the second image as image j as an example, then the static light flow caused by the movement of the camera device at each pixel point in the first image, after coordinate transformation, corresponds to A pixel at a certain pixel position in the second image, and if the pixel belongs to a stationary object and the static optical flow is completely accurate, then the pixel in the first image and the pixel in the second image after coordinate conversion by static optical flow should correspond to For the same three-dimensional point in three-dimensional space, for ease of description, the static optical flow can be recorded as F sij . At the same time, after coordinate transformation, the overall optical flow of each pixel in the first image caused by the motion of the camera device and the photographed object corresponds to the pixel at a certain pixel position in the second image, and if the overall optical flow is completely accurate, then the The pixel point in one image corresponds to the same three-dimensional point in the three-dimensional space after the coordinates of the pixel point in the second image are transformed by the global optical flow. For convenience of description, the global optical flow can be recorded as F oij .
本公开实施例中,参考数据包括位姿和深度。仍以第一图像记为图像i,第二图像记为图像j为例,参考数据可以包括第一图像i的位姿G i和第二图像的位姿G j,参考数据还可以包括第一图像i中各像素点的深度值和第二图像j中各像素点的深度值,第一图像的深度即包含第一图像中各像素点 的深度值,第二图像的深度即包含第二图像中各像素点的深度值。为了便于描述,可以将第一图像的深度记为d i,类似地,第二图像的深度可以记为d j。需要说明的是,位姿为位置和姿态的合称,其描述了世界坐标系与相机坐标系之间的转换关系。此外,深度表示物体至摄像器件之间的距离,本公开实施例中,深度可以采用逆深度参数化(inverse depth parameterization)进行表示。 In the embodiment of the present disclosure, the reference data includes pose and depth. Still taking the first image as image i and the second image as image j as an example, the reference data may include the pose G i of the first image i and the pose G j of the second image, and the reference data may also include the first The depth value of each pixel in image i and the depth value of each pixel in second image j. The depth of the first image includes the depth value of each pixel in the first image. The depth of the second image includes the second image. The depth value of each pixel in . For ease of description, the depth of the first image can be denoted as d i , and similarly, the depth of the second image can be denoted as d j . It should be noted that pose is the collective name of position and attitude, which describes the conversion relationship between the world coordinate system and the camera coordinate system. In addition, the depth represents the distance between the object and the camera device. In the embodiment of the present disclosure, the depth can be represented by inverse depth parameterization.
在一个实施场景中,本公开实施例可以循环迭代N次(如,10次、15次等),以尽可能地优化深度和位姿,提升两者的准确性,则在首次循环迭代时,可以为位姿赋予初值。示例性地,位姿可以采用4*4矩阵表示。在此基础上,可以将位姿初始化为主对角线元素为1,其他元素为0的矩阵。在此基础上,在后续循环迭代过程中,第i次迭代输入的位姿,可以为第i-1次迭代输出的位姿。In an implementation scenario, the embodiment of the present disclosure can loop iterate N times (such as 10 times, 15 times, etc.) to optimize the depth and pose as much as possible and improve the accuracy of both. In the first loop iteration, An initial value can be assigned to the pose. For example, the pose can be represented by a 4*4 matrix. On this basis, the pose can be initialized as a matrix with the main diagonal element being 1 and other elements being 0. On this basis, in the subsequent loop iteration process, the pose input by the i-th iteration can be the pose output by the i-1 iteration.
在一个实施场景中,也可以采用类似的方式,在首次循环迭代时,为深度赋予初值,深度的具体数值,在此不做限定。其中,可以先识别出第一图像和第二图像中的静止物体(如,建筑、路灯等),并基于静止物体,对第一图像和第二图像进行特征匹配,得到若干匹配点对,且匹配点对包含属于第一图像中静止物体的第一像素点,以及属于第二图像中静止物体的第二像素点,且第一像素点和第二像素点对应于三维空间中相同三维点。在此基础上,可以基于第一图像的位姿、第一像素点的深度值和第一像素点在第一图像中的像素位置,确定第一像素点在三维空间中的三维位置,与此同时,可以基于第二图像的位姿、与前述第一像素点属于相同匹配点中第二像素点的深度值及其在第二图像中的像素位置,确定第二像素点在三维空间中的三维位置,由于第一像素点对应的三维位置和第二像素点对应的三维位置应相同,故通过若干匹配点对可以构建得到一系列以第一像素点的深度值和第二像素点的深度值为未知量的方程式,求解方程式,即可得到第一像素点的深度值和第二像素点的深度值,并基于此分别为得到首次循环迭代时第一图像深度的初值,以及首次循环迭代时第二图像深度的初值。在此基础上,在后续循环迭代过程中,第i次迭代输入的深度,可以为第i-1次迭代输出的深度。In an implementation scenario, a similar method can be used to assign an initial value to the depth during the first loop iteration. The specific value of the depth is not limited here. Among them, static objects (such as buildings, street lights, etc.) in the first image and the second image can be identified first, and feature matching is performed on the first image and the second image based on the static objects to obtain several matching point pairs, and The matching point pair includes a first pixel point belonging to the stationary object in the first image, and a second pixel point belonging to the stationary object in the second image, and the first pixel point and the second pixel point correspond to the same three-dimensional point in the three-dimensional space. On this basis, the three-dimensional position of the first pixel in the three-dimensional space can be determined based on the pose of the first image, the depth value of the first pixel and the pixel position of the first pixel in the first image. With this At the same time, the position of the second pixel in the three-dimensional space can be determined based on the pose of the second image, the depth value of the second pixel in the same matching point as the first pixel and its pixel position in the second image. Three-dimensional position. Since the three-dimensional position corresponding to the first pixel point and the three-dimensional position corresponding to the second pixel point should be the same, a series of depth values of the first pixel point and the depth value of the second pixel point can be constructed through several matching point pairs. The value is an equation of unknown quantity. By solving the equation, you can get the depth value of the first pixel and the depth value of the second pixel. Based on this, you can get the initial value of the first image depth in the first loop iteration and the first loop iteration. The initial value of the second image depth during iteration. On this basis, in the subsequent loop iteration process, the depth of the i-th iteration input can be the depth of the i-1th iteration output.
在一个实施场景中,在得到首次循环迭代时的位姿和深度之后,可以基于第一图像i中像素点的像素位置p i、深度d i和第一图像与第二图像之间的相对位姿G ij进行投影,得到第一图像中像素点投影至第二图像中的像素位置p ij,如公式(1)所示: In an implementation scenario, after obtaining the pose and depth in the first loop iteration, the pixel position p i of the pixel point in the first image i, the depth di and the relative position between the first image and the second image can be Project the pose G ij to obtain the pixel position p ij projected from the pixel point in the first image to the second image, as shown in formula (1):
Figure PCTCN2022119646-appb-000001
Figure PCTCN2022119646-appb-000001
上述公式(1)中,∏ c表示用于将三维点映射至图像的相机模型,∏ c -1表示用于基于像素位置p i和深度p i将二维点映射至三维点的反投影函数,运算符
Figure PCTCN2022119646-appb-000002
表示哈达玛积(Hadamard product)。其中,相对位姿G ij可以表示为:
In the above formula (1), ∏ c represents the camera model used to map three-dimensional points to images, and ∏ c -1 represents the back-projection function used to map two-dimensional points to three-dimensional points based on pixel position p i and depth p i , operator
Figure PCTCN2022119646-appb-000002
Represents Hadamard product. Among them, the relative pose G ij can be expressed as:
Figure PCTCN2022119646-appb-000003
Figure PCTCN2022119646-appb-000003
此外,以第一图像i和第二图像j为宽W且高H的二维图像为例,第一图像i中各个像素点的像素位置p i可以采用H*W的二通道图像表示,即p i∈R H×W×2,类似地,第一图像中像素点投影至第二图像中的像素位置p ij也可以采用H*W的二通道图像表示,即p ij∈R H×W×2。基于此,在首次循环迭代时,对于第一图像i中任意像素点的像素位置p i而言,可以获取其在第二图像中的对应位置p j,且该对应位置为假设摄像器件未运动的情况下,第一图像中像素点所属的空间点(即三维点)投影在第二图像的像素位置。在此基础上,即可以基于第一图像中像素点在第二图像的对应位置p j,与其投影在第二图像的像素位置p ij之间的差值,得到静态光流F sijIn addition, taking the first image i and the second image j as two-dimensional images with width W and height H as an example, the pixel position p i of each pixel point in the first image i can be represented by a two-channel image of H*W, that is, p i ∈R H×W×2 . Similarly, the pixel position p ij projected from the pixel point in the first image to the second image can also be represented by the two-channel image of H*W, that is, p ij ∈R H×W ×2 . Based on this, in the first loop iteration, for the pixel position p i of any pixel point in the first image i, its corresponding position p j in the second image can be obtained, and the corresponding position is assuming that the camera device is not moving. In the case of , the spatial point (that is, the three-dimensional point) to which the pixel point in the first image belongs is projected at the pixel position of the second image. On this basis, the static optical flow F sij can be obtained based on the difference between the corresponding position p j of the pixel point in the first image in the second image and its projected pixel position p ij in the second image:
F sij=p ij-p j……公式(3)。 F sij =p ij -p j ...Formula (3).
在一个实施场景中,如前所述,整体光流为由摄像器件运行和拍摄对象运动共同引起,且由摄像器件运动引起的光流称为静态光流,为了便于区分,可以将由拍摄对象运动引起的光流称为动态光流,在首次循环迭代时,可以将动态光流初始化为全0矩阵,且该全0矩阵可以表示采用H*W的二通道图像表示。在此基础上,在首次循环迭代时,可以将前述静态光流F sij与以全0矩阵表示的动态光流相加,得到整体光流F oij。也就是说,本实施例中,整体光流可以分解为静态光流和动态光流。类似地,下述公开实施例中的样本整体光流也可以分解为样本静态光流和样本动态光流。请结合参阅图2,图2是整体光流分解一实施例的示意图。其中,由摄像器件运动和拍摄对象运动共同引起的光流(即整体光流)可以分解为由摄像器件运动引起的光流(即静态光流)和由拍摄对象运动引起的光流(即动态光流)。 In an implementation scenario, as mentioned above, the overall optical flow is caused by the operation of the camera device and the movement of the subject, and the optical flow caused by the movement of the camera device is called static optical flow. In order to facilitate the distinction, the movement of the subject can be The optical flow caused is called dynamic optical flow. In the first loop iteration, the dynamic optical flow can be initialized to an all-0 matrix, and the all-0 matrix can represent a two-channel image representation using H*W. On this basis, in the first loop iteration, the aforementioned static optical flow F sij can be added to the dynamic optical flow represented by an all-0 matrix to obtain the overall optical flow F oij . That is to say, in this embodiment, the overall optical flow can be decomposed into static optical flow and dynamic optical flow. Similarly, the overall optical flow of the sample in the disclosed embodiments described below can also be decomposed into the static optical flow of the sample and the dynamic optical flow of the sample. Please refer to FIG. 2 , which is a schematic diagram of an embodiment of overall optical flow decomposition. Among them, the optical flow caused by the movement of the camera device and the subject (i.e., the overall optical flow) can be decomposed into the optical flow caused by the movement of the camera device (i.e., static optical flow) and the optical flow caused by the movement of the subject (i.e., dynamic light flow).
步骤S12:基于图像序列和光流数据,预测得到分析结果。Step S12: Based on the image sequence and optical flow data, predict and obtain the analysis results.
本公开实施例中,分析结果包括静态光流的光流校准数据,光流校准数据可以包括静态光流中各个静态光流值的校准值。如前所述,静态光流可以采用H*W的二通道图像表示,则光流校准数据也可以采用H*W的二通道图像表示。为了便于描述,光流校准数据可以记为r sij∈R H×W×2In the embodiment of the present disclosure, the analysis results include optical flow calibration data of static optical flow, and the optical flow calibration data may include calibration values of each static optical flow value in the static optical flow. As mentioned before, static optical flow can be represented by a two-channel image of H*W, and the optical flow calibration data can also be represented by a two-channel image of H*W. For ease of description, the optical flow calibration data can be recorded as r sij ∈R H×W×2 .
在一个实施场景中,可以基于第一图像的图像特征和第二图像的图像特征,得到第一图像与第二图像之间的特征相关数据,并基于静态光流将第一图像中像素点进行投影,得到第一图像中像素点在第二图像中的第一投影位置。在此基础上,可以基于第一投影位置在特征相关数据中搜索,得到目标相关数据,并基于目标相关数据、静态光流和整体光流,得到分析结果。上述方式,在第一图像和第二图像两者的特征相关数据中搜索目标相关数据的过程中,参考由摄像器件运动而引起的静态光流,能够降低拍摄对象运动所产生的影响,进而能够提升后续优化位姿和深度的精度。In one implementation scenario, feature correlation data between the first image and the second image can be obtained based on the image features of the first image and the image features of the second image, and the pixels in the first image can be processed based on static optical flow. Project to obtain the first projection position of the pixel in the first image in the second image. On this basis, the feature-related data can be searched based on the first projection position to obtain the target-related data, and the analysis results can be obtained based on the target-related data, static optical flow, and overall optical flow. In the above method, in the process of searching for target-related data in the feature-related data of both the first image and the second image, the static optical flow caused by the movement of the imaging device can be referred to, which can reduce the impact of the movement of the photographed object, and thus can Improve the accuracy of subsequent optimization poses and depths.
在一个实施场景中,请结合参阅图3a,图3a是本公开图像分析方法一实施例的过程示意图。如图3a所示,为了提升图像分析的效率,可以预先训练一个图像分析模型,且图像分析模型可以包括用于为第一图像i进行特征编码的图像编码器301和用于为第二图像j进行特征编码的图像编码器302。其中,这两个图像编码器301和302可以共享网络参数。图像编码器301和图像编码器302可以包含若干(如,6个、7个等)残差块和若干(如,3个、4个等)下采样层,在此对图像编码器301和图像编码器302的网络结构不做限定。此外,示例性地,图像编码器301和图像编码器302处理之后所得到的图像特征的分辨率可以为输入图像的1/8、1/12、1/16等,在此不做限定。In an implementation scenario, please refer to FIG. 3a. FIG. 3a is a schematic process diagram of an embodiment of the image analysis method of the present disclosure. As shown in Figure 3a, in order to improve the efficiency of image analysis, an image analysis model can be pre-trained, and the image analysis model can include an image encoder 301 for feature encoding for the first image i and an image encoder 301 for encoding the second image j. Image encoder 302 for feature encoding. Among them, the two image encoders 301 and 302 can share network parameters. The image encoder 301 and the image encoder 302 may include several (eg, 6, 7, etc.) residual blocks and several (eg, 3, 4, etc.) downsampling layers, where the image encoder 301 and the image The network structure of the encoder 302 is not limited. In addition, for example, the resolution of the image features obtained after processing by the image encoder 301 and the image encoder 302 may be 1/8, 1/12, 1/16, etc. of the input image, which is not limited here.
在一个实施场景中,可以通过将第一图像i的图像特征和第二图像j的图像特征进行点乘,得到特征相关数据,且特征相关数据可以表示为4位向量。示例性地,可以将第一图像的图像特征记为
Figure PCTCN2022119646-appb-000004
并可以将第二图像的图像特征记为
Figure PCTCN2022119646-appb-000005
在此基础上,即可通过点乘计算得到特征相关数据
Figure PCTCN2022119646-appb-000006
In an implementation scenario, the feature-related data can be obtained by dot multiplying the image features of the first image i and the image features of the second image j, and the feature-related data can be expressed as a 4-bit vector. For example, the image feature of the first image can be recorded as
Figure PCTCN2022119646-appb-000004
And the image features of the second image can be recorded as
Figure PCTCN2022119646-appb-000005
On this basis, feature-related data can be obtained through dot product calculations
Figure PCTCN2022119646-appb-000006
Figure PCTCN2022119646-appb-000007
Figure PCTCN2022119646-appb-000007
上述公式(4)中,u iv iu jv j分别表示第一图像i和第二图像j中的像素坐标,此外,<,>表示点乘。为了兼顾到不同尺度的物体,上述特征相关矩阵最后两个维度可以通过不同尺寸(如,1、2、4、8等)的平均池化进行处理,以形成多层特征相关金字塔,作为特征相关数据。特征相关的具体过程,可以参阅RAFT(即Recurrent All-Pairs Field Transforms for Optical Flow)的技术细节。其中,特征相关数据C ij可以视为第一图像i和第二图像j在视觉上的一致程度。 In the above formula (4), u i v i u j v j respectively represent the pixel coordinates in the first image i and the second image j. In addition, <,> represents the dot product. In order to take into account objects of different scales, the last two dimensions of the above feature correlation matrix can be processed by average pooling of different sizes (e.g., 1, 2, 4, 8, etc.) to form a multi-layer feature correlation pyramid, as the feature correlation data. For specific features-related processes, please refer to the technical details of RAFT (Recurrent All-Pairs Field Transforms for Optical Flow). Among them, the feature-related data C ij can be regarded as the degree of visual consistency between the first image i and the second image j.
在一个实施场景中,可以定义一个相关查找函数,并定义相关查找函数的输入参数包括坐标网格和半径r,基于此即可搜索得到目标相关数据L r:
Figure PCTCN2022119646-appb-000008
该函数将H×W坐标网格作为输入,其为静态光流的图像维度。其中,可以直接将第一图像中各像素点的像素坐标加上该像素点在静态光流中静态光流值,得到该像素点在第二图像中的第一投影位置。在此基础上,通过线性插值即可从特征相关数据中搜索得到目标相关数据。这里,该相关查找函数作用于前述特征相关金字塔中的每一层,并可以将每一层搜索得到的目标相关数据进行拼接,得到最终的目标相关数据。相关搜索过程,可以参阅RAFT的技术细节。
In an implementation scenario, a correlation search function can be defined, and the input parameters of the correlation search function include the coordinate grid and radius r. Based on this, the target related data L r can be searched:
Figure PCTCN2022119646-appb-000008
This function takes as input an H×W coordinate grid, which is the image dimension of static optical flow. Wherein, the pixel coordinates of each pixel in the first image can be directly added to the static optical flow value of the pixel in the static optical flow to obtain the first projection position of the pixel in the second image. On this basis, target-related data can be obtained from feature-related data through linear interpolation. Here, the correlation search function acts on each layer in the aforementioned feature correlation pyramid, and can splice the target-related data obtained by searching at each layer to obtain the final target-related data. For related search process, please refer to the technical details of RAFT.
在一个实施场景中,如前所述,为了提升图像分析的效率,可以预先训练一个图像分析模型。此外,如图3b所示,该图像分析模型可以包括动态更新网络303,该动态更新网络303可以包括但不限于语义提取子网络3033,如ConvGRU(结合卷积的门控循环单元)等,在此对动态更新网络303的网络结构不做限定。在得到目标相关数据(其通过线性插值即可从特征相关数据305中搜索得到的)、静态光流3063和整体光流3062之后,即可输入动态更新网络303,得到分析结果。同时请结合参阅图3b,图3b是动态更新网络一实施例的框架示意图。如图3b所示,动态更新网络303可以包括光流编码器3031和相关编码器3032,则可以分别基于目标相关数据进行编码,得到第一编码特征,并基于静态光流3063和整体光流3062进行编码,得到第二编码特征,以及第一编码特征和第二编码特征,预测得到分析结果。其中,可以将第一编码特征和第二编码特征一同输入结合卷积的门控循环单元(ConvGRU),得到深层语义特征,并基于深层语义特征进行预测,得到分析结果。这里,由于ConvGRU为具有较小感受野的局部操作,故可以在图像空间维度将隐层向量取均值,作为全局上下文特征,并将全局上下文特征作为ConvGRU的额外输入。为了便于描述,可以将第k+1次循环迭代时的全局上下文特征记为h (k+1)。上述方式,基于目标相关数据进行编码,得到第一编码特征,并基于静态光流和整体光流进行编码,得到第二编码特征,在此基础上再基于第一编码特征和第二编 码特征,预测得到分析结果,从而能够在预测之前分别提取光流数据和相关数据的深层特征信息,进而能够有利于提升后续预测分析的准确性。 In an implementation scenario, as mentioned above, in order to improve the efficiency of image analysis, an image analysis model can be pre-trained. In addition, as shown in Figure 3b, the image analysis model may include a dynamic update network 303, which may include but is not limited to a semantic extraction sub-network 3033, such as ConvGRU (gated recurrent unit combined with convolution), etc., in This does not limit the network structure of the dynamic update network 303. After obtaining the target related data (which can be searched from the feature related data 305 through linear interpolation), the static optical flow 3063 and the overall optical flow 3062, the dynamic update network 303 can be input to obtain the analysis results. At the same time, please refer to Figure 3b. Figure 3b is a schematic framework diagram of an embodiment of a dynamic update network. As shown in Figure 3b, the dynamic update network 303 can include an optical flow encoder 3031 and a correlation encoder 3032, which can be encoded based on the target related data respectively to obtain the first encoding feature, and based on the static optical flow 3063 and the overall optical flow 3062 Encoding is performed to obtain the second encoding feature, the first encoding feature and the second encoding feature, and the analysis result is predicted. Among them, the first encoding feature and the second encoding feature can be input into a gated recurrent unit (ConvGRU) combined with convolution to obtain deep semantic features, and predictions can be made based on the deep semantic features to obtain analysis results. Here, since ConvGRU is a local operation with a small receptive field, the hidden layer vector can be averaged in the image space dimension as a global context feature, and the global context feature can be used as an additional input to ConvGRU. For ease of description, the global context feature at the k+1th loop iteration can be recorded as h (k+1) . In the above method, encoding is performed based on target-related data to obtain the first encoding feature, and encoding is performed based on static optical flow and overall optical flow to obtain the second encoding feature. On this basis, based on the first encoding feature and the second encoding feature, The prediction can obtain the analysis results, so that the deep feature information of the optical flow data and related data can be extracted before prediction, which can help improve the accuracy of subsequent prediction analysis.
在一个实施场景中,请继续结合参阅图3b,动态更新网络303还可以包括静态光流卷积层3035,通过静态光流卷积层3035处理前述深层语义特征,即可得到静态光流3062的光流校准数据3066。在一个实施场景中,为了提升位姿和深度的优化精度,参考数据还可以包括动态掩膜,动态掩膜可以用于指示图像中的运动对象。示例性地,在图像中某一像素点属于运动对象的情况下,图像的动态掩膜中与该像素点对应的像素位置处像素值可以为第一数值,反之,在图像中某一像素点不属于运动对象的情况下,图像的动态掩膜中与该像素点对应的像素位置处像素值可以为第二数值,且第一数值和第二数值不同,如第一数值可以设置为0,第二数值可以设置为1。在首次循环迭代时,可以将动态掩膜初始化为全0矩阵。为了便于描述,仍以第一图像i和第二图像j均为W*H的二维图像为例,动态掩膜可以表示为H*W的二通道图像,即动态掩膜M dij∈R H×W×2。请结合参阅图3a或图3b,区别于前述通过在特征相关数据305中进行搜索得到目标相关数据,并基于目标相关数据、静态光流3063和整体光流3062,预测得到分析结果的方式,可以基于目标相关数据、静态光流3063、整体光流3062和动态掩膜3061,预测得到分析结果,且分析结果可以包括动态掩膜3061的掩膜校准数据3064。上述方式,在动态更新过程中,参考动态掩膜,且动态掩膜用于指示图像中的运动对象,故能够为后续光流分解提供指导,有利于提升优化位姿和深度的精度。 In an implementation scenario, please continue to refer to Figure 3b. The dynamic update network 303 can also include a static optical flow convolution layer 3035. By processing the aforementioned deep semantic features through the static optical flow convolution layer 3035, the static optical flow 3062 can be obtained. Optical flow calibration data3066. In an implementation scenario, in order to improve the optimization accuracy of pose and depth, the reference data can also include dynamic masks, and the dynamic masks can be used to indicate moving objects in the image. For example, when a certain pixel in the image belongs to a moving object, the pixel value at the pixel position corresponding to the pixel in the dynamic mask of the image can be the first value. On the contrary, when a certain pixel in the image If it does not belong to a moving object, the pixel value at the pixel position corresponding to the pixel point in the dynamic mask of the image can be a second value, and the first value and the second value are different. For example, the first value can be set to 0, The second value can be set to 1. On the first loop iteration, the dynamic mask can be initialized to an all-zero matrix. For the convenience of description, the first image i and the second image j are still two-dimensional images of W*H as an example. The dynamic mask can be expressed as a two-channel image of H*W, that is, the dynamic mask M dij ∈R H ×W×2 . Please refer to Figure 3a or Figure 3b in combination. Different from the aforementioned method of obtaining target related data by searching in feature related data 305, and predicting the analysis results based on the target related data, static optical flow 3063 and overall optical flow 3062, you can Based on the target-related data, static optical flow 3063 , overall optical flow 3062 and dynamic mask 3061 , an analysis result is predicted, and the analysis result may include mask calibration data 3064 of the dynamic mask 3061 . In the above method, during the dynamic update process, the dynamic mask is referred to, and the dynamic mask is used to indicate moving objects in the image, so it can provide guidance for subsequent optical flow decomposition, which is beneficial to improving the accuracy of optimized pose and depth.
在一个实施场景中,仍以第一图像i和第二图像j均为W*H的二维图像为例,动态掩膜的掩膜校准数据可以包括第一图像和第二图像两者动态掩膜中各掩膜值的掩膜校准值,则掩膜校准数据也可以表示为H*W的二通道图像,即动态掩膜的掩膜校准数据ΔM dij∈R H×W×2。在此基础上,如图3b所示,可以将动态掩膜3061加上动态掩膜的掩膜校准数据3064,得到更新的动态掩膜3065。故此,在第i次循环迭代时,其需要输入的动态掩膜,可以为第i-1次循环迭代时更新输出的动态掩膜。 In an implementation scenario, still taking the first image i and the second image j as two-dimensional images of W*H as an example, the mask calibration data of the dynamic mask may include dynamic masks of both the first image and the second image. The mask calibration value of each mask value in the film, then the mask calibration data can also be expressed as a two-channel image of H*W, that is, the mask calibration data of the dynamic mask ΔM dij ∈R H×W×2 . On this basis, as shown in Figure 3b, the dynamic mask 3061 can be added to the dynamic mask's mask calibration data 3064 to obtain an updated dynamic mask 3065. Therefore, the dynamic mask that needs to be input during the i-th loop iteration can be the dynamic mask that updates the output during the i-1th loop iteration.
在一个实施场景中,如图3b所示,为了提升图像分析的效率,可以预先训练一个图像分析模型。可以参阅前述相关描述。与前述相关描述不同的是,对于动态更新网络303中的光流编码器3031而言,可以基于静态光流3063、整体光流3062和动态掩膜3061进行编码,得到第二编码特征。In an implementation scenario, as shown in Figure 3b, in order to improve the efficiency of image analysis, an image analysis model can be pre-trained. Please refer to the above related descriptions. Different from the foregoing related descriptions, for the optical flow encoder 3031 in the dynamic update network 303, encoding can be performed based on the static optical flow 3063, the overall optical flow 3062 and the dynamic mask 3061 to obtain the second encoding feature.
在一个实施场景中,如图3b所示,为了提升图像分析的效率,可以预先训练一个图像分析模型,具体可以参阅前述相关描述。与前述相关描述不同的是,动态更新网络303还可以包括卷积层,其可以对ConvGRU输出的深层语义特征进行处理,得到动态掩膜3061的掩膜校准数据3064。In an implementation scenario, as shown in Figure 3b, in order to improve the efficiency of image analysis, an image analysis model can be pre-trained. For details, please refer to the relevant description above. Different from the foregoing related descriptions, the dynamic update network 303 can also include a convolution layer, which can process the deep semantic features output by ConvGRU to obtain the mask calibration data 3064 of the dynamic mask 3061.
步骤S13:基于静态光流和光流校准数据,对位姿和深度进行优化,得到更新的位姿和更新的深度。Step S13: Based on the static optical flow and optical flow calibration data, optimize the pose and depth to obtain an updated pose and an updated depth.
在一个实施场景中,分析结果还可以包括置信度图,且置信度图包括图像中各像素点的置信度。仍以第一图像i和第二图像j均为H*W的二维图像为例,置信度图可以表示为H*W的二通道图像,即置信度图w ij∈R H×W×2。在得到静态光流的光流校准数据之后,可以基于光流校准数据对第一投影位置进行校准,得到校准位置。其中,第一投影位置为第一图像中像素点基于静态光流投影在第二图像的像素位置。为了便于描述,第一投影位置可以记为p sij,且如前所述,静态光流的光流校准数据可以记为r sij,则校准位置可以表示为p * sij=r sij+p sij,即可以对于图像中各像素点而言,可以直接将其第一投影位置加上该像素点在光流校准数据中查询到的光流校准值即可。在此基础上,可以基于校准位置,优化得到更新的位姿和更新的深度。示例性地,可以基于校准位置p * sij,构建以更新的位姿和更新的深度作为优化对象的优化函数: In an implementation scenario, the analysis results may also include a confidence map, and the confidence map includes the confidence of each pixel in the image. Still taking the first image i and the second image j as two-dimensional images of H*W as an example, the confidence map can be expressed as a two-channel image of H*W, that is, the confidence map w ij ∈R H×W×2 . After obtaining the optical flow calibration data of the static optical flow, the first projection position can be calibrated based on the optical flow calibration data to obtain the calibration position. Wherein, the first projection position is the pixel position of the pixel in the first image projected on the second image based on static optical flow. For the convenience of description, the first projection position can be recorded as p sij , and as mentioned above, the optical flow calibration data of the static optical flow can be recorded as r sij , then the calibration position can be expressed as p * sij =r sij +p sij , That is, for each pixel in the image, its first projection position can be directly added to the optical flow calibration value queried in the optical flow calibration data for the pixel. On this basis, the updated pose and updated depth can be optimized based on the calibration position. For example, based on the calibration position p * sij , an optimization function with the updated pose and updated depth as the optimization object can be constructed:
Figure PCTCN2022119646-appb-000009
Figure PCTCN2022119646-appb-000009
Σ ij=diagw ij……公式(6); Σ ij =diagw ij ...Formula (6);
上述公式(5)和公式(6)中,diag表示取矩阵主对角线上元素,G i' j表示第一图像更新的位姿和第二图像更新的位姿之间的相对位姿,d i'表示第一图像更新的深度。此外,
Figure PCTCN2022119646-appb-000010
两者的含义可以参阅前述相关描述。||·|| Σ表示马氏距离(mahalanobis),其可参阅关于马氏距离的相关技术细节。(i,j)∈ε表示具有共视关系的第一图像i和第二图像j。
In the above formulas (5) and (6), diag represents the element on the main diagonal of the matrix, G i ' j represents the relative pose between the updated pose of the first image and the updated pose of the second image, d i ' represents the depth of the first image update. also,
Figure PCTCN2022119646-appb-000010
The meanings of the two can be found in the relevant descriptions above. ||·|| Σ represents Mahalanobis distance (mahalanobis), which can be found in the relevant technical details about Mahalanobis distance. (i,j)∈ε represents the first image i and the second image j having a common view relationship.
在一个实施场景中,请结合参阅图3b,如前所述,为了提升图像分析的效率,可以预先训练一 个图像分析模型,可以参阅前述相关描述。与前述描述不同的是,动态更新网络303可以包括卷积层,用于对ConvGRU提取得到的深层语义特征进行处理,得到置信度图w ij3034。 In an implementation scenario, please refer to Figure 3b. As mentioned above, in order to improve the efficiency of image analysis, an image analysis model can be pre-trained. Please refer to the relevant description above. Different from the foregoing description, the dynamic update network 303 may include a convolutional layer for processing the deep semantic features extracted by ConvGRU to obtain the confidence map w ij 3034 .
在一个实施场景中,可以采用高斯-牛顿算法进行处理,得到深度和位姿的变化量。可以采用舒尔补,计算得到位姿的变化量,再计算深度的变化量。同时可以将深度的变化量记为Δd,并将位姿的变化量记为Δξ。在此基础上,对于深度而言,可以采用如下公式(7)得到更新的深度:In an implementation scenario, the Gauss-Newton algorithm can be used to process the changes in depth and pose. Shure compensation can be used to calculate the change in pose, and then calculate the change in depth. At the same time, the change in depth can be recorded as Δd, and the change in pose can be recorded as Δξ. On this basis, for depth, the following formula (7) can be used to obtain the updated depth:
Ξ (k+1)=ΔΞ (k)(k)……公式(7); Ξ (k+1) =ΔΞ (k)(k) ...Formula (7);
上述公式(7)中,Ξ (k)表示输入第k次循环迭代的深度,ΔΞ (k)表示第k次循环迭代输出的深度的变化量,Ξ (k+1)表示输入第k+1次循环迭代的深度,即更新的深度。即对于深度而言,可以直接将深度加上深度的变化量,得到更新的深度。与深度不同的是,可以采用如下方式得到更新的位姿: In the above formula (7), Ξ (k) represents the input depth of the k-th loop iteration, ΔΞ (k) represents the change in depth of the k-th loop iteration output, Ξ (k+1) represents the input k+1 The depth of the loop iteration, that is, the depth of the update. That is, for the depth, the depth can be directly added to the depth change to obtain the updated depth. Different from depth, the updated pose can be obtained in the following ways:
Figure PCTCN2022119646-appb-000011
Figure PCTCN2022119646-appb-000011
上述公式(8)中,G (k)表示输入第k次循环迭代的位姿,G (k+1)表示输入第k+1次循环迭代的位姿,即更新的位姿。也就是说,对于位姿而言,需要基于位姿的变化量在SE3流形对位姿进行拉伸。 In the above formula (8), G (k) represents the input pose of the k-th loop iteration, and G (k+1) represents the input pose of the k+1-th loop iteration, that is, the updated pose. In other words, for the pose, the pose needs to be stretched in the SE3 manifold based on the change in the pose.
在一个实施场景中,与前述方式不同的是,参考数据还可以包括动态掩膜,且分析结果还可以包括动态掩膜的掩膜校准数据,其可以参阅前述相关描述。在此基础上,可以基于动态掩膜、掩膜校准数据和置信度图进行融合,得到重要度图,并基于光流校准数据对第一投影位置进行校准,得到校准位置。基于此,再基于校准位置和重要度图,优化得到更新的位姿和更新的深度。上述方式,在位姿和深度的优化过程中,引入用于指示运动对象的动态掩膜,并结合置信度图得到重要度图,以为后续光流分解提供指导,有利于提升优化位姿和深度的精度。In an implementation scenario, different from the above method, the reference data may also include a dynamic mask, and the analysis result may also include mask calibration data of the dynamic mask, for which please refer to the above related description. On this basis, the dynamic mask, mask calibration data and confidence map can be fused to obtain the importance map, and the first projection position can be calibrated based on the optical flow calibration data to obtain the calibration position. Based on this, based on the calibration position and importance map, the updated pose and updated depth are optimized. In the above method, during the optimization process of pose and depth, dynamic masks used to indicate moving objects are introduced, and the importance map is obtained by combining the confidence map to provide guidance for subsequent optical flow decomposition, which is beneficial to improving the optimization of pose and depth. accuracy.
在一个实施场景中,如前所述,光流校准数据包括第一图像中像素点的校准光流,则可以将第一图像中像素点的校准光流加上像素点在第二图像中的第一投影位置,得到像素点的校准位置。其可以参阅前述相关描述。上述方式,通过直接预测第一图像中像素点的校准光流,从而通过简单加法运算即可得到像素点在仅由摄像器件运动之后的校准位置,进而能够有利于大大降低确定像素点仅由摄像器件运动之后的校准位置的计算复杂度,有利于提升优化位姿和深度的效率。In one implementation scenario, as mentioned above, the optical flow calibration data includes the calibration optical flow of the pixel in the first image, then the calibration optical flow of the pixel in the first image can be added to the calibration optical flow of the pixel in the second image. The first projection position is to obtain the calibration position of the pixel. Please refer to the above related descriptions. In the above method, by directly predicting the calibrated optical flow of the pixel in the first image, the calibrated position of the pixel after being moved only by the camera device can be obtained through a simple addition operation, which in turn can greatly reduce the need to determine the pixel point only by the camera. The computational complexity of the calibration position after device movement is beneficial to improving the efficiency of optimizing pose and depth.
在一个实施场景中,可以基于掩膜校准数据对动态掩膜进行校准,得到校准掩膜,且校准掩膜包括图像中像素点与运动对象的相关度,相关度与图像中像素点属于运动对象的可能性正相关,即像素点属于运动对象的可能性越高,相关度越大,反之,像素点属于运动对象的可能性越低,相关度越小。在此基础上,可以基于置信度图和校准掩膜进行融合,得到重要度图。其可以将置信度图和校准掩膜进行加权,并进行归一化处理,得到重要度图。其中,第一图像i和第二图像j的重要度图w dij可以表示为: In an implementation scenario, the dynamic mask can be calibrated based on the mask calibration data to obtain a calibration mask, and the calibration mask includes the correlation between the pixels in the image and the moving objects, and the correlation is related to the correlation between the pixels in the image belonging to the moving objects. The possibility is positively correlated, that is, the higher the possibility that a pixel belongs to a moving object, the greater the correlation. On the contrary, the lower the possibility that a pixel belongs to a moving object, the smaller the correlation. On this basis, the importance map can be obtained by fusion based on the confidence map and the calibration mask. It can weight and normalize the confidence map and calibration mask to obtain the importance map. Among them, the importance map w dij of the first image i and the second image j can be expressed as:
w dij=sigmoid(w ij+(1-M dij)·η)……公式(9); w dij =sigmoid(w ij +(1-M dij )·η)...Formula (9);
上述公式(9)中,sigmoid表示归一化函数,M dij表示更新的动态掩膜,由掩膜校准数据ΔM dij加上动态掩膜M dij更新得到,即可以参照上述公式(7)得到更新的动态掩膜,所不同的是,此时公式(7)中Ξ (k)表示输入第k次循环迭代的动态掩膜,ΔΞ (k)表示第k次循环迭代输出的掩膜校准数据,Ξ (k+1)表示输入第k+1次循环迭代的动态掩膜,即更新的动态掩膜。此外,1-M dij表示校准掩膜,w ij表示置信度图,η表示加权系数,如可以设置为10、20等,在此不做限定。上述方式,能够从像素点本身的置信度以及像素点与运动对象的相关度两方面共同衡量像素点的重要度,进而能够有利于提升后续优化位姿和深度的精度。 In the above formula (9), sigmoid represents the normalization function, and M dij represents the updated dynamic mask, which is updated by adding the mask calibration data ΔM dij to the dynamic mask M dij . That is, it can be updated by referring to the above formula (7). The dynamic mask of Ξ (k+1) represents the input dynamic mask of the k+1th loop iteration, that is, the updated dynamic mask. In addition, 1-M dij represents the calibration mask, w ij represents the confidence map, and eta represents the weighting coefficient, which can be set to 10, 20, etc., and is not limited here. The above method can jointly measure the importance of pixels from two aspects: the confidence of the pixel itself and the correlation between the pixel and the moving object, which can help improve the accuracy of subsequent optimization poses and depths.
在一个实施场景中,在得到校准位置和重要度图之后,可以参照上述公式(5)和公式(6)所提供的实施方式,构建优化函数,在此基础上,即可求解得到更新的深度和更新的位姿。需要说明的是,重要度图去除了对运动对象的抑制,增加了优化函数中可用像素点的数量。此外,置信度图可以负责去除一些其他影响计算的像素点,如光照效果不好等其他原因的像素点。In an implementation scenario, after obtaining the calibration position and importance map, the optimization function can be constructed by referring to the implementation provided by the above formula (5) and formula (6). On this basis, the updated depth can be obtained by solving and updated poses. It should be noted that the importance map removes the suppression of moving objects and increases the number of pixels available in the optimization function. In addition, the confidence map can be responsible for removing some other pixels that affect the calculation, such as pixels caused by poor lighting effects and other reasons.
在一个实施场景中,在得到更新的位姿和更新的深度之后,即可准备开启新一轮的循环迭代。请结合参阅图3b和图3b,分析结果还可以包括动态光流3081,且动态光流由拍摄对象运动引起。在此基础上,可以基于更新的位姿3072和更新的深度3073,获取更新的静态光流3082,并基于动态光流3081和更新的静态光流3082,得到更新的整体光流3071,以及基于更新的静态光流3082和更新的整体光流3071,得到更新的光流数据,并基于更新的位姿3072和更新的深度3073,得到更新的参考数据,从而可以重新执行前述基于图像序列和光流数据,预测得到分析结果的步骤以及后续步骤,直 至重新执行的次数满足预设条件为止。上述方式,在图像分析过程中,通过将整体光流分解为静态光流和动态光流,并循环多次迭代优化步骤,以解决单次优化效果欠佳的问题,并将旧的变量作为输入指导新的变量的生成,能够使得输入特征更为多元化,故能够有利于提升位姿和深度的精度。在图像分析过程中,通过将整体光流分解为静态光流和动态光流,并循环多次迭代优化步骤,以解决单次优化效果欠佳的问题,并将旧的变量作为输入指导新的变量的生成,能够使得输入特征更为多元化,故能够有利于提升位姿和深度的精度。In an implementation scenario, after obtaining the updated pose and updated depth, you can prepare to start a new round of loop iteration. Please refer to Figure 3b and Figure 3b in conjunction, the analysis results may also include dynamic optical flow 3081, and the dynamic optical flow is caused by the motion of the photographed object. On this basis, the updated static optical flow 3082 can be obtained based on the updated pose 3072 and the updated depth 3073, and based on the dynamic optical flow 3081 and the updated static optical flow 3082, an updated overall optical flow 3071 can be obtained, and based on The updated static optical flow 3082 and the updated overall optical flow 3071 are used to obtain updated optical flow data, and based on the updated pose 3072 and the updated depth 3073, updated reference data is obtained, so that the aforementioned image sequence and optical flow can be re-executed. Data, predict the steps to obtain the analysis results and subsequent steps until the number of re-executions meets the preset conditions. In the above method, during the image analysis process, the overall optical flow is decomposed into static optical flow and dynamic optical flow, and the optimization steps are cycled multiple times to solve the problem of poor single optimization effect, and the old variables are used as input Guiding the generation of new variables can make the input features more diverse, so it can help improve the accuracy of pose and depth. During the image analysis process, the overall optical flow is decomposed into static optical flow and dynamic optical flow, and multiple iterative optimization steps are cycled to solve the problem of poor single optimization effect, and old variables are used as input to guide new ones. The generation of variables can make the input features more diverse, so it can help improve the accuracy of pose and depth.
在一个实施场景中,可以基于更新的位姿、更新的深度和第一图像中像素点的像素位置进行投影,得到第一图像中像素点投影在第二图像的第二投影位置,并基于第一图像中像素点投影在第二图像的第二投影位置和第一图像中像素点在第二图像中的对应位置之间的差异,得到更新的静态光流,且对应位置为在摄像器件未运动的情况下,第一图像中像素点所属的空间点投影在第二图像的像素位置。其可以参阅前述公式(3)及其相关描述。上述方式,在循环迭代过程中,通过更新的位姿和更新的深度重新投影,并在摄像器件未运动的假设前提下,确定出第一图像中像素点所属的空间点投影在第二图像的像素位置,从而结合重新投影位置确定出更新的静态光流,有利于提升更新的静态光流的准确性。In an implementation scenario, projection can be performed based on the updated pose, the updated depth and the pixel position of the pixel in the first image, to obtain the second projection position of the pixel in the first image projected on the second image, and based on the The difference between the second projection position of the pixels in one image in the second image and the corresponding position of the pixels in the first image in the second image is used to obtain an updated static optical flow, and the corresponding position is the position of the pixel that is not in the camera device. In the case of motion, the spatial point to which the pixel point in the first image belongs is projected on the pixel position of the second image. You can refer to the aforementioned formula (3) and its related descriptions. In the above method, during the loop iteration process, through the updated pose and updated depth re-projection, and under the assumption that the camera device does not move, it is determined that the spatial point to which the pixel point in the first image belongs is projected on the second image. The pixel position is thus combined with the reprojection position to determine the updated static optical flow, which is beneficial to improving the accuracy of the updated static optical flow.
在一个实施场景中,在更新的静态光流之后,可以直接将分析结果中预测得到的动态光流加上更新的静态光流,得到更新的整体光流,即:In an implementation scenario, after the updated static optical flow, the dynamic optical flow predicted in the analysis results can be directly added to the updated static optical flow to obtain the updated overall optical flow, that is:
F ot=F st+F dt……公式(10); F ot =F st +F dt ...Formula (10);
上述公式(10)中,F st表示更新的静态光流,F dt表示分析结果中预测得到的动态光流,F ot表示更新的整体光流。上述方式,将预测得到的动态光流和更新的静态光流相加,即可得到更新的整体光流,即通过简单加法运算即可确定出更新的整体光流,有利于提升优化位姿和深度的效率。 In the above formula (10), F st represents the updated static optical flow, F dt represents the dynamic optical flow predicted in the analysis results, and F ot represents the updated overall optical flow. In the above method, the updated overall optical flow can be obtained by adding the predicted dynamic optical flow and the updated static optical flow. That is, the updated overall optical flow can be determined through a simple addition operation, which is beneficial to improving the optimization pose and posture. Deep efficiency.
在一个实施场景中,预设条件可以设置为包括:重新执行的次数不小于预设阈值(如,9、10等),从而可以通过多次循环迭代,不断优化位姿和深度,提升位姿和深度的精度。In an implementation scenario, the preset conditions can be set to include: the number of re-executions is not less than the preset threshold (such as 9, 10, etc.), so that the pose and depth can be continuously optimized and the pose can be continuously optimized through multiple loop iterations. and depth accuracy.
在一个实施场景中,请参阅图4a和图4b,图4a是本公开图像分析方法确定轨迹与实际轨迹、现有技术确定轨迹一实施例的对比示意图,图4b是本公开图像分析方法确定轨迹与实际轨迹、现有技术确定轨迹另一实施例的对比示意图。如图4a所示为在自动驾驶场景下的计算机视觉算法评测数据集(KITTI数据集)中图像序列09的测试结果,图4b所示为KITTI数据集中图像序列10的测试结果。其中,图像序列09和图像序列10中均包含运动对象,属于难度较大的动态场景,且虚线表示摄像器件在拍摄过程中的实际轨迹,深色线条表示现有技术确定轨迹,浅色线条表示通过本公开图像分析方法确定轨迹。如图4a所示,在动态场景下,本公开图像分析方法其精度几乎是现有技术的两倍,且在KITTI数据集图像序列10的测试场景中,本公开图像分析方法确定轨迹几乎与实际轨迹重合。In an implementation scenario, please refer to Figure 4a and Figure 4b. Figure 4a is a schematic diagram comparing the trajectory determined by the image analysis method of the present disclosure with the actual trajectory and the trajectory determined by the prior art. Figure 4b is the trajectory determined by the image analysis method of the disclosure. A schematic diagram comparing another embodiment of the actual trajectory and the trajectory determined by the prior art. Figure 4a shows the test results of image sequence 09 in the computer vision algorithm evaluation data set (KITTI data set) in the autonomous driving scenario, and Figure 4b shows the test results of image sequence 10 in the KITTI data set. Among them, both image sequence 09 and image sequence 10 contain moving objects, which are difficult dynamic scenes, and the dotted line represents the actual trajectory of the camera device during the shooting process, the dark line represents the trajectory determined by the existing technology, and the light line represents Trajectories are determined by the disclosed image analysis method. As shown in Figure 4a, in dynamic scenarios, the accuracy of the disclosed image analysis method is almost twice that of the existing technology, and in the test scene of the KITTI data set image sequence 10, the trajectory determined by the disclosed image analysis method is almost the same as the actual one. The trajectories coincide.
在一个实施场景中,请参阅图5a、图5b和图5c,图5a是本公开图像分析方法确定轨迹与实际轨迹、现有技术确定轨迹又一实施例的对比示意图,图5b是本公开图像分析方法确定轨迹与实际轨迹、现有技术确定轨迹又一实施例的对比示意图,图5c是本公开图像分析方法确定轨迹与实际轨迹、现有技术确定轨迹又一实施例的对比示意图。如图5a所示为KITTI数据集中图像序列01的测试结果,图5b所示为KITTI数据集中图像序列02的测试结果,图5c所示为KITTI数据集中图像序列20的测试结果。其中,图像序列01、图像序列02和图像序列20中均包含运动对象,属于难度较大的动态场景,其中,虚线表示摄像器件在拍摄过程中的实际轨迹,深色线条表示现有技术确定轨迹,浅色线条表示通过本公开图像分析方法确定轨迹。如图5a和图5c所示,在KITTI数据集中图像序列01和图像序列20的测试场景中,本公开图像分析方法确定轨迹、现有技术确定轨迹均与实际轨迹保持较为一致的趋势,但本公开图像分析方法确定轨迹更接近于实际轨迹;与此同时,如图5b所示,在KITTI数据集中图像序列02的测试场景中,本公开图像分析方法确定轨迹与实际轨迹保持较为一致的趋势,而现有技术确定轨迹与实际轨迹已经难以保持较为一致的趋势,在多处出现严重失真。In an implementation scenario, please refer to Figure 5a, Figure 5b and Figure 5c. Figure 5a is a schematic diagram comparing the trajectory determined by the image analysis method of the present disclosure with the actual trajectory and the trajectory determined by the prior art in another embodiment. Figure 5b is an image of the disclosure. Figure 5c is a schematic diagram comparing the trajectory determined by the image analysis method and the actual trajectory and the trajectory determined by the prior art according to another embodiment of the disclosure. Figure 5a shows the test results of image sequence 01 in the KITTI data set, Figure 5b shows the test results of image sequence 02 in the KITTI data set, and Figure 5c shows the test results of image sequence 20 in the KITTI data set. Among them, image sequence 01, image sequence 02 and image sequence 20 all contain moving objects, which are difficult dynamic scenes. Among them, the dotted line represents the actual trajectory of the camera device during the shooting process, and the dark line represents the trajectory determined by the existing technology. , the light-colored lines represent the trajectories determined by the image analysis method of the present disclosure. As shown in Figure 5a and Figure 5c, in the test scenarios of image sequence 01 and image sequence 20 in the KITTI data set, the trajectory determined by the image analysis method of the present disclosure and the trajectory determined by the prior art maintain a relatively consistent trend with the actual trajectory, but this method The trajectory determined by the public image analysis method is closer to the actual trajectory; at the same time, as shown in Figure 5b, in the test scene of image sequence 02 in the KITTI data set, the trajectory determined by the disclosed image analysis method maintains a relatively consistent trend with the actual trajectory. However, it is difficult to maintain a consistent trend between the trajectory determined by the existing technology and the actual trajectory, and serious distortion occurs in many places.
本公开实施例可以应用于SLAM系统前端,以实时确定图像的位姿和深度,或者也可以应用于SLAM系统后端,以对各图像的位姿和深度进行全局优化。其中,SLAM系统可以包含前端线程和后端线程,两者可以同时运行。其中,前端线程的任务是接收新的图像,并选择关键帧,在此基础上,通过本公开实施例获取关键帧的位姿、深度等变量结果,后端线程的任务是在全局范围内通过本公开实施例对各关键帧的位姿、深度等变量结果进行全局优化,从而在此基础上,可以构建出摄像器件所扫描环境的三维地图。The disclosed embodiments can be applied to the front end of the SLAM system to determine the pose and depth of the image in real time, or can also be applied to the back end of the SLAM system to globally optimize the pose and depth of each image. Among them, the SLAM system can include front-end threads and back-end threads, both of which can run at the same time. Among them, the task of the front-end thread is to receive new images and select key frames. On this basis, obtain the pose, depth and other variable results of the key frames through the embodiment of the present disclosure. The task of the back-end thread is to globally pass Embodiments of the present disclosure perform global optimization on variable results such as pose and depth of each key frame, so that on this basis, a three-dimensional map of the environment scanned by the camera device can be constructed.
在一个实施场景中,本公开实施例SLAM系统在初始化时,会不断收集图像,直到收集M(如,12等)帧为止。其中,本公开实施例SLAM系统仅在当前帧估计的平均静态光流大于第一数值(如,16等)像素时才留存当前帧。一旦累积收集M帧,SLAM系统则会在这些帧之间创建边来初始化因 子图304(如图3a所示)。因子图304中节点表示各帧图像,创建边的节点所对应的图像是时间上差距应在第二数值(如,3等)个时间步之内。之后,SLAM系统会采用本公开图像分析方法对图像序列中图像的位姿和深度进行动态更新。In an implementation scenario, during initialization, the SLAM system according to the embodiment of the present disclosure will continuously collect images until M (eg, 12, etc.) frames are collected. Among them, the SLAM system of the embodiment of the present disclosure only retains the current frame when the estimated average static optical flow of the current frame is greater than a first numerical value (eg, 16, etc.) pixels. Once M frames are accumulated, the SLAM system creates edges between these frames to initialize the factor graph 304 (as shown in Figure 3a). The nodes in the factor graph 304 represent each frame image, and the time difference between the images corresponding to the nodes that create edges should be within a second numerical value (eg, 3, etc.) time steps. Afterwards, the SLAM system will use the disclosed image analysis method to dynamically update the pose and depth of the images in the image sequence.
在一个实施场景中,本公开实施例SLAM系统前端可以直接处理传入的图像序列,其在相互可见的关键帧之间维护一组关键帧和一个存储边的因子图。关键帧的位姿和深度会不断进行优化。当新的一帧输入时,SLAM系统会提取其特征图,然后使用L(如,3等)帧最邻近的帧构建因子图。如前所述,衡量帧间距离的方式可以为帧间的平均静态光流。新的输入帧对应的位姿可以由线性运动模型赋予初值。随后,SLAM系统通过几次循环迭代以优化关键帧对应的位姿和深度。其中,可以固定前两帧对应的位姿以消除尺度不确定性。处理完新帧之后,可以基于静态光流的距离来删除冗余帧。如果没有合适的帧可以删除,SLAM系统可以删除最旧的关键帧。In one implementation scenario, the front end of the SLAM system of the embodiment of the present disclosure can directly process the incoming image sequence, and it maintains a set of key frames and a factor graph that stores edges between mutually visible key frames. The pose and depth of keyframes are continuously optimized. When a new frame is input, the SLAM system extracts its feature map and then uses the nearest neighbor frames of L (e.g., 3, etc.) frames to construct a factor map. As mentioned before, the distance between frames can be measured as the average static optical flow between frames. The pose corresponding to the new input frame can be given an initial value by the linear motion model. Subsequently, the SLAM system iterates through several loops to optimize the pose and depth corresponding to the key frame. Among them, the poses corresponding to the first two frames can be fixed to eliminate scale uncertainty. After processing new frames, redundant frames can be deleted based on distance from static optical flow. If there are no suitable frames to delete, the SLAM system can delete the oldest keyframes.
在一个实施场景中,本公开实施例SLAM系统后端可以在所有关键帧的集合上进行全局优化。可以使用各关键帧之间的平均静态光流作为帧间的距离,生成一个帧间距离矩阵以方便查找。在每次循环迭代过程中,可以根据距离矩阵重建因子图。示例性地,可以首先选择在时间上相邻的帧所组成的边,加入到因子图中;然后根据距离矩阵选择新边,距离越小越优先考虑。除此之外,当两条边所对应的帧的索引相邻过近时,可以将这些边对应的帧间距加大,从而抑制这些边的效果;最后可以使用本公开实施例对因子图中所有边进行优化,以更新所有帧的位姿和深度。In an implementation scenario, the backend of the SLAM system according to the embodiment of the present disclosure can perform global optimization on a set of all key frames. The average static optical flow between key frames can be used as the distance between frames to generate an inter-frame distance matrix for easy search. During each loop iteration, the factor graph can be reconstructed based on the distance matrix. For example, you can first select edges composed of temporally adjacent frames and add them to the factor graph; then select new edges based on the distance matrix, with smaller distances being given priority. In addition, when the indexes of the frames corresponding to two edges are too close to each other, the frame spacing corresponding to these edges can be increased to suppress the effect of these edges; finally, the embodiments of the present disclosure can be used to modify the factor graph All edges are optimized to update pose and depth for all frames.
在一个实施场景中,请结合参阅图5d,图5d是本公开图像分析方法应用于各种数据集的地图重建示意图。如图5d所示,在诸如KITTI、Virtual KITTI2(即VKITTI2)等存在运动物体的自动驾驶场景的数据集,以及诸如EuRoc等存在剧烈运动且具有显著光照变化的无人机场景的数据集,以及诸如TUM RGB-D等存在运动模糊且剧烈旋转的手持式SLAM的数据集,本公开实施例在上述数据集上均可以得到较好的推广应用。In an implementation scenario, please refer to Figure 5d, which is a schematic diagram of map reconstruction of the image analysis method of the present disclosure applied to various data sets. As shown in Figure 5d, in the data sets of autonomous driving scenes with moving objects such as KITTI and Virtual KITTI2 (i.e. VKITTI2), as well as the data sets of UAV scenes with violent motion and significant illumination changes such as EuRoc, and For handheld SLAM data sets with motion blur and violent rotation, such as TUM RGB-D, the embodiments of the present disclosure can be well promoted and applied on the above data sets.
此外,本公开实施例除了应用于SLAM系统,还可以应用于运动分割任务,即分割出图像中的运动对象,且本公开实施例具有显著分割效果。其中,在执行运动分割任务过程中,只需为运动设置一个阈值,并将大于该阈值的动态场的像素点可视化,即可获得运动分割结果。请结合参阅图5e,图5e是本公开图像分析方法应用于运动分割任务的示意图。如图5e所示,左侧一列表示真实动态掩膜,右侧一列表示预测动态掩膜,由图5e可见,由本公开实施例预测出来的动态掩膜已经十分接近于真实的动态掩膜,即本公开实施例在运动对象分割任务上可以取得显著效果。同时本公开实施例还可以在应用于AR,请结合参阅图5f,图5f是本公开图像分析方法与现有技术分别应用于AR的对比示意图。如图5f所示,右下角表示摄像器件拍摄得到的原始图像501,左上角表示在原始图像中添加虚拟物体(如虚线框所含的树)的期望效果502,右上角表示本公开实施例在原始图像中添加虚拟物体的效果示意503,左下角表示现有技术在原始图像中添加虚拟物体的效果示意504。显然,与现有技术相比,本公开在运动场景中通过精准定位所添加虚拟物体之后的效果示意与期望效果较为接近,而通过现有技术在原始图像中添加虚拟物体产生了严重漂移。In addition, in addition to being applied to SLAM systems, the embodiments of the present disclosure can also be applied to motion segmentation tasks, that is, to segment moving objects in images, and the embodiments of the present disclosure have significant segmentation effects. Among them, during the execution of the motion segmentation task, you only need to set a threshold for the motion and visualize the pixels of the dynamic field greater than the threshold to obtain the motion segmentation result. Please refer to Figure 5e, which is a schematic diagram of the image analysis method of the present disclosure applied to the motion segmentation task. As shown in Figure 5e, the column on the left represents the real dynamic mask, and the column on the right represents the predicted dynamic mask. As can be seen from Figure 5e, the dynamic mask predicted by the embodiment of the present disclosure is very close to the real dynamic mask, that is, Embodiments of the present disclosure can achieve remarkable results in moving object segmentation tasks. At the same time, embodiments of the present disclosure can also be applied to AR. Please refer to FIG. 5f. FIG. 5f is a schematic comparison diagram of the image analysis method of the present disclosure and the prior art applied to AR respectively. As shown in Figure 5f, the lower right corner represents the original image 501 captured by the camera device, the upper left corner represents the desired effect 502 of adding a virtual object (such as the tree contained in the dotted box) in the original image, and the upper right corner represents the embodiment of the present disclosure. The effect of adding a virtual object to the original image is shown 503. The lower left corner shows the effect of adding a virtual object to the original image using the prior art 504. Obviously, compared with the existing technology, the effect of the present disclosure after adding virtual objects through precise positioning in a sports scene is closer to the expected effect. However, adding virtual objects to the original image through the existing technology produces serious drift.
本公开实施例通过光流分解实现即使在运动场景中也能够精准定位,并可以广泛应用于诸如上述SLAM系统、运动分割任务、场景编辑(如图5f所示的AR应用)等。Embodiments of the present disclosure achieve precise positioning even in moving scenes through optical flow decomposition, and can be widely used in such things as the above-mentioned SLAM system, motion segmentation tasks, scene editing (AR applications as shown in Figure 5f), etc.
上述方案,通过模仿人类感知现实世界的方式,将整体光流视为由摄像器件运动和拍摄对象运动共同引起,并在图像分析过程中,参考整体光流和由摄像器件运动引起的静态光流,预测出静态光流的光流校准数据,从而能够在后续位姿和深度优化过程中,结合静态光流及其光流校准数据尽可能地降低拍摄对象运动所导致的影响,进而能够提升位姿和深度的精度。The above scheme, by imitating the way humans perceive the real world, regards the overall optical flow as caused by the movement of the camera device and the movement of the photographed object, and during the image analysis process, reference is made to the overall optical flow and the static optical flow caused by the movement of the camera device , predict the optical flow calibration data of the static optical flow, so that in the subsequent pose and depth optimization process, the static optical flow and its optical flow calibration data can be combined to reduce the impact caused by the motion of the subject as much as possible, thereby improving the position. pose and depth accuracy.
请参阅图6,图6是本公开图像分析模型的训练方法一实施例的流程示意图。可以包括如下步骤:Please refer to FIG. 6 , which is a schematic flow chart of an embodiment of the training method of the image analysis model of the present disclosure. May include the following steps:
步骤S61:获取样本图像序列、样本光流数据和样本图像序列中各个样本图像的样本参考数据。Step S61: Obtain the sample image sequence, sample optical flow data, and sample reference data of each sample image in the sample image sequence.
本公开实施例中,各个样本图像包括具有共视关系的第一样本图像和第二样本图像,样本光流数据包括第一样本图像与第二样本图像之间的样本静态光流和样本整体光流,样本静态光流由摄像器件运动引起,样本整体光流由摄像器件运动和拍摄对象运动共同引起,且样本参考数据包括样本位姿和样本深度。其可参阅上述关于“获取图像序列、光流数据和图像序列中各个图像的参考数据”的描述。In the embodiment of the present disclosure, each sample image includes a first sample image and a second sample image that have a common view relationship, and the sample optical flow data includes the sample static optical flow and sample between the first sample image and the second sample image. Overall optical flow, the static optical flow of the sample is caused by the motion of the camera device, the overall optical flow of the sample is caused by the motion of the camera device and the motion of the photographed object, and the sample reference data includes the sample pose and sample depth. For this, please refer to the above description about "obtaining the image sequence, optical flow data and reference data of each image in the image sequence".
步骤S62:基于图像分析模型对样本图像序列和样本光流数据进行分析预测,得到样本分析结果。Step S62: Analyze and predict the sample image sequence and sample optical flow data based on the image analysis model to obtain sample analysis results.
本公开实施例中,样本分析结果包括样本静态光流的样本光流校准数据。其可以参阅前述公开实施例中关于“基于图像序列和光流数据,预测得到分析结果”相关描述。In the embodiment of the present disclosure, the sample analysis results include sample optical flow calibration data of the sample static optical flow. For details, please refer to the relevant description of "predicting the analysis results based on the image sequence and optical flow data" in the aforementioned disclosed embodiments.
步骤S63:基于样本静态光流和样本光流校准数据,对样本位姿和样本深度进行优化,得到更新的样本位姿和更新的样本深度。Step S63: Based on the sample static optical flow and the sample optical flow calibration data, optimize the sample pose and sample depth to obtain an updated sample pose and an updated sample depth.
这里可以参阅前述公开实施例中关于“基于静态光流和光流校准数据,对位姿和深度进行优化, 得到更新的位姿和更新的深度”相关描述。Here, please refer to the relevant description of "optimizing the pose and depth based on the static optical flow and optical flow calibration data to obtain an updated pose and an updated depth" in the aforementioned disclosed embodiments.
步骤S64:基于更新的样本位姿和更新的样本深度进行损失度量,得到图像分析模型的预测损失。Step S64: Perform loss measurement based on the updated sample pose and updated sample depth to obtain the predicted loss of the image analysis model.
在一个实施场景中,与前述公开实施例类似地,样本参考数据还可以包括样本动态掩膜,样本动态掩膜用于指示样本图像中的运动对象,样本分析结果还包括样本动态光流和样本动态掩膜的样本掩膜校准数据,且样本动态光流由拍摄对象运动引起,预测损失可以包括掩膜预测损失。为了便于描述,掩膜预测损失可以记为L art_mask。此外,关于样本动态掩膜、样本动态光流、样本掩膜校准数据的具体含义,可以分别参阅前述公开实施例中关于动态掩膜、动态光流、掩膜校准数据的相关描述。在此基础上,可以先基于样本动态光流、更新的样本位姿和更新的样本深度,得到更新的样本整体光流。基于此,一方面可以基于样本掩膜校准数据和样本动态掩膜,得到样本动态掩膜在模型维度更新得到的第一预测掩膜,另一方面可以基于更新的样本整体光流、更新的样本位姿和更新的样本深度,得到样本动态掩膜在光流维度更新得到的第二预测掩膜,从而可以基于第一预测掩膜和第二预测掩膜之间的差异,得到掩膜预测损失。上述方式,在训练过程中不具备真实动态掩膜的情况下,也能够通过更新的样本整体光流、更新的样本位姿和更新的样本深度构造出动态掩膜标签,以实现自监督训练,有利于在提升模型性能的前提下,降低训练过程对样本标注的要求。 In one implementation scenario, similar to the aforementioned disclosed embodiments, the sample reference data may also include a sample dynamic mask, which is used to indicate moving objects in the sample image, and the sample analysis results also include sample dynamic optical flow and sample Sample mask calibration data of the dynamic mask, and the sample dynamic optical flow is caused by the motion of the photographed object, and the prediction loss may include a mask prediction loss. For ease of description, the mask prediction loss can be denoted as L art_mask . In addition, regarding the specific meanings of sample dynamic mask, sample dynamic optical flow, and sample mask calibration data, please refer to the relevant descriptions of dynamic mask, dynamic optical flow, and mask calibration data in the aforementioned disclosed embodiments respectively. On this basis, the updated overall optical flow of the sample can be obtained based on the sample dynamic optical flow, updated sample pose and updated sample depth. Based on this, on the one hand, the first prediction mask obtained by updating the sample dynamic mask in the model dimension can be obtained based on the sample mask calibration data and the sample dynamic mask. On the other hand, it can be based on the updated sample overall optical flow and updated sample pose and the updated sample depth, the second prediction mask obtained by updating the sample dynamic mask in the optical flow dimension is obtained, so that the mask prediction loss can be obtained based on the difference between the first prediction mask and the second prediction mask. . In the above method, even if a real dynamic mask is not available during the training process, dynamic mask labels can be constructed through the updated overall optical flow of the sample, the updated sample pose, and the updated sample depth to achieve self-supervised training. It is conducive to reducing the requirements for sample annotation during the training process on the premise of improving model performance.
在一个实施场景中,获取更新的整体光流类似地,其过程可参阅上述中“基于更新的位姿和更新的深度,获取更新的静态光流,并基于动态光流和更新的静态光流,得到更新的整体光流”相关描述。In an implementation scenario, obtaining the updated overall optical flow is similar. The process can be referred to the above "Based on the updated pose and updated depth, obtaining the updated static optical flow, and based on the dynamic optical flow and the updated static optical flow." , get the updated overall optical flow" related description.
在一个实施场景中,获取更新的动态掩膜类似地,其可以参阅前述公开实施例中关于获取更新的动态掩膜的相关描述。为了便于描述,可以将第一预测掩膜记为M diIn an implementation scenario, obtaining an updated dynamic mask is similar. For reference, please refer to the related descriptions about obtaining an updated dynamic mask in the aforementioned disclosed embodiments. For convenience of description, the first prediction mask can be denoted as M di .
在一个实施场景中,对于第二预测掩膜而言,其可以基于更新的样本位姿、更新的样本深度和第一样本图像中样本像素点的样本像素位置进行投影,得到第一样本图像中样本像素点投影在第二样本图像的第一样本投影位置p camIn an implementation scenario, for the second prediction mask, it can be projected based on the updated sample pose, the updated sample depth and the sample pixel position of the sample pixel point in the first sample image to obtain the first sample The sample pixel point in the image is projected at the first sample projection position p cam of the second sample image:
Figure PCTCN2022119646-appb-000012
Figure PCTCN2022119646-appb-000012
上述公式(11)中,G ij表示第一样本图像更新的位姿和第二样本图像更新的位姿之间的相对位姿,其获取方式可以参阅前述公开实施例中关于第一图像和第二图像相对位姿的相关描述。p i表示第一样本图像中样本像素点的样本像素位置,
Figure PCTCN2022119646-appb-000013
表示第一样本图像中样本像素点更新的深度。此外,∏ c
Figure PCTCN2022119646-appb-000014
以及运算符
Figure PCTCN2022119646-appb-000015
的具体含义,可以参阅前述公开实施例中相关描述。与此同时,可以基于更新的样本整体光流和第一样本图像中样本像素点的样本像素位置进行投影,得到第一样本图像中样本像素点投影在第二样本图像的第二样本投影位置p flow
In the above formula (11), G ij represents the relative pose between the updated pose of the first sample image and the updated pose of the second sample image. For its acquisition method, please refer to the above-mentioned disclosed embodiments regarding the first image and Relevant description of the relative pose of the second image. p i represents the sample pixel position of the sample pixel in the first sample image,
Figure PCTCN2022119646-appb-000013
Indicates the depth of update of sample pixels in the first sample image. In addition, ∏ c ,
Figure PCTCN2022119646-appb-000014
and operators
Figure PCTCN2022119646-appb-000015
For the specific meaning, please refer to the relevant descriptions in the foregoing disclosed embodiments. At the same time, projection can be performed based on the updated overall optical flow of the sample and the sample pixel position of the sample pixel point in the first sample image, to obtain a second sample projection of the sample pixel point in the first sample image projected onto the second sample image. Position p flow :
p flow=p i+F oij……公式(12); p flow =p i +F oij ...Formula (12);
上述公式(12)中,F oij表示更新的样本整体光流,也就是说,可以直接在更新的样本整体光流中查询样本像素点对应的样本整体光流值,并将其与样本像素点的样本像素位置相加,得到第二样本投影位置。在此基础上,即可基于第一样本投影位置和第二样本投影位置之间的差异,得到第二预测掩膜,故此能够从利用位姿、深度进行投影的像素位置以及利用整体光流进行投影的位置两者之间的差异,甄别出属于运动对象的样本像素点,以得到第二预测掩膜,有利于提升构造出动态掩膜标签的准确性。示例性地,可以基于第一样本投影位置与第二样本投影位置之间的距离对比预设阈值,得到样本像素点的样本掩膜值,且样本掩膜值用于表示样本像素点是否属于运动对象。例如,在第一样本投影位置与第二样本投影位置之间的距离大于预设阈值的情况下,可以认为样本像素点属于运动对象,此时可以确定样本像素点的样本掩膜值为第一数值(如,0),反之,在第一样本投影位置与第二样本投影位置之间的距离不大于预设阈值的情况下,可以认为样本像素点并不属于运动对象,此时可以确定样本像素点的样本掩膜值为第二数值(如,1)。在此基础上,即可基于各个样本像素点的样本掩膜值,得到第二预测掩膜
Figure PCTCN2022119646-appb-000016
In the above formula (12), F oij represents the updated sample overall optical flow. That is to say, the sample overall optical flow value corresponding to the sample pixel can be directly queried in the updated sample overall optical flow, and it can be compared with the sample pixel. The sample pixel positions of are added to obtain the second sample projection position. On this basis, the second prediction mask can be obtained based on the difference between the projection position of the first sample and the projection position of the second sample. Therefore, the pixel position of the projection using pose and depth and the overall optical flow can be obtained. The difference between the two projection positions is used to identify the sample pixels belonging to the moving object to obtain the second prediction mask, which is beneficial to improving the accuracy of constructing dynamic mask labels. For example, the sample mask value of the sample pixel point can be obtained based on the distance between the first sample projection position and the second sample projection position by comparing the preset threshold, and the sample mask value is used to indicate whether the sample pixel point belongs to Moving objects. For example, when the distance between the first sample projection position and the second sample projection position is greater than a preset threshold, the sample pixel point can be considered to belong to the moving object, and at this time, the sample mask value of the sample pixel point can be determined to be the first A value (such as 0). On the contrary, when the distance between the first sample projection position and the second sample projection position is not greater than the preset threshold, it can be considered that the sample pixel does not belong to the moving object. In this case, The sample mask value of the sample pixel is determined to be a second value (eg, 1). On this basis, the second prediction mask can be obtained based on the sample mask value of each sample pixel.
Figure PCTCN2022119646-appb-000016
Figure PCTCN2022119646-appb-000017
Figure PCTCN2022119646-appb-000017
上述公式(13)中,μ表示预设阈值,||·|| 2表示欧氏距离,示例性地,预设阈值μ可以设置为0.5,在此不做限定。在得到第一预测掩膜M di和第二预测掩膜
Figure PCTCN2022119646-appb-000018
之后,即可基于第一预测掩膜M di和 第二预测掩膜
Figure PCTCN2022119646-appb-000019
之间的差异,得到掩膜预测损失L art_mask。示例性地,可以采用交叉熵损失函数度量第一预测掩膜M di和第二预测掩膜
Figure PCTCN2022119646-appb-000020
之间的差异,得到掩膜预测损失L art_mask
In the above formula (13), μ represents the preset threshold, and ||·|| 2 represents the Euclidean distance. For example, the preset threshold μ can be set to 0.5, which is not limited here. After obtaining the first prediction mask M di and the second prediction mask
Figure PCTCN2022119646-appb-000018
After that, based on the first prediction mask M di and the second prediction mask
Figure PCTCN2022119646-appb-000019
The difference between them is the mask prediction loss L art_mask . For example, a cross-entropy loss function may be used to measure the first prediction mask M di and the second prediction mask
Figure PCTCN2022119646-appb-000020
The difference between them gives the mask prediction loss L art_mask :
Figure PCTCN2022119646-appb-000021
Figure PCTCN2022119646-appb-000021
上述公式(14)中,N表示第一预测掩膜或第二预测掩膜中像素点集合,|N|表示第一预测掩膜或第二预测掩膜中像素点总数。In the above formula (14), N represents the set of pixel points in the first prediction mask or the second prediction mask, and |N| represents the total number of pixel points in the first prediction mask or the second prediction mask.
在一个实施场景中,与前述自监督训练方式人工构造掩膜标签不同的是,若训练过程中存在真实动态掩膜,则可以通过有监督训练的方式,来监督模型训练。在存在真实动态掩膜的情况下,可以基于第一预测掩膜与真实动态掩膜之间的差异,得到掩膜预测损失。也可以采用交叉熵损失函数度量第一预测掩膜与真实动态掩膜之间的差异,得到掩膜预测损失。为了便于区分前述自监督训练中的掩膜预测损失与有监督训练中的掩膜预测损失,可以将有监督训练中的掩膜预测损失记为L gt_maskIn an implementation scenario, unlike the aforementioned self-supervised training method of manually constructing mask labels, if there are real dynamic masks during the training process, the model training can be supervised through supervised training. In the case where a real dynamic mask exists, the mask prediction loss can be obtained based on the difference between the first predicted mask and the real dynamic mask. The cross-entropy loss function can also be used to measure the difference between the first predicted mask and the real dynamic mask to obtain the mask prediction loss. In order to facilitate the distinction between the aforementioned mask prediction loss in self-supervised training and the mask prediction loss in supervised training, the mask prediction loss in supervised training can be recorded as L gt_mask :
Figure PCTCN2022119646-appb-000022
Figure PCTCN2022119646-appb-000022
上述公式(15)中,M i表示真实动态掩膜,其他参数可以参阅前述自监督训练的相关描述。 In the above formula (15), Mi represents the real dynamic mask. For other parameters, please refer to the related description of self-supervised training mentioned above.
在一个实施场景中,如前所述,样本参考数据还包括样本动态掩膜,样本动态掩膜用于指示样本图像中的运动对象,且预测损失包括几何光度损失。为了便于描述,几何光度损失可以记为L geo_ph。此外,关于样本动态掩膜,可以参阅前述公开实施例中关于动态掩膜的相关描述。请结合参阅图7,图7是动态场景一实施例的示意图。如图7所示,在自监督训练模式中,当使用光度误差来监督位姿和深度时,直接使用静态光流可能会导致像素不匹配(如,打叉的一对像素)即701,因为运动对象本身的运动会导致静态光流中像素的遮挡,这样会使光度误差的准确度下降。有鉴于此,可以基于各个与第一样本图像具有共视关系的第二样本图像的样本动态掩膜进行融合,得到样本融合掩膜。在此基础上,可以基于更新的样本位姿、更新的样本深度和第一样本图像中样本像素点的样本像素位置进行投影,得到第一样本图像中样本像素点投影在第二样本图像的第一样本投影位置即702。基于此,可以基于第一样本图像中样本像素点的样本像素位置,得到第一样本图像中样本像素点的第一样本像素值,并基于第一样本图像中样本像素点的第一样本投影位置,得到第一样本图像中样本像素点的第二样本像素值,以及基于样本融合掩膜,得到第一样本图像中样本像素点的融合掩膜值,从而可以基于第一样本像素值、第二样本像素值和融合掩膜值,得到几何光度损失。上述方式,通过融合与第一样本图像具有共视关系的第二样本图像的样本动态掩膜,得到样本融合掩膜,并在几何光度损失度量过程中考虑该样本融合掩膜,有利于通过样本融合掩膜尽可能地消除由于像素遮挡而导致的错误像素光度匹配,能够提升几何光度损失的度量精度,有利于提升图像分析模型的模型性能。 In one implementation scenario, as mentioned above, the sample reference data also includes a sample dynamic mask, the sample dynamic mask is used to indicate moving objects in the sample image, and the predicted loss includes a geometric photometric loss. For ease of description, the geometric photometric loss can be denoted as L geo_ph . In addition, regarding the sample dynamic mask, please refer to the relevant descriptions about the dynamic mask in the aforementioned disclosed embodiments. Please refer to FIG. 7 , which is a schematic diagram of an embodiment of a dynamic scene. As shown in Figure 7, in the self-supervised training mode, when using photometric errors to supervise pose and depth, directly using static optical flow may lead to pixel mismatch (e.g., a pair of crossed pixels), which is 701, because The movement of the moving object itself will cause the occlusion of pixels in the static optical flow, which will reduce the accuracy of the photometric error. In view of this, the sample fusion mask can be obtained by fusion based on the sample dynamic masks of the second sample images that have a common view relationship with the first sample image. On this basis, projection can be performed based on the updated sample pose, the updated sample depth and the sample pixel position of the sample pixel in the first sample image, to obtain the projection of the sample pixel in the first sample image on the second sample image. The first sample projection position is 702. Based on this, the first sample pixel value of the sample pixel point in the first sample image can be obtained based on the sample pixel position of the sample pixel point in the first sample image, and based on the first sample pixel value of the sample pixel point in the first sample image A sample projection position is used to obtain the second sample pixel value of the sample pixel point in the first sample image, and based on the sample fusion mask, the fusion mask value of the sample pixel point in the first sample image is obtained, so that the second sample pixel value of the sample pixel point in the first sample image is obtained. One sample pixel value, the second sample pixel value and the fused mask value are used to obtain the geometric photometric loss. In the above method, the sample fusion mask is obtained by fusing the sample dynamic mask of the second sample image that has a common view relationship with the first sample image, and the sample fusion mask is considered in the geometric photometric loss measurement process, which is beneficial to the The sample fusion mask eliminates erroneous pixel photometric matching due to pixel occlusion as much as possible, can improve the measurement accuracy of geometric photometric loss, and is beneficial to improving the model performance of the image analysis model.
在一个实施场景中,对于各个与第一样本图像具有共视关系的第二样本图像而言,可以将这些第二样本图像的样本动态掩膜聚合,得到样本融合掩膜。示例性地,聚合的具体操作可以包括但不限于取并集等,在此不做限定。为了便于描述,可以将样本融合掩膜记为
Figure PCTCN2022119646-appb-000023
同时,第一样本投影位置,可以参阅前述掩膜预测损失中相关描述。
In one implementation scenario, for each second sample image that has a common view relationship with the first sample image, the sample dynamic masks of these second sample images can be aggregated to obtain a sample fusion mask. For example, specific operations of aggregation may include but are not limited to taking unions, etc., which are not limited here. For ease of description, the sample fusion mask can be recorded as
Figure PCTCN2022119646-appb-000023
At the same time, for the projection position of the first sample, please refer to the relevant description in the aforementioned mask prediction loss.
在一个实施场景中,可以直接根据第一样本图像中样本像素点的样本像素位置,在第一样本图像中查询该样本像素位置处的像素值,得到第一样本像素值,其中,可以将第一样本像素值记为I i。此外,在得到第一样本投影位置之后,可以在第二样本图像通过双线性插值得到第二样本像素值I j→iIn one implementation scenario, the first sample pixel value can be obtained by querying the pixel value at the sample pixel position in the first sample image directly based on the sample pixel position of the sample pixel point in the first sample image, where, The first sample pixel value may be denoted I i . In addition, after obtaining the first sample projection position, the second sample pixel value I j→i can be obtained through bilinear interpolation in the second sample image:
Figure PCTCN2022119646-appb-000024
Figure PCTCN2022119646-appb-000024
上述公式(16)中,
Figure PCTCN2022119646-appb-000025
表示第一投影位置,I j表示第二样本图像,I j<·>表示在第二样本图像中进行插值计算。
In the above formula (16),
Figure PCTCN2022119646-appb-000025
represents the first projection position, I j represents the second sample image, and I j <·> represents interpolation calculation in the second sample image.
这里,得到第一样本像素值I i和第二样本像素值I j→i之后,可以获取第一样本像素值和第二样本像素值之间的像素差值pe(I i,I j→i),再利用样本像素点的融合掩膜值
Figure PCTCN2022119646-appb-000026
进行加权,得到加权差值
Figure PCTCN2022119646-appb-000027
在此基础上,再基于各个样本像素点的加权差值,得到几何光度损失L geo_ph
Here, after obtaining the first sample pixel value I i and the second sample pixel value I j→i , the pixel difference value pe(I i ,I j ) between the first sample pixel value and the second sample pixel value can be obtained →i ), and then use the fusion mask value of the sample pixel
Figure PCTCN2022119646-appb-000026
Perform weighting to obtain the weighted difference
Figure PCTCN2022119646-appb-000027
On this basis, based on the weighted difference of each sample pixel point, the geometric photometric loss L geo_ph is obtained:
Figure PCTCN2022119646-appb-000028
Figure PCTCN2022119646-appb-000028
上述公式(17)中,N'表示样本融合掩膜中属于静止物体的像素点总数。上述方式,通过利用融合掩膜值对像素差值进行加权,能够快速筛除由于像素遮挡而导致的错误像素光度匹配,有利于降低几何光度损失的度量复杂度。此外,为了提升几何光度损失的准确性,在度量第一样本像素值和第二样本像素值之间的像素差值pe(I i,I j→i)的过程中,可以采用多种方式进行度量。其可以基于结构相似性度量第一样本像素值和第二样本像素值,得到第一差值,并基于绝对值偏差度量第一样本像素值和第二样本像素值,得到第二差值,在此基础上,再基于第一差值和第二差值进行加权,得到像素差值pe(I i,I j→i): In the above formula (17), N' represents the total number of pixels belonging to stationary objects in the sample fusion mask. In the above method, by using the fusion mask value to weight the pixel difference values, erroneous pixel photometric matching caused by pixel occlusion can be quickly screened out, which is beneficial to reducing the measurement complexity of geometric photometric loss. In addition, in order to improve the accuracy of geometric photometric loss, in the process of measuring the pixel difference pe(I i ,I j→i ) between the first sample pixel value and the second sample pixel value, various methods can be used Make measurements. It can measure the first sample pixel value and the second sample pixel value based on structural similarity to obtain the first difference value, and measure the first sample pixel value and the second sample pixel value based on the absolute value deviation to obtain the second difference value. , on this basis, weighting is performed based on the first difference and the second difference to obtain the pixel difference pe(I i ,I j→i ):
Figure PCTCN2022119646-appb-000029
Figure PCTCN2022119646-appb-000029
上述公式(18)中,SSIM表示结构相似性度量,||·|| 1表示绝对值偏差度量,
Figure PCTCN2022119646-appb-000030
(1-α)分别表示第一差值、第二差值的权重。示例性地,α可以设置为0.85,在此不做限定。上述方式,在度量像素差值过程中,结合结构相似性和绝对值偏差两方面共同度量,有利于尽可能提升像素差值的准确性。
In the above formula (18), SSIM represents the structural similarity measure, ||·|| 1 represents the absolute value deviation measure,
Figure PCTCN2022119646-appb-000030
(1-α) represents the weight of the first difference and the second difference respectively. For example, α can be set to 0.85, which is not limited here. The above method, in the process of measuring pixel difference, combines the two aspects of structural similarity and absolute value deviation to jointly measure, which is conducive to improving the accuracy of pixel difference as much as possible.
在一个实施场景中,与前述结合样本融合掩膜度量几何光度损失不同的是,在对损失度量的精度较为宽松的情况下,也可以不考虑样本融合掩膜来度量几何光度损失。在此情况下,几何光度损失L geo_ph可以表示为公式(19)中,N表示样本像素点总数: In an implementation scenario, unlike the aforementioned measurement of geometric photometric loss in combination with the sample fusion mask, when the accuracy of the loss measurement is relatively loose, the geometric photometric loss can also be measured without considering the sample fusion mask. In this case, the geometric photometric loss L geo_ph can be expressed as formula (19), where N represents the total number of sample pixels:
Figure PCTCN2022119646-appb-000031
Figure PCTCN2022119646-appb-000031
在一个实施场景中,为了提升损失度量的准确性,预测损失还可以包括光流光度损失,其中,可以将光流光度损失记为L flow_ph。此外,样本分析结果还可以包括样本动态光流,其可以参阅前述掩膜预测损失中相关描述。基于此,可以基于样本动态光流、更新的样本位姿和更新的样本深度,得到更新的样本整体光流,其可以参阅前述掩膜预测损失中相关描述。在此基础上,可以基于更新的样本整体光流和第一样本图像中样本像素点的样本像素位置进行投影,得到第一样本图像中样本像素点投影在第二样本图像的第二样本投影位置。示例性地,可以直接在更新的样本整体光流中查询样本像素点的样本整体光流值,再加上样本像素点的样本像素位置,得到第二样本投影位置,其可以参阅前述掩膜预测损失中相关描述。与前述几何光度损失中类似地,在得到第二样本投影位置之后,可以在第二样本图像通过双线性插值得到第二样本像素值I j→iIn an implementation scenario, in order to improve the accuracy of the loss measurement, the predicted loss may also include optical flow photometric loss, where the optical flow photometric loss may be recorded as L flow_ph . In addition, the sample analysis results may also include sample dynamic optical flow, which may be described in the aforementioned mask prediction loss. Based on this, the updated overall optical flow of the sample can be obtained based on the sample dynamic optical flow, the updated sample pose and the updated sample depth. For this, please refer to the relevant description in the aforementioned mask prediction loss. On this basis, projection can be performed based on the updated overall optical flow of the sample and the sample pixel position of the sample pixel point in the first sample image, to obtain a second sample in which the sample pixel point in the first sample image is projected on the second sample image. Projection position. For example, the sample overall optical flow value of the sample pixel point can be directly queried in the updated sample overall optical flow, plus the sample pixel position of the sample pixel point, to obtain the second sample projection position, which can be referred to the aforementioned mask prediction. Related descriptions in losses. Similar to the aforementioned geometric photometric loss, after obtaining the second sample projection position, the second sample pixel value I j→i can be obtained through bilinear interpolation in the second sample image:
I j→i=I j<F oij+p i>……公式(20); I j→i =I j <F oij +p i >...Formula (20);
上述公式(20)中,I j<·>表示在第二样本图像I j进行插值计算。在此基础上,与前述几何光度损失类似地,可以基于第一样本像素值与第二样本像素值之间的差异,得到光流光度损失。示例性地,可以基于结构相似性度量第一样本像素值和第二样本像素值,得到第一差值,并基于绝对值偏差度量第一样本像素值和第二样本像素值,得到第二差值,再基于第一差值和第二差值进行加权,得到像素差值,从而可以基于各个样本像素点的像素差值,得到光流光度损失L flow_phIn the above formula (20), I j <·> indicates that interpolation calculation is performed on the second sample image I j . On this basis, similar to the aforementioned geometric photometric loss, the optical flow photometric loss can be obtained based on the difference between the first sample pixel value and the second sample pixel value. For example, the first sample pixel value and the second sample pixel value can be measured based on structural similarity to obtain the first difference value, and the first sample pixel value and the second sample pixel value can be measured based on the absolute value deviation to obtain the first difference value. The two differences are then weighted based on the first difference and the second difference to obtain the pixel difference, so that the optical flow photometric loss L flow_ph can be obtained based on the pixel difference of each sample pixel:
L flow_ph=∑ ijpe(I i,I j→i)……公式(21)。 L flow_ph =∑ ij pe(I i ,I j→i )...Formula (21).
步骤S65:基于预测损失,调整图像分析模型的网络参数。Step S65: Based on the prediction loss, adjust the network parameters of the image analysis model.
在一个实施场景中,在通过自监督方式训练网络模型的情况下,预测损失可以包括前述掩膜预测损失、几何光度损失、光流光度损失中至少一者。示例性地,预测损失可以包括前述掩膜预测损失、几何光度损失和光流光度损失,则可以基于这三者进行加权,得到预测损失L self_supIn one implementation scenario, when the network model is trained in a self-supervised manner, the prediction loss may include at least one of the aforementioned mask prediction loss, geometric photometric loss, and optical flow photometric loss. For example, the prediction loss can include the aforementioned mask prediction loss, geometric photometric loss and optical flow photometric loss. Then the prediction loss L self_sup can be obtained by weighting based on these three:
L self_sup=λ 0L geo_ph1L flow_ph2L art_mask……公式(22); L self_sup0 L geo_ph1 L flow_ph2 L art_mask ...Formula (22);
上述公式(22)中,λ 0,λ 1,λ 2均表示加权系数,示例性地,可以分别设置为100、5、0.05,在此不做限定。请结合参阅表1,表1是本公开图像分析模型采用自监督方式训练之后的测试性能与现有技术的测试性能一实施例的对比表。 In the above formula (22), λ 0 , λ 1 , and λ 2 all represent weighting coefficients. For example, they can be set to 100, 5, and 0.05 respectively, which are not limited here. Please refer to Table 1. Table 1 is a comparison table between the test performance of the disclosed image analysis model after training in a self-supervised manner and the test performance of the prior art in an embodiment.
表1分析模型采用自监督方式训练之后的测试性能与现有技术的测试性能一实施例的对比表Table 1 Comparison of the test performance of the analysis model after training in a self-supervised manner and the test performance of the existing technology according to an embodiment
分析方式Analysis method K09K09 K10K10 VK01VK01 VK02VK02 VK06VK06 VK18VK18 VK20 VK20
现有技术1Existing technology 1 28.128.1 24.024.0 -- -- -- -- --
现有技术2Existing technology 2 41.9141.91 7.5197.519 27.83027.830 XX XX XX 2.8072.807
现有技术3Existing technology 3 47.147.1 11.011.0 2.2592.259 0.0490.049 0.1360.136 1.1701.170 6.9986.998
本公开this disclosure 27.827.8 4.24.2 0.5910.591 0.0210.021 0.130.13 0.4000.400 1.0391.039
其中,K09和K10表示KITTI数据集中图像序列09和图像序列10的测试场景下,不同技术方案的测试性能,VK01、VK02、VK06、VK18、VK20表示KITTI2数据集中图像序列01、图像序列02、图像序列06、图像序列18和图像序列20的测试场景下,不同技术方案的测试性能。由表1可见,本公开自监督方式训练得到的图像分析模型在诸多测试场景下较其他现有技术均具有极为显著的模型性能。Among them, K09 and K10 represent the test performance of different technical solutions in the test scenario of image sequence 09 and image sequence 10 in the KITTI data set. VK01, VK02, VK06, VK18, and VK20 represent image sequence 01, image sequence 02, and image sequence in the KITTI2 data set. Test performance of different technical solutions in the test scenarios of sequence 06, image sequence 18 and image sequence 20. As can be seen from Table 1, the image analysis model trained by the self-supervised method of the present disclosure has extremely significant model performance compared with other existing technologies in many test scenarios.
在一个实施场景中,与前述通过自监督训练网络模型类似地,通过有监督方式训练网络模型的情况下,预测损失可以包括前述掩膜预测损失、几何光度损失、光流光度损失中至少一者。示例性地,预测损失可以包括前述掩膜预测损失、几何光度损失和光流光度损失,则可以基于这三者进行加权,得到预测损失L semi_supIn an implementation scenario, similar to the aforementioned self-supervised training of the network model, when the network model is trained in a supervised manner, the prediction loss may include at least one of the aforementioned mask prediction loss, geometric photometric loss, and optical flow photometric loss. . For example, the prediction loss can include the aforementioned mask prediction loss, geometric photometric loss and optical flow photometric loss. Then the prediction loss L semi_sup can be obtained by weighting based on these three:
L semi_sup=λ 0L geo_ph1L flow_ph2L art_mask……公式(23); L semi_sup0 L geo_ph1 L flow_ph2 L art_mask ...Formula (23);
上述公式(23)中,λ 0,λ 1,λ 2均表示加权系数,示例性地,可以分别设置为100、5、0.05,在此不做限定。 In the above formula (23), λ 0 , λ 1 , and λ 2 all represent weighting coefficients. For example, they can be set to 100, 5, and 0.05 respectively, which are not limited here.
在一个实施场景中,在得到预测损失之后,可以通过诸如梯度下降等优化方式,调整图像分析模型的网络参数,其过程可以参阅梯度下降等优化方式的技术细节。In an implementation scenario, after obtaining the prediction loss, the network parameters of the image analysis model can be adjusted through optimization methods such as gradient descent. For the process, please refer to the technical details of optimization methods such as gradient descent.
上述方案,与推理阶段类似地,通过模仿人类感知现实世界的方式,将整体光流视为由摄像器件运动和拍摄对象运动共同引起,并在图像分析过程中,参考整体光流和由摄像器件运动引起的静态光流,预测出静态光流的光流校准数据,从而能够在后续位姿和深度优化过程中,结合静态光流及其光流校准数据尽可能地降低拍摄对象运动所导致的影响,能够提升图像分析模型的模型性能,有利于提升利用图像分析模型在推理阶段得到分析结果的准确性,进而能够提升推理阶段位姿和深度的精度。The above scheme, similar to the inference stage, considers the overall optical flow as caused by the movement of the camera device and the movement of the subject by imitating the way humans perceive the real world, and during the image analysis process, refer to the overall optical flow and the movement of the camera device. The static optical flow caused by motion can predict the optical flow calibration data of the static optical flow, so that in the subsequent pose and depth optimization process, the static optical flow and its optical flow calibration data can be combined to reduce the optical flow caused by the motion of the subject as much as possible. The impact can improve the model performance of the image analysis model, which is conducive to improving the accuracy of the analysis results obtained by using the image analysis model in the inference stage, thereby improving the accuracy of the pose and depth in the inference stage.
请参阅图8,图8是本公开图像分析装置80一实施例的框架示意图。图像分析装置80包括:获取部分81,被配置为获取图像序列、光流数据和图像序列中各个图像的参考数据;其中,各个图像包括具有共视关系的第一图像和第二图像,光流数据包括第一图像与第二图像之间的静态光流和整体光流,静态光流由摄像器件运动引起,整体光流由摄像器件运动和拍摄对象运动共同引起,且参考数据包括位姿和深度;分析部分82,被配置为基于图像序列和光流数据,预测得到分析结果;其中,分析结果包括静态光流的光流校准数据;优化部分83,被配置为基于静态光流和光流校准数据,对位姿和深度进行优化,得到更新的位姿和更新的深度。Please refer to FIG. 8 , which is a schematic framework diagram of an embodiment of the image analysis device 80 of the present disclosure. The image analysis device 80 includes: an acquisition part 81 configured to acquire an image sequence, optical flow data, and reference data of each image in the image sequence; wherein each image includes a first image and a second image having a common view relationship, and the optical flow The data includes static optical flow and overall optical flow between the first image and the second image. The static optical flow is caused by the movement of the camera device, the overall optical flow is caused by the movement of the camera device and the movement of the photographed object, and the reference data includes pose and depth; the analysis part 82 is configured to predict the analysis results based on the image sequence and optical flow data; wherein the analysis results include optical flow calibration data of static optical flow; the optimization part 83 is configured to predict based on the static optical flow and optical flow calibration data , optimize the pose and depth to obtain an updated pose and an updated depth.
在一些公开实施例中,分析部分82包括:特征相关子部分,被配置为基于第一图像的图像特征和第二图像的图像特征,得到第一图像与第二图像之间的特征相关数据;第一投影子部分,被配置为基于静态光流将第一图像中像素点进行投影,得到第一图像中像素点在第二图像中的第一投影位置;特征搜索子部分,被配置为基于第一投影位置在特征相关数据中搜索,得到目标相关数据;数据分析子部分,被配置为基于目标相关数据、静态光流和整体光流,得到分析结果。In some disclosed embodiments, the analysis part 82 includes: a feature correlation sub-part configured to obtain feature correlation data between the first image and the second image based on the image features of the first image and the image features of the second image; The first projection sub-part is configured to project the pixels in the first image based on static optical flow to obtain the first projection position of the pixels in the first image in the second image; the feature search sub-part is configured to project based on The first projection position is searched in feature-related data to obtain target-related data; the data analysis subpart is configured to obtain analysis results based on target-related data, static optical flow, and overall optical flow.
在一些公开实施例中,数据分析子部分包括:第一编码子部分,被配置为基于目标相关数据进行编码,得到第一编码特征;第二编码子部分,被配置为基于静态光流和整体光流进行编码,得到第二编码特征;预测子部分,被配置为基于第一编码特征和第二编码特征,预测得到分析结果。In some disclosed embodiments, the data analysis sub-part includes: a first encoding sub-part configured to perform encoding based on target-related data to obtain first encoding features; a second encoding sub-part configured to perform encoding based on static optical flow and overall The optical flow is encoded to obtain the second encoding feature; the prediction sub-part is configured to predict and obtain the analysis result based on the first encoding feature and the second encoding feature.
在一些公开实施例中,参考数据还包括动态掩膜,动态掩膜用于指示图像中的运动对象,分析结果还包括置信度图和动态掩膜的掩膜校准数据,置信度图包括图像中各像素点的置信度;优化部分83包括:图像融合子部分,被配置为基于动态掩膜、掩膜校准数据和置信度图进行融合,得到重要度图;位置校准子部分,被配置为基于光流校准数据对第一投影位置进行校准,得到校准位置;其中,重要度图包括图像中各像素点的重要度,第一投影位置为第一图像中像素点基于静态光流投影在第二图像的像素位置;数据优化子部分,被配置为基于校准位置和重要度图,优化得到更新的位姿和更新的深度。In some disclosed embodiments, the reference data also includes a dynamic mask, the dynamic mask is used to indicate moving objects in the image, and the analysis results also include a confidence map and mask calibration data of the dynamic mask, the confidence map includes the Confidence of each pixel; the optimization part 83 includes: an image fusion subpart, which is configured to fuse based on the dynamic mask, mask calibration data and confidence map to obtain an importance map; a position calibration subpart, which is configured to fuse based on The optical flow calibration data calibrates the first projection position to obtain the calibration position; where the importance map includes the importance of each pixel in the image, and the first projection position is the projection of the pixel in the first image based on the static optical flow on the second The pixel position of the image; the data optimization subsection is configured to optimize the updated pose and updated depth based on the calibration position and importance map.
在一些公开实施例中,光流校准数据包括第一图像中像素点的校准光流,位置校准子部分,还被配置为将第一图像中像素点的校准光流加上像素点在第二图像中的第一投影位置,得到像素点的校准位置。In some disclosed embodiments, the optical flow calibration data includes the calibration optical flow of the pixel point in the first image, and the position calibration sub-part is further configured to add the calibration optical flow of the pixel point in the first image plus the pixel point in the second image. The first projected position in the image is used to obtain the calibrated position of the pixel.
在一些公开实施例中,图像融合子部分包括:校准子部分,被配置为基于掩膜校准数据对动态掩膜进行校准,得到校准掩膜;其中,校准掩膜包括图像中像素点与运动对象的相关度,且相关度与图像中像素点属于运动对象的可能性正相关;融合子部分,被配置为基于置信度图和校准掩膜进行融合,得到重要度图。In some disclosed embodiments, the image fusion sub-part includes: a calibration sub-part configured to calibrate the dynamic mask based on the mask calibration data to obtain a calibration mask; wherein the calibration mask includes pixel points and moving objects in the image The correlation degree is positively related to the possibility that a pixel in the image belongs to a moving object; the fusion sub-part is configured to fuse based on the confidence map and the calibration mask to obtain the importance map.
在一些公开实施例中,分析结果还包括动态光流,动态光流由拍摄对象运动引起;图像分析装置80包括:静态光流更新部分,被配置为基于更新的位姿和更新的深度,获取更新的静态光流;整体光流更新部分,被配置为基于动态光流和更新的静态光流,得到更新的整体光流;数据更新部分,被配置为基于更新的静态光流和更新的整体光流,得到更新的光流数据,并基于更新的位姿和更新的深度,得到更新的参考数据;循环部分,被配置为结合分析部分82和优化部分83重新执行基于图像序列和光流数据,预测得到分析结果的步骤以及后续步骤,直至重新执行的次数满足预设条件为止。In some disclosed embodiments, the analysis results also include dynamic optical flow, which is caused by the motion of the photographed object; the image analysis device 80 includes: a static optical flow update part configured to obtain based on the updated pose and updated depth. Updated static optical flow; the overall optical flow update part is configured to be based on dynamic optical flow and updated static optical flow to obtain an updated overall optical flow; the data update part is configured to be based on updated static optical flow and updated overall Optical flow, obtain updated optical flow data, and obtain updated reference data based on the updated pose and updated depth; the loop part is configured to combine the analysis part 82 and the optimization part 83 to re-execute based on the image sequence and optical flow data, Predict the steps to obtain the analysis results and subsequent steps until the number of re-executions meets the preset conditions.
在一些公开实施例中,静态光流更新部分包括:第二投影子部分,被配置为基于更新的位姿、更新的深度和第一图像中像素点的像素位置进行投影,得到第一图像中像素点投影在第二图像的第二投影位置;光流更新子部分,被配置为基于第一图像中像素点投影在第二图像的第二投影位置和第一图像中像素点在第二图像中的对应位置之间的差异,得到更新的静态光流;其中,对应位置为在假设摄像器件未运动的情况下,第一图像中像素点所属的空间点投影在第二图像的像素位置。In some disclosed embodiments, the static optical flow update part includes: a second projection subpart configured to perform projection based on the updated pose, the updated depth and the pixel position of the pixel in the first image, to obtain the The pixel point is projected at the second projection position of the second image; the optical flow update sub-section is configured to project the pixel point in the first image at the second projection position of the second image and the pixel point in the first image is at the second projection position of the second image. The difference between the corresponding positions in , the updated static optical flow is obtained; where the corresponding position is the pixel position of the second image where the spatial point to which the pixel point in the first image belongs is projected on the assumption that the camera device does not move.
在一些公开实施例中,整体光流更新部分,还被配置为将动态光流和更新的静态光流相加,得到更新的整体光流。In some disclosed embodiments, the overall optical flow updating part is also configured to add the dynamic optical flow and the updated static optical flow to obtain an updated overall optical flow.
请参阅图9,图9是图像分析模型的训练装置90一实施例的框架示意图。图像分析模型的训练装置90包括:样本获取部分91,被配置为获取样本图像序列、样本光流数据和样本图像序列中各个样本图像的样本参考数据;其中,各个样本图像包括具有共视关系的第一样本图像和第二样本图像,样本光流数据包括第一样本图像与第二样本图像之间的样本静态光流和样本整体光流,样本静态光流由摄像器件运动引起,样本整体光流由摄像器件运动和拍摄对象运动共同引起,且样本参考数据包括样本位姿和样本深度;样本分析部分92,被配置为基于图像分析模型对样本图像序列和样本光流数据进行分析预测,得到样本分析结果;其中,样本分析结果包括样本静态光流的样本光流校准数据;样本优化部分93,被配置为基于样本静态光流和样本光流校准数据,对样本位姿和样本深度进行优化,得到更新的样本位姿和更新的样本深度;损失度量部分94,被配置为基于更新的样本位姿和更新的样本深度进行损失度量,得到图像分析模型的预测损失;参数调整部分95,被配置为基于预测损失,调整图像分析模型的网络参数。Please refer to FIG. 9 , which is a schematic framework diagram of an embodiment of an image analysis model training device 90 . The training device 90 of the image analysis model includes: a sample acquisition part 91 configured to acquire the sample image sequence, the sample optical flow data, and the sample reference data of each sample image in the sample image sequence; wherein each sample image includes a common view relationship. The first sample image and the second sample image, the sample optical flow data include the sample static optical flow and the sample overall optical flow between the first sample image and the second sample image, the sample static optical flow is caused by the movement of the camera device, the sample The overall optical flow is caused by the motion of the camera device and the motion of the photographed object, and the sample reference data includes sample pose and sample depth; the sample analysis part 92 is configured to analyze and predict the sample image sequence and sample optical flow data based on the image analysis model , obtain the sample analysis results; wherein, the sample analysis results include sample optical flow calibration data of the sample static optical flow; the sample optimization part 93 is configured to calculate the sample pose and sample depth based on the sample static optical flow and the sample optical flow calibration data. Perform optimization to obtain updated sample pose and updated sample depth; the loss measurement part 94 is configured to perform loss measurement based on the updated sample pose and updated sample depth to obtain the predicted loss of the image analysis model; parameter adjustment part 95 , is configured to adjust the network parameters of the image analysis model based on the prediction loss.
在一些公开实施例中,样本参考数据还包括样本动态掩膜,样本动态掩膜用于指示样本图像中的运动对象,样本分析结果还包括样本动态光流和样本动态掩膜的样本掩膜校准数据,且样本动态光流由拍摄对象运动引起,预测损失包括掩膜预测损失;图像分析模型的训练装置90还包括:样本整体光流更新部分,被配置为基于样本动态光流、更新的样本位姿和更新的样本深度,得到更新的样本整体光流;损失度量部分94包括:第一掩膜更新子部分,被配置为基于样本掩膜校准数据和样本动态掩膜,得到样本动态掩膜在模型维度更新得到的第一预测掩膜;第二掩膜更新子部分,被配置为基于更新的样本整体光流、更新的样本位姿和更新的样本深度,得到样本动态掩膜在光流维度更新得到的第二预测掩膜;掩膜损失度量子部分,被配置为基于第一预测掩膜和第二预测掩膜之间的差异,得到掩膜预测损失。In some disclosed embodiments, the sample reference data also includes a sample dynamic mask, the sample dynamic mask is used to indicate moving objects in the sample image, and the sample analysis results also include sample dynamic optical flow and sample mask calibration of the sample dynamic mask. data, and the sample dynamic optical flow is caused by the motion of the photographed object, and the prediction loss includes the mask prediction loss; the training device 90 of the image analysis model also includes: a sample overall optical flow update part configured to update the sample based on the sample dynamic optical flow. pose and the updated sample depth, to obtain the updated overall optical flow of the sample; the loss measurement part 94 includes: a first mask update subpart configured to obtain the sample dynamic mask based on the sample mask calibration data and the sample dynamic mask The first prediction mask obtained by updating the model dimension; the second mask update sub-part is configured to obtain the dynamic mask of the sample in the optical flow based on the updated overall optical flow of the sample, the updated sample pose and the updated sample depth. a second prediction mask obtained by dimensionality update; the mask loss metric sub-section is configured to obtain the mask prediction loss based on the difference between the first prediction mask and the second prediction mask.
在一些公开实施例中,第二掩膜更新子部分包括:第一样本投影子部分,被配置为基于更新的样本位姿、更新的样本深度和第一样本图像中样本像素点的样本像素位置进行投影,得到第一样本图像中样本像素点投影在第二样本图像的第一样本投影位置;第二样本投影子部分,被配置为基于更新的样本整体光流和第一样本图像中样本像素点的样本像素位置进行投影,得到第一样本图像中样本像素点投影在第二样本图像的第二样本投影位置;掩膜确定子部分,被配置为基于第一样本投影位置和第二样本投影位置之间的差异,得到第二预测掩膜。In some disclosed embodiments, the second mask update sub-part includes: a first sample projection sub-part configured to be based on the updated sample pose, the updated sample depth and the sample of the sample pixel point in the first sample image The pixel position is projected to obtain the sample pixel point in the first sample image projected at the first sample projection position of the second sample image; the second sample projection sub-part is configured to be based on the updated overall optical flow of the sample, which is the same as the first sample Project the sample pixel position of the sample pixel point in this image to obtain the second sample projection position of the sample pixel point in the first sample image projected at the second sample image; the mask determination sub-part is configured to be based on the first sample The difference between the projected position and the second sample projected position results in a second prediction mask.
在一些公开实施例中,掩膜确定子部分包括:距离对比子部分,被配置为基于第一样本投影位置与第二样本投影位置之间的距离对比预设阈值,得到样本像素点的样本掩膜值;其中,样本掩膜值用于表示样本像素点是否属于运动对象;掩膜获取子单元,被配置为基于各个样本像素点的样本掩膜值,得到第二预测掩膜。In some disclosed embodiments, the mask determination sub-part includes: a distance comparison sub-part configured to obtain a sample of the sample pixel point based on a distance comparison between the first sample projection position and the second sample projection position with a preset threshold. Mask value; wherein, the sample mask value is used to indicate whether the sample pixel point belongs to a moving object; the mask acquisition subunit is configured to obtain the second prediction mask based on the sample mask value of each sample pixel point.
在一些公开实施例中,样本参考数据还包括样本动态掩膜,样本动态掩膜用于指示样本图像中的运动对象,且预测损失包括几何光度损失;图像分析模型的训练装置90还包括样本掩膜聚合部分,被配置为基于各个与第一样本图像具有共视关系的第二样本图像的样本动态掩膜进行融合,得到样本融合掩膜;损失度量部分94包括:第一样本投影子部分,被配置为基于更新的样本位姿、更新的样 本深度和第一样本图像中样本像素点的样本像素位置进行投影,得到第一样本图像中样本像素点投影在第二样本图像的第一样本投影位置;第一像素值确定子部分,被配置为基于第一样本图像中样本像素点的样本像素位置,得到第一样本图像中样本像素点的第一样本像素值;第二像素值确定子部分,被配置为基于第一样本图像中样本像素点的第一样本投影位置,得到第一样本图像中样本像素点的第二样本像素值;融合掩膜值获取子部分,被配置为基于样本融合掩膜,得到第一样本图像中样本像素点的融合掩膜值;光度损失度量子部分,被配置为基于第一样本像素值、第二样本像素值和融合掩膜值,得到几何光度损失。In some disclosed embodiments, the sample reference data also includes a sample dynamic mask, the sample dynamic mask is used to indicate moving objects in the sample image, and the prediction loss includes a geometric photometric loss; the training device 90 of the image analysis model also includes a sample mask The film aggregation part is configured to fuse based on the sample dynamic masks of the second sample images that have a common view relationship with the first sample image to obtain the sample fusion mask; the loss measurement part 94 includes: a first sample projection sub- The part is configured to perform projection based on the updated sample pose, the updated sample depth and the sample pixel position of the sample pixel point in the first sample image, to obtain the projection of the sample pixel point in the first sample image on the second sample image. The first sample projection position; the first pixel value determination sub-section is configured to obtain the first sample pixel value of the sample pixel point in the first sample image based on the sample pixel position of the sample pixel point in the first sample image ; The second pixel value determination sub-part is configured to obtain the second sample pixel value of the sample pixel in the first sample image based on the first sample projection position of the sample pixel in the first sample image; Fusion mask The value acquisition subpart is configured to obtain the fusion mask value of the sample pixel point in the first sample image based on the sample fusion mask; the photometric loss measurement subpart is configured to obtain the fusion mask value of the sample pixel point in the first sample image based on the sample fusion mask. The pixel values are fused with the mask values to obtain the geometric photometric loss.
在一些公开实施例中,光度损失度量子部分包括:像素差值获取子部分,被配置为获取第一样本像素值和第二样本像素值之间的像素差值;数值加权子部分,被配置为利用融合掩膜值对像素差值进行加权,得到加权差值;损失获取子部分,被配置为基于各个样本像素点的加权差值,得到几何光度损失。In some disclosed embodiments, the photometric loss measurement sub-section includes: a pixel difference acquisition sub-section configured to acquire a pixel difference between a first sample pixel value and a second sample pixel value; a numerical weighting sub-section configured to obtain a pixel difference between a first sample pixel value and a second sample pixel value; It is configured to use the fusion mask value to weight the pixel difference to obtain a weighted difference; the loss acquisition sub-part is configured to obtain a geometric photometric loss based on the weighted difference of each sample pixel.
在一些公开实施例中,像素差值获取子部分包括:第一差值子子部分,被配置为基于结构相似性度量第一样本像素值和第二样本像素值,得到第一差值;第二差值子部分,被配置为基于绝对值偏差度量第一样本像素值和第二样本像素值,得到第二差值;差值加权子部分,被配置为基于第一差值和第二差值进行加权,得到像素差值。In some disclosed embodiments, the pixel difference acquisition sub-section includes: a first difference sub-section configured to measure the first sample pixel value and the second sample pixel value based on structural similarity to obtain the first difference value; The second difference sub-section is configured to measure the first sample pixel value and the second sample pixel value based on the absolute value deviation to obtain the second difference value; the difference weighting sub-section is configured to measure the first sample pixel value and the second sample pixel value based on the absolute value deviation. The two differences are weighted to obtain the pixel difference value.
关于装置中的各模块的处理流程、以及各模块之间的交互流程的描述可以参照上述方法实施例中的相关说明,这里不再详述。For a description of the processing flow of each module in the device and the interaction flow between the modules, please refer to the relevant descriptions in the above method embodiments, and will not be described in detail here.
请参阅图10,图10是本公开电子设备100一实施例的框架示意图。电子设备100包括相互耦接的存储器101和处理器102,处理器102被配置为执行存储器101中存储的程序指令,以实现上述任一图像分析方法,或任一图像分析模型的训练方法。其中,电子设备100可以包括但不限于:微型计算机、服务器,此外,电子设备100还可以包括笔记本电脑、平板电脑等移动设备,在此不做限定。Please refer to FIG. 10 , which is a schematic framework diagram of an embodiment of the electronic device 100 of the present disclosure. The electronic device 100 includes a memory 101 and a processor 102 coupled to each other. The processor 102 is configured to execute program instructions stored in the memory 101 to implement any of the above image analysis methods or any image analysis model training method. The electronic device 100 may include but is not limited to: a microcomputer and a server. In addition, the electronic device 100 may also include mobile devices such as laptop computers and tablet computers, which are not limited here.
这里,处理器102被配置为控制其自身以及存储器101以实现上述任一图像分析方法,或实现上述任一图像分析模型的训练方法。处理器102还可以称为中央处理单元(Central Processing Unit,CPU)。处理器102可能是一种集成电路芯片,具有信号的处理能力。处理器102还可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。另外,处理器102可以由集成电路芯片共同实现。Here, the processor 102 is configured to control itself and the memory 101 to implement any of the above image analysis methods, or to implement any of the above image analysis model training methods. The processor 102 may also be called a central processing unit (Central Processing Unit, CPU). The processor 102 may be an integrated circuit chip with signal processing capabilities. The processor 102 can also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. In addition, the processor 102 may be implemented by an integrated circuit chip.
请参阅图11,图11为本公开计算机可读存储介质110一实施例的框架示意图。计算机可读存储介质110存储有能够被处理器运行的程序指令111,程序指令111被配置为实现上述任一图像分析方法,或实现上述任一图像分析模型的训练方法。Please refer to FIG. 11 , which is a schematic diagram of a framework of an embodiment of the computer-readable storage medium 110 of the present disclosure. The computer-readable storage medium 110 stores program instructions 111 that can be executed by the processor. The program instructions 111 are configured to implement any of the above image analysis methods, or to implement any of the above image analysis model training methods.
本公开实施例还提供一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行用于上述任一图像分析方法,或实现上述任一图像分析模型的训练方法。Embodiments of the present disclosure also provide a computer program. The computer program includes computer readable code. When the computer readable code is run in an electronic device, the processor of the electronic device executes any one of the above-mentioned functions. Image analysis methods, or training methods that implement any of the above image analysis models.
在本公开所提供的几个实施例中,应该理解到,所揭露的方法和装置,可以通过其它的方式实现。例如,以上所描述的装置实施方式仅仅是示意性的,例如,部分或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、机械或其它的形式。In the several embodiments provided in this disclosure, it should be understood that the disclosed methods and devices can be implemented in other ways. For example, the device implementation described above is only illustrative. For example, the division of parts or units is only a logical function division. In actual implementation, there may be other division methods. For example, units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施方式方案的目的。Units described as separate components may or may not be physically separate, and components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to network units. Some or all of the units can be selected according to actual needs to achieve the purpose of this embodiment.
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本公开各个实施方式方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、 磁碟或者光盘等各种可以存储程序代码的介质。Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. Based on this understanding, the technical solution of the present disclosure is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including a number of instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the various implementation methods of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .
本公开涉及增强现实领域,通过获取现实环境中的目标对象的图像信息,进而借助各类视觉相关算法实现对目标对象的相关特征、状态及属性进行检测或识别处理,从而得到与具体应用匹配的虚拟与现实相结合的AR效果。示例性的,目标对象可涉及与人体相关的脸部、肢体、手势、动作等,或者与物体相关的标识物、标志物,或者与场馆或场所相关的沙盘、展示区域或展示物品等。视觉相关算法可涉及视觉定位、SLAM、三维重建、图像注册、背景分割、对象的关键点提取及跟踪、对象的位姿或深度检测等。具体应用不仅可以涉及跟真实场景或物品相关的导览、导航、讲解、重建、虚拟效果叠加展示等交互场景,还可以涉及与人相关的特效处理,比如妆容美化、肢体美化、特效展示、虚拟模型展示等交互场景。The present disclosure relates to the field of augmented reality. By obtaining image information of target objects in the real environment, and then using various visual related algorithms to detect or identify the relevant features, states and attributes of the target objects, thereby obtaining information that matches specific applications. AR effect that combines virtuality and reality. For example, the target object may involve the face, limbs, gestures, actions, etc. related to the human body, or the identifiers or markers related to the object, or the sand table, display area or display items related to the venue or place. Vision-related algorithms can involve visual positioning, SLAM, three-dimensional reconstruction, image registration, background segmentation, object key point extraction and tracking, object pose or depth detection, etc. Specific applications can not only involve interactive scenes such as tours, navigation, explanations, reconstructions, and virtual effects overlay displays related to real scenes or objects, but also involve special effects processing related to people, such as makeup beautification, body beautification, special effects display, virtual Model display and other interactive scenarios.
可通过卷积神经网络,实现对目标对象的相关特征、状态及属性进行检测或识别处理。上述卷积神经网络是基于深度学习框架进行模型训练而得到的网络模型。Convolutional neural networks can be used to detect or identify the relevant features, states and attributes of target objects. The above-mentioned convolutional neural network is a network model obtained through model training based on a deep learning framework.
工业实用性Industrial applicability
本公开实施例提供了一种图像分析方法、模型的训练方法、装置、设备、介质及程序,其中,图像分析方法包括:获取图像序列、光流数据和图像序列中各个图像的参考数据;其中,各个图像包括具有共视关系的第一图像和第二图像,光流数据包括第一图像与第二图像之间的静态光流和整体光流,静态光流由摄像器件运动引起,整体光流由摄像器件运动和拍摄对象运动共同引起,且参考数据包括位姿和深度;基于图像序列和光流数据,预测得到分析结果;其中,分析结果包括静态光流的光流校准数据;基于静态光流和光流校准数据,对位姿和深度进行优化,得到更新的位姿和更新的深度。通过上述方案,能够在动态场景下,提升位姿和深度的精度。Embodiments of the present disclosure provide an image analysis method, a model training method, a device, a device, a medium and a program, wherein the image analysis method includes: acquiring an image sequence, optical flow data and reference data of each image in the image sequence; wherein , each image includes a first image and a second image that have a common viewing relationship, and the optical flow data includes static optical flow and overall optical flow between the first image and the second image. The static optical flow is caused by the movement of the camera device, and the overall optical flow The flow is caused by the movement of the camera device and the movement of the photographed object, and the reference data includes pose and depth; based on the image sequence and optical flow data, the analysis results are predicted; among them, the analysis results include optical flow calibration data of static optical flow; based on static optical flow Flow and optical flow calibration data are used to optimize pose and depth to obtain updated pose and updated depth. Through the above solution, the accuracy of pose and depth can be improved in dynamic scenes.

Claims (21)

  1. 一种图像分析方法,包括:An image analysis method including:
    获取图像序列、光流数据和所述图像序列中各个图像的参考数据;其中,所述各个图像包括具有共视关系的第一图像和第二图像,所述光流数据包括所述第一图像与所述第二图像之间的静态光流和整体光流,所述静态光流由摄像器件运动引起,所述整体光流由摄像器件运动和拍摄对象运动共同引起,且所述参考数据包括位姿和深度;Obtain an image sequence, optical flow data and reference data of each image in the image sequence; wherein each image includes a first image and a second image having a common view relationship, and the optical flow data includes the first image The static optical flow and the overall optical flow between the second image, the static optical flow is caused by the motion of the camera device, the overall optical flow is caused by the motion of the camera device and the motion of the photographed object, and the reference data includes pose and depth;
    基于所述图像序列和所述光流数据,预测得到分析结果;其中,所述分析结果包括所述静态光流的光流校准数据;Based on the image sequence and the optical flow data, an analysis result is predicted; wherein the analysis result includes optical flow calibration data of the static optical flow;
    基于所述静态光流和所述光流校准数据,对所述位姿和所述深度进行优化,得到更新的位姿和更新的深度。Based on the static optical flow and the optical flow calibration data, the pose and the depth are optimized to obtain an updated pose and an updated depth.
  2. 根据权利要求1所述的方法,其中,所述基于所述图像序列和所述光流数据,预测得到分析结果,包括:The method according to claim 1, wherein the predicting the analysis result based on the image sequence and the optical flow data includes:
    基于所述第一图像的图像特征和所述第二图像的图像特征,得到所述第一图像与所述第二图像之间的特征相关数据,并基于所述静态光流将所述第一图像中像素点进行投影,得到所述第一图像中像素点在所述第二图像中的第一投影位置;Based on the image features of the first image and the image features of the second image, feature correlation data between the first image and the second image is obtained, and the first image is transformed based on the static optical flow. Project the pixels in the image to obtain the first projection position of the pixels in the first image in the second image;
    基于所述第一投影位置在所述特征相关数据中搜索,得到目标相关数据;Search the feature-related data based on the first projection position to obtain target-related data;
    基于所述目标相关数据、所述静态光流和所述整体光流,得到所述分析结果。The analysis result is obtained based on the target-related data, the static optical flow and the overall optical flow.
  3. 根据权利要求2所述的方法,其中,所述基于所述目标相关数据、所述静态光流和所述整体光流,得到所述分析结果,包括:The method according to claim 2, wherein obtaining the analysis result based on the target-related data, the static optical flow and the overall optical flow includes:
    基于所述目标相关数据进行编码,得到第一编码特征,并基于所述静态光流和所述整体光流进行编码,得到第二编码特征;Encoding is performed based on the target-related data to obtain a first encoding feature, and encoding is performed based on the static optical flow and the overall optical flow to obtain a second encoding feature;
    基于所述第一编码特征和所述第二编码特征,预测得到所述分析结果。The analysis result is predicted based on the first encoding feature and the second encoding feature.
  4. 根据权利要求1至3任一项所述的方法,其中,所述参考数据还包括动态掩膜,所述动态掩膜用于指示所述图像中的运动对象,所述分析结果还包括置信度图和所述动态掩膜的掩膜校准数据,所述置信度图包括所述图像中各像素点的置信度;所述基于所述静态光流和所述光流校准数据,对所述位姿和所述深度进行优化,得到更新的位姿和更新的深度,包括:The method according to any one of claims 1 to 3, wherein the reference data further includes a dynamic mask used to indicate moving objects in the image, and the analysis result further includes a confidence level map and the mask calibration data of the dynamic mask, the confidence map includes the confidence of each pixel in the image; based on the static optical flow and the optical flow calibration data, the bit The pose and the depth are optimized to obtain an updated pose and an updated depth, including:
    基于所述动态掩膜、所述掩膜校准数据和所述置信度图进行融合,得到重要度图,并基于所述光流校准数据对第一投影位置进行校准,得到校准位置;其中,所述重要度图包括所述图像中各像素点的重要度,所述第一投影位置为所述第一图像中像素点基于所述静态光流投影在所述第二图像的像素位置;Fusion is performed based on the dynamic mask, the mask calibration data and the confidence map to obtain an importance map, and the first projection position is calibrated based on the optical flow calibration data to obtain a calibration position; wherein, The importance map includes the importance of each pixel in the image, and the first projection position is the pixel position of the pixel in the first image projected on the second image based on the static optical flow;
    基于所述校准位置和所述重要度图,优化得到所述更新的位姿和所述更新的深度。Based on the calibration position and the importance map, the updated pose and the updated depth are optimized.
  5. 根据权利要求4所述的方法,其中,所述光流校准数据包括所述第一图像中像素点的校准光流,所述基于所述光流校准数据对第一投影位置进行校准,得到校准位置,包括:The method according to claim 4, wherein the optical flow calibration data includes the calibration optical flow of pixels in the first image, and the first projection position is calibrated based on the optical flow calibration data to obtain the calibration Locations, including:
    将所述第一图像中像素点的校准光流加上所述像素点在所述第二图像中的第一投影位置,得到所述像素点的校准位置。The calibrated optical flow of the pixel in the first image is added to the first projection position of the pixel in the second image to obtain the calibrated position of the pixel.
  6. 根据权利要求4所述的方法,其中,所述基于所述动态掩膜、所述掩膜校准数据和所述置信度图进行融合,得到重要度图,包括:The method according to claim 4, wherein the fusion based on the dynamic mask, the mask calibration data and the confidence map to obtain an importance map includes:
    基于所述掩膜校准数据对所述动态掩膜进行校准,得到校准掩膜;其中,所述校准掩膜包括所述图像中像素点与所述运动对象的相关度,且所述相关度与所述图像中像素点属于所述运动对象的可能性正相关;The dynamic mask is calibrated based on the mask calibration data to obtain a calibration mask; wherein the calibration mask includes the correlation between the pixels in the image and the moving object, and the correlation is The possibility that a pixel in the image belongs to the moving object is positively correlated;
    基于所述置信度图和所述校准掩膜进行融合,得到所述重要度图。Fusion is performed based on the confidence map and the calibration mask to obtain the importance map.
  7. 根据权利要求1至6任一项所述的方法,其中,所述分析结果还包括动态光流,所述动态光流由所述拍摄对象运动引起;在所述基于所述静态光流和所述光流校准数据,对所述位姿和所述深度进行优化,得到更新的位姿和更新的深度之后,所述方法还包括:The method according to any one of claims 1 to 6, wherein the analysis result further includes dynamic optical flow, the dynamic optical flow is caused by the motion of the photographed object; in the method based on the static optical flow and the After using the optical flow calibration data, optimizing the pose and the depth, and obtaining the updated pose and the updated depth, the method further includes:
    基于所述更新的位姿和所述更新的深度,获取更新的静态光流,并基于所述动态光流和所述更新的静态光流,得到更新的整体光流;Based on the updated pose and the updated depth, obtain an updated static optical flow, and obtain an updated overall optical flow based on the dynamic optical flow and the updated static optical flow;
    基于所述更新的静态光流和所述更新的整体光流,得到更新的光流数据,并基于所述更新的位姿和更新的深度,得到更新的参考数据;Based on the updated static optical flow and the updated overall optical flow, obtain updated optical flow data, and obtain updated reference data based on the updated pose and updated depth;
    重新执行所述基于所述图像序列和所述光流数据,预测得到分析结果的步骤以及后续步骤。Re-execute the step of predicting and obtaining the analysis result based on the image sequence and the optical flow data and subsequent steps.
  8. 根据权利要求7所述的方法,其中,所述基于所述更新的位姿和所述更新的深度,获取更新 的静态光流,包括:The method according to claim 7, wherein said obtaining an updated static optical flow based on the updated pose and the updated depth includes:
    基于所述更新的位姿、所述更新的深度和所述第一图像中像素点的像素位置进行投影,得到所述第一图像中像素点投影在所述第二图像的第二投影位置;Projection is performed based on the updated pose, the updated depth and the pixel position of the pixel in the first image, to obtain a second projection position of the pixel in the first image projected on the second image;
    基于所述第一图像中像素点投影在所述第二图像的第二投影位置和所述第一图像中像素点在所述第二图像中的对应位置之间的差异,得到所述更新的静态光流;其中,所述对应位置为在摄像器件未运动的情况下,所述第一图像中像素点所属的空间点投影在所述第二图像的像素位置。Based on the difference between the second projection position of the pixel point in the first image projected on the second image and the corresponding position of the pixel point in the first image in the second image, the updated Static optical flow; wherein the corresponding position is the pixel position of the second image where the spatial point to which the pixel point in the first image belongs is projected on the second image when the imaging device is not moving.
  9. 根据权利要求7所述的方法,其中,所述基于所述动态光流和所述更新的静态光流,得到更新的整体光流,包括:The method of claim 7, wherein obtaining an updated overall optical flow based on the dynamic optical flow and the updated static optical flow includes:
    将所述动态光流和所述更新的静态光流相加,得到所述更新的整体光流。The dynamic optical flow and the updated static optical flow are added to obtain the updated overall optical flow.
  10. 一种图像分析模型的训练方法,包括:A training method for an image analysis model, including:
    获取样本图像序列、样本光流数据和所述样本图像序列中各个样本图像的样本参考数据;其中,所述各个样本图像包括具有共视关系的第一样本图像和第二样本图像,所述样本光流数据包括所述第一样本图像与所述第二样本图像之间的样本静态光流和样本整体光流,所述样本静态光流由摄像器件运动引起,所述样本整体光流由摄像器件运动和拍摄对象运动共同引起,且所述样本参考数据包括样本位姿和样本深度;Obtain a sample image sequence, sample optical flow data, and sample reference data of each sample image in the sample image sequence; wherein each sample image includes a first sample image and a second sample image that have a common view relationship, and the The sample optical flow data includes the sample static optical flow and the sample overall optical flow between the first sample image and the second sample image. The sample static optical flow is caused by the movement of the camera device, and the sample overall optical flow It is caused by the motion of the camera device and the motion of the photographed object, and the sample reference data includes sample pose and sample depth;
    基于所述图像分析模型对所述样本图像序列和所述样本光流数据进行分析预测,得到样本分析结果;其中,所述样本分析结果包括所述样本静态光流的样本光流校准数据;Analyze and predict the sample image sequence and the sample optical flow data based on the image analysis model to obtain a sample analysis result; wherein the sample analysis result includes sample optical flow calibration data of the sample static optical flow;
    基于所述样本静态光流和所述样本光流校准数据,对所述样本位姿和所述样本深度进行优化,得到更新的样本位姿和更新的样本深度;Based on the sample static optical flow and the sample optical flow calibration data, optimize the sample pose and the sample depth to obtain an updated sample pose and an updated sample depth;
    基于所述更新的样本位姿和所述更新的样本深度进行损失度量,得到所述图像分析模型的预测损失;Perform loss measurement based on the updated sample pose and the updated sample depth to obtain the predicted loss of the image analysis model;
    基于所述预测损失,调整所述图像分析模型的网络参数。Based on the prediction loss, network parameters of the image analysis model are adjusted.
  11. 根据权利要求10所述的方法,其中,所述样本参考数据还包括样本动态掩膜,所述样本动态掩膜用于指示所述样本图像中的运动对象,所述样本分析结果还包括样本动态光流和所述样本动态掩膜的样本掩膜校准数据,且所述样本动态光流由拍摄对象运动引起,所述预测损失包括掩膜预测损失;在所述基于所述样本静态光流和所述样本光流校准数据,对所述样本位姿和所述样本深度进行优化,得到更新的样本位姿和更新的样本深度之后,所述方法还包括:The method according to claim 10, wherein the sample reference data further includes a sample dynamic mask, the sample dynamic mask is used to indicate moving objects in the sample image, and the sample analysis result further includes a sample dynamic mask. Optical flow and sample mask calibration data of the sample dynamic mask, and the sample dynamic optical flow is caused by the motion of the photographed object, and the prediction loss includes a mask prediction loss; in the sample static optical flow based on the sample and After the sample optical flow calibration data is used to optimize the sample pose and the sample depth, and the updated sample pose and updated sample depth are obtained, the method further includes:
    基于所述样本动态光流、所述更新的样本位姿和所述更新的样本深度,得到更新的样本整体光流;Based on the sample dynamic optical flow, the updated sample pose and the updated sample depth, an updated sample overall optical flow is obtained;
    所述基于所述更新的样本位姿和所述更新的样本深度进行损失度量,得到所述图像分析模型的预测损失,包括:The loss measurement is performed based on the updated sample pose and the updated sample depth to obtain the predicted loss of the image analysis model, including:
    基于所述样本掩膜校准数据和所述样本动态掩膜,得到所述样本动态掩膜在模型维度更新得到的第一预测掩膜,并基于所述更新的样本整体光流、所述更新的样本位姿和所述更新的样本深度,得到所述样本动态掩膜在光流维度更新得到的第二预测掩膜;Based on the sample mask calibration data and the sample dynamic mask, a first prediction mask obtained by updating the sample dynamic mask in the model dimension is obtained, and based on the updated sample overall optical flow, the updated Using the sample pose and the updated sample depth, a second prediction mask obtained by updating the sample dynamic mask in the optical flow dimension is obtained;
    基于所述第一预测掩膜和所述第二预测掩膜之间的差异,得到所述掩膜预测损失。The mask prediction loss is obtained based on the difference between the first prediction mask and the second prediction mask.
  12. 根据权利要求11所述的方法,其中,所述基于所述更新的样本整体光流、所述更新的样本位姿和所述更新的样本深度,得到所述样本动态掩膜在光流维度更新得到的第二预测掩膜,包括:The method according to claim 11, wherein the sample dynamic mask is updated in the optical flow dimension based on the updated sample overall optical flow, the updated sample pose and the updated sample depth. The resulting second prediction mask includes:
    基于所述更新的样本位姿、所述更新的样本深度和所述第一样本图像中样本像素点的样本像素位置进行投影,得到所述第一样本图像中样本像素点投影在所述第二样本图像的第一样本投影位置;以及,Projection is performed based on the updated sample pose, the updated sample depth and the sample pixel position of the sample pixel in the first sample image, and the projection of the sample pixel in the first sample image is obtained. the first sample projection position of the second sample image; and,
    基于所述更新的样本整体光流和所述第一样本图像中样本像素点的样本像素位置进行投影,得到所述第一样本图像中样本像素点投影在所述第二样本图像的第二样本投影位置;Projection is performed based on the updated sample overall optical flow and the sample pixel position of the sample pixel in the first sample image, and the projection of the sample pixel in the first sample image at the second sample image is obtained. Two sample projection positions;
    基于所述第一样本投影位置和所述第二样本投影位置之间的差异,得到所述第二预测掩膜。The second prediction mask is obtained based on the difference between the first sample projection position and the second sample projection position.
  13. 根据权利要求12所述的方法,其中,所述基于所述第一样本投影位置和所述第二样本投影位置之间的差异,得到所述第二预测掩膜,包括:The method of claim 12, wherein obtaining the second prediction mask based on the difference between the first sample projection position and the second sample projection position includes:
    基于所述第一样本投影位置与所述第二样本投影位置之间的距离对比预设阈值,得到所述样本像素点的样本掩膜值;其中,所述样本掩膜值用于表示所述样本像素点是否属于所述运动对象;Based on the comparison of the distance between the first sample projection position and the second sample projection position with a preset threshold, a sample mask value of the sample pixel is obtained; wherein the sample mask value is used to represent the Whether the sample pixel belongs to the moving object;
    基于各个所述样本像素点的样本掩膜值,得到所述第二预测掩膜。The second prediction mask is obtained based on the sample mask value of each sample pixel point.
  14. 根据权利要求10所述的方法,其中,所述样本参考数据还包括样本动态掩膜,所述样本动态掩膜用于指示所述样本图像中的运动对象,且所述预测损失包括几何光度损失;在所述基于所述更新的样本位姿和所述更新的样本深度进行损失度量,得到所述图像分析模型的预测损失之前,所述方法还包括:The method of claim 10, wherein the sample reference data further includes a sample dynamic mask used to indicate moving objects in the sample image, and the prediction loss includes a geometric photometric loss ; Before performing loss measurement based on the updated sample pose and the updated sample depth to obtain the predicted loss of the image analysis model, the method further includes:
    基于各个与所述第一样本图像具有所述共视关系的第二样本图像的样本动态掩膜进行融合,得到样本融合掩膜;Fusion is performed based on the sample dynamic masks of each second sample image having the common view relationship with the first sample image to obtain a sample fusion mask;
    所述基于所述更新的样本位姿和所述更新的样本深度进行损失度量,得到所述图像分析模型的预测损失,包括:The loss measurement is performed based on the updated sample pose and the updated sample depth to obtain the predicted loss of the image analysis model, including:
    基于所述更新的样本位姿、所述更新的样本深度和所述第一样本图像中样本像素点的样本像素位置进行投影,得到所述第一样本图像中样本像素点投影在所述第二样本图像的第一样本投影位置;Projection is performed based on the updated sample pose, the updated sample depth and the sample pixel position of the sample pixel in the first sample image, and the projection of the sample pixel in the first sample image is obtained. the first sample projection position of the second sample image;
    基于所述第一样本图像中样本像素点的样本像素位置,得到所述第一样本图像中样本像素点的第一样本像素值,并基于所述第一样本图像中样本像素点的第一样本投影位置,得到所述第一样本图像中样本像素点的第二样本像素值,以及基于所述样本融合掩膜,得到所述第一样本图像中样本像素点的融合掩膜值;Based on the sample pixel position of the sample pixel point in the first sample image, obtain the first sample pixel value of the sample pixel point in the first sample image, and based on the sample pixel point in the first sample image The first sample projection position is used to obtain the second sample pixel value of the sample pixel point in the first sample image, and based on the sample fusion mask, the fusion of the sample pixel point in the first sample image is obtained. mask value;
    基于所述第一样本像素值、所述第二样本像素值和所述融合掩膜值,得到所述几何光度损失。The geometric photometric loss is obtained based on the first sample pixel value, the second sample pixel value and the fused mask value.
  15. 根据权利要求14所述的方法,其中,所述基于所述第一样本像素值、所述第二样本像素值和所述融合掩膜值,得到所述几何光度损失,包括:The method of claim 14, wherein obtaining the geometric photometric loss based on the first sample pixel value, the second sample pixel value and the fusion mask value includes:
    获取所述第一样本像素值和所述第二样本像素值之间的像素差值;Obtain the pixel difference between the first sample pixel value and the second sample pixel value;
    利用所述融合掩膜值对所述像素差值进行加权,得到加权差值;Use the fusion mask value to weight the pixel difference value to obtain a weighted difference value;
    基于各个所述样本像素点的加权差值,得到所述几何光度损失。The geometric photometric loss is obtained based on the weighted difference value of each sample pixel point.
  16. 根据权利要求15所述的方法,其中,所述获取所述第一样本像素值和所述第二样本像素值之间的像素差值,包括:The method according to claim 15, wherein said obtaining the pixel difference value between the first sample pixel value and the second sample pixel value includes:
    基于结构相似性度量所述第一样本像素值和所述第二样本像素值,得到第一差值,并基于绝对值偏差度量所述第一样本像素值和所述第二样本像素值,得到第二差值;Measuring the first sample pixel value and the second sample pixel value based on structural similarity, obtaining a first difference value, and measuring the first sample pixel value and the second sample pixel value based on absolute value deviation , get the second difference;
    基于所述第一差值和所述第二差值进行加权,得到所述像素差值。Weighting is performed based on the first difference value and the second difference value to obtain the pixel difference value.
  17. 一种图像分析装置,包括:An image analysis device, including:
    获取部分,被配置为获取图像序列、光流数据和所述图像序列中各个图像的参考数据;其中,所述各个图像包括具有共视关系的第一图像和第二图像,所述光流数据包括所述第一图像与所述第二图像之间的静态光流和整体光流,所述静态光流由摄像器件运动引起,所述整体光流由摄像器件运动和拍摄对象运动共同引起,且所述参考数据包括位姿和深度;The acquisition part is configured to acquire an image sequence, optical flow data, and reference data of each image in the image sequence; wherein each image includes a first image and a second image with a common view relationship, and the optical flow data Including static optical flow and overall optical flow between the first image and the second image, the static optical flow is caused by the movement of the camera device, and the overall optical flow is caused by the movement of the camera device and the movement of the photographed object, And the reference data includes pose and depth;
    分析部分,被配置为基于所述图像序列和所述光流数据,预测得到分析结果;其中,所述分析结果包括所述静态光流的光流校准数据;An analysis part configured to predict an analysis result based on the image sequence and the optical flow data; wherein the analysis result includes optical flow calibration data of the static optical flow;
    优化部分,被配置为基于所述静态光流和所述光流校准数据,对所述位姿和所述深度进行优化,得到更新的位姿和更新的深度。The optimization part is configured to optimize the pose and the depth based on the static optical flow and the optical flow calibration data to obtain an updated pose and an updated depth.
  18. 一种图像分析模型的训练装置,包括:An image analysis model training device, including:
    样本获取部分,被配置为获取样本图像序列、样本光流数据和所述样本图像序列中各个样本图像的样本参考数据;其中,所述各个样本图像包括具有共视关系的第一样本图像和第二样本图像,所述样本光流数据包括所述第一样本图像与所述第二样本图像之间的样本静态光流和样本整体光流,所述样本静态光流由摄像器件运动引起,所述样本整体光流由摄像器件运动和拍摄对象运动共同引起,且所述样本参考数据包括样本位姿和样本深度;The sample acquisition part is configured to acquire a sample image sequence, sample optical flow data, and sample reference data of each sample image in the sample image sequence; wherein each sample image includes a first sample image and a common view relationship. A second sample image, the sample optical flow data includes a sample static optical flow and a sample overall optical flow between the first sample image and the second sample image, the sample static optical flow is caused by the movement of the camera device , the overall optical flow of the sample is caused by the motion of the camera device and the motion of the photographed object, and the sample reference data includes sample pose and sample depth;
    样本分析部分,被配置为基于所述图像分析模型对所述样本图像序列和所述样本光流数据进行分析预测,得到样本分析结果;其中,所述样本分析结果包括所述样本静态光流的样本光流校准数据;The sample analysis part is configured to analyze and predict the sample image sequence and the sample optical flow data based on the image analysis model to obtain a sample analysis result; wherein the sample analysis result includes the static optical flow of the sample. Sample optical flow calibration data;
    样本优化部分,被配置为基于所述样本静态光流和所述样本光流校准数据,对所述样本位姿和所述样本深度进行优化,得到更新的样本位姿和更新的样本深度;A sample optimization part configured to optimize the sample pose and the sample depth based on the sample static optical flow and the sample optical flow calibration data to obtain an updated sample pose and an updated sample depth;
    损失度量部分,被配置为基于所述更新的样本位姿和所述更新的样本深度进行损失度量,得到所述图像分析模型的预测损失;a loss measurement part configured to perform loss measurement based on the updated sample pose and the updated sample depth to obtain the predicted loss of the image analysis model;
    参数调整部分,被配置为基于所述预测损失,调整所述图像分析模型的网络参数。The parameter adjustment part is configured to adjust the network parameters of the image analysis model based on the prediction loss.
  19. 一种电子设备,包括相互耦接的存储器和处理器,所述处理器用于执行所述存储器中存储的程序指令,以实现权利要求1至9任一项所述的图像分析方法,或实现权利要求10至16任一项所述的图像分析模型的训练方法。An electronic device, including a memory and a processor coupled to each other, the processor being used to execute program instructions stored in the memory to implement the image analysis method according to any one of claims 1 to 9, or to implement the right The training method of the image analysis model according to any one of claims 10 to 16.
  20. 一种计算机可读存储介质,其上存储有程序指令,所述程序指令被处理器执行时实现权利要求1至9任一项所述的图像分析方法,或实现权利要求10至16任一项所述的图像分析模型的训练方法。A computer-readable storage medium having program instructions stored thereon. When the program instructions are executed by a processor, the image analysis method of any one of claims 1 to 9 is implemented, or any one of claims 10 to 16 is implemented. The training method of the image analysis model.
  21. 一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行用于实现权利要求1至9任一项所述的图像分析方法,或实现权利要求10至16任一项所述的图像分析模型的训练方法。A computer program, the computer program comprising computer readable code, when the computer readable code is run in an electronic device, the processor of the electronic device executes for implementing any one of claims 1 to 9 The image analysis method, or the training method to implement the image analysis model described in any one of claims 10 to 16.
PCT/CN2022/119646 2022-03-25 2022-09-19 Image analysis method and apparatus, model training method and apparatus, and device, medium and program WO2023178951A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210307855.3 2022-03-25
CN202210307855.3A CN114612545A (en) 2022-03-25 2022-03-25 Image analysis method and training method, device, equipment and medium of related model

Publications (1)

Publication Number Publication Date
WO2023178951A1 true WO2023178951A1 (en) 2023-09-28

Family

ID=81867129

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/119646 WO2023178951A1 (en) 2022-03-25 2022-09-19 Image analysis method and apparatus, model training method and apparatus, and device, medium and program

Country Status (2)

Country Link
CN (1) CN114612545A (en)
WO (1) WO2023178951A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114612545A (en) * 2022-03-25 2022-06-10 浙江商汤科技开发有限公司 Image analysis method and training method, device, equipment and medium of related model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111311664A (en) * 2020-03-03 2020-06-19 上海交通大学 Joint unsupervised estimation method and system for depth, pose and scene stream
US20200211206A1 (en) * 2018-12-27 2020-07-02 Baidu Usa Llc Joint learning of geometry and motion with three-dimensional holistic understanding
CN111783582A (en) * 2020-06-22 2020-10-16 东南大学 Unsupervised monocular depth estimation algorithm based on deep learning
CN111797688A (en) * 2020-06-02 2020-10-20 武汉大学 Visual SLAM method based on optical flow and semantic segmentation
CN112686952A (en) * 2020-12-10 2021-04-20 中国科学院深圳先进技术研究院 Image optical flow computing system, method and application
CN112884813A (en) * 2021-02-18 2021-06-01 北京小米松果电子有限公司 Image processing method, device and storage medium
CN114612545A (en) * 2022-03-25 2022-06-10 浙江商汤科技开发有限公司 Image analysis method and training method, device, equipment and medium of related model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200211206A1 (en) * 2018-12-27 2020-07-02 Baidu Usa Llc Joint learning of geometry and motion with three-dimensional holistic understanding
CN111311664A (en) * 2020-03-03 2020-06-19 上海交通大学 Joint unsupervised estimation method and system for depth, pose and scene stream
CN111797688A (en) * 2020-06-02 2020-10-20 武汉大学 Visual SLAM method based on optical flow and semantic segmentation
CN111783582A (en) * 2020-06-22 2020-10-16 东南大学 Unsupervised monocular depth estimation algorithm based on deep learning
CN112686952A (en) * 2020-12-10 2021-04-20 中国科学院深圳先进技术研究院 Image optical flow computing system, method and application
CN112884813A (en) * 2021-02-18 2021-06-01 北京小米松果电子有限公司 Image processing method, device and storage medium
CN114612545A (en) * 2022-03-25 2022-06-10 浙江商汤科技开发有限公司 Image analysis method and training method, device, equipment and medium of related model

Also Published As

Publication number Publication date
CN114612545A (en) 2022-06-10

Similar Documents

Publication Publication Date Title
Dai et al. Rgb-d slam in dynamic environments using point correlations
JP7009399B2 (en) Detection of objects in video data
Walch et al. Image-based localization using lstms for structured feature correlation
CN107980150B (en) Modeling three-dimensional space
Baak et al. A data-driven approach for real-time full body pose reconstruction from a depth camera
WO2019174377A1 (en) Monocular camera-based three-dimensional scene dense reconstruction method
Dockstader et al. Multiple camera tracking of interacting and occluded human motion
US20130335528A1 (en) Imaging device capable of producing three dimensional representations and methods of use
Boniardi et al. Robot localization in floor plans using a room layout edge extraction network
Wang et al. Tracking everything everywhere all at once
CN108229347B (en) Method and apparatus for deep replacement of quasi-Gibbs structure sampling for human recognition
Luo et al. Real-time dense monocular SLAM with online adapted depth prediction network
Labbé et al. Single-view robot pose and joint angle estimation via render & compare
Košecka Detecting changes in images of street scenes
Liu et al. Single-view 3D scene reconstruction and parsing by attribute grammar
WO2022252487A1 (en) Pose acquisition method, apparatus, electronic device, storage medium, and program
JP7439153B2 (en) Lifted semantic graph embedding for omnidirectional location recognition
CN110070578B (en) Loop detection method
Zhang et al. Hand-held monocular SLAM based on line segments
CN111105439A (en) Synchronous positioning and mapping method using residual attention mechanism network
WO2023178951A1 (en) Image analysis method and apparatus, model training method and apparatus, and device, medium and program
Phalak et al. Scan2plan: Efficient floorplan generation from 3d scans of indoor scenes
US11188787B1 (en) End-to-end room layout estimation
Chang et al. 2d–3d pose consistency-based conditional random fields for 3d human pose estimation
Chen et al. StateNet: Deep state learning for robust feature matching of remote sensing images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22933003

Country of ref document: EP

Kind code of ref document: A1