WO2023178951A1

WO2023178951A1 - Image analysis method and apparatus, model training method and apparatus, and device, medium and program

Info

Publication number: WO2023178951A1
Application number: PCT/CN2022/119646
Authority: WO
Inventors: 章国锋; 鲍虎军; 叶伟才; 余星源
Original assignee: 上海商汤智能科技有限公司
Priority date: 2022-03-25
Filing date: 2022-09-19
Publication date: 2023-09-28
Also published as: CN114612545A

Abstract

An image analysis method and apparatus, a model training method and apparatus, and a device, a medium and a program. The image analysis method comprises: acquiring an image sequence, optical flow data, and reference data of each image in the image sequence (S11), wherein each image comprises a first image and a second image, which have a co-visibility relationship, the optical flow data comprises a static optical flow and an overall optical flow between the first image and the second image, the static optical flow is caused by the movement of a photographic device, the overall optical flow is caused by both the movement of the photographic device and the movement of a photographic subject, and the reference data comprises a pose and a depth; on the basis of the image sequence and the optical flow data, performing prediction to obtain an analysis result (S12), wherein the analysis result comprises optical flow calibration data of the static optical flow; and on the basis of the static optical flow and the optical flow calibration data, optimizing the pose and the depth, so as to obtain an updated pose and an updated depth (S13). By means of the solution, the precision of a pose and a depth can be improved in a dynamic scenario.

Description

Image analysis methods, model training methods, devices, equipment, media and programs

Cross-references to related applications

This disclosure requires priority for the Chinese patent application number 202210307855.3 submitted on March 25, 2022. The applicant is Zhejiang Shangtang Technology Development Co., Ltd. and the application name is "Image analysis method and related model training method, device, equipment and medium" Right, the entire text of this application is incorporated into this disclosure by reference.

Technical field

The present disclosure relates to the field of computer vision technology, and in particular, to an image analysis method, a model training method, a device, an equipment, a medium and a program.

Background technique

Simultaneous localization and mapping (SLAM) is one of the most basic tasks in the field of computer vision and robotics. Its application scope includes but is not limited to: augmented reality (Augmented Reality, AR), virtual reality (Virtual Reality, VR), autonomous driving, etc. Among them, monocular dense SLAM has attracted much attention due to the simplicity of monocular video acquisition, but compared with dense SLAM of depth images (Red Green Blue-Depth, RGB-D), it is a difficult task. Research has found that building a robust and reliable SLAM system is still challenging, especially in dynamic scenes. Current SLAM systems still have major problems and cannot obtain accurate poses and depths.

Contents of the invention

Embodiments of the present disclosure provide an image analysis method, a model training method, a device, equipment, a medium and a program.

Embodiments of the present disclosure provide an image analysis method, which includes: acquiring an image sequence, optical flow data, and reference data of each image in the image sequence; wherein each image includes a first image and a second image that have a common view relationship, and the optical flow data includes: The flow data includes static optical flow and overall optical flow between the first image and the second image. The static optical flow is caused by the motion of the camera device, the overall optical flow is caused by the motion of the camera device and the motion of the photographed object, and the reference data includes pose and depth; based on the image sequence and optical flow data, the analysis results are predicted; among them, the analysis results include optical flow calibration data of static optical flow; based on the static optical flow and optical flow calibration data, the pose and depth are optimized to obtain an updated position posture and updated depth.

Therefore, the image sequence, the optical flow data and the reference data of each image in the image sequence are obtained, and each image includes a first image and a second image having a common view relationship, and the optical flow data includes the first image and the second image. Static optical flow and overall optical flow. Static optical flow is caused by the movement of the camera device. Overall optical flow is caused by the movement of the camera device and the movement of the photographed object. The reference data includes pose and depth. On this basis, based on the image sequence and optical flow Data, prediction and analysis results are obtained, and the analysis results include optical flow calibration data of static optical flow, and based on the static optical flow and optical flow calibration data, the pose and depth are optimized to obtain an updated pose and an updated depth. Therefore, by imitating the way humans perceive the real world, the overall optical flow is considered to be caused by the motion of the camera device and the motion of the photographed object. During the image analysis process, the overall optical flow and the static optical flow caused by the motion of the camera device are referenced to predict The optical flow calibration data of the static optical flow can be obtained, so that in the subsequent pose and depth optimization process, the static optical flow and its optical flow calibration data can be combined to reduce the impact caused by the motion of the subject as much as possible, thereby improving the pose and depth. Depth accuracy.

Embodiments of the present disclosure provide a training method for an image analysis model, which includes: obtaining a sample image sequence, sample optical flow data, and sample reference data of each sample image in the sample image sequence; wherein each sample image includes a common view relationship. The first sample image and the second sample image, the sample optical flow data include the sample static optical flow and the sample overall optical flow between the first sample image and the second sample image, the sample static optical flow is caused by the movement of the camera device, the sample The overall optical flow is caused by the motion of the camera device and the motion of the photographed object, and the sample reference data includes the sample pose and sample depth; based on the image analysis model, the sample image sequence and sample optical flow data are analyzed and predicted to obtain the sample analysis results; where, The sample analysis results include sample optical flow calibration data of the sample static optical flow; based on the sample static optical flow and sample optical flow calibration data, the sample pose and sample depth are optimized to obtain an updated sample pose and an updated sample depth; based on The updated sample pose and updated sample depth are used for loss measurement to obtain the predicted loss of the image analysis model; based on the predicted loss, the network parameters of the image analysis model are adjusted.

Therefore, similar to the inference stage, by imitating the way humans perceive the real world, the overall optical flow is considered to be caused by both the motion of the camera device and the motion of the photographed object, and during the image analysis process, the overall optical flow and the motion of the camera device are referenced The static optical flow caused by the static optical flow is predicted to predict the optical flow calibration data of the static optical flow, so that in the subsequent pose and depth optimization process, the static optical flow and its optical flow calibration data can be combined to reduce the impact caused by the motion of the subject as much as possible. , can improve the model performance of the image analysis model, help improve the accuracy of the analysis results obtained by using the image analysis model in the inference stage, and thus improve the accuracy of the pose and depth in the inference stage.

An embodiment of the present disclosure provides an image analysis device, including: an acquisition part configured to acquire an image sequence, optical flow data, and reference data of each image in the image sequence; wherein each image includes a first image with a common view relationship and the second image, the optical flow data includes static optical flow and overall optical flow between the first image and the second image, the static optical flow is caused by the movement of the camera device, and the overall optical flow is caused by the movement of the camera device and the movement of the photographed object, And the reference data includes pose and depth; the analysis part is configured to predict the analysis results based on the image sequence and optical flow data; where the analysis results include optical flow calibration data of static optical flow; the optimization part is configured to predict the analysis results based on the static optical flow. Flow and optical flow calibration data are used to optimize pose and depth to obtain updated pose and updated depth.

An embodiment of the present disclosure provides a training device for an image analysis model, including: a sample acquisition part configured to acquire a sample image sequence, sample optical flow data, and sample reference data of each sample image in the sample image sequence; wherein, each sample The image includes a first sample image and a second sample image that have a common viewing relationship, and the sample optical flow data includes a sample static optical flow and a sample overall optical flow between the first sample image and the second sample image, and the sample static optical flow Caused by the motion of the camera device, the overall optical flow of the sample is caused by the motion of the camera device and the motion of the photographed object, and the sample reference data includes the sample pose and sample depth; the sample analysis part is configured to analyze the sample image sequence and sample based on the image analysis model The optical flow data is analyzed and predicted to obtain sample analysis results; among which, the sample analysis results include sample optical flow calibration data of the sample static optical flow; the sample optimization part is configured to optimize the sample based on the sample static optical flow and sample optical flow calibration data. The pose and sample depth are optimized to obtain an updated sample pose and an updated sample depth; the loss measurement part is configured to perform loss measurement based on the updated sample pose and updated sample depth to obtain the predicted loss of the image analysis model; The parameter adjustment part is configured to adjust the network parameters of the image analysis model based on the prediction loss.

Embodiments of the present disclosure provide an electronic device, including a memory and a processor coupled to each other. The processor is configured to execute program instructions stored in the memory to implement the above image analysis method or the image analysis model training method.

Embodiments of the present disclosure provide a computer-readable storage medium on which program instructions are stored. When the program instructions are executed by a processor, the above-mentioned image analysis method or the training method of the image analysis model is implemented.

An embodiment of the present disclosure provides a computer program. The computer program includes computer readable code. When the computer readable code is run in an electronic device, the processor of the electronic device executes to implement the above image analysis method, or image analysis method. Analytical model training methods.

The image analysis method, model training method, device, equipment, medium and program provided by the embodiments of the present disclosure firstly obtain the image sequence, optical flow data and reference data of each image in the image sequence, and each image includes a co-view relationship The first image and the second image, the optical flow data includes the static optical flow and the overall optical flow between the first image and the second image. The static optical flow is caused by the movement of the camera device, and the overall optical flow is caused by the movement of the camera device and the photographed object. Caused by motion together, the reference data includes pose and depth. On this basis, based on the image sequence and optical flow data, the analysis results are predicted, and the analysis results include optical flow calibration data of static optical flow, and are based on static optical flow and optical flow Calibrate the data, optimize the pose and depth, and obtain the updated pose and updated depth. Therefore, by imitating the way humans perceive the real world, the overall optical flow is considered to be caused by the motion of the camera device and the motion of the photographed object. During the image analysis process, the overall optical flow and the static optical flow caused by the motion of the camera device are referenced to predict The optical flow calibration data of the static optical flow can be obtained, so that in the subsequent pose and depth optimization process, the static optical flow and its optical flow calibration data can be combined to reduce the impact caused by the motion of the subject as much as possible, thereby improving the pose and depth. Depth accuracy.

In order to make the above-mentioned objects, features and advantages of the present disclosure more obvious and understandable, preferred embodiments are given below and described in detail with reference to the accompanying drawings.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present disclosure more clearly, the drawings required to be used in the embodiments of the present disclosure will be described below. The accompanying drawings herein are incorporated into and constitute a part of this specification. They illustrate embodiments consistent with the disclosure and, together with the description, serve to explain the technical solutions of the disclosure.

Figure 1 is a schematic flow chart of an embodiment of the image analysis method of the present disclosure;

Figure 2 is a schematic diagram of an embodiment of overall optical flow decomposition;

Figure 3a is a schematic process diagram of an embodiment of the image analysis method of the present disclosure;

Figure 3b is a schematic framework diagram of an embodiment of a dynamic update network;

Figure 4a is a schematic diagram comparing the trajectory determined by the image analysis method of the present disclosure with the actual trajectory and the trajectory determined by the prior art according to an embodiment;

Figure 4b is a schematic diagram comparing the trajectory determined by the image analysis method of the present disclosure with the actual trajectory and the trajectory determined by the prior art in another embodiment;

Figure 5a is a schematic diagram comparing the trajectory determined by the disclosed image analysis method with the actual trajectory and the trajectory determined by the prior art in another embodiment;

Figure 5b is a schematic diagram comparing the trajectory determined by the disclosed image analysis method with the actual trajectory and the trajectory determined by the prior art in another embodiment;

Figure 5c is a schematic diagram comparing the trajectory determined by the disclosed image analysis method with the actual trajectory and the trajectory determined by the prior art in another embodiment;

Figure 5d is a schematic diagram of map reconstruction using the image analysis method of the present disclosure applied to various data sets;

Figure 5e is a schematic diagram of the image analysis method of the present disclosure applied to the motion segmentation task;

Figure 5f is a schematic comparison diagram of the image analysis method of the present disclosure and the prior art respectively applied to AR;

Figure 6 is a schematic flow chart of an embodiment of the training method of the image analysis model of the present disclosure;

Figure 7 is a schematic diagram of an embodiment of a dynamic scene;

Figure 8 is a schematic framework diagram of an embodiment of the image analysis device of the present disclosure;

Figure 9 is a schematic framework diagram of an embodiment of the training device of the image analysis model of the present disclosure;

Figure 10 is a schematic framework diagram of an embodiment of the electronic device of the present disclosure;

FIG. 11 is a schematic diagram of an embodiment of a computer-readable storage medium of the present disclosure.

Detailed ways

The solutions of the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

In the following description, for purposes of explanation and not limitation, specific details such as specific system structures, interfaces, technologies, and the like are set forth in order to provide a thorough understanding of the present disclosure.

The terms "system" and "network" are often used interchangeably in this article. The term "and/or" in this article is just an association relationship that describes related objects, indicating that three relationships can exist. For example, A and/or B can mean: A exists alone, A and B exist simultaneously, and they exist alone. B these three situations. In addition, the character "/" in this article generally indicates that the related objects are an "or" relationship. In addition, "many" in this article means two or more than two.

Please refer to FIG. 1 , which is a schematic flowchart of an embodiment of the image analysis method of the present disclosure. May include the following steps:

Step S11: Obtain the image sequence, optical flow data and reference data of each image in the image sequence.

In the embodiment of the present disclosure, each image includes a first image and a second image that have a common view relationship. Among them, a certain pixel point in the first image is back-projected to a three-dimensional point in the three-dimensional space. If the three-dimensional point can be projected into the second image, it can be considered that the first image and the second image have a common viewing relationship, that is, if the three-dimensional If a certain three-dimensional point in space exists in both the first image and the second image, it can be considered that the first image and the second image have a common viewing relationship. That is to say, when the viewing angles of the first image and the second image at least partially overlap, it can be considered that the first image and the second image have a co-viewing relationship. In addition, during the analysis process, the number of second images having a common view relationship with the first image may be one or more, and at least one second image and the first image may form an image sequence.

In the embodiment of the present disclosure, the optical flow data may include static optical flow and overall optical flow between the first image and the second image. The static optical flow is caused by the movement of the camera device, and the overall optical flow is caused by the movement of the camera device and the movement of the photographed object. cause. For example, a certain three-dimensional point in the three-dimensional space is located at P1 (u1, v1) in the first image captured by the camera device at time t1, and the object to which the three-dimensional point belongs is a stationary object. At time t2, due to the movement of the camera device itself , the three-dimensional point is located at P2 (u2, v2) in the second image captured by the camera device at time t2, then the static optical flow value located at the pixel position P1 (u1, v1) in the static optical flow can be recorded as (u2-u1 , v2-v1), the static optical flow between the first image and the second image includes the static optical flow value of each pixel in the first image, so the pixel position of the pixel in the first image plus its static optical flow The flow value can be used to obtain the three-dimensional point to which the pixel belongs. Due to the movement of the camera device itself, it theoretically corresponds to the pixel position in the second image, and if the three-dimensional point to which the pixel belongs is located on a stationary object and the static optical flow is also completely accurate , then theoretically the corresponding pixel position in the second image is also the projection position of the three-dimensional point to which the pixel point belongs in the second image; or, the third image captured by the camera device at time t1 is still based on a certain three-dimensional point in the three-dimensional space. For example, in an image located at P1 (u1, v1), if the object to which the three-dimensional point belongs is a moving object, at time t2, due to the movement of the camera device itself and the movement of the moving object, the three-dimensional point will be captured by the camera device at time t2. In the second image, it is located at P3 (u3, v3). Then the overall optical flow value located at the pixel position P1 (u1, v1) in the overall optical flow can be recorded as (u3-u1, v3-v1). The first image and the second The overall optical flow between images includes the overall optical flow value of each pixel in the first image. Therefore, adding the pixel position of the pixel in the first image to its overall optical flow value can obtain the three-dimensional point to which the pixel belongs. Since the motion of the camera device itself and the motion of the photographed object theoretically correspond to the pixel position in the second image, and if the overall optical flow is completely accurate, then theoretically the pixel position corresponding to the second image also belongs to the pixel. The projected position of the 3D point in the second image.

In an implementation scenario, taking the first image as image i and the second image as image j as an example, then the static light flow caused by the movement of the camera device at each pixel point in the first image, after coordinate transformation, corresponds to A pixel at a certain pixel position in the second image, and if the pixel belongs to a stationary object and the static optical flow is completely accurate, then the pixel in the first image and the pixel in the second image after coordinate conversion by static optical flow should correspond to For the same three-dimensional point in three-dimensional space, for ease of description, the static optical flow can be recorded as F _sij . At the same time, after coordinate transformation, the overall optical flow of each pixel in the first image caused by the motion of the camera device and the photographed object corresponds to the pixel at a certain pixel position in the second image, and if the overall optical flow is completely accurate, then the The pixel point in one image corresponds to the same three-dimensional point in the three-dimensional space after the coordinates of the pixel point in the second image are transformed by the global optical flow. For convenience of description, the global optical flow can be recorded as F _oij .

In the embodiment of the present disclosure, the reference data includes pose and depth. Still taking the first image as image i and the second image as image j as an example, the reference data may include the pose G _i of the first image i and the pose G _j of the second image, and the reference data may also include the first The depth value of each pixel in image i and the depth value of each pixel in second image j. The depth of the first image includes the depth value of each pixel in the first image. The depth of the second image includes the second image. The depth value of each pixel in . For ease of description, the depth of the first image can be denoted as d _i , and similarly, the depth of the second image can be denoted as d _j . It should be noted that pose is the collective name of position and attitude, which describes the conversion relationship between the world coordinate system and the camera coordinate system. In addition, the depth represents the distance between the object and the camera device. In the embodiment of the present disclosure, the depth can be represented by inverse depth parameterization.

In an implementation scenario, the embodiment of the present disclosure can loop iterate N times (such as 10 times, 15 times, etc.) to optimize the depth and pose as much as possible and improve the accuracy of both. In the first loop iteration, An initial value can be assigned to the pose. For example, the pose can be represented by a 4*4 matrix. On this basis, the pose can be initialized as a matrix with the main diagonal element being 1 and other elements being 0. On this basis, in the subsequent loop iteration process, the pose input by the i-th iteration can be the pose output by the i-1 iteration.

In an implementation scenario, a similar method can be used to assign an initial value to the depth during the first loop iteration. The specific value of the depth is not limited here. Among them, static objects (such as buildings, street lights, etc.) in the first image and the second image can be identified first, and feature matching is performed on the first image and the second image based on the static objects to obtain several matching point pairs, and The matching point pair includes a first pixel point belonging to the stationary object in the first image, and a second pixel point belonging to the stationary object in the second image, and the first pixel point and the second pixel point correspond to the same three-dimensional point in the three-dimensional space. On this basis, the three-dimensional position of the first pixel in the three-dimensional space can be determined based on the pose of the first image, the depth value of the first pixel and the pixel position of the first pixel in the first image. With this At the same time, the position of the second pixel in the three-dimensional space can be determined based on the pose of the second image, the depth value of the second pixel in the same matching point as the first pixel and its pixel position in the second image. Three-dimensional position. Since the three-dimensional position corresponding to the first pixel point and the three-dimensional position corresponding to the second pixel point should be the same, a series of depth values of the first pixel point and the depth value of the second pixel point can be constructed through several matching point pairs. The value is an equation of unknown quantity. By solving the equation, you can get the depth value of the first pixel and the depth value of the second pixel. Based on this, you can get the initial value of the first image depth in the first loop iteration and the first loop iteration. The initial value of the second image depth during iteration. On this basis, in the subsequent loop iteration process, the depth of the i-th iteration input can be the depth of the i-1th iteration output.

In an implementation scenario, after obtaining the pose and depth in the first loop iteration, the pixel position p _i of the pixel point in the first image i, the depth _di and the relative position between the first image and the second image can be Project the pose G _ij to obtain the pixel position p _ij projected from the pixel point in the first image to the second image, as shown in formula (1):

In the above formula (1), ∏ _c represents the camera model used to map three-dimensional points to images, and ∏ _c ^-1 represents the back-projection function used to map two-dimensional points to three-dimensional points based on pixel position p _i and depth p _i , operator

Represents Hadamard product. Among them, the relative pose G _ij can be expressed as:

In addition, taking the first image i and the second image j as two-dimensional images with width W and height H as an example, the pixel position p _i of each pixel point in the first image i can be represented by a two-channel image of H*W, that is, p _i ∈R ^H×W×2 . Similarly, the pixel position p _ij projected from the pixel point in the first image to the second image can also be represented by the two-channel image of H*W, that is, p _ij ∈R ^{H×W ×2} . Based on this, in the first loop iteration, for the pixel position p _i of any pixel point in the first image i, its corresponding position p _j in the second image can be obtained, and the corresponding position is assuming that the camera device is not moving. In the case of , the spatial point (that is, the three-dimensional point) to which the pixel point in the first image belongs is projected at the pixel position of the second image. On this basis, the static optical flow F _sij can be obtained based on the difference between the corresponding position p _j of the pixel point in the first image in the second image and its projected pixel position p _ij in the second image:

F _sij =p _ij -p _j ...Formula (3).

In an implementation scenario, as mentioned above, the overall optical flow is caused by the operation of the camera device and the movement of the subject, and the optical flow caused by the movement of the camera device is called static optical flow. In order to facilitate the distinction, the movement of the subject can be The optical flow caused is called dynamic optical flow. In the first loop iteration, the dynamic optical flow can be initialized to an all-0 matrix, and the all-0 matrix can represent a two-channel image representation using H*W. On this basis, in the first loop iteration, the aforementioned static optical flow F _sij can be added to the dynamic optical flow represented by an all-0 matrix to obtain the overall optical flow F _oij . That is to say, in this embodiment, the overall optical flow can be decomposed into static optical flow and dynamic optical flow. Similarly, the overall optical flow of the sample in the disclosed embodiments described below can also be decomposed into the static optical flow of the sample and the dynamic optical flow of the sample. Please refer to FIG. 2 , which is a schematic diagram of an embodiment of overall optical flow decomposition. Among them, the optical flow caused by the movement of the camera device and the subject (i.e., the overall optical flow) can be decomposed into the optical flow caused by the movement of the camera device (i.e., static optical flow) and the optical flow caused by the movement of the subject (i.e., dynamic light flow).

Step S12: Based on the image sequence and optical flow data, predict and obtain the analysis results.

In the embodiment of the present disclosure, the analysis results include optical flow calibration data of static optical flow, and the optical flow calibration data may include calibration values of each static optical flow value in the static optical flow. As mentioned before, static optical flow can be represented by a two-channel image of H*W, and the optical flow calibration data can also be represented by a two-channel image of H*W. For ease of description, the optical flow calibration data can be recorded as r _sij ∈R ^H×W×2 .

In one implementation scenario, feature correlation data between the first image and the second image can be obtained based on the image features of the first image and the image features of the second image, and the pixels in the first image can be processed based on static optical flow. Project to obtain the first projection position of the pixel in the first image in the second image. On this basis, the feature-related data can be searched based on the first projection position to obtain the target-related data, and the analysis results can be obtained based on the target-related data, static optical flow, and overall optical flow. In the above method, in the process of searching for target-related data in the feature-related data of both the first image and the second image, the static optical flow caused by the movement of the imaging device can be referred to, which can reduce the impact of the movement of the photographed object, and thus can Improve the accuracy of subsequent optimization poses and depths.

In an implementation scenario, please refer to FIG. 3a. FIG. 3a is a schematic process diagram of an embodiment of the image analysis method of the present disclosure. As shown in Figure 3a, in order to improve the efficiency of image analysis, an image analysis model can be pre-trained, and the image analysis model can include an image encoder 301 for feature encoding for the first image i and an image encoder 301 for encoding the second image j. Image encoder 302 for feature encoding. Among them, the two

image encoders

301 and 302 can share network parameters. The image encoder 301 and the image encoder 302 may include several (eg, 6, 7, etc.) residual blocks and several (eg, 3, 4, etc.) downsampling layers, where the image encoder 301 and the image The network structure of the encoder 302 is not limited. In addition, for example, the resolution of the image features obtained after processing by the image encoder 301 and the image encoder 302 may be 1/8, 1/12, 1/16, etc. of the input image, which is not limited here.

In an implementation scenario, the feature-related data can be obtained by dot multiplying the image features of the first image i and the image features of the second image j, and the feature-related data can be expressed as a 4-bit vector. For example, the image feature of the first image can be recorded as

And the image features of the second image can be recorded as

On this basis, feature-related data can be obtained through dot product calculations

In the above formula (4), u _i v _i u _j v _j respectively represent the pixel coordinates in the first image i and the second image j. In addition, <,> represents the dot product. In order to take into account objects of different scales, the last two dimensions of the above feature correlation matrix can be processed by average pooling of different sizes (e.g., 1, 2, 4, 8, etc.) to form a multi-layer feature correlation pyramid, as the feature correlation data. For specific features-related processes, please refer to the technical details of RAFT (Recurrent All-Pairs Field Transforms for Optical Flow). Among them, the feature-related data C _ij can be regarded as the degree of visual consistency between the first image i and the second image j.

In an implementation scenario, a correlation search function can be defined, and the input parameters of the correlation search function include the coordinate grid and radius r. Based on this, the target related data L _r can be searched:

This function takes as input an H×W coordinate grid, which is the image dimension of static optical flow. Wherein, the pixel coordinates of each pixel in the first image can be directly added to the static optical flow value of the pixel in the static optical flow to obtain the first projection position of the pixel in the second image. On this basis, target-related data can be obtained from feature-related data through linear interpolation. Here, the correlation search function acts on each layer in the aforementioned feature correlation pyramid, and can splice the target-related data obtained by searching at each layer to obtain the final target-related data. For related search process, please refer to the technical details of RAFT.

In an implementation scenario, as mentioned above, in order to improve the efficiency of image analysis, an image analysis model can be pre-trained. In addition, as shown in Figure 3b, the image analysis model may include a dynamic update network 303, which may include but is not limited to a semantic extraction sub-network 3033, such as ConvGRU (gated recurrent unit combined with convolution), etc., in This does not limit the network structure of the dynamic update network 303. After obtaining the target related data (which can be searched from the feature related data 305 through linear interpolation), the static optical flow 3063 and the overall optical flow 3062, the dynamic update network 303 can be input to obtain the analysis results. At the same time, please refer to Figure 3b. Figure 3b is a schematic framework diagram of an embodiment of a dynamic update network. As shown in Figure 3b, the dynamic update network 303 can include an optical flow encoder 3031 and a correlation encoder 3032, which can be encoded based on the target related data respectively to obtain the first encoding feature, and based on the static optical flow 3063 and the overall optical flow 3062 Encoding is performed to obtain the second encoding feature, the first encoding feature and the second encoding feature, and the analysis result is predicted. Among them, the first encoding feature and the second encoding feature can be input into a gated recurrent unit (ConvGRU) combined with convolution to obtain deep semantic features, and predictions can be made based on the deep semantic features to obtain analysis results. Here, since ConvGRU is a local operation with a small receptive field, the hidden layer vector can be averaged in the image space dimension as a global context feature, and the global context feature can be used as an additional input to ConvGRU. For ease of description, the global context feature at the k+1th loop iteration can be recorded as h ^(k+1) . In the above method, encoding is performed based on target-related data to obtain the first encoding feature, and encoding is performed based on static optical flow and overall optical flow to obtain the second encoding feature. On this basis, based on the first encoding feature and the second encoding feature, The prediction can obtain the analysis results, so that the deep feature information of the optical flow data and related data can be extracted before prediction, which can help improve the accuracy of subsequent prediction analysis.

In an implementation scenario, please continue to refer to Figure 3b. The dynamic update network 303 can also include a static optical flow convolution layer 3035. By processing the aforementioned deep semantic features through the static optical flow convolution layer 3035, the static optical flow 3062 can be obtained. Optical flow calibration data3066. In an implementation scenario, in order to improve the optimization accuracy of pose and depth, the reference data can also include dynamic masks, and the dynamic masks can be used to indicate moving objects in the image. For example, when a certain pixel in the image belongs to a moving object, the pixel value at the pixel position corresponding to the pixel in the dynamic mask of the image can be the first value. On the contrary, when a certain pixel in the image If it does not belong to a moving object, the pixel value at the pixel position corresponding to the pixel point in the dynamic mask of the image can be a second value, and the first value and the second value are different. For example, the first value can be set to 0, The second value can be set to 1. On the first loop iteration, the dynamic mask can be initialized to an all-zero matrix. For the convenience of description, the first image i and the second image j are still two-dimensional images of W*H as an example. The dynamic mask can be expressed as a two-channel image of H*W, that is, the dynamic mask M _dij ∈R ^{H ×W×2} . Please refer to Figure 3a or Figure 3b in combination. Different from the aforementioned method of obtaining target related data by searching in feature related data 305, and predicting the analysis results based on the target related data, static optical flow 3063 and overall optical flow 3062, you can Based on the target-related data, static optical flow 3063 , overall optical flow 3062 and dynamic mask 3061 , an analysis result is predicted, and the analysis result may include mask calibration data 3064 of the dynamic mask 3061 . In the above method, during the dynamic update process, the dynamic mask is referred to, and the dynamic mask is used to indicate moving objects in the image, so it can provide guidance for subsequent optical flow decomposition, which is beneficial to improving the accuracy of optimized pose and depth.

In an implementation scenario, still taking the first image i and the second image j as two-dimensional images of W*H as an example, the mask calibration data of the dynamic mask may include dynamic masks of both the first image and the second image. The mask calibration value of each mask value in the film, then the mask calibration data can also be expressed as a two-channel image of H*W, that is, the mask calibration data of the dynamic mask ΔM _dij ∈R ^H×W×2 . On this basis, as shown in Figure 3b, the dynamic mask 3061 can be added to the dynamic mask's mask calibration data 3064 to obtain an updated dynamic mask 3065. Therefore, the dynamic mask that needs to be input during the i-th loop iteration can be the dynamic mask that updates the output during the i-1th loop iteration.

In an implementation scenario, as shown in Figure 3b, in order to improve the efficiency of image analysis, an image analysis model can be pre-trained. Please refer to the above related descriptions. Different from the foregoing related descriptions, for the optical flow encoder 3031 in the dynamic update network 303, encoding can be performed based on the static optical flow 3063, the overall optical flow 3062 and the dynamic mask 3061 to obtain the second encoding feature.

In an implementation scenario, as shown in Figure 3b, in order to improve the efficiency of image analysis, an image analysis model can be pre-trained. For details, please refer to the relevant description above. Different from the foregoing related descriptions, the dynamic update network 303 can also include a convolution layer, which can process the deep semantic features output by ConvGRU to obtain the mask calibration data 3064 of the dynamic mask 3061.

Step S13: Based on the static optical flow and optical flow calibration data, optimize the pose and depth to obtain an updated pose and an updated depth.

In an implementation scenario, the analysis results may also include a confidence map, and the confidence map includes the confidence of each pixel in the image. Still taking the first image i and the second image j as two-dimensional images of H*W as an example, the confidence map can be expressed as a two-channel image of H*W, that is, the confidence map w _ij ∈R ^H×W×2 . After obtaining the optical flow calibration data of the static optical flow, the first projection position can be calibrated based on the optical flow calibration data to obtain the calibration position. Wherein, the first projection position is the pixel position of the pixel in the first image projected on the second image based on static optical flow. For the convenience of description, the first projection position can be recorded as p _sij , and as mentioned above, the optical flow calibration data of the static optical flow can be recorded as r _sij , then the calibration position can be expressed as p ^* _sij =r _sij +p _sij , That is, for each pixel in the image, its first projection position can be directly added to the optical flow calibration value queried in the optical flow calibration data for the pixel. On this basis, the updated pose and updated depth can be optimized based on the calibration position. For example, based on the calibration position p ^* _sij , an optimization function with the updated pose and updated depth as the optimization object can be constructed:

Σ _ij =diagw _ij ...Formula (6);

In the above formulas (5) and (6), diag represents the element on the main diagonal of the matrix, G _i ' _j represents the relative pose between the updated pose of the first image and the updated pose of the second image, d _i ' represents the depth of the first image update. also,

The meanings of the two can be found in the relevant descriptions above. ||·|| _Σ represents Mahalanobis distance (mahalanobis), which can be found in the relevant technical details about Mahalanobis distance. (i,j)∈ε represents the first image i and the second image j having a common view relationship.

In an implementation scenario, please refer to Figure 3b. As mentioned above, in order to improve the efficiency of image analysis, an image analysis model can be pre-trained. Please refer to the relevant description above. Different from the foregoing description, the dynamic update network 303 may include a convolutional layer for processing the deep semantic features extracted by ConvGRU to obtain the confidence map w _ij 3034 .

In an implementation scenario, the Gauss-Newton algorithm can be used to process the changes in depth and pose. Shure compensation can be used to calculate the change in pose, and then calculate the change in depth. At the same time, the change in depth can be recorded as Δd, and the change in pose can be recorded as Δξ. On this basis, for depth, the following formula (7) can be used to obtain the updated depth:

Ξ ^(k+1) =ΔΞ ^(k) +Ξ ^(k) ...Formula (7);

In the above formula (7), Ξ ^(k) represents the input depth of the k-th loop iteration, ΔΞ ^(k) represents the change in depth of the k-th loop iteration output, Ξ ^(k+1) represents the input k+1 The depth of the loop iteration, that is, the depth of the update. That is, for the depth, the depth can be directly added to the depth change to obtain the updated depth. Different from depth, the updated pose can be obtained in the following ways:

In the above formula (8), G ^(k) represents the input pose of the k-th loop iteration, and G ^(k+1) represents the input pose of the k+1-th loop iteration, that is, the updated pose. In other words, for the pose, the pose needs to be stretched in the SE3 manifold based on the change in the pose.

In an implementation scenario, different from the above method, the reference data may also include a dynamic mask, and the analysis result may also include mask calibration data of the dynamic mask, for which please refer to the above related description. On this basis, the dynamic mask, mask calibration data and confidence map can be fused to obtain the importance map, and the first projection position can be calibrated based on the optical flow calibration data to obtain the calibration position. Based on this, based on the calibration position and importance map, the updated pose and updated depth are optimized. In the above method, during the optimization process of pose and depth, dynamic masks used to indicate moving objects are introduced, and the importance map is obtained by combining the confidence map to provide guidance for subsequent optical flow decomposition, which is beneficial to improving the optimization of pose and depth. accuracy.

In one implementation scenario, as mentioned above, the optical flow calibration data includes the calibration optical flow of the pixel in the first image, then the calibration optical flow of the pixel in the first image can be added to the calibration optical flow of the pixel in the second image. The first projection position is to obtain the calibration position of the pixel. Please refer to the above related descriptions. In the above method, by directly predicting the calibrated optical flow of the pixel in the first image, the calibrated position of the pixel after being moved only by the camera device can be obtained through a simple addition operation, which in turn can greatly reduce the need to determine the pixel point only by the camera. The computational complexity of the calibration position after device movement is beneficial to improving the efficiency of optimizing pose and depth.

In an implementation scenario, the dynamic mask can be calibrated based on the mask calibration data to obtain a calibration mask, and the calibration mask includes the correlation between the pixels in the image and the moving objects, and the correlation is related to the correlation between the pixels in the image belonging to the moving objects. The possibility is positively correlated, that is, the higher the possibility that a pixel belongs to a moving object, the greater the correlation. On the contrary, the lower the possibility that a pixel belongs to a moving object, the smaller the correlation. On this basis, the importance map can be obtained by fusion based on the confidence map and the calibration mask. It can weight and normalize the confidence map and calibration mask to obtain the importance map. Among them, the importance map w _dij of the first image i and the second image j can be expressed as:

w _dij =sigmoid(w _ij +(1-M _dij )·η)...Formula (9);

In the above formula (9), sigmoid represents the normalization function, and M _dij represents the updated dynamic mask, which is updated by adding the mask calibration data ΔM _dij to the dynamic mask M _dij . That is, it can be updated by referring to the above formula (7). The ^dynamic ^mask of Ξ ^(k+1) represents the input dynamic mask of the k+1th loop iteration, that is, the updated dynamic mask. In addition, 1-M _dij represents the calibration mask, w _ij represents the confidence map, and eta represents the weighting coefficient, which can be set to 10, 20, etc., and is not limited here. The above method can jointly measure the importance of pixels from two aspects: the confidence of the pixel itself and the correlation between the pixel and the moving object, which can help improve the accuracy of subsequent optimization poses and depths.

In an implementation scenario, after obtaining the calibration position and importance map, the optimization function can be constructed by referring to the implementation provided by the above formula (5) and formula (6). On this basis, the updated depth can be obtained by solving and updated poses. It should be noted that the importance map removes the suppression of moving objects and increases the number of pixels available in the optimization function. In addition, the confidence map can be responsible for removing some other pixels that affect the calculation, such as pixels caused by poor lighting effects and other reasons.

In an implementation scenario, after obtaining the updated pose and updated depth, you can prepare to start a new round of loop iteration. Please refer to Figure 3b and Figure 3b in conjunction, the analysis results may also include dynamic optical flow 3081, and the dynamic optical flow is caused by the motion of the photographed object. On this basis, the updated static optical flow 3082 can be obtained based on the updated pose 3072 and the updated depth 3073, and based on the dynamic optical flow 3081 and the updated static optical flow 3082, an updated overall optical flow 3071 can be obtained, and based on The updated static optical flow 3082 and the updated overall optical flow 3071 are used to obtain updated optical flow data, and based on the updated pose 3072 and the updated depth 3073, updated reference data is obtained, so that the aforementioned image sequence and optical flow can be re-executed. Data, predict the steps to obtain the analysis results and subsequent steps until the number of re-executions meets the preset conditions. In the above method, during the image analysis process, the overall optical flow is decomposed into static optical flow and dynamic optical flow, and the optimization steps are cycled multiple times to solve the problem of poor single optimization effect, and the old variables are used as input Guiding the generation of new variables can make the input features more diverse, so it can help improve the accuracy of pose and depth. During the image analysis process, the overall optical flow is decomposed into static optical flow and dynamic optical flow, and multiple iterative optimization steps are cycled to solve the problem of poor single optimization effect, and old variables are used as input to guide new ones. The generation of variables can make the input features more diverse, so it can help improve the accuracy of pose and depth.

In an implementation scenario, projection can be performed based on the updated pose, the updated depth and the pixel position of the pixel in the first image, to obtain the second projection position of the pixel in the first image projected on the second image, and based on the The difference between the second projection position of the pixels in one image in the second image and the corresponding position of the pixels in the first image in the second image is used to obtain an updated static optical flow, and the corresponding position is the position of the pixel that is not in the camera device. In the case of motion, the spatial point to which the pixel point in the first image belongs is projected on the pixel position of the second image. You can refer to the aforementioned formula (3) and its related descriptions. In the above method, during the loop iteration process, through the updated pose and updated depth re-projection, and under the assumption that the camera device does not move, it is determined that the spatial point to which the pixel point in the first image belongs is projected on the second image. The pixel position is thus combined with the reprojection position to determine the updated static optical flow, which is beneficial to improving the accuracy of the updated static optical flow.

In an implementation scenario, after the updated static optical flow, the dynamic optical flow predicted in the analysis results can be directly added to the updated static optical flow to obtain the updated overall optical flow, that is:

F _ot =F _st +F _dt ...Formula (10);

In the above formula (10), F _st represents the updated static optical flow, F _dt represents the dynamic optical flow predicted in the analysis results, and F _ot represents the updated overall optical flow. In the above method, the updated overall optical flow can be obtained by adding the predicted dynamic optical flow and the updated static optical flow. That is, the updated overall optical flow can be determined through a simple addition operation, which is beneficial to improving the optimization pose and posture. Deep efficiency.

In an implementation scenario, the preset conditions can be set to include: the number of re-executions is not less than the preset threshold (such as 9, 10, etc.), so that the pose and depth can be continuously optimized and the pose can be continuously optimized through multiple loop iterations. and depth accuracy.

In an implementation scenario, please refer to Figure 4a and Figure 4b. Figure 4a is a schematic diagram comparing the trajectory determined by the image analysis method of the present disclosure with the actual trajectory and the trajectory determined by the prior art. Figure 4b is the trajectory determined by the image analysis method of the disclosure. A schematic diagram comparing another embodiment of the actual trajectory and the trajectory determined by the prior art. Figure 4a shows the test results of image sequence 09 in the computer vision algorithm evaluation data set (KITTI data set) in the autonomous driving scenario, and Figure 4b shows the test results of image sequence 10 in the KITTI data set. Among them, both image sequence 09 and image sequence 10 contain moving objects, which are difficult dynamic scenes, and the dotted line represents the actual trajectory of the camera device during the shooting process, the dark line represents the trajectory determined by the existing technology, and the light line represents Trajectories are determined by the disclosed image analysis method. As shown in Figure 4a, in dynamic scenarios, the accuracy of the disclosed image analysis method is almost twice that of the existing technology, and in the test scene of the KITTI data set image sequence 10, the trajectory determined by the disclosed image analysis method is almost the same as the actual one. The trajectories coincide.

In an implementation scenario, please refer to Figure 5a, Figure 5b and Figure 5c. Figure 5a is a schematic diagram comparing the trajectory determined by the image analysis method of the present disclosure with the actual trajectory and the trajectory determined by the prior art in another embodiment. Figure 5b is an image of the disclosure. Figure 5c is a schematic diagram comparing the trajectory determined by the image analysis method and the actual trajectory and the trajectory determined by the prior art according to another embodiment of the disclosure. Figure 5a shows the test results of image sequence 01 in the KITTI data set, Figure 5b shows the test results of image sequence 02 in the KITTI data set, and Figure 5c shows the test results of image sequence 20 in the KITTI data set. Among them, image sequence 01, image sequence 02 and image sequence 20 all contain moving objects, which are difficult dynamic scenes. Among them, the dotted line represents the actual trajectory of the camera device during the shooting process, and the dark line represents the trajectory determined by the existing technology. , the light-colored lines represent the trajectories determined by the image analysis method of the present disclosure. As shown in Figure 5a and Figure 5c, in the test scenarios of image sequence 01 and image sequence 20 in the KITTI data set, the trajectory determined by the image analysis method of the present disclosure and the trajectory determined by the prior art maintain a relatively consistent trend with the actual trajectory, but this method The trajectory determined by the public image analysis method is closer to the actual trajectory; at the same time, as shown in Figure 5b, in the test scene of image sequence 02 in the KITTI data set, the trajectory determined by the disclosed image analysis method maintains a relatively consistent trend with the actual trajectory. However, it is difficult to maintain a consistent trend between the trajectory determined by the existing technology and the actual trajectory, and serious distortion occurs in many places.

The disclosed embodiments can be applied to the front end of the SLAM system to determine the pose and depth of the image in real time, or can also be applied to the back end of the SLAM system to globally optimize the pose and depth of each image. Among them, the SLAM system can include front-end threads and back-end threads, both of which can run at the same time. Among them, the task of the front-end thread is to receive new images and select key frames. On this basis, obtain the pose, depth and other variable results of the key frames through the embodiment of the present disclosure. The task of the back-end thread is to globally pass Embodiments of the present disclosure perform global optimization on variable results such as pose and depth of each key frame, so that on this basis, a three-dimensional map of the environment scanned by the camera device can be constructed.

In an implementation scenario, during initialization, the SLAM system according to the embodiment of the present disclosure will continuously collect images until M (eg, 12, etc.) frames are collected. Among them, the SLAM system of the embodiment of the present disclosure only retains the current frame when the estimated average static optical flow of the current frame is greater than a first numerical value (eg, 16, etc.) pixels. Once M frames are accumulated, the SLAM system creates edges between these frames to initialize the factor graph 304 (as shown in Figure 3a). The nodes in the factor graph 304 represent each frame image, and the time difference between the images corresponding to the nodes that create edges should be within a second numerical value (eg, 3, etc.) time steps. Afterwards, the SLAM system will use the disclosed image analysis method to dynamically update the pose and depth of the images in the image sequence.

In one implementation scenario, the front end of the SLAM system of the embodiment of the present disclosure can directly process the incoming image sequence, and it maintains a set of key frames and a factor graph that stores edges between mutually visible key frames. The pose and depth of keyframes are continuously optimized. When a new frame is input, the SLAM system extracts its feature map and then uses the nearest neighbor frames of L (e.g., 3, etc.) frames to construct a factor map. As mentioned before, the distance between frames can be measured as the average static optical flow between frames. The pose corresponding to the new input frame can be given an initial value by the linear motion model. Subsequently, the SLAM system iterates through several loops to optimize the pose and depth corresponding to the key frame. Among them, the poses corresponding to the first two frames can be fixed to eliminate scale uncertainty. After processing new frames, redundant frames can be deleted based on distance from static optical flow. If there are no suitable frames to delete, the SLAM system can delete the oldest keyframes.

In an implementation scenario, the backend of the SLAM system according to the embodiment of the present disclosure can perform global optimization on a set of all key frames. The average static optical flow between key frames can be used as the distance between frames to generate an inter-frame distance matrix for easy search. During each loop iteration, the factor graph can be reconstructed based on the distance matrix. For example, you can first select edges composed of temporally adjacent frames and add them to the factor graph; then select new edges based on the distance matrix, with smaller distances being given priority. In addition, when the indexes of the frames corresponding to two edges are too close to each other, the frame spacing corresponding to these edges can be increased to suppress the effect of these edges; finally, the embodiments of the present disclosure can be used to modify the factor graph All edges are optimized to update pose and depth for all frames.

In an implementation scenario, please refer to Figure 5d, which is a schematic diagram of map reconstruction of the image analysis method of the present disclosure applied to various data sets. As shown in Figure 5d, in the data sets of autonomous driving scenes with moving objects such as KITTI and Virtual KITTI2 (i.e. VKITTI2), as well as the data sets of UAV scenes with violent motion and significant illumination changes such as EuRoc, and For handheld SLAM data sets with motion blur and violent rotation, such as TUM RGB-D, the embodiments of the present disclosure can be well promoted and applied on the above data sets.

In addition, in addition to being applied to SLAM systems, the embodiments of the present disclosure can also be applied to motion segmentation tasks, that is, to segment moving objects in images, and the embodiments of the present disclosure have significant segmentation effects. Among them, during the execution of the motion segmentation task, you only need to set a threshold for the motion and visualize the pixels of the dynamic field greater than the threshold to obtain the motion segmentation result. Please refer to Figure 5e, which is a schematic diagram of the image analysis method of the present disclosure applied to the motion segmentation task. As shown in Figure 5e, the column on the left represents the real dynamic mask, and the column on the right represents the predicted dynamic mask. As can be seen from Figure 5e, the dynamic mask predicted by the embodiment of the present disclosure is very close to the real dynamic mask, that is, Embodiments of the present disclosure can achieve remarkable results in moving object segmentation tasks. At the same time, embodiments of the present disclosure can also be applied to AR. Please refer to FIG. 5f. FIG. 5f is a schematic comparison diagram of the image analysis method of the present disclosure and the prior art applied to AR respectively. As shown in Figure 5f, the lower right corner represents the original image 501 captured by the camera device, the upper left corner represents the desired effect 502 of adding a virtual object (such as the tree contained in the dotted box) in the original image, and the upper right corner represents the embodiment of the present disclosure. The effect of adding a virtual object to the original image is shown 503. The lower left corner shows the effect of adding a virtual object to the original image using the prior art 504. Obviously, compared with the existing technology, the effect of the present disclosure after adding virtual objects through precise positioning in a sports scene is closer to the expected effect. However, adding virtual objects to the original image through the existing technology produces serious drift.

Embodiments of the present disclosure achieve precise positioning even in moving scenes through optical flow decomposition, and can be widely used in such things as the above-mentioned SLAM system, motion segmentation tasks, scene editing (AR applications as shown in Figure 5f), etc.

The above scheme, by imitating the way humans perceive the real world, regards the overall optical flow as caused by the movement of the camera device and the movement of the photographed object, and during the image analysis process, reference is made to the overall optical flow and the static optical flow caused by the movement of the camera device , predict the optical flow calibration data of the static optical flow, so that in the subsequent pose and depth optimization process, the static optical flow and its optical flow calibration data can be combined to reduce the impact caused by the motion of the subject as much as possible, thereby improving the position. pose and depth accuracy.

Please refer to FIG. 6 , which is a schematic flow chart of an embodiment of the training method of the image analysis model of the present disclosure. May include the following steps:

Step S61: Obtain the sample image sequence, sample optical flow data, and sample reference data of each sample image in the sample image sequence.

In the embodiment of the present disclosure, each sample image includes a first sample image and a second sample image that have a common view relationship, and the sample optical flow data includes the sample static optical flow and sample between the first sample image and the second sample image. Overall optical flow, the static optical flow of the sample is caused by the motion of the camera device, the overall optical flow of the sample is caused by the motion of the camera device and the motion of the photographed object, and the sample reference data includes the sample pose and sample depth. For this, please refer to the above description about "obtaining the image sequence, optical flow data and reference data of each image in the image sequence".

Step S62: Analyze and predict the sample image sequence and sample optical flow data based on the image analysis model to obtain sample analysis results.

In the embodiment of the present disclosure, the sample analysis results include sample optical flow calibration data of the sample static optical flow. For details, please refer to the relevant description of "predicting the analysis results based on the image sequence and optical flow data" in the aforementioned disclosed embodiments.

Step S63: Based on the sample static optical flow and the sample optical flow calibration data, optimize the sample pose and sample depth to obtain an updated sample pose and an updated sample depth.

Here, please refer to the relevant description of "optimizing the pose and depth based on the static optical flow and optical flow calibration data to obtain an updated pose and an updated depth" in the aforementioned disclosed embodiments.

Step S64: Perform loss measurement based on the updated sample pose and updated sample depth to obtain the predicted loss of the image analysis model.

In one implementation scenario, similar to the aforementioned disclosed embodiments, the sample reference data may also include a sample dynamic mask, which is used to indicate moving objects in the sample image, and the sample analysis results also include sample dynamic optical flow and sample Sample mask calibration data of the dynamic mask, and the sample dynamic optical flow is caused by the motion of the photographed object, and the prediction loss may include a mask prediction loss. For ease of description, the mask prediction loss can be denoted as L _{art_mask} . In addition, regarding the specific meanings of sample dynamic mask, sample dynamic optical flow, and sample mask calibration data, please refer to the relevant descriptions of dynamic mask, dynamic optical flow, and mask calibration data in the aforementioned disclosed embodiments respectively. On this basis, the updated overall optical flow of the sample can be obtained based on the sample dynamic optical flow, updated sample pose and updated sample depth. Based on this, on the one hand, the first prediction mask obtained by updating the sample dynamic mask in the model dimension can be obtained based on the sample mask calibration data and the sample dynamic mask. On the other hand, it can be based on the updated sample overall optical flow and updated sample pose and the updated sample depth, the second prediction mask obtained by updating the sample dynamic mask in the optical flow dimension is obtained, so that the mask prediction loss can be obtained based on the difference between the first prediction mask and the second prediction mask. . In the above method, even if a real dynamic mask is not available during the training process, dynamic mask labels can be constructed through the updated overall optical flow of the sample, the updated sample pose, and the updated sample depth to achieve self-supervised training. It is conducive to reducing the requirements for sample annotation during the training process on the premise of improving model performance.

In an implementation scenario, obtaining the updated overall optical flow is similar. The process can be referred to the above "Based on the updated pose and updated depth, obtaining the updated static optical flow, and based on the dynamic optical flow and the updated static optical flow." , get the updated overall optical flow" related description.

In an implementation scenario, obtaining an updated dynamic mask is similar. For reference, please refer to the related descriptions about obtaining an updated dynamic mask in the aforementioned disclosed embodiments. For convenience of description, the first prediction mask can be denoted as M _di .

In an implementation scenario, for the second prediction mask, it can be projected based on the updated sample pose, the updated sample depth and the sample pixel position of the sample pixel point in the first sample image to obtain the first sample The sample pixel point in the image is projected at the first sample projection position p _cam of the second sample image:

In the above formula (11), G _ij represents the relative pose between the updated pose of the first sample image and the updated pose of the second sample image. For its acquisition method, please refer to the above-mentioned disclosed embodiments regarding the first image and Relevant description of the relative pose of the second image. p _i represents the sample pixel position of the sample pixel in the first sample image,

Indicates the depth of update of sample pixels in the first sample image. In addition, ∏ _c ,

and operators

For the specific meaning, please refer to the relevant descriptions in the foregoing disclosed embodiments. At the same time, projection can be performed based on the updated overall optical flow of the sample and the sample pixel position of the sample pixel point in the first sample image, to obtain a second sample projection of the sample pixel point in the first sample image projected onto the second sample image. Position p _flow :

p _flow =p _i +F _oij ...Formula (12);

In the above formula (12), F _oij represents the updated sample overall optical flow. That is to say, the sample overall optical flow value corresponding to the sample pixel can be directly queried in the updated sample overall optical flow, and it can be compared with the sample pixel. The sample pixel positions of are added to obtain the second sample projection position. On this basis, the second prediction mask can be obtained based on the difference between the projection position of the first sample and the projection position of the second sample. Therefore, the pixel position of the projection using pose and depth and the overall optical flow can be obtained. The difference between the two projection positions is used to identify the sample pixels belonging to the moving object to obtain the second prediction mask, which is beneficial to improving the accuracy of constructing dynamic mask labels. For example, the sample mask value of the sample pixel point can be obtained based on the distance between the first sample projection position and the second sample projection position by comparing the preset threshold, and the sample mask value is used to indicate whether the sample pixel point belongs to Moving objects. For example, when the distance between the first sample projection position and the second sample projection position is greater than a preset threshold, the sample pixel point can be considered to belong to the moving object, and at this time, the sample mask value of the sample pixel point can be determined to be the first A value (such as 0). On the contrary, when the distance between the first sample projection position and the second sample projection position is not greater than the preset threshold, it can be considered that the sample pixel does not belong to the moving object. In this case, The sample mask value of the sample pixel is determined to be a second value (eg, 1). On this basis, the second prediction mask can be obtained based on the sample mask value of each sample pixel.

In the above formula (13), μ represents the preset threshold, and ||·|| ₂ represents the Euclidean distance. For example, the preset threshold μ can be set to 0.5, which is not limited here. After obtaining the first prediction mask M _di and the second prediction mask

After that, based on the first prediction mask M _di and the second prediction mask

The difference between them is the mask prediction loss L _{art_mask} . For example, a cross-entropy loss function may be used to measure the first prediction mask M _di and the second prediction mask

The difference between them gives the mask prediction loss L _{art_mask} :

In the above formula (14), N represents the set of pixel points in the first prediction mask or the second prediction mask, and |N| represents the total number of pixel points in the first prediction mask or the second prediction mask.

In an implementation scenario, unlike the aforementioned self-supervised training method of manually constructing mask labels, if there are real dynamic masks during the training process, the model training can be supervised through supervised training. In the case where a real dynamic mask exists, the mask prediction loss can be obtained based on the difference between the first predicted mask and the real dynamic mask. The cross-entropy loss function can also be used to measure the difference between the first predicted mask and the real dynamic mask to obtain the mask prediction loss. In order to facilitate the distinction between the aforementioned mask prediction loss in self-supervised training and the mask prediction loss in supervised training, the mask prediction loss in supervised training can be recorded as L _{gt_mask} :

In the above formula (15), _Mi represents the real dynamic mask. For other parameters, please refer to the related description of self-supervised training mentioned above.

In one implementation scenario, as mentioned above, the sample reference data also includes a sample dynamic mask, the sample dynamic mask is used to indicate moving objects in the sample image, and the predicted loss includes a geometric photometric loss. For ease of description, the geometric photometric loss can be denoted as L _{geo_ph} . In addition, regarding the sample dynamic mask, please refer to the relevant descriptions about the dynamic mask in the aforementioned disclosed embodiments. Please refer to FIG. 7 , which is a schematic diagram of an embodiment of a dynamic scene. As shown in Figure 7, in the self-supervised training mode, when using photometric errors to supervise pose and depth, directly using static optical flow may lead to pixel mismatch (e.g., a pair of crossed pixels), which is 701, because The movement of the moving object itself will cause the occlusion of pixels in the static optical flow, which will reduce the accuracy of the photometric error. In view of this, the sample fusion mask can be obtained by fusion based on the sample dynamic masks of the second sample images that have a common view relationship with the first sample image. On this basis, projection can be performed based on the updated sample pose, the updated sample depth and the sample pixel position of the sample pixel in the first sample image, to obtain the projection of the sample pixel in the first sample image on the second sample image. The first sample projection position is 702. Based on this, the first sample pixel value of the sample pixel point in the first sample image can be obtained based on the sample pixel position of the sample pixel point in the first sample image, and based on the first sample pixel value of the sample pixel point in the first sample image A sample projection position is used to obtain the second sample pixel value of the sample pixel point in the first sample image, and based on the sample fusion mask, the fusion mask value of the sample pixel point in the first sample image is obtained, so that the second sample pixel value of the sample pixel point in the first sample image is obtained. One sample pixel value, the second sample pixel value and the fused mask value are used to obtain the geometric photometric loss. In the above method, the sample fusion mask is obtained by fusing the sample dynamic mask of the second sample image that has a common view relationship with the first sample image, and the sample fusion mask is considered in the geometric photometric loss measurement process, which is beneficial to the The sample fusion mask eliminates erroneous pixel photometric matching due to pixel occlusion as much as possible, can improve the measurement accuracy of geometric photometric loss, and is beneficial to improving the model performance of the image analysis model.

In one implementation scenario, for each second sample image that has a common view relationship with the first sample image, the sample dynamic masks of these second sample images can be aggregated to obtain a sample fusion mask. For example, specific operations of aggregation may include but are not limited to taking unions, etc., which are not limited here. For ease of description, the sample fusion mask can be recorded as

At the same time, for the projection position of the first sample, please refer to the relevant description in the aforementioned mask prediction loss.

In one implementation scenario, the first sample pixel value can be obtained by querying the pixel value at the sample pixel position in the first sample image directly based on the sample pixel position of the sample pixel point in the first sample image, where, The first sample pixel value may be denoted I _i . In addition, after obtaining the first sample projection position, the second sample pixel value I _j→i can be obtained through bilinear interpolation in the second sample image:

In the above formula (16),

represents the first projection position, I _j represents the second sample image, and I _j <·> represents interpolation calculation in the second sample image.

Here, after obtaining the first sample pixel value I _i and the second sample pixel value I _j→i , the pixel difference value pe(I _i ,I _j ) between the first sample pixel value and the second sample pixel value can be obtained _→i ), and then use the fusion mask value of the sample pixel

Perform weighting to obtain the weighted difference

On this basis, based on the weighted difference of each sample pixel point, the geometric photometric loss L _{geo_ph} is obtained:

In the above formula (17), N' represents the total number of pixels belonging to stationary objects in the sample fusion mask. In the above method, by using the fusion mask value to weight the pixel difference values, erroneous pixel photometric matching caused by pixel occlusion can be quickly screened out, which is beneficial to reducing the measurement complexity of geometric photometric loss. In addition, in order to improve the accuracy of geometric photometric loss, in the process of measuring the pixel difference pe(I _i ,I _j→i ) between the first sample pixel value and the second sample pixel value, various methods can be used Make measurements. It can measure the first sample pixel value and the second sample pixel value based on structural similarity to obtain the first difference value, and measure the first sample pixel value and the second sample pixel value based on the absolute value deviation to obtain the second difference value. , on this basis, weighting is performed based on the first difference and the second difference to obtain the pixel difference pe(I _i ,I _j→i ):

In the above formula (18), SSIM represents the structural similarity measure, ||·|| ₁ represents the absolute value deviation measure,

(1-α) represents the weight of the first difference and the second difference respectively. For example, α can be set to 0.85, which is not limited here. The above method, in the process of measuring pixel difference, combines the two aspects of structural similarity and absolute value deviation to jointly measure, which is conducive to improving the accuracy of pixel difference as much as possible.

In an implementation scenario, unlike the aforementioned measurement of geometric photometric loss in combination with the sample fusion mask, when the accuracy of the loss measurement is relatively loose, the geometric photometric loss can also be measured without considering the sample fusion mask. In this case, the geometric photometric loss L _{geo_ph} can be expressed as formula (19), where N represents the total number of sample pixels:

In an implementation scenario, in order to improve the accuracy of the loss measurement, the predicted loss may also include optical flow photometric loss, where the optical flow photometric loss may be recorded as L _{flow_ph} . In addition, the sample analysis results may also include sample dynamic optical flow, which may be described in the aforementioned mask prediction loss. Based on this, the updated overall optical flow of the sample can be obtained based on the sample dynamic optical flow, the updated sample pose and the updated sample depth. For this, please refer to the relevant description in the aforementioned mask prediction loss. On this basis, projection can be performed based on the updated overall optical flow of the sample and the sample pixel position of the sample pixel point in the first sample image, to obtain a second sample in which the sample pixel point in the first sample image is projected on the second sample image. Projection position. For example, the sample overall optical flow value of the sample pixel point can be directly queried in the updated sample overall optical flow, plus the sample pixel position of the sample pixel point, to obtain the second sample projection position, which can be referred to the aforementioned mask prediction. Related descriptions in losses. Similar to the aforementioned geometric photometric loss, after obtaining the second sample projection position, the second sample pixel value I _j→i can be obtained through bilinear interpolation in the second sample image:

I _j→i =I _j <F _oij +p _i >...Formula (20);

In the above formula (20), I _j <·> indicates that interpolation calculation is performed on the second sample image I _j . On this basis, similar to the aforementioned geometric photometric loss, the optical flow photometric loss can be obtained based on the difference between the first sample pixel value and the second sample pixel value. For example, the first sample pixel value and the second sample pixel value can be measured based on structural similarity to obtain the first difference value, and the first sample pixel value and the second sample pixel value can be measured based on the absolute value deviation to obtain the first difference value. The two differences are then weighted based on the first difference and the second difference to obtain the pixel difference, so that the optical flow photometric loss L _{flow_ph} can be obtained based on the pixel difference of each sample pixel:

L _{flow_ph} =∑ _ij pe(I _i ,I _j→i )...Formula (21).

Step S65: Based on the prediction loss, adjust the network parameters of the image analysis model.

In one implementation scenario, when the network model is trained in a self-supervised manner, the prediction loss may include at least one of the aforementioned mask prediction loss, geometric photometric loss, and optical flow photometric loss. For example, the prediction loss can include the aforementioned mask prediction loss, geometric photometric loss and optical flow photometric loss. Then the prediction loss L _{self_sup} can be obtained by weighting based on these three:

L _{self_sup} =λ ₀ L _{geo_ph} +λ ₁ L _{flow_ph} +λ ₂ L _{art_mask} ...Formula (22);

In the above formula (22), λ ₀ , λ ₁ , and λ ₂ all represent weighting coefficients. For example, they can be set to 100, 5, and 0.05 respectively, which are not limited here. Please refer to Table 1. Table 1 is a comparison table between the test performance of the disclosed image analysis model after training in a self-supervised manner and the test performance of the prior art in an embodiment.

Table 1 Comparison of the test performance of the analysis model after training in a self-supervised manner and the test performance of the existing technology according to an embodiment

分析方式Analysis method	K09K09	K10K10	VK01VK01	VK02VK02	VK06VK06	VK18VK18	VK20 VK20

现有技术1Existing technology 1	28.128.1	24.024.0	--	--	--	--	--
现有技术2Existing technology 2	41.9141.91	7.5197.519	27.83027.830	XX	XX	XX	2.8072.807
现有技术3Existing technology 3	47.147.1	11.011.0	2.2592.259	0.0490.049	0.1360.136	1.1701.170	6.9986.998
本公开this disclosure	27.827.8	4.24.2	0.5910.591	0.0210.021	0.130.13	0.4000.400	1.0391.039

Among them, K09 and K10 represent the test performance of different technical solutions in the test scenario of image sequence 09 and image sequence 10 in the KITTI data set. VK01, VK02, VK06, VK18, and VK20 represent image sequence 01, image sequence 02, and image sequence in the KITTI2 data set. Test performance of different technical solutions in the test scenarios of sequence 06, image sequence 18 and image sequence 20. As can be seen from Table 1, the image analysis model trained by the self-supervised method of the present disclosure has extremely significant model performance compared with other existing technologies in many test scenarios.

In an implementation scenario, similar to the aforementioned self-supervised training of the network model, when the network model is trained in a supervised manner, the prediction loss may include at least one of the aforementioned mask prediction loss, geometric photometric loss, and optical flow photometric loss. . For example, the prediction loss can include the aforementioned mask prediction loss, geometric photometric loss and optical flow photometric loss. Then the prediction loss L _{semi_sup} can be obtained by weighting based on these three:

L _{semi_sup} =λ ₀ L _{geo_ph} +λ ₁ L _{flow_ph} +λ ₂ L _{art_mask} ...Formula (23);

In the above formula (23), λ ₀ , λ ₁ , and λ ₂ all represent weighting coefficients. For example, they can be set to 100, 5, and 0.05 respectively, which are not limited here.

In an implementation scenario, after obtaining the prediction loss, the network parameters of the image analysis model can be adjusted through optimization methods such as gradient descent. For the process, please refer to the technical details of optimization methods such as gradient descent.

The above scheme, similar to the inference stage, considers the overall optical flow as caused by the movement of the camera device and the movement of the subject by imitating the way humans perceive the real world, and during the image analysis process, refer to the overall optical flow and the movement of the camera device. The static optical flow caused by motion can predict the optical flow calibration data of the static optical flow, so that in the subsequent pose and depth optimization process, the static optical flow and its optical flow calibration data can be combined to reduce the optical flow caused by the motion of the subject as much as possible. The impact can improve the model performance of the image analysis model, which is conducive to improving the accuracy of the analysis results obtained by using the image analysis model in the inference stage, thereby improving the accuracy of the pose and depth in the inference stage.

Please refer to FIG. 8 , which is a schematic framework diagram of an embodiment of the image analysis device 80 of the present disclosure. The image analysis device 80 includes: an acquisition part 81 configured to acquire an image sequence, optical flow data, and reference data of each image in the image sequence; wherein each image includes a first image and a second image having a common view relationship, and the optical flow The data includes static optical flow and overall optical flow between the first image and the second image. The static optical flow is caused by the movement of the camera device, the overall optical flow is caused by the movement of the camera device and the movement of the photographed object, and the reference data includes pose and depth; the analysis part 82 is configured to predict the analysis results based on the image sequence and optical flow data; wherein the analysis results include optical flow calibration data of static optical flow; the optimization part 83 is configured to predict based on the static optical flow and optical flow calibration data , optimize the pose and depth to obtain an updated pose and an updated depth.

In some disclosed embodiments, the analysis part 82 includes: a feature correlation sub-part configured to obtain feature correlation data between the first image and the second image based on the image features of the first image and the image features of the second image; The first projection sub-part is configured to project the pixels in the first image based on static optical flow to obtain the first projection position of the pixels in the first image in the second image; the feature search sub-part is configured to project based on The first projection position is searched in feature-related data to obtain target-related data; the data analysis subpart is configured to obtain analysis results based on target-related data, static optical flow, and overall optical flow.

In some disclosed embodiments, the data analysis sub-part includes: a first encoding sub-part configured to perform encoding based on target-related data to obtain first encoding features; a second encoding sub-part configured to perform encoding based on static optical flow and overall The optical flow is encoded to obtain the second encoding feature; the prediction sub-part is configured to predict and obtain the analysis result based on the first encoding feature and the second encoding feature.

In some disclosed embodiments, the reference data also includes a dynamic mask, the dynamic mask is used to indicate moving objects in the image, and the analysis results also include a confidence map and mask calibration data of the dynamic mask, the confidence map includes the Confidence of each pixel; the optimization part 83 includes: an image fusion subpart, which is configured to fuse based on the dynamic mask, mask calibration data and confidence map to obtain an importance map; a position calibration subpart, which is configured to fuse based on The optical flow calibration data calibrates the first projection position to obtain the calibration position; where the importance map includes the importance of each pixel in the image, and the first projection position is the projection of the pixel in the first image based on the static optical flow on the second The pixel position of the image; the data optimization subsection is configured to optimize the updated pose and updated depth based on the calibration position and importance map.

In some disclosed embodiments, the optical flow calibration data includes the calibration optical flow of the pixel point in the first image, and the position calibration sub-part is further configured to add the calibration optical flow of the pixel point in the first image plus the pixel point in the second image. The first projected position in the image is used to obtain the calibrated position of the pixel.

In some disclosed embodiments, the image fusion sub-part includes: a calibration sub-part configured to calibrate the dynamic mask based on the mask calibration data to obtain a calibration mask; wherein the calibration mask includes pixel points and moving objects in the image The correlation degree is positively related to the possibility that a pixel in the image belongs to a moving object; the fusion sub-part is configured to fuse based on the confidence map and the calibration mask to obtain the importance map.

In some disclosed embodiments, the analysis results also include dynamic optical flow, which is caused by the motion of the photographed object; the image analysis device 80 includes: a static optical flow update part configured to obtain based on the updated pose and updated depth. Updated static optical flow; the overall optical flow update part is configured to be based on dynamic optical flow and updated static optical flow to obtain an updated overall optical flow; the data update part is configured to be based on updated static optical flow and updated overall Optical flow, obtain updated optical flow data, and obtain updated reference data based on the updated pose and updated depth; the loop part is configured to combine the analysis part 82 and the optimization part 83 to re-execute based on the image sequence and optical flow data, Predict the steps to obtain the analysis results and subsequent steps until the number of re-executions meets the preset conditions.

In some disclosed embodiments, the static optical flow update part includes: a second projection subpart configured to perform projection based on the updated pose, the updated depth and the pixel position of the pixel in the first image, to obtain the The pixel point is projected at the second projection position of the second image; the optical flow update sub-section is configured to project the pixel point in the first image at the second projection position of the second image and the pixel point in the first image is at the second projection position of the second image. The difference between the corresponding positions in , the updated static optical flow is obtained; where the corresponding position is the pixel position of the second image where the spatial point to which the pixel point in the first image belongs is projected on the assumption that the camera device does not move.

In some disclosed embodiments, the overall optical flow updating part is also configured to add the dynamic optical flow and the updated static optical flow to obtain an updated overall optical flow.

Please refer to FIG. 9 , which is a schematic framework diagram of an embodiment of an image analysis model training device 90 . The training device 90 of the image analysis model includes: a sample acquisition part 91 configured to acquire the sample image sequence, the sample optical flow data, and the sample reference data of each sample image in the sample image sequence; wherein each sample image includes a common view relationship. The first sample image and the second sample image, the sample optical flow data include the sample static optical flow and the sample overall optical flow between the first sample image and the second sample image, the sample static optical flow is caused by the movement of the camera device, the sample The overall optical flow is caused by the motion of the camera device and the motion of the photographed object, and the sample reference data includes sample pose and sample depth; the sample analysis part 92 is configured to analyze and predict the sample image sequence and sample optical flow data based on the image analysis model , obtain the sample analysis results; wherein, the sample analysis results include sample optical flow calibration data of the sample static optical flow; the sample optimization part 93 is configured to calculate the sample pose and sample depth based on the sample static optical flow and the sample optical flow calibration data. Perform optimization to obtain updated sample pose and updated sample depth; the loss measurement part 94 is configured to perform loss measurement based on the updated sample pose and updated sample depth to obtain the predicted loss of the image analysis model; parameter adjustment part 95 , is configured to adjust the network parameters of the image analysis model based on the prediction loss.

In some disclosed embodiments, the sample reference data also includes a sample dynamic mask, the sample dynamic mask is used to indicate moving objects in the sample image, and the sample analysis results also include sample dynamic optical flow and sample mask calibration of the sample dynamic mask. data, and the sample dynamic optical flow is caused by the motion of the photographed object, and the prediction loss includes the mask prediction loss; the training device 90 of the image analysis model also includes: a sample overall optical flow update part configured to update the sample based on the sample dynamic optical flow. pose and the updated sample depth, to obtain the updated overall optical flow of the sample; the loss measurement part 94 includes: a first mask update subpart configured to obtain the sample dynamic mask based on the sample mask calibration data and the sample dynamic mask The first prediction mask obtained by updating the model dimension; the second mask update sub-part is configured to obtain the dynamic mask of the sample in the optical flow based on the updated overall optical flow of the sample, the updated sample pose and the updated sample depth. a second prediction mask obtained by dimensionality update; the mask loss metric sub-section is configured to obtain the mask prediction loss based on the difference between the first prediction mask and the second prediction mask.

In some disclosed embodiments, the second mask update sub-part includes: a first sample projection sub-part configured to be based on the updated sample pose, the updated sample depth and the sample of the sample pixel point in the first sample image The pixel position is projected to obtain the sample pixel point in the first sample image projected at the first sample projection position of the second sample image; the second sample projection sub-part is configured to be based on the updated overall optical flow of the sample, which is the same as the first sample Project the sample pixel position of the sample pixel point in this image to obtain the second sample projection position of the sample pixel point in the first sample image projected at the second sample image; the mask determination sub-part is configured to be based on the first sample The difference between the projected position and the second sample projected position results in a second prediction mask.

In some disclosed embodiments, the mask determination sub-part includes: a distance comparison sub-part configured to obtain a sample of the sample pixel point based on a distance comparison between the first sample projection position and the second sample projection position with a preset threshold. Mask value; wherein, the sample mask value is used to indicate whether the sample pixel point belongs to a moving object; the mask acquisition subunit is configured to obtain the second prediction mask based on the sample mask value of each sample pixel point.

In some disclosed embodiments, the sample reference data also includes a sample dynamic mask, the sample dynamic mask is used to indicate moving objects in the sample image, and the prediction loss includes a geometric photometric loss; the training device 90 of the image analysis model also includes a sample mask The film aggregation part is configured to fuse based on the sample dynamic masks of the second sample images that have a common view relationship with the first sample image to obtain the sample fusion mask; the loss measurement part 94 includes: a first sample projection sub- The part is configured to perform projection based on the updated sample pose, the updated sample depth and the sample pixel position of the sample pixel point in the first sample image, to obtain the projection of the sample pixel point in the first sample image on the second sample image. The first sample projection position; the first pixel value determination sub-section is configured to obtain the first sample pixel value of the sample pixel point in the first sample image based on the sample pixel position of the sample pixel point in the first sample image ; The second pixel value determination sub-part is configured to obtain the second sample pixel value of the sample pixel in the first sample image based on the first sample projection position of the sample pixel in the first sample image; Fusion mask The value acquisition subpart is configured to obtain the fusion mask value of the sample pixel point in the first sample image based on the sample fusion mask; the photometric loss measurement subpart is configured to obtain the fusion mask value of the sample pixel point in the first sample image based on the sample fusion mask. The pixel values are fused with the mask values to obtain the geometric photometric loss.

In some disclosed embodiments, the photometric loss measurement sub-section includes: a pixel difference acquisition sub-section configured to acquire a pixel difference between a first sample pixel value and a second sample pixel value; a numerical weighting sub-section configured to obtain a pixel difference between a first sample pixel value and a second sample pixel value; It is configured to use the fusion mask value to weight the pixel difference to obtain a weighted difference; the loss acquisition sub-part is configured to obtain a geometric photometric loss based on the weighted difference of each sample pixel.

In some disclosed embodiments, the pixel difference acquisition sub-section includes: a first difference sub-section configured to measure the first sample pixel value and the second sample pixel value based on structural similarity to obtain the first difference value; The second difference sub-section is configured to measure the first sample pixel value and the second sample pixel value based on the absolute value deviation to obtain the second difference value; the difference weighting sub-section is configured to measure the first sample pixel value and the second sample pixel value based on the absolute value deviation. The two differences are weighted to obtain the pixel difference value.

For a description of the processing flow of each module in the device and the interaction flow between the modules, please refer to the relevant descriptions in the above method embodiments, and will not be described in detail here.

Please refer to FIG. 10 , which is a schematic framework diagram of an embodiment of the electronic device 100 of the present disclosure. The electronic device 100 includes a memory 101 and a processor 102 coupled to each other. The processor 102 is configured to execute program instructions stored in the memory 101 to implement any of the above image analysis methods or any image analysis model training method. The electronic device 100 may include but is not limited to: a microcomputer and a server. In addition, the electronic device 100 may also include mobile devices such as laptop computers and tablet computers, which are not limited here.

Here, the processor 102 is configured to control itself and the memory 101 to implement any of the above image analysis methods, or to implement any of the above image analysis model training methods. The processor 102 may also be called a central processing unit (Central Processing Unit, CPU). The processor 102 may be an integrated circuit chip with signal processing capabilities. The processor 102 can also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. In addition, the processor 102 may be implemented by an integrated circuit chip.

Please refer to FIG. 11 , which is a schematic diagram of a framework of an embodiment of the computer-readable storage medium 110 of the present disclosure. The computer-readable storage medium 110 stores program instructions 111 that can be executed by the processor. The program instructions 111 are configured to implement any of the above image analysis methods, or to implement any of the above image analysis model training methods.

Embodiments of the present disclosure also provide a computer program. The computer program includes computer readable code. When the computer readable code is run in an electronic device, the processor of the electronic device executes any one of the above-mentioned functions. Image analysis methods, or training methods that implement any of the above image analysis models.

In the several embodiments provided in this disclosure, it should be understood that the disclosed methods and devices can be implemented in other ways. For example, the device implementation described above is only illustrative. For example, the division of parts or units is only a logical function division. In actual implementation, there may be other division methods. For example, units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.

Units described as separate components may or may not be physically separate, and components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to network units. Some or all of the units can be selected according to actual needs to achieve the purpose of this embodiment.

In addition, each functional unit in various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.

Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. Based on this understanding, the technical solution of the present disclosure is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including a number of instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the various implementation methods of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .

The present disclosure relates to the field of augmented reality. By obtaining image information of target objects in the real environment, and then using various visual related algorithms to detect or identify the relevant features, states and attributes of the target objects, thereby obtaining information that matches specific applications. AR effect that combines virtuality and reality. For example, the target object may involve the face, limbs, gestures, actions, etc. related to the human body, or the identifiers or markers related to the object, or the sand table, display area or display items related to the venue or place. Vision-related algorithms can involve visual positioning, SLAM, three-dimensional reconstruction, image registration, background segmentation, object key point extraction and tracking, object pose or depth detection, etc. Specific applications can not only involve interactive scenes such as tours, navigation, explanations, reconstructions, and virtual effects overlay displays related to real scenes or objects, but also involve special effects processing related to people, such as makeup beautification, body beautification, special effects display, virtual Model display and other interactive scenarios.

Convolutional neural networks can be used to detect or identify the relevant features, states and attributes of target objects. The above-mentioned convolutional neural network is a network model obtained through model training based on a deep learning framework.

Industrial applicability

Embodiments of the present disclosure provide an image analysis method, a model training method, a device, a device, a medium and a program, wherein the image analysis method includes: acquiring an image sequence, optical flow data and reference data of each image in the image sequence; wherein , each image includes a first image and a second image that have a common viewing relationship, and the optical flow data includes static optical flow and overall optical flow between the first image and the second image. The static optical flow is caused by the movement of the camera device, and the overall optical flow The flow is caused by the movement of the camera device and the movement of the photographed object, and the reference data includes pose and depth; based on the image sequence and optical flow data, the analysis results are predicted; among them, the analysis results include optical flow calibration data of static optical flow; based on static optical flow Flow and optical flow calibration data are used to optimize pose and depth to obtain updated pose and updated depth. Through the above solution, the accuracy of pose and depth can be improved in dynamic scenes.

Claims

An image analysis method including:

Obtain an image sequence, optical flow data and reference data of each image in the image sequence; wherein each image includes a first image and a second image having a common view relationship, and the optical flow data includes the first image The static optical flow and the overall optical flow between the second image, the static optical flow is caused by the motion of the camera device, the overall optical flow is caused by the motion of the camera device and the motion of the photographed object, and the reference data includes pose and depth;

Based on the image sequence and the optical flow data, an analysis result is predicted; wherein the analysis result includes optical flow calibration data of the static optical flow;

Based on the static optical flow and the optical flow calibration data, the pose and the depth are optimized to obtain an updated pose and an updated depth.
The method according to claim 1, wherein the predicting the analysis result based on the image sequence and the optical flow data includes:

Based on the image features of the first image and the image features of the second image, feature correlation data between the first image and the second image is obtained, and the first image is transformed based on the static optical flow. Project the pixels in the image to obtain the first projection position of the pixels in the first image in the second image;

Search the feature-related data based on the first projection position to obtain target-related data;

The analysis result is obtained based on the target-related data, the static optical flow and the overall optical flow.
The method according to claim 2, wherein obtaining the analysis result based on the target-related data, the static optical flow and the overall optical flow includes:

Encoding is performed based on the target-related data to obtain a first encoding feature, and encoding is performed based on the static optical flow and the overall optical flow to obtain a second encoding feature;

The analysis result is predicted based on the first encoding feature and the second encoding feature.
The method according to any one of claims 1 to 3, wherein the reference data further includes a dynamic mask used to indicate moving objects in the image, and the analysis result further includes a confidence level map and the mask calibration data of the dynamic mask, the confidence map includes the confidence of each pixel in the image; based on the static optical flow and the optical flow calibration data, the bit The pose and the depth are optimized to obtain an updated pose and an updated depth, including:

Fusion is performed based on the dynamic mask, the mask calibration data and the confidence map to obtain an importance map, and the first projection position is calibrated based on the optical flow calibration data to obtain a calibration position; wherein, The importance map includes the importance of each pixel in the image, and the first projection position is the pixel position of the pixel in the first image projected on the second image based on the static optical flow;

Based on the calibration position and the importance map, the updated pose and the updated depth are optimized.
The method according to claim 4, wherein the optical flow calibration data includes the calibration optical flow of pixels in the first image, and the first projection position is calibrated based on the optical flow calibration data to obtain the calibration Locations, including:

The calibrated optical flow of the pixel in the first image is added to the first projection position of the pixel in the second image to obtain the calibrated position of the pixel.
The method according to claim 4, wherein the fusion based on the dynamic mask, the mask calibration data and the confidence map to obtain an importance map includes:

The dynamic mask is calibrated based on the mask calibration data to obtain a calibration mask; wherein the calibration mask includes the correlation between the pixels in the image and the moving object, and the correlation is The possibility that a pixel in the image belongs to the moving object is positively correlated;

Fusion is performed based on the confidence map and the calibration mask to obtain the importance map.
The method according to any one of claims 1 to 6, wherein the analysis result further includes dynamic optical flow, the dynamic optical flow is caused by the motion of the photographed object; in the method based on the static optical flow and the After using the optical flow calibration data, optimizing the pose and the depth, and obtaining the updated pose and the updated depth, the method further includes:

Based on the updated pose and the updated depth, obtain an updated static optical flow, and obtain an updated overall optical flow based on the dynamic optical flow and the updated static optical flow;

Based on the updated static optical flow and the updated overall optical flow, obtain updated optical flow data, and obtain updated reference data based on the updated pose and updated depth;

Re-execute the step of predicting and obtaining the analysis result based on the image sequence and the optical flow data and subsequent steps.
The method according to claim 7, wherein said obtaining an updated static optical flow based on the updated pose and the updated depth includes:

Projection is performed based on the updated pose, the updated depth and the pixel position of the pixel in the first image, to obtain a second projection position of the pixel in the first image projected on the second image;

Based on the difference between the second projection position of the pixel point in the first image projected on the second image and the corresponding position of the pixel point in the first image in the second image, the updated Static optical flow; wherein the corresponding position is the pixel position of the second image where the spatial point to which the pixel point in the first image belongs is projected on the second image when the imaging device is not moving.
The method of claim 7, wherein obtaining an updated overall optical flow based on the dynamic optical flow and the updated static optical flow includes:

The dynamic optical flow and the updated static optical flow are added to obtain the updated overall optical flow.
A training method for an image analysis model, including:

Obtain a sample image sequence, sample optical flow data, and sample reference data of each sample image in the sample image sequence; wherein each sample image includes a first sample image and a second sample image that have a common view relationship, and the The sample optical flow data includes the sample static optical flow and the sample overall optical flow between the first sample image and the second sample image. The sample static optical flow is caused by the movement of the camera device, and the sample overall optical flow It is caused by the motion of the camera device and the motion of the photographed object, and the sample reference data includes sample pose and sample depth;

Analyze and predict the sample image sequence and the sample optical flow data based on the image analysis model to obtain a sample analysis result; wherein the sample analysis result includes sample optical flow calibration data of the sample static optical flow;

Based on the sample static optical flow and the sample optical flow calibration data, optimize the sample pose and the sample depth to obtain an updated sample pose and an updated sample depth;

Perform loss measurement based on the updated sample pose and the updated sample depth to obtain the predicted loss of the image analysis model;

Based on the prediction loss, network parameters of the image analysis model are adjusted.
The method according to claim 10, wherein the sample reference data further includes a sample dynamic mask, the sample dynamic mask is used to indicate moving objects in the sample image, and the sample analysis result further includes a sample dynamic mask. Optical flow and sample mask calibration data of the sample dynamic mask, and the sample dynamic optical flow is caused by the motion of the photographed object, and the prediction loss includes a mask prediction loss; in the sample static optical flow based on the sample and After the sample optical flow calibration data is used to optimize the sample pose and the sample depth, and the updated sample pose and updated sample depth are obtained, the method further includes:

Based on the sample dynamic optical flow, the updated sample pose and the updated sample depth, an updated sample overall optical flow is obtained;

The loss measurement is performed based on the updated sample pose and the updated sample depth to obtain the predicted loss of the image analysis model, including:

Based on the sample mask calibration data and the sample dynamic mask, a first prediction mask obtained by updating the sample dynamic mask in the model dimension is obtained, and based on the updated sample overall optical flow, the updated Using the sample pose and the updated sample depth, a second prediction mask obtained by updating the sample dynamic mask in the optical flow dimension is obtained;

The mask prediction loss is obtained based on the difference between the first prediction mask and the second prediction mask.
The method according to claim 11, wherein the sample dynamic mask is updated in the optical flow dimension based on the updated sample overall optical flow, the updated sample pose and the updated sample depth. The resulting second prediction mask includes:

Projection is performed based on the updated sample pose, the updated sample depth and the sample pixel position of the sample pixel in the first sample image, and the projection of the sample pixel in the first sample image is obtained. the first sample projection position of the second sample image; and,

Projection is performed based on the updated sample overall optical flow and the sample pixel position of the sample pixel in the first sample image, and the projection of the sample pixel in the first sample image at the second sample image is obtained. Two sample projection positions;

The second prediction mask is obtained based on the difference between the first sample projection position and the second sample projection position.
The method of claim 12, wherein obtaining the second prediction mask based on the difference between the first sample projection position and the second sample projection position includes:

Based on the comparison of the distance between the first sample projection position and the second sample projection position with a preset threshold, a sample mask value of the sample pixel is obtained; wherein the sample mask value is used to represent the Whether the sample pixel belongs to the moving object;

The second prediction mask is obtained based on the sample mask value of each sample pixel point.
The method of claim 10, wherein the sample reference data further includes a sample dynamic mask used to indicate moving objects in the sample image, and the prediction loss includes a geometric photometric loss ; Before performing loss measurement based on the updated sample pose and the updated sample depth to obtain the predicted loss of the image analysis model, the method further includes:

Fusion is performed based on the sample dynamic masks of each second sample image having the common view relationship with the first sample image to obtain a sample fusion mask;

The loss measurement is performed based on the updated sample pose and the updated sample depth to obtain the predicted loss of the image analysis model, including:

Projection is performed based on the updated sample pose, the updated sample depth and the sample pixel position of the sample pixel in the first sample image, and the projection of the sample pixel in the first sample image is obtained. the first sample projection position of the second sample image;

Based on the sample pixel position of the sample pixel point in the first sample image, obtain the first sample pixel value of the sample pixel point in the first sample image, and based on the sample pixel point in the first sample image The first sample projection position is used to obtain the second sample pixel value of the sample pixel point in the first sample image, and based on the sample fusion mask, the fusion of the sample pixel point in the first sample image is obtained. mask value;

The geometric photometric loss is obtained based on the first sample pixel value, the second sample pixel value and the fused mask value.
The method of claim 14, wherein obtaining the geometric photometric loss based on the first sample pixel value, the second sample pixel value and the fusion mask value includes:

Obtain the pixel difference between the first sample pixel value and the second sample pixel value;

Use the fusion mask value to weight the pixel difference value to obtain a weighted difference value;

The geometric photometric loss is obtained based on the weighted difference value of each sample pixel point.
The method according to claim 15, wherein said obtaining the pixel difference value between the first sample pixel value and the second sample pixel value includes:

Measuring the first sample pixel value and the second sample pixel value based on structural similarity, obtaining a first difference value, and measuring the first sample pixel value and the second sample pixel value based on absolute value deviation , get the second difference;

Weighting is performed based on the first difference value and the second difference value to obtain the pixel difference value.
An image analysis device, including:

The acquisition part is configured to acquire an image sequence, optical flow data, and reference data of each image in the image sequence; wherein each image includes a first image and a second image with a common view relationship, and the optical flow data Including static optical flow and overall optical flow between the first image and the second image, the static optical flow is caused by the movement of the camera device, and the overall optical flow is caused by the movement of the camera device and the movement of the photographed object, And the reference data includes pose and depth;

An analysis part configured to predict an analysis result based on the image sequence and the optical flow data; wherein the analysis result includes optical flow calibration data of the static optical flow;

The optimization part is configured to optimize the pose and the depth based on the static optical flow and the optical flow calibration data to obtain an updated pose and an updated depth.
An image analysis model training device, including:

The sample acquisition part is configured to acquire a sample image sequence, sample optical flow data, and sample reference data of each sample image in the sample image sequence; wherein each sample image includes a first sample image and a common view relationship. A second sample image, the sample optical flow data includes a sample static optical flow and a sample overall optical flow between the first sample image and the second sample image, the sample static optical flow is caused by the movement of the camera device , the overall optical flow of the sample is caused by the motion of the camera device and the motion of the photographed object, and the sample reference data includes sample pose and sample depth;

The sample analysis part is configured to analyze and predict the sample image sequence and the sample optical flow data based on the image analysis model to obtain a sample analysis result; wherein the sample analysis result includes the static optical flow of the sample. Sample optical flow calibration data;

A sample optimization part configured to optimize the sample pose and the sample depth based on the sample static optical flow and the sample optical flow calibration data to obtain an updated sample pose and an updated sample depth;

a loss measurement part configured to perform loss measurement based on the updated sample pose and the updated sample depth to obtain the predicted loss of the image analysis model;

The parameter adjustment part is configured to adjust the network parameters of the image analysis model based on the prediction loss.
An electronic device, including a memory and a processor coupled to each other, the processor being used to execute program instructions stored in the memory to implement the image analysis method according to any one of claims 1 to 9, or to implement the right The training method of the image analysis model according to any one of claims 10 to 16.
A computer-readable storage medium having program instructions stored thereon. When the program instructions are executed by a processor, the image analysis method of any one of claims 1 to 9 is implemented, or any one of claims 10 to 16 is implemented. The training method of the image analysis model.
A computer program, the computer program comprising computer readable code, when the computer readable code is run in an electronic device, the processor of the electronic device executes for implementing any one of claims 1 to 9 The image analysis method, or the training method to implement the image analysis model described in any one of claims 10 to 16.