CN117853645A

CN117853645A - Image rendering method based on cross-view binding cross-perception nerve radiation field

Info

Publication number: CN117853645A
Application number: CN202410242047.2A
Authority: CN
Inventors: 曹明伟; 王凤娜; 黄宝龙; 曹志伟; 赵海峰
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2024-03-04
Filing date: 2024-03-04
Publication date: 2024-04-09
Anticipated expiration: 2044-03-04

Abstract

The invention discloses an image rendering method based on cross-view binding cross-perception nerve radiation fields, which utilizes a light-weight binding feature extraction module to calculate multi-scale binding convolution features; refining convolution characteristics by using a hybrid cross-perception mechanism, aggregating cross-view interaction information and timely supplementing context characteristic information, then performing similarity embedded rendering, taking the residual rotation similarity of geometric prior conditions as display matching embedded clues, and capturing similarity matching information for a predicted color density field; and then, minimizing the consistency loss of the color and the content, adding the absolute value difference of the logarithm of the rendering view and the real instantaneous view in the same time stamp, and solving the problems of artifacts and fuzzy textures under the conditions of large scene span and lack of characteristic information.

Description

Image rendering method based on cross-view binding cross-perception nerve radiation field

Technical Field

The invention relates to computer vision and computer graphics technology, in particular to an image rendering method based on cross-view binding cross-perception nerve radiation fields.

Background

Neural radiation fields (Neural Radiance Field, neRF) are a view synthesis technique. The research work extending on the basis of nerve radiation fields is mainly divided into the following categories: (1) dynamic scene rendering; (2) occlusion scene rendering; (3) field large scene rendering; (4) cross-scene rendering; (5) sparse input scene rendering. However, some of these current works require a large number of view datasets with dense camera poses as training sets, and some use only a small portion of real scene datasets with a large number of synthetic datasets, which have shortcomings in practical applications, such as: (1) The collection task of large datasets is very time consuming and expensive; (2) Most of views in the actual application scene are unconstrained; (3) The synthetic dataset has less noise information, which is detrimental to the reproduction of the texture details of the model on the real world dataset. Therefore, the sparse view is utilized to realize the realistic rendering of the real scene, and popularization of the nerve radiation field to the generalized framework in the cross-scene application is gradually becoming the mainstream so as to make a breakthrough in the aspect of practicality. In practical applications, sparse view constraints present new challenges to generalized neural radiation fields: the sparse input condition of one or more views cannot meet the requirement of the basic nerve radiation field frame, so that the model is over-fitted or serious depth estimation offset problem is generated under the sparse input condition, and the final rendering quality is affected. The lack of 3D scene information caused by sparse input influences the depth estimation precision, and further seriously influences the rendering quality of the generalized neural radiation field frame. Currently, methods for solving the above problems can be roughly classified into two categories: (a) utilizing additional regularization supervision; (b) using a pre-trained model. The regularization mode does not need to use dense view input for training, but continuously optimizes each scene, and obtains a high-quality synthesized view through fine adjustment, so that the problem of false good during training caused by over fitting is avoided. The existing regularization method can introduce some additional prior information or data to be used for supervising own models, such as an additional training module, dense camera pose data, unknown region depth map data, better data structure and the like, and although solving a part of the rendering problem, the heavy performance of the whole rendering process is increased, and additional computing resources are required to be consumed. For the pre-training mode, the pre-training is carried out on a large data set in the early stage, the capability of a training model to learn scenes is required, and the model can be better popularized to different scenes in the later stage. This process requires a lot of time and collecting large view data is difficult. Therefore, the above-described methods are not satisfactory in terms of high efficiency and popularization practicality in the generalization connotation. At present, a generalization neural radiation field method for realizing realistic new view rendering by using sparse input, which is light in weight, low in time cost and high in quality, is urgently needed so as to effectively realize high-fidelity view rendering across scenes.

In recent years, the technology of the nerve radiation field has been greatly developed in the field of nerve rendering, some researchers popularize the technology of the nerve radiation field into a cross-scene generalized nerve radiation field using sparse input, new possibilities are provided for optimizing the nerve radiation field and breaking the limitation of the nerve radiation field, and related research papers are as follows: [1] IBRNet: learning Multi-View Image-Based Rendering, [2] pixelNeRF: neuralRadiance Fields from One or Few Images, [3] ContraNeRF: generalizable Neural Radiance Fields for Synthetic-to-real Novel View Synthesis via Contrastive Learning. Paper [1] published in 2021 on a computer vision and pattern recognition (Computer Vision and Pattern Recognition, CVPR) conference, is a cross-scene generalization neural radiation field method based on sparse view interpolation; paper [2] published in 2021 at a meeting of computer vision and pattern recognition (Computer Vision and Pattern Recognition, CVPR), is a cross-scene generalized neural radiation field method based on complete convolution; paper [3] published in 2023 on a computer vision and pattern recognition (Computer Vision and Pattern Recognition, CVPR) conference is a scene-crossing generalized neural radiation field method based on geometric perception learning, mainly solving the artifact phenomenon when applied from a synthesized view to a real-world view. The core idea of the generalization nerve radiation field methods is to popularize the nerve radiation field into cross-scene application, so that training efficiency and view rendering quality are improved.

Thus, existing cross-scene generalized neural radiation field approaches still face the following challenges in real-world applications: (1) During training, the existing novel view rendering method of the generalized nerve radiation field uses dense views with strong consistency constraint, and is difficult to meet the application requirements of the fields of metauniverse, digital twin, digital protection of cultural heritage, virtual reality, augmented reality and the like; (2) The conventional generalized nerve radiation field method based on regularization and pre-training modes requires a large amount of additional training resources or consumes a large amount of training time, which is contrary to the generalized lightweight target and the efficient connotation; (3) The conventional view rendering method based on the generalization nerve radiation field is too dependent on simple feature connection or uses a transducer to perform infinite feature enhancement, so that the effect of cross-scene interaction feature information is ignored, and further high-quality texture detail information is difficult to obtain.

Disclosure of Invention

The invention aims to: the invention aims to solve the defects in the prior art and provides an image rendering method based on cross-view binding cross-perception nerve radiation fields.

The technical scheme is as follows: the invention discloses an image rendering method based on cross-view binding cross-perception nerve radiation field, which comprises the following steps:

s1, constructing a binding feature extraction module BUF to extract view features, wherein the specific method comprises the following steps: for view datasetsIs first extracted using a lightweight feature pyramid gradient network (Feature Pyramid Network, FPN)>Or->Coarse-grained global features at resolution; then extracting the original image +.A two-dimensional convolutional neural network (Convolutional Neural Networks, CNN)>And->Fine-grained local features at resolution are extracted from coarse to fine in an incremental manner>Personal view dataset +.>At->、/>And->Convolution feature set of samples at resolutionWill->Simply called->；

Wherein,representation view dataset +.>View number of->Ratio representing original image resolution, +.>Sequence number representing view, ">Indicate->The individual views are at resolution->The convolution feature below; the whole training model is light by binding the feature extraction module BUF, so that the model is more efficient, and the burden of memory occupation and calculation resource consumption is reduced;

s2, processing the view dataset by using Hybrid Cross-Perception Mechanism (SCC)Is->The matching pairs aggregate cross-view interaction information, and the specific process is as follows:

for view-dependent featuresAnd cross-view interaction feature independent of a single view +.>Performing polymerization treatment to obtain mixed characteristic->；

Wherein,refers to a sampling convolution feature set obtained by binding a feature extraction moduleIs abbreviated as->Means will->The image convolution characteristic data is further enhanced and processed through a mixed cross perception mechanism;

since each view is paired with its possible offset view, no paired view is specified, i.e.Personal view matching pair and feature vector->；/>And->All representing the sequence number of the view, ">Representation view dataset +.>View number of->Representing the feature vectors used to calculate the residual similarity between views,is the viewing angle direction;

s3, constructing a multi-dimensional cascade mixed cross attention module to calculate context feature information, feeding multi-granularity convolution features among cascade modules in a multi-dimensional cascade process of a mixed cross perception mechanism, and feeding own coarse and fine granularity sampling convolution features to the next cascade by each cascade module to timely supplement the context feature information for later calculation of high-resolution features;

s4, rendering a similarity embedded body, namely, performing residual rotation similarity of geometric prior conditions5D geometry information for use in rendering embedded into a neuroradiating field as display matching embedded cues>Enriching geometric prior information, and capturing similarity information for a predicted color density field; />Representing three-dimensional coordinates +.>；

S5, minimizing color and content consistency loss, and utilizing new regularized baselinesAnd (3) performing end-to-end training of the model, and continuously optimizing model parameters without additional external ground parameter data or additional training resources:

（1）

wherein,is a calculation constant +.>For color cross entropy loss, ++>A loss of consistency for the content loops;

s6, for any input visual angle image, a new image with a high-fidelity visual effect can be rendered by utilizing the trained neural network model and the predicted scene color density field; in the step S6, for any input view angle image, a new image with high-fidelity visual effect can be rendered by using the trained binding cross-perception neural radiation field (including BUF, SCC and other modules) and the predicted scene color density field.

Further, the step S2 is for view feature setsIn a particular viewing angle direction +.>Performing bilinear sampling, processing the view dataset using hybrid cross-awareness mechanism SCC>Is->The matching pairs aggregate cross-view interaction information;

wherein cross-view interaction features that are independent of a single viewThe calculation formula of (2) is as follows:

in the above-mentioned method, the step of,representation ofiImage in-samplingdConvolution characteristic information in direction, ++>Representation ofjImage in-samplingdConvolution characteristic information in the direction; />Representing the separate handling of +.>Andperforming feature enhancement and context information supplementation; />Is a cosine similarity function; />And->All representing the sequence number of the view.

Further, in the step S3 of calculating the context feature information, the hybrid cross-aware mechanism is a multi-dimensional cascading process, and each cascading sequentially uses two modules: a mixed cross-attention module CS and a cross-attention module cross; the mixed cross attention module CS comprises a self-attention layer and a cross attention layer, and the cross attention layer is used for balancing the self-attention layer, so that bias influence caused by excessively relying on local features when local features are learned is prevented, sparse input conditions are more met, and overfitting is prevented; the cross attention module cross cooperates with the mixed cross attention module CS to better learn the interactive global features.

The multi-granularity convolution characteristic feeding between each cascade module is carried out in the multi-dimensional cascade process of the hybrid cross perception mechanism, each cascade module feeds the own coarse and fine granularity sampling convolution characteristic to the next cascade, context characteristic information is timely supplemented for the later-stage calculation of the high-resolution characteristic, and the resource consumption of the process is negligible.

In order to prevent a large amount of related information from being lost due to a high-frequency cross-dimension information conversion step, the method uses a grouping mode on the residual rotation similarity in the cross-perception characteristics to respectively calculate the residual rotation similarity between each group, and reserves more characteristic information, and the step S4 of performing similarity embedded rendering is as follows:

s4.1, will bePersonal view and->The feature matching pair information of the individual views is divided into +.>A group;

（2）

（3）

wherein,representing three-dimensional coordinates +.>，/>Indicating the visual observation direction +.>Indicate->Personal view and->The +.f. of the feature matching pair of the individual views>Group (S)/(S)>Indicate->Personal view and->Personal view +.>Group redundancy similarity set,/->Means calculating +.>Personal view and->A residual rotation similarity function of the m-th group of feature information of the matching pair of the view features; />Indicate->Mth group of feature information of individual view feature, < >>Indicate->An mth set of feature information for the view features; />Indicate->Personal view and->The first pair of view feature matchingMThe degree of residual rotation similarity of the group characteristic information;

s4.2 atAnd->Feature refinement and aggregation are carried out under the resolution of (2), and the residual rotation similarity matching of the features is respectively carried out under the two types of resolution to obtain +.>And->Is a complementary rotation similarity subset +.>；

S4.3, finally, atCycle on dataset->S4.2, and taking the average value to obtain final residual rotation similarity set +.>And the degree of residual rotation similarity:

（4）

will be hereinThe display matching condition clues are embedded into original 5D geometric information, and the display matching condition clues are used as important newly-added geometric prior conditions to capture similar matching information for the predicted color density field, so that the generalization capability of the model is enhanced.

In order to prevent a large amount of related information from being lost due to high-frequency cross-dimension information conversion steps, the invention considers a grouping mode in the calculation of the residual rotation similarity in the cross-perception characteristics of step S3, calculates the residual rotation similarity between each group, reserves more characteristic information, and further optimizes and processes the calculation of the residual rotation similarity of step S3 in step S4.

If gamma correction is introduced into a sparse input rendering frame using only traditional color loss, the problems that the view and the camera pose in a pre-trained real world standard dataset have many constraints and limits and are inconsistent with most real application conditions are solved, false excellent indexes are prevented from being generated during training, the sparse input is truly faced, and when the light time span is large, the traditional color rendering loss is insufficient to support the quality of the whole rendering effect and the condition of the presentation of fine textures; the detailed calculation step of step S5 is as follows:

s5.1, using the sum of cross entropy loss between the rendering colors of the coarse granularity stage and the fine granularity stage and the ground true color:

（5）

wherein,is a ray set of different batches, is->And->Representing the coarse phase and the fine phase in ray +.>Up-predicted rendering color, and +.>Representing the true ground true phase color;

s5.2 using content loss functionGamma correction is introduced, so that the rendering quality can be ensured under the cross-scene condition of sparse input, and meanwhile, extra supervision data or resources are not used any more; the gamma correction of the present invention requires +.>Is introduced with gamma correction value->And obtain its linear color using logarithmic deformation viewThe method establishes a relation between the instantaneous image and the observation image in the picture data set, and specifically comprises the following steps:

treating image datasets as containing time stamps one by one2D coordinate Point->And polarity->Tuple set->The time stamp is here an implicit variable, which does not represent a specific precedence but only by whether each picture is +.>Happens to judge whether the view and the instant view are within a time stamp, get +.>Instantaneous intensity image logarithmic absolute value and +.>Viewing image log absolute on time stampValue difference->：

（6）

Wherein,bringing it into formula (6) gives:

（7）

the absolute value difference of the pair numbers of the rendering view and the real instantaneous view in the same timestamp is used for establishing the connection between the rendering view and the real instantaneous view, so that the problems of artifacts and fuzzy textures under the conditions of large scene span and lack of characteristic information in the real application condition are better processed;

s5.3, finally obtaining new regularized baseline loss functionIncluding color cross entropy loss and content loop consistency loss, the expressions are as follows:

wherein,=0.1 represents a constant, ++>For color cross entropy loss, ++>And (5) the consistency loss is circulated for the content.

The beneficial effects are that: compared with the prior art, the invention has the following advantages:

(1) The invention is thatAnd constructing a lightweight binding feature extraction module (Bundle feature extraction modules, BUF) for incremental processing of feature information, so that the burden of memory occupation and calculation resource consumption is reduced. Lightweight binding feature extraction module BUF pre-use lightweight FPN extractionOr->Coarse-grained global feature at resolution, followed by extraction of +.>And->Fine-grained local features at resolution;

(2) According to the invention, through a mixed cross-perception attention mechanism, feature refinement is realized, a sparse input writing new view synthesis task is assisted, a high-quality rendering effect is obtained in a cross-scene new view synthesis challenge with insufficient geometric information, and a new possibility is provided for improving the real applicability and popularization of the generalized nerve radiation field frame.

(3) The invention uses a new regularized baseline to increase the color rendering robustness of the scene with the offset view angle, thereby realizing the optimization of scene texture details.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a schematic diagram of the overall network framework of the present invention;

FIG. 3 is a schematic diagram of an embodiment of the present invention;

FIG. 4 is a new view of an embodiment of the invention;

fig. 5 is a diagram showing experimental comparison of the present invention with the prior art scheme.

Detailed Description

The technical scheme of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

The existing advanced method of cross-scene generalized neural radiation field usually ignores the importance of cross-scene interaction characteristic information when predicting the color density field of a scene. The simple feature connection or the infinite enhancement feature of the transducer alone cannot render high-quality scene texture detail information.

As shown in fig. 1 and 2, the image rendering based on cross-view binding cross-perceived nerve radiation field of the present embodiment includes the following steps:

s1, constructing a binding feature extraction module BUF to extract view features, wherein the specific method comprises the following steps: for view datasetsIs first extracted using a lightweight feature pyramid gradient network FPN>Or->Coarse-grained global features at resolution; then extracting the original figure +.A. by using a two-dimensional convolutional neural network CNN>And->Fine-grained local features at resolution are extracted from coarse to fine in an incremental manner>Personal view dataset +.>At->、/>And->At resolution ofIs>And is abbreviated as +.>；

Wherein,representation view dataset +.>View number of->Representing the ratio of the resolution of the artwork, +.>Sequence number representing view, ">Indicate->The individual views are at resolution->The convolution feature below;

s2, using a hybrid cross-sense mechanism SCC to process the view data setIs->The matching pairs aggregate cross-view interaction information, and the specific process is as follows:

for view-dependent featuresAnd cross-view interaction feature independent of a single view +.>Performing polymerization treatment to obtain mixed characteristic->The method comprises the steps of carrying out a first treatment on the surface of the Since each view is paired with its possible offset view, no paired view is specified, i.e. +.>Personal view matching pair and feature vector->；

Wherein,means will->The image convolution characteristic data is further enhanced and processed through a mixed cross perception mechanism; />Is the sequence number of the view, ">Representing the feature vectors used to calculate the residual similarity between views, +.>Is the viewing angle direction;

s4, rendering a similarity embedded body, namely, performing residual rotation similarity of geometric prior conditionsAs display matching embedded linesCable, 5D geometry information used when embedding into neural radiation field rendering +.>Capturing similarity information for the predicted color density field; />Representing three-dimensional coordinates +.>；

S5, calculating minimized color and content consistency loss, and utilizing new regularized baselinesAnd (3) performing end-to-end training of the model, and continuously optimizing model parameters without additional external ground parameter data or additional training resources:

（1）

wherein,=0.1 is a calculation constant, +.>For color cross entropy loss, ++>A loss of consistency for the content loops;

s6, for any input visual angle image, a new image with a high-fidelity visual effect can be rendered by using the trained binding cross-perception nerve radiation field and the predicted scene color density field.

Step S2 of the present embodiment is for view feature setIn a particular viewing angle direction +.>Performing bilinear sampling, processing the view dataset using hybrid cross-awareness mechanism SCC>Is->The matching pairs aggregate cross-view interaction information;

In the multidimensional cascading process of the multidimensional cascading hybrid cross attention module, multi-granularity convolution characteristic feeding is carried out among all cascading modules, context characteristic information is timely supplemented for later-stage calculation of high-resolution characteristics, and the resource consumption of the process is negligible. And two modules are used in sequence for each cascade: a mixed cross-attention module CS (Hybrid cross-attention module) and a cross-attention module cross. The mixed cross attention module CS comprises a self attention layer and a cross attention layer, and the cross attention layer is used for balancing the self attention layer, so that bias influence caused by excessive dependence on local features when local features are learned is prevented, sparse input conditions are more met, and overfitting is prevented. The cross attention module cross cooperates with the mixed cross attention module CS to better learn the interactive global features.

The multi-dimensional cascade mixed cross attention module combines the mixed cross attention CS and the cross attention module (cross), can support image feature processing and learning of the whole framework under the sparse input condition, and can perform multi-granularity convolution feature feeding among cascade modules while enhancing and processing multi-scale convolution features, even if context information is supplemented.

Examples

A sample of the image data input in this embodiment is shown in fig. 3, and a new view output in this embodiment is shown in fig. 4, and it can be seen that the newly synthesized image has clear texture information.

As can be seen from the above embodiments, the present invention firstly inputs a pre-training view sequence; then, extracting multi-scale binding convolution characteristics; secondly, aggregating cross-view interaction information; again, supplementing the contextual characteristic information; from time to time, similarity embedded rendering; minimizing color and content consistency loss from time to time; finally, a high-fidelity rendered image is output (as shown in fig. 4).

The invention fully utilizes a binding feature extraction mode, processes feature information in an incremental way, refines features through a mixed cross-perception attention mechanism, aggregates cross-view interaction feature information, assists sparse input writing new view synthesis tasks with deficient geometric information, uses a new regularized baseline to perform model end-to-end training, and continuously optimizes model parameters, thereby realizing that a more accurate scene color density field and a higher-quality rendering image can be generated when coping with a cross-scene situation.

Further comparing the technical scheme of the invention with several prior art schemes (including MVSNeRF, geoNeRF, matchNeRF and GroundTruth), the final experimental result is shown in FIG. 5, wherein the rows of Ours are visual effect diagrams of the technical scheme (BCS-NeRF) of the invention, and the parts marked by the frames in the respective diagrams are the comparison parts of the invention with other methods, and the parts are enlarged, so that the technical scheme of the invention can be clearly observed to be better in the aspect of reproducing texture details.

Claims

1. An image rendering method based on cross-view binding cross-perception nerve radiation field is characterized by comprising the following steps:

s1, constructing a binding feature extraction module BUF to extract view features, wherein the specific method comprises the following steps: for view datasetsIs first extracted using a lightweight feature pyramid gradient network FPN>Or->Coarse-grained global features at resolution; then extracting the original figure +.A. by using a two-dimensional convolutional neural network CNN>And->Fine-grained local features at resolution are extracted from coarse to fine in an incremental manner>Personal view dataset +.>At->、/>And->Convolution feature set of samples at resolutionWill->Simply called->；

Wherein,means will->The image convolution characteristic data is further enhanced and processed through a mixed cross perception mechanism;jis the sequence number of the view, ">Representing the feature vectors used to calculate the residual similarity between views, +.>Is the viewing angle direction;

s3, constructing a multi-dimensional cascade mixed cross attention module to calculate context characteristic information, realizing multi-granularity convolution characteristic feeding among cascade modules, wherein each cascade module feeds own coarse and fine granularity sampling convolution characteristic to the next cascade, and timely supplements context characteristic information for later calculation of high-resolution characteristic;

s4, rendering a similarity embedded body, namely, performing residual rotation similarity of geometric prior conditions5D geometry information for use in rendering embedded into a neuroradiating field as display matching embedded cues>Capturing similarity information for the predicted color density field; />Representing three-dimensional coordinates +.>；

（1）

s6, for any input visual angle image, a new image with a high-fidelity visual effect is rendered by using the trained binding cross-perception nerve radiation field and the predicted scene color density field.

2. The image rendering method based on cross-view binding cross-perceived nerve radiation field according to claim 1, wherein the step S2 is for a view feature setIn a particular viewing angle direction +.>Performing bilinear sampling, processing the view dataset using hybrid cross-awareness mechanism SCC>Is->The matching pairs aggregate cross-view interaction information;

；

in the above-mentioned method, the step of,representation ofiImage in-samplingdConvolution characteristic information in direction, ++>Representation ofjImage in-samplingdConvolution characteristic information in the direction; />Representing the separate handling of +.>And->Performing feature enhancement and context information supplementation; />Is a cosine similarity function.

3. The cross-view binding cross-perceived nerve radiation field based image rendering method of claim 1, wherein: the step S3 is that each cascading process of the multi-dimensional cascading mixed cross attention module sequentially uses two modules: a mixed cross-attention module CS and a cross-attention module cross, the mixed cross-attention module CS comprising a self-attention layer and a cross-attention layer, the cross-attention layer being used to balance the self-attention layer; the cross attention module cross cooperates with the mixed cross attention module CS to better learn the interactive global features.

4. The cross-view binding cross-perceived nerve radiation field based image rendering method of claim 1, wherein: the specific steps of the step S4 of performing the similarity embedded volume rendering are as follows:

（2）

（3）

wherein,indicate->Personal view and->The +.f. of the feature matching pair of the individual views>Group (S)/(S)>Indicate->Personal view and->Personal view +.>A group redundancy similarity set; />Means calculating +.>Personal view and->A residual rotation similarity function of the m-th group of feature information of the matching pair of the view features; />Indicate->Mth group of feature information of individual view feature, < >>Indicate->An mth set of feature information for the view features; />Indicate->Personal view and->The first pair of view feature matchingMThe degree of residual rotation similarity of the group characteristic information;

（4）。

5. the cross-view binding cross-perceived nerve radiation field based image rendering method of claim 1, wherein: the detailed calculation step of the step S5 is as follows:

（5）

s5.2 using content loss functionIntroducing gamma correction, which requires the correction from each piece of instantaneous image dataIs introduced with gamma correction value->And obtain its linear color using logarithmic deformation view +.>The method for establishing the connection between the instantaneous image and the observation image in the picture data set comprises the following steps:

treating image datasets as containing time stamps one by one2D coordinate Point->And polarity->Is set of tuples of (a)The timestamp is an implicit variable, simply by whether each picture is +.>Happens to judge whether the view and the instant view are within a time stamp, get +.>Instantaneous intensity image logarithmic absolute value and +.>Observing the difference of the absolute value of the pair of images on the time stamp +.>：

（6）

Wherein,bringing it into formula (6) gives:

（7）

s5.3, finally obtaining a new regularized baseline loss functionThe following expression:

。