CN113689548B - A 3D reconstruction method of medical images based on mutual attention Transformer - Google Patents

A 3D reconstruction method of medical images based on mutual attention Transformer Download PDF

Info

Publication number
CN113689548B
CN113689548B CN202110881635.7A CN202110881635A CN113689548B CN 113689548 B CN113689548 B CN 113689548B CN 202110881635 A CN202110881635 A CN 202110881635A CN 113689548 B CN113689548 B CN 113689548B
Authority
CN
China
Prior art keywords
image
stage
coding
characteristic
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110881635.7A
Other languages
Chinese (zh)
Other versions
CN113689548A (en
Inventor
全红艳
董家顺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202110881635.7A priority Critical patent/CN113689548B/en
Publication of CN113689548A publication Critical patent/CN113689548A/en
Application granted granted Critical
Publication of CN113689548B publication Critical patent/CN113689548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three-dimensional [3D] modelling for computer graphics
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images
    • G16H30/20ICT specially adapted for the handling or processing of medical images for handling medical images, e.g. DICOM, HL7 or PACS
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10072Tomographic images
    • G06T2207/10081Computed x-ray tomography [CT]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10132Ultrasound image
    • G06T2207/101363D ultrasound image
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Processing (AREA)

Abstract

本发明公开了一种基于互注意力Transformer的医学影像三维重建方法,本发明的特点是基于互注意力Transformer的无监督学习方法,根据超声或CT影像的特征,设计基于互注意力机制的卷积神经网络结构,通过迁移学习,采用无监督机制,实现超声影像的三维重建。利用本发明能够有效地准确预测超声或CT影像的三维几何信息,该方法能够为人工智能的医疗辅助诊断在临床的实践中提供有效的3D重建解决措施,进一步改进人工智能辅助诊断的效率。

Figure 202110881635

The invention discloses a three-dimensional reconstruction method of medical images based on mutual attention Transformer. The feature of the invention is an unsupervised learning method based on mutual attention Transformer. According to the characteristics of ultrasound or CT images, a volume based on mutual attention mechanism is designed. A neural network structure is used to achieve 3D reconstruction of ultrasound images through transfer learning and an unsupervised mechanism. The invention can effectively and accurately predict the three-dimensional geometric information of ultrasound or CT images, and the method can provide effective 3D reconstruction solutions for artificial intelligence-assisted medical diagnosis in clinical practice, and further improve the efficiency of artificial intelligence-assisted diagnosis.

Figure 202110881635

Description

一种基于互注意力Transformer的医学影像三维重建方法A 3D reconstruction method of medical images based on mutual attention Transformer

技术领域technical field

本发明属于计算机技术领域,涉及医疗智能辅助诊断中超声或CT影像三维重建,是一种借助于自然图像的成像规律,利用深度学习机制进行学习,采样人工智能的迁移学习策略及互注意力的Transformer编码技术,建立有效的网络结构,能够实现对超声或CT影像三维几何信息的重建。The invention belongs to the field of computer technology, and relates to the three-dimensional reconstruction of ultrasound or CT images in medical intelligent auxiliary diagnosis. It is a kind of imaging rule by means of natural images, using deep learning mechanism for learning, sampling artificial intelligence migration learning strategy and mutual attention Transformer coding technology establishes an effective network structure, which can realize the reconstruction of three-dimensional geometric information of ultrasound or CT images.

背景技术Background technique

近年来,人工智能技术快速发展,智能医疗辅助诊断中,3D可视化技术对于现代医学临床中的诊断可以起到辅助的作用。同时,由于医学影像少纹理多噪声的客观事实,并且特别是对于超声摄像机的参数恢复存在一定的难度,导致目前超声或CT影像的三维重建技术的研究存在一定难点,这为医学影像的三维重建技术的研究带来挑战性。In recent years, with the rapid development of artificial intelligence technology, 3D visualization technology can play an auxiliary role in the diagnosis of modern medical clinics in intelligent medical aided diagnosis. At the same time, due to the objective fact that medical images have less texture and more noise, and it is especially difficult to restore the parameters of ultrasound cameras, there are certain difficulties in the research of 3D reconstruction technology for ultrasound or CT images. Technology research poses challenges.

同时,近年来出现的人工智能先进技术,使得超声或CT影像的三维重建的问题,可以通过三维重建技术,建立有效的深度学习的网络编码模型来解决,Transformer模型由于具有强大的特征感知能力,目前在医学影像分析中被广泛应用。At the same time, the advanced artificial intelligence technology that has emerged in recent years has enabled the problem of 3D reconstruction of ultrasound or CT images to be solved by establishing an effective deep learning network coding model through 3D reconstruction technology. The Transformer model has strong feature perception capabilities. It is widely used in medical image analysis.

发明内容Contents of the invention

本发明的目的是提供一种基于互注意力Transformer的医学影像三维重建方法,该方法采用多尺度的Transformer编码结构,设计多分支网络结构,另外,结合计算机视觉中几何成像的特点进行设计,采用互注意力机制,充分利用不同视图之间的互相作用,提高了三维重建的准确度,该发明可以得到较为精细的医学目标三维结构,具有较高的实用价值。The purpose of the present invention is to provide a kind of medical image three-dimensional reconstruction method based on mutual attention Transformer, this method adopts multi-scale Transformer encoding structure, designs multi-branch network structure, in addition, combines the characteristics of geometric imaging in computer vision to design, adopts The mutual attention mechanism makes full use of the interaction between different views and improves the accuracy of 3D reconstruction. This invention can obtain a relatively fine 3D structure of medical targets and has high practical value.

实现本发明目的的具体技术方案是:The concrete technical scheme that realizes the object of the invention is:

一种基于互注意力Transformer的医学影像三维重建方法,该方法输入一个超声或者CT影像序列,其影像分辨率为M×N,100≤M≤2000,100≤N≤2000,三维重建的过程具体包括以下步骤:A 3D reconstruction method of medical images based on mutual attention Transformer. The method inputs an ultrasound or CT image sequence, and its image resolution is M×N, 100≤M≤2000, 100≤N≤2000. The process of 3D reconstruction is specific Include the following steps:

步骤1:构建数据集Step 1: Build the dataset

(a)构建自然图像数据集(a) Constructing a natural image dataset

选取一个自然图像网站,要求具有图像序列及对应的摄像机内部参数,从所述自然图像网站下载a个图像序列及序列对应的内部参数,1≤a≤20,对于每个图像序列,每相邻3帧图像记为图像b、图像c和图像d,将图像b和图像d按照颜色通道进行拼接,得到图像τ,由图像c与图像τ构成一个数据元素,图像c为自然目标图像,图像c的采样视点作为目标视点,图像b、图像c和图像d的内部参数均为et(t=1,2,3,4),其中e1为水平焦距,e2为垂直焦距,e3及e4是主点坐标的两个分量;如果同一图像序列中最后剩余图像不足3帧,则舍弃;利用所有序列构建自然图像数据集,所构建的自然图像数据集中有f个元素,而且3000≤f≤20000;Select a natural image website, which requires an image sequence and the corresponding internal parameters of the camera, download a image sequence and the internal parameters corresponding to the sequence from the natural image website, 1≤a≤20, for each image sequence, each adjacent The 3 frames of images are denoted as image b, image c and image d, image b and image d are spliced according to the color channel to obtain image τ, image c and image τ form a data element, image c is the natural target image, image c The sampling viewpoint of is taken as the target viewpoint, and the internal parameters of image b, image c and image d are all e t (t=1, 2, 3, 4), where e 1 is the horizontal focal length, e 2 is the vertical focal length, e 3 and e 4 is the two components of the principal point coordinates; if the last remaining image in the same image sequence is less than 3 frames, discard it; use all sequences to construct a natural image dataset, and the constructed natural image dataset has f elements, and 3000≤ f≤20000;

(b)构建超声影像数据集(b) Constructing an ultrasound image dataset

采样g个超声影像序列,其中1≤g≤20,对于每个序列,每相邻3帧影像记为影像i、影像j和影像k,将影像i和影像k按照颜色通道进行拼接得到影像π,由影像j与影像π构成一个数据元素,影像j为超声目标影像,影像j的采样视点作为目标视点,如果同一影像序列中最后剩余影像不足3帧,则舍弃,利用所有序列构建超声影像数据集,所构建的超声影像数据集中有F个元素,而且1000≤F≤20000;Sampling g ultrasound image sequences, where 1≤g≤20, for each sequence, every adjacent 3 frames of images are recorded as image i, image j and image k, and image i and image k are spliced according to the color channel to obtain image π , a data element is composed of image j and image π, image j is the ultrasound target image, and the sampling viewpoint of image j is taken as the target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discard it, and use all sequences to construct ultrasound image data Set, there are F elements in the constructed ultrasound image data set, and 1000≤F≤20000;

(c)构建CT影像数据集(c) Construct CT image dataset

采样h个CT影像序列,其中1≤h≤20,对于每个序列,每相邻3帧记为影像l、影像m和影像n,将影像l和影像n按照颜色通道进行拼接得到影像σ,由影像m与影像σ构成一个数据元素,影像m为CT目标影像,影像m的采样视点作为目标视点,如果同一影像序列中最后剩余影像不足3帧,则舍弃,利用所有序列构建CT影像数据集,所构建的CT影像数据集中有ξ个元素,而且1000≤ξ≤20000;Sampling h CT image sequences, where 1≤h≤20, for each sequence, every adjacent 3 frames are recorded as image l, image m and image n, image l and image n are spliced according to the color channel to obtain image σ, A data element is composed of image m and image σ. Image m is the CT target image, and the sampling viewpoint of image m is the target viewpoint. If the last remaining image in the same image sequence is less than 3 frames, discard it and use all sequences to construct a CT image dataset. , there are ξ elements in the constructed CT image data set, and 1000≤ξ≤20000;

步骤2:构建神经网络Step 2: Build the Neural Network

神经网络输入的图像或影像的分辨率均为p×o,p为宽度,o为高度,以像素为单位,100≤o≤2000,100≤p≤2000;The resolution of the image or image input by the neural network is p×o, p is the width, o is the height, in pixels, 100≤o≤2000, 100≤p≤2000;

(1)深度信息编码网络(1) Deep information coding network

张量H作为输入,尺度为α×o×p×3,张量I作为输出,尺度为α×o×p×1,α为批次数量;Tensor H is used as input, the scale is α×o×p×3, tensor I is used as output, the scale is α×o×p×1, and α is the number of batches;

深度信息编码网络由编码器和解码器组成,对于张量H,依次经过编码和解码处理后,获得输出张量I;The depth information encoding network is composed of an encoder and a decoder. For the tensor H, after encoding and decoding in sequence, the output tensor I is obtained;

编码器由5个单元组成,第一个单元为卷积单元,第2至第5个单元均由残差模块组成,在第一个单元中,有64个卷积核组成,这些卷积核的形状均为7×7,卷积的水平方向及垂直方向的步长均为2,卷积之后进行一次最大池化处理,第2至第5个单元分别包括3,4,6,3个残差模块,每个残差模块进行3次卷积,卷积核的形状均为3×3,卷积核的个数分别是64,128,256,512;The encoder consists of 5 units, the first unit is a convolution unit, and the second to fifth units are composed of residual modules. In the first unit, there are 64 convolution kernels. These convolution kernels The shape of each is 7×7, and the horizontal and vertical steps of the convolution are both 2. After convolution, a maximum pooling process is performed, and the second to fifth units include 3, 4, 6, and 3 respectively. Residual module, each residual module performs 3 convolutions, the shape of the convolution kernel is 3×3, and the number of convolution kernels are 64, 128, 256, 512 respectively;

解码器由6个解码单元组成,每个解码单元均包括反卷积和卷积处理,反卷积和卷积处理的卷积核形状、个数相同,第1至第6解码单元中卷积核的形状均为3×3,卷积核的个数分别对应是512,256,128,64,32,16,编码器与解码器的网络层之间进行跨层连接,跨层连接的对应关系为:1与4、2与3、3与2、4与1;The decoder is composed of 6 decoding units, and each decoding unit includes deconvolution and convolution processing. The shape and number of convolution kernels for deconvolution and convolution processing are the same, and the convolution kernels in the 1st to 6th decoding units The shape of the kernel is 3×3, and the number of convolution kernels is 512, 256, 128, 64, 32, 16 respectively. The network layers of the encoder and the decoder are connected across layers, and the corresponding cross-layer connections The relationship is: 1 and 4, 2 and 3, 3 and 2, 4 and 1;

(2)互注意力Transformer学习网络(2) Mutual Attention Transformer Learning Network

互注意力Transformer学习网络由一个主干网络和4个网络分支构成,4个网络分支分别用于预测张量L、张量O、张量D和张量B;The mutual attention Transformer learning network consists of a backbone network and 4 network branches, and the 4 network branches are used to predict tensor L, tensor O, tensor D and tensor B respectively;

张量J和张量C作为输入,尺度分别为α×o×p×3和α×o×p×6,,输出为张量L、张量O、张量D和张量B,张量L尺度为α×2×6,张量O尺度为α×4×1,张量D的尺度为α×3,张量B尺度为α×o×p×4,α为批次数量;Tensor J and tensor C are used as input, the scales are α×o×p×3 and α×o×p×6, respectively, and the output is tensor L, tensor O, tensor D and tensor B, tensor The scale of L is α×2×6, the scale of tensor O is α×4×1, the scale of tensor D is α×3, the scale of tensor B is α×o×p×4, and α is the number of batches;

主干网络设计为3个阶段的跨视图编码:The backbone network is designed as 3-stage cross-view encoding:

1)第1个阶段的跨视图编码包括第1个阶段的嵌入编码和第1个阶段注意力编码1) The first stage of cross-view coding includes the first stage of embedded coding and the first stage of attention coding

第1个阶段的嵌入编码,分别将张量J、张量C的最后一个维度的前3个特征分量、张量C最后一个维度的后3个特征分量进行卷积运算,卷积核尺度均为7×7,特征通道数为24,序列化处理将编码特征从图像特征空域形状变换为序列结构,层归一化处理,分别得到第1个阶段嵌入编码1、第1个阶段嵌入编码2和第1个阶段嵌入编码3;In the first stage of embedded coding, the first three feature components of the last dimension of tensor J, tensor C, and the last three feature components of the last dimension of tensor C are respectively convolved, and the convolution kernel scale is equal to is 7×7, and the number of feature channels is 24. The serialization process transforms the coding features from the shape of the image feature space to a sequence structure, and the layer normalization process obtains the first stage embedded coding 1 and the first stage embedded coding 2 respectively. and stage 1 embedding code 3;

第1个阶段注意力编码,将第1个阶段嵌入编码1与第1个阶段嵌入编码2按照最后一个维度进行串接,得到注意力编码输入特征1;将第1个阶段嵌入编码1与第1个阶段嵌入编码3按照最后一个维度进行串接,得到第1个阶段注意力编码输入特征2;将第1个阶段嵌入编码2与第1个阶段嵌入编码1按照最后一个维度进行串接,得到第1个阶段注意力编码输入特征3;将第1个阶段嵌入编码3与第1个阶段嵌入编码1按照最后一个维度进行串接,得到第1个阶段注意力编码输入特征4;将所述第1个阶段注意力编码的4个输入特征,进行注意力编码:将第1个阶段每个注意力编码输入特征按照最后一个维度将前一半通道特征作为目标编码特征,将后一半通道特征作为源编码特征,再将目标编码特征和源编码特征分别进行可分离的卷积运算,其中卷积核尺度均为3×3,特征通道数为24,水平方向及垂直方向的步长均为1,将目标编码特征的处理结果作为注意力学习的查询关键词K编码向量和数值V编码向量,将源编码特征的处理结果作为注意力学习的查询Q编码向量,然后,利用多头注意力方法计算每个注意力编码输入特征的注意力权重矩阵,头的个数为1,特征通道数为24,最后,将所述每个注意力权重矩阵与每个注意力编码输入特征的目标编码特征相加,得到第1个阶段4个跨视图编码特征,利用所述4个跨视图编码特征的第1个和第2个跨视图编码特征的平均特征作为第1个阶段跨视图跨层特征;将所述第1个阶段跨视图跨层特征、第1个阶段第3个跨视图编码特征、第1个阶段第4个跨视图编码特征作为第1个阶段跨视图编码结果;将所述第1个阶段跨视图编码结果作为第2个阶段跨视图编码输入,将所述第1个阶段跨视图编码结果按照最后一个维度进行串接得到第1个阶段串接编码结果;In the first stage of attention coding, the first stage embedding code 1 and the first stage embedding code 2 are concatenated according to the last dimension to obtain the attention coding input feature 1; the first stage embedding code 1 and the first stage The first stage of embedded coding 3 is concatenated according to the last dimension to obtain the input feature 2 of the first stage of attention coding; the first stage of embedded coding 2 and the first stage of embedded coding 1 are concatenated according to the last dimension, Obtain the first stage attention coding input feature 3; concatenate the first stage embedding code 3 with the first stage embedding code 1 according to the last dimension to obtain the first stage attention coding input feature 4; Describe the 4 input features of attention encoding in the first stage, and perform attention encoding: use each attention encoding input feature in the first stage according to the last dimension, use the first half of the channel features as the target encoding features, and use the second half of the channel features As the source coding feature, the target coding feature and the source coding feature are subjected to separable convolution operations, where the convolution kernel scale is 3×3, the number of feature channels is 24, and the horizontal and vertical step sizes are 1. Use the processing result of the target encoding feature as the query keyword K encoding vector and the value V encoding vector for attention learning, and use the processing result of the source encoding feature as the query Q encoding vector for attention learning, and then use the multi-head attention method Calculate the attention weight matrix of each attention encoding input feature, the number of heads is 1, and the number of feature channels is 24. Finally, each attention weight matrix is combined with the target encoding feature of each attention encoding input feature Adding up to obtain the 4 cross-view coding features of the first stage, using the average feature of the first and second cross-view coding features of the 4 cross-view coding features as the cross-view cross-layer feature of the first stage; The cross-view and cross-layer features of the first stage, the third cross-view coding features of the first stage, and the fourth cross-view coding features of the first stage are used as the cross-view coding results of the first stage; the first stage The cross-view coding result of one stage is input as the cross-view coding result of the second stage, and the cross-view coding result of the first stage is concatenated according to the last dimension to obtain the concatenated coding result of the first stage;

2)第2个阶段的跨视图编码包括第2个阶段的嵌入编码和第2个阶段注意力编码2) The second stage of cross-view coding includes the second stage of embedding coding and the second stage of attention coding

第2个阶段的嵌入编码,将第1个阶段跨视图编码结果中的每个特征进行嵌入编码,卷积运算的特征通道数为64,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,序列化处理将编码特征从图像特征空域形状变换为序列结构,特征的层归一化处理,得到第2个阶段嵌入编码1、第2个阶段嵌入编码2和第2个阶段嵌入编码3;In the second stage of embedded coding, each feature in the cross-view coding result of the first stage is embedded and coded. The number of feature channels in the convolution operation is 64, and the convolution kernel scale is 3×3. Horizontal and vertical The step size of the direction is 2, the serialization process transforms the encoded feature from the image feature space shape to the sequence structure, and the layer normalization process of the feature obtains the second stage embedded coding 1, the second stage embedded coding 2 and the second stage 2 stages of embedded coding3;

第2个阶段注意力编码,将第2个阶段嵌入编码1与第2个阶段嵌入编码2按照最后一个维度进行串接,得到第2阶注意力编码输入特征1;将第2个阶段嵌入编码1与第2个阶段嵌入编码3按照最后一个维度进行串接,得到第2个阶段注意力编码输入特征2;将第2个阶段嵌入编码2与第2个阶段嵌入编码1按照最后一个维度进行串接,得到第2个阶段注意力编码输入特征3;将第2个阶段嵌入编码3与第2个阶段嵌入编码1按照最后一个维度进行串接,得到第2个阶段注意力编码输入特征4,将所述每个输入特征,按照最后一个维度,将前一半通道特征作为目标编码特征,将后一半通道特征作为源编码特征,将目标编码特征和源编码特征分别进行可分离的卷积运算,卷积核尺度均为3×3,特征通道数为64,水平方向及垂直方向的步长均为2,将目标编码特征的处理结果作为注意力学习的查询关键词K编码向量和数值V编码向量,将源编码特征的处理结果作为注意力学习的查询Q编码向量,然后,利用多头注意力方法计算每个注意力编码输入特征的的注意力权重矩阵,头的个数为3,特征通道数为64,最后,将所述每个注意力编码输入特征的注意力权重矩阵与每个注意力编码输入特征的目标编码特征相加,得到第2个阶段的4个跨视图编码特征,利用所述跨视图编码特征的第1个和第2个特征的平均特征作为第2个阶段跨视图跨层特征;将所述第2个阶段跨视图跨层特征、第2个阶段第3个跨视图编码特征、第2个阶段第4个跨视图编码特征作为第2个阶段跨视图编码结果;将所述第2个阶段跨视图编码结果作为第3个阶段跨视图编码输入,将所述第2个阶段跨视图编码结果按照最后一个维度进行串接得到第2个阶段串接编码结果;In the second stage of attention coding, the second stage of embedded coding 1 and the second stage of embedded coding 2 are concatenated according to the last dimension to obtain the second stage of attention coding input feature 1; the second stage of embedded coding 1 and the second stage embedded coding 3 are concatenated according to the last dimension to obtain the second stage attention coding input feature 2; the second stage embedded coding 2 and the second stage embedded coding 1 are performed according to the last dimension Concatenate to get the second stage attention coding input feature 3; concatenate the second stage embedding code 3 and the second stage embedding code 1 according to the last dimension to get the second stage attention coding input feature 4 , for each input feature, according to the last dimension, the first half of the channel features are used as the target coding features, and the second half of the channel features are used as the source coding features, and the target coding features and source coding features are respectively subjected to separable convolution operations. , the scale of the convolution kernel is 3×3, the number of feature channels is 64, the step size in the horizontal direction and the vertical direction is 2, and the processing result of the target encoding feature is used as the query keyword K encoding vector and value V of attention learning. Encoding vector, the processing result of the source encoding feature is used as the query Q encoding vector of attention learning, and then, the multi-head attention method is used to calculate the attention weight matrix of each attention encoding input feature, the number of heads is 3, and the feature The number of channels is 64. Finally, add the attention weight matrix of each attention coding input feature to the target coding feature of each attention coding input feature to obtain 4 cross-view coding features in the second stage, Utilize the average feature of the first and second features of the cross-view coding feature as the second stage cross-view cross-layer feature; the second stage cross-view cross-layer feature, the second stage third The cross-view coding feature, the fourth cross-view coding feature of the second stage is used as the cross-view coding result of the second stage; the cross-view coding result of the second stage is used as the cross-view coding input of the third stage, and the In the second stage, the cross-view encoding results are concatenated according to the last dimension to obtain the second stage concatenated encoding results;

3)第3个阶段的跨视图编码包括第3个阶段的嵌入编码和第3个阶段注意力编码3) The cross-view coding in the third stage includes the embedding coding in the third stage and the attention coding in the third stage

第3个阶段的嵌入编码,将第2个阶段跨视图编码结果中的每个特征进行嵌入编码处理,卷积运算,卷积核尺度均为3×3,特征通道数为128,水平方向及垂直方向的步长均为2,序列化处理将编码特征从图像特征空域形状变换为序列结构,特征的层归一化处理,得到第3个阶段嵌入编码1、第3个阶段嵌入编码2和第3个阶段嵌入编码3;In the third stage of embedded coding, each feature in the cross-view coding result of the second stage is processed by embedded coding, convolution operation, the convolution kernel scale is 3×3, the number of feature channels is 128, and the horizontal direction and The step size in the vertical direction is 2, the serialization process transforms the encoded features from the shape of the image feature space to a sequence structure, and the layer normalization process of the features obtains the third stage embedded coding 1, the third stage embedded coding 2 and The third stage embedded coding 3;

第3个阶段注意力编码,将第3个阶段嵌入编码1与第3个阶段嵌入编码2按照最后一个维度进行串接,得到第3阶注意力编码输入特征1;将第3个阶段嵌入编码1与第3个阶段嵌入编码3按最后一个维度进行串接,得到第3个阶段注意力编码输入特征2;将第3个阶段嵌入编码2与第3个阶段嵌入编码1按照最后一个维度进行串接,得到第3个阶段注意力编码输入特征3;将第3个阶段嵌入编码3与第3个阶段嵌入编码1按照最后一个维度进行串接,得到第3个阶段注意力编码输入特征4;将所述每个输入特征,按照最后一个维度,将前一半通道特征作为目标编码特征,将后一半通道特征作为源编码特征,将目标编码特征和源编码特征分别进行可分离的卷积运算,其中卷积核尺度均为3×3,特征通道数为128,水平方向及垂直方向的步长均为2,将目标编码特征的处理结果作为注意力学习的查询关键词K编码向量和数值V编码向量,将源编码特征的处理结果作为注意力学习的查询Q编码向量,然后,利用多头注意力方法计算每个注意力编码输入特征的注意力权重矩阵,头的个数为6,特征通道数为128,最后,将第3个阶段每个注意力编码输入特征的权重矩阵与每个注意力编码输入特征的目标编码特征相加,得到第3个阶段的4个跨视图编码特征,利用所述跨视图编码特征的第1个和第2个特征的平均特征作为第3个阶段跨视图跨层特征;将所述第3个阶段跨视图跨层特征、第3个阶段第3个跨视图编码特征、第3个阶段第4个跨视图编码特征作为第3个阶段跨视图编码结果;将所述第3个阶段跨视图编码结果按照最后一个维度进行串接得到第3个阶段串接编码结果;In the third stage of attention coding, the third stage of embedded coding 1 and the third stage of embedded coding 2 are concatenated according to the last dimension to obtain the third stage of attention coding input feature 1; the third stage of embedded coding 1 and the third stage embedded coding 3 are concatenated according to the last dimension to obtain the third stage attention coding input feature 2; the third stage embedded coding 2 and the third stage embedded coding 1 are carried out according to the last dimension Concatenate to get the third stage attention coding input feature 3; connect the third stage embedded coding 3 and the third stage embedded coding 1 according to the last dimension to get the third stage attention coding input feature 4 ; For each input feature, according to the last dimension, the first half of the channel features are used as the target coding features, and the second half of the channel features are used as the source coding features, and the target coding features and the source coding features are respectively subjected to separable convolution operations , where the scale of the convolution kernel is 3×3, the number of feature channels is 128, and the step size in the horizontal direction and vertical direction is 2, and the processing result of the target encoding feature is used as the query keyword K encoding vector and value of attention learning V encoding vector, the processing result of the source encoding feature is used as the query Q encoding vector of attention learning, and then, the multi-head attention method is used to calculate the attention weight matrix of each attention encoding input feature, the number of heads is 6, and the feature The number of channels is 128. Finally, the weight matrix of each attention encoding input feature in the third stage is added to the target encoding feature of each attention encoding input feature to obtain 4 cross-view encoding features in the third stage, The average feature of the first and second features of the cross-view coding feature is used as the third stage cross-view cross-layer feature; the third stage cross-view cross-layer feature, the third stage third The cross-view coding feature and the fourth cross-view coding feature of the third stage are used as the cross-view coding result of the third stage; the cross-view coding result of the third stage is concatenated according to the last dimension to obtain the third stage string Receive the encoding result;

对于第1个网络分支,将第1个阶段串接编码结果依次进行2个单元处理:在第1个单元处理中,卷积运算的特征通道数为16,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理;在第2个单元处理中,卷积运算的特征通道数为32,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理;将所得到的特征依次进行2个单元处理:在第1个单元处理中,卷积运算的特征通道数为32,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理;在第2个单元处理中,卷积运算的特征通道数为64,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理;然后,将所得到的特征与第3个阶段串接编码结果相串接,进行以下3个单元处理:在第1个单元处理中,卷积运算的特征通道数为64,卷积核尺度均为7×7,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理;在第2个单元处理中,卷积运算的特征通道数为128,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理;在第3个单元处理中,卷积运算的特征通道数为12,卷积核尺度均为1×1,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理;将所得的12通道的特征结果按照2×6的形式进行预测,得到张量L的结果;For the first network branch, the concatenated encoding results of the first stage are sequentially processed by two units: in the first unit processing, the number of feature channels of the convolution operation is 16, and the convolution kernel scale is 7×7 , the horizontal and vertical steps are both 1, and then feature activation and batch normalization are performed; in the second unit processing, the number of feature channels of the convolution operation is 32, and the convolution kernel scale is 3× 3. The horizontal and vertical steps are both 2, and then feature activation and batch normalization are performed; the obtained features are sequentially processed by two units: in the first unit processing, the features of the convolution operation The number of channels is 32, the scale of the convolution kernel is 7×7, and the step size in the horizontal and vertical directions is 1, and then feature activation and batch normalization are performed; in the second unit processing, the convolution operation The number of feature channels is 64, the scale of the convolution kernel is 3×3, and the step size in the horizontal and vertical directions is 2, and then feature activation and batch normalization are performed; then, the obtained features are combined with the third The coding results of stage concatenation are concatenated, and the following three unit processes are performed: in the first unit process, the number of feature channels of the convolution operation is 64, the scale of the convolution kernel is 7×7, and the horizontal and vertical The step size is 2, and then feature activation and batch normalization are performed; in the second unit processing, the number of feature channels of the convolution operation is 128, the convolution kernel scale is 3×3, and the horizontal and vertical directions The step size is 2, and then feature activation and batch normalization are performed; in the third unit processing, the number of feature channels of the convolution operation is 12, the convolution kernel scale is 1×1, and the horizontal and vertical The step size of the direction is 1, and then feature activation and batch normalization are performed; the obtained 12-channel feature results are predicted in the form of 2×6, and the result of the tensor L is obtained;

对于第2个网络分支,将第1个阶段串接编码结果依次进行2个单元处理:在第1个单元处理中,卷积运算的特征通道数为16,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理;在第2个单元处理中,卷积运算的特征通道数为32,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理;然后将所得到的特征与第2个阶段串接编码结果相串接,进行以下2个单元处理:在第1个单元处理中,卷积运算的特征通道数为32,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理;在第2个单元处理中,卷积运算的特征通道数为32,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理;将所得到的特征与第3个阶段串接编码结果相串接,进行2个单元处理:在第1个单元处理中,卷积运算的特征通道数为64,卷积核尺度均为7×7,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理;在第2个单元处理中,卷积运算的特征通道数为128,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理;在第3个单元处理中,卷积运算的特征通道数为4,卷积核尺度均为1×1,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理;将所得的4通道特征作为张量O的结果;For the second network branch, the concatenated encoding results of the first stage are sequentially processed by two units: in the first unit processing, the number of feature channels of the convolution operation is 16, and the convolution kernel scale is 7×7 , the horizontal and vertical steps are both 1, and then feature activation and batch normalization are performed; in the second unit processing, the number of feature channels of the convolution operation is 32, and the convolution kernel scale is 3× 3. The horizontal and vertical steps are both 2, and then feature activation and batch normalization are performed; then the obtained features are concatenated with the results of the second stage concatenated encoding, and the following two unit processes are performed : In the first unit processing, the number of feature channels of the convolution operation is 32, the scale of the convolution kernel is 7×7, and the step size in the horizontal direction and the vertical direction is 1, and then feature activation and batch normalization are performed. Processing; in the second unit processing, the number of feature channels of the convolution operation is 32, the convolution kernel scale is 3×3, and the horizontal and vertical steps are both 2, and then feature activation and batch normalization are performed. The obtained features are concatenated with the results of the third-stage concatenated encoding, and two unit processes are performed: in the first unit process, the number of feature channels of the convolution operation is 64, and the scale of the convolution kernel is uniform. It is 7×7, the horizontal and vertical steps are both 2, and then feature activation and batch normalization are performed; in the second unit processing, the number of feature channels of the convolution operation is 128, and the convolution kernel scale Both are 3×3, the step size in the horizontal direction and the vertical direction are both 2, and then perform feature activation and batch normalization processing; in the third unit processing, the number of feature channels of the convolution operation is 4, and the convolution kernel The scale is 1×1, the step size in the horizontal direction and vertical direction is 1, and then feature activation and batch normalization are performed; the obtained 4-channel features are used as the result of tensor O;

对于第3个网络分支,将第3个阶段串接编码结果进行以下4个单元的处理:在第1个单元处理中,卷积运算的特征通道数为256,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理;在第2个单元处理中,卷积运算的特征通道数为512,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理;在第3个单元处理中,卷积运算的特征通道数为1024,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,在第4个单元处理中,卷积运算的特征通道数为3,卷积核尺度均为1×1,水平方向及垂直方向的步长均为1,将所得到的特征作为张量D的结果;For the third network branch, the serial encoding results of the third stage are processed by the following four units: In the first unit processing, the number of feature channels of the convolution operation is 256, and the convolution kernel scale is 3× 3. The horizontal and vertical steps are both 1, and then feature activation and batch normalization are performed; in the second unit processing, the number of feature channels of the convolution operation is 512, and the convolution kernel scale is 3 ×3, the horizontal and vertical steps are both 2, and then feature activation and batch normalization are performed; in the third unit processing, the number of feature channels of the convolution operation is 1024, and the convolution kernel scale is 3×3, the step size in the horizontal direction and the vertical direction are both 2, in the fourth unit processing, the number of feature channels of the convolution operation is 3, the convolution kernel scale is 1×1, the horizontal direction and the vertical direction The step size is 1, and the obtained features are used as the result of the tensor D;

对于第4个网络分支,将第1个阶段跨视图跨层特征进行一次反卷积运算、特征激活、批归一化处理,反卷积运算中,卷积的特征通道数为16,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2;将得到的结果记为解码器跨层特征1,再将第1个阶段跨视图跨层特征进行以下2个单元的处理:第1个单元处理时,卷积运算特征通道数为32,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,并将处理特征记为解码器跨层特征2;第2个单元处理,卷积运算,特征通道数为32,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,将所得到的特征与第2个阶段跨视图跨层特征进行串接,将所述串接结果进行以下2个单元的处理:第1个单元处理时,卷积的特征通道数为64,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,并将处理特征记为解码器跨层特征3;第2个单元处理时,卷积的特征通道数为128,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,然后将所得到的特征与第3个阶段跨视图跨层特征进行串接,再进行以下3个单元处理,第1个单元处理时,卷积的特征通道数为128,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,并将处理特征记为解码器跨层特征4;第2个单元处理时,卷积的特征通道数为256,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,将并将处理特征记为解码器跨层特征5;第3个单元处理时,卷积的特征通道数为512个,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,处理后得到第4个网络分支编码特征;For the fourth network branch, a deconvolution operation, feature activation, and batch normalization processing are performed on the cross-view and cross-layer features of the first stage. In the deconvolution operation, the number of feature channels of the convolution is 16, and the convolution The kernel scale is 3×3, and the step size in the horizontal direction and vertical direction is 2; the obtained result is recorded as the decoder cross-layer feature 1, and then the cross-view cross-layer feature in the first stage is used for the following two units Processing: When the first unit is processed, the number of feature channels of the convolution operation is 32, the scale of the convolution kernel is 7×7, the step size in the horizontal direction and the vertical direction is 1, feature activation, batch normalization processing, and Record the processing feature as decoder cross-layer feature 2; the second unit is processed, convolution operation, the number of feature channels is 32, the convolution kernel scale is 3×3, and the step size in the horizontal direction and vertical direction is 2, Feature activation, batch normalization processing, the obtained features are concatenated with the cross-view and cross-layer features of the second stage, and the concatenation results are processed by the following two units: when the first unit is processed, volume The number of feature channels of the product is 64, the scale of the convolution kernel is 7×7, the step size of the horizontal direction and the vertical direction are both 1, and the processing feature is recorded as the decoder cross-layer feature 3; when the second unit is processed, The number of feature channels of the convolution is 128, the scale of the convolution kernel is 3×3, and the step size in the horizontal direction and the vertical direction is 2, and then the obtained features are concatenated with the third stage cross-view and cross-layer features , and then perform the following 3 unit processing. When the first unit is processed, the number of convolutional feature channels is 128, the convolution kernel scale is 7×7, and the step size in the horizontal direction and vertical direction is 1, and the processing The feature is recorded as decoder cross-layer feature 4; when the second unit is processed, the number of convolutional feature channels is 256, the convolution kernel scale is 3×3, and the horizontal and vertical steps are both 2. The processing feature is recorded as the decoder cross-layer feature 5; when the third unit is processed, the number of feature channels of the convolution is 512, the convolution kernel scale is 3×3, and the horizontal and vertical steps are both 2 , after processing, the encoding feature of the fourth network branch is obtained;

进一步进行解码,将所述第4个网络分支编码特征进行1次反卷积运算:卷积的特征通道数为256,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,并将得到的结果与解码器跨层特征5相串接,进行一次卷积运算:特征通道数为512,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,将所得到的结果进行反卷积运算:特征通道数为256,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,将所得到的结果与解码器跨层特征4相串接,进行一次卷积运算:特征通道数为256,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,将所得到的结果进行一次进行反卷积运算:特征通道数为128,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,将所得到的结果与解码器跨层特征3相串接,进行一次卷积运算:特征通道数为128,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,将所得到的特征作为张量B的第4个尺度结果,同时,将所得到的特征进行1次反卷积运算,反卷积的特征通道数为64,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,将所得到的特征与解码器跨层特征2相串接,进行一次卷积运算:特征通道数为64,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,将所得到的特征作为张量B的第3个尺度结果,同时,将所得到的特征进行1次反卷积运算:反卷积的特征通道数为32,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,再将所得到的特征与解码器跨层特征1相串接,然后进行一次卷积运算:特征通道数为32,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,将所得到的特征作为张量B的第2个尺度结果,同时,将所得到的特征进行1次反卷积运算:特征通道数为16,卷积核尺度均为7×7,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,将所得到的特征与第3个尺度特征的上采样结果进行相串接,然后进行一次卷积运算:特征通道数为16,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,将所得到的特征作为张量B的第1个尺度结果,利用所述张量B的4个尺度结果,得到第4个网络分支的输出;For further decoding, deconvolution is performed once on the encoded features of the fourth network branch: the number of convolutional feature channels is 256, the scale of the convolution kernel is 3×3, and the step size in the horizontal and vertical directions is equal to is 2, feature activation, batch normalization processing, and the obtained result is concatenated with the decoder cross-layer feature 5, and a convolution operation is performed: the number of feature channels is 512, and the convolution kernel scale is 3×3. The step size in the horizontal direction and the vertical direction is 1, feature activation, batch normalization processing, and deconvolution operation on the obtained results: the number of feature channels is 256, the convolution kernel scale is 3×3, and the horizontal direction The step size in the vertical direction is 2, feature activation, batch normalization processing, the obtained result is concatenated with the decoder cross-layer feature 4, and a convolution operation is performed: the number of feature channels is 256, and the convolution kernel The scale is 3×3, the step size in the horizontal direction and the vertical direction is 1, feature activation, batch normalization processing, and deconvolution operation on the obtained results: the number of feature channels is 128, and the convolution kernel The scale is 3×3, the horizontal and vertical steps are both 2, feature activation, batch normalization processing, the obtained result is concatenated with the decoder cross-layer feature 3, and a convolution operation is performed: The number of feature channels is 128, the scale of the convolution kernel is 3×3, the step size in the horizontal direction and the vertical direction are both 1, feature activation, batch normalization, and the obtained feature is used as the fourth tensor B At the same time, the obtained features are subjected to one deconvolution operation. The number of deconvolution feature channels is 64, the convolution kernel scale is 3×3, and the horizontal and vertical steps are both 2. Feature activation, batch normalization processing, the obtained features are concatenated with the decoder cross-layer feature 2, and a convolution operation is performed: the number of feature channels is 64, the convolution kernel scale is 3×3, and the horizontal direction and The step size in the vertical direction is 1, feature activation, batch normalization processing, the obtained features are used as the third scale result of the tensor B, and at the same time, the obtained features are subjected to a deconvolution operation: deconvolution The number of feature channels of the convolution is 32, the scale of the convolution kernel is 3×3, the step size of the horizontal direction and the vertical direction are both 2, feature activation, batch normalization, and then the obtained features and the decoder cross Layer feature 1 is concatenated, and then a convolution operation is performed: the number of feature channels is 32, the convolution kernel scale is 3×3, the step size in the horizontal direction and vertical direction is 1, feature activation, batch normalization processing , use the obtained feature as the second scale result of tensor B, and at the same time, perform a deconvolution operation on the obtained feature: the number of feature channels is 16, the convolution kernel scale is 7×7, and the horizontal direction The step size in the vertical direction is 2, feature activation, batch normalization processing, the obtained features are concatenated with the upsampling results of the third scale features, and then a convolution operation is performed: the number of feature channels is 16. The scale of the convolution kernel is 3×3, the step size in the horizontal direction and the vertical direction are both 1, feature activation, batch normalization processing, and the obtained feature is used as the first scale result of tensor B, using The 4 scale results of the tensor B are obtained as the output of the 4th network branch;

步骤3:神经网络的训练Step 3: Training of Neural Network

分别将自然图像数据集、超声影像数据集和CT影像数据集中样本按照9:1划分为训练集和测试集,训练集中数据用于训练,测试集数据用于测试,在训练时,分别从对应的数据集中获取训练数据,统一缩放到分辨率p×o,输入到对应网络中,迭代优化,通过不断修改网络模型参数,使得每批次的损失达到最小;The samples in the natural image data set, ultrasound image data set and CT image data set are divided into training set and test set according to 9:1. The data in the training set is used for training, and the data in the test set is used for testing. During training, the data from the corresponding Obtain training data from the dataset, uniformly scale it to a resolution of p×o, input it into the corresponding network, iteratively optimize, and continuously modify the parameters of the network model to minimize the loss of each batch;

在训练过程中,各损失的计算方法:During the training process, the calculation method of each loss:

内部参数监督合成损失:在自然图像的网络模型训练中,将深度信息编码网络输出的张量I作为深度,将互注意力Transformer学习网络输出的张量L与训练数据的内部参数标签et(t=1,2,3,4)分别作为位姿参数和摄像机内部参数,根据计算机视觉原理算法,利用图像b和图像d分别合成图像c视点处的两个图像,利用图像c分别与所述的两个合成图像,按照逐像素、逐颜色通道强度差之和计算得到;Internal parameter supervision synthesis loss: In the network model training of natural images, the tensor I output by the depth information encoding network is used as the depth, and the tensor L output by the mutual attention Transformer learning network is combined with the internal parameter label e t of the training data ( t=1, 2, 3, 4) are respectively used as pose parameters and internal camera parameters, according to the computer vision principle algorithm, use image b and image d to synthesize two images at the viewpoint of image c respectively, and use image c to combine with the The two synthetic images of are calculated according to the sum of the pixel-by-pixel and color-by-color channel intensity differences;

无监督合成损失:在超声或者CT影像的网络模型训练中,将深度信息编码网络的输出张量I作为深度,将互注意力Transformer学习网络的第1个网络分支的输出张量L和互注意力Transformer学习网络的第2个网络分支的输出张量O分别作为位姿参数和摄像机内部参数,根据计算机视觉算法,利用目标影像的两相邻影像分别合成目标影像视点处的影像,利用目标影像分别与所述目标影像视点处的影像,按照逐像素、逐颜色通道强度差之和计算得到;Unsupervised synthesis loss: In the network model training of ultrasound or CT images, the output tensor I of the depth information encoding network is used as the depth, and the output tensor L and mutual attention of the first network branch of the mutual attention Transformer learning network are used The output tensor O of the second network branch of the force Transformer learning network is used as the pose parameters and the internal parameters of the camera respectively. According to the computer vision algorithm, two adjacent images of the target image are used to synthesize the image at the viewpoint of the target image, and the target image is used The image at the viewpoint of the target image is calculated according to the sum of pixel-by-pixel and color-by-color channel intensity differences;

内部参数误差损失:在自然图像的网络模型训练中,利用互注意力Transformer学习网络的第2个网络分支的输出张量O与训练数据的内部参数标签et(t=1,2,3,4)按照各分量差的绝对值之和计算得到;Internal parameter error loss: In the network model training of natural images, use mutual attention Transformer to learn the output tensor O of the second network branch of the network and the internal parameter label e t of the training data (t=1, 2, 3, 4) Calculated according to the sum of the absolute values of the differences of each component;

空间结构误差损失:在超声或者CT影像的网络模型训练中,将深度信息编码网络的输出张量I作为深度,将互注意力Transformer学习网络的第1个网络分支的输出张量L和互注意力Transformer学习网络的第2个网络分支的输出张量O分别作为位姿参数和摄像机内部参数,根据计算机视觉算法,分别利用目标视点处影像的两个相邻影像重建目标视点处影像的三维坐标,采用RANSAC算法对重建点进行空间结构拟合,空间结构误差损失利用拟合得到的法向量与互注意力Transformer学习网络的第3个网络分支的输出张量D,利用余弦距离计算得到;Spatial structure error loss: In the network model training of ultrasound or CT images, the output tensor I of the depth information encoding network is used as the depth, and the output tensor L and mutual attention of the first network branch of the mutual attention Transformer learning network are used The output tensor O of the second network branch of the force Transformer learning network is used as pose parameters and camera internal parameters respectively. According to the computer vision algorithm, two adjacent images of the image at the target viewpoint are used to reconstruct the three-dimensional coordinates of the image at the target viewpoint. , using the RANSAC algorithm to fit the spatial structure of the reconstruction points, the spatial structure error loss is calculated by using the normal vector obtained from the fitting and the output tensor D of the third network branch of the mutual attention Transformer learning network, and calculated by using the cosine distance;

变换合成损失:在超声或者CT影像的网络参数训练中,将深度信息编码网络的输出张量I作为深度,将互注意力Transformer学习网络的第1个网络分支的输出张量L和互注意力Transformer学习网络的第2个网络分支的输出张量O分别作为位姿参数和摄像机内部参数,利用目标影像的两个相邻影像构建目标影像视点处的两个合成影像,对于所述合成影像中的每个影像,在合成过程得到每个像素位置后,将第4个网络分支的输出张量B作为合成影像空域变形的位移量,构成合成结果影像,然后利用目标视点处的图像或者影像与所述合成目标视点处的结果,按照逐像素、逐颜色通道强度差之和计算得到;Transformation synthesis loss: In the network parameter training of ultrasound or CT images, the output tensor I of the depth information encoding network is used as the depth, and the output tensor L and mutual attention of the first network branch of the mutual attention Transformer learning network are used The output tensor O of the second network branch of the Transformer learning network is used as pose parameters and internal camera parameters respectively, and two adjacent images of the target image are used to construct two synthetic images at the viewpoint of the target image. For the synthetic image For each image of , after the position of each pixel is obtained in the synthesis process, the output tensor B of the fourth network branch is used as the displacement of the spatial deformation of the synthetic image to form a synthetic result image, and then the image or image at the target viewpoint and the The result at the synthetic target viewpoint is calculated according to the sum of pixel-by-pixel and color-by-color channel intensity differences;

具体训练步骤:Specific training steps:

(1)在自然图像数据集上,分别对深度信息编码网络及互注意力Transformer学习网络主干网络及第1个网络分支,训练80000次(1) On the natural image data set, train 80,000 times for the depth information encoding network and the mutual attention Transformer learning network backbone network and the first network branch

每次从自然图像数据集中取出训练数据,统一缩放到分辨率p×o,将图像c输入深度信息编码网络,将图像c及图像τ输入互注意力Transformer学习网络,对深度信息编码网络及视觉互注意力Transformer学习网络主干网络及第1个网络分支,训练80000次,每批次的训练损失由内部参数监督合成损失计算得到;Take out the training data from the natural image data set each time, uniformly zoom to the resolution p×o, input the image c into the depth information encoding network, input the image c and image τ into the mutual attention Transformer learning network, and perform the depth information encoding network and visual The mutual attention Transformer learns the network backbone network and the first network branch, trains 80,000 times, and the training loss of each batch is calculated by the internal parameter supervision synthesis loss;

(2)在自然图像数据集上,对互注意力Transformer学习网络第2个网络分支,训练50000次(2) On the natural image dataset, the second network branch of the mutual attention Transformer learning network is trained 50,000 times

每次从自然图像数据集中取出训练数据,统一缩放到分辨率p×o,将图像c输入深度信息编码网络,将图像c及图像τ输入互注意力Transformer学习网络,对第2个网络分支进行训练,每批次的训练损失由无监督合成损失和内部参数误差损失之和计算得到;Take out the training data from the natural image data set each time, uniformly zoom to the resolution p×o, input the image c into the depth information encoding network, input the image c and image τ into the mutual attention Transformer learning network, and perform the second network branch Training, the training loss of each batch is calculated by the sum of unsupervised synthesis loss and internal parameter error loss;

(3)在超声影像数据集上,对深度信息编码网络、互注意力Transformer学习网络的主干网络及网络分支1-4训练80000次,得到模型参数ρ(3) On the ultrasound image data set, train the backbone network and network branches 1-4 of the depth information encoding network and the mutual attention Transformer learning network for 80,000 times, and obtain the model parameter ρ

每次从超声影像数据集上取出超声训练数据,统一缩放到分辨率p×o,将影像j输入深度信息编码网络,将影像j及影像π输入到互注意力Transformer学习网络,对深度信息编码网络、互注意力Transformer学习网络的主干网络及网络分支1-4进行训练,每批次的训练损失由变换合成损失、空间结构误差损失之和计算得到;Take out the ultrasound training data from the ultrasound image data set each time, uniformly zoom to the resolution p×o, input the image j into the depth information encoding network, input the image j and image π into the mutual attention Transformer learning network, and encode the depth information Network, the backbone network of mutual attention Transformer learning network and network branches 1-4 are trained, and the training loss of each batch is calculated by the sum of transformation synthesis loss and spatial structure error loss;

(4)在CT影像数据集上,对于互注意力Transformer学习网络训练60000次,得到模型参数ρ′(4) On the CT image data set, the mutual attention Transformer learning network is trained 60,000 times to obtain the model parameter ρ′

每次从CT影像数据集中取出CT影像训练数据,统一缩放到分辨率p×o,将影像m及影像σ输入到互注意力Transformer学习网络,将深度信息编码网络输出结果作为深度,主干网络及第1及第2个网络分支的输出结果分别作为位姿参数和摄像机内部参数,将互注意力Transformer学习网络的第4个网络分支的输出张量B作为空域变形的位移量,分别根据影像l和影像n合成影像m视点处的两张影像,对所述网络进行训练,不断修改网络的参数,迭代优化,针对每批次每幅影像损失达到最小,迭代后得到最优的网络模型参数ρ′,使得每批次的每幅影像的损失达到最小,网络优化的损失计算时,除了变换合成损失、空间结构误差损失,还附加摄像机平移运动的损失;Each time the CT image training data is taken out from the CT image data set, uniformly scaled to the resolution p×o, the image m and image σ are input into the mutual attention Transformer learning network, and the output result of the depth information encoding network is used as the depth, backbone network and The output results of the first and second network branches are used as pose parameters and camera internal parameters respectively, and the output tensor B of the fourth network branch of the mutual attention Transformer learning network is used as the displacement of the spatial deformation, respectively according to the image l Synthesize two images at the viewpoint of image m with image n, train the network, continuously modify the parameters of the network, iteratively optimize, and achieve the minimum loss for each image in each batch, and obtain the optimal network model parameter ρ after iteration ′, so that the loss of each image in each batch is minimized. When calculating the loss of network optimization, in addition to the transformation synthesis loss and the spatial structure error loss, the loss of camera translation motion is also added;

步骤4:对超声或者CT影像三维重建Step 4: 3D reconstruction of ultrasound or CT images

利用自采样的一个超声或者CT序列影像,同时进行以下3个处理实现三维重建:Using a self-sampled ultrasound or CT sequence image, the following three processes are performed simultaneously to achieve 3D reconstruction:

(1)对序列影像中任一目标影像,按照如下方法计算摄像机坐标系下的三维坐标:缩放到分辨率p×o,对于超声序列影像,将影像j输入深度信息编码网络,将影像j及影像π输入到互注意力Transformer学习网络,对于CT序列影像,将影像m输入深度信息编码网络,将,影像m及影像σ输入到输入互注意力Transformer学习网络,分别利用模型参数ρ和模型参数ρ′进行预测,从深度信息编码网络得到每一帧目标影像的深度,将互注意力Transformer学习网络的第1个网络分支的输出张量L和第2个网络分支的输出张量O分别作为摄像机位姿参数及摄像机内部参数,根据目标影像的深度信息及摄像机内部参数,依据计算机视觉的原理,计算目标影像的摄像机坐标系下的三维坐标;(1) For any target image in the sequence image, calculate the three-dimensional coordinates in the camera coordinate system according to the following method: scaling to the resolution p×o, for the ultrasound sequence image, input the image j into the depth information encoding network, and the image j and The image π is input to the mutual attention Transformer learning network. For CT sequence images, the image m is input into the depth information encoding network, and the image m and image σ are input into the mutual attention Transformer learning network. The model parameters ρ and model parameters are used respectively ρ' is used to predict, and the depth of each frame of the target image is obtained from the depth information encoding network, and the output tensor L of the first network branch of the mutual attention Transformer learning network and the output tensor O of the second network branch are respectively used as Camera pose parameters and camera internal parameters, according to the depth information of the target image and camera internal parameters, and according to the principle of computer vision, calculate the three-dimensional coordinates of the target image in the camera coordinate system;

(2)序列影像三维重建过程中,建立关键帧序列:将序列影像第一帧作为关键帧序列的第一帧,并作为当前关键帧,当前关键帧之后的帧作为目标帧,按照目标帧顺序依次动态选取新的关键帧:首先,用单位矩阵初始化目标帧相对于当前关键帧的位姿参数矩阵,针对任一目标帧,将所述位姿参数矩阵累乘目标帧摄像机位姿参数,并利用累乘结果,结合所述目标帧的内部参数及深度信息,合成所述目标帧视点处的影像,利用所述合成影像与所述目标帧之间逐像素逐颜色通道强度差之和的大小计算误差λ,再根据所述目标帧的相邻帧,利用摄像机位姿参数和内部参数,合成所述目标帧视点处的影像,利用所述合成的影像与所述目标帧之间逐像素逐颜色通道强度差之和的大小计算误差γ,进一步利用公式(1)计算合成误差比Z:(2) In the process of 3D reconstruction of sequential images, establish a key frame sequence: take the first frame of the sequence image as the first frame of the key frame sequence, and as the current key frame, and the frame after the current key frame as the target frame, according to the order of the target frames Dynamically select new key frames in sequence: first, initialize the pose parameter matrix of the target frame relative to the current key frame with the identity matrix, and multiply the pose parameter matrix by the camera pose parameters of the target frame for any target frame, and Using the multiplication result, combined with the internal parameters and depth information of the target frame, to synthesize the image at the viewpoint of the target frame, using the size of the sum of the pixel-by-pixel and color-by-color channel intensity differences between the synthesized image and the target frame Calculate the error λ, and then according to the adjacent frames of the target frame, use the camera pose parameters and internal parameters to synthesize the image at the viewpoint of the target frame, and use the pixel-by-pixel relationship between the synthesized image and the target frame Calculate the error γ of the sum of the intensity differences of the color channels, and further use the formula (1) to calculate the composite error ratio Z:

Figure BDA0003192217030000111
Figure BDA0003192217030000111

满足Z大于阈值η,1<η<2,将所述目标帧作为新的关键帧,并将所述目标帧相对于当前关键帧的位姿参数矩阵作为新的关键帧的位姿参数,同时将所述目标帧更新为当前关键帧;以此迭代,完成关键帧序列建立;Satisfied that Z is greater than threshold η, 1<η<2, using the target frame as a new key frame, and using the pose parameter matrix of the target frame relative to the current key frame as the pose parameter of the new key frame, while Updating the target frame to the current key frame; iterating to complete the establishment of the key frame sequence;

(3)将序列影像第一帧的视点作为世界坐标系的原点,对任一目标影像,将其分辨率缩放到M×N,根据网络输出得到的摄像机内部参数及深度信息,计算得到摄像机坐标系下的三维坐标,根据网络输出的摄像机位姿参数,并结合关键帧序列中每一关键帧的位姿参数以及目标帧相对于当前关键帧的位姿参数矩阵,计算得到所述目标帧的每个像素的世界坐标系中的三维坐标。(3) Take the viewpoint of the first frame of the sequence image as the origin of the world coordinate system, scale its resolution to M×N for any target image, and calculate the camera coordinates according to the internal camera parameters and depth information obtained from the network output The three-dimensional coordinates under the system, according to the camera pose parameters output by the network, combined with the pose parameters of each key frame in the key frame sequence and the pose parameter matrix of the target frame relative to the current key frame, calculate the target frame The three-dimensional coordinates in the world coordinate system of each pixel.

本发明的有益效果:Beneficial effects of the present invention:

本发明设计了一种基于互注意力的Transformer网络模型,采用不同视图之间互注意力机制进行学习,使得在医学影像的三维重建中充分发挥深度学习的智能感知能力,能够从二维的超声或者CT影像自动获取三维几何信息,利用本发明可以对医学临床诊断目标进行可视化显示,可以为人工智能的医疗辅助诊断提供有效的3D重建解决方案,提高人工智能辅助医学诊断的效率。The present invention designs a Transformer network model based on mutual attention, and uses the mutual attention mechanism between different views to learn, so that in the three-dimensional reconstruction of medical images, the intelligent perception ability of deep learning can be fully utilized, and it can learn from two-dimensional ultrasound Or CT images can automatically acquire three-dimensional geometric information, and the present invention can be used to visually display medical clinical diagnosis targets, provide an effective 3D reconstruction solution for artificial intelligence-assisted medical diagnosis, and improve the efficiency of artificial intelligence-assisted medical diagnosis.

附图说明Description of drawings

图1为本发明超声影像的三维重建结果图;Fig. 1 is the three-dimensional reconstruction result figure of ultrasonic image of the present invention;

图2为本发明CT影像的三维重建结果图。Fig. 2 is a three-dimensional reconstruction result diagram of a CT image of the present invention.

具体实施方式Detailed ways

下面结合附图及实施例对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

实施例Example

本实施例在PC机上Windows10 64位操作系统下进行实施,其硬件配置是CPU i7-9700F,内存16G,GPU NVIDIA GeForce GTX 2070 8G;深度学习库采用Tensorflow1.14;编程采用Python语言3.7版本。This embodiment is implemented under the Windows 10 64-bit operating system on the PC, and its hardware configuration is CPU i7-9700F, memory 16G, GPU NVIDIA GeForce GTX 2070 8G; deep learning library adopts Tensorflow1.14; programming adopts Python language version 3.7.

一种基于互注意力Transformer的医学影像三维重建方法,该方法输入一个超声或者CT影像序列,分辨率为M×N,对于超声影像,M取450,N取300,对于CT影像,M和N均取512,三维重建的过程具体包括以下步骤:A 3D reconstruction method for medical images based on mutual attention Transformer. The method inputs an ultrasound or CT image sequence with a resolution of M×N. For ultrasound images, M is 450, and N is 300. For CT images, M and N Both take 512, and the process of three-dimensional reconstruction specifically includes the following steps:

步骤1:构建数据集Step 1: Build the dataset

(a)构建自然图像数据集(a) Constructing a natural image dataset

选取一个自然图像网站,要求具有图像序列及对应的摄像机内部参数,从该网站下载19个图像序列及序列对应的内部参数,对于每个图像序列,每相邻3帧图像记为图像b、图像c和图像d,将图像b和图像d按照颜色通道进行拼接,得到图像τ,由图像c与图像τ构成一个数据元素,图像c为自然目标图像,图像c的采样视点作为目标视点,图像b、图像c和图像d的内部参数均为et(t=1,2,3,4),其中e1为水平焦距,e2为垂直焦距,e3及e4是主点坐标的两个分量;如果同一图像序列中最后剩余图像不足3帧,则舍弃;利用所有序列构建自然图像数据集,其数据集有3600个元素;Select a natural image website, which requires image sequences and corresponding camera internal parameters, and download 19 image sequences and internal parameters corresponding to the sequences from this website. For each image sequence, every adjacent 3 frames of images are recorded as image b, image c and image d, image b and image d are spliced according to the color channel to obtain image τ, image c and image τ constitute a data element, image c is a natural target image, and the sampling viewpoint of image c is used as the target viewpoint, image b , the internal parameters of image c and image d are all e t (t=1, 2, 3, 4), where e 1 is the horizontal focal length, e 2 is the vertical focal length, e 3 and e 4 are two coordinates of the principal point component; if the last remaining image in the same image sequence is less than 3 frames, discard it; use all sequences to construct a natural image dataset, and the dataset has 3600 elements;

(b)构建超声影像数据集(b) Constructing an ultrasound image dataset

采样10个超声影像序列,对于每个序列,每相邻3帧影像记为影像i、影像j和影像k,将影像i和影像k按照颜色通道进行拼接得到影像π,由影像j与影像π构成一个数据元素,影像j为超声目标影像,影像j的采样视点作为目标视点,如果同一影像序列中最后剩余影像不足3帧,则舍弃,利用所有序列构建超声影像数据集,其数据集有1600个元素;Sampling 10 ultrasound image sequences, for each sequence, every adjacent 3 frames of images are recorded as image i, image j and image k, image i and image k are spliced according to the color channel to obtain image π, image j and image π Constitute a data element, image j is the ultrasound target image, and the sampling viewpoint of image j is the target viewpoint. If the last remaining image in the same image sequence is less than 3 frames, it will be discarded, and all sequences are used to construct an ultrasound image data set. The data set has 1600 elements;

(c)构建CT影像数据集(c) Construct CT image dataset

采样1个CT影像序列,对于所述序列,每相邻3帧记为影像l、影像m和影像n,将影像l和影像n按照颜色通道进行拼接得到影像σ,由影像m与影像σ构成一个数据元素,影像m为CT目标影像,影像m的采样视点作为目标视点,如果同一影像序列中最后剩余影像不足3帧,则舍弃,利用所有序列构建CT影像数据集,其数据集有2000个元素;Sampling 1 CT image sequence, for the sequence, every adjacent 3 frames are recorded as image l, image m, and image n, image l and image n are spliced according to the color channel to obtain image σ, which is composed of image m and image σ One data element, image m is the CT target image, and the sampling viewpoint of image m is used as the target viewpoint. If the last remaining image in the same image sequence is less than 3 frames, it will be discarded, and all sequences are used to construct a CT image dataset. The dataset has 2000 element;

步骤2:构建神经网络Step 2: Build the Neural Network

神经网络处理的图像或影像的分辨率均为416×128,416为宽度,128为高度,以像素为单位;The resolution of the image or image processed by the neural network is 416×128, 416 is the width, 128 is the height, and the unit is pixel;

(1)深度信息编码网络的结构(1) Structure of deep information encoding network

张量H作为输入,尺度为4×128×416×3,张量I作为输出,尺度为4×128×416×1;Tensor H is used as input, and the scale is 4×128×416×3, and tensor I is used as output, and the scale is 4×128×416×1;

深度信息编码网络由编码器和解码器组成,对于张量H,依次经过编码和解码处理后,获得输出张量I;The depth information encoding network is composed of an encoder and a decoder. For the tensor H, after encoding and decoding in sequence, the output tensor I is obtained;

编码器由5个单元组成,第一个单元为卷积单元,第2至第5个单元均由残差模块组成,在第一个单元中,有64个卷积核组成,这些卷积核的形状均为7×7,卷积的水平方向及垂直方向的步长均为2,卷积之后进行一次最大池化处理,第2至第5个单元分别包括3,4,6,3个残差模块,每个残差模块进行3次卷积,卷积核的形状均为3×3,卷积核的个数分别是64,128,256,512;The encoder consists of 5 units, the first unit is a convolution unit, and the second to fifth units are composed of residual modules. In the first unit, there are 64 convolution kernels. These convolution kernels The shape of each is 7×7, and the horizontal and vertical steps of the convolution are both 2. After convolution, a maximum pooling process is performed, and the second to fifth units include 3, 4, 6, and 3 respectively. Residual module, each residual module performs 3 convolutions, the shape of the convolution kernel is 3×3, and the number of convolution kernels are 64, 128, 256, 512 respectively;

解码器由6个解码单元组成,每个解码单元均包括反卷积和卷积处理,反卷积和卷积处理的卷积核形状、个数相同,第1至第6解码单元中卷积核的形状均为3×3,卷积核的个数分别对应是512,256,128,64,32,16,编码器与解码器的网络层之间进行跨层连接,跨层连接的对应关系为:1与4、2与3、3与2、4与1;The decoder is composed of 6 decoding units, and each decoding unit includes deconvolution and convolution processing. The shape and number of convolution kernels for deconvolution and convolution processing are the same, and the convolution kernels in the 1st to 6th decoding units The shape of the kernel is 3×3, and the number of convolution kernels is 512, 256, 128, 64, 32, 16 respectively. The network layers of the encoder and the decoder are connected across layers, and the corresponding cross-layer connections The relationship is: 1 and 4, 2 and 3, 3 and 2, 4 and 1;

(2)互注意力Transformer学习网络(2) Mutual Attention Transformer Learning Network

互注意力Transformer学习网络由一个主干网络和4个网络分支构成,4个网络分支分别用于预测张量L、张量O、张量D和张量B;The mutual attention Transformer learning network consists of a backbone network and 4 network branches, and the 4 network branches are used to predict tensor L, tensor O, tensor D and tensor B respectively;

张量J和张量C作为输入,尺度分别为4×128×416×3和4×128×416×6,输出为张量L、张量O、张量D和张量B,尺度分别为:张量L尺度为4×2×6,张量O尺度为4×4×1,张量D的尺度为4×3,张量B尺度为4×128×416×4;Tensor J and tensor C are used as input, and the scales are 4×128×416×3 and 4×128×416×6 respectively, and the output is tensor L, tensor O, tensor D and tensor B, and the scales are respectively : The scale of tensor L is 4×2×6, the scale of tensor O is 4×4×1, the scale of tensor D is 4×3, and the scale of tensor B is 4×128×416×4;

主干网络设计为3个阶段的跨视图编码:The backbone network is designed as 3-stage cross-view encoding:

1)第1个阶段的跨视图编码包括第1个阶段的嵌入编码和第1个阶段注意力编码1) The first stage of cross-view coding includes the first stage of embedded coding and the first stage of attention coding

第1个阶段的嵌入编码,分别将张量J、张量C的最后一个维度的前3个特征分量、张量C最后一个维度的后3个特征分量进行卷积运算卷积核尺度均为7×7,序列化处理将编码特征从图像特征空域形状变换为序列结构,层归一化处理,分别得到第1个阶段嵌入编码1、第1个阶段嵌入编码2和第1个阶段嵌入编码3;In the first stage of embedded coding, the first three feature components of the last dimension of tensor J, tensor C, and the last three feature components of the last dimension of tensor C are respectively convoluted. The convolution kernel scale is 7×7, the serialization process transforms the coding feature from the shape of the image feature space to the sequence structure, and the layer normalization process obtains the first stage embedded coding 1, the first stage embedded coding 2 and the first stage embedded coding respectively 3;

第1个阶段注意力编码,将第1个阶段嵌入编码1与第1个阶段嵌入编码2按照最后一个维度进行串接,得到注意力编码输入特征1,将第1个阶段嵌入编码1与第1个阶段嵌入编码3按照最后一个维度进行串接,得到第1个阶段注意力编码输入特征2,将第1个阶段嵌入编码2与第1个阶段嵌入编码1按照最后一个维度进行串接,得到第1个阶段注意力编码输入特征3,将第1个阶段嵌入编码3与第1个阶段嵌入编码1按照最后一个维度进行串接,得到第1个阶段注意力编码输入特征4,将所述第1个阶段注意力编码的4个输入特征,分别进行注意力编码处理:先利用多头自注意力方法计算第1个阶段注意力编码输入特征的注意力权重矩阵,具体地,将第1个阶段每个注意力编码输入特征按照最后一个维度将前一半通道特征作为目标编码特征,将后一半通道特征作为源编码特征,再将前一半通道特征和后一半通道特征分别进行可分离的卷积运算,其中卷积核尺度均为3×3,特征通道数为24,水平方向及垂直方向的步长均为1,将目标编码特征的处理结果作为注意力学习的查询关键词K编码向量和数值V编码向量,将源编码特征的处理结果作为注意力学习的查询Q编码向量,然后,利用多头注意力方法计算注意力权重矩阵,头的个数为1,特征通道数为24,最后,将第1个阶段所述特征的注意力权重矩阵与所述目标编码特征相加得到第1个阶段注意力编码,由第1个阶段4个注意力编码输入特征进行注意力编码处理后得到第1个阶段4个跨视图编码特征,利用所述跨视图编码特征的第1个和第2个特征的平均特征作为第1个阶段跨视图跨层特征,将所述第1个阶段跨视图跨层特征、第1个阶段第3个跨视图编码特征、第1个阶段第4个跨视图编码特征作为第1个阶段跨视图编码结果,将所述第1个阶段跨视图编码结果作为第2个阶段跨视图编码输入,将所述第1个阶段跨视图编码结果按照最后一个维度进行串接得到第1个阶段串接编码结果;In the first stage of attention coding, the first stage embedded coding 1 and the first stage embedded coding 2 are concatenated according to the last dimension to obtain the attention coding input feature 1, and the first stage embedded coding 1 and the first stage The first stage of embedded coding 3 is concatenated according to the last dimension to obtain the input feature 2 of the first stage of attention coding, and the first stage of embedded coding 2 and the first stage of embedded coding 1 are concatenated according to the last dimension. Obtain the input feature 3 of the attention coding in the first stage, concatenate the embedded coding 3 of the first stage with the embedded coding 1 of the first stage according to the last dimension, and obtain the input feature 4 of the attention coding of the first stage. The four input features of attention encoding in the first stage are described, and the attention encoding process is performed separately: first, the multi-head self-attention method is used to calculate the attention weight matrix of the input features of attention encoding in the first stage, specifically, the first Each attention encoding input feature in each stage takes the first half of the channel features as the target encoding features according to the last dimension, uses the second half of the channel features as the source encoding features, and then separates the first half of the channel features and the second half of the channel features. Product operation, in which the scale of the convolution kernel is 3×3, the number of feature channels is 24, the step size in the horizontal direction and the vertical direction is 1, and the processing result of the target encoding feature is used as the query keyword K encoding vector for attention learning and the numerical V encoding vector, the processing result of the source encoding feature is used as the query Q encoding vector of attention learning, and then, the multi-head attention method is used to calculate the attention weight matrix, the number of heads is 1, the number of feature channels is 24, and finally , add the attention weight matrix of the features described in the first stage to the target encoding features to obtain the attention encoding in the first stage, and obtain the attention encoding from the four input features of the attention encoding in the first stage. The 4 cross-view coding features in the first stage, using the average feature of the first and second features of the cross-view coding feature as the first stage cross-view cross-layer feature, the first stage cross-view The cross-layer feature, the third cross-view coding feature in the first stage, and the fourth cross-view coding feature in the first stage are taken as the cross-view coding result of the first stage, and the cross-view coding result of the first stage is taken as the first cross-view coding result Two stages of cross-view coding input, concatenation of the cross-view coding results of the first stage according to the last dimension to obtain the concatenation coding results of the first stage;

2)第2个阶段的跨视图编码包括第2个阶段的嵌入编码和第2个阶段注意力编码2) The second stage of cross-view coding includes the second stage of embedding coding and the second stage of attention coding

第2个阶段的嵌入编码,将第1个阶段跨视图编码结果中的每个特征进行嵌入编码:卷积运算的特征通道数为64,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,序列化处理将编码特征从图像特征空域形状变换为序列结构,特征的层归一化处理,得到第2个阶段嵌入编码1、第2个阶段嵌入编码2和第2个阶段嵌入编码3;In the second stage of embedded coding, each feature in the cross-view coding result of the first stage is embedded and coded: the number of feature channels of the convolution operation is 64, the convolution kernel scale is 3×3, and the horizontal direction and vertical direction The step size of the direction is 2, the serialization process transforms the encoded feature from the image feature space shape to the sequence structure, and the layer normalization process of the feature obtains the second stage embedded coding 1, the second stage embedded coding 2 and the second stage 2 stages of embedded coding3;

第2个阶段注意力编码,将第2个阶段嵌入编码1与第2个阶段嵌入编码2按照最后一个维度进行串接,得到第2阶注意力编码输入特征1,将第2个阶段嵌入编码1与第2个阶段嵌入编码3按照最后一个维度进行串接,得到第2个阶段注意力编码输入特征2,将第2个阶段嵌入编码2与第2个阶段嵌入编码1按照最后一个维度进行串接,得到第2个阶段注意力编码输入特征3,将第2个阶段嵌入编码3与第2个阶段嵌入编码1按照最后一个维度进行串接,得到第2个阶段注意力编码输入特征4,将所述每个输入特征,按照最后一个维度,将前一半通道特征作为目标编码特征,将后一半通道特征作为源编码特征,将目标编码特征和源编码特征分别进行可分离的卷积运算,卷积核尺度均为3×3,特征通道数为64,水平方向及垂直方向的步长均为2,将目标编码特征的处理结果作为注意力学习的查询关键词K编码向量和数值V编码向量,将源编码特征的处理结果作为注意力学习的查询Q编码向量,然后,利用多头注意力方法计算所述特征的注意力权重矩阵,头的个数为3,特征通道数为64,最后,将第2个阶段所述特征的注意力权重矩阵与所述目标编码特征相加得到第2个阶段注意力编码,由第2个阶段4个注意力编码输入特征进行注意力编码处理后得到第2个阶段4个跨视图编码特征,利用所述跨视图编码特征的第1个和第2个特征的平均特征作为第2个阶段跨视图跨层特征,将所述第2个阶段跨视图跨层特征、第2个阶段第3个跨视图编码特征、第2个阶段第4个跨视图编码特征作为第2个阶段跨视图编码结果,将所述第2个阶段跨视图编码结果作为第3个阶段跨视图编码输入,将所述第2个阶段跨视图编码结果按照最后一个维度进行串接得到第2个阶段串接编码结果;In the second stage of attention coding, the second stage of embedded coding 1 and the second stage of embedded coding 2 are concatenated according to the last dimension to obtain the second stage of attention coding input feature 1, and the second stage of embedded coding 1 and the second stage embedded coding 3 are concatenated according to the last dimension to obtain the second stage attention coding input feature 2, and the second stage embedded coding 2 and the second stage embedded coding 1 are performed according to the last dimension Concatenate to get the second stage attention coding input feature 3, concatenate the second stage embedding code 3 and the second stage embedding code 1 according to the last dimension to get the second stage attention coding input feature 4 , for each input feature, according to the last dimension, the first half of the channel features are used as the target coding features, and the second half of the channel features are used as the source coding features, and the target coding features and source coding features are respectively subjected to separable convolution operations. , the scale of the convolution kernel is 3×3, the number of feature channels is 64, the step size in the horizontal direction and the vertical direction is 2, and the processing result of the target encoding feature is used as the query keyword K encoding vector and value V of attention learning. Coding vector, the processing result of the source coding feature is used as the query Q coding vector of attention learning, and then, the multi-head attention method is used to calculate the attention weight matrix of the feature, the number of heads is 3, and the number of feature channels is 64, Finally, the attention weight matrix of the features described in the second stage is added to the target coding features to obtain the attention coding in the second stage, and the attention coding is performed on the 4 attention coding input features in the second stage Obtain the 4 cross-view coding features of the second stage, use the average feature of the first and second features of the cross-view coding features as the cross-view cross-layer feature of the second stage, and use the cross-view cross-layer feature of the second stage to The view cross-layer feature, the third cross-view coding feature in the second stage, and the fourth cross-view coding feature in the second stage are used as the cross-view coding result of the second stage, and the cross-view coding result of the second stage is used as In the third stage, the cross-view coding input is performed, and the cross-view coding results of the second stage are concatenated according to the last dimension to obtain the concatenated coding results of the second stage;

3)第3个阶段的跨视图编码包括第3个阶段的嵌入编码和第3个阶段注意力编码3) The cross-view coding in the third stage includes the embedding coding in the third stage and the attention coding in the third stage

第3个阶段的嵌入编码,将第2个阶段跨视图编码结果中的每个特征进行嵌入编码处理:卷积运算,卷积核尺度均为3×3,特征通道数为128,水平方向及垂直方向的步长均为2,序列化处理将编码特征从图像特征空域形状变换为序列结构,特征的层归一化处理,得到第3个阶段嵌入编码1、第3个阶段嵌入编码2和第3个阶段嵌入编码3;In the third stage of embedded coding, each feature in the cross-view coding result of the second stage is processed by embedded coding: convolution operation, the convolution kernel scale is 3×3, the number of feature channels is 128, and the horizontal direction and The step size in the vertical direction is 2, the serialization process transforms the encoded features from the shape of the image feature space to a sequence structure, and the layer normalization process of the features obtains the third stage embedded coding 1, the third stage embedded coding 2 and The third stage embedded coding 3;

第3个阶段注意力编码,将第3个阶段嵌入编码1与第3个阶段嵌入编码2按照最后一个维度进行串接,得到第3阶注意力编码输入特征1,将第3个阶段嵌入编码1与第3个阶段嵌入编码3按最后一个维度进行串接,得到第3个阶段注意力编码输入特征2,将第3个阶段嵌入编码2与第3个阶段嵌入编码1按照最后一个维度进行串接,得到第3个阶段注意力编码输入特征3,将第3个阶段嵌入编码3与第3个阶段嵌入编码1按照最后一个维度进行串接,得到第3个阶段注意力编码输入特征4,将所述每个输入特征,按照最后一个维度,将前一半通道特征作为目标编码特征,将后一半通道特征作为源编码特征,将目标编码特征和源编码特征分别进行可分离的卷积运算,其中卷积核尺度均为3×3,特征通道数为128,水平方向及垂直方向的步长均为2,将目标编码特征的处理结果作为注意力学习的查询关键词K编码向量和数值V编码向量,将源编码特征的处理结果作为注意力学习的查询Q编码向量,然后,利用多头注意力方法计算所述特征的注意力权重矩阵,头的个数为6,特征通道数为128,最后,将第3个阶段所述特征的注意力权重矩阵与所述目标编码特征相加得到第3个阶段注意力编码,这样,第3个阶段的4个注意力编码输入特征经过所述的嵌入编码处理及注意力编码处理后,得到第3个阶段4个跨视图编码特征,利用所述跨视图编码特征的第1个和第2个特征的平均特征作为第3个阶段跨视图跨层特征,将所述第3个阶段跨视图跨层特征、第3个阶段第3个跨视图编码特征、第3个阶段第4个跨视图编码特征作为第3个阶段跨视图编码结果,将所述第3个阶段跨视图编码结果按照最后一个维度进行串接得到第3个阶段串接编码结果;In the third stage of attention coding, the third stage of embedded coding 1 and the third stage of embedded coding 2 are concatenated according to the last dimension to obtain the third stage of attention coding input feature 1, and the third stage of embedded coding 1 and the third stage embedded coding 3 are concatenated according to the last dimension to obtain the third stage attention coding input feature 2, and the third stage embedded coding 2 and the third stage embedded coding 1 are performed according to the last dimension Connect in series to get the input feature 3 of the attention coding in the third stage, connect the embedded coding 3 in the third stage and the embedded coding 1 in the third stage according to the last dimension, and get the input feature 4 of the attention coding in the third stage , for each input feature, according to the last dimension, the first half of the channel features are used as the target coding features, and the second half of the channel features are used as the source coding features, and the target coding features and source coding features are respectively subjected to separable convolution operations. , where the scale of the convolution kernel is 3×3, the number of feature channels is 128, and the step size in the horizontal direction and vertical direction is 2, and the processing result of the target encoding feature is used as the query keyword K encoding vector and value of attention learning V encoding vector, the processing result of the source encoding feature is used as the query Q encoding vector of attention learning, and then the multi-head attention method is used to calculate the attention weight matrix of the feature, the number of heads is 6, and the number of feature channels is 128 , and finally, add the attention weight matrix of the feature in the third stage to the target encoding feature to obtain the attention encoding in the third stage, so that the four input features of the attention encoding in the third stage pass through the After the embedded coding processing and attention coding processing, the 4 cross-view coding features of the third stage are obtained, and the average feature of the first and second features of the cross-view coding features is used as the third stage cross-view cross Layer features, the third stage cross-view cross-layer feature, the third cross-view encoding feature in the third stage, and the fourth cross-view encoding feature in the third stage are used as the third stage cross-view encoding result, and the The cross-view coding results of the third stage are concatenated according to the last dimension to obtain the concatenated coding results of the third stage;

对于第1个网络分支,将第1个阶段串接编码结果依次进行2个单元处理:在第1个单元处理中,卷积运算的特征通道数为16,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理,在第2个单元处理中,卷积运算的特征通道数为32,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理,将所得到的特征依次进行2个单元处理:在第1个单元处理中,卷积运算的特征通道数为32,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理,在第2个单元处理中,卷积运算的特征通道数为64,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理,然后,将所得到的特征与第3个阶段串接编码结果相串接,进行以下3个单元处理:在第1个单元处理中,卷积运算的特征通道数为64,卷积核尺度均为7×7,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理,在第2个单元处理中,卷积运算的特征通道数为128,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理,在第3个单元处理中,卷积运算的特征通道数为12,卷积核尺度均为1×1,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理,将所得的12通道的特征结果按照2×6的形式进行预测,得到张量L的结果;For the first network branch, the concatenated encoding results of the first stage are sequentially processed by two units: in the first unit processing, the number of feature channels of the convolution operation is 16, and the convolution kernel scale is 7×7 , the horizontal and vertical steps are both 1, and then feature activation and batch normalization are performed. In the second unit processing, the number of feature channels of the convolution operation is 32, and the convolution kernel scale is 3× 3. The horizontal and vertical steps are both 2, and then feature activation and batch normalization are performed, and the obtained features are sequentially processed by two units: in the first unit processing, the features of the convolution operation The number of channels is 32, the scale of the convolution kernel is 7×7, and the step size in the horizontal and vertical directions is 1, and then feature activation and batch normalization are performed. In the second unit processing, the convolution operation The number of feature channels is 64, the scale of the convolution kernel is 3×3, and the step size in the horizontal and vertical directions is 2, and then feature activation and batch normalization are performed, and then the obtained features are combined with the third The coding results of stage concatenation are concatenated, and the following three unit processes are performed: in the first unit process, the number of feature channels of the convolution operation is 64, the scale of the convolution kernel is 7×7, and the horizontal and vertical The step size is 2, and then the feature activation and batch normalization processing are performed. In the second unit processing, the number of feature channels of the convolution operation is 128, and the convolution kernel scale is 3×3, and the horizontal and vertical directions The step size is 2, and then feature activation and batch normalization processing are performed. In the third unit processing, the number of feature channels of the convolution operation is 12, the convolution kernel scale is 1×1, and the horizontal and vertical The step size of the direction is 1, and then perform feature activation and batch normalization processing, and predict the obtained 12-channel feature results in the form of 2×6, and obtain the result of tensor L;

对于第2个网络分支,将第1个阶段串接编码结果依次进行2个单元处理:在第1个单元处理中,卷积运算的特征通道数为16,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理,在第2个单元处理中,卷积运算的特征通道数为32,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理,然后将所得到的特征与第2个阶段串接编码结果相串接,进行以下2个单元处理:在第1个单元处理中,卷积运算的特征通道数为32,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理,在第2个单元处理中,卷积运算的特征通道数为32,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理,将所得到的特征与第3个阶段串接编码结果相串接,进行2个单元处理:在第1个单元处理中,卷积运算的特征通道数为64,卷积核尺度均为7×7,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理,在第2个单元处理中,卷积运算的特征通道数为128,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理,在第3个单元处理中,卷积运算的特征通道数为4,卷积核尺度均为1×1,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理,将所得的4通道的特征结果作为张量O的结果;For the second network branch, the concatenated encoding results of the first stage are sequentially processed by two units: in the first unit processing, the number of feature channels of the convolution operation is 16, and the convolution kernel scale is 7×7 , the horizontal and vertical steps are both 1, and then feature activation and batch normalization are performed. In the second unit processing, the number of feature channels of the convolution operation is 32, and the convolution kernel scale is 3× 3. The step size in the horizontal direction and the vertical direction is both 2, and then perform feature activation and batch normalization processing, and then concatenate the obtained features with the second stage concatenated encoding results, and perform the following two unit processing : In the first unit processing, the number of feature channels of the convolution operation is 32, the scale of the convolution kernel is 7×7, and the step size in the horizontal direction and the vertical direction is 1, and then feature activation and batch normalization are performed. Processing, in the second unit processing, the number of feature channels of the convolution operation is 32, the convolution kernel scale is 3×3, and the horizontal and vertical steps are both 2, and then feature activation and batch normalization are performed. The obtained features are concatenated with the concatenated coding results of the third stage, and two unit processes are performed: in the first unit process, the number of feature channels of the convolution operation is 64, and the scale of the convolution kernel is uniform. It is 7×7, the horizontal and vertical steps are both 2, and then feature activation and batch normalization are performed. In the second unit processing, the number of feature channels of the convolution operation is 128, and the convolution kernel scale Both are 3×3, the step size in the horizontal direction and the vertical direction are both 2, and then perform feature activation and batch normalization processing. In the third unit processing, the number of feature channels of the convolution operation is 4, and the convolution kernel The scale is 1×1, the step size in the horizontal direction and the vertical direction are both 1, and then feature activation and batch normalization are performed, and the obtained feature results of the 4 channels are used as the result of the tensor O;

对于第3个网络分支,将第3个阶段串接编码结果进行以下4个单元的处理:在第1个单元处理中,卷积运算的特征通道数为256,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,再进行特征激活、批归一化处理,在第2个单元处理中,卷积运算的特征通道数为512,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,再进行特征激活、批归一化处理,在第3个单元处理中,卷积运算的特征通道数为1024,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,在第1个单元处理中,卷积运算的特征通道数为3,卷积核尺度均为1×1,水平方向及垂直方向的步长均为1,将所得到的特征作为张量D的结果;For the third network branch, the serial encoding results of the third stage are processed by the following four units: In the first unit processing, the number of feature channels of the convolution operation is 256, and the convolution kernel scale is 3× 3. The horizontal and vertical steps are both 1, and then feature activation and batch normalization are performed. In the second unit processing, the number of feature channels for convolution operations is 512, and the convolution kernel scale is 3. ×3, the horizontal and vertical steps are both 2, and then feature activation and batch normalization are performed. In the third unit processing, the number of feature channels of the convolution operation is 1024, and the convolution kernel scale is 3×3, the step size in the horizontal direction and the vertical direction are both 2, in the first unit processing, the number of feature channels of the convolution operation is 3, the convolution kernel scale is 1×1, the horizontal direction and the vertical direction The step size is 1, and the obtained features are used as the result of the tensor D;

对于第4个网络分支,将第1个阶段跨视图跨层特征进行一次反卷积运算、特征激活、批归一化处理,反卷积运算中,卷积的特征通道数为16,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,将得到的结果记为解码器跨层特征1,再将第1个阶段跨视图跨层特征进行以下2个单元的处理:第1个单元处理时,卷积运算特征通道数为32,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,并将处理特征记为解码器跨层特征2,第2个单元处理,卷积运算,特征通道数为32,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,将所得到的特征与将第2个阶段跨视图跨层特征进行串接,将所述串接结果进行以下2个单元的处理:第1个单元处理时,卷积的特征通道数为64,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,并将处理特征记为解码器跨层特征3,将所述跨层特征进行第2个单元处理,第2个单元处理时,卷积的特征通道数为128,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,然后将所得到的特征与第3个阶段跨视图跨层特征进行串接,再进行以下3个单元处理,每个单元的处理功能均为卷积运算、特征激活、批归一化处理,在所述3个单元处理中,第1个单元处理时,卷积的特征通道数为128,卷积核尺度均为7×7,水平方向及垂直方向的步长均为1,并将处理特征记为解码器跨层特征4,将所述跨层特征进行第2个单元处理,第2个单元处理时,卷积的特征通道数为256,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,将并将处理特征记为解码器跨层特征5,将所述跨层特征进行第3个单元处理,第3个单元处理时,卷积的特征通道数为512个,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,处理后得到第4个分支编码特征;For the fourth network branch, a deconvolution operation, feature activation, and batch normalization processing are performed on the cross-view and cross-layer features of the first stage. In the deconvolution operation, the number of feature channels of the convolution is 16, and the convolution The kernel scale is 3×3, and the horizontal and vertical steps are both 2. The obtained result is recorded as the decoder cross-layer feature 1, and then the cross-view cross-layer feature of the first stage is used for the following two units: Processing: When the first unit is processed, the number of feature channels of the convolution operation is 32, the scale of the convolution kernel is 7×7, the step size in the horizontal direction and the vertical direction is 1, feature activation, batch normalization processing, and Record the processing feature as decoder cross-layer feature 2, the second unit processing, convolution operation, the number of feature channels is 32, the convolution kernel scale is 3×3, and the step size in the horizontal direction and vertical direction is 2, Feature activation, batch normalization processing, the obtained features are concatenated with the cross-view and cross-layer features of the second stage, and the concatenation results are processed by the following two units: when the first unit is processed, The number of feature channels for convolution is 64, the scale of the convolution kernel is 7×7, and the step size in the horizontal direction and vertical direction is 1, and the processing feature is recorded as decoder cross-layer feature 3, and the cross-layer feature Perform the second unit processing. During the second unit processing, the number of convolutional feature channels is 128, the convolution kernel scale is 3×3, and the horizontal and vertical steps are both 2, and then the obtained The features are concatenated with the cross-view and cross-layer features in the third stage, and then the following three units are processed. The processing functions of each unit are convolution operation, feature activation, and batch normalization processing. In the three units In processing, when the first unit is processing, the number of feature channels of convolution is 128, the scale of convolution kernel is 7×7, the step size of horizontal direction and vertical direction is 1, and the processing feature is recorded as decoder stride Layer feature 4, the cross-layer feature is processed by the second unit. When the second unit is processed, the number of convolutional feature channels is 256, the convolution kernel scale is 3×3, and the horizontal and vertical steps are The length is 2, and the processing feature is recorded as the decoder cross-layer feature 5, and the cross-layer feature is processed by the third unit. When the third unit is processed, the number of convolutional feature channels is 512, and the convolution The scale of the product kernel is 3×3, and the step size in the horizontal direction and the vertical direction is 2, and the fourth branch encoding feature is obtained after processing;

进一步进行解码,将所述第4个分支编码特征进行1次反卷积运算:卷积的特征通道数为256,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,并将得到的结果与解码器跨层特征5相串接,进行一次卷积运算:特征通道数为512,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,将所得到的结果进行反卷积运算:特征通道数为256,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,将所得到的结果与解码器跨层特征4相串接,进行一次卷积运算:特征通道数为256,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,将所得到的结果进行一次进行反卷积运算:特征通道数为128,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,将所得到的结果与解码器跨层特征3相串接,进行一次卷积运算:特征通道数为128,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,将所得到的特征作为张量B的第4个尺度结果,同时,将所得到的特征进行1次反卷积运算,反卷积的特征通道数为64,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,将做得到的特征与解码器跨层特征2相串接,进行一次卷积运算:特征通道数为64,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,将所得到的特征作为张量B的第3个尺度结果,同时,将所得到的特征进行1次反卷积运算:反卷积的特征通道数为32,卷积核尺度均为3×3,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,再将所得到的特征与解码器跨层特征1相串接,然后进行一次卷积运算:特征通道数为32,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,将所得到的特征作为张量B的第2个尺度结果,同时,将所得到的特征进行1次反卷积运算:特征通道数为16,卷积核尺度均为7×7,水平方向及垂直方向的步长均为2,特征激活、批归一化处理,将所得到的特征与第3个尺度特征的上采样结果进行相串接,然后进行一次卷积运算:特征通道数为16,卷积核尺度均为3×3,水平方向及垂直方向的步长均为1,特征激活、批归一化处理,将所得到的特征作为张量B的第1个尺度结果,利用所述张量B的4个尺度结果,得到第4个分支的输出;For further decoding, deconvolution is performed once on the fourth branch code feature: the number of convolutional feature channels is 256, the scale of the convolution kernel is 3×3, and the step size in the horizontal direction and vertical direction is 2. Feature activation, batch normalization processing, and concatenation of the obtained result with the cross-layer feature 5 of the decoder, and perform a convolution operation: the number of feature channels is 512, the convolution kernel scale is 3×3, and the horizontal The step size in both direction and vertical direction is 1, feature activation, batch normalization processing, and deconvolution operation on the obtained results: the number of feature channels is 256, the convolution kernel scale is 3×3, the horizontal direction and The step size in the vertical direction is 2, feature activation, batch normalization processing, the obtained result is concatenated with the decoder cross-layer feature 4, and a convolution operation is performed: the number of feature channels is 256, and the convolution kernel scale Both are 3×3, the step size in the horizontal direction and the vertical direction are both 1, feature activation, batch normalization processing, and deconvolution operation on the obtained results: the number of feature channels is 128, and the convolution kernel scale Both are 3×3, the horizontal and vertical steps are both 2, feature activation, batch normalization processing, the obtained result is concatenated with the decoder cross-layer feature 3, and a convolution operation is performed: feature The number of channels is 128, the scale of the convolution kernel is 3×3, the step size in the horizontal direction and the vertical direction are both 1, feature activation, batch normalization processing, and the obtained features are used as the fourth scale of tensor B As a result, at the same time, the obtained features were subjected to one deconvolution operation. The number of deconvolution feature channels is 64, the convolution kernel scale is 3×3, and the step size in the horizontal and vertical directions is 2. The feature Activation and batch normalization processing, concatenating the obtained features with the cross-layer feature 2 of the decoder, and performing a convolution operation: the number of feature channels is 64, the convolution kernel scale is 3×3, and the horizontal and vertical The step size of the direction is 1, feature activation, batch normalization processing, the obtained features are used as the third scale result of tensor B, and at the same time, the obtained features are subjected to a deconvolution operation: deconvolution The number of feature channels of the product is 32, the scale of the convolution kernel is 3×3, the step size of the horizontal direction and the vertical direction are both 2, feature activation, batch normalization processing, and then the obtained features are cross-layered with the decoder Feature 1 is concatenated, and then a convolution operation is performed: the number of feature channels is 32, the convolution kernel scale is 3×3, the step size in the horizontal direction and vertical direction is 1, feature activation, batch normalization, The obtained feature is used as the second scale result of tensor B, and at the same time, the obtained feature is subjected to a deconvolution operation: the number of feature channels is 16, the convolution kernel scale is 7×7, and the horizontal direction and The step size in the vertical direction is 2, feature activation, batch normalization processing, the obtained features are concatenated with the upsampling results of the third scale features, and then a convolution operation is performed: the number of feature channels is 16 , the scale of the convolution kernel is 3×3, the step size of the horizontal direction and the vertical direction are both 1, feature activation, batch normalization, and the obtained feature is used as the first scale result of the tensor B, using the The four scale results of tensor B are obtained to obtain the output of the fourth branch;

步骤3:神经网络的训练Step 3: Training of Neural Network

分别将自然图像数据集、超声影像数据集和CT影像数据集中样本按照9:1划分为训练集和测试集,训练集中数据用于训练,测试集数据用于测试,在训练时,分别从对应的数据集中获取训练数据,统一缩放到分辨率416×128,输入到对应网络中,迭代优化,通过不断修改网络模型参数,使得每批次的损失达到最小;The samples in the natural image data set, ultrasound image data set and CT image data set are divided into training set and test set according to 9:1. The data in the training set is used for training, and the data in the test set is used for testing. During training, the data from the corresponding Obtain training data from the dataset, scale it to a resolution of 416×128, input it into the corresponding network, iteratively optimize, and continuously modify the parameters of the network model to minimize the loss of each batch;

在训练过程中,各损失的计算方法:During the training process, the calculation method of each loss:

内部参数监督合成损失:在自然图像的网络模型训练中,将深度信息编码网络输出的张量I作为深度,将互注意力Transformer学习网络输出的张量L与训练数据的内部参数标签et(t=1,2,3,4)分别作为位姿参数和摄像机内部参数,根据计算机视觉原理算法,利用图像b和图像d分别合成图像c视点处的两个图像,利用图像c分别与所述的两个合成图像,按照逐像素、逐颜色通道强度差之和计算得到;Internal parameter supervision synthesis loss: In the network model training of natural images, the tensor I output by the depth information encoding network is used as the depth, and the tensor L output by the mutual attention Transformer learning network is combined with the internal parameter label e t of the training data ( t=1, 2, 3, 4) are respectively used as pose parameters and internal camera parameters, according to the computer vision principle algorithm, use image b and image d to synthesize two images at the viewpoint of image c respectively, and use image c to combine with the The two synthetic images of are calculated according to the sum of the pixel-by-pixel and color-by-color channel intensity differences;

无监督合成损失:在超声或者CT影像的网络模型训练中,将深度信息编码网络的输出张量I作为深度,将互注意力Transformer学习网络的第1个网络分支的输出张量L和互注意力Transformer学习网络的第2个网络分支的输出张量O分别作为位姿参数和摄像机内部参数,根据计算机视觉算法,利用目标影像的两相邻影像分别合成目标影像视点处的影像,利用目标影像分别与所述目标影像视点处的影像,按照逐像素、逐颜色通道强度差之和计算得到;Unsupervised synthesis loss: In the network model training of ultrasound or CT images, the output tensor I of the depth information encoding network is used as the depth, and the output tensor L and mutual attention of the first network branch of the mutual attention Transformer learning network are used The output tensor O of the second network branch of the force Transformer learning network is used as the pose parameters and the internal parameters of the camera respectively. According to the computer vision algorithm, two adjacent images of the target image are used to synthesize the image at the viewpoint of the target image, and the target image is used The image at the viewpoint of the target image is calculated according to the sum of pixel-by-pixel and color-by-color channel intensity differences;

内部参数误差损失:在自然图像的网络模型训练中,利用互注意力Transformer学习网络的第2个网络分支的输出张量O与训练数据的内部参数标签et(t=1,2,3,4)按照各分量差的绝对值之和计算得到;Internal parameter error loss: In the network model training of natural images, use mutual attention Transformer to learn the output tensor O of the second network branch of the network and the internal parameter label e t of the training data (t=1, 2, 3, 4) Calculated according to the sum of the absolute values of the differences of each component;

空间结构误差损失:在超声或者CT影像的网络模型训练中,将深度信息编码网络的输出张量I作为深度,将互注意力Transformer学习网络的第1个网络分支的输出张量L和互注意力Transformer学习网络的第2个网络分支的输出张量O分别作为位姿参数和摄像机内部参数,根据计算机视觉算法,分别利用目标视点处影像的两个相邻影像重建目标视点处影像的三维坐标,采用RANSAC算法对重建点进行空间结构拟合,空间结构误差损失利用拟合得到的法向量与互注意力Transformer学习网络的输出张量D,利用余弦距离计算得到;Spatial structure error loss: In the network model training of ultrasound or CT images, the output tensor I of the depth information encoding network is used as the depth, and the output tensor L and mutual attention of the first network branch of the mutual attention Transformer learning network are used The output tensor O of the second network branch of the force Transformer learning network is used as pose parameters and camera internal parameters respectively. According to the computer vision algorithm, two adjacent images of the image at the target viewpoint are used to reconstruct the three-dimensional coordinates of the image at the target viewpoint. , using the RANSAC algorithm to fit the spatial structure of the reconstruction points, the spatial structure error loss is calculated by using the normal vector obtained from the fitting and the output tensor D of the mutual attention Transformer learning network, and is calculated by using the cosine distance;

变换合成损失:在超声或者CT影像的网络参数训练中,将深度信息编码网络的输出张量I作为深度,将互注意力Transformer学习网络的第1个网络分支的输出张量L和互注意力Transformer学习网络的第2个网络分支的输出张量O分别作为位姿参数和摄像机内部参数,利用目标影像的两个相邻影像构建目标影像视点处的两个合成影像,对于所述合成影像中的每个影像,在合成过程得到每个像素位置后,将第5个网络分支输出张量B作为合成影像空域变形的位移量,构成合成结果影像,然后利用目标视点处的图像或者影像与所述合成目标视点处的结果,按照逐像素、逐颜色通道强度差之和计算得到;Transformation synthesis loss: In the network parameter training of ultrasound or CT images, the output tensor I of the depth information encoding network is used as the depth, and the output tensor L and mutual attention of the first network branch of the mutual attention Transformer learning network are used The output tensor O of the second network branch of the Transformer learning network is used as pose parameters and internal camera parameters respectively, and two adjacent images of the target image are used to construct two synthetic images at the viewpoint of the target image. For the synthetic image For each image of , after obtaining the position of each pixel in the synthesis process, the fifth network branch outputs tensor B as the displacement of the spatial deformation of the synthetic image to form a synthetic result image, and then use the image or image at the target viewpoint to match the The result at the viewpoint of the synthesis target is calculated according to the sum of the pixel-by-pixel and color-by-color channel intensity differences;

具体训练步骤:Specific training steps:

(1)在自然图像数据集上,分别对深度信息编码网络及互注意力Transformer学习网络主干网络及第一个网络分支,训练80000次(1) On the natural image data set, train the backbone network and the first network branch of the depth information encoding network and the mutual attention Transformer learning network for 80,000 times

每次从自然图像数据集中取出训练数据,统一缩放到分辨率416×128,将图像c输入深度信息编码网络,将图像c及图像τ输入互注意力Transformer学习网络,对深度信息编码网络及视觉互注意力Transformer学习网络主干网络及第一个网络分支,训练80000次,每批次的训练损失由内部参数监督合成损失计算得到;Take out the training data from the natural image data set each time, uniformly zoom to a resolution of 416×128, input the image c into the depth information encoding network, input the image c and image τ into the mutual attention Transformer learning network, and perform the depth information encoding network and visual Mutual attention Transformer learns the network backbone network and the first network branch, trains 80,000 times, and the training loss of each batch is calculated by the internal parameter supervision synthesis loss;

(2)在自然图像数据集上,对互注意力Transformer学习网络第2个网络分支,训练50000次(2) On the natural image dataset, the second network branch of the mutual attention Transformer learning network is trained 50,000 times

每次从自然图像数据集中取出训练数据,统一缩放到分辨率416×128,将图像c输入深度信息编码网络,将图像c及图像τ输入互注意力Transformer学习网络,对第2个网络分支进行训练,每批次的训练损失由无监督合成损失和内部参数误差损失之和计算得到;Take out the training data from the natural image data set each time, uniformly zoom to a resolution of 416×128, input the image c into the depth information coding network, input the image c and image τ into the mutual attention Transformer learning network, and perform Training, the training loss of each batch is calculated by the sum of unsupervised synthesis loss and internal parameter error loss;

(3)在超声影像数据集上,对深度信息编码网络、互注意力Transformer学习网络主干网络及网络分支1-4训练80000次,得到模型参数ρ(3) On the ultrasound image data set, train 80,000 times on the depth information encoding network, mutual attention Transformer learning network backbone network and network branches 1-4, and obtain the model parameter ρ

每次从超声影像数据集上取出超声训练数据,统一缩放到分辨率416×128,将影像j输入深度信息编码网络,将影像j及影像π输入到互注意力Transformer学习网络,对深度信息编码网络、互注意力Transformer学习网络主干网络及网络分支1-4进行训练,每批次的训练损失由变换合成损失、空间结构误差损失之和计算得到;Take out the ultrasound training data from the ultrasound image data set each time, uniformly zoom to a resolution of 416×128, input the image j into the depth information encoding network, input the image j and image π into the mutual attention Transformer learning network, and encode the depth information Network, mutual attention Transformer learning network backbone network and network branches 1-4 for training, the training loss of each batch is calculated by the sum of transformation synthesis loss and spatial structure error loss;

(4)在CT影像数据集上,对于互注意力Transformer学习网训练60000次,得到模型参数ρ′(4) On the CT image data set, the mutual attention Transformer learning network is trained 60,000 times to obtain the model parameter ρ′

每次从CT影像数据集中取出CT影像训练数据,统一缩放到分辨率416×128,将影像m及影像σ输入到互注意力Transformer学习网络,将深度信息编码网络输出结果作为深度,主干网络及第1及第2个网络分支的输出结果分别作为位姿参数和摄像机内部参数,将互注意力Transformer学习网络的第4个网络分支的输出张量B作为空域变形的位移量,分别根据影像l和影像n合成影像m视点处的两张影像,通过不断修改网络的参数,对所述网络进行训练,迭代优化,针对每批次每幅影像损失达到最小,迭代后得到最优的网络模型参数ρ′,使得每批次的每幅影像的损失达到最小,网络优化的损失计算时,除了变换合成损失、空间结构误差损失,还附加摄像机平移运动的损失;Each time the CT image training data is taken out from the CT image data set, uniformly scaled to a resolution of 416×128, the image m and image σ are input to the mutual attention Transformer learning network, and the output result of the depth information encoding network is used as the depth, backbone network and The output results of the first and second network branches are used as pose parameters and camera internal parameters respectively, and the output tensor B of the fourth network branch of the mutual attention Transformer learning network is used as the displacement of the spatial deformation, respectively according to the image l Synthesize two images at the viewpoint of image m with image n, train the network by continuously modifying the parameters of the network, iteratively optimize, and achieve the minimum loss for each image in each batch, and obtain the optimal network model parameters after iteration ρ′, so that the loss of each image in each batch is minimized. When calculating the loss of network optimization, in addition to the transformation synthesis loss and the spatial structure error loss, the loss of camera translation motion is also added;

步骤4:对超声或者CT影像三维重建Step 4: 3D reconstruction of ultrasound or CT images

利用自采样的一个超声或者CT序列影像,同时进行以下3个处理实现三维重建:Using a self-sampled ultrasound or CT sequence image, the following three processes are performed simultaneously to achieve 3D reconstruction:

(1)对序列影像中任一目标影像,按照如下方法计算摄像机坐标系下的三维坐标:缩放到分辨率416×128,对于超声序列影像,将影像j输入深度信息编码网络,将影像j及影像π输入到互注意力Transformer学习网络,对于CT序列影像,将影像m输入深度信息编码网络,将,影像m及影像σ输入到输入互注意力Transformer学习网络,分别利用模型参数ρ和模型参数ρ′进行预测,从深度信息编码网络得到每一帧目标影像的深度,将互注意力Transformer学习网络的第1个网络分支的输出张量L和第2个网络分支的输出张量O分别得到摄像机位姿参数及摄像机内部参数,根据目标影像的深度信息及摄像机内部参数,依据计算机视觉的原理,计算目标影像的摄像机坐标系下的三维坐标;(1) For any target image in the sequence image, calculate the three-dimensional coordinates in the camera coordinate system according to the following method: scaling to a resolution of 416×128, for the ultrasound sequence image, input image j into the depth information encoding network, and image j and The image π is input to the mutual attention Transformer learning network. For CT sequence images, the image m is input into the depth information encoding network, and the image m and image σ are input into the mutual attention Transformer learning network. The model parameters ρ and model parameters are used respectively ρ' is used to predict, and the depth of each frame of the target image is obtained from the depth information encoding network, and the output tensor L of the first network branch of the mutual attention Transformer learning network and the output tensor O of the second network branch are respectively obtained Camera pose parameters and camera internal parameters, according to the depth information of the target image and camera internal parameters, and according to the principle of computer vision, calculate the three-dimensional coordinates of the target image in the camera coordinate system;

(2)序列影像三维重建过程中,建立关键帧序列:将序列影像第一帧作为关键帧序列的第一帧,并作为当前关键帧,当前关键帧之后的帧作为目标帧,按照目标帧顺序依次动态选取新的关键帧:首先,用单位矩阵初始化目标帧相对于当前关键帧的位姿参数矩阵,针对任一目标帧,将所述位姿参数矩阵累乘目标帧摄像机位姿参数,并利用累乘结果,结合所述目标帧的内部参数及深度信息,合成所述目标帧视点处的影像,利用所述合成影像与所述目标帧之间逐像素逐颜色通道强度差之和的大小计算误差λ,再根据所述目标帧的相邻帧,利用摄像机位姿参数和内部参数,合成所述目标帧视点处的影像,利用所述合成的影像与所述目标帧之间逐像素逐颜色通道强度差之和的大小计算误差γ,进一步利用公式(1)计算合成误差比Z:(2) In the process of 3D reconstruction of sequential images, establish a key frame sequence: take the first frame of the sequence image as the first frame of the key frame sequence, and as the current key frame, and the frame after the current key frame as the target frame, according to the order of the target frames Dynamically select new key frames in sequence: first, initialize the pose parameter matrix of the target frame relative to the current key frame with the identity matrix, and multiply the pose parameter matrix by the camera pose parameters of the target frame for any target frame, and Using the multiplication result, combined with the internal parameters and depth information of the target frame, to synthesize the image at the viewpoint of the target frame, using the size of the sum of the pixel-by-pixel and color-by-color channel intensity differences between the synthesized image and the target frame Calculate the error λ, and then according to the adjacent frames of the target frame, use the camera pose parameters and internal parameters to synthesize the image at the viewpoint of the target frame, and use the pixel-by-pixel relationship between the synthesized image and the target frame Calculate the error γ of the sum of the intensity differences of the color channels, and further use the formula (1) to calculate the composite error ratio Z:

Figure BDA0003192217030000211
Figure BDA0003192217030000211

满足Z大于1.2时,将所述目标帧作为新的关键帧,并将所述目标帧相对于当前关键帧的位姿参数矩阵作为新的关键帧的位姿参数,同时将所述目标帧更新为当前关键帧;以此迭代,完成关键帧序列建立;When Z is greater than 1.2, the target frame is used as a new key frame, and the pose parameter matrix of the target frame relative to the current key frame is used as the pose parameter of the new key frame, and the target frame is updated at the same time is the current key frame; use this iteration to complete the establishment of the key frame sequence;

(3)将序列影像第一帧的视点作为世界坐标系的原点,对任一目标帧,将其分辨率缩放到M×N,对于超声影像,M取450,N取300,对于CT影像,M和N均取512,根据网络输出得到的摄像机内部参数及深度信息,计算得到摄像机坐标系下的三维坐标,根据网络输出的摄像机位姿参数,并结合关键帧序列中每一关键帧的位姿参数以及目标帧相对于当前关键帧的位姿参数矩阵,计算得到所述目标帧的每个像素的世界坐标系中的三维坐标。(3) Take the viewpoint of the first frame of the sequence image as the origin of the world coordinate system, and scale its resolution to M×N for any target frame. For ultrasound images, M is 450, and N is 300. For CT images, Both M and N are set to 512. According to the internal camera parameters and depth information obtained from the network output, the three-dimensional coordinates in the camera coordinate system are calculated. According to the camera pose parameters output from the network, combined with the position of each key frame in the key frame sequence The pose parameters and the pose parameter matrix of the target frame relative to the current key frame are calculated to obtain the three-dimensional coordinates in the world coordinate system of each pixel of the target frame.

本实施例在构建的自然图像训练集、超声影像训练集和CT影像训练集进行网络训练,利用公共数据集的10个超声影像序列及1个CT影像序列,分别进行测试,采用变换合成损失进行误差计算,在超声或者CT影像的误差计算中,利用目标影像的两个相邻影像构建目标影像视点处的两个合成影像,对于所述合成影像中的每个影像,利用所述两个目标视点处的合成影像,按照逐像素、逐颜色通道强度差之和计算得到。In this embodiment, network training is performed on the constructed natural image training set, ultrasound image training set, and CT image training set, and 10 ultrasound image sequences and 1 CT image sequence in the public data set are used for testing respectively, and the transformation synthesis loss is used for Error calculation, in the error calculation of ultrasound or CT images, using two adjacent images of the target image to construct two synthetic images at the viewpoint of the target image, for each image in the synthetic image, using the two target images The synthetic image at the viewpoint is calculated according to the sum of intensity differences of pixel-by-pixel and color-by-color channels.

表1为在超声影像序列重建时,计算得到的误差,表2为在CT影像序列重建时,计算得到的误差,本实施例,采用DenseNet对超声或者CT影像进行分割,然后进行3D重建,图1表示利用本发明得到的超声影像的三维重建结果,图2表示利用本发明得到的CT影像的三维重建结果,从中可以看出本发明能够得到较为准确的重建结果。Table 1 shows the calculated errors during the reconstruction of ultrasound image sequences, and Table 2 shows the calculated errors during the reconstruction of CT image sequences. In this embodiment, DenseNet is used to segment ultrasound or CT images, and then perform 3D reconstruction, as shown in Fig. 1 represents the three-dimensional reconstruction result of the ultrasound image obtained by the present invention, and Fig. 2 represents the three-dimensional reconstruction result of the CT image obtained by the present invention, from which it can be seen that the present invention can obtain relatively accurate reconstruction results.

表1Table 1

序号serial number 误差error 11 0.12991646688187270.1299164668818727 22 0.03689153168118060.0368915316811806 33 0.073398618544713040.07339861854471304 44 0.097449061783164760.09744906178316476 55 0.10180285893746920.1018028589374692 66 0.081094201717199850.08109420171719985 77 0.0519733031100745240.051973303110074524 88 0.09888878207596970.0988887820759697 99 0.108807991295838940.10880799129583894 1010 0.066472738493409570.06647273849340957

表2Table 2

序号serial number 误差error 11 0.0585447836060013150.058544783606001315 22 0.06672005134199540.0667200513419954 33 0.068218166112307450.06821816611230745 44 0.067807292716041910.06780729271604191 55 0.118624374236327310.11862437423632731 66 0.100546011294206550.10054601129420655 77 0.124421894922008810.12442189492200881 88 0.150656560142459870.15065656014245987 99 0.107562793936629360.10756279393662936 1010 0.114510649296728310.11451064929672831

Claims (1)

1. A medical image three-dimensional reconstruction method based on a mutual-attention transducer is characterized in that an ultrasonic or CT image sequence is input, the image resolution is MxN, M is more than or equal to 100 and less than or equal to 2000, N is more than or equal to 100 and less than or equal to 2000, and the three-dimensional reconstruction process specifically comprises the following steps:
step 1: constructing a dataset
(a) Constructing a natural image dataset
Selecting a natural image website, requiring to have an image sequence and corresponding internal parameters of a camera, downloading a image sequences and corresponding internal parameters of the sequences from the natural image website, wherein a is more than or equal to 1 and less than or equal to 20, for each image sequence, each adjacent 3 frames of images are marked as an image b, an image c and an image d, splicing the image b and the image d according to color channels to obtain an image tau, forming a data element by the image c and the image tau, wherein the image c is a natural target image, a sampling viewpoint of the image c is used as a target viewpoint, and the internal parameters of the image b, the image c and the image d are all e t (t=1, 2,3, 4), where e 1 E is a horizontal focal length 2 E is vertical focal length 3 E 4 Two components of principal point coordinates; discarding if the last remaining image in the same image sequence is less than 3 frames; constructing a natural image data set by utilizing all sequences, wherein f elements are in the constructed natural image data set, and f is more than or equal to 3000 and less than or equal to 20000;
(b) Constructing ultrasound image datasets
Sampling g ultrasonic image sequences, wherein g is more than or equal to 1 and less than or equal to 20, for each sequence, marking every 3 adjacent frames of images as an image i, an image j and an image k, splicing the image i and the image k according to color channels to obtain an image pi, forming a data element by the image j and the image pi, wherein the image j is an ultrasonic target image, the sampling viewpoint of the image j is used as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, and constructing an ultrasonic image data set by utilizing all the sequences, wherein F elements are contained in the constructed ultrasonic image data set, and F is more than or equal to 1000 and less than or equal to 20000;
(c) Constructing CT image datasets
Sampling h CT image sequences, wherein h is more than or equal to 1 and less than or equal to 20, for each sequence, marking every 3 adjacent frames as an image l, an image m and an image n, splicing the image l and the image n according to a color channel to obtain an image sigma, forming a data element by the image m and the image sigma, wherein the image m is a CT target image, a sampling viewpoint of the image m is used as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, constructing a CT image data set by utilizing all the sequences, wherein xi elements are in the constructed CT image data set, and the xi is more than or equal to 1000 and less than or equal to 20000;
Step 2: construction of neural networks
The resolution of the image or the image input by the neural network is p multiplied by o, p is the width, o is the height, and the pixel is 100-2000, and 100-2000;
(1) Depth information coding network
Tensor H is used as input, the scale is alpha x o x p x 3, tensor I is used as output, the scale is alpha x o x p x 1, and alpha is the batch number;
the depth information coding network consists of an encoder and a decoder, and for the tensor H, the output tensor I is obtained after coding and decoding processing in sequence;
the encoder consists of 5 units, wherein the first unit is a convolution unit, the 2 nd to 5 th units are all composed of residual error modules, in the first unit, 64 convolution kernels are formed, the shapes of the convolution kernels are 7 multiplied by 7, the step sizes of the convolution in the horizontal direction and the vertical direction are 2, the maximum pooling treatment is carried out once after the convolution, the 2 nd to 5 th units respectively comprise 3,4,6,3 residual error modules, each residual error module carries out 3 times of convolution, the shapes of the convolution kernels are 3 multiplied by 3, and the numbers of the convolution kernels are 64, 128, 256 and 512;
the decoder consists of 6 decoding units, each decoding unit comprises deconvolution and convolution processing, the deconvolution and convolution processing have the same shape and number of convolution kernels, the shape of the convolution kernels in the 1 st to 6 th decoding units is 3 multiplied by 3, the number of the convolution kernels is 512, 256, 128, 64, 32 and 16 respectively, the encoder and the network layer of the decoder are connected in a cross-layer manner, and the corresponding relationship of the cross-layer connection is as follows: 1 and 4, 2 and 3, 3 and 2, 4 and 1;
(2) Mutual-attention transducer learning network
The mutual attention transducer learning network consists of a main network and 4 network branches, wherein the 4 network branches are respectively used for predicting tensors L, O, D and B;
tensor J and tensor C are used as input, the scales are alpha x O x p x 3 and alpha x O x p x 6 respectively, the outputs are tensor L, tensor O, tensor D and tensor B, the tensor L scale is alpha x 2 x 6, the tensor O scale is alpha x 4 x 1, the tensor D scale is alpha x 3, the tensor B scale is alpha x O x p x 4, alpha is the batch number,
the backbone network is designed for 3-phase cross-view coding:
1) The cross-view coding of the 1 st stage comprises embedded coding of the 1 st stage and attention coding of the 1 st stage
The embedded coding of the 1 st stage respectively carries out convolution operation on the first 3 characteristic components of the last dimension of the tensor J and the last 3 characteristic components of the last dimension of the tensor C, the convolution kernel scale is 7 multiplied by 7, the characteristic channel number is 24, the coding characteristics are transformed into a sequence structure from the spatial domain shape of the image characteristics by the serialization processing, and the 1 st stage embedded coding 1, the 1 st stage embedded coding 2 and the 1 st stage embedded coding 3 are respectively obtained by the layer normalization processing;
The attention code of the 1 st stage is obtained by concatenating the embedded code 1 of the 1 st stage and the embedded code 2 of the 1 st stage according to the last dimension; concatenating the 1 st stage embedded code 1 and the 1 st stage embedded code 3 according to the last dimension to obtain a 1 st stage attention code input feature 2; concatenating the 1 st stage embedded code 2 and the 1 st stage embedded code 1 according to the last dimension to obtain a 1 st stage attention code input characteristic 3; concatenating the 1 st stage embedded code 3 and the 1 st stage embedded code 1 according to the last dimension to obtain a 1 st stage attention code input characteristic 4; -attention encoding the 4 input features of the 1 st phase attention encoding: taking a first half channel characteristic as a target coding characteristic, a second half channel characteristic as a source coding characteristic and then carrying out separable convolution operation on the target coding characteristic and the source coding characteristic according to a last dimension in the 1 st stage, wherein the convolution kernel scale is 3 multiplied by 3, the characteristic channel number is 24, the step sizes in the horizontal direction and the vertical direction are 1, the processing result of the target coding characteristic is taken as a query keyword K coding vector and a numerical value V coding vector for attention learning, the processing result of the source coding characteristic is taken as a query Q coding vector for attention learning, then, the attention weight matrix of each attention coding input characteristic is calculated by utilizing a multi-head attention method, the number of heads is 1, the characteristic channel number is 24, finally, each attention weight matrix is added with the target coding characteristic of each attention coding input characteristic to obtain 4 cross-view coding characteristics in the 1 st stage, and the average characteristic of the 1 st and 2 nd cross-view coding characteristics of the 4 cross-view coding characteristics is taken as a 1 st stage cross-view cross-layer characteristic; taking the 1 st stage cross-view cross-layer feature, the 1 st stage 3 rd cross-view coding feature and the 1 st stage 4 th cross-view coding feature as 1 st stage cross-view coding results; taking the 1 st stage cross-view coding result as a 2 nd stage cross-view coding input, and concatenating the 1 st stage cross-view coding result according to the last dimension to obtain a 1 st stage concatenated coding result;
2) The cross-view coding of phase 2 includes embedded coding of phase 2 and attention coding of phase 2
The embedded coding of the 2 nd stage, the embedded coding of each feature in the cross-view coding result of the 1 st stage is carried out, the number of feature channels of convolution operation is 64, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are 2, the serialization processing transforms coding features from the spatial domain shape of image features into a sequence structure, and the layer normalization processing of the features obtains the 2 nd stage embedded coding 1, the 2 nd stage embedded coding 2 and the 2 nd stage embedded coding 3;
the attention code of the 2 nd stage, the embedded code 1 of the 2 nd stage and the embedded code 2 of the 2 nd stage are connected in series according to the last dimension to obtain the input characteristic 1 of the attention code of the 2 nd stage; concatenating the 2 nd stage embedded code 1 and the 2 nd stage embedded code 3 according to the last dimension to obtain a 2 nd stage attention code input feature 2; concatenating the 2 nd stage embedded code 2 and the 2 nd stage embedded code 1 according to the last dimension to obtain a 2 nd stage attention code input characteristic 3; concatenating the 2 nd stage embedded code 3 with the 2 nd stage embedded code 1 according to the last dimension to obtain a 2 nd stage attention code input feature 4, taking each input feature as a target code feature according to the last dimension, taking the first half channel feature as a target code feature, taking the second half channel feature as a source code feature, respectively carrying out separable convolution operation on the target code feature and the source code feature, wherein the convolution kernel dimensions are 3×3, the feature channel number is 64, the step sizes in the horizontal direction and the vertical direction are 2, the processing result of the target code feature is taken as a query keyword K code vector and a numerical value V code vector for attention learning, the processing result of the source code feature is taken as a query Q code vector for attention learning, then, calculating an attention weight matrix of each attention code input feature by utilizing a multi-head attention method, the number of heads is 3, the feature channel number is 64, finally, adding the attention weight of each attention code input feature and the target code feature of each attention code input feature to 4 cross-view code features, and utilizing the 1 st cross-view feature and the 2 nd stage cross-view code feature as an average cross-view feature; taking the 2 nd stage cross-view cross-layer feature, the 2 nd stage 3 rd cross-view coding feature and the 2 nd stage 4 th cross-view coding feature as 2 nd stage cross-view coding results; taking the 2 nd stage cross-view coding result as a 3 rd stage cross-view coding input, and concatenating the 2 nd stage cross-view coding result according to the last dimension to obtain a 2 nd stage concatenated coding result;
3) The 3 rd stage cross-view coding includes 3 rd stage embedded coding and 3 rd stage attention coding
The embedded coding of the 3 rd stage, each feature in the cross-view coding result of the 2 nd stage is subjected to embedded coding processing, convolution operation is carried out, the convolution kernel scale is 3 multiplied by 3, the number of feature channels is 128, the step length in the horizontal direction and the step length in the vertical direction are 2, the serialization processing transforms coding features from the spatial domain shape of the image features into a sequence structure, and the layer normalization processing of the features is carried out to obtain a 3 rd stage embedded coding 1, a 3 rd stage embedded coding 2 and a 3 rd stage embedded coding 3;
the 3 rd stage attention code, the 3 rd stage embedded code 1 and the 3 rd stage embedded code 2 are connected in series according to the last dimension to obtain the 3 rd stage attention code input characteristic 1; concatenating the 3 rd stage embedded code 1 and the 3 rd stage embedded code 3 according to the last dimension to obtain a 3 rd stage attention code input feature 2; concatenating the 3 rd stage embedded code 2 and the 3 rd stage embedded code 1 according to the last dimension to obtain a 3 rd stage attention code input characteristic 3; concatenating the 3 rd stage embedded code 3 and the 3 rd stage embedded code 1 according to the last dimension to obtain a 3 rd stage attention code input feature 4; taking the first half channel characteristic as a target coding characteristic, the second half channel characteristic as a source coding characteristic, respectively carrying out separable convolution operation on the target coding characteristic and the source coding characteristic, wherein the convolution kernel scale is 3 multiplied by 3, the characteristic channel number is 128, the step length in the horizontal direction and the step length in the vertical direction are 2, taking the processing result of the target coding characteristic as a query keyword K coding vector and a numerical V coding vector for attention learning, taking the processing result of the source coding characteristic as a query Q coding vector for attention learning, then calculating an attention weight matrix of each attention coding input characteristic by utilizing a multi-head attention method, the number of heads is 6, the characteristic channel number is 128, finally adding the weight matrix of each attention coding input characteristic in the 3 rd stage with the target coding characteristic of each attention coding input characteristic to obtain 4 cross-view coding characteristics in the 3 rd stage, and taking the average characteristics of the 1 st and 2 nd characteristics of the cross-view coding characteristics as cross-view cross-layer characteristics in the 3 rd stage; taking the 3 rd-stage cross-view cross-layer feature, the 3 rd-stage 3 rd cross-view coding feature and the 3 rd-stage 4 th cross-view coding feature as 3 rd-stage cross-view coding results; concatenating the 3 rd stage cross-view coding result according to the last dimension to obtain a 3 rd stage concatenated coding result;
For the 1 st network branch, the 1 st stage concatenated coding result is sequentially processed by 2 units: in the 1 st unit processing, the number of characteristic channels of convolution operation is 16, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; the resulting features were sequentially subjected to 2 unit processes: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; then, the obtained features are concatenated with the 3 rd stage concatenated coding result, and the following 3 unit processes are performed: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 12, the convolution kernel scales are all 1 multiplied by 1, the step sizes in the horizontal direction and the vertical direction are all 1, and then characteristic activation and batch normalization processing are carried out; predicting the obtained characteristic results of the 12 channels according to a 2 multiplied by 6 form to obtain a tensor L result;
For the 2 nd network branch, the 1 st stage concatenated coding result is sequentially processed by 2 units: in the 1 st unit processing, the number of characteristic channels of convolution operation is 16, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; then the obtained characteristics are connected with the 2 nd stage serial connection coding result in series, and the following 2 unit processing is carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; the obtained characteristics are connected with the 3 rd stage serial connection coding result in series, and 2 unit processing is carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 4, the convolution kernel scales are all 1 multiplied by 1, the step sizes in the horizontal direction and the vertical direction are all 1, and then characteristic activation and batch normalization processing are carried out; taking the obtained 4-channel characteristics as the result of tensor O;
For the 3 rd network branch, the 3 rd stage concatenated code result is processed by the following 4 units: in the 1 st unit processing, the number of characteristic channels of convolution operation is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 512, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 1024, the convolution kernel scales are 3×3, the step sizes in the horizontal direction and the vertical direction are 2, in the 4 th unit processing, the number of characteristic channels of convolution operation is 3, the convolution kernel scales are 1×1, the step sizes in the horizontal direction and the vertical direction are 1, and the obtained characteristics are used as the result of tensor D;
for the 4 th network branch, performing one-time deconvolution operation, feature activation and batch normalization processing on the cross-layer features of the cross-view in the 1 st stage, wherein in the deconvolution operation, the number of the convolved feature channels is 16, the convolution kernel scales are 3 multiplied by 3, and the step sizes in the horizontal direction and the vertical direction are 2; the obtained result is marked as a decoder cross-layer characteristic 1, and the cross-view cross-layer characteristic of the 1 st stage is processed by the following 2 units: when the 1 st unit is processed, the number of convolution operation characteristic channels is 32, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization processing are carried out, and the processing characteristic is marked as a decoder cross-layer characteristic 2; processing the 2 nd unit, carrying out convolution operation, wherein the number of characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, carrying out characteristic activation and batch normalization processing, carrying out series connection on the obtained characteristic and the 2 nd stage cross-view cross-layer characteristic, and carrying out the processing of the following 2 units on the series connection result: when the 1 st unit is processed, the number of characteristic channels of convolution is 64, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and the processing characteristics are marked as decoder cross-layer characteristics 3; when the 2 nd unit is processed, the number of the convolved characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, then the obtained characteristic is connected with the 3 rd stage cross-view cross-layer characteristic in series, the following 3 unit processes are carried out, when the 1 st unit is processed, the number of the convolved characteristic channels is 128, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and the processing characteristic is marked as the decoder cross-layer characteristic 4; when the 2 nd unit is processed, the number of the characteristic channels of convolution is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and the processing characteristics are marked as decoder cross-layer characteristics 5; when the 3 rd unit is processed, the number of the convolved characteristic channels is 512, the convolution kernel scales are 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are 2, and the 4 th network branch coding characteristic is obtained after the processing;
Decoding is further carried out, and deconvolution operation is carried out on the 4 th network branch coding feature for 1 time: the number of characteristic channels of convolution is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, the characteristics are activated and normalized in batches, the obtained result is connected with the cross-layer characteristics 5 of the decoder in series, and one convolution operation is carried out: the number of the characteristic channels is 512, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, and deconvolution operation is carried out on the obtained result: the number of the characteristic channels is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristic activation and batch normalization are carried out, the obtained result is connected with the cross-layer characteristic 4 of the decoder in series, and one convolution operation is carried out: the number of characteristic channels is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization processing are carried out, and the obtained result is subjected to deconvolution operation once: the number of the characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristic activation and batch normalization are carried out, the obtained result is connected with the cross-layer characteristic 3 of the decoder in series, and one convolution operation is carried out: the number of characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 1, the characteristics are activated and subjected to batch normalization processing, the obtained characteristics are used as the 4 th scale result of tensor B, meanwhile, 1 deconvolution operation is carried out on the obtained characteristics, the number of deconvoluted characteristic channels is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, the characteristics are activated and subjected to batch normalization processing, the obtained characteristics are connected with cross-layer characteristics 2 of a decoder in series, and one convolution operation is carried out: the number of the characteristic channels is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, the obtained characteristic is used as the 3 rd scale result of the tensor B, and meanwhile, the obtained characteristic is subjected to 1 deconvolution operation: the number of deconvolution characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristics are activated and normalized in batches, the obtained characteristics are connected with the cross-layer characteristics 1 of the decoder in series, and then one convolution operation is carried out: the number of the characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, the obtained characteristic is used as the 2 nd scale result of the tensor B, and meanwhile, the obtained characteristic is subjected to 1 deconvolution operation: the number of the characteristic channels is 16, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 2, the characteristics are activated and subjected to batch normalization, the obtained characteristics are connected with the up-sampling result of the 3 rd scale characteristics in series, and then one convolution operation is carried out: the number of the characteristic channels is 16, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristics are activated and subjected to batch normalization, the obtained characteristics are used as the 1 st scale result of the tensor B, and the 4 th scale result of the tensor B is utilized to obtain the output of the 4 th network branch;
Step 3: training of neural networks
Dividing samples in a natural image dataset, an ultrasonic image dataset and a CT image dataset into a training set and a testing set according to a ratio of 9:1, wherein data in the training set is used for training, data in the testing set is used for testing, training data are respectively obtained from corresponding data sets during training, the training data are uniformly scaled to a resolution p multiplied by o, the resolution p multiplied by o is input into a corresponding network, iterative optimization is performed, and the loss of each batch is minimized by continuously modifying network model parameters;
in the training process, the calculation method of each loss comprises the following steps:
internal parameters supervise synthesis loss: in the network model training of natural images, depth information is encoded and output by a networkTensor I is used as depth, tensor L output by the mutual attention transducer learning network and internal parameter label e of training data are used t (t=1, 2,3, 4) respectively serving as pose parameters and camera internal parameters, respectively synthesizing two images at the view point of an image c by using an image b and an image d according to a computer vision principle algorithm, and calculating by using the sum of the intensity differences of the channel pixel by pixel and color by pixel by using the image c and the two synthesized images;
unsupervised synthesis loss: in the network model training of ultrasonic or CT images, taking the output tensor I of a depth information coding network as depth, taking the output tensor L of the 1 st network branch of a mutual-attention transducer learning network and the output tensor O of the 2 nd network branch of the mutual-attention transducer learning network as pose parameters and camera internal parameters respectively, respectively synthesizing images at a target image viewpoint by utilizing two adjacent images of the target image according to a computer vision algorithm, and calculating according to the sum of pixel-by-pixel and color channel intensity differences by utilizing the images at the target image viewpoint and the target image viewpoint respectively;
Internal parameter error loss: in the network model training of natural images, the output tensor O of the 2 nd network branch of a mutual-attention transducer learning network and the internal parameter label e of training data are utilized t (t=1, 2,3, 4) is calculated as the sum of the absolute values of the respective component differences;
spatial structure error loss: in the network model training of ultrasonic or CT images, taking the output tensor I of a depth information coding network as depth, taking the output tensor L of the 1 st network branch of a mutual-attention transducer learning network and the output tensor O of the 2 nd network branch of the mutual-attention transducer learning network as pose parameters and camera internal parameters respectively, reconstructing three-dimensional coordinates of images at the target viewpoint by using two adjacent images of the images at the target viewpoint according to a computer vision algorithm, performing spatial structure fitting on reconstructed points by using a RANSAC algorithm, and calculating spatial structure error loss by using a normal vector obtained by fitting and the output tensor D of the 3 rd network branch of the mutual-attention transducer learning network;
conversion synthesis loss: in the network parameter training of ultrasonic or CT images, taking the output tensor I of a depth information coding network as depth, taking the output tensor L of the 1 st network branch of a mutual-attention transducer learning network and the output tensor O of the 2 nd network branch of the mutual-attention transducer learning network as pose parameters and camera internal parameters respectively, constructing two synthesized images at the view point of a target image by utilizing two adjacent images of the target image, taking the output tensor B of the 4 th network branch as displacement of spatial domain deformation of the synthesized image after each pixel position is obtained in the synthesis process for each image in the synthesized image, forming a synthesized result image, and calculating according to the sum of pixel-by-pixel and color channel intensity differences by utilizing the images at the target view point or the results of the images and the synthesized target view point;
The specific training steps are as follows:
(1) On the natural image data set, training is carried out for 80000 times on a depth information coding network, a mutual attention transducer learning network backbone network and a 1 st network branch respectively
Taking out training data from a natural image data set each time, uniformly scaling to resolution p×o, inputting an image c into a depth information coding network, inputting an image c and an image tau into a mutual attention transducer learning network, training the depth information coding network, a visual mutual attention transducer learning network backbone network and a 1 st network branch for 80000 times, and performing monitoring and synthesizing loss calculation on the training loss of each batch by internal parameters;
(2) On the natural image dataset, training 50000 times for the 2 nd network branch of the mutual-attention transducer learning network
Taking out training data from a natural image data set each time, uniformly scaling to a resolution p multiplied by o, inputting an image c into a depth information coding network, inputting the image c and an image tau into a mutual attention Transformer learning network, training a 2 nd network branch, and calculating the training loss of each batch by the sum of an unsupervised synthesis loss and an internal parameter error loss;
(3) On an ultrasonic image data set, training a depth information coding network, a main network of a mutual attention transducer learning network and network branches 1-4 for 80000 times to obtain model parameters rho
Taking out ultrasonic training data from an ultrasonic image data set each time, uniformly scaling to resolution p multiplied by o, inputting an image j into a depth information coding network, inputting the image j and the image pi into a mutual attention transducer learning network, training the depth information coding network, a backbone network of the mutual attention transducer learning network and network branches 1-4, and calculating the training loss of each batch by the sum of conversion synthesis loss and space structure error loss;
(4) On the CT image dataset, the mutual attention transducer learning network is trained 60000 times to obtain model parameters rho'
Taking CT image training data out of a CT image data set each time, uniformly scaling the CT image training data to a resolution p x o, inputting an image m and an image sigma into a mutual-attention transducer learning network, taking an output result of a depth information coding network as depth, taking output results of a backbone network and 1 st and 2 nd network branches as pose parameters and internal parameters of a camera respectively, taking an output tensor B of a 4 th network branch of the mutual-attention transducer learning network as displacement of airspace deformation, synthesizing two images at an image m viewpoint according to an image l and an image n respectively, training the network, continuously modifying parameters of the network, performing iterative optimization, minimizing each image loss for each batch, obtaining an optimal network model parameter rho' after iteration, so that the loss of each image of each batch is minimized, and adding the loss of translational motion of the camera except for conversion synthesis loss and space structure error loss when calculating the loss of network optimization;
Step 4: three-dimensional reconstruction of ultrasound or CT images
Using an ultrasound or CT sequence image from the sample, three-dimensional reconstruction is achieved by simultaneously performing the following 3 processes:
(1) For any target image in the sequence image, three-dimensional coordinates under a camera coordinate system are calculated according to the following method: scaling to resolution p x O, inputting an image j into a depth information coding network, inputting an image j and an image pi into a mutual attention transducer learning network for an ultrasonic sequence image, inputting an image m into the depth information coding network, inputting an image m and an image sigma into the input mutual attention transducer learning network for a CT sequence image, respectively predicting by using a model parameter rho and a model parameter rho', obtaining the depth of each frame of target image from the depth information coding network, taking an output tensor L of a 1 st network branch and an output tensor O of a 2 nd network branch of the mutual attention transducer learning network as a camera pose parameter and a camera internal parameter respectively, and calculating three-dimensional coordinates of the target image under a camera coordinate system according to the depth information and the camera internal parameter of the target image and the principle of computer vision;
(2) In the three-dimensional reconstruction process of the sequence image, a key frame sequence is established: taking the first frame of the sequence image as the first frame of the key frame sequence, taking the first frame of the sequence image as a current key frame, taking the frame after the current key frame as a target frame, and dynamically selecting new key frames in sequence according to the sequence of the target frames: firstly, initializing a pose parameter matrix of a target frame relative to a current key frame by using an identity matrix, multiplying the pose parameter matrix by a pose parameter of a target frame camera for any target frame, combining internal parameters and depth information of the target frame by using a multiplication result to synthesize an image at a target frame viewpoint, calculating an error lambda by using the sum of pixel-by-pixel color channel intensity differences between the synthesized image and the target frame, synthesizing an image at the target frame viewpoint by using the pose parameter and the internal parameters of the camera according to an adjacent frame of the target frame, calculating an error gamma by using the sum of pixel-by-pixel color channel intensity differences between the synthesized image and the target frame, and further calculating a synthesis error ratio Z by using a formula (1):
Figure QLYQS_1
Meeting Z is larger than a threshold value eta, 1 eta is smaller than 2, taking the target frame as a new key frame, taking a pose parameter matrix of the target frame relative to the current key frame as a pose parameter of the new key frame, and simultaneously updating the target frame into the current key frame; finishing key frame sequence establishment by the iteration;
(3) And taking the viewpoint of the first frame of the sequence image as the origin of the world coordinate system, scaling the resolution of any target image to M multiplied by N, calculating to obtain three-dimensional coordinates under the camera coordinate system according to the internal parameters and depth information of the camera obtained by network output, and calculating to obtain the three-dimensional coordinates in the world coordinate system of each pixel of the target frame according to the pose parameters of the camera output by the network and combining the pose parameters of each key frame in the key frame sequence and the pose parameter matrix of the target frame relative to the current key frame.
CN202110881635.7A 2021-08-02 2021-08-02 A 3D reconstruction method of medical images based on mutual attention Transformer Active CN113689548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110881635.7A CN113689548B (en) 2021-08-02 2021-08-02 A 3D reconstruction method of medical images based on mutual attention Transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110881635.7A CN113689548B (en) 2021-08-02 2021-08-02 A 3D reconstruction method of medical images based on mutual attention Transformer

Publications (2)

Publication Number Publication Date
CN113689548A CN113689548A (en) 2021-11-23
CN113689548B true CN113689548B (en) 2023-06-23

Family

ID=78578764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110881635.7A Active CN113689548B (en) 2021-08-02 2021-08-02 A 3D reconstruction method of medical images based on mutual attention Transformer

Country Status (1)

Country Link
CN (1) CN113689548B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117952966B (en) * 2024-03-26 2024-10-22 华南理工大学 Sinkhorn algorithm-based multi-mode fusion survival prediction method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007282945A (en) * 2006-04-19 2007-11-01 Toshiba Corp Image processing device
CN112767467A (en) * 2021-01-25 2021-05-07 郑健青 Double-image depth estimation method based on self-supervision deep learning
CN112767532A (en) * 2020-12-30 2021-05-07 华东师范大学 Ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning
CN113076398A (en) * 2021-03-30 2021-07-06 昆明理工大学 Cross-language information retrieval method based on bilingual dictionary mapping guidance

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007282945A (en) * 2006-04-19 2007-11-01 Toshiba Corp Image processing device
CN112767532A (en) * 2020-12-30 2021-05-07 华东师范大学 Ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning
CN112767467A (en) * 2021-01-25 2021-05-07 郑健青 Double-image depth estimation method based on self-supervision deep learning
CN113076398A (en) * 2021-03-30 2021-07-06 昆明理工大学 Cross-language information retrieval method based on bilingual dictionary mapping guidance

Also Published As

Publication number Publication date
CN113689548A (en) 2021-11-23

Similar Documents

Publication Publication Date Title
Zheng et al. Smaformer: Synergistic multi-attention transformer for medical image segmentation
CN111145181B (en) Skeleton CT image three-dimensional segmentation method based on multi-view separation convolutional neural network
CN113689542B (en) A 3D reconstruction method of ultrasound or CT medical images based on self-attention Transformer
CN112767532B (en) Ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning
CN116129107A (en) Three-dimensional medical image segmentation method and system based on long-short-term memory self-attention model
CN113689545B (en) 2D-to-3D end-to-end ultrasound or CT medical image cross-modal reconstruction method
CN115908811B (en) CT image segmentation method based on transducer and convolution attention mechanism
CN116228823A (en) An artificial intelligence-based method for unsupervised cascade registration of magnetic resonance images
CN117036162B (en) Residual feature attention fusion method for lightweight chest CT image super-resolution
CN116309754B (en) Brain medical image registration method and system based on local-global information collaboration
CN117333750A (en) Spatial registration and local-global multi-scale multi-modal medical image fusion method
CN119785195B (en) Pathological hyperspectral image detection method based on trans-scale spatial spectrum feature fusion network
CN110415253A (en) A kind of point Interactive medical image dividing method based on deep neural network
CN110930378A (en) Emphysema image processing method and system based on low data demand
CN117853730A (en) U-shaped fully convolutional medical image segmentation network based on convolution kernel attention mechanism
CN119600043A (en) Brain tumor MRI image segmentation model and method based on improved Swin UNETR network
CN118229695A (en) A medical image segmentation method based on PCCTrans
CN113689546B (en) A cross-modal 3D reconstruction method for ultrasound or CT images with two-view twin Transformers
CN113689544B (en) Cross-view geometric constraint medical image three-dimensional reconstruction method
Sun et al. Medical image super-resolution via transformer-based hierarchical encoder–decoder network
CN113689548B (en) A 3D reconstruction method of medical images based on mutual attention Transformer
CN112700534B (en) Ultrasonic or CT medical image three-dimensional reconstruction method based on feature migration
CN112734907B (en) Ultrasonic or CT medical image three-dimensional reconstruction method
Malczewski A framework for reconstructing super-resolution magnetic resonance images from sparse raw data using multilevel generative methods
CN113689547B (en) A method for 3D reconstruction of ultrasound or CT medical images based on cross-view visual Transformer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant