CN113689548B

CN113689548B - Medical image three-dimensional reconstruction method based on mutual attention transducer

Info

Publication number: CN113689548B
Application number: CN202110881635.7A
Authority: CN
Inventors: 全红艳; 董家顺
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2023-06-23
Anticipated expiration: 2041-08-02
Also published as: CN113689548A

Abstract

The invention discloses a medical image three-dimensional reconstruction method based on a mutual-attention transducer. The method can effectively and accurately predict the three-dimensional geometric information of the ultrasonic or CT image, can provide effective 3D reconstruction solving measures for the medical auxiliary diagnosis of the artificial intelligence in clinical practice, and further improves the efficiency of the auxiliary diagnosis of the artificial intelligence.

Description

Medical image three-dimensional reconstruction method based on mutual attention transducer

Technical Field

The invention belongs to the technical field of computers, relates to ultrasonic or CT image three-dimensional reconstruction in medical intelligent auxiliary diagnosis, and relates to a method for reconstructing ultrasonic or CT image three-dimensional geometric information by means of imaging rules of natural images, learning by using a deep learning mechanism, sampling artificial intelligence migration learning strategy and a mutual attention transducer coding technology, establishing an effective network structure and realizing reconstruction of ultrasonic or CT image three-dimensional geometric information.

Background

In recent years, artificial intelligence technology is rapidly developed, and in intelligent medical auxiliary diagnosis, 3D visualization technology can play an auxiliary role in diagnosis in modern medical clinic. Meanwhile, due to the objective fact that the medical image has few textures and multiple noises, and particularly certain difficulty exists in parameter recovery of an ultrasonic camera, the research of the current three-dimensional reconstruction technology of ultrasonic or CT images has certain difficulty, and the research of the three-dimensional reconstruction technology of the medical image is challenging.

Meanwhile, the problem of three-dimensional reconstruction of ultrasonic or CT images due to the artificial intelligence advanced technology in recent years can be solved by establishing an effective deep-learning network coding model through the three-dimensional reconstruction technology, and the transducer model is widely applied to medical image analysis at present due to strong feature perception capability.

Disclosure of Invention

The invention aims to provide a medical image three-dimensional reconstruction method based on a mutual attention transducer, which adopts a multi-scale transducer coding structure to design a multi-branch network structure, combines the characteristics of geometric imaging in computer vision to design, adopts a mutual attention mechanism, fully utilizes interaction among different views, improves the accuracy of three-dimensional reconstruction, can obtain a finer medical target three-dimensional structure, and has higher practical value.

The specific technical scheme for realizing the aim of the invention is as follows:

a medical image three-dimensional reconstruction method based on a mutual-attention transducer, which inputs an ultrasonic or CT image sequence, wherein the image resolution is MxN, M is more than or equal to 100 and less than or equal to 2000, N is more than or equal to 100 and less than or equal to 2000, and the three-dimensional reconstruction process specifically comprises the following steps:

step 1: constructing a dataset

(a) Constructing a natural image dataset

Selecting a natural image website, requiring to have an image sequence and corresponding internal parameters of a camera, downloading a image sequences and corresponding internal parameters of the sequences from the natural image website, wherein a is more than or equal to 1 and less than or equal to 20, for each image sequence, each adjacent 3 frames of images are marked as an image b, an image c and an image d, splicing the image b and the image d according to color channels to obtain an image tau, forming a data element by the image c and the image tau, wherein the image c is a natural target image, a sampling viewpoint of the image c is used as a target viewpoint, and the internal parameters of the image b, the image c and the image d are all e _t (t=1, 2,3, 4), where e ₁ E is a horizontal focal length ₂ E is vertical focal length ₃ E ₄ Two components of principal point coordinates; discarding if the last remaining image in the same image sequence is less than 3 frames; constructing a natural image data set by utilizing all sequences, wherein f elements are in the constructed natural image data set, and f is more than or equal to 3000 and less than or equal to 20000;

(b) Constructing ultrasound image datasets

Sampling g ultrasonic image sequences, wherein g is more than or equal to 1 and less than or equal to 20, for each sequence, marking every 3 adjacent frames of images as an image i, an image j and an image k, splicing the image i and the image k according to color channels to obtain an image pi, forming a data element by the image j and the image pi, wherein the image j is an ultrasonic target image, the sampling viewpoint of the image j is used as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, and constructing an ultrasonic image data set by utilizing all the sequences, wherein F elements are contained in the constructed ultrasonic image data set, and F is more than or equal to 1000 and less than or equal to 20000;

(c) Constructing CT image datasets

Sampling h CT image sequences, wherein h is more than or equal to 1 and less than or equal to 20, for each sequence, marking every 3 adjacent frames as an image l, an image m and an image n, splicing the image l and the image n according to a color channel to obtain an image sigma, forming a data element by the image m and the image sigma, wherein the image m is a CT target image, a sampling viewpoint of the image m is used as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, constructing a CT image data set by utilizing all the sequences, wherein xi elements are in the constructed CT image data set, and the xi is more than or equal to 1000 and less than or equal to 20000;

Step 2: construction of neural networks

The resolution of the image or the image input by the neural network is p multiplied by o, p is the width, o is the height, and the pixel is 100-2000, and 100-2000;

(1) Depth information coding network

Tensor H is used as input, the scale is alpha x o x p x 3, tensor I is used as output, the scale is alpha x o x p x 1, and alpha is the batch number;

the depth information coding network consists of an encoder and a decoder, and for the tensor H, the output tensor I is obtained after coding and decoding processing in sequence;

the encoder consists of 5 units, wherein the first unit is a convolution unit, the 2 nd to 5 th units are all composed of residual error modules, in the first unit, 64 convolution kernels are formed, the shapes of the convolution kernels are 7 multiplied by 7, the step sizes of the convolution in the horizontal direction and the vertical direction are 2, the maximum pooling treatment is carried out once after the convolution, the 2 nd to 5 th units respectively comprise 3,4,6,3 residual error modules, each residual error module carries out 3 times of convolution, the shapes of the convolution kernels are 3 multiplied by 3, and the numbers of the convolution kernels are 64, 128, 256 and 512;

the decoder consists of 6 decoding units, each decoding unit comprises deconvolution and convolution processing, the deconvolution and convolution processing have the same shape and number of convolution kernels, the shape of the convolution kernels in the 1 st to 6 th decoding units is 3 multiplied by 3, the number of the convolution kernels is 512, 256, 128, 64, 32 and 16 respectively, the encoder and the network layer of the decoder are connected in a cross-layer manner, and the corresponding relationship of the cross-layer connection is as follows: 1 and 4, 2 and 3, 3 and 2, 4 and 1;

(2) Mutual-attention transducer learning network

The mutual attention transducer learning network consists of a main network and 4 network branches, wherein the 4 network branches are respectively used for predicting tensors L, O, D and B;

tensor J and tensor C are used as input, the scales are alpha x O x p x 3 and alpha x O x p x 6 respectively, the output is tensor L, tensor O, tensor D and tensor B, the tensor L scale is alpha x 2 x 6, the tensor O scale is alpha x 4 x 1, the tensor D scale is alpha x 3, the tensor B scale is alpha x O x p x 4, and alpha is the batch number.

The backbone network is designed for 3-phase cross-view coding:

1) The cross-view coding of the 1 st stage comprises embedded coding of the 1 st stage and attention coding of the 1 st stage

The embedded coding of the 1 st stage respectively carries out convolution operation on the first 3 characteristic components of the last dimension of the tensor J and the last 3 characteristic components of the last dimension of the tensor C, the convolution kernel scale is 7 multiplied by 7, the characteristic channel number is 24, the coding characteristics are transformed into a sequence structure from the spatial domain shape of the image characteristics by the serialization processing, and the 1 st stage embedded coding 1, the 1 st stage embedded coding 2 and the 1 st stage embedded coding 3 are respectively obtained by the layer normalization processing;

The attention code of the 1 st stage is obtained by concatenating the embedded code 1 of the 1 st stage and the embedded code 2 of the 1 st stage according to the last dimension; concatenating the 1 st stage embedded code 1 and the 1 st stage embedded code 3 according to the last dimension to obtain a 1 st stage attention code input feature 2; concatenating the 1 st stage embedded code 2 and the 1 st stage embedded code 1 according to the last dimension to obtain a 1 st stage attention code input characteristic 3; concatenating the 1 st stage embedded code 3 and the 1 st stage embedded code 1 according to the last dimension to obtain a 1 st stage attention code input characteristic 4; -attention encoding the 4 input features of the 1 st phase attention encoding: taking a first half channel characteristic as a target coding characteristic, a second half channel characteristic as a source coding characteristic and then carrying out separable convolution operation on the target coding characteristic and the source coding characteristic according to a last dimension in the 1 st stage, wherein the convolution kernel scale is 3 multiplied by 3, the characteristic channel number is 24, the step sizes in the horizontal direction and the vertical direction are 1, the processing result of the target coding characteristic is taken as a query keyword K coding vector and a numerical value V coding vector for attention learning, the processing result of the source coding characteristic is taken as a query Q coding vector for attention learning, then, the attention weight matrix of each attention coding input characteristic is calculated by utilizing a multi-head attention method, the number of heads is 1, the characteristic channel number is 24, finally, each attention weight matrix is added with the target coding characteristic of each attention coding input characteristic to obtain 4 cross-view coding characteristics in the 1 st stage, and the average characteristic of the 1 st and 2 nd cross-view coding characteristics of the 4 cross-view coding characteristics is taken as a 1 st stage cross-view cross-layer characteristic; taking the 1 st stage cross-view cross-layer feature, the 1 st stage 3 rd cross-view coding feature and the 1 st stage 4 th cross-view coding feature as 1 st stage cross-view coding results; taking the 1 st stage cross-view coding result as a 2 nd stage cross-view coding input, and concatenating the 1 st stage cross-view coding result according to the last dimension to obtain a 1 st stage concatenated coding result;

2) The cross-view coding of phase 2 includes embedded coding of phase 2 and attention coding of phase 2

The embedded coding of the 2 nd stage, the embedded coding of each feature in the cross-view coding result of the 1 st stage is carried out, the number of feature channels of convolution operation is 64, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are 2, the serialization processing transforms coding features from the spatial domain shape of image features into a sequence structure, and the layer normalization processing of the features obtains the 2 nd stage embedded coding 1, the 2 nd stage embedded coding 2 and the 2 nd stage embedded coding 3;

the attention code of the 2 nd stage, the embedded code 1 of the 2 nd stage and the embedded code 2 of the 2 nd stage are connected in series according to the last dimension to obtain the input characteristic 1 of the attention code of the 2 nd stage; concatenating the 2 nd stage embedded code 1 and the 2 nd stage embedded code 3 according to the last dimension to obtain a 2 nd stage attention code input feature 2; concatenating the 2 nd stage embedded code 2 and the 2 nd stage embedded code 1 according to the last dimension to obtain a 2 nd stage attention code input characteristic 3; concatenating the 2 nd stage embedded code 3 with the 2 nd stage embedded code 1 according to the last dimension to obtain a 2 nd stage attention code input feature 4, taking each input feature as a target code feature according to the last dimension, taking the first half channel feature as a target code feature, taking the second half channel feature as a source code feature, respectively carrying out separable convolution operation on the target code feature and the source code feature, wherein the convolution kernel scale is 3×3, the feature channel number is 64, the step sizes in the horizontal direction and the vertical direction are 2, the processing result of the target code feature is taken as a query keyword K code vector and a numerical value V code vector for attention learning, the processing result of the source code feature is taken as a query Q code vector for attention learning, then, calculating an attention weight matrix of each attention code input feature by utilizing a multi-head attention method, the number of heads is 3, the feature channel number is 64, and finally, adding the attention weight of each attention code input feature and the target code feature of each attention code input feature to obtain 4 cross view code features of the 2 nd stage, and the 1 st cross view code feature is utilized as an average cross view feature of the 2 nd stage cross view code feature; taking the 2 nd stage cross-view cross-layer feature, the 2 nd stage 3 rd cross-view coding feature and the 2 nd stage 4 th cross-view coding feature as 2 nd stage cross-view coding results; taking the 2 nd stage cross-view coding result as a 3 rd stage cross-view coding input, and concatenating the 2 nd stage cross-view coding result according to the last dimension to obtain a 2 nd stage concatenated coding result;

3) The 3 rd stage cross-view coding includes 3 rd stage embedded coding and 3 rd stage attention coding

The embedded coding of the 3 rd stage, each feature in the cross-view coding result of the 2 nd stage is subjected to embedded coding processing, convolution operation is carried out, the convolution kernel scale is 3 multiplied by 3, the number of feature channels is 128, the step length in the horizontal direction and the step length in the vertical direction are 2, the serialization processing transforms coding features from the spatial domain shape of the image features into a sequence structure, and the layer normalization processing of the features is carried out to obtain a 3 rd stage embedded coding 1, a 3 rd stage embedded coding 2 and a 3 rd stage embedded coding 3;

the 3 rd stage attention code, the 3 rd stage embedded code 1 and the 3 rd stage embedded code 2 are connected in series according to the last dimension to obtain the 3 rd stage attention code input characteristic 1; concatenating the 3 rd stage embedded code 1 and the 3 rd stage embedded code 3 according to the last dimension to obtain a 3 rd stage attention code input feature 2; concatenating the 3 rd stage embedded code 2 and the 3 rd stage embedded code 1 according to the last dimension to obtain a 3 rd stage attention code input characteristic 3; concatenating the 3 rd stage embedded code 3 and the 3 rd stage embedded code 1 according to the last dimension to obtain a 3 rd stage attention code input feature 4; taking the first half channel characteristic as a target coding characteristic, the second half channel characteristic as a source coding characteristic, respectively carrying out separable convolution operation on the target coding characteristic and the source coding characteristic, wherein the convolution kernel scale is 3 multiplied by 3, the characteristic channel number is 128, the step length in the horizontal direction and the step length in the vertical direction are 2, taking the processing result of the target coding characteristic as a query keyword K coding vector and a numerical V coding vector for attention learning, taking the processing result of the source coding characteristic as a query Q coding vector for attention learning, then calculating an attention weight matrix of each attention coding input characteristic by utilizing a multi-head attention method, the number of heads is 6, the characteristic channel number is 128, finally adding the weight matrix of each attention coding input characteristic in the 3 rd stage with the target coding characteristic of each attention coding input characteristic to obtain 4 cross-view coding characteristics in the 3 rd stage, and taking the average characteristics of the 1 st and 2 nd characteristics of the cross-view coding characteristics as cross-view cross-layer characteristics in the 3 rd stage; taking the 3 rd-stage cross-view cross-layer feature, the 3 rd-stage 3 rd cross-view coding feature and the 3 rd-stage 4 th cross-view coding feature as 3 rd-stage cross-view coding results; concatenating the 3 rd stage cross-view coding result according to the last dimension to obtain a 3 rd stage concatenated coding result;

For the 1 st network branch, the 1 st stage concatenated coding result is sequentially processed by 2 units: in the 1 st unit processing, the number of characteristic channels of convolution operation is 16, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; the resulting features were sequentially subjected to 2 unit processes: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; then, the obtained features are concatenated with the 3 rd stage concatenated coding result, and the following 3 unit processes are performed: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 12, the convolution kernel scales are all 1 multiplied by 1, the step sizes in the horizontal direction and the vertical direction are all 1, and then characteristic activation and batch normalization processing are carried out; predicting the obtained characteristic results of the 12 channels according to a 2 multiplied by 6 form to obtain a tensor L result;

For the 2 nd network branch, the 1 st stage concatenated coding result is sequentially processed by 2 units: in the 1 st unit processing, the number of characteristic channels of convolution operation is 16, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; then the obtained characteristics are connected with the 2 nd stage serial connection coding result in series, and the following 2 unit processing is carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; the obtained characteristics are connected with the 3 rd stage serial connection coding result in series, and 2 unit processing is carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 4, the convolution kernel scales are all 1 multiplied by 1, the step sizes in the horizontal direction and the vertical direction are all 1, and then characteristic activation and batch normalization processing are carried out; taking the obtained 4-channel characteristics as the result of tensor O;

For the 3 rd network branch, the 3 rd stage concatenated code result is processed by the following 4 units: in the 1 st unit processing, the number of characteristic channels of convolution operation is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 512, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 1024, the convolution kernel scales are 3×3, the step sizes in the horizontal direction and the vertical direction are 2, in the 4 th unit processing, the number of characteristic channels of convolution operation is 3, the convolution kernel scales are 1×1, the step sizes in the horizontal direction and the vertical direction are 1, and the obtained characteristics are used as the result of tensor D;

for the 4 th network branch, performing one-time deconvolution operation, feature activation and batch normalization processing on the cross-layer features of the cross-view in the 1 st stage, wherein in the deconvolution operation, the number of the convolved feature channels is 16, the convolution kernel scales are 3 multiplied by 3, and the step sizes in the horizontal direction and the vertical direction are 2; the obtained result is marked as a decoder cross-layer characteristic 1, and the cross-view cross-layer characteristic of the 1 st stage is processed by the following 2 units: when the 1 st unit is processed, the number of convolution operation characteristic channels is 32, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization processing are carried out, and the processing characteristic is marked as a decoder cross-layer characteristic 2; processing the 2 nd unit, carrying out convolution operation, wherein the number of characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, carrying out characteristic activation and batch normalization processing, carrying out series connection on the obtained characteristic and the 2 nd stage cross-view cross-layer characteristic, and carrying out the processing of the following 2 units on the series connection result: when the 1 st unit is processed, the number of characteristic channels of convolution is 64, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and the processing characteristics are marked as decoder cross-layer characteristics 3; when the 2 nd unit is processed, the number of the convolved characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, then the obtained characteristic is connected with the 3 rd stage cross-view cross-layer characteristic in series, the following 3 unit processes are carried out, when the 1 st unit is processed, the number of the convolved characteristic channels is 128, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and the processing characteristic is marked as the decoder cross-layer characteristic 4; when the 2 nd unit is processed, the number of the characteristic channels of convolution is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and the processing characteristics are marked as decoder cross-layer characteristics 5; when the 3 rd unit is processed, the number of the convolved characteristic channels is 512, the convolution kernel scales are 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are 2, and the 4 th network branch coding characteristic is obtained after the processing;

Decoding is further carried out, and deconvolution operation is carried out on the 4 th network branch coding feature for 1 time: the number of characteristic channels of convolution is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, the characteristics are activated and normalized in batches, the obtained result is connected with the cross-layer characteristics 5 of the decoder in series, and one convolution operation is carried out: the number of the characteristic channels is 512, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, and deconvolution operation is carried out on the obtained result: the number of the characteristic channels is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristic activation and batch normalization are carried out, the obtained result is connected with the cross-layer characteristic 4 of the decoder in series, and one convolution operation is carried out: the number of characteristic channels is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization processing are carried out, and the obtained result is subjected to deconvolution operation once: the number of the characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristic activation and batch normalization are carried out, the obtained result is connected with the cross-layer characteristic 3 of the decoder in series, and one convolution operation is carried out: the number of characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 1, the characteristics are activated and subjected to batch normalization processing, the obtained characteristics are used as the 4 th scale result of tensor B, meanwhile, 1 deconvolution operation is carried out on the obtained characteristics, the number of deconvoluted characteristic channels is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, the characteristics are activated and subjected to batch normalization processing, the obtained characteristics are connected with cross-layer characteristics 2 of a decoder in series, and one convolution operation is carried out: the number of the characteristic channels is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, the obtained characteristic is used as the 3 rd scale result of the tensor B, and meanwhile, the obtained characteristic is subjected to 1 deconvolution operation: the number of deconvolution characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristics are activated and normalized in batches, the obtained characteristics are connected with the cross-layer characteristics 1 of the decoder in series, and then one convolution operation is carried out: the number of the characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, the obtained characteristic is used as the 2 nd scale result of the tensor B, and meanwhile, the obtained characteristic is subjected to 1 deconvolution operation: the number of the characteristic channels is 16, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 2, the characteristics are activated and subjected to batch normalization, the obtained characteristics are connected with the up-sampling result of the 3 rd scale characteristics in series, and then one convolution operation is carried out: the number of the characteristic channels is 16, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristics are activated and subjected to batch normalization, the obtained characteristics are used as the 1 st scale result of the tensor B, and the 4 th scale result of the tensor B is utilized to obtain the output of the 4 th network branch;

Step 3: training of neural networks

Dividing samples in a natural image dataset, an ultrasonic image dataset and a CT image dataset into a training set and a testing set according to a ratio of 9:1, wherein data in the training set is used for training, data in the testing set is used for testing, training data are respectively obtained from corresponding data sets during training, the training data are uniformly scaled to a resolution p multiplied by o, the resolution p multiplied by o is input into a corresponding network, iterative optimization is performed, and the loss of each batch is minimized by continuously modifying network model parameters;

in the training process, the calculation method of each loss comprises the following steps:

internal parameters supervise synthesis loss: in the network model training of natural images, tensor I output by a depth information coding network is used as depth, tensor L output by a mutual attention transducer learning network and internal parameter label e of training data are used _t (t=1, 2,3, 4) respectively serving as pose parameters and camera internal parameters, respectively synthesizing two images at the view point of an image c by using an image b and an image d according to a computer vision principle algorithm, and calculating by using the sum of the intensity differences of the channel pixel by pixel and color by pixel by using the image c and the two synthesized images;

unsupervised synthesis loss: in the network model training of ultrasonic or CT images, taking the output tensor I of a depth information coding network as depth, taking the output tensor L of the 1 st network branch of a mutual-attention transducer learning network and the output tensor O of the 2 nd network branch of the mutual-attention transducer learning network as pose parameters and camera internal parameters respectively, respectively synthesizing images at a target image viewpoint by utilizing two adjacent images of the target image according to a computer vision algorithm, and calculating according to the sum of pixel-by-pixel and color channel intensity differences by utilizing the images at the target image viewpoint and the target image viewpoint respectively;

Internal parameter error loss: in the network model training of natural images, the output tensor O of the 2 nd network branch of a mutual-attention transducer learning network and the internal parameter label e of training data are utilized _t (t=1, 2,3, 4) is calculated as the sum of the absolute values of the respective component differences;

spatial structure error loss: in the network model training of ultrasonic or CT images, taking the output tensor I of a depth information coding network as depth, taking the output tensor L of the 1 st network branch of a mutual-attention transducer learning network and the output tensor O of the 2 nd network branch of the mutual-attention transducer learning network as pose parameters and camera internal parameters respectively, reconstructing three-dimensional coordinates of images at the target viewpoint by using two adjacent images of the images at the target viewpoint according to a computer vision algorithm, performing spatial structure fitting on reconstructed points by using a RANSAC algorithm, and calculating spatial structure error loss by using a normal vector obtained by fitting and the output tensor D of the 3 rd network branch of the mutual-attention transducer learning network;

conversion synthesis loss: in the network parameter training of ultrasonic or CT images, taking the output tensor I of a depth information coding network as depth, taking the output tensor L of the 1 st network branch of a mutual-attention transducer learning network and the output tensor O of the 2 nd network branch of the mutual-attention transducer learning network as pose parameters and camera internal parameters respectively, constructing two synthesized images at the view point of a target image by utilizing two adjacent images of the target image, taking the output tensor B of the 4 th network branch as displacement of spatial domain deformation of the synthesized image after each pixel position is obtained in the synthesis process for each image in the synthesized image, forming a synthesized result image, and calculating according to the sum of pixel-by-pixel and color channel intensity differences by utilizing the images at the target view point or the results of the images and the synthesized target view point;

The specific training steps are as follows:

(1) On the natural image data set, training is carried out for 80000 times on a depth information coding network, a mutual attention transducer learning network backbone network and a 1 st network branch respectively

Taking out training data from a natural image data set each time, uniformly scaling to resolution p×o, inputting an image c into a depth information coding network, inputting an image c and an image tau into a mutual attention transducer learning network, training the depth information coding network, a visual mutual attention transducer learning network backbone network and a 1 st network branch for 80000 times, and performing monitoring and synthesizing loss calculation on the training loss of each batch by internal parameters;

(2) On the natural image dataset, training 50000 times for the 2 nd network branch of the mutual-attention transducer learning network

Taking out training data from a natural image data set each time, uniformly scaling to a resolution p multiplied by o, inputting an image c into a depth information coding network, inputting the image c and an image tau into a mutual attention Transformer learning network, training a 2 nd network branch, and calculating the training loss of each batch by the sum of an unsupervised synthesis loss and an internal parameter error loss;

(3) On an ultrasonic image data set, training a depth information coding network, a main network of a mutual attention transducer learning network and network branches 1-4 for 80000 times to obtain model parameters rho

Taking out ultrasonic training data from an ultrasonic image data set each time, uniformly scaling to resolution p multiplied by o, inputting an image j into a depth information coding network, inputting the image j and the image pi into a mutual attention transducer learning network, training the depth information coding network, a backbone network of the mutual attention transducer learning network and network branches 1-4, and calculating the training loss of each batch by the sum of conversion synthesis loss and space structure error loss;

(4) On the CT image dataset, the mutual attention transducer learning network is trained 60000 times to obtain model parameters rho'

Taking CT image training data out of a CT image data set each time, uniformly scaling the CT image training data to a resolution p x o, inputting an image m and an image sigma into a mutual-attention transducer learning network, taking an output result of a depth information coding network as depth, taking output results of a backbone network and 1 st and 2 nd network branches as pose parameters and internal parameters of a camera respectively, taking an output tensor B of a 4 th network branch of the mutual-attention transducer learning network as displacement of airspace deformation, synthesizing two images at an image m viewpoint according to an image l and an image n respectively, training the network, continuously modifying parameters of the network, performing iterative optimization, minimizing each image loss for each batch, obtaining an optimal network model parameter rho' after iteration, so that the loss of each image of each batch is minimized, and adding the loss of translational motion of the camera except for conversion synthesis loss and space structure error loss when calculating the loss of network optimization;

Step 4: three-dimensional reconstruction of ultrasound or CT images

Using an ultrasound or CT sequence image from the sample, three-dimensional reconstruction is achieved by simultaneously performing the following 3 processes:

(1) For any target image in the sequence image, three-dimensional coordinates under a camera coordinate system are calculated according to the following method: scaling to resolution p x O, inputting an image j into a depth information coding network, inputting an image j and an image pi into a mutual attention transducer learning network for an ultrasonic sequence image, inputting an image m into the depth information coding network, inputting an image m and an image sigma into the input mutual attention transducer learning network for a CT sequence image, respectively predicting by using a model parameter rho and a model parameter rho', obtaining the depth of each frame of target image from the depth information coding network, taking an output tensor L of a 1 st network branch and an output tensor O of a 2 nd network branch of the mutual attention transducer learning network as a camera pose parameter and a camera internal parameter respectively, and calculating three-dimensional coordinates of the target image under a camera coordinate system according to the depth information and the camera internal parameter of the target image and the principle of computer vision;

(2) In the three-dimensional reconstruction process of the sequence image, a key frame sequence is established: taking the first frame of the sequence image as the first frame of the key frame sequence, taking the first frame of the sequence image as a current key frame, taking the frame after the current key frame as a target frame, and dynamically selecting new key frames in sequence according to the sequence of the target frames: firstly, initializing a pose parameter matrix of a target frame relative to a current key frame by using an identity matrix, multiplying the pose parameter matrix by a pose parameter of a target frame camera for any target frame, combining internal parameters and depth information of the target frame by using a multiplication result to synthesize an image at a target frame viewpoint, calculating an error lambda by using the sum of pixel-by-pixel color channel intensity differences between the synthesized image and the target frame, synthesizing an image at the target frame viewpoint by using the pose parameter and the internal parameters of the camera according to an adjacent frame of the target frame, calculating an error gamma by using the sum of pixel-by-pixel color channel intensity differences between the synthesized image and the target frame, and further calculating a synthesis error ratio Z by using a formula (1):

Meeting Z is larger than a threshold value eta, 1 eta is smaller than 2, taking the target frame as a new key frame, taking a pose parameter matrix of the target frame relative to the current key frame as a pose parameter of the new key frame, and simultaneously updating the target frame into the current key frame; finishing key frame sequence establishment by the iteration;

(3) And taking the viewpoint of the first frame of the sequence image as the origin of the world coordinate system, scaling the resolution of any target image to M multiplied by N, calculating to obtain three-dimensional coordinates under the camera coordinate system according to the internal parameters and depth information of the camera obtained by network output, and calculating to obtain the three-dimensional coordinates in the world coordinate system of each pixel of the target frame according to the pose parameters of the camera output by the network and combining the pose parameters of each key frame in the key frame sequence and the pose parameter matrix of the target frame relative to the current key frame.

The invention has the beneficial effects that:

the invention designs a transducer network model based on mutual attention, which adopts a mutual attention mechanism among different views to learn, so that the intelligent perception capability of deep learning is fully exerted in the three-dimensional reconstruction of medical images, three-dimensional geometric information can be automatically acquired from two-dimensional ultrasonic or CT images, the medical clinical diagnosis target can be visually displayed by using the invention, an effective 3D reconstruction solution can be provided for the medical auxiliary diagnosis of artificial intelligence, and the efficiency of the medical auxiliary diagnosis of the artificial intelligence is improved.

Drawings

FIG. 1 is a three-dimensional reconstruction result graph of an ultrasound image of the present invention;

fig. 2 is a three-dimensional reconstruction result diagram of a CT image according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples.

Examples

The embodiment is implemented under Windows 10-bit operating system on PC, and the hardware configuration is CPU i7-9700F, memory 16G,GPU NVIDIA GeForce GTX 2070 8G; the deep learning library adopts Tensorflow1.14; programming is in Python language version 3.7.

A medical image three-dimensional reconstruction method based on a mutual-attention transducer, which is characterized in that an ultrasonic or CT image sequence is input, the resolution is MxN, for ultrasonic images, M is 450, N is 300, for CT images, M and N are 512, and the three-dimensional reconstruction process specifically comprises the following steps:

step 1: constructing a dataset

(a) Constructing a natural image dataset

Selecting a natural image website, requiring to have an image sequence and corresponding internal parameters of a camera, downloading 19 image sequences and corresponding internal parameters of the sequence from the website, recording each adjacent 3 frames of images as an image b, an image c and an image d for each image sequence, splicing the image b and the image d according to color channels to obtain an image tau, forming a data element by the image c and the image tau, wherein the image c is a natural target image, a sampling viewpoint of the image c is a target viewpoint, and the internal parameters of the image b, the image c and the image d are all e _t (t=1, 2,3, 4), where e ₁ E is a horizontal focal length ₂ E is vertical focal length ₃ E ₄ Two components of principal point coordinates; discarding if the last remaining image in the same image sequence is less than 3 frames; constructing a natural image dataset by using all sequences, wherein the dataset has 3600 elements;

(b) Constructing ultrasound image datasets

Sampling 10 ultrasonic image sequences, for each sequence, marking every 3 adjacent frames of images as an image i, an image j and an image k, splicing the image i and the image k according to color channels to obtain an image pi, forming a data element by the image j and the image pi, wherein the image j is an ultrasonic target image, the sampling viewpoint of the image j is taken as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, and constructing an ultrasonic image data set by utilizing all the sequences, wherein the data set comprises 1600 elements;

(c) Constructing CT image datasets

Sampling 1 CT image sequence, for the sequence, marking every 3 adjacent frames as an image l, an image m and an image n, splicing the image l and the image n according to a color channel to obtain an image sigma, forming a data element by the image m and the image sigma, wherein the image m is a CT target image, a sampling viewpoint of the image m is used as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, and constructing a CT image data set by utilizing all the sequences, wherein the data set comprises 2000 elements;

Step 2: construction of neural networks

The resolution of the image or the image processed by the neural network is 416×128, 416 is width, 128 is height, and the pixels are used as units;

(1) Structure of depth information coding network

Tensor H is taken as input, the scale is 4×128×416×3, tensor I is taken as output, and the scale is 4×128×416×1;

(2) Mutual-attention transducer learning network

tensor J and tensor C are taken as inputs, the scales are 4×128×416×3 and 4×128×416×6, respectively, and outputs are tensor L, tensor O, tensor D and tensor B, respectively, the scales are: the tensor L scale is 4×2×6, the tensor O scale is 4×4×1, the tensor D scale is 4×3, and the tensor B scale is 4×128×416×4;

the backbone network is designed for 3-phase cross-view coding:

The embedded coding of the 1 st stage respectively carries out convolution operation on the first 3 characteristic components of the last dimension of the tensor J and the last 3 characteristic components of the last dimension of the tensor C, wherein the convolution kernel scale is 7 multiplied by 7, the coding characteristics are transformed from the spatial domain shape of the image characteristics into a sequence structure by the serialization processing, and the 1 st stage embedded coding 1, the 1 st stage embedded coding 2 and the 1 st stage embedded coding 3 are respectively obtained by the layer normalization processing;

the method comprises the steps of 1 st phase attention code, concatenating 1 st phase embedded code 1 and 1 st phase embedded code 2 according to the last dimension to obtain attention code input feature 1, concatenating 1 st phase embedded code 1 and 1 st phase embedded code 3 according to the last dimension to obtain 1 st phase attention code input feature 2, concatenating 1 st phase embedded code 2 and 1 st phase embedded code 1 according to the last dimension to obtain 1 st phase attention code input feature 3, concatenating 1 st phase embedded code 3 and 1 st phase embedded code 1 according to the last dimension to obtain 1 st phase attention code input feature 4, and respectively performing attention code processing on 4 input features of the 1 st phase attention code: firstly, calculating an attention weight matrix of an attention coding input feature of a 1 st stage by using a multi-head self-attention method, specifically, taking a first half channel feature as a target coding feature, taking a second half channel feature as a source coding feature according to a last dimension of each attention coding input feature of the 1 st stage, respectively carrying out separable convolution operation on the first half channel feature and the second half channel feature, wherein the convolution kernel scale is 3 multiplied by 3, the number of feature channels is 24, the steps in the horizontal direction and the vertical direction are 1, the processing result of the target coding feature is taken as a learned query keyword K coding vector and a numerical V coding vector, the processing result of the source coding feature is taken as a learned query Q coding vector, then, calculating an attention weight matrix by using the multi-head attention method, the number of the heads is 1, the number of the feature channels is 24, finally, adding the attention weight matrix of the features of the 1 st stage and the target coding feature to obtain a 1 st stage attention coding, carrying out attention coding processing by the 4 attention coding features of the 1 st stage, obtaining a 1 st stage 4 cross-stage feature, taking the processing result of the target coding feature as a first stage 1 cross-stage view coding feature, taking the processing result of the first stage 1 cross-stage 2 cross-stage as a first stage 1 cross-stage view coding feature, taking the first stage 1 cross-stage 2 cross-stage feature as a cross-stage 1 cross-stage view coding feature as a cross-stage 1 cross-stage 2 cross-stage view feature, concatenating the 1 st stage cross-view coding result according to the last dimension to obtain a 1 st stage concatenated coding result;

Stage 2 embedded coding, namely performing embedded coding on each feature in the cross-view coding result of stage 1: the number of characteristic channels of convolution operation is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, coding characteristics are transformed from the spatial domain shape of the image characteristics into a sequence structure by serialization processing, and the layer normalization processing of the characteristics is carried out to obtain a 2 nd stage embedded code 1, a 2 nd stage embedded code 2 and a 2 nd stage embedded code 3;

the 2 nd stage attention code, the 2 nd stage embedded code 1 and the 2 nd stage embedded code 2 are connected in series according to the last dimension to obtain the 2 nd stage attention code input characteristic 1, the 2 nd stage embedded code 1 and the 2 nd stage embedded code 3 are connected in series according to the last dimension to obtain the 2 nd stage attention code input characteristic 2, the 2 nd stage embedded code 2 and the 2 nd stage embedded code 1 are connected in series according to the last dimension to obtain the 2 nd stage attention code input characteristic 3, the 2 nd stage embedded code 3 and the 2 nd stage embedded code 1 are connected in series according to the last dimension to obtain the 2 nd stage attention code input characteristic 4, each input characteristic is connected in series according to the last dimension, taking the first half channel characteristic as a target coding characteristic, taking the second half channel characteristic as a source coding characteristic, respectively carrying out separable convolution operation on the target coding characteristic and the source coding characteristic, wherein the convolution kernel scale is 3 multiplied by 3, the number of the characteristic channels is 64, the step sizes in the horizontal direction and the vertical direction are 2, taking the processing result of the target coding characteristic as a query keyword K coding vector and a numerical value V coding vector for attention learning, taking the processing result of the source coding characteristic as a query Q coding vector for attention learning, then calculating an attention weight matrix of the characteristic by utilizing a multi-head attention method, the number of heads is 3, the number of the characteristic channels is 64, finally adding the attention weight matrix of the characteristic in the 2 nd stage with the target coding characteristic to obtain the attention code in the 2 nd stage, the method comprises the steps of obtaining 2-stage 4 cross-view coding features after attention coding processing is carried out on 2-stage 4 attention coding input features, using average features of 1 st and 2 nd features of the cross-view coding features as 2-stage cross-view cross-layer features, using the 2-stage cross-view cross-layer features, 2-stage 3 rd cross-view coding features and 2-stage 4 cross-view coding features as 2-stage cross-view coding results, using the 2-stage cross-view coding results as 3-stage cross-view coding inputs, and using the 2-stage cross-view coding results in series according to the last dimension to obtain 2-stage series coding results;

Embedding and coding of the 3 rd stage, and carrying out embedding and coding processing on each feature in the 2 nd stage cross-view coding result: the convolution operation, the convolution kernel scale is 3 multiplied by 3, the number of characteristic channels is 128, the step length in the horizontal direction and the step length in the vertical direction are 2, the serialization processing transforms the coding characteristics from the spatial domain shape of the image characteristics into a sequence structure, and the layer normalization processing of the characteristics obtains a 3 rd stage embedded code 1, a 3 rd stage embedded code 2 and a 3 rd stage embedded code 3;

the 3 rd phase attention code, the 3 rd phase embedded code 1 and the 3 rd phase embedded code 2 are connected in series according to the last dimension to obtain the 3 rd phase attention code input feature 1, the 3 rd phase embedded code 1 and the 3 rd phase embedded code 3 are connected in series according to the last dimension to obtain the 3 rd phase attention code input feature 2, the 3 rd phase embedded code 2 and the 3 rd phase embedded code 1 are connected in series according to the last dimension to obtain the 3 rd phase attention code input feature 3, the 3 rd phase embedded code 3 and the 3 rd phase embedded code 1 are connected in series according to the last dimension to obtain the 3 rd phase attention code input feature 4, each input feature is connected in series according to the last dimension to take the first half channel feature as the target code feature, taking the second half channel characteristic as a source coding characteristic, respectively carrying out separable convolution operation on a target coding characteristic and the source coding characteristic, wherein the convolution kernel scale is 3 multiplied by 3, the number of characteristic channels is 128, the steps in the horizontal direction and the vertical direction are 2, taking the processing result of the target coding characteristic as a query keyword K coding vector and a numerical value V coding vector for attention learning, taking the processing result of the source coding characteristic as a query Q coding vector for attention learning, then calculating an attention weight matrix of the characteristic by utilizing a multi-head attention method, the number of heads is 6, the number of characteristic channels is 128, finally adding the attention weight matrix of the characteristic in the 3 rd stage with the target coding characteristic to obtain the 3 rd stage attention code, after 4 attention code input features of the 3 rd stage are subjected to the embedded coding processing and the attention coding processing, 4 cross-view coding features of the 3 rd stage are obtained, the average features of the 1 st and 2 nd features of the cross-view coding features are used as 3-stage cross-view cross-layer features, the 3 rd stage 3 cross-view coding features and the 3 rd stage 4 cross-view coding features are used as 3-stage cross-view coding results, and the 3-stage cross-view coding results are concatenated according to the last dimension to obtain 3-stage concatenated coding results;

For the 1 st network branch, the 1 st stage concatenated coding result is sequentially processed by 2 units: in the 1 st unit processing, the number of characteristic channels of convolution operation is 16, the convolution kernel scales are 7×7, the step sizes in the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3×3, the step sizes in the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, and the obtained characteristics are sequentially subjected to 2 unit processing: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 7 multiplied by 7, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, then the obtained characteristics are connected in series with the 3 rd stage serial connection coding result, and the following 3 unit processing is carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scales are 7 multiplied by 7, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, in the 3 rd unit processing, the number of characteristic channels of convolution operation is 12, the convolution kernel scales are 1 multiplied by 1, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, and the obtained characteristic results of the 12 channels are predicted according to the form of 2 multiplied by 6, so as to obtain the result of tensor L;

For the 2 nd network branch, the 1 st stage concatenated coding result is sequentially processed by 2 units: in the 1 st unit processing, the number of characteristic channels of convolution operation is 16, the convolution kernel scales are 7 multiplied by 7, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, then the obtained characteristics are connected in series with the 2 nd stage serial connection coding result, and the following 2 unit processing is carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 7 multiplied by 7, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, the obtained characteristics are connected with the 3 rd stage serial connection coding result in series, and 2 unit processing is carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scales are 7 multiplied by 7, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, in the 3 rd unit processing, the number of characteristic channels of convolution operation is 4, the convolution kernel scales are 1 multiplied by 1, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, and the obtained characteristic results of the 4 channels are used as the result of tensor O;

For the 3 rd network branch, the 3 rd stage concatenated code result is processed by the following 4 units: in the 1 st unit processing, the number of characteristic channels of convolution operation is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 512, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, in the 3 rd unit processing, the number of characteristic channels of convolution operation is 1024, the step sizes of the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, in the 1 st unit processing, the number of characteristic channels of convolution operation is 3, the convolution kernel scales are 1 multiplied by 1, the step sizes of the horizontal direction and the vertical direction are 1, and the obtained characteristic is taken as a result of tensor D;

for the 4 th network branch, performing one-time deconvolution operation, feature activation and batch normalization treatment on the 1 st stage cross-view cross-layer feature, wherein in the deconvolution operation, the number of convolved feature channels is 16, the convolution kernel scale is 3×3, the step sizes in the horizontal direction and the vertical direction are 2, the obtained result is recorded as a decoder cross-layer feature 1, and then the 1 st stage cross-view cross-layer feature is processed by the following 2 units: when the 1 st unit is processed, the number of convolution operation characteristic channels is 32, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization processing are carried out, the processing characteristics are marked as decoder cross-layer characteristics 2, the 2 nd unit is processed, the convolution operation is carried out, the number of the characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristic activation and batch normalization processing are carried out, the obtained characteristics are connected with the cross-layer characteristics of the 2 nd stage cross-view in series, and the serial connection result is processed by the following 2 units: the method comprises the steps that when a 1 st unit is processed, the number of characteristic channels of convolution is 64, the convolution kernel scale is 7 multiplied by 7, the step length in the horizontal direction and the step length in the vertical direction are 1, the processing characteristic is marked as a decoder cross-layer characteristic 3, the cross-layer characteristic is processed by a 2 nd unit, when the 2 nd unit is processed, the number of characteristic channels of convolution is 128, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are 2, then the obtained characteristic is connected with the cross-layer characteristic of a 3 rd stage cross-view in series, the following 3 unit processes are carried out, the processing function of each unit is convolution operation, characteristic activation and batch normalization processing, when the 3 rd unit processes are carried out, the number of the convolved characteristic channels is 128, the convolution kernel scale is 7 multiplied by 7, the step length in the horizontal direction and the vertical direction is 1, the processing characteristic is marked as a decoder cross-layer characteristic 4, when the 2 nd unit processes are carried out, the number of the convolved characteristic channels is 256, the convolved characteristic channels of convolution kernel is 256, the cross-layer characteristic is 3, the cross-layer characteristic is coded by 3, the cross-layer characteristic is 3, and the cross-layer characteristic is 3, the cross-layer characteristic is coded by 3, and the cross-layer characteristic is 3, the cross-layer characteristic is 2 is marked to be 3, and the cross-layer characteristic is processed in the 3;

Decoding is further carried out, and deconvolution operation is carried out on the 4 th branch coding feature for 1 time: the number of characteristic channels of convolution is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, the characteristics are activated and normalized in batches, the obtained result is connected with the cross-layer characteristics 5 of the decoder in series, and one convolution operation is carried out: the number of the characteristic channels is 512, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, and deconvolution operation is carried out on the obtained result: the number of the characteristic channels is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristic activation and batch normalization are carried out, the obtained result is connected with the cross-layer characteristic 4 of the decoder in series, and one convolution operation is carried out: the number of characteristic channels is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization processing are carried out, and the obtained result is subjected to deconvolution operation once: the number of the characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristic activation and batch normalization are carried out, the obtained result is connected with the cross-layer characteristic 3 of the decoder in series, and one convolution operation is carried out: the number of the characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, the obtained characteristic is used as the 4 th scale result of the tensor B, meanwhile, the obtained characteristic is subjected to 1 deconvolution operation, the number of the deconvoluted characteristic channels is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, the characteristic activation and batch normalization are carried out, the obtained characteristic is connected with the cross-layer characteristic 2 of the decoder in series, and one convolution operation is carried out: the number of the characteristic channels is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, the obtained characteristic is used as the 3 rd scale result of the tensor B, and meanwhile, the obtained characteristic is subjected to 1 deconvolution operation: the number of deconvolution characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristics are activated and normalized in batches, the obtained characteristics are connected with the cross-layer characteristics 1 of the decoder in series, and then one convolution operation is carried out: the number of the characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, the obtained characteristic is used as the 2 nd scale result of the tensor B, and meanwhile, the obtained characteristic is subjected to 1 deconvolution operation: the number of the characteristic channels is 16, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 2, the characteristics are activated and subjected to batch normalization, the obtained characteristics are connected with the up-sampling result of the 3 rd scale characteristics in series, and then one convolution operation is carried out: the number of the characteristic channels is 16, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristics are activated and subjected to batch normalization, the obtained characteristics are used as the 1 st scale result of the tensor B, and the 4 th branch output is obtained by using the 4 th scale result of the tensor B;

Step 3: training of neural networks

Dividing samples in a natural image dataset, an ultrasonic image dataset and a CT image dataset into a training set and a testing set according to a ratio of 9:1, wherein data in the training set is used for training, data in the testing set is used for testing, training data is respectively obtained from corresponding data sets during training, the training data are uniformly scaled to a resolution of 416 multiplied by 128, the resolution is input into a corresponding network, iterative optimization is performed, and the loss of each batch is minimized by continuously modifying network model parameters;

spatial structure error loss: in the network model training of ultrasonic or CT images, taking the output tensor I of a depth information coding network as depth, taking the output tensor L of the 1 st network branch of a mutual-attention transducer learning network and the output tensor O of the 2 nd network branch of the mutual-attention transducer learning network as pose parameters and camera internal parameters respectively, reconstructing three-dimensional coordinates of images at the target viewpoint by using two adjacent images of the images at the target viewpoint according to a computer vision algorithm, performing space structure fitting on reconstructed points by using a RANSAC algorithm, and calculating space structure error loss by using a normal vector obtained by fitting and the output tensor D of the mutual-attention transducer learning network;

conversion synthesis loss: in the network parameter training of ultrasonic or CT images, taking the output tensor I of a depth information coding network as depth, taking the output tensor L of the 1 st network branch of a mutual-attention transducer learning network and the output tensor O of the 2 nd network branch of the mutual-attention transducer learning network as pose parameters and camera internal parameters respectively, constructing two synthesized images at the view point of a target image by utilizing two adjacent images of the target image, taking the 5 th network branch output tensor B as displacement of spatial domain deformation of the synthesized image after each pixel position is obtained in the synthesis process for each image in the synthesized image, forming a synthesized result image, and calculating according to the sum of pixel-by-pixel and color channel intensity differences by utilizing the images at the target view point or the results of the images and the synthesized target view point;

The specific training steps are as follows:

(1) On the natural image data set, training is carried out for 80000 times on a depth information coding network, a mutual attention transducer learning network backbone network and a first network branch respectively

Taking out training data from a natural image data set each time, uniformly scaling to 416×128 resolution, inputting an image c into a depth information coding network, inputting an image c and an image tau into a mutual attention transducer learning network, training the depth information coding network, a visual mutual attention transducer learning network backbone network and a first network branch for 80000 times, and performing monitoring and synthesizing loss calculation on the training loss of each batch by internal parameters;

Taking out training data from a natural image data set each time, uniformly scaling to 416 multiplied by 128, inputting an image c into a depth information coding network, inputting the image c and an image tau into a mutual attention Transformer learning network, training a 2 nd network branch, and calculating the training loss of each batch by the sum of an unsupervised synthesis loss and an internal parameter error loss;

(3) Training a depth information coding network, a mutual attention transducer learning network backbone network and network branches 1-4 for 80000 times on an ultrasonic image data set to obtain model parameters rho

Taking out ultrasonic training data from an ultrasonic image data set each time, uniformly scaling to 416 multiplied by 128 of resolution, inputting an image j into a depth information coding network, inputting the image j and the image pi into a mutual attention transducer learning network, training the depth information coding network, the mutual attention transducer learning network backbone network and network branches 1-4, and calculating the training loss of each batch by the sum of conversion synthesis loss and space structure error loss;

(4) On the CT image data set, training the mutual attention transducer learning net 60000 times to obtain model parameters rho'

Taking CT image training data out of a CT image data set each time, uniformly scaling the CT image training data to 416X 128 of resolution, inputting an image m and an image sigma into a mutual attention transducer learning network, taking an output result of a depth information coding network as depth, taking output results of a backbone network and 1 st and 2 nd network branches as pose parameters and internal parameters of a camera respectively, taking an output tensor B of a 4 th network branch of the mutual attention transducer learning network as displacement quantity of airspace deformation, synthesizing two images at an image m viewpoint according to an image l and an image n respectively, training the network by continuously modifying parameters of the network, performing iterative optimization, obtaining an optimal network model parameter rho' for each image loss of each batch to be minimum after iteration, and adding the loss of translational motion of the camera besides conversion synthesis loss and space structure error loss when calculating the loss of each image of each batch is optimized;

Step 4: three-dimensional reconstruction of ultrasound or CT images

(1) For any target image in the sequence image, three-dimensional coordinates under a camera coordinate system are calculated according to the following method: scaling to 416 x 128, inputting an image j into a depth information coding network, inputting an image j and an image pi into a mutual attention transducer learning network for an ultrasonic sequence image, inputting an image m into the depth information coding network, inputting an image m and an image sigma into the input mutual attention transducer learning network, respectively predicting by using a model parameter rho and a model parameter rho', obtaining the depth of each frame of target image from the depth information coding network, respectively obtaining a camera pose parameter and a camera internal parameter by using an output tensor L of a 1 st network branch and an output tensor O of a 2 nd network branch of the mutual attention transducer learning network, and calculating three-dimensional coordinates under a camera coordinate system of the target image according to the depth information and the camera internal parameter of the target image and the principle of computer vision;

When Z is more than 1.2, taking the target frame as a new key frame, taking a pose parameter matrix of the target frame relative to the current key frame as a pose parameter of the new key frame, and simultaneously updating the target frame into the current key frame; finishing key frame sequence establishment by the iteration;

(3) The method comprises the steps of taking a viewpoint of a first frame of a sequence image as an origin of a world coordinate system, scaling resolution of the viewpoint to M multiplied by N for any target frame, taking 450M and 300N for an ultrasonic image, taking 512M and N for a CT image, calculating to obtain three-dimensional coordinates under the camera coordinate system according to camera internal parameters and depth information obtained by network output, and calculating to obtain three-dimensional coordinates in the world coordinate system of each pixel of the target frame according to camera pose parameters output by the network by combining the pose parameters of each key frame in a key frame sequence and a pose parameter matrix of the target frame relative to a current key frame.

In the embodiment, network training is performed on the constructed natural image training set, ultrasonic image training set and CT image training set, 10 ultrasonic image sequences and 1 CT image sequence of a public data set are used for testing respectively, conversion synthesis loss is used for error calculation, in the error calculation of ultrasonic or CT images, two adjacent images of target images are used for constructing two synthesized images at the target image view point, and for each image in the synthesized images, the synthesized images at the two target view points are used for calculating according to the sum of the pixel-by-pixel and color-by-color channel intensity differences.

Table 1 is the calculated error during the reconstruction of the ultrasound image sequence, table 2 is the calculated error during the reconstruction of the CT image sequence, in this embodiment, the ultrasound image or the CT image is segmented by using the DenseNet and then 3D reconstructed, fig. 1 shows the three-dimensional reconstruction result of the ultrasound image obtained by using the present invention, and fig. 2 shows the three-dimensional reconstruction result of the CT image obtained by using the present invention, from which it can be seen that the present invention can obtain a more accurate reconstruction result.

TABLE 1

Sequence number	Error of
		1	0.1299164668818727
2	0.0368915316811806
		3	0.07339861854471304
4	0.09744906178316476
		5	0.1018028589374692
6	0.08109420171719985
		7	0.051973303110074524
8	0.0988887820759697
		9	0.10880799129583894
10	0.06647273849340957

TABLE 2

Sequence number	Error of
		1	0.058544783606001315
2	0.0667200513419954
		3	0.06821816611230745
4	0.06780729271604191
		5	0.11862437423632731
6	0.10054601129420655
		7	0.12442189492200881
8	0.15065656014245987
		9	0.10756279393662936
10	0.11451064929672831

Claims

1. A medical image three-dimensional reconstruction method based on a mutual-attention transducer is characterized in that an ultrasonic or CT image sequence is input, the image resolution is MxN, M is more than or equal to 100 and less than or equal to 2000, N is more than or equal to 100 and less than or equal to 2000, and the three-dimensional reconstruction process specifically comprises the following steps:

step 1: constructing a dataset

(a) Constructing a natural image dataset

(b) Constructing ultrasound image datasets

(c) Constructing CT image datasets

Step 2: construction of neural networks

(1) Depth information coding network

(2) Mutual-attention transducer learning network

tensor J and tensor C are used as input, the scales are alpha x O x p x 3 and alpha x O x p x 6 respectively, the outputs are tensor L, tensor O, tensor D and tensor B, the tensor L scale is alpha x 2 x 6, the tensor O scale is alpha x 4 x 1, the tensor D scale is alpha x 3, the tensor B scale is alpha x O x p x 4, alpha is the batch number,

the backbone network is designed for 3-phase cross-view coding:

the attention code of the 2 nd stage, the embedded code 1 of the 2 nd stage and the embedded code 2 of the 2 nd stage are connected in series according to the last dimension to obtain the input characteristic 1 of the attention code of the 2 nd stage; concatenating the 2 nd stage embedded code 1 and the 2 nd stage embedded code 3 according to the last dimension to obtain a 2 nd stage attention code input feature 2; concatenating the 2 nd stage embedded code 2 and the 2 nd stage embedded code 1 according to the last dimension to obtain a 2 nd stage attention code input characteristic 3; concatenating the 2 nd stage embedded code 3 with the 2 nd stage embedded code 1 according to the last dimension to obtain a 2 nd stage attention code input feature 4, taking each input feature as a target code feature according to the last dimension, taking the first half channel feature as a target code feature, taking the second half channel feature as a source code feature, respectively carrying out separable convolution operation on the target code feature and the source code feature, wherein the convolution kernel dimensions are 3×3, the feature channel number is 64, the step sizes in the horizontal direction and the vertical direction are 2, the processing result of the target code feature is taken as a query keyword K code vector and a numerical value V code vector for attention learning, the processing result of the source code feature is taken as a query Q code vector for attention learning, then, calculating an attention weight matrix of each attention code input feature by utilizing a multi-head attention method, the number of heads is 3, the feature channel number is 64, finally, adding the attention weight of each attention code input feature and the target code feature of each attention code input feature to 4 cross-view code features, and utilizing the 1 st cross-view feature and the 2 nd stage cross-view code feature as an average cross-view feature; taking the 2 nd stage cross-view cross-layer feature, the 2 nd stage 3 rd cross-view coding feature and the 2 nd stage 4 th cross-view coding feature as 2 nd stage cross-view coding results; taking the 2 nd stage cross-view coding result as a 3 rd stage cross-view coding input, and concatenating the 2 nd stage cross-view coding result according to the last dimension to obtain a 2 nd stage concatenated coding result;

Step 3: training of neural networks

internal parameters supervise synthesis loss: in the network model training of natural images, depth information is encoded and output by a networkTensor I is used as depth, tensor L output by the mutual attention transducer learning network and internal parameter label e of training data are used _t (t=1, 2,3, 4) respectively serving as pose parameters and camera internal parameters, respectively synthesizing two images at the view point of an image c by using an image b and an image d according to a computer vision principle algorithm, and calculating by using the sum of the intensity differences of the channel pixel by pixel and color by pixel by using the image c and the two synthesized images;

The specific training steps are as follows:

Step 4: three-dimensional reconstruction of ultrasound or CT images