CN113689542B

CN113689542B - Ultrasonic or CT medical image three-dimensional reconstruction method based on self-attention transducer

Info

Publication number: CN113689542B
Application number: CN202110878837.6A
Authority: CN
Inventors: 全红艳; 董家顺
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2023-06-23
Anticipated expiration: 2041-08-02
Also published as: CN113689542A

Abstract

The invention discloses a self-attention-transferring-based ultrasonic or CT medical image three-dimensional reconstruction method, which is characterized in that an unsupervised learning mechanism based on multi-layer feature perception of transferring is adopted, a convolutional neural network structure based on vision transferring is designed according to the characteristics of ultrasonic or CT image acquisition data, the self-attention mechanism is sampled, and three-dimensional reconstruction of ultrasonic images is realized by adopting unsupervised measures through transfer learning. The method can effectively predict the three-dimensional geometric information of the ultrasonic or CT image, can provide an effective 3D reconstruction solution for the medical auxiliary diagnosis of the artificial intelligence, and improves the efficiency of the auxiliary diagnosis of the artificial intelligence.

Description

Ultrasonic or CT medical image three-dimensional reconstruction method based on self-attention transducer

Technical Field

The invention belongs to the technical field of computers, relates to ultrasonic or CT image three-dimensional reconstruction in medical auxiliary diagnosis, and relates to a method for three-dimensional reconstruction of ultrasonic or CT images by means of imaging rules of natural images and utilizing artificial intelligence migration learning strategies and adopting a latest leading-edge self-attention transducer coding technology.

Background

In recent years, artificial intelligence technology is rapidly developed, and research on key technologies of intelligent medical auxiliary diagnosis has macroscopic significance in modern medical clinic. At present, in the research of the three-dimensional reconstruction technology of ultrasonic or CT images, due to the objective fact that the medical images have few textures and multiple noises, and particularly, certain difficulty exists in parameter recovery of an ultrasonic camera, the research of the three-dimensional reconstruction technology of the ultrasonic or CT images has certain difficulty, and the complexity is brought to the three-dimensional reconstruction of a model, so that the development of the clinical medical auxiliary diagnosis technology is not facilitated. How to build an effective network coding model for deep learning and solve the problem of rapid three-dimensional reconstruction of ultrasonic or CT images is a practical problem to be solved. The transducer model has strong feature perception capability due to the adoption of a global context attention mechanism, and is widely applied to medical image analysis gradually.

Disclosure of Invention

The invention aims to provide an ultrasonic or CT medical image three-dimensional reconstruction method based on a self-attention transducer, which combines the characteristics of medical images, adopts a self-attention transducer coding structure to fully learn the contextual characteristics of the medical images, utilizes a convolutional neural network to construct a depth prediction model, fully utilizes the spatial structure characteristics of the medical images and is used as a constraint condition for optimization of a reconstruction process, so that the method can obtain a finer three-dimensional structure of a medical target, and has higher practical value.

The specific technical scheme for realizing the aim of the invention is as follows:

a self-attention transducer-based ultrasonic or CT medical image three-dimensional reconstruction method inputs an ultrasonic or CT image sequence, wherein the image resolution is MxN, M is more than or equal to 100 and less than or equal to 2000, N is more than or equal to 100 and less than or equal to 2000, and the three-dimensional reconstruction process specifically comprises the following steps:

step 1: constructing a dataset

a) Constructing a natural image dataset

Selecting a natural image website, requiring to have an image sequence and corresponding internal parameters of a camera, downloading a image sequences and corresponding internal parameters of the sequences from the natural image website, wherein a is more than or equal to 1 and less than or equal to 20, for each image sequence, each adjacent 3 frames of images are marked as an image b, an image c and an image d, splicing the image b and the image d according to color channels to obtain an image tau, forming a data element by the image c and the image tau, wherein the image c is a natural target image, a sampling viewpoint of the image c is used as a target viewpoint, and the internal parameters of the image b, the image c and the image d are all e _t (t=1, 2,3, 4), where e ₁ E is a horizontal focal length ₂ E is vertical focal length ₃ E ₄ Two components of principal point coordinates; discarding if the last remaining image in the same image sequence is less than 3 frames; constructing a natural image data set by utilizing all sequences, wherein f elements are in the constructed natural image data set, and f is more than or equal to 3000 and less than or equal to 20000;

b) Constructing ultrasound image datasets

Sampling g ultrasonic image sequences, wherein g is more than or equal to 1 and less than or equal to 20, for each sequence, marking every 3 adjacent frames of images as an image i, an image j and an image k, splicing the image i and the image k according to color channels to obtain an image pi, forming a data element by the image j and the image pi, wherein the image j is an ultrasonic target image, the sampling viewpoint of the image j is used as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, and constructing an ultrasonic image data set by utilizing all the sequences, wherein F elements are contained in the constructed ultrasonic image data set, and F is more than or equal to 1000 and less than or equal to 20000;

c) Constructing CT image datasets

Sampling h CT image sequences, wherein h is more than or equal to 1 and less than or equal to 20, for each sequence, marking every 3 adjacent frames as an image l, an image m and an image n, splicing the image l and the image n according to a color channel to obtain an image sigma, forming a data element by the image m and the image sigma, wherein the image m is a CT target image, a sampling viewpoint of the image m is used as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, constructing a CT image data set by utilizing all the sequences, wherein xi elements are in the constructed CT image data set, and the xi is more than or equal to 1000 and less than or equal to 20000;

Step 2: construction of neural networks

The resolution of the image or the image input by the neural network is p multiplied by o, p is the width, o is the height, and the pixel is 100-2000, and 100-2000;

(1) Structure of depth information coding network

Tensor H is used as input, the scale is alpha x o x p x 3, tensor I is used as output, the scale is alpha x o x p x 1, and alpha is the batch number;

the depth information coding network consists of an encoder and a decoder, and for the tensor H, the output tensor I is obtained after coding and decoding processing in sequence;

the encoder consists of 5 units, wherein the first unit is a convolution unit, the 2 nd to 5 th units are all composed of residual error modules, in the first unit, 64 convolution kernels are formed, the shapes of the convolution kernels are 7 multiplied by 7, the step sizes of the convolution in the horizontal direction and the vertical direction are 2, the maximum pooling treatment is carried out once after the convolution, the 2 nd to 5 th units respectively comprise 3,4,6,3 residual error modules, each residual error module carries out 3 times of convolution, the shapes of the convolution kernels are 3 multiplied by 3, and the numbers of the convolution kernels are 64, 128, 256 and 512;

the decoder consists of 6 decoding units, each decoding unit comprises deconvolution and convolution processing, the deconvolution and convolution processing have the same shape and number of convolution kernels, the shape of the convolution kernels in the 1 st to 6 th decoding units is 3 multiplied by 3, the number of the convolution kernels is 512, 256, 128, 64, 32 and 16 respectively, the encoder and the network layer of the decoder are connected in a cross-layer manner, and the corresponding relationship of the cross-layer connection is as follows: 1 and 4, 2 and 3, 3 and 2, 4 and 1;

(2) Structure of vision transducer parameter learning network

The visual transducer parameter learning network is composed of a module W and a module G, wherein for the module W, a tensor J and a tensor C are taken as input, the scales are alpha x O x p x 3 and alpha x O x p x 6 respectively, the output is a tensor L, a tensor O and a tensor D, and the tensor L is as follows: alpha x 2 x 6, tensor O scale is alpha x 4 x 1, tensor D scale is alpha x 3, alpha is batch number;

for the module W, the module W consists of a main network and 3 network branches, wherein the 3 network branches are respectively used for predicting tensors L, O and D;

first, the backbone network is encoded as follows: the tensor J and the tensor C are connected in series according to the last channel and then are input into a backbone network, 3 stages of codes are sequentially carried out, the number of attention heads is respectively 2, 3 and 4 when each stage of codes is carried out, and each stage of codes is specifically as follows:

a) Embedded coding

In embedded coding, firstly, carrying out convolution operation, wherein the convolution kernel scales are respectively 7 multiplied by 7, 3 multiplied by 3 and 3 multiplied by 3 when the coding is carried out in 3 stages, the step sizes in the horizontal direction and the vertical direction are respectively 4, 2 and 2, then, further stretching the obtained coding characteristics from the airspace shape of the image characteristics to a sequence form, and then carrying out layer normalization processing;

b) Transformer encoding for self-attention mechanisms

Performing layer normalization to obtain intermediate processing features, and performing separable convolution operation processing on the intermediate processing features according to query dimensions: the convolution kernel scale is 3 multiplied by 3, the input feature dimension is 64, the step sizes in the horizontal direction and the vertical direction are 1, then batch normalization is carried out, operation processing of a convolution unit is carried out, the convolution kernel scale is 1 multiplied by 1, the step sizes in the horizontal direction and the vertical direction are 1, the output feature dimension is the number of attention heads multiplied by the input feature dimension, and the obtained coding feature is further stretched into a sequence form from the airspace shape of the image feature to be used as a query Q coding vector for attention learning;

and carrying out separable convolution operation processing on the intermediate processing characteristics according to the keyword dimension: the convolution kernel scale is 3 multiplied by 3, the input feature dimension is 64, the step sizes in the horizontal direction and the vertical direction are 1, then batch normalization is carried out, operation processing of a convolution unit is carried out, the convolution kernel scale is 1 multiplied by 1, the step sizes in the horizontal direction and the vertical direction are 1, the output feature dimension is the number of attention heads multiplied by the input feature dimension, and the obtained coding feature is further stretched into a sequence form from the airspace shape of the image feature to be used as a keyword K coding vector for attention learning;

And carrying out separable convolution operation processing on the intermediate processing characteristics according to the numerical dimension: the convolution kernel scale is 3 multiplied by 3, the input feature dimension is 64, the step sizes in the horizontal direction and the vertical direction are 1, then batch normalization is carried out, operation processing of a convolution unit is carried out, the convolution kernel scale is 1 multiplied by 1, the step sizes in the horizontal direction and the vertical direction are 1, the output feature dimension is the number of attention heads multiplied by the input feature dimension, and the obtained coding feature is further stretched into a sequence form from the airspace shape of the image feature to be used as a numerical value V coding vector for attention learning;

according to the query Q code vector, the keyword K code vector and the numerical value V code vector which are obtained by the three stage codes respectively, the attention weight matrix is calculated by a self-attention mechanics learning method respectively;

the 1 st stage attention weight matrix and the 1 st stage intermediate processing feature are added to obtain a 1 st stage backbone network coding feature, the 2 nd stage attention weight matrix and the 2 nd stage intermediate processing feature are added to obtain a 2 nd stage backbone network coding feature, and the 3 rd stage attention weight matrix and the 3 rd stage intermediate processing feature are added to obtain a 3 rd stage backbone network coding feature;

Then, 3 network branch codes are sequentially performed:

for the 1 st network branch, the 1 st stage backbone network coding feature is sequentially subjected to 2 unit processes: in the 1 st unit processing, the number of characteristic channels of convolution operation is 16, the convolution kernel scale is 7×7, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scale is 3×3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, then the obtained characteristics are connected in series with the 1 st stage backbone network coding characteristics, and 2 unit processing are carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scale is 7×7, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 3×3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, then the obtained characteristics are connected in series with the main network coding characteristics of the 3 rd stage, and 3 unit processing are carried out in sequence: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 7 multiplied by 7, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scale is 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, in the 3 rd unit processing, the number of characteristic channels of convolution operation is 12, the convolution kernel scale is 1 multiplied by 1, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, and the obtained characteristic results of 12 channels are predicted according to the form of 2 multiplied by 6, so as to obtain the result of tensor L;

For the 2 nd network branch, the 1 st stage backbone network coding feature is sequentially subjected to 2 unit processes: in the 1 st unit processing, the number of characteristic channels of convolution operation is 16, the convolution kernel scale is 7×7, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scale is 3×3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, then the obtained characteristics are connected in series with the main network coding characteristics of the 2 nd stage, and then 2 unit processing are carried out in sequence: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scale is 7×7, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 3×3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, the obtained characteristics are connected in series with the main network coding characteristics of the 3 rd stage, and then 3 unit processing are carried out in sequence: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 7 multiplied by 7, the step sizes of the horizontal direction and the vertical direction are 2, then the characteristic activation and batch normalization processing are carried out, the 2 nd unit is 2, then the characteristic activation and batch normalization processing are carried out, in the 3 rd unit processing, the number of characteristic channels of convolution operation is 4, the convolution kernel scale is 1 multiplied by 1, the step sizes of the horizontal direction and the vertical direction are 1, then the characteristic activation and batch normalization processing are carried out, and the obtained characteristic results of the 4 channels are used as the result of tensor O;

For the 3 rd network branch, the 3 rd stage backbone network coding feature is sequentially subjected to 3 unit processes: in the 1 st unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scale is 3×3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 3×3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, in the 3 rd unit processing, the number of characteristic channels of convolution operation is 3, the convolution kernel scale is 1×1, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, and the obtained characteristic is used as the result of the 3 rd channel and is used as the result of tensor D;

for the module G, the tensor J and the tensor C are used as inputs, the output is tensor B, the scale is α×o×p×4, α is the number of batches, and the module G is designed to perform cross-view embedded encoding first, then convolutional embedded encoding, and finally perform decoding processing, specifically:

a) Cross-view embedded coding

Firstly, respectively performing cross-view embedding coding processing on the first 3 feature components of the last dimension of the tensor J and the last 3 feature components of the last dimension of the tensor C to obtain the following steps: the convolution operation, the convolution kernel scale is 7 multiplied by 7, the number of characteristic channels is 32, the step sizes in the horizontal direction and the vertical direction are all 4, coding characteristics are transformed from the spatial domain shape of the image characteristics into a sequence structure, and the cross-view embedded code 1, the cross-view embedded code 2 and the cross-view embedded code 3 are obtained through layer normalization processing;

Then, the cross-view embedded code 1 and the cross-view embedded code 2 are connected in series according to the last dimension to obtain an attention code input feature 1, the cross-view embedded code 1 and the cross-view embedded code 3 are connected in series according to the last dimension to obtain an attention code input feature 2, the cross-view embedded code 2 and the cross-view embedded code 1 are connected in series according to the last dimension to obtain an attention code input feature 3, the cross-view embedded code 3 and the cross-view embedded code 1 are connected in series according to the last dimension to obtain an attention code input feature 4, and the 4 attention code input features are respectively subjected to attention code processing:

the method comprises the steps of inputting attention codes into a feature 1, carrying out separable convolution operation by taking the first half of the feature as a target code feature 1 according to a last channel, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the steps in the horizontal direction and the vertical direction are 1, taking the obtained code feature as a key word K code vector for attention learning and a numerical value V code vector for attention learning respectively, inputting the attention codes into the feature 1, carrying out separable convolution operation by taking the second half of the feature as a source code feature 1 according to the last channel, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the steps in the horizontal direction and the vertical direction are 1, taking the obtained code feature as a query Q code vector for attention learning, and then calculating an attention weight matrix 1 by utilizing a self-attention mechanical learning method according to the query Q code vector, the key word K code vector and the numerical value V code vector;

Carrying out separable convolution operation on the attention code input feature 2, taking the first half channel feature as the target code feature 2 according to the last channel, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, taking the obtained code feature as a key word K code vector for attention learning and a numerical value V code vector for attention learning respectively, carrying out separable convolution operation on the attention code input feature 2, taking the second half channel feature as the source code feature 2 according to the last channel, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, taking the obtained code feature as a query Q code vector for attention learning, and then calculating an attention weight matrix 2 by utilizing a self-attention mechanics learning method;

carrying out separable convolution operation on the attention code input feature 3, taking the first half channel feature as the target code feature 3 according to the last channel, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, taking the obtained code feature as a key word K code vector for attention learning and a numerical value V code vector for attention learning respectively, carrying out separable convolution operation on the attention code input feature 3, taking the second half channel feature as the source code feature 3 according to the last channel, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, taking the obtained code feature as a query Q code vector for attention learning, and then calculating an attention weight matrix 3 by utilizing a self-attention mechanics learning method;

Carrying out separable convolution operation on the attention code input feature 4 by taking the first half channel feature as the target code feature 4 according to the last channel, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, taking the obtained code feature as a key word K code vector for attention learning and a numerical value V code vector for attention learning respectively, carrying out separable convolution operation on the attention code input feature 4 by taking the second half channel feature as the source code feature 4 according to the last channel, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, taking the obtained code feature as a query Q code vector for attention learning, and then calculating an attention weight matrix 4 by utilizing a self-attention mechanical learning method according to the query Q code vector, the key word K code vector and the numerical value V code vector;

adding attention code input feature 1 and attention weight matrix 1 to obtain cross-view embedded coding feature 1, adding attention code input feature 2 and attention weight matrix 2 to obtain cross-view embedded coding feature 2, adding attention code input feature 3 and attention weight matrix 3 to obtain cross-view embedded coding feature 3, adding attention code input feature 4 and attention weight matrix 4 to obtain cross-view embedded coding feature 4, taking average features of cross-view embedded coding feature 1 and cross-view embedded coding feature 2 as cross-view cross-layer feature 1, and carrying out next convolution embedded coding processing on the cross-view cross-layer feature 1, cross-view embedded coding feature 3 and cross-view embedded coding feature 4;

b) Convolutional embedded coding

And respectively and sequentially carrying out 2 unit processes by using the cross-view cross-layer feature 1, the cross-view embedded coding feature 3 and the cross-view embedded coding feature 4: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are 2, then the serialization processing is carried out, then the layer normalization processing is carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are 2, then the serialization processing is carried out, then the layer normalization processing is carried out, 3 embedded coding characteristics are obtained, the characteristics obtained after the 1 st unit processing of the cross-view cross-layer characteristics 1 are taken as cross-view cross-layer characteristics 2, the characteristics obtained after the 2 nd unit processing of the cross-view cross-layer characteristics 2 are taken as cross-view cross-layer characteristics 3, and the 3 embedded coding characteristics are connected in series according to the last dimension to obtain a convolution embedded coding result;

c) Decoding process

Deconvolution unit processing is carried out on the cross-view cross-layer feature 1: the method comprises the steps of deconvolution feature channel number is 16, convolution kernel scale is 3 multiplied by 3, step length in horizontal direction and vertical direction is 2, then feature activation and batch normalization processing are carried out, one-time convolution operation is carried out on obtained results, the convolution feature channel number is 32, step length in horizontal direction and vertical direction is 3 multiplied by 3, then feature activation and batch normalization processing are carried out, the obtained features are marked as decoder cross-layer features 1, the decoder cross-layer features 1 and cross-view cross-layer features 2 are connected in series, then one-time convolution operation is carried out on the connected results, the convolution feature channel number is 128, the convolution kernel scale is 3 multiplied by 3, step length in horizontal direction and vertical direction is 2, then feature activation and batch normalization processing are carried out, the obtained results are connected in series with the cross-view cross-layer features 3, and then deconvolution unit processing is carried out on the connected results: the number of deconvolution characteristic channels is 128, the deconvolution kernel scale is 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, the obtained result is connected with the cross-layer characteristic 1 of the decoder in series, and the obtained series result is subjected to primary convolution unit processing: the number of convolution characteristic channels is 128, the convolution kernel scale is 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, the obtained characteristic is used as the 4 th scale result of tensor B, meanwhile, the obtained 4 th scale characteristic is connected with cross-view cross-layer characteristic 1 in series, and deconvolution unit processing is carried out on the connected result: the deconvolution method comprises the steps that the number of characteristic channels is 64, the deconvolution kernel scale is 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, characteristic activation and batch normalization processing are carried out, the obtained characteristic is used as a 3 rd scale result of tensor B, meanwhile, the obtained 3 rd scale characteristic is connected with cross-view cross-layer characteristic 2 in series, and deconvolution unit processing is carried out on the series result: the number of the characteristic channels is 32, the deconvolution kernel scale is 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, the obtained characteristic is used as the 2 nd scale result of tensor B, meanwhile, the obtained 2 nd scale characteristic is connected with the cross-view cross-layer characteristic 1 in series, and the series connection result is subjected to convolution unit processing: the number of the characteristic channels is 16, the convolution kernel scale is 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, and the obtained characteristic is used as the 1 st scale result of the tensor B;

Obtaining the output of the module G by using the 4 th scale result, the 3 rd scale result, the 2 nd scale result and the 1 st scale result of the tensor B;

step 3: training of neural networks

Dividing samples in a natural image dataset, an ultrasonic image dataset and a CT image dataset into a training set and a testing set according to the ratio of 9:1, wherein data in the training set is used for training, data in the testing set is used for testing, training data is respectively obtained from the corresponding dataset during training, the training data are uniformly scaled to the resolution p multiplied by o, the resolution p multiplied by o is input into a corresponding network, iterative optimization is performed, and the loss of each batch is minimized by continuously modifying network model parameters;

in the training process, the calculation method of each loss comprises the following steps:

internal parameters supervise synthesis loss: in the network model training of natural images, tensor I output by a depth information coding network is taken as depthTensor L output by visual transducer parameter learning network and internal parameter label e of training data _t (t=1, 2,3, 4) respectively serving as pose parameters and camera internal parameters, respectively synthesizing two images at the view point of an image c by using an image b and an image d according to a computer vision principle algorithm, and calculating by using the sum of the intensity differences of the channel pixel by pixel and color by pixel by using the image c and the two synthesized images;

Unsupervised synthesis loss: in the network model training of ultrasonic or CT images, taking the output tensor I of a depth information coding network as depth, taking the output tensor L and tensor O of a visual transducer parameter learning network W module as pose parameters and camera internal parameters respectively, respectively constructing synthetic images at target viewpoints by using two adjacent images of the target images according to a computer visual algorithm, and calculating according to the sum of pixel-by-pixel and color-by-color channel intensity differences by using the synthetic images at the target viewpoints and the target images;

internal parameter error loss: tensor O output by visual transducer parameter learning network and internal parameter label e of training data _t (t=1, 2,3, 4) is calculated as the sum of the absolute values of the respective component differences;

spatial structure error loss: in the network model training of ultrasonic or CT images, taking an output tensor I of a depth information coding network as depth, taking an output tensor L and a tensor O of a visual transducer parameter learning network W module as pose parameters and camera internal parameters respectively, reconstructing three-dimensional coordinates of the images at the target viewpoint by respectively utilizing two adjacent images of the images at the target viewpoint according to a computer vision algorithm, performing space structure fitting on reconstructed points by adopting a RANSAC algorithm, and utilizing a normal vector obtained by fitting and a visual transducer parameter learning network output tensor D to obtain the three-dimensional coordinate by utilizing cosine distance calculation;

Conversion synthesis loss: in the network model training of ultrasonic or CT images, taking an output tensor I of a depth information coding network as depth, taking an output tensor L and a tensor O of a visual transducer parameter learning network W module as pose parameters and camera internal parameters respectively, constructing two synthesized images at a target image viewpoint by using two adjacent images of the target image according to a computer visual algorithm, taking an output tensor B of a module G as displacement of spatial domain deformation of the synthesized image after each pixel position is obtained in the synthesis process for each image in the synthesized images, and calculating according to the sum of pixel-by-pixel and color-by-color channel intensity differences by utilizing the two synthesized images at the target image viewpoint and the images at the target viewpoint respectively;

the specific training steps are as follows:

(1) On the natural image data set, training 60000 times for a main network and a 1 st network branch of a depth information coding network and a visual transducer parameter learning network respectively

Taking out training data from a natural image data set each time, uniformly scaling to a resolution p×o, inputting an image c into a depth information coding network, inputting an image c and an image tau into a visual transducer parameter learning network, and training a backbone network and a 1 st network branch of a W module of the depth information coding network and the visual transducer parameter learning network for 60000 times, wherein the training loss of each batch is obtained by the calculation of internal parameter supervision synthesis loss;

(2) On the natural image dataset, training 50000 times on the 2 nd network branch of the visual transducer parameter learning network W module

Taking out training data from a natural image data set each time, uniformly scaling to a resolution p×o, inputting an image c into a depth information coding network, inputting the image c and an image tau into a vision transducer parameter learning network, training a 2 nd network branch of a W module of the vision transducer parameter learning network, and calculating the training loss of each batch by the sum of an unsupervised synthesis loss and an internal parameter error loss;

(3) On an ultrasonic image data set, training a main network and network branches 1-3 of a depth information coding network and a visual transducer parameter learning network W module and a G module for 60000 times to obtain a network model parameter rho

Taking out ultrasonic training data from an ultrasonic image data set each time, uniformly scaling to a resolution p multiplied by o, inputting an image j into a depth information coding network, inputting the image j and the image pi into a visual transducer parameter learning network, training the depth information coding network, backbone network branches 1-3 of a visual transducer parameter learning network W module and a G module, wherein the training loss of each batch is calculated by the sum of conversion synthesis loss and space structure error loss;

(4) On the CT image data set, training a main network and network branches 1-3 of a depth information coding network and a visual transducer parameter learning network W module and a G module for 60000 times to obtain a model parameter rho'

Taking CT image training data out of a CT image data set each time, uniformly scaling the CT image training data to a resolution p multiplied by o, inputting an image m and an image sigma into a light-weight depth vision transducer parameter learning network, training a depth information coding network, a main network of a vision transducer parameter learning network W module, network branches 1-3 and a G module, continuously modifying parameters of the network, training the network, and continuously modifying the network parameters to minimize the loss of each image of each batch, and obtaining model parameters ρ' by adding the loss of camera translational motion for 60000 times besides transformation synthesis loss and space structure error loss when calculating the loss of network optimization;

step 4: three-dimensional reconstruction of ultrasound or CT images

Using an ultrasound or CT sequence image from the sample, three-dimensional reconstruction is achieved by simultaneously performing the following 3 processes:

(1) For any target image in the sequence image, three-dimensional coordinates under a camera coordinate system are calculated according to the following method: scaling to resolution p x O, inputting an image j into a depth information coding network, inputting an image j and an image pi into a vision transducer parameter learning network for an ultrasonic sequence image, inputting an image m into the depth information coding network, inputting an image m and an image sigma into the input vision transducer parameter learning network for a CT sequence image, respectively predicting by using a model parameter rho and a model parameter rho', obtaining the depth of each frame of target image from the depth information coding network, and respectively taking a tensor L output by a 1 st network branch and a tensor O output by a 2 nd network branch of the vision transducer parameter learning network as a camera pose parameter and a camera internal parameter, and calculating three-dimensional coordinates of the target image under a camera coordinate system according to the principle of computer vision according to the depth information and the camera internal parameter of the target image;

(2) In the three-dimensional reconstruction process of the sequence image, a key frame sequence is established: taking the first frame of the sequence image as the first frame of the key frame sequence, taking the first frame of the sequence image as a current key frame, taking the frame after the current key frame as a target frame, and dynamically selecting new key frames in sequence according to the sequence of the target frames: firstly, initializing a pose parameter matrix of a target frame relative to a current key frame by using an identity matrix, multiplying the pose parameter matrix by a pose parameter of a target frame camera for any target frame, combining internal parameters and depth information of the target frame by using a multiplication result to synthesize an image at a target frame viewpoint, calculating an error lambda by using the sum of pixel-by-pixel color channel intensity differences between the synthesized image and the target frame, synthesizing an image at the target frame viewpoint by using the pose parameter and the internal parameters of the camera according to an adjacent frame of the target frame, calculating an error gamma by using the sum of pixel-by-pixel color channel intensity differences between the synthesized image and the target frame, and further calculating a synthesis error ratio Z by using a formula (1):

meeting Z is larger than a threshold value eta, 1 eta is smaller than 2, taking the target frame as a new key frame, taking a pose parameter matrix of the target frame relative to the current key frame as a pose parameter of the new key frame, and simultaneously updating the target frame into the current key frame; finishing key frame sequence establishment by the iteration;

(3) And taking the viewpoint of the first frame of the sequence image as the origin of the world coordinate system, scaling the resolution of any target image to M multiplied by N, calculating to obtain three-dimensional coordinates under the camera coordinate system according to the internal parameters and depth information of the camera obtained by network output, and calculating to obtain the three-dimensional coordinates in the world coordinate system of each pixel of the target frame according to the pose parameters of the camera output by the network and combining the pose parameters of each key frame in the key frame sequence and the pose parameter matrix of the target frame relative to the current key frame.

The invention has the beneficial effects that:

the invention designs a visual transducer deep learning model, establishes a neural network, learns the contextual characteristics in the medical image, fully utilizes the deep learning mechanism, realizes the automatic three-dimensional reconstruction function of the medical image, can effectively obtain the three-dimensional geometric information of the ultrasonic or CT image, is beneficial to realizing the three-dimensional visualization function of a focus area in clinical diagnosis, can provide an effective 3D reconstruction solution for artificial intelligent medical auxiliary diagnosis, and improves the efficiency of the artificial intelligent auxiliary medical diagnosis.

Drawings

FIG. 1 is a three-dimensional reconstruction result graph of an ultrasound image of the present invention;

Fig. 2 is a three-dimensional reconstruction result diagram of a CT image according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples.

Examples

The embodiment is implemented under Windows 10-bit operating system on PC, and the hardware configuration is CPU i7-9700F, memory 16G,GPU NVIDIA GeForce GTX 2070 8G; the deep learning library adopts Tensorflow1.14; programming uses the python version 3.7 programming language.

A three-dimensional reconstruction method of an ultrasonic or CT medical image based on a self-attention transducer, which is characterized in that an ultrasonic or CT image sequence is input, the resolution is MxN, for an ultrasonic image, M is 450, N is 300, for a CT image, M and N are 512, and the three-dimensional reconstruction process specifically comprises the following steps:

step 1: constructing a dataset

a) Constructing a natural image dataset

Selecting a natural image website, requesting to have an image sequence and corresponding camera internal parameters, and downloading from the websiteFor each image sequence, marking each adjacent 3 frames of images as an image b, an image c and an image d, splicing the image b and the image d according to color channels to obtain an image tau, forming a data element by the image c and the image tau, wherein the image c is a natural target image, a sampling viewpoint of the image c is taken as a target viewpoint, and the internal parameters of the image b, the image c and the image d are all e _t (t=1, 2,3, 4), where e ₁ E is a horizontal focal length ₂ E is vertical focal length ₃ E ₄ Two components of principal point coordinates; discarding if the last remaining image in the same image sequence is less than 3 frames; constructing a natural image dataset by using all sequences, wherein the dataset has 3600 elements;

b) Constructing ultrasound image datasets

Sampling 10 ultrasonic image sequences, for each sequence, marking every 3 adjacent frames of images as an image i, an image j and an image k, splicing the image i and the image k according to color channels to obtain an image pi, forming a data element by the image j and the image pi, wherein the image j is an ultrasonic target image, the sampling viewpoint of the image j is taken as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, and constructing an ultrasonic image data set by utilizing all the sequences, wherein the data set comprises 1600 elements;

c) Constructing CT image datasets

Sampling 1 CT image sequence, for the sequence, marking every 3 adjacent frames as an image l, an image m and an image n, splicing the image l and the image n according to a color channel to obtain an image sigma, forming a data element by the image m and the image sigma, wherein the image m is a CT target image, a sampling viewpoint of the image m is used as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, and constructing a CT image data set by utilizing all the sequences, wherein the data set comprises 2000 elements;

Step 2: construction of neural networks

The resolution of the image or the image processed by the neural network is 416×128, 416 is width, 128 is height, and the pixels are used as units;

(1) Structure of depth information coding network

Tensor H is taken as input, the scale is 4×128×416×3, tensor I is taken as output, and the scale is 4×128×416×1;

the network A consists of an encoder and a decoder, and for the tensor H, the output tensor I is obtained after encoding and decoding processes are sequentially carried out;

the encoder consists of 5 units, wherein the first unit is a convolution unit, the 2 nd to 5 th units are all composed of residual error modules, in the first unit, 64 convolution kernels are formed, the shapes of the convolution kernels are 7 multiplied by 7, the step sizes of the convolution in the horizontal direction and the vertical direction are 2, the maximum pooling treatment is carried out once after the convolution, the 2 nd to 5 th units respectively comprise 3,4,6,3 residual error modules, each residual error module carries out convolution for 3 times, the shapes of the convolution kernels are 3 multiplied by 3, and the numbers of the convolution kernels are 64, 128, 256 and 512;

the decoder consists of 6 decoding units, each decoding unit comprises two steps of deconvolution and convolution, the deconvolution and convolution processes have the same shape and number of convolution kernels, the shape of the convolution kernels in the 1 st to 6 th decoding units is 3 multiplied by 3, the number of the convolution kernels is 512, 256, 128, 64, 32 and 16 respectively, the encoder and the network layer of the decoder are connected in a cross-layer manner, and the corresponding relationship of the cross-layer connection is as follows: 1 and 4, 2 and 3, 3 and 2, 4 and 1;

(2) Structure of vision transducer parameter learning network

The visual transducer parameter learning network is composed of a module W and a module G, for a module P, a tensor J and a tensor C are taken as input, the scales are 4×128×416×3 and 4×128×416×6 respectively, the output is a tensor L, a tensor O and a tensor D, and the tensor L is as follows: 4×2×6, the tensor O scale is 4×4×1, and the tensor D scale is 4×3;

a) Embedded coding

b) Transformer encoding for self-attention mechanisms

Then, 3 network branch codes are sequentially performed:

For the 2 nd network branch, the 1 st stage backbone network coding feature is sequentially subjected to 2 unit processes: in the 1 st unit processing, the number of characteristic channels of convolution operation is 16, the convolution kernel scale is 7×7, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scale is 3×3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, then the obtained characteristics are connected in series with the main network coding characteristics of the 2 nd stage, and then 2 unit processing are carried out in sequence: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scale is 7×7, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 3×3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, the obtained characteristics are connected in series with the main network coding characteristics of the 3 rd stage, and then 3 unit processing are carried out in sequence: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 7 multiplied by 7, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scale is 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, in the 3 rd unit processing, the number of characteristic channels of convolution operation is 4, the convolution kernel scale is 1 multiplied by 1, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, and the obtained characteristic results of the 4 channels are used as the result of tensor O;

c) Cross-view embedded coding

Firstly, respectively performing cross-view embedded coding processing on the first 3 characteristic components of the last dimension of the tensor J and the last 3 characteristic components of the last dimension of the tensor C: the convolution operation, the convolution kernel scale is 7 multiplied by 7, the number of characteristic channels is 32, the step sizes in the horizontal direction and the vertical direction are all 4, coding characteristics are transformed from the spatial domain shape of the image characteristics into a sequence structure, and the cross-view embedded code 1, the cross-view embedded code 2 and the cross-view embedded code 3 are obtained through layer normalization processing;

The attention code input feature initialization is then performed as follows: concatenating the cross-view embedded code 1 and the cross-view embedded code 2 according to the last dimension to obtain an attention code input feature 1, concatenating the cross-view embedded code 1 and the cross-view embedded code 3 according to the last dimension to obtain an attention code input feature 2, concatenating the cross-view embedded code 2 and the cross-view embedded code 1 according to the last dimension to obtain an attention code input feature 3, concatenating the cross-view embedded code 3 and the cross-view embedded code 1 according to the last dimension to obtain an attention code input feature 4, and respectively performing attention code processing on the 4 attention code input features:

a) Convolutional embedded coding

And respectively and sequentially carrying out 2 unit processes by using the cross-view cross-layer feature 1, the cross-view embedded coding feature 3 and the cross-view embedded coding feature 4: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 3×3, the step length in the horizontal direction and the step length in the vertical direction are 2, then the serialization processing is carried out, then the layer normalization processing is carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scale is 3×3, the step length in the horizontal direction and the step length in the vertical direction are 2, then the serialization processing is carried out, then the layer normalization processing is carried out, 3 embedded coding characteristics are obtained, the characteristics obtained after the 1 st unit processing of the cross-view cross-layer characteristics 1 are used as cross-view cross-layer characteristics 2, the characteristics obtained after the 2 nd unit processing of the cross-view cross-layer characteristics 2 are used as cross-view cross-layer characteristics 3, and the 3 embedded coding characteristics are connected in series according to the last dimension to obtain a convolution embedded coding result;

b) Decoding process

step 3: training of neural networks

Dividing samples in a natural image dataset, an ultrasonic image dataset and a CT image dataset into a training set and a testing set according to the ratio of 9:1, wherein data in the training set is used for training, data in the testing set is used for testing, training data is respectively obtained from the corresponding dataset during training, the training data are uniformly scaled to the resolution of 416 multiplied by 128, the resolution is input into a corresponding network, iterative optimization is performed, and the loss of each batch is minimized by continuously modifying network model parameters;

internal parameters supervise synthesis loss: in the network model training of natural images, tensor I output by a depth information coding network is used as depth, tensor L output by a visual transducer parameter learning network and internal parameter label e of training data are used _t (t=1, 2,3, 4) respectively serving as pose parameters and camera internal parameters, respectively synthesizing two images at the view point of an image c by using an image b and an image d according to a computer vision principle algorithm, and calculating by using the sum of the intensity differences of the channel pixel by pixel and color by pixel by using the image c and the two synthesized images;

the specific training steps are as follows:

Taking out training data from a natural image data set each time, uniformly scaling to 416×128 resolution, inputting an image c into a depth information coding network, inputting an image c and an image tau into a visual transducer parameter learning network, and training a backbone network and a 1 st network branch of a W module of the depth information coding network and the visual transducer parameter learning network for 60000 times, wherein training loss of each batch is obtained by internal parameter supervision synthesis loss calculation;

Taking out training data from the natural image data set each time, uniformly scaling to 416×128 resolution, inputting an image c into a depth information coding network, inputting an image c and an image tau into a vision transducer parameter learning network, training a 2 nd network branch of a W module of the vision transducer parameter learning network, and calculating the training loss of each batch by the sum of unsupervised synthesis loss and internal parameter error loss;

Taking out ultrasonic training data from an ultrasonic image data set each time, uniformly scaling to 416 multiplied by 128 of resolution, inputting an image j into a depth information coding network, inputting the image j and the image pi into a visual transducer parameter learning network, training the depth information coding network, backbone network branches 1-3 of a visual transducer parameter learning network W module and a G module, wherein the training loss of each batch is calculated by the sum of conversion synthesis loss and space structure error loss;

Taking CT image training data out of a CT image data set each time, uniformly scaling the CT image training data to a resolution p multiplied by o, inputting an image m and an image sigma into a light-weight depth vision transducer parameter learning network, training a depth information coding network, a main network of a vision transducer parameter learning network W module, network branches 1-3 and a G module, continuously modifying parameters of the network, training the network, and continuously modifying the network parameters to minimize the loss of each image of each batch, and obtaining a parameter model rho' by adding the loss of camera translational motion for 60000 times besides transformation synthesis loss and space structure error loss when calculating the loss of network optimization;

step 4: three-dimensional reconstruction of ultrasound or CT images

(1) For any target image in the sequence image, three-dimensional coordinates under a camera coordinate system are calculated according to the following method: scaling to 416 x 128, inputting an image j into a depth information coding network, inputting an image j and an image pi into a vision transducer parameter learning network for an ultrasonic sequence image, inputting an image m into the depth information coding network, inputting an image m and an image sigma into the input vision transducer parameter learning network for a CT sequence image, respectively predicting by using a model parameter rho and a model parameter rho', obtaining the depth of each frame of target image from the depth information coding network, and respectively taking a tensor L output by a 1 st network branch and a tensor O output by a 2 nd network branch of the vision transducer parameter learning network as a camera pose parameter and a camera internal parameter, and calculating three-dimensional coordinates of the target image under a camera coordinate system according to the principle of computer vision according to the depth information and the camera internal parameter of the target image;

when Z is more than 1.2, taking the target frame as a new key frame, taking a pose parameter matrix of the target frame relative to the current key frame as a pose parameter of the new key frame, and simultaneously updating the target frame into the current key frame; finishing key frame sequence establishment by the iteration;

(3) The method comprises the steps of taking a viewpoint of a first frame of a sequence image as an origin of a world coordinate system, scaling resolution of any target image to M multiplied by N, taking 450 by M and 300 by N for an ultrasonic image, taking 512 by M and N for a CT image, calculating to obtain three-dimensional coordinates under the camera coordinate system according to camera internal parameters and depth information obtained by network output, and calculating to obtain three-dimensional coordinates in the world coordinate system of each pixel of the target frame according to camera pose parameters output by the network by combining the pose parameters of each key frame in a key frame sequence and a pose parameter matrix of the target frame relative to a current key frame.

In this example, the experimental hyper-parameters: the optimizer adopts an Adam optimizer, the network learning rate is 0.0002, and the momentum coefficient is 0.9.

In the embodiment, network training is performed on the constructed natural image training set, ultrasonic image training set and CT image training set, 10 ultrasonic image sequences and 1 CT image sequence of a public data set are used for testing respectively, conversion synthesis loss is used for error calculation, in the error calculation of ultrasonic or CT images, two synthesized images at the target image view point are constructed by two adjacent images of the target image, and for each image in the synthesized images, the synthesized images at the two target view points are used for calculation according to the sum of pixel-by-pixel and color-by-color channel intensity differences.

Table 1 is the calculated error during the reconstruction of the ultrasound image sequence, table 2 is the calculated error during the reconstruction of the CT image sequence, in this embodiment, the ultrasound image or the CT image is segmented by using the DenseNet and then 3D reconstructed, fig. 1 shows the three-dimensional reconstruction result of the ultrasound image obtained by using the present invention, and fig. 2 shows the three-dimensional reconstruction result of the CT image obtained by using the present invention, from which it can be seen that the present invention can obtain a more accurate reconstruction result.

TABLE 1

Sequence number	Error of
		1	0.10654667014503133
2	0.02526559898617755
		3	0.053380661733795236
4	0.07186935243508444
		5	0.055040699123203043
6	0.0569973246074301
		7	0.031235526713007722
8	0.07208439064528675
		9	0.08464272856695701
10	0.03252974517429145

TABLE 2

Sequence number	Error of
		1	0.05769914209578394
2	0.06644105676866426
		3	0.06760795378867354
4	0.06723370896784081
		5	0.12021887377061856
6	0.1024131896296913
		7	0.12699357037032025
8	0.1531152112275075
		9	0.10963905408322308
10	0.11539085665406078

Claims

1. A self-attention transducer-based ultrasonic or CT medical image three-dimensional reconstruction method is characterized in that an ultrasonic or CT image sequence is input, the image resolution is MxN, M is more than or equal to 100 and less than or equal to 2000, N is more than or equal to 100 and less than or equal to 2000, and the three-dimensional reconstruction process specifically comprises the following steps:

step 1: constructing a dataset

(a) Constructing a natural image dataset

(b) Constructing ultrasound image datasets

(c) Constructing CT image datasets

Step 2: construction of neural networks

(1) Structure of depth information coding network

(2) Structure of vision transducer parameter learning network

the module W is composed of a backbone network and 3 network branches, wherein the 3 network branches are used for predicting tensors L, O and D respectively;

the backbone network is encoded as follows: the tensor J and the tensor C are connected in series according to the last channel and then are input into a backbone network, 3 stages of codes are sequentially carried out, the number of attention heads is respectively 2, 3 and 4 when each stage of codes is carried out, and each stage of codes is specifically as follows:

a) Embedded coding

b) Transformer encoding for self-attention mechanisms

Then, 3 network branch codes are sequentially performed:

for the 1 st network branch, the 1 st stage backbone network coding feature is sequentially subjected to 2 unit processes: in the 1 st unit processing, the number of characteristic channels of convolution operation is 16, the convolution kernel scale is 7 multiplied by 7, the step length in the horizontal direction and the step length in the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are both 2, and then characteristic activation and batch normalization processing are carried out; then, the obtained characteristics are connected with the 1 st stage backbone network coding characteristics in series, and 2 unit processing is carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scale is 7 multiplied by 7, the step length in the horizontal direction and the step length in the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are both 2, and then characteristic activation and batch normalization processing are carried out; then, the obtained features are connected with the encoding features of the backbone network in the 3 rd stage in series, and 3 unit processes are sequentially carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 7 multiplied by 7, the step length in the horizontal direction and the step length in the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are both 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 12, the convolution kernel scale is 1 multiplied by 1, the step length in the horizontal direction and the step length in the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, and the obtained characteristic results of the 12 channels are predicted in a 2 multiplied by 6 form to obtain the result of tensor L;

For the 2 nd network branch, the 1 st stage backbone network coding feature is sequentially subjected to 2 unit processes: in the 1 st unit processing, the number of characteristic channels of convolution operation is 16, the convolution kernel scale is 7 multiplied by 7, the step length in the horizontal direction and the step length in the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are both 2, and then characteristic activation and batch normalization processing are carried out; then the obtained characteristics are connected with the main network coding characteristics of the 2 nd stage in series, and then 2 unit treatments are sequentially carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scale is 7 multiplied by 7, the step length in the horizontal direction and the step length in the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are both 2, and then characteristic activation and batch normalization processing are carried out; the obtained characteristics are connected with the encoding characteristics of the backbone network in the 3 rd stage in series, and then 3 unit processes are sequentially carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 7 multiplied by 7, the step length in the horizontal direction and the step length in the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are both 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 4, the convolution kernel scale is 1 multiplied by 1, the step length in the horizontal direction and the step length in the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, and the obtained characteristic result of the 4 channels is used as the result of tensor O;

For the 3 rd network branch, the 3 rd stage backbone network coding feature is sequentially subjected to 3 unit processes: in the 1 st unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are both 2, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are both 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 3, the convolution kernel scale is 1 multiplied by 1, the step length in the horizontal direction and the step length in the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, and the obtained characteristics are used as the result of tensor D;

a) Cross-view embedded coding

Firstly, respectively performing cross-view embedded coding processing on the first 3 characteristic components of the last dimension of the tensor J and the last 3 characteristic components of the last dimension of the tensor C: convolution operation, namely, the convolution kernel scale is 7 multiplied by 7, the number of characteristic channels is 32, the step sizes in the horizontal direction and the vertical direction are 4, the obtained coding features are respectively transformed into a sequence structure from the spatial domain shape of the image features, and the cross-view embedded code 1, the cross-view embedded code 2 and the cross-view embedded code 3 are obtained through layer normalization processing;

inputting the attention code into a feature 1, taking the first half of the feature as a target code feature 1 according to the last channel, and carrying out separable convolution operation, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, and the obtained code features are respectively used as a key word K code vector of attention learning and a numerical value V code vector of attention learning; inputting the attention code into a feature 1, taking the latter half of the features as a source code feature 1 according to the last channel, carrying out separable convolution operation, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, and taking the obtained code feature as a query Q code vector of attention learning; then, according to the query Q code vector, the keyword K code vector and the numerical value V code vector of attention learning, calculating an attention weight matrix 1 by using a self-attention mechanics learning method;

The method comprises the steps of inputting attention codes into a feature 2, taking the first half channel feature as a target code feature 2 according to the last channel, and carrying out separable convolution operation, wherein the convolution kernel scale is 3 multiplied by 3, the number of feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, and the obtained code features are respectively used as a key word K code vector for attention learning and a numerical value V code vector for attention learning; the attention code is input into a feature 2, a last channel is adopted, a second half channel feature is taken as a source code feature 2, separable convolution operation is carried out, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, and the obtained code feature is taken as a query Q code vector for attention learning; then, according to the query Q code vector, the keyword K code vector and the numerical value V code vector of attention learning, calculating an attention weight matrix 2 by using a self-attention mechanics learning method;

the method comprises the steps of inputting attention codes into a feature 3, taking the first half channel feature as a target code feature 3 according to the last channel, and carrying out separable convolution operation, wherein the convolution kernel scale is 3 multiplied by 3, the number of feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, and the obtained code features are respectively used as a key word K code vector for attention learning and a numerical value V code vector for attention learning; the attention code input feature 3 is subjected to separable convolution operation by taking the second half channel feature as the source code feature 3 according to the last channel, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, and the obtained code feature is used as a query Q code vector for attention learning; then, according to the query Q code vector, the keyword K code vector and the numerical value V code vector of attention learning, calculating an attention weight matrix 3 by using a self-attention mechanics learning method;

The method comprises the steps of carrying out separable convolution operation on attention code input features 4 according to a last channel, taking the first half channel features as target code features 4, wherein the convolution kernel scale is 3 multiplied by 3, the number of feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, and taking the obtained code features as an attention learning keyword K code vector and an attention learning numerical value V code vector respectively; the attention code input feature 4 is used for carrying out separable convolution operation by taking the second half channel feature as the source code feature 4 according to the last channel, wherein the convolution kernel scale is 3 multiplied by 3, the number of the feature channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, the obtained code feature is used as an attention learning query Q code vector, and then the attention weight matrix 4 is calculated by using a self-attention mechanics learning method according to the attention learning query Q code vector, the keyword K code vector and the numerical value V code vector;

adding attention code input feature 1 to attention weight matrix 1 to obtain cross-view embedded code feature 1, adding attention code input feature 2 to attention weight matrix 2 to obtain cross-view embedded code feature 2, adding attention code input feature 3 to attention weight matrix 3 to obtain cross-view embedded code feature 3, and adding attention code input feature 4 to attention weight matrix 4 to obtain cross-view embedded code feature 4; taking the average characteristics of the cross-view embedded coding characteristic 1 and the cross-view embedded coding characteristic 2 as cross-view cross-layer characteristics 1, and carrying out next convolution embedded coding processing on the cross-view cross-layer characteristics 1, the cross-view embedded coding characteristic 3 and the cross-view embedded coding characteristic 4;

b) Convolutional embedded coding

And respectively and sequentially carrying out 2 unit processes by using the cross-view cross-layer feature 1, the cross-view embedded coding feature 3 and the cross-view embedded coding feature 4: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are both 2, then the serialization processing is carried out, and then the layer normalization processing is carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are both 2, then the serialization processing is carried out, and then the layer normalization processing is carried out, so that 3 embedded coding characteristics are respectively obtained; taking the feature obtained by the cross-view cross-layer feature 1 after being processed by the 1 st unit as a cross-view cross-layer feature 2, and taking the feature obtained by the cross-view cross-layer feature 2 after being processed by the 2 nd unit as a cross-view cross-layer feature 3; concatenating the 3 embedded coding features according to the last dimension to serve as a convolution embedded coding result;

c) Decoding process

Deconvolution unit processing is carried out on the cross-view cross-layer feature 1: the method comprises the steps of deconvolution feature channel number is 16, convolution kernel scale is 3 multiplied by 3, step length in horizontal direction and vertical direction is 2, then feature activation and batch normalization processing are carried out, one-time convolution operation is carried out on obtained results, the convolution feature channel number is 32, step length in horizontal direction and vertical direction is 3 multiplied by 3, then feature activation and batch normalization processing are carried out, the obtained features are marked as decoder cross-layer features 1, the decoder cross-layer features 1 and cross-view cross-layer features 2 are connected in series, then one-time convolution operation is carried out on the connected results, the convolution feature channel number is 128, the convolution kernel scale is 3 multiplied by 3, step length in horizontal direction and vertical direction is 2, then feature activation and batch normalization processing are carried out, the obtained results are connected in series with the cross-view cross-layer features 3, and then deconvolution unit processing is carried out on the connected results: the number of deconvolution characteristic channels is 128, the deconvolution kernel scale is 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, the obtained result is connected with the cross-layer characteristic 1 of the decoder in series, and the obtained series result is subjected to primary convolution unit processing: the number of the convolution characteristic channels is 128, the convolution kernel scale is 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, and the obtained characteristic is used as a 4 th scale result of tensor B; meanwhile, the obtained 4 th scale feature is connected with the cross-view cross-layer feature 1 in series, and deconvolution unit processing is carried out on the serial result: the deconvolution characteristic channel number is 64, the deconvolution kernel scale is 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are both 2, then characteristic activation and batch normalization processing are carried out, and the obtained characteristic is used as the 3 rd scale result of tensor B; meanwhile, the 3 rd scale feature is connected with the cross-view cross-layer feature 2 in series, and deconvolution unit processing is carried out on the serial result: the number of the characteristic channels is 32, the deconvolution kernel scale is 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, and the obtained characteristic is used as the 2 nd scale result of tensor B; meanwhile, the 2 nd scale feature is connected with the cross-view cross-layer feature 1 in series, and the series result is subjected to convolution unit processing: the number of the characteristic channels is 16, the convolution kernel scale is 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, and the obtained characteristic is used as the 1 st scale result of the tensor B;

step 3: training of neural networks

internal parameters supervise synthesis loss: in natureIn the network model training of the image, the tensor I output by the depth information coding network is taken as depth, and the tensor L output by the visual transducer parameter learning network and the internal parameter label e of training data are taken as depth _t (t=1, 2,3, 4) respectively serving as pose parameters and camera internal parameters, respectively synthesizing two images at the view point of an image c by using an image b and an image d according to a computer vision principle algorithm, and calculating by using the sum of the intensity differences of the channel pixel by pixel and color by pixel by using the image c and the two synthesized images;

the specific training steps are as follows:

step 4: three-dimensional reconstruction of ultrasound or CT images

The three-dimensional reconstruction is realized by using one self-sampled ultrasonic or CT sequence image to simultaneously carry out the following 3 processes:

(1) For any target image in the sequence image, three-dimensional coordinates under a camera coordinate system are calculated according to the following method: scaling to resolution p x O, inputting an image j into a depth information coding network, inputting an image j and an image pi into a vision transducer parameter learning network for an ultrasonic sequence image, inputting an image m into the depth information coding network, inputting an image m and an image sigma into the input vision transducer parameter learning network for a CT sequence image, respectively predicting by using a model parameter rho and a model rho', obtaining the depth of each frame of target image from the depth information coding network, respectively taking a tensor L output by a 1 st network branch and a tensor O output by a 2 nd network branch of the vision transducer parameter learning network as a camera pose parameter and a camera internal parameter, and calculating three-dimensional coordinates of the target image under a camera coordinate system according to the principle of computer vision according to the depth information and the camera internal parameter of the target image;