CN113689543B

CN113689543B - Epipolar constrained sparse attention mechanism medical image three-dimensional reconstruction method

Info

Publication number: CN113689543B
Application number: CN202110881585.2A
Authority: CN
Inventors: 全红艳; 董家顺
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2023-06-27
Anticipated expiration: 2041-08-02
Also published as: CN113689543A

Abstract

The invention discloses a polar constrained sparse attention mechanism medical image three-dimensional reconstruction method, which is characterized in that a cross-view transducer network structure is established, an unsupervised transfer learning is adopted by means of an imaging rule of a natural image, a cross-view transducer based on a convolutional neural network structure is designed, a computer vision multi-view geometric polar constraint method is adopted in attention matrix calculation, an attention moment array can generate an accurate corresponding relation in a feature learning process under the constraint, and three-dimensional geometric reconstruction of medical images is realized.

Description

Epipolar constrained sparse attention mechanism medical image three-dimensional reconstruction method

Technical Field

The invention belongs to the technical field of computers, and relates to a medical image three-dimensional visualization technology in medical auxiliary diagnosis.

Background

In recent years, increasingly developed artificial intelligent medical technology has become an important means for assisting medical development, and key technology of intelligent medical image assisted diagnosis plays an important role in modern clinical diagnosis, wherein three-dimensional reconstruction technology of ultrasonic or CT images can improve diagnosis efficiency of doctors in assisted diagnosis and reduce probability of misdiagnosis, but due to objective facts of few textures and multiple noises of medical images, particularly, parameters of an ultrasonic or CT image are difficult to recover, so that research of the three-dimensional reconstruction technology of the ultrasonic or CT image has a certain difficulty, and therefore, how to establish an effective network coding model for deep learning is needed to solve the difficult problem of geometric recovery in medical image reconstruction.

Disclosure of Invention

The invention aims to provide a polar constrained sparse attention mechanism medical image three-dimensional reconstruction method, which adopts a cross-view visual transducer basic network, designs an ultrasonic or CT medical image three-dimensional reconstruction method based on geometric constraints, and can obtain a finer three-dimensional structure of a medical target by utilizing the visual constraints of epipolar geometry and combining an attention mechanics learning mechanism among cross-views, thereby having higher practical value.

The specific technical scheme for realizing the aim of the invention is as follows:

a three-dimensional reconstruction method of a polar line constrained sparse attention mechanism medical image is disclosed, wherein an ultrasonic or CT image sequence is input, the image resolution is MxN, M is more than or equal to 100 and less than or equal to 2000, N is more than or equal to 100 and less than or equal to 2000, and the three-dimensional reconstruction process specifically comprises the following steps:

step 1: constructing a dataset

(a) Constructing a natural image dataset

Selecting a natural image website, requiring to have an image sequence and corresponding internal parameters of a camera, downloading a image sequences and corresponding internal parameters of the sequences from the natural image website, wherein a is more than or equal to 1 and less than or equal to 20, for each image sequence, each adjacent 3 frames of images are marked as an image b, an image c and an image d, splicing the image b and the image d according to color channels to obtain an image tau, forming a data element by the image c and the image tau, wherein the image c is a natural target image, a sampling viewpoint of the image c is used as a target viewpoint, and the internal parameters of the image b, the image c and the image d are all e _t (t=1, 2,3, 4), where e ₁ E is a horizontal focal length ₂ E is vertical focal length ₃ E ₄ Two components of principal point coordinates; discarding if the last remaining image in the same image sequence is less than 3 frames; constructing a natural image data set by utilizing all sequences, wherein f elements are in the constructed natural image data set, and f is more than or equal to 3000 and less than or equal to 20000;

(b) Constructing ultrasound image datasets

Sampling g ultrasonic image sequences, wherein g is more than or equal to 1 and less than or equal to 20, for each sequence, marking every 3 adjacent frames of images as an image i, an image j and an image k, splicing the image i and the image k according to color channels to obtain an image pi, forming a data element by the image j and the image pi, wherein the image j is an ultrasonic target image, the sampling viewpoint of the image j is used as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, and constructing an ultrasonic image data set by utilizing all the sequences, wherein F elements are contained in the constructed ultrasonic image data set, and F is more than or equal to 1000 and less than or equal to 20000;

(c) Constructing CT image datasets

Sampling h CT image sequences, wherein h is more than or equal to 1 and less than or equal to 20, for each sequence, marking every 3 adjacent frames as an image l, an image m and an image n, splicing the image l and the image n according to a color channel to obtain an image sigma, forming a data element by the image m and the image sigma, wherein the image m is a CT target image, a sampling viewpoint of the image m is used as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, constructing a CT image data set by utilizing all the sequences, wherein xi elements are in the constructed CT image data set, and the xi is more than or equal to 1000 and less than or equal to 20000;

Step 2: construction of neural networks

The resolution of the image or the image input by the network is p multiplied by o, p is the width, o is the height, and o is 100-2000 in units of pixels, and p is 100-2000;

(1) Polar constraint sparse attention mechanical learning network A

The network A is used as a basic structure of a double-branch twin network, the structure of the network A is composed of a main network and 5 network branches, for the main network, the tensor J, the tensor C and the variable X are used as inputs, the tensor J and the tensor C are respectively alpha X O X p X3 and alpha X O X p X6, the variable X is a Boolean type variable, the 5 network branches respectively predict the tensor L, the tensor O, the tensor W, the tensor B and the tensor D, the tensor L is alpha X2X 6, the tensor O is alpha X4X 1, the tensor W is alpha X O X p X1, the tensor B is alpha X O X p X4, the tensor D is alpha X3 and alpha is the batch number;

the backbone network is designed for 3-stage cross-view coding, which proceeds in sequence:

1) The cross-view coding of stage 1 includes the embedded coding of stage 1 and the attention coding of stage 1:

the embedded coding of the 1 st stage respectively carries out convolution operation on the first 3 characteristic components of the last dimension of the tensor J and the last 3 characteristic components of the last dimension of the tensor C, the convolution kernel scale is 7 multiplied by 7, the characteristic channel number is 32, the serialization processing transforms coding characteristics from the spatial domain shape of the image characteristics into a sequence structure, and the layer normalization processing respectively obtains the 1 st stage embedded coding 1, the 1 st stage embedded coding 2 and the 1 st stage embedded coding 3;

The attention code of the 1 st stage is obtained by concatenating the embedded code 1 of the 1 st stage and the embedded code 2 of the 1 st stage according to the last dimension; concatenating the 1 st stage embedded code 1 and the 1 st stage embedded code 3 according to the last dimension to obtain a 1 st stage attention code input feature 2; concatenating the 1 st stage embedded code 2 and the 1 st stage embedded code 1 according to the last dimension to obtain a 1 st stage attention code input characteristic 3; concatenating the 1 st stage embedded code 3 and the 1 st stage embedded code 1 according to the last dimension to obtain a 1 st stage attention code input characteristic 4; and 4 input features of the attention code of the 1 st stage are respectively subjected to attention code processing: the method comprises the steps of carrying out a first treatment on the surface of the The method comprises the steps of taking a first half channel characteristic as a target coding characteristic, a second half channel characteristic as a source coding characteristic, and respectively carrying out separable convolution operation on the target coding characteristic and the source coding characteristic according to a last dimension in the 1 st stage, wherein the convolution kernel scale is 3 multiplied by 3, the number of characteristic channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, stretching the processing result of the target coding characteristic from the spatial domain shape of the image characteristic to a serialization form, serving as a query keyword K coding vector and a numerical value V coding vector for attention learning, and stretching the processing result of the source coding characteristic from the spatial domain shape of the image characteristic to a serialization form, serving as a query Q coding vector for attention learning;

When the network A is used as the 1 st branch of the two-branch twin network, the input variable X is False, and the a) is executed; when the network A is used as the 2 nd branch of the two-branch twin network, the input variable X is True, and b) is executed; performing a) or b) to obtain cross-view coding features for each attention-coded input feature of stage 1;

a) Calculating the attention weight matrix of each attention code input feature by utilizing a multi-head attention method, wherein the number of heads is 1, the number of feature channels is 32, and finally, adding the attention weight matrix of each attention code input feature in the 1 st stage and the target code feature of each attention code input feature to obtain the cross-view code feature of each attention code input feature in the 1 st stage;

b) First, a cross-view weighting matrix R is calculated:

the result tensor L and O output by the 1 st branch of the twin network are used as pose parameters and internal parameters of the camera, a basic matrix U is calculated according to a computer vision principle, and then a cross-view polar line matrix Y is calculated by the U:

Y＝xU (1)

wherein x is an airspace position matrix of the source coding feature, the scale of the airspace position matrix is w× 3,w, the length of the coding sequence after the processing result of the source coding feature is serialized, the element of x is the normalized coordinate of the pixel position in the processing result of the source coding feature under the equipment coordinate system, the scale of Y is w×3, and each column is the coefficient of an polar equation corresponding to the pixel position in the processing result of the source coding feature;

Calculating an error matrix E:

E＝Yq (2)

where q is the transposed matrix of x, and the dimension of E is w×w;

calculating the maximum value of all errors in the matrix E according to the error matrix E, taking half of the maximum value as an error threshold, setting each element larger than the error threshold in the error matrix E as 1, and setting each element smaller than or equal to the error threshold in the error matrix E as 0;

then, calculating an attention weight matrix of each attention code input feature by utilizing a multi-head attention method, wherein the number of heads is 1, the number of feature channels is 32, multiplying the obtained attention weight matrix by a cross-view weighting matrix R to obtain an attention weight matrix of each attention code input feature in the 1 st stage, and adding the attention weight matrix with the target code features to obtain cross-view code features of 4 attention code input features in the 1 st stage;

utilizing the average feature of the 1 st and 2 nd features of the cross-view coding features of the 4 attention-coded input features as a 1 st stage cross-view cross-layer feature; taking the 1 st stage cross-view cross-layer feature, the 1 st stage 3 rd cross-view coding feature and the 1 st stage 4 th cross-view coding feature as 1 st stage cross-view coding results; taking the 1 st stage cross-view coding result as a 2 nd stage cross-view coding input, and concatenating the 1 st stage cross-view coding result according to the last dimension to obtain a 1 st stage concatenated coding result;

2) The cross-view coding of phase 2 includes embedded coding of phase 2 and attention coding of phase 2:

stage 2 embedded coding, namely performing embedded coding on each feature in the cross-view coding result of stage 1: the convolution kernel scale is 3 multiplied by 3, the number of characteristic channels is 64, the step sizes in the horizontal direction and the vertical direction are 2, coding characteristics are transformed from the spatial domain shape of the image characteristics into a sequence structure through serialization, and the layer normalization of the characteristics is carried out to obtain a 2 nd stage embedded code 1, a 2 nd stage embedded code 2 and a 2 nd stage embedded code 3;

the attention code of the 2 nd stage, the embedded code 1 of the 2 nd stage and the embedded code 2 of the 2 nd stage are connected in series according to the last dimension to obtain the input characteristic 1 of the attention code of the 2 nd stage; concatenating the 2 nd stage embedded code 1 and the 2 nd stage embedded code 3 according to the last dimension to obtain a 2 nd stage attention code input feature 2; concatenating the 2 nd stage embedded code 2 and the 2 nd stage embedded code 1 according to the last dimension to obtain a 2 nd stage attention code input characteristic 3; concatenating the 2 nd stage embedded code 3 and the 2 nd stage embedded code 1 according to the last dimension to obtain a 2 nd stage attention code input characteristic 4; according to the last dimension, each attention code input feature takes the first half channel feature as a target code feature, the second half channel feature as a source code feature, the target code feature and the source code feature are subjected to separable convolution operation respectively, the convolution kernel scale is 3 multiplied by 3, the number of feature channels is 64, the step sizes in the horizontal direction and the vertical direction are 2, the processing result of the target code feature is stretched from the airspace shape of the image feature to a serialization form to serve as a query keyword K code vector and a numerical value V code vector for attention learning, and the processing result of the source code feature is stretched from the airspace shape of the image feature to a serialization form to serve as a query Q code vector for attention learning;

a) Calculating the attention weight matrix of each attention code input feature by utilizing a multi-head attention method, wherein the number of heads is 3, the number of feature channels is 64, and finally, adding the attention weight matrix of each attention code input feature in the 2 nd stage and the target code feature of each attention code input feature to obtain the cross-view code feature of each attention code input feature in the 2 nd stage;

b) First, a cross-view weighting matrix R' is calculated:

the result tensor L and O output by the 1 st branch of the twin network are used as pose parameters and internal parameters of the camera, a basic matrix U ' is calculated according to a computer vision principle, and then a cross-view polar line matrix Y ' is calculated by using the U ':

Y′＝x′U′ (3)

wherein x 'is an airspace position matrix of the source coding feature, the scale of x' is w '× 3,w' is the length of the coding sequence after the processing result of the source coding feature is serialized, the element of x 'is the normalized coordinate of the pixel position in the processing result of the source coding feature under the equipment coordinate system, the scale of Y' is w '× 3,w' is the length of the coding sequence after the processing result of the source coding feature is serialized, and each column is the coefficient of the polar equation corresponding to the pixel position in the processing result of the source coding feature;

Calculating an error matrix E':

E′＝Y′q′ (4)

where q 'is the transposed matrix of x' and E 'is the scale of w' x 3;

calculating the maximum value of all errors in the matrix E 'according to the error matrix E', taking half of the maximum value as an error threshold, setting each element larger than the error threshold in the error matrix E 'as 1, and setting each element smaller than or equal to the error threshold in the error matrix E' as 0;

then, calculating an attention weight matrix of each attention code input feature by utilizing a multi-head attention method, wherein the number of heads is 3, the number of feature channels is 64, multiplying the obtained attention weight matrix by a cross-view weighting matrix R' to obtain an attention weight matrix of each attention code input feature in the 2 nd stage, and adding the attention weight matrix with a target code feature of each attention code input feature to obtain cross-view code features of 4 attention code input features in the 2 nd stage respectively;

utilizing the average feature of the 1 st and 2 nd features of the cross-view coding features of the 4 attention-coded input features as a 2 nd stage cross-view cross-layer feature; taking the 2 nd stage cross-view cross-layer feature, the 2 nd stage 3 rd cross-view coding feature and the 2 nd stage 4 th cross-view coding feature as 2 nd stage cross-view coding results; taking the 2 nd stage cross-view coding result as a 3 rd stage cross-view coding input, and concatenating the 2 nd stage cross-view coding result according to the last dimension to obtain a 2 nd stage concatenated coding result;

3) The 3 rd stage cross-view coding includes 3 rd stage embedded coding and 3 rd stage attention coding

Embedding and coding of the 3 rd stage, and carrying out embedding and coding processing on each feature in the 2 nd stage cross-view coding result: the convolution operation, the convolution kernel scale is 3 multiplied by 3, the number of characteristic channels is 128, the step sizes in the horizontal direction and the vertical direction are 2, the serialization processing transforms the coding characteristics from the spatial domain shape of the image characteristics into a sequence structure, and the layer normalization processing of the characteristics obtains a 3 rd stage embedded code 1, a 3 rd stage embedded code 2 and a 3 rd stage embedded code 3;

the 3 rd stage attention code, the 3 rd stage embedded code 1 and the 3 rd stage embedded code 2 are connected in series according to the last dimension to obtain the 3 rd stage attention code input characteristic 1; concatenating the 3 rd stage embedded code 1 and the 3 rd stage embedded code 3 according to the last dimension to obtain a 3 rd stage attention code input feature 2; concatenating the 3 rd stage embedded code 2 and the 3 rd stage embedded code 1 according to the last dimension to obtain a 3 rd stage attention code input characteristic 3; concatenating the 3 rd stage embedded code 3 and the 3 rd stage embedded code 1 according to the last dimension to obtain a 3 rd stage attention code input feature 4; according to the last dimension, each attention code input feature takes the first half channel feature as a target code feature, the second half channel feature as a source code feature, and the target code feature and the source code feature are respectively subjected to separable convolution operation, wherein the convolution kernel scale is 3×3, the number of feature channels is 128, the step sizes in the horizontal direction and the vertical direction are both 2, the processing result of the target code feature is stretched from the spatial domain shape of the image feature to a serialization form, the processing result of the source code feature is stretched from the spatial domain shape of the image feature to a serialization form, and the processing result of the source code feature is used as the query Q code vector of the attention learning;

a) Calculating the attention weight matrix of each attention code input feature by utilizing a multi-head attention method, wherein the number of heads is 6, the number of feature channels is 128, and finally, adding the attention weight matrix of each attention code input feature in the 3 rd stage and the target code feature of each attention code input feature to obtain the cross-view code feature of each attention code input feature in the 3 rd stage;

b) First, a cross-view weighting matrix r″ is calculated:

the result tensor L and O output by the 1 st branch of the twin network are used as pose parameters and internal parameters of the camera, a basic matrix U ' is calculated according to a computer vision principle, and then a cross-view polar line matrix Y ' is calculated by using U ':

Y″＝x″U″ (5)

wherein x 'is an airspace position matrix of the source coding feature, the scale of x' is w '. Times. 3,w' is the length of the coding sequence after the processing result of the source coding feature is serialized, the element of x 'is the normalized coordinate of the pixel position in the processing result of the source coding feature under the equipment coordinate system, the scale of Y' is w '. Times. 3,w' which is the length of the code sequence after the processing result of the source code feature is serialized, and each column is the coefficient of the polar equation corresponding to the pixel position in the processing result of the source code feature;

Calculating an error matrix E':

E″＝Y″q″ (6)

where q "is the transposed matrix of x", and E "is the dimension w" x 3;

calculating the maximum value of all errors in the matrix E 'according to the error matrix E', taking half of the maximum value as an error threshold value, setting each element larger than the error threshold value in the error matrix E 'as 1, and setting each element smaller than or equal to the error threshold value in the error matrix E' as 0;

then, calculating an attention weight matrix of each attention code input feature by utilizing a multi-head attention method, wherein the number of heads is 6, the number of feature channels is 128, multiplying the obtained attention weight matrix by a cross-view weighting matrix R' to obtain an attention weight matrix of each attention code input feature in the 3 rd stage, and adding the attention weight matrix with a target code feature of each attention code input feature to obtain cross-view code features of 4 attention code input features in the 3 rd stage respectively;

utilizing the average feature of the 1 st and 2 nd features of the cross-view coding features of the 4 attention-coded input features as a 3 rd stage cross-view cross-layer feature; taking the 3 rd-stage cross-view cross-layer feature, the 3 rd-stage 3 rd cross-view coding feature and the 3 rd-stage 4 th cross-view coding feature as 3 rd-stage cross-view coding results; concatenating the 3 rd stage cross-view coding result according to the last dimension to obtain a 3 rd stage concatenated coding result;

For the 1 st network branch, the 1 st stage concatenated coding result is sequentially processed by 2 units: in the 1 st unit processing, the number of characteristic channels of convolution operation is 16, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; the resulting features were sequentially subjected to 2 unit processes: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; then, the obtained features are concatenated with the 3 rd stage concatenated coding result, and 3 unit processing is performed: in the 1 st unit processing, the number of characteristic channels of convolution operation is 64, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 12, the convolution kernel scales are all 1 multiplied by 1, the step sizes in the horizontal direction and the vertical direction are all 1, and then characteristic activation and batch normalization processing are carried out; predicting the obtained characteristic results of the 12 channels according to a 2 multiplied by 6 form to obtain a tensor L result;

For the 2 nd network branch, the 1 st stage concatenated coding result is sequentially processed by 2 units: in the 1 st unit processing, the number of the convolved characteristic channels is 16, the convolution kernel scale is 7 multiplied by 7, the step length in the horizontal direction and the step length in the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of the convolved characteristic channels is 32, the convolution kernel scale is 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; then the obtained characteristics are connected with the 2 nd stage serial connection coding result in series, and 2 units of processing are carried out: in the 1 st unit processing, the number of the convolved characteristic channels is 32, the convolution kernel scale is 7 multiplied by 7, the step length in the horizontal direction and the step length in the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in 2 unit processing, the number of characteristic channels of convolution is 64, the convolution kernel scale is 3×3, the step sizes in the horizontal direction and the vertical direction are 2, then characteristic activation and batch normalization processing are carried out, the obtained characteristics are connected with the 3 rd stage serial coding result in series, and 3 unit processing is carried out: in the 1 st unit processing, the number of the convolved characteristic channels is 64, the convolution kernel scale is 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of the convolved characteristic channels is 128, the convolution kernel scale is 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of the convolved characteristic channels is 4, the convolution kernel scale is 1 multiplied by 1, the step length in the horizontal direction and the step length in the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, and the obtained 4-channel characteristics are used as the result of tensor O;

For the 3 rd network branch, inputting the 1 st stage cross-view cross-layer characteristics, and sequentially performing 3 unit processes: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3×3, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3×3, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 3 rd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3×3, the step sizes of the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, then 1 deconvolution operation is carried out, the number of characteristic channels of convolution is 16, the convolution kernel scales are 3×3, the step sizes of the horizontal direction and the vertical direction are 2, and the batch normalization processing is carried out, and the 1 st cross-layer characteristic of the 3 rd network branch is obtained;

initializing 3 rd network branch cross-layer characteristics: taking the cross-view cross-layer characteristic of the 1 st stage of the backbone network as the 2 nd cross-layer characteristic of the 3 rd network branch; taking the 2 nd stage cross-view cross-layer characteristic of the backbone network as the 3 rd cross-layer characteristic of the 3 rd network branch; taking the 3 rd stage cross-view cross-layer characteristic of the backbone network as the 4 th cross-layer characteristic of the 3 rd network branch; then the 1 st stage cross-view cross-layer characteristic of the backbone network is subjected to 1 st residual coding treatment, which comprises 3 convolution operations, wherein the channel characteristic numbers are 64, 64 and 256 respectively, the shape of the convolution kernel is 1×1, 3×3 and 1×1 respectively, and then 2 unit treatments are sequentially carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 192, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 192, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and the characteristic activation and batch normalization processing are used as the 5 th cross-layer characteristic of the 3 rd network branch; carrying out 1-time residual coding treatment on the 5 th cross-layer feature, specifically carrying out 3-time convolution operations, wherein the number of channel features in the 3-time convolution operations is 512, 512 and 2048, the shapes of convolution kernels are 1×1, 3×3 and 1×1 respectively, and then sequentially carrying out 6 unit treatment processes on the obtained features:

During the processing of the 1 st unit, the up-sampling convolution processing is carried out, the number of characteristic channels is 512, the convolution kernel is 3 multiplied by 3, the up-sampling scale is 2 multiplied by 2, the obtained characteristic is connected with the 5 th cross-layer characteristic of the 3 rd network branch in series, the convolution processing is further carried out on the series-connected characteristic, the number of the convolved characteristic channels is 512, and the convolution kernel is 3 multiplied by 3;

when the 2 nd unit is processed, the up-sampling convolution processing is carried out, the number of characteristic channels is 256, the convolution kernel is 3 multiplied by 3, the up-sampling scale is 2 multiplied by 2, the obtained characteristic is connected with the 4 th cross-layer characteristic of the 3 rd network branch in series, the convolution processing is carried out on the obtained characteristic, the number of the convolved characteristic channels is 256, and the convolution kernel is 3 multiplied by 3;

when the 3 rd unit is processed, the up-sampling convolution processing is carried out, the number of characteristic channels is 128, the convolution kernel shape is 3 multiplied by 3, the up-sampling scale is 2 multiplied by 2, the obtained characteristic is connected with the 3 rd cross-layer characteristic of the 3 rd network branch in series, the convolution processing is carried out on the series characteristic, the number of the convolved characteristic channels is 128, the convolution kernel shape is 3 multiplied by 3, the obtained characteristic is input to 4 units for processing, and meanwhile, the obtained characteristic is predicted to be the 4 th scale result of the tensor W through the convolution operation with the kernel of 3 multiplied by 3;

In the 4 th unit processing, the up-sampling convolution processing is carried out, the number of characteristic channels is 64, the convolution kernel shape is 3×3, the up-sampling scale is 2×2, the obtained characteristic is connected with the 2 nd cross-layer characteristic of the 3 rd network branch and the 2×2 up-sampling characteristic in the 3 rd unit processing in series, the convolution processing is carried out on the connected characteristic, the number of the convolution characteristic channels is 64, the convolution kernel shape is 3×3, the obtained characteristic is input to the 5 th unit processing, and meanwhile, the obtained characteristic is predicted to be the 3 rd scale result of tensor W through the convolution operation of the kernel of 3×3;

when the 5 th unit is processed, the up-sampling convolution processing is carried out, the number of characteristic channels is 32, the convolution kernel shape is 3 multiplied by 3, the up-sampling scale is 2 multiplied by 2, the obtained characteristics are respectively connected with the 1 st cross-layer characteristics of the 3 rd network branch and the 2 multiplied by 2 up-sampling characteristics when the 4 th unit is processed, the connected characteristics are input into the 6 th unit for processing, and meanwhile, the obtained characteristics are predicted to be the 2 nd scale result of tensor W through the convolution operation with the kernel of 3 multiplied by 3;

in the processing of the 6 th unit, the up-sampling convolution processing is carried out, the number of characteristic channels is 16, the convolution kernel shape is 3 multiplied by 3, the up-sampling scale is 2 multiplied by 2, the obtained characteristic is connected with the 2 multiplied by 2 up-sampling characteristic in series in the processing of the 5 th unit, and then the series characteristic is predicted to be the 1 st scale result of the tensor W through the convolution operation with the kernel of 3 multiplied by 3;

Using the results of the 1 st to 4 th scales as the result of the tensor W;

for the 4 th network branch, performing one-time deconvolution operation, feature activation and batch normalization processing on the cross-layer features of the cross-view in the 1 st stage, wherein in the deconvolution operation, the number of the convolved feature channels is 16, the convolution kernel scales are 3 multiplied by 3, and the step sizes in the horizontal direction and the vertical direction are 2; the obtained result is marked as a decoder cross-layer characteristic 1, and the cross-view cross-layer characteristic of the 1 st stage is processed by the following 2 units: when the 1 st unit is processed, the number of convolution operation characteristic channels is 32, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization processing are carried out, and the processing characteristic is marked as a decoder cross-layer characteristic 2; processing the 2 nd unit, carrying out convolution operation, wherein the number of characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, carrying out characteristic activation and batch normalization processing, carrying out series connection on the obtained characteristic and the 2 nd stage cross-view cross-layer characteristic, and sequentially carrying out the processing of the following 2 units on the series connection result: when the 1 st unit is processed, the number of characteristic channels of convolution is 64, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and the processing characteristics are marked as decoder cross-layer characteristics 3; when the 2 nd unit is processed, the number of the convolved characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, then the obtained characteristic is connected with the 3 rd stage cross-view cross-layer characteristic in series, the following 3 unit processes are sequentially carried out, when the 1 st unit is processed, the number of the convolved characteristic channels is 128, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 1, and the processing characteristic is marked as the decoder cross-layer characteristic 4; when the 2 nd unit is processed, the number of the characteristic channels of convolution is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and the processing characteristics are marked as decoder cross-layer characteristics 5; when the 3 rd unit is processed, the number of the convolved characteristic channels is 512, the convolution kernel scales are 3 multiplied by 3, the step length in the horizontal direction and the step length in the vertical direction are 2, and the 4 th network branch coding characteristic is obtained after the processing;

The decoding process is as follows: performing 1 deconvolution operation on the 4 th network branch coding feature: the number of characteristic channels of convolution is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, the characteristics are activated and normalized in batches, the obtained result is connected with the cross-layer characteristics 5 of the decoder in series, and one convolution operation is carried out: the number of the characteristic channels is 512, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, and deconvolution operation is carried out on the obtained result: the number of the characteristic channels is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristic activation and batch normalization are carried out, the obtained result is connected with the cross-layer characteristic 4 of the decoder in series, and one convolution operation is carried out: the number of characteristic channels is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization processing are carried out, and the obtained result is subjected to deconvolution operation once: the number of the characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristic activation and batch normalization are carried out, the obtained result is connected with the cross-layer characteristic 3 of the decoder in series, and one convolution operation is carried out: the number of characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 1, the characteristics are activated and subjected to batch normalization processing, the obtained characteristics are used as the 4 th scale result of tensor B, meanwhile, 1 deconvolution operation is carried out on the obtained characteristics, the number of deconvoluted characteristic channels is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, the characteristics are activated and subjected to batch normalization processing, the obtained characteristics are connected with cross-layer characteristics 2 of a decoder in series, and one convolution operation is carried out: the number of the characteristic channels is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, the obtained characteristic is used as the 3 rd scale result of the tensor B, and meanwhile, the obtained characteristic is subjected to 1 deconvolution operation: the number of deconvolution characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristics are activated and normalized in batches, the obtained characteristics are connected with the cross-layer characteristics 1 of the decoder in series, and then one convolution operation is carried out: the number of the characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, the obtained characteristic is used as the 2 nd scale result of the tensor B, and meanwhile, the obtained characteristic is subjected to 1 deconvolution operation: the number of the characteristic channels is 16, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 2, the characteristics are activated and subjected to batch normalization, the obtained characteristics are connected with the up-sampling result of the 3 rd scale characteristics in series, and then one convolution operation is carried out: the number of the characteristic channels is 16, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristics are activated and subjected to batch normalization, the obtained characteristics are used as the 1 st scale result of the tensor B, and the 4 th scale result of the tensor B is utilized to obtain the output of the 4 th network branch;

For the 5 th network branch, the 3 rd stage concatenated coding result is sequentially processed by 4 units: in the 1 st unit processing, the number of characteristic channels of convolution operation is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 512, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 1024, the convolution kernel scales are 3×3, and the step sizes in the horizontal direction and the vertical direction are 2; in the 4 th unit processing, the number of characteristic channels of convolution operation is 3, convolution kernel scales are 1×1, step sizes in the horizontal direction and the vertical direction are 1, and the obtained characteristics are used as a result of tensor D;

(2) Twin network S

The structure of the twin network S consists of a twin branch 1 and a twin branch 2, wherein the twin branch 1 and the twin branch 2 take a network A as a basic framework;

for the twin network S, tensor J and tensor C are used as inputs, the scales of the tensor J and the tensor C are alpha x O x p x 3 and alpha x O x p x 6 respectively, the input tensor of the twin network S is sequentially learned by adopting a twin branch 1 and a twin branch 2, then the obtained outputs are tensor L, tensor O, tensor W, tensor B and tensor D, the scales of the tensor L are alpha x 2 x 6, the scales of the tensor O are alpha x 4 x 1, the scales of the tensor W are alpha x O x p x 1, the scales of the tensor B are alpha x O x p x 4, the scales of the tensor D are alpha x 3, and alpha is the batch number;

First, the input tensor J and tensor C of the twin network S are learned by the twin branch 1:

setting a Boolean type variable X as False, inputting the False type variable X, the tensor J and the tensor C into the twin branch 1, and obtaining the output of the twin branch 1 after learning;

then, the input tensor J and tensor C of the network S are learned by the twin branch 2:

setting a Boolean type variable X as True, inputting the True type variable X, the tensor J and the tensor C into a twin branch 2 for learning, adopting tensor L and tensor O output by the twin branch 1 as pose parameters and camera internal parameters respectively in the calculation of a cross-view error matrix and a cross-view weighting matrix in the learning process of the twin branch 2, and obtaining the output of a twin network S after the learning of the twin branch 2;

step 3: training of neural networks

Dividing samples in a natural image dataset, an ultrasonic image dataset and a CT image dataset into a training set and a testing set according to a ratio of 9:1, wherein data in the training set is used for training, data in the testing set is used for testing, training data are respectively obtained from corresponding data sets during training, the training data are uniformly scaled to a resolution p multiplied by o, the resolution p multiplied by o is input into a corresponding network, iterative optimization is performed, and the loss of each batch is minimized by continuously modifying network model parameters;

In the training process, the calculation method of each loss comprises the following steps:

internal parameters supervise synthesis loss: in the network model training of natural images, a twin network S is outputTensor W is taken as depth, tensor L output by twin network S and internal parameter label e of training data are taken as depth _t (t=1, 2,3, 4) respectively serving as pose parameters and camera internal parameters, respectively synthesizing two images at the view point of an image c by using an image b and an image d according to a computer vision principle algorithm, and calculating by using the sum of the intensity differences of the channel pixel by pixel and color by pixel by using the image c and the two synthesized images;

unsupervised synthesis loss: in the network model training of ultrasonic or CT images, tensor W output by a twin network S is used as depth, tensor L and tensor O output by the twin network S are respectively used as pose parameters and camera internal parameters, two adjacent images of a target image are utilized to respectively construct a synthesized image at a target viewpoint according to a computer vision algorithm, and the target image is utilized to respectively combine with the synthesized images at the two target viewpoints, so that the image is obtained by calculation according to the sum of pixel-by-pixel and color-by-color channel intensity differences;

internal parameter error loss: tensor O output by twin network S and internal parameter label e of training data _t (t=1, 2,3, 4) is calculated as the sum of the absolute values of the respective component differences;

spatial structure error loss: in the network model training of ultrasonic or CT images, taking tensor W output by a twin network S as depth, taking tensor L and tensor O output by the twin network S as pose parameters and camera internal parameters respectively, reconstructing three-dimensional coordinates of images at a target viewpoint by using two adjacent images of the images at the target viewpoint according to a computer vision algorithm, performing space structure fitting on reconstructed points by using a RANSAC algorithm, and calculating by using a cosine distance from a normal vector obtained by fitting to tensor D output by the twin network S;

conversion synthesis loss: in the network model training of ultrasonic or CT images, taking tensor W output by a twinning network S as depth, taking tensor L and tensor O output by the twinning network S as pose parameters and camera internal parameters respectively, constructing two synthesized images at the target image view point by using two adjacent images of the target image according to a computer vision algorithm, taking tensor B output by the twinning network S as displacement of spatial domain deformation of the synthesized image after each pixel position is obtained in the synthesis process for each image in the synthesized image, and calculating according to the sum of pixel-by-pixel and color channel intensity differences by utilizing the synthesized image at the two target view points and the image at the target view point;

The specific training steps are as follows:

(1) On the natural image data set, the main network and the 1 st and 3 rd network branches of the network A are trained 50000 times by utilizing a twin network S

Training data are taken out from a natural image data set each time, the training data are uniformly scaled to resolution p multiplied by o, an image c and an image tau are input into a twin network S, the trunk network of the network A and the 1 st and 3 rd network branches are trained for 50000 times, and the training loss of each batch is obtained by the calculation of internal parameter supervision synthesis loss;

(2) On the natural image data set, the 2 nd network branch of the network A is trained 60000 times by utilizing the twin network S

Taking out training data from the natural image data set each time, uniformly scaling to resolution p multiplied by o, inputting an image c and an image tau into a twin network S, training a 2 nd network branch of the network A, and calculating the training loss of each batch by the sum of unsupervised synthesis loss and internal parameter error loss;

(3) On the ultrasonic image data set, utilizing a twin network S to train the 4 th and 5 th network branches of the network A60000 times to obtain network model parameters

Taking out ultrasonic training data from an ultrasonic image data set each time, uniformly scaling to resolution p multiplied by o, inputting an image j and an image pi into a twin network S, training the 4 th and 5 th network branches of the network A, and calculating the training loss of each batch by the sum of conversion synthesis loss and space structure error loss;

(4) On the ultrasonic image data set, utilizing a twin network S to train a main network and 1 st to 5 th network branches of a network A30000 times to obtain a network model parameter rho

Taking out ultrasonic training data from an ultrasonic image data set each time, uniformly scaling to resolution p multiplied by o, inputting an image j and an image pi into a twin network S, training a main network of the network A and 1 st to 5 th network branches, and calculating the training loss of each batch by the sum of conversion synthesis loss and space structure error loss;

(5) Training the main network and the 1 st to 5 th network branches of the network A for 50000 times by utilizing a twin network S on the CT image data set to obtain a network model parameter rho'

Taking CT image training data out of a CT image data set each time, uniformly scaling the CT image training data to a resolution p multiplied by O, inputting an image m and an image sigma into a twin network S, taking a tensor W output by the twin network S as depth, taking a tensor L and a tensor O output by a network A as pose parameters and internal parameters of a camera respectively, taking a tensor B output by the twin network S as displacement of spatial domain deformation of a synthesized image, respectively synthesizing two images at an m viewpoint of the image L and the image n, training the network by continuously modifying parameters of the network, and continuously modifying network parameters so that the loss of each image of each batch is minimum, and when the loss calculation of network optimization, adding the loss of translational motion of the camera besides transformation synthesis loss and spatial structure error loss, training 50000 times to obtain network model parameters rho'; step 4: three-dimensional reconstruction of ultrasound or CT images

Using an ultrasound or CT sequence image from the sample, three-dimensional reconstruction is achieved by simultaneously performing the following 3 processes:

(1) For any target image in the sequence image, three-dimensional coordinates under a camera coordinate system are calculated according to the following method: scaling to resolution p X O, for an ultrasonic sequence image, inputting an image j and an image pi into a twin branch 1 of a twin network S, setting a Boolean type variable X to be False, for a CT sequence image, inputting an image m and an image sigma into the twin branch 1 of the twin network S, setting the Boolean type variable X to be False, respectively predicting by using a model parameter rho and a model parameter rho', taking an output tensor W of the twin network S as depth, taking a tensor L and a tensor O output by the twin network S as pose parameters and camera internal parameters, and calculating three-dimensional coordinates of the target image under a camera coordinate system according to depth information of the target image and camera internal parameters and a principle of computer vision;

(2) In the three-dimensional reconstruction process of the sequence image, a key frame sequence is established: taking the first frame of the sequence image as the first frame of the key frame sequence, taking the first frame of the sequence image as a current key frame, taking the frame after the current key frame as a target frame, and dynamically selecting new key frames in sequence according to the sequence of the target frames: firstly, initializing a pose parameter matrix of a target frame relative to a current key frame by using an identity matrix, multiplying the pose parameter matrix by a pose parameter of a target frame camera for any target frame, combining internal parameters and depth information of the target frame by using a multiplication result to synthesize an image at a target frame viewpoint, calculating an error lambda by using the sum of pixel-by-pixel color channel intensity differences between the synthesized image and the target frame, synthesizing an image at the target frame viewpoint by using the pose parameter and the internal parameters of the camera according to an adjacent frame of the target frame, calculating an error gamma by using the sum of pixel-by-pixel color channel intensity differences between the synthesized image and the target frame, and further calculating a synthesis error ratio Z by using a formula (7):

Meeting Z is larger than a threshold value eta, 1 eta is smaller than 2, taking the target frame as a new key frame, taking a pose parameter matrix of the target frame relative to the current key frame as a pose parameter of the new key frame, and simultaneously updating the target frame into the current key frame; finishing key frame sequence establishment by the iteration;

(3) And taking the viewpoint of the first frame of the sequence image as the origin of the world coordinate system, scaling the resolution of any target image to M multiplied by N, calculating to obtain three-dimensional coordinates under the camera coordinate system according to the internal parameters and depth information of the camera obtained by network output, and calculating to obtain the three-dimensional coordinates in the world coordinate system of each pixel of the target frame according to the pose parameters of the camera output by the network and combining the pose parameters of each key frame in the key frame sequence and the pose parameter matrix of the target frame relative to the current key frame.

The invention has the beneficial effects that:

according to the invention, an epipolar geometry constraint transducer network model is adopted, a cross-view attention mechanics learning network with epipolar geometry constraint is designed by utilizing imaging constraint among cross-views, the intelligent perception capability of deep learning is fully exerted in the three-dimensional reconstruction of medical images, the reconstruction function from two-dimensional medical images to three-dimensional space information can be effectively realized by utilizing the invention, thus the geometric structure of a target is obtained, and an effective 3D reconstruction solution can be provided for artificial intelligent medical auxiliary diagnosis.

Drawings

FIG. 1 is a three-dimensional reconstruction result graph of an ultrasound image of the present invention;

fig. 2 is a three-dimensional reconstruction result diagram of a CT image according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples.

Examples

The embodiment is implemented under Windows 10-bit operating system on PC, and the hardware configuration is CPU i7-9700F, memory 16G,GPU NVIDIA GeForce GTX 2070 8G; the deep learning library was programmed using Tensorflow1.14, using Python language version 3.7.

A three-dimensional reconstruction method of a polar line constrained sparse attention mechanism medical image is disclosed, wherein an ultrasonic or CT image sequence is input, the resolution is MxN, for an ultrasonic image, M is 450, N is 300, for a CT image, M and N are 512, and the three-dimensional reconstruction process specifically comprises the following steps:

step 1: constructing a dataset

(a) Constructing a natural image dataset

Selecting a natural image website, requiring to have an image sequence and corresponding camera internal parameters, downloading 19 image sequences and corresponding internal parameters of the sequences from the website, marking each adjacent 3 frames of images as an image b, an image c and an image d for each image sequence, splicing the image b and the image d according to color channels to obtain an image tau, and obtaining a figure Image c and image tau form a data element, image c is a natural target image, the sampling viewpoint of image c is a target viewpoint, and the internal parameters of image b, image c and image d are e _t (t=1, 2,3, 4), where e ₁ E is a horizontal focal length ₂ E is vertical focal length ₃ E ₄ Two components of principal point coordinates; discarding if the last remaining image in the same image sequence is less than 3 frames; constructing a natural image dataset by using all sequences, wherein the dataset has 3600 elements;

(b) Constructing ultrasound image datasets

Sampling 10 ultrasonic image sequences, for each sequence, marking every 3 adjacent frames of images as an image i, an image j and an image k, splicing the image i and the image k according to color channels to obtain an image pi, forming a data element by the image j and the image pi, wherein the image j is an ultrasonic target image, the sampling viewpoint of the image j is taken as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, and constructing an ultrasonic image data set by utilizing all the sequences, wherein the data set comprises 1600 elements;

(c) Constructing CT image datasets

Sampling 1 CT image sequence, for the sequence, marking every 3 adjacent frames as an image l, an image m and an image n, splicing the image l and the image n according to a color channel to obtain an image sigma, forming a data element by the image m and the image sigma, wherein the image m is a CT target image, a sampling viewpoint of the image m is used as a target viewpoint, if the last remaining image in the same image sequence is less than 3 frames, discarding, and constructing a CT image data set by utilizing all the sequences, wherein the data set comprises 2000 elements; step 2: construction of neural networks

The resolution of the image or the image processed by the neural network is 416×128, 416 is width, 128 is height, and the pixels are used as units;

(1) Polar constraint sparse attention mechanical learning network A

The network A is used as a basic structure of a double-branch twin network, the structure of the network A is composed of a backbone network and 5 network branches, the backbone network is a cross-view Transformer topological structure, tensors J and C are used as inputs, the scales are respectively 4×128×416×3 and 4×128×416×6,5 network branches respectively predict tensors L, O, W, B and D, the scale of the tensors L is 4×2×6, the scale of the tensors O is 4×4×1, the scale of the tensors W is 4×128×416×1, the scale of the tensors B is 4×128×416×4, and the scale of the tensors D is 4×3;

b) First, a cross-view weighting matrix R is calculated:

Y＝xU (1)

Calculating an error matrix E:

E＝Yq (2)

where q is the transposed matrix of x, and the dimension of E is w×w;

then, calculating an attention weight matrix of each attention code input feature by utilizing a multi-head attention method, wherein the number of heads is 1, the number of feature channels is 32, multiplying the obtained attention weight matrix by a cross-view weighting matrix R to obtain an attention weight matrix of each attention code input feature in the 1 st stage, and adding the attention weight matrix with the target code feature to obtain a cross-view code feature of each attention code input feature in the 1 st stage;

utilizing the average feature of the 1 st and 2 nd features of the cross-view coding feature as a 1 st stage cross-view cross-layer feature; taking the 1 st stage cross-view cross-layer feature, the 1 st stage 3 rd cross-view coding feature and the 1 st stage 4 th cross-view coding feature as 1 st stage cross-view coding results; taking the 1 st stage cross-view coding result as a 2 nd stage cross-view coding input, and concatenating the 1 st stage cross-view coding result according to the last dimension to obtain a 1 st stage concatenated coding result;

b) First, a cross-view weighting matrix R' is calculated:

Y′＝x′U′ (3)

Calculating an error matrix E':

E′＝Y′q′ (4)

where q 'is the transposed matrix of x' and E 'is the scale of w' x 3;

then, calculating an attention weight matrix of each attention code input feature by utilizing a multi-head attention method, wherein the number of heads is 3, the number of feature channels is 64, multiplying the obtained attention weight matrix by a cross-view weighting matrix R' to obtain an attention weight matrix of each attention code input feature in the 2 nd stage, and adding the attention weight matrix with a target code feature of each attention code input feature to obtain a cross-view code feature of each attention code input feature in the 2 nd stage;

utilizing the average feature of the 1 st and 2 nd features of the cross-view coding feature as a 2 nd stage cross-view cross-layer feature; taking the 2 nd stage cross-view cross-layer feature, the 2 nd stage 3 rd cross-view coding feature and the 2 nd stage 4 th cross-view coding feature as 2 nd stage cross-view coding results; taking the 2 nd stage cross-view coding result as a 3 rd stage cross-view coding input, and concatenating the 2 nd stage cross-view coding result according to the last dimension to obtain a 2 nd stage concatenated coding result;

b) First, a cross-view weighting matrix r″ is calculated:

Y″＝x″U″ (5)

Calculating an error matrix E':

E″＝Y″q″ (6)

where q "is the transposed matrix of x", and E "is the dimension w" x 3;

then, calculating the attention weight matrix of each attention code input feature by utilizing a multi-head attention method, wherein the number of heads is 6, the number of feature channels is 128, multiplying the obtained attention weight matrix with a cross-view weighting matrix R' to obtain the attention weight matrix of each attention code input feature in the 3 rd stage, adding the attention weight matrix with the target code feature of each attention code input feature to obtain the cross-view code feature of each attention code input feature in the 3 rd stage,

utilizing the average feature of the 1 st and 2 nd features of the cross-view coding feature as a 3 rd stage cross-view cross-layer feature; taking the 3 rd-stage cross-view cross-layer feature, the 3 rd-stage 3 rd cross-view coding feature and the 3 rd-stage 4 th cross-view coding feature as 3 rd-stage cross-view coding results; concatenating the 3 rd stage cross-view coding result according to the last dimension to obtain a 3 rd stage concatenated coding result;

Using the results of the 1 st to 4 th scales as the result of the tensor W;

(2) Twin network S

for the twin network S, the tensor J and the tensor C are taken as inputs, the scales of the tensor J and the tensor C are respectively 4×128×416×3 and 4×128×416×6, the input tensor of the twin network S is sequentially learned by adopting the twin branch 1 and the twin branch 2, and then the obtained outputs are tensor L, tensor O, tensor W, tensor B and tensor D, the tensor L scale is 4×2×6, the tensor O scale is 4×4×1, the tensor W scale is 4×128×416×1, the tensor B scale is 4×128×416×4, and the scale of tensor D is 4×3;

then, the input tensor J and tensor C of the twin network S are learned by the twin branch 2:

step 3: training of neural networks

Dividing samples in a natural image dataset, an ultrasonic image dataset and a CT image dataset into a training set and a testing set according to a ratio of 9:1, wherein data in the training set is used for training, data in the testing set is used for testing, training data is respectively obtained from corresponding data sets during training, the training data are uniformly scaled to a resolution of 416 multiplied by 128, the resolution is input into a corresponding network, iterative optimization is performed, and the loss of each batch is minimized by continuously modifying network model parameters;

internal parameters supervise synthesis loss: in the network model training of natural images, the tensor W output by the twin network S is taken as depth, and the tensor L output by the twin network S and the internal parameter label e of training data are taken as depth _t (t=1, 2,3, 4) respectively serving as pose parameters and camera internal parameters, respectively synthesizing two images at the view point of an image c by using an image b and an image d according to a computer vision principle algorithm, and calculating by using the sum of the intensity differences of the channel pixel by pixel and color by pixel by using the image c and the two synthesized images;

The specific training steps are as follows:

Training data are taken out from a natural image data set each time, the training data are uniformly scaled to 416 multiplied by 128 of resolution, an image c and an image tau are input into a twin network S, the trunk network of the network A and the 1 st and 3 rd network branches are trained for 50000 times, and the training loss of each batch is obtained by the calculation of internal parameter supervision synthesis loss;

Taking out training data from the natural image data set each time, uniformly scaling to 416 multiplied by 128, inputting an image c and an image tau into a twin network S, training the 2 nd network branch of the network A, and calculating the training loss of each batch by the sum of the unsupervised synthesis loss and the internal parameter error loss;

Taking out ultrasonic training data from an ultrasonic image data set each time, uniformly scaling to 416 multiplied by 128 resolution, inputting an image j and an image pi into a twin network S, training the 4 th and 5 th network branches of the network A, and calculating the training loss of each batch by the sum of conversion synthesis loss and space structure error loss;

Taking out ultrasonic training data from an ultrasonic image data set each time, uniformly scaling to 416 multiplied by 128, inputting an image j and an image pi into a twin network S, training a main network of the network A and 1 st to 5 th network branches, and calculating the training loss of each batch by the sum of a transformation synthesis loss and a space structure error loss;

Taking CT image training data out of a CT image data set each time, uniformly scaling the CT image training data to 416 multiplied by 128, inputting an image m and an image sigma into a twin network S, taking a tensor W output by the twin network S as depth, taking a tensor L and a tensor O output by a network A as pose parameters and internal parameters of a camera respectively, taking a tensor B output by the twin network S as displacement of spatial deformation of a synthesized image, respectively synthesizing two images at an m viewpoint of the image L and the image n, training the network by continuously modifying parameters of the network, and continuously modifying network parameters so that the loss of each image of each batch is minimum, and when the loss calculation of network optimization, adding the loss of translational motion of the camera besides transformation synthesis loss and spatial structure error loss, training 50000 times to obtain network model parameters rho';

Step 4: three-dimensional reconstruction of ultrasound or CT images

(1) For any target image in the sequence image, three-dimensional coordinates under a camera coordinate system are calculated according to the following method: scaling to 416X 128, for an ultrasonic sequence image, inputting an image j and an image pi into a twin branch 1 of a twin network S, setting a Boolean type variable X to be False, for a CT sequence image, inputting an image m and an image sigma into the twin branch 1 of the twin network S, setting the Boolean type variable X to be False, respectively predicting by using a model parameter rho and a model parameter rho', taking an output tensor W of the twin network S as depth, taking a tensor L and a tensor O output by the twin network S as pose parameters and camera internal parameters, and calculating three-dimensional coordinates of the target image under a camera coordinate system according to depth information of the target image and camera internal parameters and a principle of computer vision;

(2) In the three-dimensional reconstruction process of the sequence image, a key frame sequence is established: taking the first frame of the sequence image as the first frame of the key frame sequence, taking the first frame of the sequence image as a current key frame, taking the frame after the current key frame as a target frame, and dynamically selecting new key frames in sequence according to the sequence of the target frames: firstly, initializing a pose parameter matrix of a target frame relative to a current key frame by using an identity matrix, multiplying the pose parameter matrix by a pose parameter of a target frame camera for any target frame, combining internal parameters and depth information of the target frame by using a multiplication result to synthesize an image at a target frame viewpoint, calculating an error lambda by using the sum of pixel-by-pixel color channel intensity differences between the synthesized image and the target frame, synthesizing an image at the target frame viewpoint by using the pose parameter and the internal parameters of the camera according to an adjacent frame of the target frame, calculating an error gamma by using the sum of pixel-by-pixel color channel intensity differences between the synthesized image and the target frame, and further calculating a synthesis error ratio Z by using a formula (4):

When Z is more than 1.2, taking the target frame as a new key frame, taking a pose parameter matrix of the target frame relative to the current key frame as a pose parameter of the new key frame, and simultaneously updating the target frame into the current key frame; finishing key frame sequence establishment by the iteration;

(3) The method comprises the steps of taking a viewpoint of a first frame of a sequence image as an origin of a world coordinate system, scaling resolution of the viewpoint to M multiplied by N for any target frame, taking 450M and 300N for an ultrasonic image, taking 512M and N for a CT image, calculating to obtain three-dimensional coordinates under the camera coordinate system according to camera internal parameters and depth information obtained by network output, and calculating to obtain three-dimensional coordinates in the world coordinate system of each pixel of the target frame according to camera pose parameters output by the network by combining the pose parameters of each key frame in a key frame sequence and a pose parameter matrix of the target frame relative to a current key frame.

In this example, the experimental hyper-parameters: the optimizer adopts an Adam optimizer, the network learning rate is 0.0002, and the momentum coefficient is 0.9.

In the embodiment, network training is performed on the constructed natural image training set, ultrasonic image training set and CT image training set, 10 ultrasonic sampling sequences and 1 CT image sequence in a public data set are used for testing respectively, conversion synthesis loss is used for error calculation, in the error calculation of ultrasonic or CT images, two adjacent images of a target image are used for respectively constructing two synthesized images at a target image viewpoint, and each image in the synthesized images and the image at the target viewpoint are used for calculating according to the sum of the pixel-by-pixel and color-by-color channel intensity differences.

Table 1 is the calculated error when the ultrasound image sequence is reconstructed, the error of each line in the table corresponds to one sampling sequence in the ultrasound image common data set, the sequence frames in the CT image sequence are divided into 10 groups when the CT image sequence is reconstructed, the number of CT images in each group is 40 frames, the error of the image reconstruction of table 2 is 10 groups, and the error of each line in the table corresponds to each group of CT images.

In this embodiment, the DenseNet is used to segment the ultrasound or CT image and then perform 3D reconstruction, FIG. 1 shows the three-dimensional reconstruction result of the ultrasound image obtained by the present invention, and FIG. 2 shows the three-dimensional reconstruction result of the CT image obtained by the present invention, from which it can be seen that the present invention can obtain a more accurate reconstruction result.

TABLE 1

Sequence number	Error of
		1	0.16795025690556517
2	0.07588948248992386
		3	0.10406341926572499
4	0.11806257064506355
		5	0.11147846461056986
6	0.11861705677945202
		7	0.11786212592937022
8	0.06985992697810901
		9	0.1378135516760461
10	0.07952738195851675

TABLE 2

Sequence number	Error of
		1	0.05569306584866662
2	0.06458857968174544
		3	0.06570206329233383
4	0.06511348561945766
		5	0.11852502653788335
6	0.10058261585285906
		7	0.1247837253715424
8	0.15029050346342082
		9	0.10726172543388322
10	0.1132056428828783

Claims

1. A polar line constrained sparse attention mechanism medical image three-dimensional reconstruction method is characterized in that an ultrasonic or CT image sequence is input, the image resolution is MxN, M is more than or equal to 100 and less than or equal to 2000, N is more than or equal to 100 and less than or equal to 2000, and the three-dimensional reconstruction process specifically comprises the following steps:

step 1: constructing a dataset

(a) Constructing a natural image dataset

(b) Constructing ultrasound image datasets

(c) Constructing CT image datasets

Step 2: construction of neural networks

(1) Polar constraint sparse attention mechanical learning network A

The attention code of the 1 st stage is obtained by concatenating the embedded code 1 of the 1 st stage and the embedded code 2 of the 1 st stage according to the last dimension; concatenating the 1 st stage embedded code 1 and the 1 st stage embedded code 3 according to the last dimension to obtain a 1 st stage attention code input feature 2; concatenating the 1 st stage embedded code 2 and the 1 st stage embedded code 1 according to the last dimension to obtain a 1 st stage attention code input characteristic 3; concatenating the 1 st stage embedded code 3 and the 1 st stage embedded code 1 according to the last dimension to obtain a 1 st stage attention code input characteristic 4; and 4 input features of the attention code of the 1 st stage are respectively subjected to attention code processing: the method comprises the steps of taking a first half channel characteristic as a target coding characteristic, a second half channel characteristic as a source coding characteristic, and respectively carrying out separable convolution operation on the target coding characteristic and the source coding characteristic according to a last dimension in the 1 st stage, wherein the convolution kernel scale is 3 multiplied by 3, the number of characteristic channels is 32, the step sizes in the horizontal direction and the vertical direction are 1, stretching the processing result of the target coding characteristic from the spatial domain shape of the image characteristic to a serialization form, serving as a query keyword K coding vector and a numerical value V coding vector for attention learning, and stretching the processing result of the source coding characteristic from the spatial domain shape of the image characteristic to a serialization form, serving as a query Q coding vector for attention learning;

b) First, a cross-view weighting matrix R is calculated:

the result tensor L and O output by the 1 st branch of the twin network are used as pose parameters and internal parameters of the camera, a basic matrix U is calculated according to a computer vision principle, and then a cross-view error matrix Y is calculated by the U:

Y＝xU (1)

Calculating an error matrix E:

E＝Yq (2)

where q is the transposed matrix of x, and the dimension of E is w×w;

b) First, a cross-view weighting matrix R' is calculated:

Y′＝x′U′ (3)

Calculating an error matrix E':

E′＝Y′q′ (4)

where q 'is the transposed matrix of x' and E 'is the scale of w' x 3;

b) First, a cross-view weighting matrix r″ is calculated:

Y″＝x″U″ (5)

Calculating an error matrix E':

E″＝Y″q″ (6)

where q "is the transposed matrix of x", and E "is the dimension w" x 3;

For the 3 rd network branch, inputting the 1 st stage cross-view cross-layer characteristics, and sequentially performing 3 unit processes: in the 1 st unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, in the 2 nd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, then characteristic activation and batch normalization processing are carried out, then 1 deconvolution operation is carried out, the number of the characteristic channels of convolution is 16, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and the characteristic activation and batch normalization processing are carried out to obtain the 1 st cross-layer characteristic of the 3 rd network branch;

initializing 3 rd network branch cross-layer characteristics: taking the 1 st stage cross-view cross-layer characteristic of the backbone network as the 2 nd cross-layer characteristic of the 3 rd network branch; taking the 2 nd stage cross-view cross-layer characteristic of the backbone network as the 3 rd cross-layer characteristic of the 3 rd network branch; taking the 3 rd stage cross-view cross-layer characteristic of the backbone network as the 4 th cross-layer characteristic of the 3 rd network branch; then the 1 st stage cross-view cross-layer characteristic of the backbone network is subjected to 1 st residual coding treatment, which comprises 3 convolution operations, wherein the channel characteristic numbers are 64, 64 and 256 respectively, the shape of the convolution kernel is 1×1, 3×3 and 1×1 respectively, and then 2 unit treatments are sequentially carried out: in the 1 st unit processing, the number of characteristic channels of convolution operation is 192, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 192, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and the characteristic activation and batch normalization processing are used as the 5 th cross-layer characteristic of the 3 rd network branch; carrying out 1-time residual coding treatment on the 5 th cross-layer feature, specifically carrying out 3-time convolution operations, wherein the number of channel features in the 3-time convolution operations is 512, 512 and 2048, the shapes of convolution kernels are 1×1, 3×3 and 1×1 respectively, and then sequentially carrying out 6 unit treatment processes on the obtained features:

Using the results of the 1 st to 4 th scales as the result of the tensor W;

The decoding process is as follows: performing 1 deconvolution operation on the 4 th network branch coding feature: the number of characteristic channels of convolution is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, the characteristics are activated and normalized in batches, the obtained result is connected with the cross-layer characteristics 5 of the decoder in series, and one convolution operation is carried out: the number of the characteristic channels is 512, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, and deconvolution operation is carried out on the obtained result: the number of the characteristic channels is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristic activation and batch normalization are carried out, the obtained result is connected with the cross-layer characteristic 4 of the decoder in series, and one convolution operation is carried out: the number of characteristic channels is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization processing are carried out, and the obtained result is subjected to deconvolution operation once: the number of the characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristic activation and batch normalization are carried out, the obtained result is connected with the cross-layer characteristic 3 of the decoder in series, and one convolution operation is carried out: the number of characteristic channels is 128, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 1, the characteristics are activated and subjected to batch normalization processing, the obtained characteristics are used as the 4 th scale result of tensor B, meanwhile, 1 deconvolution operation is carried out on the obtained characteristics, the number of deconvoluted characteristic channels is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes of the horizontal direction and the vertical direction are 2, the characteristics are activated and subjected to batch normalization processing, the obtained characteristics are connected with cross-layer characteristics 2 of a decoder in series, and one convolution operation is carried out: the number of the characteristic channels is 64, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, the obtained characteristic is used as the 3 rd scale result of the tensor B, and meanwhile, the obtained characteristic is subjected to 1 deconvolution operation: the number of deconvolution characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, the characteristics are activated and normalized in batches, the obtained characteristics are connected with the cross-layer characteristics 1 of the decoder in series, and then one convolution operation is carried out: the number of the characteristic channels is 32, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, the obtained characteristic is used as the 2 nd scale result of the tensor B, and meanwhile, the obtained characteristic is subjected to 1 deconvolution operation: the number of the characteristic channels is 16, the convolution kernel scales are 7 multiplied by 7, the step sizes in the horizontal direction and the vertical direction are 2, the characteristics are activated and subjected to batch normalization, the obtained characteristics are connected with the up-sampling result of the 3 rd scale characteristics in series, and then one convolution operation is carried out: the number of the characteristic channels is 16, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, the characteristic activation and batch normalization are carried out, and the obtained characteristic is used as the 1 st scale result of the tensor B; obtaining the output of the 4 th network branch by using the 4 scale results of the tensor B;

For the 5 th network branch, the 3 rd stage concatenated coding result is sequentially processed by 4 units: in the 1 st unit processing, the number of characteristic channels of convolution operation is 256, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 1, and then characteristic activation and batch normalization processing are carried out; in the 2 nd unit processing, the number of characteristic channels of convolution operation is 512, the convolution kernel scales are 3 multiplied by 3, the step sizes in the horizontal direction and the vertical direction are 2, and then characteristic activation and batch normalization processing are carried out; in the 3 rd unit processing, the number of characteristic channels of convolution operation is 1024, the convolution kernel scales are 3×3, and the step sizes in the horizontal direction and the vertical direction are 2; in the 4 th unit processing, the number of characteristic channels of convolution operation is 3, convolution kernel scales are 1×1, step sizes in the horizontal direction and the vertical direction are 1, and the obtained characteristics are used as a result of tensor D; (2) Twin network S

step 3: training of neural networks

internal parameters supervise synthesis loss: in the network model training of natural images, taking tensor W output by a twin network S as depth, taking tensor L output by the twin network S and internal parameter labels et (t=1, 2,3 and 4) of training data as pose parameters and camera internal parameters respectively, respectively synthesizing two images at the view point of an image c by utilizing an image b and an image d according to a computer vision principle algorithm, and calculating according to the sum of pixel-by-pixel and color-by-color channel intensity differences by utilizing the image c and the two synthesized images respectively;

The specific training steps are as follows:

Taking CT image training data out of a CT image data set each time, uniformly scaling the CT image training data to a resolution p multiplied by O, inputting an image m and an image sigma into a twin network S, taking a tensor W output by the twin network S as depth, taking a tensor L and a tensor O output by a network A as pose parameters and internal parameters of a camera respectively, taking a tensor B output by the twin network S as displacement of spatial domain deformation of a synthesized image, respectively synthesizing two images at an m viewpoint of the image L and the image n, training the network by continuously modifying parameters of the network, and continuously modifying network parameters so that the loss of each image of each batch is minimum, and when the loss calculation of network optimization, adding the loss of translational motion of the camera besides transformation synthesis loss and spatial structure error loss, training 50000 times to obtain network model parameters rho';

Step 4: three-dimensional reconstruction of ultrasound or CT images