WO2022206036A1

WO2022206036A1 - Soft tissue motion prediction method and apparatus, terminal device, and readable storage medium

Info

Publication number: WO2022206036A1
Application number: PCT/CN2021/138567
Authority: WO
Inventors: 张嘉乐; 廖祥云; 王琼; 王平安
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-03-29
Filing date: 2021-12-15
Publication date: 2022-10-06
Also published as: CN112967275B; CN112967275A

Abstract

The present application is applicable to the technical field of image processing, and in particular to a soft tissue motion prediction method and apparatus, a terminal device, and a readable storage medium. The soft tissue motion prediction method comprises: obtaining an original image sequence, the original image sequence being used for describing a motion track of a soft tissue in a first time period; and inputting the original image sequence into a preset soft tissue motion prediction model for processing to obtain a predicted image sequence output by the soft tissue motion prediction model, the predicted image sequence being used for describing a predicted motion track of the soft tissue in a second time period adjacent to the first time period, wherein the soft tissue motion prediction model comprises multiple layers of long short-term memory network units that are stacked, the long short-term memory network units transmit target temporal and spatial features across layers according to a time sequence, and each long short-term memory network unit comprises a self-attention module. By means of the soft tissue motion prediction method provided by the present application, the effect and precision of soft tissue motion prediction can be effectively improved.

Description

Soft tissue motion prediction method, device, terminal device and readable storage medium

technical field

The present application belongs to the technical field of image processing, and in particular, relates to a soft tissue motion prediction method, apparatus, terminal device, and computer-readable storage medium.

Background technique

In high-intensity focused ultrasound (HIFU) image-guided therapy, soft tissue motion can negatively impact therapy. Therefore, soft tissue motion prediction needs to be performed in advance. In the prior art, traditional methods such as a tracking method without model matching and a tracking method based on model matching can be used to predict the motion of soft tissue, but these traditional methods have problems of poor motion prediction effect and low accuracy.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a soft tissue motion prediction method, apparatus, terminal device, and computer-readable storage medium, which can effectively improve the effect and accuracy of soft tissue motion prediction.

In a first aspect, an embodiment of the present application provides a soft tissue motion prediction method, which may include:

acquiring an original image sequence, the original image sequence is used to describe the motion trajectory of the soft tissue in the first time period;

The original image sequence is input into a preset soft tissue motion prediction model for processing, and a predicted image sequence output by the soft tissue motion prediction model is obtained, and the predicted image sequence is used to describe the predicted relationship between the soft tissue and the first soft tissue. The motion trajectory of a second time period adjacent to a time period; wherein, the soft tissue motion prediction model includes a stacked multi-layer long-term and short-term memory network unit, and the long-term and short-term memory network unit transmits the target spatiotemporal feature across layers according to the time series, The long short-term memory network unit includes a self-attention module.

The above-mentioned soft tissue motion prediction method can obtain the context information of the global space through the self-attention module, and transmit the spatiotemporal features across layers according to the time series, so as to enhance the transmission of spatiotemporal information in different temporal images, so that the soft tissue motion prediction model has more advantages. Strong spatial correlation, short-term modeling ability and long-term modeling ability can greatly improve the prediction effect and accuracy of the soft tissue motion prediction model, thereby improving the effect and accuracy of soft tissue motion prediction.

Exemplarily, the long-short-term memory network unit transmits the target spatiotemporal features across layers according to a time series, which may include:

The l+1 layer long short-term memory network unit transmits the target spatiotemporal feature map generated at time t-1 to the l layer long short-term memory network unit at time t, 1≤l<L, L is the soft tissue motion prediction model contains The total number of layers of long short-term memory network units.

Optionally, the self-attention module includes a first self-attention module and a second self-attention module, the first self-attention module is connected in parallel with the second self-attention module, and the first self-attention module is connected in parallel. The force module is used to generate candidate spatiotemporal feature maps, and the second self-attention module is used to generate candidate spatial feature maps.

Exemplarily, the first self-attention module can generate the candidate spatiotemporal feature map according to the following formula:

in,

is the candidate spatiotemporal feature map generated by the first self-attention module in the lth layer long short-term memory network unit at time t, W _f , W _lv , W _xo , _Who , and W _co are the preset weight matrices,

is the input feature map corresponding to the first self-attention module in the l-th layer long short-term memory network unit at time t, Z _l is the first self-attention module based on

The generated intermediate feature map, Z _{l; i} is the ith element in Z _l , a _{l; i, j} are

The similarity between the i-th element and the j-th element in ,

for

The jth element in , where N is

The total number of elements contained, σ is the sigmoid function, x _t is the original image at time t,

is the target spatiotemporal feature map transmitted by the l+1 layer long short-term memory network unit at time t-1,

is the target time feature map generated by the lth layer long short-term memory network unit at time t, and b _o is the preset bias term.

Exemplarily, the second self-attention module can generate the candidate space feature map according to the following formula:

in,

is the candidate spatial feature map generated by the second self-attention module in the lth layer long short-term memory network unit at time t, W _z , W _mv are preset weight matrices,

is the target space feature map output by the l-1th layer long short-term memory network unit at time t, and Z _m is the second self-attention module based on

The generated intermediate feature map, Z _{m; i} is the ith element in Z _m , a _{m; i, j} are

The similarity between the i-th element and the j-th element in ,

for

The jth element in , R is

The total number of elements contained.

Specifically, the long short-term memory network unit can process the candidate spatiotemporal feature map generated by the first self-attention module and the candidate spatial feature map generated by the second self-attention module according to the following formula, and obtain the The target spatiotemporal feature map and target spatial feature map output by the long short-term memory network unit:

in,

is the target spatiotemporal feature map output by the lth layer long short-term memory network unit at time t,

is the target spatial feature map output by the lth layer long short-term memory network unit at time t,

is the candidate spatiotemporal feature map generated by the first self-attention module in the l-th layer long short-term memory network unit at time t,

is the candidate spatial feature map generated by the second self-attention module in the l-th layer long short-term memory network unit at time t, e is the element product, σ is the sigmoid function, W _ho' and W _mg are the preset weight matrices, b _o' and b _g' are preset bias terms.

In a second aspect, an embodiment of the present application provides a soft tissue motion prediction device, which may include:

an image sequence acquisition module, configured to acquire an original image sequence, the original image sequence being used to describe the motion trajectory of the soft tissue in the first time period;

The soft tissue motion prediction module is used to input the original image sequence into a preset soft tissue motion prediction model for processing, and obtain a predicted image sequence output by the soft tissue motion prediction model, and the predicted image sequence is used to describe the predicted image sequence. The motion trajectory of the soft tissue in the second time period adjacent to the first time period; wherein, the soft tissue motion prediction model includes stacked multi-layer long-term and short-term memory network units, and the long-term and short-term memory network units cross layers according to time series The transmission of target spatiotemporal features is performed, and the long short-term memory network unit includes a self-attention module.

In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, when the processor executes the computer program The soft tissue motion prediction method according to any one of the above first aspects is implemented.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements any one of the above-mentioned first aspect Soft tissue motion prediction method.

In a fifth aspect, an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the soft tissue motion prediction method described in any one of the first aspects above.

It can be understood that, for the beneficial effects of the second aspect to the fifth aspect, reference may be made to the relevant description in the first aspect, which is not repeated here.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

1 is a schematic flowchart of a soft tissue motion prediction method provided by an embodiment of the present application;

2 is a schematic structural diagram of a soft tissue motion prediction model provided in an embodiment of the present application developed according to a time series;

3 is a schematic structural diagram of a long short-term memory network unit provided by an embodiment of the present application;

4 is a schematic structural diagram of a self-attention module provided by an embodiment of the present application;

5 is a schematic structural diagram of a soft tissue motion prediction device provided by an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.

Detailed ways

In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are set forth in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It is to be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described feature, integer, step, operation, element and/or component, but does not exclude one or more other The presence or addition of features, integers, steps, operations, elements, components and/or sets thereof.

It will also be understood that, as used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

As used in the specification of this application and the appended claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting ". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".

In addition, in the description of the specification of the present application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and should not be construed as indicating or implying relative importance.

References in this specification to "one embodiment" or "some embodiments" and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically emphasized otherwise. The terms "including", "including", "having" and their variants mean "including but not limited to" unless specifically emphasized otherwise.

HIFU therapy has become a common method for the treatment of thoracic and abdominal cancer due to its advantages of non-invasiveness, high efficiency and low cost. Its core technology is to accurately locate the target area under the premise of considering the heterogeneity of human body structure and the nonlinear relationship between high-precision scalpel and soft tissue motion, and realize precise spatiotemporal control of the surgical system. During HIFU ultrasound image-guided therapy, soft tissue motion can negatively impact the therapy. Soft tissue is the soft tissue in the target volume. The movement of the soft tissue may include elastic deformation caused by needle puncturing of the soft tissue, changes in the displacement of the soft tissue caused by the movement of organs or tissues caused by the patient's breathing, or the movement of the body, and the like. Once the target soft tissue moves, it is often difficult for the treatment system to track the target area in time, resulting in under-dose in the treatment target area or damage to surrounding normal tissues or organs, resulting in unnecessary treatment side effects.

Therefore, it is necessary to predict the motion of the target soft tissue in advance. In the prior art, a tracking method without model matching and a tracking method based on model matching can be used to predict the motion of soft tissue. Among the model-free tracking prediction methods, the block matching method is the most widely used. The block matching method uses the local structural information of the image to estimate the state of the target soft tissue for tracking, and its main idea is to find multiple neighbors closest to the query block by matching the query block with the neighboring blocks image block. But the block matching method can not solve the instability of local image structure well, and can not make full use of the prior information of the image. Model-based matching tracking methods can include real-time tracking methods for non-rigid objects based on active shape models and nonlinear state space tracking methods. The model-based matching tracking method can use the prior information of medical image sequences to construct a mathematical prediction network model of medical organs, and enhance the robustness by optimizing the model parameters. However, most of the existing model-based matching and tracking methods regard the target tissue as a rigid whole or a point, and cannot accurately locate the region and boundary of the target tissue, so they cannot accurately predict the motion of soft tissue.

In addition, these traditional methods have the following disadvantages in the tracking and prediction of medical image sequences (such as ultrasound image sequences): sharp changes in the contour of the tracked target may lead to poor contour tracking; if the target displacement between frames is too large or the traditional methods erroneously If the scale and direction of the target are estimated, the tracking target may be lost.

Deep learning methods are well suited for processing ultrasound image sequences due to their strong nonlinear modeling capabilities and the advantage of exploiting the spatiotemporal information of sequence images. At present, many deep learning-based methods have been applied to the motion prediction of soft tissues in dynamic environments. For example, a population-based statistical motion model and information from two-dimensional ultrasound sequences to predict respiratory motion in the right hepatic lobe using artificial neural networks (ANNs) by extending the spatial prediction method with temporal predictors location of the liver. However, this method only uses the clinical data of a limited number of patients to train the model, that is, it only explores the specific motion of specific soft tissues based on limited features, and does not consider the complexity of the motion of different soft tissues, so it is applied to motion prediction of other soft tissues. , the prediction effect and prediction accuracy are poor. For example, stacked recurrent networks for video prediction, which use convolutional long short-term memory (ConvLSTM) as recurrent units, ConvLSTM aims to correctly retain and forget past information through a gated structure, and then It is fused with the current spatial representation to predict video frames. However, the stacked ConvLSTM does not add additional modeling functions for stepwise recursive state transitions, and its short-term dynamic modeling ability is poor, and it is difficult to capture the long-term correlation of the input image sequence, resulting in poor prediction effect and low prediction accuracy. For example, we use multi-scale convolution operations to extract features from input images, learn dense deformations between images in the input sequence, and use cascaded spatial transformer networks (STNs) to generate future image sequences. However, this method is not effective for images with large changes in breathing motion, and at the same time, it lacks global dependence on the features extracted from image sequences, resulting in poor prediction effect and low prediction accuracy.

In order to solve the above problem, an embodiment of the present application provides a soft tissue motion prediction method, which can acquire an original image sequence, and the original image sequence is used to describe the motion trajectory of the soft tissue in the first time period; Input to a preset soft tissue motion prediction model for processing, and obtain a predicted image sequence output by the soft tissue motion prediction model, and the predicted image sequence is used to describe the predicted soft tissue in the second adjacent to the first time period. The motion trajectory of the time period; wherein, the soft tissue motion prediction model includes stacked multi-layer long-term and short-term memory network units, and the long-term and short-term memory network units transmit the target spatiotemporal features across layers according to time series. The unit includes a self-attention module. That is, in the embodiment of the present application, the context information of the global space can be obtained through the self-attention module, and the spatiotemporal features can be transmitted across layers according to the time series, so as to enhance the transmission of spatiotemporal information in images at different times, so that the soft tissue motion prediction model has Stronger spatial correlation, short-term modeling ability and long-term modeling ability can greatly improve the prediction effect and accuracy of the soft tissue motion prediction model, thereby improving the effect and accuracy of soft tissue motion prediction, with strong ease of use and practicability .

Referring to FIG. 1 , FIG. 1 shows a schematic flowchart of a soft tissue motion prediction method provided by an embodiment of the present application. The soft tissue motion prediction method may be applied to terminal devices such as mobile phones, tablet computers, notebook computers, and desktop computers, and the embodiment of the present application does not specifically limit the types of terminal devices. As shown in Figure 1, the soft tissue motion prediction method may include:

S101. Acquire an original image sequence, where the original image sequence is used to describe the motion trajectory of soft tissue in a first time period;

Wherein, the soft tissue may be the soft tissue in the target area of the HIFU treatment. The original sequence of images may be a sequence of ultrasound images. The ultrasound image sequence may be acquired by an ultrasound image acquisition device. The ultrasonic image acquisition device may be connected in communication with a terminal device, and when the ultrasonic image acquisition device acquires an ultrasonic image sequence including soft tissue, the acquired ultrasonic image sequence may be sent to the terminal device for the terminal device to use. Perform soft tissue motion prediction.

S102. Input the original image sequence into a preset soft tissue motion prediction model for processing, and obtain a predicted image sequence output by the soft tissue motion prediction model, where the predicted image sequence is used to describe the predicted soft tissue in The motion trajectory of the second time period adjacent to the first time period; wherein, the soft tissue motion prediction model includes stacked multi-layer long-term and short-term memory network units, and the long-term and short-term memory network units perform cross-layer target spatiotemporal features according to time series. transmission, the long short-term memory network unit includes a self-attention module.

In this embodiment of the present application, the original image sequence may include multiple original images, the predicted image sequence may include one or more predicted images, and the predicted image may reflect the soft tissue in subsequent moments. sports situation. The number of images included in the original image sequence and the number of images included in the predicted image sequence may be specifically set according to actual conditions, which are not specifically limited in this embodiment of the present application.

Specifically, when it is necessary to predict the predicted image sequence of length m in the second time period in the future according to the original image sequence of length n in the first time period, the second time period is immediately following the first time period The terminal device can input the original images such as x ₁ , _x ₂ , . Based on the original images x ₁ , x ₂ , ..., x _n , the predicted images x ₂ ', x ₃ ', ..., x _n+1 ', ..., x _n+m ' can be obtained. At this time, x _{n+ 1} ', x _n+2 ', ..., x _n+m ' are the predicted image sequence.

The prediction process of the soft tissue motion prediction model will be described in detail below with reference to the network structure of the soft tissue motion prediction model.

Please refer to FIG. 2 to FIG. 4 together. FIG. 2 shows a schematic structural diagram of a soft tissue motion prediction model provided by an embodiment of the present application developed in a time series. A schematic structural diagram, FIG. 4 shows a schematic structural diagram of a self-attention module provided by an embodiment of the present application. As shown in FIG. 2 and FIG. 3 , the soft tissue motion prediction model may include stacked multi-layer long short-term memory (LSTM) units, each layer of long short-term memory network units has the same structure, and the The long short-term memory network unit may include a self-attention module (SA). It should be understood that the embodiments of the present application do not specifically limit the total number of layers of long short-term memory network units. The following will take an example that the soft tissue motion prediction model includes a four-layer long short-term memory network unit as an example for illustration.

As shown in FIG. 2 , the soft tissue motion prediction model may include a first-layer long-short-term memory network unit 201 , a second-layer long-short-term memory network unit 202 , a third-layer long-short-term memory network unit 203 , and a fourth layer connected in sequence Long Short Term Memory network unit 204 . Wherein, the first layer of long-term and short-term memory network unit 201 is used for processing, such as feature extraction and fusion of the original image x _t in the original image sequence, to obtain the first spatial feature output by the first layer of long-term and short-term memory network unit 201 and input the first spatial feature map to the second-layer long short-term memory network unit 202 . The second-layer long short-term memory network unit 202 can perform feature extraction, fusion, etc. on the first spatial feature map to obtain a second spatial feature map, and input the second spatial feature map to the third spatial feature map. Layer long short-term memory network unit 203 . Similarly, the third-layer long short-term memory network unit 203 can perform feature extraction, fusion, etc. on the second spatial feature map to obtain a third spatial feature map, and input the third spatial feature map into the third spatial feature map. Four layers of long short term memory network unit 204 . The fourth-layer long short-term memory network unit 204 can perform feature extraction, fusion, etc. on the third spatial feature map to obtain a predicted image x _t+1 ′ predicted by the soft tissue motion prediction model at time t, that is, The predicted image x _t+1 ′ is the image corresponding to the time t+1 predicted at the time t.

In the embodiment of the present application, the soft tissue motion prediction model is a trained model. Among them, in the training process, a schedule sampling method can be used to process the relationship between the predicted image sequence and the training image sequence. Since the soft tissue motion prediction model uses a stacked structure, that is, the predicted image x _t+2 ′ predicted at the next time (such as time t+1) needs to be based on the predicted image x predicted at the previous time (such as time t). _t+1 ', and when the predicted image at the previous moment (ie, time t) is wrong, the subsequent predicted image will also be wrong, which affects the effect and accuracy of soft tissue motion prediction. In order to solve this problem, during training, in this embodiment of the present application, the similarity between the predicted image x _t+1 ′ predicted at time t and the real image x _t+ 1 at time t+1 can be evaluated, and based on the similarity degree, set the weight of the predicted image x _t+1 ' at time t+1, that is, when the similarity is large, the weight of the real image x _t+1 can be reduced, and the weight of the predicted image x _t+1 ' can be increased; If the degree is small, the weight of the real image x _t+1 can be increased, and the weight of the predicted image x _t+1 ′ can be reduced. Here, the similarity can be determined in combination with a preset similarity threshold, that is, when the similarity is greater than or equal to the similarity threshold, it can be determined that the similarity is large; and when the similarity is less than the similarity threshold, then It can be determined that the similarity is small. The similarity threshold may be specifically set according to the actual situation.

In this embodiment of the present application, the long short-term memory network unit may include two types of feature maps: temporal feature maps (also referred to as temporal memory)

and spatial feature maps (also known as spatial memory)

t is the time, and l is the layer number of the long short-term memory network unit. Among them, in the lth layer long short-term memory network unit, the temporal feature map at time t

directly depends on the temporal feature map of its previous moment (ie, moment t-1)

It is controlled by the forget gate ft, the input gate it and the output gate _gt at _{time t} _. In the lth layer long short-term memory network unit, the spatial feature map at time t

Depends on the spatial feature map of LSTM network units in layers l-1

For the first layer of long short-term memory network units, the spatial feature map at time t

Then it can depend on the spatial feature map generated by the last layer of long short-term memory network units at the previous time (ie time t-1).

At this point, the input to the first layer of long short-term memory network units can be

determined as

That is to say, when l=1, the input to the spatial feature map of the lth layer long short-term memory network unit

1≤1<L, L is the total number of layers of long short-term memory network units included in the soft tissue motion prediction model, and in the embodiment of the present application, L may be 4.

It should be noted that, in order to enhance the transmission of spatiotemporal information in different time images, so that the spatiotemporal information of the original image sequence can be deeply extracted, so as to improve the motion prediction effect of the soft tissue motion prediction model, each layer of long short-term memory network unit The transfer of target spatiotemporal features can be performed across layers according to time series. The time series may be the time series corresponding to the original image sequence or the time series corresponding to the predicted image sequence. Specifically, the l+1 layer long short-term memory network unit can transmit the target spatiotemporal feature map generated at time t-1 to the l layer long short-term memory network unit at time t.

That is, as shown in FIG. 2 , the fourth-layer long short-term memory network unit 204 can use the target spatiotemporal feature map generated at time t-1

It is transmitted to the third-layer long short-term memory network unit 203 at time t. The third-layer long short-term memory network unit 203 can use the target spatiotemporal feature map generated at time t-1.

It is transmitted to the second-layer long short-term memory network unit 202 at time t. The second-layer long short-term memory network unit 202 can use the target spatiotemporal feature map generated at time t-1.

It is transmitted to the first-layer long short-term memory network unit 201 at time t. Optionally, the target spatiotemporal feature map transmitted to the fourth-layer long short-term memory network unit 204 may be set to 0.

As shown in FIG. 2 , the long short-term memory network unit can process the input temporal feature map, spatial feature map and spatio-temporal feature map to obtain the target time feature map and target spatial feature corresponding to the long short-term memory network unit. graph and target spatiotemporal feature map.

Among them, for the initial moment, that is, the moment when the temporal feature map, spatial feature map and spatiotemporal feature map of the previous moment are not input, for example, when the original image x ₁ in the original image sequence is input into the soft tissue motion prediction model , the terminal device can use a random initialization method to initialize the temporal feature map, spatial feature map and spatiotemporal feature map transmitted to each of the long-term and short-term memory network units, and each of the long-term and short-term memory network units can combine the randomly generated temporal feature map , a spatial feature map, and a spatiotemporal feature map to generate a target time feature map, a target spatial feature map, and a target spatiotemporal feature map corresponding to each of the long short-term memory network units at this time.

That is, by adding additional connections between different time steps, the embodiment of the present application can pursue long-term coherence and short-term repetition depth, and can learn complex nonlinear transition functions of nearby frames in a short time, which can significantly improve its short-term dynamic construction. mold ability. In addition, using the triple storage mechanism and simple connection of AND gates, the temporal feature map updated horizontally, the spatial feature map updated in a zigzag direction, and the spatiotemporal feature map updated across time steps and layers can be deeply The spatiotemporal information of the sequence is extracted, so that the soft tissue motion prediction model has a strong dynamic modeling ability, which can effectively improve the motion prediction effect of the soft tissue motion prediction model.

The generation of the target temporal feature map, the target spatial feature map, and the target spatiotemporal feature map generated by the long short-term memory network unit will be described in detail below.

As shown in Figure 3, the update equation of the long short-term memory network unit can be:

Among them, W _xg , W _hg , W _xi , W _hi , W _xf , W _hf , W _xo , W _ho , W _co are preset weight matrices, and b _g , b _i , b _f , and b _o are preset weight matrices Bias term, σ is the sigmoid function, x _t is the original image at time t,

is the target temporal feature map generated for the lth layer long short-term memory network unit at time t,

The target temporal feature map generated for the lth layer of long short-term memory network units at time t-1,

is the input feature map of the self-attention module (that is, the feature map input to the self-attention module),

It can be aggregated from temporal feature maps and spatiotemporal feature maps, SA is the processing of the self-attention module,

and

are the candidate spatial feature maps and candidate spatiotemporal feature maps aggregated by the self-attention module,

is the target spatial feature map output by the l-1th layer long short-term memory network unit at time t. It should be understood that when l=1,

The following pair of self-attention modules

and

The process of obtaining candidate spatial feature maps and candidate spatiotemporal feature maps by aggregation will be described.

As shown in FIG. 4 , the self-attention module may include a first self-attention module 401 and a second self-attention module 402, the first self-attention module 401 and the second self-attention module 402 are connected in parallel , and the first self-attention module 401 shares Query with the second self-attention module 402, the first self-attention module 401 is used to generate candidate spatiotemporal feature maps, and the second self-attention module 402 Used to generate candidate spatial feature maps.

As shown in Figure 4, for the input feature map

The first self-attention module 401 can firstly input the feature map

Map to feature space Query, Key, Value:

Query:

Key:

Value:

in,

C is

the corresponding number of channels,

is the number of channels corresponding to Q _c and K _l , and N is

The corresponding number of elements, W _lq , W _lk , and W _lv are preset weight matrices of 1×1 convolution.

Then, by multiplying between Q _c and K _l

calculate

The similarity between every two elements in (i.e. can be obtained by

to calculate). That is,

similarity between the i-th element and the j-th element in

Then, you can use the softmax function to normalize each similarity to get a _l :

Among them, T represents matrix transpose, L _t,i is

The i-th element in , L _t,j is

The jth element in . L _t,i and L _t,j are feature vectors of size C×1.

As shown in FIG. 4 , the first self-attention module 401 can generate the candidate spatiotemporal feature map according to the following formula:

in,

The similarity between the i-th element and the j-th element in ,

for

The jth element in , where N is

As shown in Figure 4, for the input feature map

The second self-attention module 402 can map it to Key by 1×1 convolution with W _mk and W _mv as weight matrices, respectively.

and value

Then, you can multiply between Query Q _c and Key K _m , that is, by

to calculate

The similarity between the i-th element and the j-th element in _em;i,j . Then, the softmax function can be used to normalize each similarity to get a _m :

Specifically, the second self-attention module 402 can generate the candidate space feature map according to the following formula:

in,

The similarity between the i-th element and the j-th element in ,

for

The jth element in , R is

The total number of elements contained.

That is, the feature value of the ith element in the intermediate feature map Z _m can be calculated by the weighted sum of all N positions in the value V _m .

Exemplarily, as shown in FIG. 3 , the long short-term memory network unit may, according to the following formula, analyze the candidate spatiotemporal feature map generated by the first self-attention module and the candidate space generated by the second self-attention module: The feature map is processed to obtain the target spatiotemporal feature map and the target spatial feature map output by the long short-term memory network unit:

in,

It can be understood that after the last layer of long short-term memory network unit obtains the target temporal feature map, the target spatial feature map and the target spatiotemporal feature map, the target temporal feature map, the target spatial feature map and the target spatiotemporal feature map can be mapped back to the pixel space. , to obtain the predicted image output by the soft tissue motion prediction model. In addition, each of the long-term and short-term memory network units can transmit the obtained target time feature map, target space-time feature map, and target space feature map to each of the long-term and short-term memory network units at the next moment, so as to perform the next moment’s Image prediction.

In the embodiment of the present application, the original image sequence may be obtained, and the original image sequence is used to describe the motion trajectory of the soft tissue in the first time period; the original image sequence is input into the preset soft tissue motion prediction model for processing, and the output of the soft tissue motion prediction model is obtained. The predicted image sequence is used to describe the predicted motion trajectory of soft tissue in the second time period adjacent to the first time period; wherein, the soft tissue motion prediction model includes stacked multi-layer long short-term memory network units, long short-term memory The network unit transmits the target spatiotemporal features across layers according to the time series, and the long short-term memory network unit includes a self-attention module. That is, in the embodiment of the present application, the context information of the global space can be obtained through the self-attention module, and the spatiotemporal features can be transmitted across layers according to the time series, so as to enhance the transmission of spatiotemporal information in images at different times, so that the soft tissue motion prediction model has The stronger spatial correlation, short-term modeling ability and long-term modeling ability can greatly improve the prediction effect and accuracy of the soft tissue motion prediction model, thereby improving the effect and accuracy of soft tissue motion prediction.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Corresponding to the soft tissue motion prediction method described in the above embodiments, FIG. 5 shows a structural block diagram of the soft tissue motion prediction apparatus provided by the embodiments of the present application. For convenience of description, only the parts related to the embodiments of the present application are shown.

5, the soft tissue motion prediction device may include:

an image sequence acquisition module 501, configured to acquire an original image sequence, the original image sequence being used to describe the motion trajectory of the soft tissue in the first time period;

The soft tissue motion prediction module 502 is configured to input the original image sequence into a preset soft tissue motion prediction model for processing, and obtain a predicted image sequence output by the soft tissue motion prediction model, and the predicted image sequence is used to describe the predicted image sequence. The motion trajectory of the soft tissue in a second time period adjacent to the first time period; wherein, the soft tissue motion prediction model includes stacked multi-layer long-term and short-term memory network units, and the long-term and short-term memory network units span according to time series. The layer performs the transmission of target spatiotemporal features, and the long short-term memory network unit includes a self-attention module.

Optionally, in the soft tissue motion prediction model, the l+1 layer long short-term memory network unit transmits the target spatiotemporal feature map generated at time t-1 to the l layer long short-term memory network unit at time t, 1≤ l<L, where L is the total number of layers of long and short-term memory network units included in the soft tissue motion prediction model.

In a possible implementation manner, the self-attention module may include a first self-attention module and a second self-attention module, the first self-attention module is connected in parallel with the second self-attention module, The first self-attention module is used to generate candidate spatiotemporal feature maps, and the second self-attention module is used to generate candidate spatial feature maps.

in,

The similarity between the i-th element and the j-th element in ,

for

The jth element in , where N is

in,

The similarity between the i-th element and the j-th element in ,

for

The jth element in , R is

The total number of elements contained.

It can be understood that the long short-term memory network unit can process the candidate spatiotemporal feature map generated by the first self-attention module and the candidate spatial feature map generated by the second self-attention module according to the following formula: Obtain the target spatiotemporal feature map and target spatial feature map output by the long short-term memory network unit:

in,

It should be noted that the information exchange, execution process and other contents between the above-mentioned devices/units are based on the same concept as the method embodiments of the present application. For specific functions and technical effects, please refer to the method embodiments section. It is not repeated here.

Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated in one unit, and the above-mentioned integrated units may adopt hardware. It can also be realized in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application. For the specific working processes of the units and modules in the above-mentioned system, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

FIG. 6 is a schematic structural diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 6 , the terminal device 6 in this embodiment includes: at least one processor 60 (only one is shown in FIG. 6 ), a memory 61 , and a memory 61 stored in the memory 61 and available in the at least one processor 60 A computer program 62 running on the processor 60, when the processor 60 executes the computer program 62, implements the steps in any of the foregoing embodiments of the soft tissue motion prediction method.

The terminal device 6 may be a computing device such as a desktop computer, a notebook, and a palmtop computer. The terminal device may include, but is not limited to, a processor 60 and a memory 61 . Those skilled in the art can understand that FIG. 6 is only an example of the terminal device 6, and does not constitute a limitation on the terminal device 6, and may include more or less components than the one shown, or combine some components, or different components , for example, may also include input and output devices, network access devices, and the like.

The processor 60 can be a central processing unit (central processing unit, CPU), and the processor 60 can also be other general-purpose processors, digital signal processors (digital signal processors, DSP), application specific integrated circuits (application specific integrated circuit) , ASIC), field programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may be an internal storage unit of the terminal device 6 in some embodiments, such as a hard disk or a memory of the terminal device 6 . In other embodiments, the memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk equipped on the terminal device 6, a smart memory card (smart media card, SMC), a secure digital (secure digital, SD) card, flash card (flash card), etc. Further, the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device. The memory 61 is used to store an operating system, an application program, a boot loader (Boot Loader), data, and other programs, such as program codes of the computer program. The memory 61 can also be used to temporarily store data that has been output or will be output.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.

The embodiments of the present application provide a computer program product, when the computer program product runs on a terminal device, so that the steps in the foregoing method embodiments can be implemented when the terminal device executes.

The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the present application realizes all or part of the processes in the methods of the above embodiments, which can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium. When executed by a processor, the steps of each of the above method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like. The computer-readable storage medium may include at least: any entity or device capable of carrying the computer program code to the device/terminal device, recording medium, computer memory, read-only memory (ROM, ROM), random access Memory (random access memory, RAM,), electrical carrier signals, telecommunication signals, and software distribution media. For example, U disk, mobile hard disk, disk or CD, etc. In some jurisdictions, under legislation and patent practice, computer-readable storage media may not be electrical carrier signals and telecommunications signals.

In the foregoing embodiments, the description of each embodiment has its own emphasis. For parts that are not described or described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

In the embodiments provided in this application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims

A soft tissue motion prediction method, comprising:

acquiring an original image sequence, the original image sequence is used to describe the motion trajectory of the soft tissue in the first time period;

The original image sequence is input into a preset soft tissue motion prediction model for processing, and a predicted image sequence output by the soft tissue motion prediction model is obtained, and the predicted image sequence is used to describe the predicted relationship between the soft tissue and the first soft tissue. The motion trajectory of a second time period adjacent to a time period; wherein, the soft tissue motion prediction model includes a stacked multi-layer long-term and short-term memory network unit, and the long-term and short-term memory network unit transmits the target spatiotemporal feature across layers according to the time series, The long short-term memory network unit includes a self-attention module.
The soft tissue motion prediction method according to claim 1, wherein the long short-term memory network unit transmits the target spatiotemporal feature across layers according to a time series, comprising:

The l+1 layer long short-term memory network unit transmits the target spatiotemporal feature map generated at time t-1 to the l layer long short-term memory network unit at time t, 1≤l<L, L is the soft tissue motion prediction model contains The total number of layers of long short-term memory network units.
The soft tissue motion prediction method according to claim 1 or 2, wherein the self-attention module comprises a first self-attention module and a second self-attention module, and the first self-attention module and the The second self-attention modules are connected in parallel, the first self-attention module is used for generating candidate spatiotemporal feature maps, and the second self-attention module is used for generating candidate spatial feature maps.
The soft tissue motion prediction method according to claim 3, wherein the first self-attention module generates the candidate spatiotemporal feature map according to the following formula:

in,
is the candidate spatiotemporal feature map generated by the first self-attention module in the lth layer long short-term memory network unit at time t, W f , W lv , W xo , Who , and W co are the preset weight matrices,
is the input feature map corresponding to the first self-attention module in the l-th layer long short-term memory network unit at time t, Z l is the first self-attention module based on
The generated intermediate feature map, Z l; i is the ith element in Z l , a l; i, j are
The similarity between the i-th element and the j-th element in ,
for
The jth element in , where N is
The total number of elements contained, σ is the sigmoid function, x t is the original image at time t,
is the target spatiotemporal feature map transmitted by the l+1 layer long short-term memory network unit at time t-1,
is the target time feature map generated by the lth layer long short-term memory network unit at time t, and b o is the preset bias term.
The soft tissue motion prediction method according to claim 3, wherein the second self-attention module generates the candidate space feature map according to the following formula:

in,
is the candidate spatial feature map generated by the second self-attention module in the lth layer long short-term memory network unit at time t, W z , W mv are preset weight matrices,
is the target space feature map output by the l-1th layer long short-term memory network unit at time t, and Z m is the second self-attention module based on
The generated intermediate feature map, Z m; i is the ith element in Z m , a m; i, j are
The similarity between the i-th element and the j-th element in ,
for
The jth element in , R is
The total number of elements contained.
The soft tissue motion prediction method according to claim 3, characterized in that, the long short-term memory network unit is based on the following formula on the candidate spatiotemporal feature maps generated by the first self-attention module and the second self-attention The candidate space feature map generated by the module is processed to obtain the target spatiotemporal feature map and target space feature map output by the long short-term memory network unit:

in,
is the target spatiotemporal feature map output by the lth layer long short-term memory network unit at time t,
is the target spatial feature map output by the lth layer long short-term memory network unit at time t,
is the candidate spatiotemporal feature map generated by the first self-attention module in the l-th layer long short-term memory network unit at time t,
is the candidate spatial feature map generated by the second self-attention module in the l-th layer long short-term memory network unit at time t, e is the element product, σ is the sigmoid function, W ho' and W mg are the preset weight matrices, b o' and b g' are preset bias terms.
A soft tissue motion prediction device, comprising:

an image sequence acquisition module, configured to acquire an original image sequence, the original image sequence being used to describe the motion trajectory of the soft tissue in the first time period;

The soft tissue motion prediction module is used to input the original image sequence into a preset soft tissue motion prediction model for processing, and obtain a predicted image sequence output by the soft tissue motion prediction model, and the predicted image sequence is used to describe the predicted image sequence. The motion trajectory of the soft tissue in the second time period adjacent to the first time period; wherein, the soft tissue motion prediction model includes stacked multi-layer long-term and short-term memory network units, and the long-term and short-term memory network units cross layers according to time series The transmission of target spatiotemporal features is performed, and the long short-term memory network unit includes a self-attention module.
The soft tissue motion prediction device according to claim 7, wherein in the soft tissue motion prediction model, the l+1th layer long short-term memory network unit transmits the target spatiotemporal feature map generated at time t-1 to time t The l-th layer of long-term and short-term memory network units, 1≤1<L, where L is the total number of layers of long-term and short-term memory network units included in the soft tissue motion prediction model.
A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the computer program, the process according to claim 1 to The soft tissue motion prediction method according to any one of 6.
A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the soft tissue motion prediction according to any one of claims 1 to 6 is realized method.