GB2576945A

GB2576945A - Image processing methods

Info

Publication number: GB2576945A
Application number: GB1814649.8A
Authority: GB
Inventors: Schlemper Jo; Rueckert Daniel; Oktay Ozan; Duan Jinming
Original assignee: Imperial College of Science Technology and Medicine; Imperial College of Science and Medicine
Current assignee: Imperial College of Science Technology and Medicine
Priority date: 2018-09-10
Filing date: 2018-09-10
Publication date: 2020-03-11
Also published as: GB201814649D0

Abstract

Receiving image data corresponding to an under-sampling of an image, which may be (a) represented in a non-spatial vector space (such as Fourier space, frequency space or k-space), wherein a first transform (e.g. a Fourier transform) exists for mapping the non-spatial representation of the image to a representation in a first spatial vector space (e.g. the standard image space). Preferably this representation comes from under-sampled magnetic resonance (MRI) data. Alternatively, the under-sampling of the image may be (b) represented in a number of second spatial vector spaces, wherein a second transform (for instance a back projection) exists for mapping the image representation to a representation in the first spatial vector space. Alternatively, the under-sampling of the image may be (c) represented in the first spatial vector space, wherein the representation in the first spatial vector space has been obtained by transforming representation (a) or transforming representation (b) using the respective first or second transform. Preferably the network is a convolutional neural network, or a TL-net architecture. The trained machine-learning model is configured to determine one or more parameters and/or segmentation maps of the representation of the image in the first spatial vector space, without reconstructing or estimating a fully-sampled image.

Description

(71) Applicant(s):

Imperial College of Science, Technology and Medicine

South Kensington Campus, Faculty Building, Exhibition Road, London, SW7 2AZ, United Kingdom (72) Inventor(s):

Jo Schlemper

Daniel Rueckert

Ozan Oktay

Jinming Duan (51) INT CL:

G06T7/10 (2017.01) (56) Documents Cited:

EP 3382417 A2 CN 107292275 A (USMAN et al) Compressive manifold learning: Estimating one-dimensional respiratory motion directly from undersampled k-space data Magnetic Resonance in Medicine 72:1130-1140 (2013) (58) Field of Search:

INT CL A61B, G06F, G06K, G06N, G06T

Other: EPODOC, WPI, INSPEC, Internet, Patent Fulltext (74) Agent and/or Address for Service:

Venner Shipley LLP

200 Aldersgate, LONDON, EC1A4HD, United Kingdom (54) Title of the Invention: Image processing methods

Abstract Title: Using or training a machine learning model to output a parameter or segmentation map of an undersampled image without reconstructing a fully-sampled image (57) Receiving image data corresponding to an undersampling of an image, which may be (a) represented in a non-spatial vector space (such as Fourier space, frequency space or k-space), wherein a first transform (e.g. a Fourier transform) exists for mapping the nonspatial representation of the image to a representation in a first spatial vector space (e.g. the standard image space). Preferably this representation comes from undersampled magnetic resonance (MRI) data. Alternatively, the under-sampling of the image may be (b) represented in a number of second spatial vector spaces, wherein a second transform (for instance a back projection) exists for mapping the image representation to a representation in the first spatial vector space. Alternatively, the undersampling of the image may be (c) represented in the first spatial vector space, wherein the representation in the first spatial vector space has been obtained by transforming representation (a) or transforming representation (b) using the respective first or second transform. Preferably the network is a convolutional neural network, or a TL-net architecture. The trained machine-learning model is configured to determine one or more parameters and/or segmentation maps of the representation of the image in the first spatial vector space, without reconstructing or estimating a fullysampled image.

Start

Measure image data | SO

segmentation maps

Fig.2

1/15

2/15

3/15 wavevector, k_y . . position, y

4/15

5/15

X

6/15

S7

I I ' Obtain training set '

S8

S9

S10

S11

S12

S13

S14

S18

S23

S24

S25

S26

Fig.8

S27

8/15

9/15

Autoencoder

10/15

Diets

11/15

0.95

Ones

ΡΰΛ

COTip, 2

comp.

14/15

SyiiB<=rf PGA

cornp. 2

Synjet tSNE

comp. 2

15/15

w	LV ES V	iv mv	0 V BOV	W BDV	LVM	EE
I 10 28	1 10 w	1 io so	1 .10 so	I.	10 w	1 10 20
	7.3 3,0 3. 2	13,8 7.3 7.8	11.7 0.3 8.0	is..! ias 10,0	2S.0 13.7 13,0	8.8 SJ 4.,8
	0,0 4 J 3 ..4	iO: 7,o o,i	11,0 4,0 4.,1.	13,4: 7.7 0,0	11.4	8.8 3,8	8,0 4 J 4.8

	BD (Wlfc)	MB (WEto)	HD (B«wa)	MD (tawm)
	Mw	W	Mw M¥	Mw	w	Ws Il¥
	1 18	1 18	1. 18 1 18	1 18	i io	1 10 1 18
11-set 8y»-sst	8.:83 2,58 8J7 3J3	4,08 3.87 8.38 8.05	1.. .47 02 LH 1.14 1.. .83 8« 3,13 1.34	α·« 18.30 IOS 18.34:	IOS 14.31 11.38 13.22	4.40 4.M 0.82 5.01 4.18 4.81 4.78 8.03

Image processing methods

Field of the invention

The present invention relates to methods of training and using a machine learning model to extract one or more parameters and/or segmentation maps from undersampled image data.

Background

Advanced volume imaging techniques such as Magnetic Resonance Imaging or X-ray computed tomography may reveal additional information as compared to twodimensional imaging projections. However, obtaining tomographic images, for example three-dimensional volume images, is often time consuming because a large number of input measurements or images may be needed before the desired image or view may be reconstructed to a desired resolution. For example, in X-ray computed tomography, X-ray projection, or shadow, images for a large number of different angles are needed. If a smaller number of images are used, a heavily streaked image is obtained on the reconstruction plane, which is likely to require significant postprocessing before any useful features may be discerned. In magnetic resonance imaging, the frequency, or k-space, grid needs to be populated before reconstruction of a real image. If the k-space grid is not fully populated, then the resulting heavily aliased image is likely to require significant post-processing before any useful features may be discerned. The acquisition rate of volume imaging techniques such as Magnetic Resonance Imaging or X-ray computed tomography may lead to high costs and low throughputs, limiting potential applications.

One exemplary application of advanced volume imaging techniques is the use of Magnetic Resonance Imaging for cardiac imaging. Cardiovascular magnetic resonance (CMR) imaging enables accurate quantification of cardiac chamber volume, ejection fraction and myocardial mass, which are important for diagnosing, assessing and monitoring cardiovascular diseases (CVDs), a leading cause of death globally. However, a limitation of CMR is the slow acquisition time. A routine CMR protocol can take from 20 to 60 minutes, which makes the tool costly and less accessible to worldwide populations. In addition, CMR often requires breath-holds which can be difficult for patients; therefore, accelerating the CMR acquisition has been investigated.

- 2 Approaches have been proposed for accelerated MR imaging. Lustig, M., Donoho, D., Pauly, J.M., “Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine 58(6), 1182-1195(2007), discusses the applications of compressed sensing. Hammernik, K., Klatzer, T., Kobler, E., Recht,

M.P., Sodickson, D.K., Pock, T., Knoll, F., “Learning a variational network for reconstruction of accelerated MRI data. Magnetic Resonance in Medicine (2017)”, discusses deep learning approaches.

Reconstructing images from accelerated and under-sampled MRI is an ill-posed problem and approaches try to exploit redundancies or assumptions on the underlying data to try and resolve the aliasing caused by sub-Nyquist sampling. In the case of dynamic cardiac cine (video) reconstructions, high spatiotemporal redundancy and sparsity can be exploited, however, it has been reported that the acceleration factor for a near perfect reconstruction is currently limited to 9, see Schlemper, J., Caballero, J.,

Hajnal, J.V., Price, A., Rueckert, D. “A deep cascade of convolutional neural networks for dynamic MR image reconstruction”. IEEE Transactions on Medical Imaging 37 (2017).

Post-processing methods for computed tomography have also been reporting which aim to allow reconstructing or estimating a fully sampled image using back-projection methods. For example, A deep learning architecture for limited-angle computed tomography reconstruction., Hammernik, Kerstin, et al., Bildverarbeitung fur die Medizin 2017. Springer Vieweg, Berlin, Heidelberg, 2017. 92-97.

-3Sunimary

According to a first aspect of the invention, there is provided a method which includes receiving image data corresponding to an under-sampling of an image. The undersampling of the image maybe (a) represented in a non-spatial vector space, wherein a 5 transform exists for mapping a representation of the image in the non-spatial vector space to a representation in a first spatial vector space. Alternatively, the undersampling of the image may be (b) represented in a number of second spatial vector spaces, wherein a transform exists for mapping a representation of the image in the plurality of second spatial vector spaces to a representation in the first spatial vector 10 space. Alternatively, the under-sampling of the image may be (c) represented in the first spatial vector space, wherein the representation in the first spatial vector space has been obtained by transforming representation (a) using the first, transform or by transforming representation (b) using the second transform. The method also includes using the image data as input to a trained machine-learning model configured to determine one or more parameters and/or segmentation maps of the representation of the image in the first spatial vector space, without reconstructing or estimating a fullysampled representation of the image in the first spatial vector space. The method also includes obtaining, as output of the trained machine-learning model, the one or more parameters and/or segmentation maps.

The one or more parameters and/or segmentation maps of the representation of the image in the first spatial vector space may be output. The first spatial vector space may be normal three-dimensional space, sometimes also referred to as real space or as the desired image plane or volume. The first spatial vector space maybe a two- dimensional plane within normal three-dimensional space. The first spatial vector space maybe a three-dimensional volume.

No individual second spatial vector space spans the first spatial vector space. A combination of two or more of the second spatial vector spaces may span the first 30 spatial vector space.

The non-spatial vector space may be frequency space (also referred to as “k-space”).

The representation of the image in the non-spatial vector space maybe the reciprocal lattice of the representation of the image in the first spatial vector space.

-4The method may include receiving the image data corresponding to representation (a) or (b), generating image data corresponding to representation (c), and using the image data corresponding to representation (c) as input to the trained machine-learning model.

The image data according to representation (c) may correspond to an aliased image obtained by performing an inverse Fourier transform on a part-filled frequency space. For example, by filling un-measured frequencies with zeros. The image data according to representation (c) may correspond to a streaked image obtained by back-projecting 10 an insufficient, or relatively low, number of shadow images, for example X-ray shadow images, onto a reconstruction plane. The image data according to representation (c) maybe any real-space image which exhibits degraded integrity and/or quality as a consequence of performing the first or second transform on under-sampled image data in representation (a) or representation (b).

The method maybe applied to each frame of video data which includes a number of frames. Each frame may correspond to image data. The method maybe applied to a plurality of frames of video data concurrently.

The image data may correspond to representation (a). The first transform may be an inverse Fourier transform.

The image data may correspond to representation (b). The second transform may be a back-projection.

The image data may be magnetic resonance image data, and the non-spatial vector space may correspond to frequency space.

The image data may be X-ray transmission data. The number of second spatial vector 30 spaces may correspond to a number of X-ray projection images obtained from different angles of incidence to an imaged object.

The first spatial vector space may encompass a two-dimensional slice of an imaged object, and each second spatial vector space may correspond to a projection of the two35 dimensional slice onto a line oriented in a particular direction. The first spatial vector space may encompass a three-dimensional volume of an imaged object, and each

-5second spatial vector space may correspond to a projection of the three-dimensional volume onto a plane oriented in a particular direction.

The method may also include measuring the image data.

The image data may correspond to a manufactured item. One or more parameters output by the trained machine-learning model may include one or more of a dimension of a feature of the manufactured item, an area of a feature of the manufactured item, a volume of a feature of the manufactured item and/or a deviation of the shape of the 10 whole or any part of the manufactured item from an expected (or designed for) shape.

Segm entation maps may correspond to one or more specific features of the manufactured item. Segmentation maps may correspond to one or more undesired features of the manufactured item, for example, voids.

The one or more parameters and/or segmentation maps may be used for quality control of the manufactured items. The image data may be measured as part of a manufacturing line. In response to the one or more parameters and/or segmentation maps indicating that one or more features of a manufactured item exceeds a predetermined tolerance, the manufactured item may be labelled or tagged, either physically or electronically, as requiring more detailed scanning and/or human inspection. In response to the one or more parameters and/or segmentation maps indicating that one or more features of a manufactured item exceeds a predetermined tolerance, the manufactured item may be diverted off the manufacturing line for more detailed scanning and/or human inspection.

The image data may correspond to the whole or any part of a human or animal anatomy. The image data may correspond to an organ such as a heart. One or more parameters may include left ventricular (LV) end-systolic (ES) and/or end-diastolic (ED) volumes (ESV/EDV), right ventricular (RV) ESV and/or /EDV, LV mass (LVM) 30 and/or ejection fraction (EF).

The image data may correspond to a random under-sampling of the image when represented in (a) the non-spatial vector space, or (b) the number of second spatial vector spaces.

- 6 The trained machine-learning model may take the form of a convolutional neural network. The trained machine-learning model may take the form of a TL-network.

According to a second aspect of the invention, there is provided a computer program product storing a machine-learning model for use in the method.

According to a third aspect of the invention, there is provided a method of training a machine-learning model to receive input in the form of image data corresponding to an under-sampling of an image which is (a) represented in a non-spatial vector space, 10 wherein a first transform exists for mapping a representation of the image in the nonspatial vector space to a representation in a first spatial vector space, or (b) represented in a number of second spatial vector spaces, wherein a second transform exists for mapping a representation of the image in the plurality of second spatial vector spaces to a representation in the first spatial vector space, or (c) represented in the first spatial 15 vector space, wherein the representation in the first spatial vector space has been obtained by transforming representation (a) using the first transform or by transforming representation (b) using the second transform, and determine one or more parameters and/or segmentation maps of the representation of the image in the first spatial vector space, without reconstructing or estimating a fully-sampled representation of the image in the first spatial vector space. The method of training the machine-learning model includes receiving a training set. The training set including a number of training images. The training set also including, for each training image, a corresponding one or more parameters and/or segmentation maps of the representation of the training image in the first spatial vector space. The method of training the machine-learning model also includes, for each training image, generating a corresponding under-sampled training image based on an under-sampling fraction. All of the under-sampled training images are (a) represented in the non-spatial vector space, (b) represented in the plurality of second vector spaces, or (c) represented in the first spatial vector space, wherein each representation in the first spatial vector space is 30 obtained by transforming an under-sampled training image from representation (a) using the first transform or by transforming an under-sampled training image from representation (b) using the second transform. The method of training the machinelearning model also includes training the machine-learning model using the undersampled training images as inputs and the corresponding one or more parameters and/or segmentation maps as the target output.

The under-sampling fraction may be greater than zero and less than one. The training images may correspond to multiple frames of one or a number of videos.

The method of training the machine-learning model may also include training the machine-learning model across a plurality of epochs. For each epoch the undersampled training images maybe generated according to a random sampling of the corresponding training images. Random may encompass pseudo-random.

An under-sampled training image may be generated according to a random sampling 10 obtained in representation (a) or representation (b). An under-sampled training image in representation (c) may be obtained by transforming a random sampling obtained in representation (a) using the first transform. An under-sampled training image in representation (c) may be obtained by transforming a random sampling obtained in representation (b) using the second transform.

The method of training the machine-learning model may also include training the machine learning model using the complete training images and the corresponding one or more parameters and/or segmentation maps, before training the machine learning model using the under-sampled training images as inputs. If the machine learning 20 model is intended to process image data in representation (a), then complete training images in representation (a) may be used for initial training. Similarly, if the machine learning model is intended to process image data in representations (b) or (c), then complete training images in representation (b) or (c) may be used for initial training as appropriate.

The under-sampling fraction may be decreased from unity to a target under-sampling fraction across a plurality of training epochs. The under-sampling fraction may be decreased at fixed intervals.

The under-sampling fraction may be decreased in dependence upon a prediction accuracy of the machine-learning model when applied to a testing set which includes a plurality of under-sampled test images, and for each test image, a corresponding one or more parameters and/or segmentation maps of the representation of the test image in the first spatial vector space. The under-sampled test images maybe generated based on sampling test images according to the target under-sampling fraction.

-8The machine-learning model may take the form of a convolutional neural network. The trained machine-learning model may take the form of a TL-network.

According to a fourth aspect of the invention, there is provided a computer program product storing a machine-learning model trained according to the method.

According to a fifth aspect of the invention, there is provided apparatus including measurement means configured to obtain image data corresponding to an undersampling of an image. The under-sampling of the image may be (a) represented in a 10 non-spatial vector space, wherein a first transform exists for mapping a representation of the image in the non-spatial vector space to a representation in a first spatial vector space. The under-sampling of the image maybe (b) represented in a number of second spatial vector spaces, wherein a second transform exists for mapping a representation of the image in the plurality of second spatial vector spaces to a representation in the 15 first spatial vector space. The under-sampling of the image may be (c) represented in the first spatial vector space, wherein the representation in the first spatial vector space has been obtained by transforming representation (a) using the first transform or by transforming representation (b) using the second transform. The apparatus also includes a controller configured to process the image data according to the method.

In any aspect of the invention, an under-sampling fraction for under-sampled image data maybe less than or equal to 0.01, less than or equal to 0.05, less than or equal to 0.1, less than or equal to 0.25, less than or equal to 0.5, or less than or equal to 0.75.

-9Brief Description of the Drawings

Certain embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings in which:

Figure 1 illustrates an imaging system.

Figure 2 is a process flow diagram of a method of processing image data.

Figure 3 illustrates a spatial vector space and a non-spatial vector space connected by a transform;

Figure 4 illustrates a first example of under-sampling image data in frequency space;

Figure 5 illustrates a second example of under-sampling image data in frequency space;

Figure 6 illustrates an example of under-sampling image data for a back-projection method;

Figure 7 is a process flow diagram of a first method of training a machine learning model;

Figure 8 is a process flow diagram of a second method of training a machine learning 15 model;

Figure 9 illustrates a first exemplary configuration of a machine learning model;

Figure 10 illustrates a second exemplary' configuration of a machine learning model;

Figure 11 presents a comparison of Dice scores for the machine learning models shown in Figures 9 and 10, when applied to data for the left ventricle of a human heart;

Figure 12 presents a comparison of Dice scores for the machine learning models shown in Figures 9 and 10, when applied to data for the right ventricle of a human heart;

Figure 13 presents a comparison of Dice scores for the machine learning models shown in Figures 9 and 10, when applied to data for the myocardium of a human heart;

Figures 14A to 14C present, segmentation maps obtained using the machine learning 25 model shown in Figure 10;

Figures isAto 15c present segmentation maps obtained using the machine learning model shown in Figure 9;

Figure 16 presents a ground truth segmentation of the data shown in Figures iqAto 15C;

Figures 17A to 17C present temporally averaged images corresponding to 1,10 or 20 lines of k-space data;

Figures 18 and 19 present principal component analyses of results obtained using the machine learning model shown in Figure 10; and

Figures 20 and 21 present principal component analyses of results obtained using the 35 machine learning model shown in Figure 9.

- 10 Detailed Description of Certain Embodiments

In the following, like parts are denoted by like reference numbers.

Advanced volume imaging techniques such as Magnetic Resonance Imaging or X-ray computed tomography may reveal additional information as compared to twodimensional imaging projections. However, obtaining tomographic images, for example three-dimensional volume images, is often time consuming because a large number of input measurements or images are needed before the desired image maybe reconstructed. For example, in X-ray computed tomography, X-ray projection images 10 (also referred to as shadow images) for a large number of different angles are needed.

In magnetic resonance imaging, the frequency, or k-space, grid needs to be populated before transformation to a real image.

In this specification, methods are described which take a different approach to accelerating the acquisition of relevant information from under-sampled image data.

This approach may be referred to as “application driven imaging”, and may allow for significant improvements in acceleration factor. An important point is the realisation that, in many cases, the images acquired using, for example X-ray computed tomography or magnetic resonance imaging, are not an end in themselves. Instead, 20 images are often obtained with the objective of analysing the images in order to extract particular parameters or other information which is embodied in the image. For example, in medical applications, imaging is usually concerned with extracting clinically relevant parameters which may be used by medical practitioners for the purposes of diagnosis or guidance. In other application areas, for example material 25 safety testing or industrial quality control, the objective may be to find and measure cracks, voids or parameters of other geometric features. Such parameters are usually extracted from images using post-processing methods such as image segmentation.

According to methods of the present specification, under-sampled image data is 30 acquired then processed in a single step to directly obtain the desired outputs, bypassing the need to reconstruct or estimate a full resolution image. This maybe possible because the desired outputs, for example one or more parameters or segmentation maps of an imaged object, may typically be more compressible that a full resolution image.

Referring to Figure 1, an imaging system is shown.

- 11 The imaging system 1 includes measurement means 2, in the form of imaging equipment, configured to obtain image data 3 of an imaged object 4. For example, the imaging equipment 3 may be a magnetic resonance imaging machine, or an X-ray computed tomography machine (allowing capture of X-ray images at a range of angles).

The imaging equipment 2 operates by directing electromagnetic radiation 5 at the imaged object 4, and the image data 3 corresponds to electromagnetic radiation 5 which is transmitted, reflected, attenuated, scattered, and so forth, by interactions with the imaged object 4. The image data 3 often requires further processing before an 10 image of the object 4 may be formed, depending on the imaging technique employed.

The image data 3 is provided to a controller 6 which includes at least machine readable memory 7 and one or more processors 8. The one or more processors 8 retrieve a trained machine-learning model 9 stored in the memory 7. The image data 3 is provided as input to the trained machine-learning model 9. The trained machinelearning model 9 is configured to determine one or more parameters and/or segmentation maps 10 based on the image data 3. The controller 6 outputs the determined one or more parameters and/or segmentation maps 10 for subsequent analysis. Optionally, the one or more parameters and/or segmentation maps 10 may be 20 provided to a process control module 11 in order to permit the operation of equipment handling or processing the imaged object 4 to be controlled at least partially in dependence upon the one or more parameters and/or segmentation maps 10.

The present specification is concerned with methods of obtaining one or more parameters and/or segmentation maps 10 directly from under-sampled (or sparsely sampled) image data 3, without reconstructing or estimating a fully-sampled (or full resolution) image of the imaged object 4.

Method of obtaining one or more parameters and/or segmentation maps

Referring also to Figure 2, a process flow diagram of the method of obtaining one or more parameters and/or segmentation maps 10 directly from under-sampled image data 3 will be explained.

Image data 3 is received by the controller 6 (step Si). The image data 3 maybe received directly from the imaging equipment 2, for example in real time.

- 12 Alternatively, the image data 3 may have been acquired at an earlier time, or at a remote location, or both.

In this specification, the image data 3 corresponds to an under-sampling of an image of the imaged object 4 which is:

(a) represented in a non-spatial vector space 13 (Figure 3), wherein a first transform 14 (Figure 3) exists for mapping a representation of the image in the nonspatial vector space 13 (Figure 3) to a representation in a first spatial vector space 12 (Figure 3);

(b) represented in a plurality⁷ of second spatial vector spaces 23 (Figure 6), wherein a second transform exists for mapping a representation of the image in the plurality of second spatial vector spaces 23 (Figure 6) to a representation in the first spatial vector space 12 (Figure 3); or (c) represented in the first spatial vector space 12 (Figure 3), wherein the representation in the first spatial vector space 12 (Figure 3) has been obtained by transforming representation (a) using the first transform 14 (Figure 3) or by transforming representation (b) using the second transform.

Receiving the image data 3 maybe a two-stage process. For example, the controller 6 may receive image data 3 corresponding to representation (a) or representation (b). The controller 6 may then convert the image data 3 to representation (c). The image data corresponding to representation (c) may then be used for subsequent processing by the trained machine-learning model 9.

The image data 3 may correspond to a random under-sampling of an image of the imaged object 4 represented in (a) the non-spatial vector space, or (b) the plurality of second spatial vector spaces. Further details and examples of non-spatial and second spatial vector spaces 13, 23 (Figures 3, 6) are explained hereinafter. Examples of under-sampling image data 3 are also explained hereinafter.

Under-sampled image data 3 represented in the first spatial vector space 12 (Figure 3), in other words an under-sampled real image of the desired plane or volume of space, differs from a fully sampled image represented in the first spatial vector space 12 (Figure 3). For example, image data 3 according to representation (c) may correspond 35 to a heavily aliased image obtained by performing an inverse Fourier transform on a

-13 part-filled frequency space. Such an inverse Fourier transform may be performed on a partly filled frequency space by filling the un-sampled frequencies with zeros.

Referring also to Figure 16 a full resolution cardiac MRI image is shown. Referring also to Figure 17A to 17C, corresponding under-sampled cardiac MRI images are shown.

The image shown in Figure 17A corresponds to a heavy under-sampling of the full resolution image of Figure 16. Significant aliasing artefacts are evident in Figure 17A. The image data 3 shown in Figure 17A was obtained by randomly sampling individual lines in frequency space, or k-space, from a multiple frames of a cardiac MRI video, combining the sampling individual lines into a single k-space grid, and applying the inverse Fourier transform. Figures 17B and 17C show similar images corresponding to 10 and 20 lines sampled at random. It may be observed that the extent of degradation in the under-sampled images 17A, 17B, 17C decreases with the increased sampling fraction.

Alternatively, for X-ray projection data, the image data 3 according to representation (c) may correspond to a streaked image (for example as illustrated in Figure 6) obtained by back-projecting an insufficient, or relatively low, number of X-ray transmission images onto a reconstruction plane 24 (Figure 6). In general, the image data 3 according to representation (c) maybe any real-space image which exhibits degraded integrity and/or quality as a consequence of performing the first transform 14 or second transform on under-sampled image data 3 corresponding to representation (a) or representation (b).

The image data 3 is prepared into a suitable format for processing by the trained machine learning model 9 (step S2).

The trained machine learning model 9 is executed on the formatted input data 3 (step

S3). The trained machine-learning model 9 is configured to determine one or more parameters and/or segmentation maps 10 of the representation of the image in the first spatial vector space 12 (Figure 3). The trained machine learning model 9 does not reconstruct or estimate a fully-sampled (or full resolution) representation of the image in the first spatial vector space 12. Instead, the trained machine-learning model 9 is configured to determine one or more parameters and/or segmentation maps 10 directly from the under-sampled image data 3. The input under-sampled image data 3 may

-14correspond to any one of representations (a), (b) or (c). However, a given machine learning model 9 will need to be trained and operated using a single, consistent input representation.

The one or more parameters and/or segmentation maps 10 are obtained as output(s) of the trained machine learning model (step S4).

The output one or more parameters and/or segmentation maps 10 maybe stored for subsequent examination, or may be output to one or more local or remote computers for additional processing and/or analysis.

Optionally, in some applications, further actions may be carried out based on the output one or more parameters and/or segmentation maps 10 (step S5).

In some examples, the imaged object 4 may be a manufactured item. One or more parameters 10 output by the trained machine-learning model 9 may include one or more of a dimension of a feature of the manufactured item, an area of a feature of the manufactured item, a volume of a feature of the manufactured item and/or a deviation of the shape of the whole or any part of the manufactured item from an expected (or 20 designed for) shape. Segmentation maps 10 may correspond to one or more specific features of the manufactured item. Segmentation maps may correspond to one or more undesired features of the manufactured item, for example, voids. A segmentation map may take the form of, for example, an image in which each pixel holds an integer value. The integer values may signify the type of material or feature in a corresponding image 25 of the imaged object 4. For example, a segmentation map might use a value of 0 (zero) for empty space exterior to the imaged object 4, a value of 1 (one) for plastic, a value of 2 (two) for steel, and a value of 3 (three) for internal voids of the imaged object 4.

In such examples, the one or more parameters and/or segmentation maps 10 may be used for quality control of the imaged object 4. The image data 3 may be measured as part of a manufacturing line. In response to the one or more parameters and/or segmentation maps 10 indicating that one or more features of an imaged object exceeds a predetermined tolerance, the imaged object 4 maybe labelled or tagged, either physically or electronically, as requiring more detailed scanning and/or human inspection. In response to the one or more parameters and/or segmentation maps 10 indicating that one or more features of an imaged object 4 exceeds a predetermined

-15 tolerance, the imaged object 4 maybe re-directed for more detailed scanning and/or human inspection (step S5).

If there is additional data, then the process is repeated (step S6).

For example, in some implementations of the method, each frame of video data comprising a plurality of frames may be processed according to the method. In such an example, each frame of the video will correspond to image data as defined hereinbefore. Alternatively, multiple frames of video data may be input to the machine learning 10 model 9 simultaneously. In other words, the input data may be, for example, twodimensional plus time or even three-dimensional plus time. When trained to process multiple frames of video data as a block, the machine learning model 9 may exploit spatio-temporal redundancies inherent in video data. In this way, the image data 3 may be under-sampled to a greater degree than for processing of single frames.

Optionally, the image data 3 may be measured using the imaging equipment 2 (step So). The imaging equipment may a magnetic resonance imager, an X-ray computed tomography system, or any other imaging equipment suitable for obtaining image data 3 which required a transform to be mapped into a first spatial vector space. The 20 imaged object. 4 corresponding to the image data 3 may be the whole or any part of a human or animal anatomy. The imaged object may be a heart, or a portion or chamber of a heart. The imaged object may be another organ or a human or animal body. The imaged object corresponding to the image data 3 may be the whole or any part of a manufactured item.

The trained machine learning model 9 may take the form of a convolutional neural network. The trained machine learning model 9 may take the form of an auto-encoder and a predictor network, for example a so called “TL-NET”. The name “TL-net” (alternatively TL-embedding network) refers to the overall shape of the drawn network 30 architecture, in particular, it looks like, an inverted T or an L depending on which part of the network is used. A TL-net architecture is described in “Learning a Predictable and Generative Vector Representation for Objects”, Rohit Girdhar, David F. Fouhey Mikel Rodriguez, Abhinav Gupta, arXiv:i6o3.o8637v2,31 Aug 2016.

-16 Example of a non-spatial vector space

For example, referring also to Figure 3, examples of a first spatial vector space 12 and a non-spatial vector space 13 connected by a first transform 14 are shown.

Examples of a first spatial vector space 12 include normal three-dimensional space, or a two-dimensional plane within normal three-dimensional space (also referred to herein as a Cartesian space, since in the applications with which this specification is concerned relativistic effects are not anticipated). For example, an image represented in a twodimensional first spatial vector space 12 may take the form of a regular lattice of pixel 10 positions 15, each of which Is associated with one or more image values. Such an image may also be referred to as a normal or desired image. The pixel positions 15 are spaced in a first, x, direction by increments δχ and in a second, y, direction by increments Sy. Similarly, an image represented in a three-dimensional first spatial vector space 12 may take the form of a regular lattice of voxel positions, each of which is associated with one 15 or more image values, and spaced by increments δχ, 6y, δζ.

An example of a non-spatial vector space 13 which permits representation of an image is frequency space, often referred to as k-space (wherein the “k” refers to wavevector). A representation of an image in normal space (i.e. an image which is recognisable by a 20 human) has an equivalent representation in frequency space. For example, for a nonspatial vector space 13 in the form of two-dimensional frequency space, an image may be represented as a regular lattice of frequencies 16, each of which is associated with one or more amplitudes and phases. The frequencies 16 are spaced in a first, k_x, direction by increments 5k_x and in a second, k_y, direction by increments 8k_tJ. Similarly, 25 an image represented in a three-dimensional first spatial vector space 12 may take the form of a regular lattice of frequencies, each of which is associated with one or more amplitudes and phases, and spaced by increments $k_x, Sk_y, Sk_z.

When the non-spatial vector space 13 is frequency space and the first spatial vector 12 30 space is normal space, a representation of an image in frequency space may be mapped to the equivalent representation in the first spatial vector space 12 using a first transform 14 in the form of an inverse Fourier transform. In such an example, the first transform 14 is invertible, and an image represented in the first spatial vector space 12 may be mapped to the equivalent representation in frequency space by a Fourier 35 transform. The lattice of frequencies 16 is sometimes referred to as the reciprocal lattice” of the lattice of pixel positions 15. When the first transform 14 is an inverse

-17 Fourier transform, the spacing increments in the first spatial vector space 12 and the non-spatial vector space 13 are inversely proportional:

1 1 δχ oc ---— δν oc —-—, δζ oc ---— δ/ίχ dky δΚ_ζ (1)

Referring also to Figure 4, the concept of under-sampling of the image data 3 will discussed with reference to an example where the non-spatial vector space 13 is frequency space and the first spatial vector space 12 is normal space.

As an example, Figure 4 shows an illustration of a non-spatial vector space 13 in the form of a two-dimensional frequency space in which discrete frequencies 16 are arranged. Each frequency corresponds to a Fourier transformed representation in frequency space of an image in normal space. The frequencies 16 corresponding to an image maybe viewed as forming a reciprocal lattice having rows 17 and columns 18.

In the example of under-sampling shown in Figure 4, the frequencies 16 in three columns 17-3,171,174 are sampled frequencies 19 (shown shaded) whilst the remaining frequencies 16 are un-sampled frequencies 20 (shown unshaded). Thus, the image data 3 shown in Figure 4 is under-sampled because the reciprocal lattice is only partially sampled. Applying the first transform 14, in this case the inverse Fourier transform (using zeros for the un-sampled frequencies 20), would result in an image representation in the first spatial vector space, in this case normal space, which would include substantial aliasing artefacts. Such an aliased image is an example of undersampled image data 3 corresponding to representation (c). This type of undersampling may be performed using, for example, imaging equipment 2 in the form of a magnetic resonance imaging system.

The under-sampling of image data 3 representing an image in a non-spatial vector space is not limited to sampling specific rows of a reciprocal lattice. For example, a number of columns 18 maybe sampled, or a fraction of the frequencies 16 may be sampled either at random or according to a predetermined pattern.

Substantial work has been carried out in relation to reconstructing or estimating a fully sampled (or full resolution) representation of an image in normal space (i.e. the first spatial vector space 12) based on under-sampled image data 3, for example, “Sparse

-18MRI: The application of compressed sensing for rapid MR imaging.”, Lustig, M., Donoho, D., Pauly, J.M., Magnetic Resonance in Medicine 58(6), 1182-1195(2007)).

However, according to the methods of the present specification, the reconstruction or estimation of a fully sampled (full resolution) representation of an image in the first spatial vector space 12 is not required. Instead, the trained machine learning model 9 extracts the one or more parameters and/or segmentation maps 10 directly from the under-sampled image data 3, whether represented in the non-spatial vector space 13 or the first spatial vector space 12.

Referring also to Figure 5, the sampled frequencies 19 are not required to correspond to the frequencies 16 of a reciprocal lattice corresponding to a fully sampled image.

In the example shown in Figure 5, the non-spatial vector space 13 is frequency space, and un-sampled frequencies 16, 20 are shown which correspond to the reciprocal lattice of a fully sampled representation of an image in normal space (the first spatial vector space 12 in this example). A pair of sampled lines 21_T, 2i₂ are shown, each of which includes sampled frequencies 19 spread along its length. The sampled lines 2ii, 21₂ do not need to be aligned with the reciprocal lattice, and in general may be oriented 20 at any angle with respect to the reciprocal lattice.

Once example which is analogous to the illustration in Figure 5 is the reconstruction of

X-ray computed tomography images using a Fourier transform method. For example, the reciprocal lattice of frequencies 16 may correspond to the frequency space representation of a slice in the computed tomography reconstruction volume and each of the sampled lines 2¼ 21₂ may correspond to the Fourier transform of one line of a shadow projection obtained from a particular direction. In such an example, the computed tomography reconstruction volume, or slices thereof, correspond to the first spatial vector space 12. Similarly, the projected images, or shadow images, obtained from a number of different directions, or lines thereof, correspond to a plurality of second spatial vector spaces.

Conventionally, the method of obtaining an image in the first spatial vector space would be to continue to populate the frequency space with sampled lines 2I1, 2i₂ obtained by 35 taking the Fourier transform of larger numbers of proj ection or shadow images. Once enough sampled lines 2ii, 21₂ have been added, amplitudes corresponding to each

-19frequency 16 of the reciprocal lattice are interpolated, before applying an inverse Fourier transform to obtain an image representation in normal space (i.e. a computed tomographic slice image). The reconstruction image may then be analysed to measure parameters, identify features and/or generate segmentation maps. Alternatively, by applying significant post-processing to the under-sampled image data 3, a reconstruction or estimate of a fully sampled image in normal space (i.e. the first vector space 12) may be obtained. The reconstruction or estimate of the fully sampled image may then be analysed conventionally.

By contrast, applying the methods of the present specification, the need for larger numbers of measurements and/or post-processing to obtain a reconstruction or estimate of the fully sampled image may be removed, because the machine learning model may be trained to extract the desired one or more parameters and/or segmentation maps directly from the under-sampled image data 3 and without needing to reconstruct or estimate a fully-sampled representation of the image in normal space (i.e. the first spatial vector space 12). Under-sampling of the image data 3 corresponds to a situation in which the number of sampled frequencies 19, across all sampled lines 2I1, 21a, is less than the number of points in the reciprocal lattice frequencies 16. The methods of the present specification maybe applied to the under-sampled image data

3, whether represented in the non-spatial vector space 13 or the first spatial vector space 12.

The methods of the present specification are not limited to image data 3 obtained in a non-spatial vector space 13, such as frequency space, and a first transform 14 in the form of an inverse Fourier transform.

Example of second spatial vector spaces

For example, referring also to Figure 6, under-sampled image data 3 will be explained with reference to a back-projection method.

Similarly to MRI imaging, post-processing methods have also been developed to allow reconstructing or estimating a fully sampled image using back-projection methods commonly applied in computed tomographic imaging. For example, A deep learning architecture for limited-angle computed tomography reconstruction., Hammernik,

Kerstin, et al., Bildverarbeitung fur die Medizin 2017. Springer Vieweg, Berlin, Heidelberg, 2017. 92-97.

- 20 However, applying the methods of the present specification may allow bypassing the extensive post-processing needed to reconstruct or estimate a fully sampled image from a relatively small number of tomographic projection images. The desired one or more 5 parameters and/or segmentation maps by be directly extracted from the tomographic projection images (i.e. the second spatial vector spaces), or from a heavily streaked image projected onto a reconstruction plane 24 (for example as shown in Figure 6).

Using imaging equipment 2 in the form of a computed tomography system, image data 10 3 is acquired in the form of a number of X-ray projection (or shadow) images 22 of an imaged object 4. In the example depicted in Figure 5, each X-ray projection image 22 is captured in a different direction oriented in the x-y plane, and maybe represented in a plane perpendicular to the projection direction. Each different z-axis coordinate in the projection images 22 may correspond to a one-dimensional second spatial vector space 15 23. In other words, the first spatial vector space 12 may encompass a two-dimensional slice of an imaged object 4, and each second spatial vector space 23 may correspond to a projection of said two-dimensional slice onto a line oriented in a particular direction.

For example, in Figure 5, the lines of individual X-ray projection images 22 are shown 20 as image values plotted against, an independent, axis which is perpendicular to both the z-axis and the respective projection direction. A corresponding x-y plane 24 for reconstruction of a slice of the computed tomography data may correspond to a twodimensional first spatial vector space 12. The second transform in such an example may be back-projection of the X-ray projection images 22 onto the reconstruction plane 25 24.

The image values of the X-ray projection images 22 are projected across the reconstruction plane 24, and image values are summed in those regions which intersect each other. As a greater number of X-ray projection images 22 are added, the portions of the reconstruction plane 24 corresponding to a slice through the imaged object 4 are further reinforced, building up the image of the imaged object 4. Usually, a very'large number of X-ray projection images 22 are required before a useful reconstruction of the imaged object 4 maybe obtained.

However, the nature of X-ray imaging is that each X-ray projection image 22 includes information about every? point in the imaged object 4. This is because each image value,

i.e. each pixel, of an X-ray projection image 22 has a magnitude which is determined by the total attenuation of the material or materials the X-rays passed through. In consequence of this, X-ray projection images 22 possess a significant quantity of latent information about the through-thickness structure of an imaged object 4. Whilst this information may not be sufficient to recover a detailed image of the imaged object 4, it is expected that even a relatively small number of X-ray projection images 22 may include sufficient latent information to enable extraction of one or more parameters and/or segmentation maps 10 useful for characterising the imaged object 4. In applications where there is prior knowledge about the expected overall shape of an imaged object 4, for example in quality control of manufacture items, it may be possible to further reduce the number of X-ray projection images 22. Thus, under-sampled image data 3 may correspond to a smaller number of X-ray projection images 22 than would conventionally be captured to obtain a conventional computed tomographic reconstruction. As already explained, the methods of the present specification need not.

actually involve obtaining a reconstructed image in the reconstruction plane 24. Instead, one. or more parameters and/or segmentation maps 10 could be generated directly from the X-ray projection images 22.

Alternatively, one or more parameters and/or segmentation maps 10 maybe generated directly from a heavily streaked back-projected image, on the reconstruction plane. 24.

The entire reconstruction volume may alternatively be viewed as a three-dimensional first spatial vector space 12, and each X-ray projected image 22 may be Hewed as a twodimensional second spatial vector space 23. In other words, the first spatial vector space 12 may encompass a three-dimensional volume of an imaged object 4, and each second spatial vector space 23 may correspond to a projection of the three-dimensional volume onto a plane oriented in a particular direction.

Typically, no individual second spatial vector space 23 spans the first spatial vector space 22. However, a combination of two or more of the second spatial vector spaces 23 may span the first spatial vector space 12.

Although the back-projection example explained hereinbefore is with reference to a direct back-projection method , the methods of the present specification are also expected to be applicable to filtered back-projection methods.

- 22 ~

First method of t raining the machine learning model

Referring also to Figure 7, a process-flow diagram of a first method of training the machine learning model 9 will be explained.

A training set is received which includes a number of training images (step S8) and, for each training image, a corresponding one or more parameters and/or segmentation maps (step S9). The training images forming the training set may be represented in (a) the non-spatial vector space 13, (b) the plurality of second vector spaces 23, or (c) the first vector space 12. The training images may correspond to multiple frames from one 10 or more videos.

Alternatively, in some examples the one or more parameters and/or segmentation maps may be generated based on the received training images, using conventional feature extraction techniques.

An under-sampled training image is generated corresponding to each training image (step S10). All of the under-sampled training images are (a) represented in the nonspatial vector space 13, (b) represented in the plurality of second vector spaces 23, or (c) represented in the first spatial vector space 12. Each representation (c) in the first 20 spatial vector space 12 is obtained by transforming an under-sampled training image from representation (a) using the first transform 14, or by transforming an undersampled training image from representation (b) using the second transform. The choice of representation, for example (a) or (c), or (b) or (c), depends on the input image data 3 which the machine learning model 9 will receive in use. The under25 sampled training images are formatted for input to the machine-learning model 9 (step S11).

The machine-learning model 9 is trained using the under-sampled training images as the input, and the corresponding one or more parameters and/or segmentation maps 30 as the ground truth (also referred to as target output) (step S12). Errors for backpropagation and adjustment of the machine-learning model 9 weights are generated based on discrepancies between the output estimated one or more parameters and/or segmentation maps 10, and the input ground truth one or more parameters and/or segmentation maps received as part of the training set.

The machine-learning model 9 may be trained over multiple epochs using a relatively low training rate, so as to avoid overfitting to the training set data (step S14). During each epoch, the machine-learning model 9 may be trained using the same undersampled training images (step S12), and the model weights adjusted based on determined errors with respect to the ground truth one or more parameters and/or segmentation maps received as part of the training set (i.e. looping through steps S12 to Si 4).

Alternatively, for each training epoch, new under-sampled training images maybe generated according to a random sampling of the corresponding training images (i.e. looping through steps S10 to S14). Herein, the term “random” also encompasses pseudo-random processes. An under-sampling fraction used for generating undersampled training images may be greater than zero and less than one.

For example, during the generation of under-sampled training images (step S10), rows 17 of a representation in frequency space may be randomly selected up to a desired under-sampling fraction. Equivalently, columns 18, or even individual frequencies 16 of a representation in frequency space maybe randomly selected up to a desired undersampling fraction.

In an alternative example, during the generation of under-sampled training images (step S10), X-ray projection images 22 maybe selected from a complete set of equiangular X-ray projection images 22, up to a desired under-sampling fraction.

Generation of under-sampled training images is not limited to samplings obtained in representation (a) or representation (b). In other examples, an under-sampled training image in representation (c) maybe obtained by transforming a random sampling obtained in representation (a) using the first transform 14. For example, an intermediate under-sampled image maybe obtained in frequency space by sampling rows 17 or columns 18 of a fully sampled training image up to a desired under-sampling fraction. An under-sampled training image may then be obtained by applying the inverse Fourier transform to the intermediate under-sampled image.

In further examples, an under-sampled training image in representation (c) maybe obtained by transforming a random sampling obtained in representation (b) using the second transform.

-24As mentioned hereinbefore, the trained machine-learning model 9 may take the form of a convolutional neural network. Alternatively, the trained machine-learning model 9 may take the form of an auto-encoder and a predictor network (supervised auto5 encoder).

Optionally, the training method may include obtaining the training set, including the training images and the corresponding ground truth one or more parameters and/or segmentation maps (step S7). For example, the imaging equipment 2 may be used to 10 obtain training images.

Optionally, the accuracy of the machine-learning model 9 may be evaluated throughout training (step S13). The testing should be carried out by applying the machinelearning model 9 to a validation set. (sometimes also termed a testing set) which include 15 a number of testing images, each testing image corresponding to one or more parameters and/or segmentation maps. Under-sampled testing images maybe generated from the testing images in the same way as under-sampled training images. An error or errors between the machine-learning model 9 output and the one or more parameters and/or segmentation maps from the testing set may be compared against a 20 predetermined threshold or thresholds. Whether or not. to execute further training epochs (step S14) may be decided in dependence on the present accuracy of the machine-learning model 9. In other words, additional epochs maybe run until either a maximum number of epochs is exceeded, or the error or errors of the machine-learning model 9 when applied to the testing set is less than the predetermined threshold or 25 thresholds.

Second method of training the machine learning model

Referring also to Figure 8, a process-flow diagram of a second method of training the machine learning model 9 will be explained.

The training images and corresponding ground truth one or more parameters and/or segmentation maps are received (steps S16, S17), and optionally obtained (step S15), in the same way as for the first method (steps S7 to S9).

Unlike the first method, in the second method the machine-learning model 9 is initially trained using the fully sampled training images, before gradually reducing the undersampling fraction.

For example, the fully sampled training images may be formatted for input to the machine-learning model (step S18), and the machine-learning model 9 may be trained using the formatted fully sampled training images as input and the corresponding one or more corresponding parameters and/or segmentation maps as ground truth (step Si9). Several epochs of training may be carried out (step S21), for example, until a predetermined number of epochs have been run, or until an error or errors determined by applying the machine-learning model 9 to the testing set is less than the predetermined threshold or thresholds (step S20).

If the machine learning model 9 is intended to process image data 3 corresponding to representation (a), then fully sampled training images in representation (a) may be used for initial training. Similarly, if the machine learning model 9 is intended to process image data in representations (b) or (c), then fully sampled training images in representation (b) or (c) maybe used for initial training.

Once the machine-learning model 9 has been trained using the fully sampled training images, the first method of training may be applied unmodified. This has the advantage that the weights of the machine-learning model 9 start from an initial configuration corresponding to an accurate solution for the fully sampled training images. In this way, the machine-learning model 9 may be trained more rapidly. The resulting trained machine-learning model 9 may also display improved accuracy.

Alternatively, in the second method, the under-sampling fraction used to generate under-sampled training images maybe incrementally decreased to the desired undersampling fraction.

For example, an initial under-sampling fraction maybe set (step S22). The initial under-sampling fraction may be just slightly less than 1 (one, or unity)

Under sampled training images are generated (step S23), formatted (step S24) and used as input for training the machine-learning model 9 (step S25), in the same way as first method (steps S10 to S12). Optionally, the current accuracy of the machine

- 26 learning model 9 maybe tested by applying the machine-learning model 9 to the testing images.

As with the first method, the training may be conducted across multiple epochs (step

S27). Each under-sampled training image is preferably generated for each epoch by random sampling of the corresponding fully sampled training image.

Between epochs, the under-sampling fraction may be updated (step S28). For example, the under-sampling fraction may be decremented each epoch by a small amount. In an alternative example, the under-sampling fraction maybe decreased after fixed numbers of epochs, i.e. every tenth epoch, and so forth. In another example, the under-sampling fraction maybe decreased in dependence upon the prediction accuracy of the machine-learning model when applied to the testing set (step S26).

In this way, the under-sampling fraction may be decreased from unity to a target under-sampling fraction across a plurality of training epochs, whilst, minimising the possibility of the machine-learning model moving away from the solutions found using the complete training images.

Experimental data

Two examples of the methods of the present specification were applied to magnetic resonance imaging data of human hearts, in order to obtain segmentation maps of the left and right ventricles.

Experiments were performed using 5000 short-axis cardiac cine (video) magnetic resonance images from a UK Biobank study (Petersen, S.E., et al.: UK Biobank’s cardiovascular magnetic resonance protocol. Journal of Cardiovascular Magnetic Resonance 18(1), 8 (Feb 2016)). These cardiac cine (video) magnetic resonance images were acquired using balanced steady-state free precession magnetization (bSSFP) sequence, a matrix size of N_xxN_yxT = 208 x 187x50, a pixel resolution of i.8x i.8x 10.0 mm3 _and _a temporal resolution of 31.56 ms. Since the manual annotations were only available at end-systolic (ES) and end-diastolic (ED) frames, segmentation maps 10 were generated for the left- ventricular (LV) cavity⁷, the myocardium and the rightventricular (RV) cavity for ail time-frames including apical, mid and basal slices, according to the methods described in Bai, W., Sinclair, M., Tarroni, G., Oktay, 0., Raj chi, M., Vaillant, G., Lee, A.M., Aung, N., Lukaschuk, E., Sanghvi, M.M., et al.:

Human-level CMR image analysis with deep fully convolutional networks. arXiv preprint arXiv:i7io.09289 (2017). The segmentation maps generated in this way agree well with the manual segmentations, and were then treated as the ground truth labels for the training and testing sets.

The data was divided into 4000 training images and 1000 validation (test) images. Random under-sampling of the training images to generated the under-sampled training images was conducted using variable density one-dimensional under-sampling masks. In other words, the training images were represented in a non-spatial vector 10 space 13 in the form of frequency space, or k-space, and the first transform 14 for these examples would be the inverse Fourier transform. The one-dimensional undersampling masks had the effect of sampling a certain fraction of rows 17 (or columns 18) of the training images. These under-sampling masks were generated on-the-fly, i.e. the under-sampling masks were not predetermined prior to training the machine-learning 15 model 9. As only the magnitude images were available we synthetically generated the phase maps (smoothly varying 2D sinusoid waves) on-the-fly to make the simulation more realistic by removing the conjugate symmetry in k-space. Different levels of acceleration factors (i=nr) were considered, m e [1; 100] where n is the number of lines per time-frame. For a fully-sampled image, n? ----- 168.

The examples used different network architectures to learn the mapping from the under-sampled images data 3 to the output segmentation maps. The first example is termed “end-to-end synthesis network” or “Syn-net” for short. The Syn-net example may exploit spatiotemporal redundancies in the input to directly generate a segmentation map 10. The second example is termed “Li-net”. The Li-net example first predicts the low dimensional latent code of the corresponding segmentation map 10, which is subsequently decoded.

Syn-net

Referring also to Figure 9, the architecture of the machine-learning model 9 for the Syn-net example is shown. In Figure 9, the changes in the number of features are shown above the respective tensors.

Let D 33 Xx Y be a dataset of fully- sampled complex valued (dynamic) images x = {Xi e €\ i eS}, where S denotes indices on a pixel grid, and the corresponding segmentation map 10 labels y = {yi I i eS} representing different tissue types with

- 28 yi ε{1, 2,..., C}. Let u = {m e€\u= F_u ^HF_ux} denote under-sampled image data 3, where F_u is the under-sampling Fourier encoding matrix, and LA is the Hermitian conjugate of F_u. Let p(yi\x) be the true be the true distribution of i^th pixel label given an image, and let r(u |x, M) represent the sampling distribution of the under-sampled images given an image x and a (pseudo) random under-sampling mask generator M.

The objective is to train a synthesis network q(yt | u, Θ), termed Syn-net, which uses a convolutional neural network (CNN) to model the probability distribution of segmentation maps 10 given the under-sampled image data 3 parameterised by Θ. The 10 machine learning model 9 in the form of the CNN was trained using the following modified cross-entropy (CE) loss:

mA- 2 (x,y)eO (2) in which the expectation is taken over differently under-sampled images. In practice, one different under-sampling pattern was generated each mini-batch training as an approximation to the expectation. For the network architecture, we use the architecture shown in Figure 9.

Li-net

Referring also to Figure 10, the architecture of the machine-learning model 9 for the Li-net example is shown. In Figure 10, the two-stage training strategy is illustrated. The same encoder and decoder as Syn-net can be used for LI-Net.

The Syn-net example assumes that the under-sampled image data 3 contains sufficient geometrical information to generate the one or more parameters and/or segmentation maps 10. For heavily under-sampled (and therefore aliased) images, this assumption may not be optimal, as the effects of aliasing might in some cases mislead the network from identifying the correct boundaries.

In the latter case, determination of one or more parameters and/or segmentation maps 10 is still possible as long as the target domain has a compact, discriminative latent representation h ell that can be predicted. Such a network can be trained in following stages:

-29stage 1:

an auto-encoder (AE) is trained in the target domain y = Ψ(Φ(ρ; θ_ρηίφ);θά«ΰ), y ^£Y, which is a composition of encoder Φ: ¥ ->H and decoder H -> Ύ, parameterised by Q_en_C and Qdec respectively and H is a low-dimensional latent space. The AE can be trained using the l₂ norm or cross-entropy (CE) loss.

stage 2: a predictor network is trained. Π: X -*H, parameterised by Opred- For a given input-target pair (x, y), the predictor attempts to predict the latent io code h =<P(y; θ_εηε) from x. This is trained using the norm in the latent space: dn(y, x) = | |Φ(ι/; θ_εηε) - Π(χ; θ_ρΚά) Ih?· Once the predictor is trained, one can obtain an input-output mapping by the composition

Ψ(Π(χ; θ,.,.); θ,_εν,).

According to the Li-net example of the methods of the present specification, the AE is trained to learn the compact representation of segmentation maps and the predictor is trained to interpolate these from input under-sampled image data. The terminology of a latent feature interpolation network, or Li-net, is derived from these features.

In stage 1 of training the Li-net example, the AE is trained using CE loss. In stage 2 of training the Li-net example, the network is further encouraged to produce a consistent prediction for differently under-sampled versions of the same training image. This constraint may be implemented by forcing the network to produce the same latent code for under-sampled image data 3 as for the fully-sampled image.

Additionally, a CE term dcE(y, Ψ⁰ Π(ιι)) is added, to ensure that an accurate segmentation map 10 maybe obtained from the latent code. Therefore, the objective term is as follows (here λ/s are hyper-parameters to be adjusted based on the preferred end-goal):

E_!£_._r[d_H(y,u) + + λ₂1|Π(χ) -- n(u)||₂ + A₃d_CE(y, Ψ 0 n(u))] (3) Modelling and Results

The input to the network (Syn-net or Li-net) is the under-sampled image data 3. For the experimental data presented herein. The under-sampled image data in two

-30dimensional plus time data and the output is a sequence of segmentation maps 10. Each z-slice was processed separately due to the relatively larger slice thickness.

Referring again to Figure 9, the architecture of the Syn-net model is shown. To make a fair comparison between the Syn-net and Li-net architectures, we used the encoding path of Syn-net as both encoder Φ and predictor Π, and the decoding path as decoder Ψ. The size of the latent code was set to be \ h \ = 1024. Fully-connected layers were used to join the encoder, the latent code and the decoder. They were trained with minibatch size 8 using Adam (see Adam: A method for stochastic optimization., Kingma,

P., Jimmy B., arXiv preprint arXiv: 1412.6980 (2014).) with initial learning rate 1O’⁴, which was reduced by a factor of 0.8 every 2 epochs. The AE in Li-net was trained for 30 epochs to ensure that the Dice scores (alternatively known as “Fl scores”, or “Sorensen-Dice coefficients”) for each class reached 0.95. For both models, the network was initially trained to perform segmentation from fully-sampled data in order to provide a starting point for training using under-sampled image data 3. The number of lines sampled (i.e. rows 17) was gradually reduced and by io^ih epoch, we uniformly sampled the number of lines, n; from rows [o; 167]. In this way, the under-sampling fraction was Π//168.

The training error for both the Syn-net and Li-net examples was observed to plateau within 50 epochs. For the Li-net example, the hyper-parameters for the loss function (Equation (3)) were empirically chosen to be λ_; == 1, λ₂ == 1O‘⁴, λ₃ == 10, which was found to perform adequately. For data augmentation, we generated affine transformations on-the-fly. We used PyTorch for implementation. Data augmentation has the usual meaning of artificially generating new versions of the training samples by rotating or distorting the original training samples. For these data, geometrically transformed images were generated for each epoch during training. The affine transformations applied included rotation, translation and scaling of the training images. The parameters for the affine transformation were chosen arbitrarily (using pseudo-random methods).

Referring also to Figure 11, a comparison of Dice scores is shown for the Syn-net and Li-net examples when applied to data for the left-ventricle (LV) of the hearts in the dataset. The solid line plots the mean scores for Syn-net, and the corresponding error bars indicate standard deviation. The dashed line plots the mean score for Li-net, and

- 31 the corresponding error bars indicate standard deviation. The independent variable in both plots is the number of k-space rows used for the under-sampling, n? e [i, 168].

Referring also to Figure 12, a comparison of Dice scores is shown for the Syn-net and

Li-net examples when applied to data for the right-ventricle (RV) of the hearts in the dataset. The solid line plots the mean scores for Syn-net, and the corresponding error bars indicate standard deviation. The dashed line plots the mean score for Li-net, and the corresponding error bars indicate standard deviation. The independent variable in both plots is the number of k-space rows used for the under-sampling, m e [1,168].

/0

Referring also to Figure 13, a comparison of Dice scores is shown for the Syn-net and Li-net examples when applied to data for the myocardium (Myo) of the hearts in the dataset. The solid line plots the mean scores for Syn-net, and the corresponding error bars indicate standard deviation. The dashed line plots the mean score for Li-net, and 15 the corresponding error bars indicate standard deviation. The independent variable in both plots is the number of k-space rows used for the under-sampling, m e [1,168].

For each subject, the data is a 2-dimensional stack of short-axis slices for one cardiac cycle, which contains multiple time-frames. To compute the Dice scores, we only 20 included ES and ED frames, but aggregated the results across ail short-axis slices.

When calculating the Dice scores, the ES and ED frames were used for all slices. For example, if there are 10 short-axis slices and 50 time-frames, then there are 10x50=500 images originally. However, for measuring the performance using Dice scores, only 10x2 =20 images were used. The Dice scores are plotted versus the 25 number of acquired k-space rows 17 in Figures 11 to 13. The Syn-net and Li-net examples both maintained performance down to about 20 lines per frame out of 168, demonstrating the capability of the models to directly interpolate an anatomical boundary even in the presence of aliasing artefacts.

In general, the Syn-net. example was observed to provide superior performance, indicating that the extracted spatiotemporal features are directly useful for segmentation.

Without wishing to be bound by theory, it is thought that the Li-net example relatively 35 underperformed as it does not employ the skip-connection which the Syn-net example does. This is thought to limit how accurately the Li-net example can delineate

- 32 boundaries from the under-sampled image data 3. It may be possible that increasing the capacity of network might improve the results.

It may also be observed that for the case of obtaining a segmentation map 10 from a single row 17 of the k-space d ata, the Ll-net example outperformed the Syn-net example. This may suggest that in more challenging domains the approach of Ll-net to interpolate the latent code remains a viable option.

In a further experiment, the Syn-net and Ll-net models were further fine-tuned for a fixed number of lines for ni e {1,10, 20} separately. From the obtained segmentation maps 10, we computed parameters 10 in the form of left ventricle (LV) end-systolic (ES) and end diastolic (ED) volumes (ESV/EDV), right ventricle (RV) ESV/EDV, LV mass (LVM) and ejection fraction (EF). The mean percentage errors across all test, subjects are set out in Table 1. It may be observed that the Syn-net example consistently /5 performs better than the Ll-net example, and has relatively smaller errors (< 7.7%) for all values for n? e {10, 20}. Both examples showed low error for EF, where the correlation coefficient was 0.81 for both models for n; = 20.

Referring also to Figures lqAto 14C, examples of segmentation maps obtained using the Ll-net example are shown for n/=i, n~io and n/=20 respectively. For visualisation purposes, the segmentation maps 10 are shown overlaid on the fully sampled images.

Referring also to Figures 15 to 15B, examples of segmentation maps obtained using the Syn-net example are shown for ni=i, n/=10 and n?=20 respectively. For visualisation purposes, the segmentation maps 10 are shown overlaid on the fully sampled images.

Referring also to Figure 16, the ground truth segmentation is shown for comparison. This is an example of the segmentation maps used to train the machine-learning model

9.

Referring also to Figures 17A to 17C, the input image data 3 for n/=i, /7/==10 and n?=20 are shown. In other words, the under-sampled image data 3 were input using representation (c). These images correspond to temporally averaged images for the x-y plane, which were obtained by combining all k-space lines across the temporal axis into 35 a single k_x-k_y grid, then applying the first transform 14 in the form of the inverse

Fourier transform.

- 33 It maybe observed that the Li-net example produced more anatomically regularised, consistent segmentation maps. It may be observed that the Syn-net example produced segmentation maps that are occasionally anatomically implausible but more faithful to 5 the boundary'.

Although in theory⁷ it might be expected that the segmentation map would be independent of the aliasing artefact present in the input image data 3, this is not always the case. In order to obtain a measure of such variability, “within subject distances” 10 may be defined: A fully-sampled image may be under-sampled differently for ntriai ~ too times. From the predicted segmentation maps 10 given by a model (e.g. Syn-net or Li-net), the mean shape maybe computed, against which parameters of calculated mean contour distance (MD) and Hausdorff distance (HD) of individual predictions may be calculated. Small distances indicate that the segmentation maps are 15 consistent. Ho wever, if the machine learning model 9 simply produces a population mean shape independent of the input, image data 3, then the above distances can be very⁷ low⁷ even without producing useful segmentations. To get a better picture, the between subject distances may also be measured, which computes MD and HD between the population mean shape (a mean predicted shape across all subjects) and the 20 individual subject mean shapes. For the experiments described herein, nSUbj_ea ^::: too subjects were used and the averaged distances are shown in Table 2.

It maybe observed that the Li-net example shows lower values for within subject distances, indicating that it produces more consistent segmentations than the Syn-net 23 example (p « 0.01, Wilcoxon rank-sum). High between subject distances indicate that both the Syn-net example and the Li-net example are generating segmentation maps closer to subject-specific means than to the population mean.

The latent space of the Syn-net and Li-net machine learning models 9 was also investigated. For 5 subjects, 50 under-sampled images were generated for each number of lines m e {i, 5,10,15, 20}. All under-sampled images had the same target segmentation map 10 per subject. For the Li-net example the predicted latent code h ell was plotted for these images. For the Syn-net example, the activation map before the first up-sampling layer was plotted to see whether the network exploits any latent 35 space structure for generating the segmentation maps 10.

-34Referring also to Figures 18 to 21, the results of the latent space investigation are visualised using Principal Component Analysis (PCA) and t-distributed stochastic neighbour embedding (t-SNE) with d = 2. Data for each subject are plotted using distinctly shaped markers, and the size of the markers corresponds to the under5 sampling fraction (i.e. a smaller marker corresponds to a smaller under-sampling fraction).

It may be observed that for the Li-net example, both for PCA and t-SNE, the latent space is clearly clustered by individual subjects, indicating that the predictions are indeed consistent for different under-sampling patterns. In addition, as the latent code is discriminative for each subject, it may enable fitting a classifier for subject-based prediction tasks.

On the other hand, for the Syn-net example, although there are per-subject clusters, there is also an observed tendency to favour clustering the data points by the undersampling fraction, as seen in the t-SNE plot. It is thought that because Syn-net also exploits skip connections, the Syn-net example may exploit different reconstruction strategies for different acceleration factors. It may additionally be observed that in the PCA for the Syn-net example, the distances between all the points are reduced as the 20 acceleration factor is increased. This is thought to mean that the latent features for

Syn-net are less discriminative when images are heavily aliased. However, the extracted features appear to become gradually more discriminative as more rows 17 are sampled.

Modifications

It will be appreciated that many modifications may be made to the embodiments hereinbefore described. Such modifications may involve equivalent and other features which are already known in the design, training and application of machine-learning methods for image processing, and which maybe used instead of or in addition to 30 features already described herein. Features of one embodiment may be replaced or supplemented by features of another embodiment.

Although claims have been formulated in this application to particular combinations of features, it should be understood that the scope of the disclosure of the present invention also includes any novel features or any novel combination of features disclosed herein either explicitly or implicitly or any generalization thereof, whether or

- 35 not it relates to the same invention as presently claimed in any claim and whether or not it mitigates any or all of the same technical problems as does the present invention. The applicant hereby gives notice that new claims may be formulated to such features and/or combinations of such features during the prosecution of the present application or of any further application derived therefrom.

Claims

Claims

1. A method, comprising:

receiving image data corresponding to an under-sampling of an image which is:

5 (a) represented in a non-spatial vector space, wherein a first transform exists for mapping a representation of the image in the non-spatial vector space to a representation in a first spatial vector space;

(b) represented in a plurality of second spatial vector spaces, wherein a second transform exists for mapping a representation of the image in

10 the plurality of second spatial vector spaces to a representation in the first spatial vector space; or (c) represented in the first spatial vector space, wherein the representation in the first spatial vector space has been obtained by transforming representation (a) using the first transform or by

15 transforming representation (b) using the second transform;

using the image data as input to a trained machine-learning model configured to determine one or more parameters and/or segmentation maps of the representation of the image in the first, spatial vector space, without, reconstructing or estimating a fully-sampled representation of the image in the first spatial vector space;

20 obtaining, as output of the trained machine-learning model, the one or more parameters and/or segmentation maps.
2. A method according to claim 1, comprising:

receiving the image data corresponding to representation (a) or (b);

25 generating image data corresponding to representation (c); and using the image data corresponding to representation (c) as input to the trained machine-learning model.
3. A method comprising applying the method of claims 1 or 2 to each frame of

30 video data comprising a plurality of frames, wherein each frame corresponds to image data.
4. A method comprising applying the method of claims 1 or 2 to a plurality of frames of video data concurrently, wherein each frame corresponds to image data.

-37*
5- A method according to any one of claims 1 to 4, wherein the first transform is an inverse Fourier transform.
6. A method according to any one of claims 1 to 4, wherein the second transform is 5 a back-projection.
7. A method according to any one of claims 1 to 5, wherein the image data is magnetic resonance image data, and wherein the non-spatial vector space corresponds to frequency space.
8. A method according to any one of claims 1 to 6, wherein the image data is X-ray transmission data, and wherein the plurality of second spatial vector spaces correspond to a plurality of X-ray projection images obtained from different angles of incidence to an imaged object.
9. A method according to any one of claims 1 to 8, further comprising measuring the image data.
10. A method according to any one of claims 1 to 9, wherein the image data

20 corresponds to a random under-sampling of the image when represented in:

(a) the non-spatial vector space; or (b) the plurality of second spatial vector spaces.
11. A method according to any one of claims 1 to 10, wherein the trained machine25 learning model comprises a convolutional neural network.
13. A method according to any one of claims 1 to 10, wherein the trained machinelearning model comprises a TL-network.

30
14. A computer program product storing a machine-learning model for use in any one of claims 1 to 13.

-₃8ιο
15. A method of training a machine-learning model to:

receive input in the form of image data corresponding to an under-sampling of an image which is:

(a) represented in a non-spatial vector space, wherein a first transform exists for mapping a representation of the image in the non-spatial vector space to a representation in a first spatial vector space;

(b) represented in a plurality of second spatial vector spaces, wherein a second transform exists for mapping a representation of the image in the plurality of second spatial vector spaces to a representation in the first spatial vector space; or (c) represented in the first spatial vector space, wherein the representation in the first spatial vector space has been obtained by transforming representation (a) using the first transform or by transforming representation (b) using the second transform;

determine one or more parameters and/or segmentation maps of the representation of the image in the first spatial vector space, without, reconstructing or estimating a fully-sampled representation of the image in the first spatial vector space;

the method of training the machine-learning model comprising: receiving a training set comprising:

a plurality of training images;

for each training image, a corresponding one or more parameters and/or segmentation maps of the representation of the training image in the first spatial vector space;

for each training image, generating a corresponding under-sampled training image based on an under-sampling fraction;

wherein all of the under-sampled training images are:

(a) represented in the non-spatial vector space;

(b) represented in the plurality of second spatial vector spaces; or (c) represented in the first spatial vector space, wherein each representation in the first spatial vector space is obtained by transforming an under-sampled training image from representation (a) using the first transform or by transforming an under-sampled training image from representation (b) using the second transform;

-39training the machine-learning model using the under-sampled training Images as Inputs and the corresponding one or more parameters and/or segmentation maps as the target output.

5
16. A method according to claim 15, further comprising training the machinelearning model across a plurality of epochs, wherein for each epoch the under-sampled training images are generated according to a random sampling of the corresponding training images.

10
17. A method comprising:

training the machine learning model using the complete training images and the corresponding one or more parameters and/or segmentation maps;

training the machine learning model according to claims 15 or 16.

15
18. A method according to claim 17, wherein the under-sampling fraction is decreased from unity to a target under-sampling fraction across a plurality of training epochs.
19. A method according to any one of claims 15 to 18, wherein the machine-learning 20 model comprises a convolutional neural network.
20. A method according to any one of claims 15 to 18, wherein the trained machinelearning model comprises a TL-network.

25
21. A computer program product storing a machine-learning model trained according to any one of claims 15 to 20.
22. Apparatus comprising:

measurement means configured to obtain image data corresponding to an

30 under-sampling of an image which is:

(a) represented in a non-spatial vector space, wherein a first transform exists for mapping a representation of the image in the non-spatial vector space to a representation in a first spatial vector space;

(b) represented in a plurality of second spatial vector spaces, wherein a

35 second transform exists for mapping a representation of the image in the

- 40 plurality of second spatial vector spaces to a representation in the first spatial vector space; or (c) represented in the first spatial vector space, wherein the representation in the first spatial vector space has been obtained by transforming representation (a) using the first transform or by transforming representation (b) using the second transform;

a controller configured to process the image data according to the method of any one of claims 1 to 11.

Intellectual

Property

Office

Application No: GB1814649.8 Examiner: Dr Andrew Rose