WO2020077198A1 - Modèles fondés sur des images permettant un suivi des informations biométriques et de mouvement sans marqueur en temps réel dans des applications d'imagerie - Google Patents

Modèles fondés sur des images permettant un suivi des informations biométriques et de mouvement sans marqueur en temps réel dans des applications d'imagerie Download PDF

Info

Publication number
WO2020077198A1
WO2020077198A1 PCT/US2019/055819 US2019055819W WO2020077198A1 WO 2020077198 A1 WO2020077198 A1 WO 2020077198A1 US 2019055819 W US2019055819 W US 2019055819W WO 2020077198 A1 WO2020077198 A1 WO 2020077198A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
motion
model
feature
data
Prior art date
Application number
PCT/US2019/055819
Other languages
English (en)
Inventor
Lalit Keshav MESTHA
Jeffrey N. Yu
Michael G. ENGELMANN
Original Assignee
Kineticor, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kineticor, Inc. filed Critical Kineticor, Inc.
Publication of WO2020077198A1 publication Critical patent/WO2020077198A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images
    • G16H30/40ICT specially adapted for the handling or processing of medical images for processing medical images, e.g. editing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • the present disclosure relates to systems, methods, and techniques for training, validating, and applying image-based (e.g., feature-based) models used for biometrics and marker less motion tracking. More specifically, these models may include machine-learning models that are capable of assessing images of a person captured in real-time in order to accurately determine and track the person’s movements and biometrics without the use of markers or sensors applied to the person. Such models may be especially useful for determining and tracking a patient’s movement and biometrics during a medical imaging and/or therapeutic procedure. The present disclosure also relates to systems, methods, and techniques for further improving the accuracy of these image-based models.
  • image-based e.g., feature-based
  • MRI magnetic resonance imaging
  • CAT CAT
  • PET PET
  • SPECT positron emission computed tomography
  • MRI magnetic resonance imaging
  • the technique involves the use of a MRI scanner, which is a device with a powerful magnet.
  • a patient or a portion of the patient’s body is positioned within the MRI scanner, such that the magnetic field from the magnet can be used to align the magnetization of some atomic nuclei (usually hydrogen nuclei - protons) in the patient’s body.
  • Radio frequency magnetic fields are applied to systematically alter the alignment of the magnetization and cause those nuclei to produce a rotating magnetic field that the MRI scanner can detect and record in order to construct an image of the scanned region of the patient’s body. These scanning procedures may last from several minutes up to one hour, and any movement in the patient’s body can degrade or ruin the resulting images, which may require the scanning procedure to be repeated.
  • radiation therapies can be applied to a targeted tissue region that is static, in some instances, radiation therapy can be dynamically applied in response to patient movement.
  • this dynamic application of radiation therapy may not have a high degree of accuracy. Accordingly, the use of radiation therapy in this manner can result in the unintentional application of radiation to non-targeted, healthy tissue regions. This issue may also be present for proton therapies and other therapeutic procedures.
  • One way to adapt these medical imaging techniques and therapeutic procedures for patient movement is to use motion tracking technology or techniques that accurately and precisely track the patient’s movement in real-time during the medical imaging and/or therapeutic procedure, and then continuously alter the medical imaging and/or therapeutic procedure in a way that eliminates or accounts for the patient’s movements.
  • one technique applicable for medical imaging is to place or affix one or more markers to one or more portions of a patient’s body.
  • One or more detectors e.g., cameras
  • the markers may be configured to be uniquely identifiable and distinguishable by the detectors. This allows the markers to be used to determine the real-time movement (e.g., translation and rotation) of the body parts that the markers are affixed to.
  • the ability to affix a marker on a subject may not be possible.
  • Described herein are systems, methods, and techniques related to machine- learning techniques and image-based models that can be used for tracking and sensing (e.g., accurately predicting) the motion coordinates and biometrics of a patient’s body (in real-time) from captured video images, without the use of an external marker affixed to the patient.
  • These systems, methods, and techniques combine multiple processes and components together in order to maximize accuracy in determining (e.g., predicting) patient movement from sequential video images of the patient without an external marker.
  • the systems, methods, and techniques described herein may apply various image processing and filtering techniques on the captured video images before using information in the processed images and other forms of input data to obtain a multi modal, multi-disciplinary (MMMD) feature representation.
  • MMMD multi modal, multi-disciplinary
  • the systems, methods, and techniques described herein may determine the multi-modal, multi-disciplinary (MMMD) feature representation using a MMMD Feature Engineering Framework.
  • the MMMD Framework may be capable of utilizing manual (e.g., hand-crafted) features as well as learned features, which involve features determined using automated feature learning techniques, such as deep feature learning algorithms.
  • Dimensionality reduction techniques such as feature extraction can be used in order to reduce the number of features (e.g., by generating a smaller set of higher-order features) and obtain a feature vector.
  • the systems, methods, and techniques described herein may provide the feature vector as an input to a motion and biometric signal detection model (e.g., a trained machine-learning regression model), which can map the features in the feature vector into motion coordinates in six degrees-of-freedom (6DOF) and biometric waveforms for the patient.
  • a motion and biometric signal detection model e.g., a trained machine-learning regression model
  • 6DOF degrees-of-freedom
  • This determination can be performed in real-time (e.g., while a medical imaging technique or therapeutic procedure is being performed), which enables the output results to be instantaneously used to reduce or eliminate imaging artifacts or make corrections to the therapeutic procedure.
  • the outputted biometrics may be waveforms, which would not only inform about values, but also timing, phase, cycle, and amplitude. For instance, an outputted cardiac-related waveform may be analyzed in order to determine timing, phase, cycle, amplitude, and so forth.
  • a marker-less motion tracking system comprising: one or more cameras each configured to generate video data of an object, the video data comprising sequential image frames; a medical imaging device configured to perform imaging of the object in order to generate medical image data of the object; a machine-learning based algorithm configured to process the video data generated by each of the one or more cameras, determine features informative on motion or biometrics of the object during imaging performed by the medical imaging device, and generate object tracking data representative of at least one of: (i) motion of the object during imaging performed by the medical imaging device, and (ii) biometrics of the object during imaging performed by the medical imaging device; and an object tracking engine configured to provide the object tracking data to the medical imaging device to adjust imaging performed by the medical imaging device to account for the motion or biometrics of the object during the imaging.
  • the features include at least one of: (i) manual features, (ii) learned features from a deep feature learning algorithm, (iii) higher-order features, and (iv) dimensionality-reduced features generated from a dimensionality reduction technique.
  • the object tracking data is representative of the motion of the object during imaging and includes at least one of: (i) X-axis translation value, (ii) Y-axis translation value, (iii) Z-axis translation value, (iv) X-axis rotation value, (v) Y-axis rotation value, and (vi) Z-axis rotation value.
  • the system of claim 3, wherein the machine-learning based algorithm is associated with any of: (i) extreme learning machine, (ii) linear regression, (iii) logistic regression, (iv) autoregressive model, (v) autoregressive moving average model, (vi) convolution neural network model, (vii) autoencoder model, and (viii) sequential learning model.
  • the object is a human subject, and the features associated with biometrics of the subject during imaging are further associated with any of: (i) subject age, (ii) subject BMI, (iii) range of heart rate, (iv) range of respiration rate, (iv) gender, (v) feature maps, (vi) feature parameters, and (vii) learning algorithm parameters.
  • the machine-learning based algorithm generates the object tracking data by at least: processing the image frames in the video data generated by each of the one or more cameras to obtain processed image frames; determining a feature vector comprising the features, wherein the features are at least partially learned using a deep feature learning technique, and wherein the features include at least one feature determined from the processed image frames; and generating the object tracking data by mapping the feature vector to the object tracking data using a set of parameters determined through training.
  • processing the image frames comprises applying a spatial-temporal filter to remove non-rigid -body motion from the image frames.
  • one or more of the features are learned through feature learning.
  • the feature learning involves offline learning during training.
  • the machine-learning based algorithm generates object tracking data representative of motion of the object during imaging by converting a position and orientation of the object, both based on appearance of the object to each of the one or more cameras, to 6DOF motion coordinates or biometrics for the object.
  • the object is a human subject, and the features associated with motion of the subject during imaging are further associated with any of: (i) the video data generated by the one or more cameras, and (ii) multi-modal and multi-disciplinary features.
  • a method of motion compensation for a medical imaging device comprising: receiving, in real-time from each of one or more cameras associated with the medical imaging device, streamed video data for an object that contains motion of the object during imaging performed by the medical imaging device , wherein the video data comprises sequential image frames; processing the image frames in the streamed video data generated by each of the one or more cameras to obtain processed image frames; determining a feature vector comprising features informative on motion or biometrics of the object during the imaging performed by the medical imaging device, wherein the features are at least partially learned using a feature learning technique, and wherein the features include at least one feature determined from the processed image frames; generate, based on the feature vector, object tracking data representative of the motion of the object during the imaging; causing an adjustment to the imaging of the object performed by the medical imaging device, wherein the adjustment accounts for the motion of the object during the imaging.
  • the features include at least one of: (i) manual features, (ii) learned features from a deep feature learning algorithm, (iii) higher-order features, and (iv) dimensionality-reduced features generated from a dimensionality reduction technique.
  • the object tracking data includes at least one of: (i) X-axis translation value, (ii) Y- axis translation value, (iii) Z-axis translation value, (iv) X-axis rotation value, (v) Y-axis rotation value , and (vi) Z-axis rotation value.
  • the object tracking data is generated using a machine-learning based algorithm associated with any of: (i) extreme learning machine, (ii) linear regression, (iii) logistic regression, (iv) autoregressive model, (v) autoregressive moving average model, (vi) convolution neural network model, (vii) autoencoder model, and (viii) sequential learning model.
  • the object is a human subject
  • at least one the features is associated with biometrics of the subject during imaging based on at least one of: (i) subject age, (ii) subject BMI, (iii) range of heart rate, (iv) range of respiration rate, (iv) gender, (v) feature maps, (vi) feature parameters, and (vii) learning algorithm parameters.
  • processing the image frames comprises applying a spatial-temporal filter to remove non-rigid-body motion from the image frames.
  • the feature learning technique involves offline learning during training.
  • the object tracking data is generated by converting a position and orientation of the object, both based on appearance of the object to each of the one or more cameras, to 6DOF motion coordinates or biometrics for the object.
  • the object is a human subject, and at least one the features is associated with motion of the subject during imaging based on at least one of: (i) the video data generated by the one or more cameras, and (ii) multi-modal and multi disciplinary features.
  • FIG. 1 is a system diagram that provides an overview of how an image-based biometrics and marker-less motion tracking system could be built and used, in accordance with embodiments of the present disclosure.
  • FIG. 2 is a block diagram of a simplified data processing pipeline applicable towards the initialization and real-time use of a model for biometrics and mark-less motion tracking, in accordance with embodiments of the present disclosure.
  • FIG. 3 illustrates an example full-size image frame captured using a camera, which can be used as raw input data for a model for biometrics and mark-less motion tracking, in accordance with embodiments of the present disclosure.
  • FIG. 4 illustrates an example of removing head coil pixels from a set of images, which can be used with a model for biometrics and marker-less motion tracking, in accordance with embodiments of the present disclosure.
  • FIG. 5 is a flow diagram for a process of performing spatial-temporal filtering for biometrics and marker-less motion tracking, in accordance with embodiments of the present disclosure.
  • FIG. 6 is a flow diagram for a process of performing spatial-temporal filtering for biometrics, marker-less motion, and marker-based motion tracking, in accordance with embodiments of the present disclosure.
  • FIG. 7 illustrates example graphs of filtered and unfiltered average pixel intensity signals for a video stream of a patient exhibiting both non-rigid-body and rigid-body motion, in accordance with embodiments of the present disclosure.
  • FIG. 8 illustrates example graphs of filtered and unfiltered individual pixel intensities and average pixel intensity signals associated with eye -blinking, in accordance with embodiments of the present disclosure.
  • FIG. 9 illustrates example image frames of a patient, both before and after application of the spatial-temporal filter associated with FIG. 8.
  • FIG. 10 is a diagram of a Multi-Modal and Multi-Disciplinary (MMMD) Feature Engineering Framework, in accordance with embodiments of the present disclosure.
  • MMMD Multi-Modal and Multi-Disciplinary
  • FIG. 11 is a diagram of both an encoder architecture and a decoder architecture associated with an autoencoder, in accordance with embodiments of the present disclosure.
  • FIG. 12 is a schematic view of an example convolution operation performed in a convolution layer using one filter in a 2D Convolution Neural Network (CNN), in accordance with embodiments of the present disclosure.
  • CNN 2D Convolution Neural Network
  • FIG. 13 illustrates the difference in how video images would be used as input data between a 2D CNN and a 3D CNN, in accordance with embodiments of the present disclosure.
  • FIG. 14 illustrates example convolution operations performed within a 3D CNN, in accordance with embodiments of the present disclosure.
  • FIGS. 15-16 illustrate example feature maps that were obtained for various filters used in 3D CNNs, in accordance with embodiments of the present disclosure.
  • FIG. 17 illustrates an example set of the final features that were obtained in connection with the 3D CNNs associated with FIGS. 15-16.
  • FIG. 18 illustrates an example architecture for a single layer ELM network, in accordance with embodiments of the present disclosure.
  • FIG. 19 illustrates an example frame of sensor data including four images captured by their respective cameras and a region of interest (ROI) within the frame, in accordance with embodiments of the present disclosure.
  • FIG. 20 illustrates an example plot of normalized pixel intensities for a ROI over a period of time, in accordance with embodiments of the present disclosure.
  • FIG. 21 illustrates an example network structure of a deep feature learning algorithm employing CNN and a motion and biometric signal detection model employing ELM, in accordance with embodiments of the present disclosure.
  • FIG. 22 illustrates an example 2D visual representation of the features learned by a CNN, in accordance with embodiments of the present disclosure.
  • FIG. 23 illustrates an example plot of prediction results obtained from a model employing CNN and ELM, in accordance with embodiments of the present disclosure.
  • FIG. 24 is a block diagram of a stacked 2-layer autoencoder, in accordance with embodiments of the present disclosure.
  • FIGS. 25-26 illustrate example blood volume waveform prediction results based on feature learnings from two different deep learning algorithms and two different detection models, in accordance with embodiments of the present disclosure.
  • FIG. 27 illustrates an example architecture of a model for biometrics and marker-less motion tracking that employs a Convolution Neural Network, in accordance with embodiments of the present disclosure.
  • FIGS. 28-30 illustrate example results associated with a biometrics and marker less motion tracking system.
  • FIG. 31 illustrates an example full-size image frame captured using four cameras, which can be used as raw input data for a model for biometrics and marker-less motion tracking, in accordance with embodiments of the present disclosure.
  • FIG. 32 illustrates an example architecture of a model for biometrics and marker-less motion tracking that employs an autoencoder, in accordance with embodiments of the present disclosure.
  • FIG. 33 is a block diagram of an encoder-decoder network, in accordance with embodiments of the present disclosure.
  • FIG. 34 illustrates an example plot of validation results of a model for biometrics and marker-less motion tracking, in accordance with embodiments of the present disclosure.
  • FIG. 35 illustrates an example table of key statistics associated with the results of a model for biometrics and marker-less motion tracking, in accordance with embodiments of the present disclosure.
  • FIG. 36 illustrates an example cross-correlation plot matrix for both marker- based and marker-less motion tracking, in accordance with embodiments of the present disclosure.
  • FIG. 37 illustrates a set of four images used in a training sample, in accordance with embodiments of the present disclosure.
  • FIG. 38 illustrates a set of sixty images used in a training sample, in accordance with embodiments of the present disclosure.
  • FIG. 39 illustrates example highlights of feature maps obtained from a CNN, in accordance with embodiments of the present disclosure.
  • FIG. 40 illustrates example highlights of feature maps obtained from a CNN, in accordance with embodiments of the present disclosure.
  • FIG. 41 illustrates example highlights of feature maps obtained from a CNN, in accordance with embodiments of the present disclosure.
  • FIG. 42 illustrates a set of four images used in a validation sample, in accordance with embodiments of the present disclosure.
  • FIG. 43 illustrates an example calibration pattern, in accordance with embodiments of the present disclosure.
  • FIG. 44 is a flow chart illustrating part of a process of developing a camera pose model, in accordance with embodiments of the present disclosure.
  • FIG. 45 illustrates an example table of characterized camera pose data, in accordance with embodiments of the present disclosure.
  • FIG. 46 is a flow chart illustrating part of a process of developing a camera pose model, in accordance with embodiments of the present disclosure.
  • FIG. 47 is a flow chart illustrating part of a process of developing a camera aggregation model, in accordance with embodiments of the present disclosure.
  • FIG. 48 is a block diagram of part of a process of developing a camera pose model, in accordance with embodiments of the present disclosure.
  • FIG. 49 is a block diagram of part of a process of developing a camera aggregation model, in accordance with embodiments of the present disclosure.
  • FIG. 50 illustrates an example plot of 6DOF coordinates generated by an AI- based camera pose model, in accordance with embodiments of the present disclosure.
  • FIG. 51 illustrates an example graph of results from an AI-based camera aggregation model, in accordance with embodiments of the present disclosure.
  • FIG. 52 illustrates an example table summarizing the validation statistics and efficacy from using a camera pose model with a camera aggregation model, in accordance with embodiments of the present disclosure.
  • FIG. 53 is a flow diagram illustrating an example method of motion detection, in accordance with embodiments of the present disclosure.
  • FIG. 54 is a block diagram depicting an illustrative computing device that can be used in accordance with embodiments of the present disclosure.
  • FIG. 55 illustrates the coordinate frames of a system for real-time adaptive Medical Scanning, in accordance with embodiments of the present disclosure.
  • Motion tracking techniques can be used that accurately and precisely track the patient’s movement in real-time during the medical imaging and/or therapeutic procedure itself, and then the medical imaging and/or therapeutic procedure can be altered in a way that eliminates or accounts for the patient’s movements.
  • Such high accuracy tracking can improve the imaging quality obtained and produced by diagnostic equipment, such as through prospective motion correction during a medical imaging scan.
  • Motion tracking can be used to track the head or brain of a patient, and the scan planes can be adjusted in real-time or near real-time such that they follow the movement, resulting in images without motion artifacts.
  • one available technique for motion tracking that is applicable for medical imaging is to place or affix one or more markers to one or more portions of a patient’s body.
  • One or more detectors e.g., cameras
  • the markers may be configured to be uniquely identifiable and distinguishable by the detectors. This allows the markers to be used to determine the real-time movement (e.g., translation and rotation) of the body parts that the markers are affixed to.
  • this kind of motion tracking requires having an accurate set of reference points (e.g., provided by the markers), and the ability to affix a marker on a subject may not be possible.
  • these systems, methods, and techniques similarly utilize detectors to capture video or sequential image frames of the patient’s body as they move around. This image data, along with additional data from other sources, may be used to obtain features that capture all the relevant information embedded within all the available input data.
  • a trained model may be able to utilize this information in order to make a determination of the patient’s motion coordinates and biometrics based on relationships it uncovered from training data (e.g., transform the translations and rotations of a patient’s body part observed in the images into motion coordinates in actual space as required by the scanner and/or the therapeutic device).
  • image-based models can include feature-based models, and thus this disclosure is inclusive of feature-based biometrics and marker-less motion tracking, which may be referred to as dynamic feature-based motion control.
  • a predictive model may consist of some function with parameters (e.g., coefficients - often referred to as weights and biases within a machine-learning context) that can be applied to a set of one or more inputs in order to generate an output (e.g., a predictive value).
  • parameters e.g., coefficients - often referred to as weights and biases within a machine-learning context
  • an output e.g., a predictive value.
  • the values for the parameters of this predictive model are initially unknown. They can be determined through a process called training, in which the predictive model will receive training data and then select parameter values that best“fit” or make sense of the received training data.
  • the training data may include numerous data points, with each data point consisting of one or more input values and the corresponding one or more output values obtained from those input values.
  • the training data can be made up of actual data points collected from, and associated with, actual, real-world observations so that the predictive model can be modelled to reflect real-world occurrences.
  • the training data can be made up of data points considered to contain the best-case or ideal output values, so that parameter values can be selected to make the applied predictive model generate outputs (e.g., predictions) that are as close to the best-case or ideal outputs as possible.
  • outputs e.g., predictions
  • ground truth is often used in the machine-learning context to refer to the observed or measured outputs used for the training data, which serves as a baseline towards which the predictive model is aspiring to in its predictions.
  • the optimal parameter values that best fit the training data will not be immediately apparent. Instead, the optimal parameter values will have to be estimated (e.g., guessed), and there may be many different training techniques that can be used to train the predictive model and estimate the optimal parameter values. For instance, even for a simple linear model like in this example, there exists many different linear regression techniques, such as the method of ordinary least squares.
  • the underlying concept of many of these parameter estimation techniques is to select parameter values that would minimize the“error” between the ground truth (e.g., the data points in the training data) and the outputs generated from the application of the predictive model.
  • parameter values should be selected so that the predictive model generates predictions that are as close to the ground truth as possible.
  • An intuitive way to understand this is to visualize a scenario in which a line is fitted through a bunch of data points. The line (which will be captured by its parameter values) should be visually drawn in a way that minimizes the“spread” (e.g., the error) between the data points and this line.
  • these parameter values are often estimated through an iterative process, and often with the help of a computer (especially as the model becomes more complex or the amount of training data increases).
  • an equation can be constructed to calculate the“error” between the ground truth and the outputs of the predictive model, and an iterative optimization algorithm can be used to select parameters for the model that would minimize that equation (e.g., get values as close to zero as possible).
  • the number of iterations performed by the computer increases, better and better guesses for the parameter values can be made, which will converge towards the optimal parameter values to use.
  • the best parameter values can then be put into the model.
  • the validation data set will be made up of data points of actual, real-world observations for which the actual input and output values are known. In some cases, the validation data set can even be a portion of the training data set that was separated from the training set prior to training (and thus, not used in the training).
  • the input values for that data point can be provided to the model to use as inputs for generating a predicted output. That predicted output can then be compared against the actual output values for that data point, in order to determine how closely the predicted outputs match up to the actual outputs for the data points.
  • the model is pretty accurate (e.g., the correct parameter values and function were chosen, the correct assumptions were made about the nature of the data, and so forth), then the model can then be applied in real-time to make predictions from input data.
  • some kind of function can be selected for a model based on the suspected characteristics and nature of the training data being dealt with (e.g., a linear function if the data exhibits linearity).
  • the parameters for this function are unknown, but the inputs (e.g., the independent variables) and the outputs (e.g., the dependent variables) for the function can be defined based on some suspected relationship. Data points with the actual values for those inputs and outputs must also be available in the training data.
  • the model is then trained against the training data in order to determine the best parameter values for the function that provide the best mapping of inputs to outputs (e.g., minimizes the “error” between predictions and the ground truth).
  • the selection of the best parameter values requires a sufficiently large amount of training data and is often performed iteratively.
  • optimal parameter values e.g., the model is trained
  • they can be added to the model, which is then validated to determine if the trained model makes accurate predictions. If so, then the trained model may be trusted enough to be applied in real-time and make predictions from available input data.
  • This paradigm is generally the same one applied in many machine-learning techniques (e.g., supervised learning algorithms) that are directed to regression.
  • supervised learning algorithms are used to build a mathematical model from training data containing a large number of training examples (e.g., data points) that teach both the inputs and desired outputs, and the resulting model can be applied to make predictions based on relationships that have been uncovered in the training data.
  • the supervised learning algorithm chosen for the task may depend on many factors, such as the overall context surrounding the model, the model’s intended applications, the nature and characteristics of the training data, the suspected relationships between the inputs and outputs, and so forth.
  • the inputs used in the mathematical model are referred to as“features” (e.g., analogous to the independent or explanatory variable used in a linear regression).
  • features e.g., analogous to the independent or explanatory variable used in a linear regression.
  • a feature can be thought of as an individual measurable property or characteristic of a phenomenon being observed, or as some type of information which is relevant for solving the computational task related to a certain application (e.g., it is information that has some kind of predictive value in determining the desired output).
  • each training example (e.g., data point) may be representable by an array or vector of features, which is sometimes called a feature vector.
  • the supervised learning algorithm can determine the parameters of the mathematical model that can be used to predict outputs associated with new input values it has not seen before (e.g., that were not a part of the training data). The accuracy of these predictions may improve over time as new training examples are added to the training data and the parameters are updated.
  • a specific structure in the image may be used as a feature, such as points, edges, or even objects in the image.
  • An“interesting” part of an image can be used as a feature, such as features related to motion in image sequences, to shapes defined in terms of curves or boundaries between different image regions or to properties of such a region, to edge direction, and to changing intensity.
  • Some features may be even more abstract (e.g., abstractions of image information), such as features that are the result of a general neighborhood operation or feature detection applied to the image.
  • Feature detection is a technique used to make local decisions at every image point (e.g., point by point) on whether there is an image feature of a particular type at that point, for image features such as edges, corners, blobs, and ridges.
  • Higher-order features can also be used (e.g., a feature that is derived from another feature or a combination of features), particularly if there are too many available features to choose from and some of them are redundant. This is useful because having too many features (e.g., inputs) in the model increases its complexity and the amount of memory and computation power required to train the model. There may also be a problem of overfitting the model to the training data when too many features are used as inputs, which may result in a model that has poor predictive power when applied to new data.
  • the redundancies in the features can be eliminated using techniques such as feature selection or feature extraction, in order to select a combination/subset of features or to generate a reduced set of derived features.
  • the resulting reduced set of features is expected to still contain all the relevant information among the entire set of features and convey existing relationships in the data with sufficient accuracy, which can facilitate training efficiency and improve the generalization of the model to new data.
  • These techniques of feature selection or feature extraction are similar to the concept of dimensionality reduction.
  • the markers serve as unique reference points that enable sequentially-captured video images of the patient’s head having the marker visible to be used to determine the movement of the patient’ s head in six degrees-of-freedom (also referred to herein as 6DOF, which consists of X, Y, and Z translation and X, Y, and Z rotation) over a period of time, based on its relationship to the measurable positions and angles of the markers over that period of time.
  • 6DOF degrees-of-freedom
  • the features (e.g., inputs) used in the model for determining (e.g., predicting) movement of a patient’s head would have to be associated with something else, such as an observable landmark that is common across the images captured for all the patients.
  • something else such as an observable landmark that is common across the images captured for all the patients.
  • the nose of a patient could be relied on in lieu of a marker, since a patient’s nose is affixed to the head and there should be a relationship between nose movement and head movement.
  • a machine-learning model could be trained that utilizes higher-order features associated with the positioning and movement of the nose of the patient, and that model may be able to account for the differences in the noses of different patients and reliably identify the nose and its movement and map that to head movement.
  • the model may be important for the model to have extremely accurate predictive capability (e.g., accurate within a number of microns) due to its potential applications in medical imaging and therapeutic procedures (e.g., potentially life-or-death situations).
  • a model that takes into account the positioning and movement of the nose of the patient may generate results that are close to the desired output for patient movement, but still not close enough for use in medical imaging and therapeutic procedures.
  • the model would consider the information among all the available, relevant features (e.g., in our example, there may be additional features beyond those associated with positioning/movement of the nose that could be added to the model in order to improve the accuracy of the predictions).
  • Feature engineering is a term used to describe the process of using domain knowledge of the data (e.g., its nature and characteristics) in order to create and select the features (e.g., inputs) that make machine-learning algorithms work well.
  • the process of feature engineering is performed iteratively and consists of the steps of: brainstorming or testing features, deciding what features to create, creating those features, checking how those features work with the model, improving the features if needed, and then going back to brainstorm/create/implement additional features until the desired result is achieved (e.g., the model’s predictions attain a desired level of accuracy). This process can be both difficult and time consuming.
  • automated feature learning e.g., techniques enabling a computing system to automatically learn and discover features to use for the model on its own behalf
  • machine-learning techniques e.g., neural networks
  • the systems, methods, and techniques may apply various image processing and filtering techniques on the captured video images before using information in the processed images to obtain a multi-modal, multi-disciplinary (MMMD) feature representation (determined using automated feature learning and feature extraction) that can be input into a motion and biometric signal detection model (e.g., a machine-learning regression model), which is capable of mapping the features into motion coordinates in six degrees-of-freedom (6DOF) and biometric waveforms for the patient.
  • MMMD multi-modal, multi-disciplinary
  • a motion and biometric signal detection model e.g., a machine-learning regression model
  • 6DOF degrees-of-freedom
  • FIG. 1 illustrates a system diagram that provides an overview of how an image- based motion and biometrics tracking system could be built and used, in accordance with embodiments of the present disclosure.
  • FIG. 1 illustrates a medical imaging system 150 with a medical imaging device 152, a patient 154, and one or more detectors 156.
  • the medical imaging device 152 is configured to scan a volume 158 (which contains a portion of the patient 154, e.g., the patient’s head during a medical imaging procedure) and generate medical images of the portion of the patient 154 contained in the volume 158.
  • the medical imaging device 152 may include various components (not shown) for controlling the medical imaging procedure, generating the medical images, and reducing image artifacts on the basis of patient movement.
  • the one or more detectors 156 may be configured to collect images of the patient 154 in real-time, such as by capturing video that comprises sequential video images of the portion of the patient 154 in volume 158.
  • the one or more detectors 156 may be cameras for collecting video or images of the patient 154. In some implementations, there may be a total of four or six cameras used. When multiple detectors are used, they can be arranged in various positions in order to simultaneously capture images of the patient 154 from different angles.
  • the relative position of the one or more detectors 156 to the medical imaging device 152 and the volume 158 may be known beforehand.
  • the relative positions of the one or more detectors 156 can be the same as the relative positions of the detectors used to collect the training data 120, which may serve to reduce the number of variables that must be considered for tracking patient motion and biometrics.
  • the medical imaging system 150 may send the images collected by the one or more detectors 156 of the patient 154 to an image-based motion and biometrics tracking system 100 as the images are collected (e.g., streamed in real time).
  • the image -based motion and biometrics tracking system 100 may include various components, such as an image processing module 102; a trained feature transformation model 104 and a corresponding feature engineering module 108; and a trained regression model 106 and a corresponding regression model training module 110.
  • the image -based motion and biometrics tracking system 100 may be configured to ingest training data 120.
  • the training data 120 may include a large set of training examples that can be used by the image -based motion and biometrics tracking system 100.
  • the nature of the training data 120 can vary between implementations, but in some implementations, the training data 120 may consist of video (e.g., consisting of sequential video images) captured of various patients (e.g., the portion of the patient in the volume 158 as they move around), using a similar configuration as the one or more detectors 156.
  • the image data in each instance e.g., associated with a video
  • the training data 120 may also include, as raw inputs, additional information or knowledge that was collected about the patient in each training example.
  • additional information or knowledge may include the patient’s age, body mass index (BMI), gender, height, range for normal resting heart rate, range for normal respiratory rate, whether the patient has a beard, and so forth.
  • BMI body mass index
  • This patient information can be relevant if these aspects of a patient’s physiology somehow affect how the biometric and motion coordinates are generated (e.g., these aspects of the patient’s physiology are an implicit part of the relationship that exists between sequential image data of the patient and the patient’s motion).
  • the training data 120 may also include the desired outputs for each instance or training example, which may be the actual, known motion coordinates (“actual” meaning best- available estimate, since an exact determination may not be possible) for that portion of the patient in 6DOF (e.g., X, Y, and Z translation and also X, Y, and Z rotation) over the period of time in the corresponding video.
  • These motion coordinates will serve as the ground truth that is used to teach the overall relationship that exists between sequential image data of a patient captured from the one or more detectors 156 and the 6DOF motion coordinates for the patient’s motion.
  • one method for obtaining the motion coordinates is to also affix a marker to the patient’ s body when collecting the training data 120 for that patient.
  • the collected images can be processed to separate the parts of the images having the marker from the parts of the images without the marker (sometimes referred to herein as the“region of interest”, or ROI).
  • the images without the marker can serve as the raw input data for the training example, while the images with the marker can be used to determine the motion coordinates for the patient’s movement in 6DOF - such as by using techniques described in the families for U.S. Patent No. 8,121,361, 9,305,365, 9,717,461, 9,734,589, and 10,004,462.
  • the ground truth for training marker-less models could be motion coordinates of the patient that were determined from a marker-based model.
  • Other ground truths may include the use of FID navigators or programmable robots with precise motion that can be recorded.
  • the desired outputs could also include the known biometric waveforms (e.g., associated with heart rate, heart rate variability, pulse transit time, pulse wave velocity, blood pressure, oxygen saturation, prospective gating, psychophysiological state of the subject, etc.) of the patient over the period of time in the corresponding video.
  • biometric waveforms e.g., associated with heart rate, heart rate variability, pulse transit time, pulse wave velocity, blood pressure, oxygen saturation, prospective gating, psychophysiological state of the subject, etc.
  • biometric waveforms will serve as the ground truth that is used to teach the overall relationship that exists between sequential image data of a patient captured from the one or more detectors 156 and their biometric waveforms.
  • the shade or color of a patient’s face may change slightly over time in a way that would be imperceptible to the human eye (e.g., as blood pumped by the heart travels through vessels in the patient’s face or changes color based on oxygen saturation), but those tiny changes may be captured in the image data and compared against the actual, recorded biometric waveforms for the patient in order to teach how tiny changes in the shade/color of a patient may be related to changes in the patient’s biometrics.
  • the method used for obtaining the actual biometric waveforms of a patient is to record the patient’s biometrics using the appropriate sensor and device (e.g., an EKG monitor, Sp02 monitor, etc.) while images of the patient are collected for the training data 120.
  • the image-based motion and biometrics tracking system 100 may process the training data 120 using an image processing module 102.
  • the image processing module 102 may be configured to perform various kinds of image processing and/or apply various kinds of filters to the images, in order to make the images easier to work with or remove noise from the images (thereby increasing accuracy). For instance, the image processing module 102 may be configured to resize images or remove irrelevant portions of the images, in order to reduce the size of the images and reduce the computational load on the image-based motion and biometrics tracking system 100.
  • the image processing module 102 may also be configured to filter out certain kinds of patient motion from the images, which may improve the results of the motion tracking and determination performed by the image-based motion and biometrics tracking system 100.
  • the feature engineering module 108 may be configured to perform one or more machine-learning techniques associated with automated feature learning (e.g., techniques enabling a computing system to automatically learn and discover features to use for the model on its own behalf) on the processed images and other available information in the training data (e.g., about patient physiology), in order to determine and select the best set of features (which together convey all the relevant information that is encapsulated in the images and any other available input data) to be used by a regression model for calculating motion coordinates and biometrics.
  • the feature engineering module 108 may make this determination of the best set of features to use through an iterative process. Some of these features may be higher-order features (e.g., derived in some manner from one or more other features).
  • the feature engineering module 108 may provide the algorithms for generating and extracting these features from processed image data (and any other available information, such as patient physiology) to the trained feature transformation model 104, which can apply those algorithms in real-time to obtain feature vectors from the processed images it receives.
  • the feature engineering module 108 may additionally be provided with a set of manually-determined features (e.g., associated with observable characteristics that we suspect may have predictive value), such as features associated with each patient’ s physiology (when data for it is available in the training data 120).
  • the feature engineering module 108 may perform dimensionality reduction (e.g., feature selection and feature extraction) in order to make a determination of the best set of features to use from all the features available to it (e.g., from among both the set of manually-determined features and the set of deep-learning features it learned via automated feature learning).
  • dimensionality reduction e.g., feature selection and feature extraction
  • the feature engineering module 108 may generate feature vectors of those features from the processed images of the training examples in training data 120.
  • the feature engineering module 108 may provide the feature vectors to the regression model training module 110 for use as inputs.
  • there may be higher- order features used in the set of features which preserves the relevant information (while eliminating any redundancies) that may exist between the image data obtained from different detectors (e.g., from different positions/angles).
  • the feature vector provided to the regression model training module 110 may inherently factor in the existence of multiple detectors by including the relevant information across the images recorded by all of the detectors.
  • the regression model training module 110 may utilize machine-learning techniques (e.g., regression-based supervised machine-learning techniques) for training a regression model to process the feature vectors (that it received from the feature engineering module 108) and desired outputs (e.g., ground truth) for each training example in the training data 120, in order to determine the best parameter values (e.g., weights and biases) to use for mapping feature vectors into 6DOF motion coordinates and biometrics.
  • the regression model training module 110 may make this determination of the optimal parameters to use through an iterative process.
  • the regression model training module 110 may also validate these parameters against validation data, which may be new data (similar in format as the training data 120) that was not used in the training, in order to see how well these parameters work for calculating motion coordinates and biometrics when generalized to new, never-before-seen data. Once the optimal parameters (e.g., weights and biases) have been determined, the regression model training module 110 may provide the parameters for use in the trained regression model 104, which can use those parameters in real-time to convert feature vectors that it receives into 6DOF motion coordinates and biometrics that can actually be interpreted.
  • validation data may be new data (similar in format as the training data 120) that was not used in the training, in order to see how well these parameters work for calculating motion coordinates and biometrics when generalized to new, never-before-seen data.
  • the optimal parameters e.g., weights and biases
  • the feature engineering module 108 is being used to determine a best set of initially-unknown features to provide to the regression model training module 110 based on processed images in the training data 120
  • the regression model training module 110 is being used to determine a best set of initially-unknown parameters (e.g., weights and biases) for mapping (e.g., via a mapping function) the best set of initially-unknown features to the desired outputs in the training data 120. Since each affects the other and, also, given the amount of information that is initially-unknown, the feature engineering module 108 and regression model training module 110 may be used together in combination (e.g., in a stacked configuration) to determine the best set of features and parameters to use.
  • the medical imaging system 150 may send the images collected by the one or more detectors 156 of the patient 154 to the image-based motion and biometrics tracking system 100 as the images are collected during a medical imaging procedure (e.g., streamed in real-time).
  • the image processing module 102 may perform image processing and filtering on the received images, which may include the previously-mentioned examples described with regards to the processing and filtering of the training data 120 (e.g., resizing of images).
  • the processed images may be sent to the trained feature transformation model 104, which will extract feature vector(s) for the previously-determined best set of features from the image data.
  • the trained feature transformation model 104 will send the feature vector(s) to the trained regression model 106, which will apply a mapping function with the previously-determined best set of parameters on the feature vector(s) to generate 6DOF motion coordinates and biometrics of the patient (it should be noted that the trained regression model 104 is a predictive model, and thus, it may be more precise to say the trained regression model 104 is actually making“predictions” for what the 6DOF motion coordinates and biometrics of the patient should be).
  • the image-based motion and biometrics tracking system 100 can then send that information back to the medical imaging system 150, which may be able to use the 6DOF motion coordinates associated with the patient’s movement to make corrections to the medical imaging procedure, remove image artifacts, and so forth.
  • FIG. 2 a simplified data processing pipeline is shown, for which the general structure is applicable towards the initialization scenario of feature learning/determination and model training, as well as the real-time processing of medical imaging data for the tracking of motion and biometric signals in a medical imaging procedure.
  • a feature extraction component and a trained regression model may be used.
  • the trained regression model may be referred to as a marker-less motion and biometric signal detection model.
  • image processing component that performs some preprocessing and processing, such as resizing video frames, filtering noise, and so forth, before passing the video frames to the feature extraction component.
  • the feature extraction component processes the medical image data (e.g., the video frames) to extract salient features.
  • the feature extraction module relies on types of features, number of features, structure of the algorithm, parameter values, etc.
  • this information is generated based on methods described herein (e.g., through a MMMD Feature Engineering Framework, which can utilize automated feature learning and dimensionality reduction).
  • the information may be stored in a data store or configuration as described.
  • the tuned model structure e.g., of the marker-less motion and biometric signal detection model
  • the type of algorithm and its parameters may be taken as additional inputs.
  • the system may generate features over a sliding window (sliding over time) for a pre-determined window width (e.g., 5 seconds, 10 seconds, etc.) and, in some instances, a pre-determined overlap time. Sliding can be done on a frame by frame basis or at time intervals (e.g., 1 second or less) depending on the bandwidth of output signals.
  • the training may identify a region of interest within a frame for feature extraction.
  • the feature extraction component may pass a feature vector to the trained motion and biometric signal detection model, which is configured to take the feature vector and convert it into 6DOF motion coordinates and biometric signals for output.
  • image processing may be performed on the image data obtained from the cameras (e.g., the detectors 156) in order to make the images easier to work with (thereby improving computational efficiency and speed, which benefits real-time applications) or to remove noise from the images (thereby improving accuracy of the results).
  • Image processing may be applied to the image data in the training data (e.g., the training data 120) for model training, but it can also be applied in real-time applications (e.g., to process images as they are received from the detectors and/or the medical imaging device). There may be numerous image processing and filtering steps that can be applied in combination, and some of the steps may be optional.
  • the images may be resized to a smaller size in order to reduce the size of the data.
  • an image may be resized to be 86 x 86 pixels.
  • images may also be down-sampled or downgraded in image quality in order to make the images smaller and/or faster to process. In some cases, this may result in pixilation in the resulting image.
  • the approach described herein for biometrics and marker-less motion tracking may obtain accurate results despite this pixilation in the images.
  • image data may be normalized prior to their use as inputs in various machine-learning algorithms, which may rely on normalized inputs (e.g., values between 0 and 1). In some cases, the outputs of these machine-learning algorithms may be de- normalized.
  • images of patients may include markers affixed to the patient, which would not be used in predicting patient motion. Instead, the markers may be used for determining ground truth (e.g., if the images are to be added to training data).
  • the markers can be cropped from the image frames (e.g., to be separately saved) to leave images without the markers, such that no information associated with the marker can be used as inputs in the models for biometrics and marker-less motion tracking.
  • FIG. 3 illustrates the removal of markers from a set of four images captured by four cameras, resulting in the processed set of four images without the markers.
  • the marker-less motion and biometrics tracking may be applied to medical imaging performed using clinical MRI scanners.
  • clinical MRI scanners There can be many head coils that are used in clinical MRI scanners to provide patient comfort during the medical imaging procedure. These head coils and any other obstructions may be in view of the cameras and end up in the images captured of the patient.
  • head coils are affixed to the patient table or medical imaging device and they do not move when the patient moves.
  • head coils in view of a camera should remain static in the images captured by the camera over the duration of the medical imaging procedure.
  • the pixels for the head coils can be removed from the images in order to reduce the amount of information in the images that is irrelevant to the problem of marker-less motion and biometrics tracking.
  • FIG. 4 illustrates the removal of head coil pixels from a set of four images 402 captured by four cameras, resulting in the processed set of four images 404 without the head coil pixels.
  • a technician may be able to use an application on a computer in order to trace the outline of the head coil in one frame of video data at the beginning of the medical imaging procedure and create a mask for removing those pixels from all the other frames.
  • an algorithm can be used to go through all the frames of an entire video, identify which pixels remain static for the entire duration of the video, and remove those pixels from all the frames of the video.
  • an edge-finding or edge-detection algorithm can be used to determine the boundaries of the head coil in the images in order to remove the head coil pixels from the images.
  • the cameras associated with various medical imaging devices can be similarly positioned, and the locations of the head coil pixels in images collected by the cameras of one of those medical imaging devices may be the same as with the other medical imaging devices (e.g., of that model), which allows for the locations of the head coil pixels to be known by default.
  • non-rigid-body motion is essentially motion in the video that does not correspond to motion of the patient’s head or body parts (e.g., changes in its position or rotation).
  • non-rigid-body motion include eyebrow raising, eye blinking, eye left-right movement, nose squinting, frowning, motion due to coughing, forehead scrunching, shuddering, crossing legs, and so forth.
  • MRI imaging the presence of non-rigid-body motion in the motion signal can result in artifacts.
  • the present disclosure provides a method to filter out non-rigid-body motion directly from the video stream and retain rigid-body motion. This method is based on the underlying assumption that, for any particular pixel in a video, very sudden changes in that pixel’s intensity are associated with non-rigid-body motion, whereas more gradual changes in pixel intensity are associated with rigid-body motion.
  • a smoothing function can be applied to reduce any sudden changes in pixel intensity without affecting any gradual changes in pixel intensity that would be associated with rigid-body motion.
  • a spatial-temporal filter can be used to perform filtering in both spatial and temporal dimensions. It can also be used in both prospective and retrospective motion correction systems (retrospective motion correction modifies the image data during the reconstruction using measured motion coordinates while prospective motion correction performs an adaptive update during the medical imaging procedure, e.g., in real-time).
  • a spatial- temporal filter may also be used to remove non-rigid-body motion for both marker-less and marker-based motion tracking, with the distinction that the filtering for non-rigid-body motion is carried out at the output stage (e.g., after the computing of 6DOF motion coordinates) in marker- based tracking due to various hardware constraints. This distinction can be better understood by comparing FIGS. 5 and 6.
  • FIG. 5 is a flow diagram for a process of performing spatial-temporal filtering for biometrics and marker-less motion tracking, in accordance with embodiments of the present disclosure.
  • video frames e.g., a video consisting of sequential images
  • training data e.g., if processing the training data for model building
  • a streaming device or camera system associated with a medical imaging device e.g., if processing for a real-time application, such as an ongoing medical imaging procedure.
  • a region of interest (ROI) consisting of a group of pixels is selected from the video frames for the non-rigid-body motion correction to be performed.
  • ROI region of interest
  • the intensity of one pixel, or the mean pixel intensity of a group of pixels, is selected for filtering.
  • a spatial-temporal filter is applied to the intensity associated with the selected pixel (or group of pixels).
  • This spatial-temporal filter is configured with appropriate parameters to specifically remove non-rigid-body motion from the intensity associated with that pixel (or group of pixels).
  • Blocks 506, 507, and 508 loop application of the spatial-temporal filter until intensity for all the pixels (or groups of pixel) have been filtered in the ROI.
  • the filtered pixel intensity values can then be used to obtain features for use in generating 6DOF motion coordinates and biometrics values associated with the video frames.
  • An example spatial-temporal filter for this marker- less motion tracking may be as follows.
  • a pixel intensity signal represented by a variable, z tj , as a function of two components for pixel location ij where i is the x - spatial pixel coordinate and j is the y - spatial pixel coordinate:
  • Zi j(jitlered component is the temporal signal we would like to retain and z ij (noise) is the temporal noise component we would like to reject.
  • the signal is one-dimensional and is modeled by a parameter vector, q ⁇ , as follows:
  • D d is the approximation used to represent the ⁇ /th derivative operator.
  • Equation (3) Using matrix calculus (algebra not shown because of complexity), we can solve for Equation (3) and obtain the least squares solution as follows:
  • D 2 will be of size N X N.
  • the parameter vector then becomes the filtered signal which is given by:
  • the parameter, Xf is used to adjust the time-varying frequency response of the filter.
  • This parameter adjusts the frequency response of the signals so that it can be programmed to have spatial-temporal dependency and will be designed to automatically vary based on pixel intensity.
  • This spatial filter operates like a time-varying high pass Finite Impulse Response (FIR) filter by removing lower frequency components.
  • FIR Finite Impulse Response
  • Faster response is useful for filtering lower frequencies and retaining higher frequency components during fast transition in pixel intensity (e.g., associated with fast motion).
  • lower values are more desirable for fast motion and higher values for slow/no motion.
  • the function is approximated in the temporal direction and the smoothing parameter is made a function of pixel intensity to provide filtering in all three directions ( and - v directions for spatial and time for temporal direction). As the intensity changes due to motion, the smoothing parameter will be automatically adapted to varying intensity of each pixel.
  • FIG. 6 is a flow diagram for a process of performing spatial-temporal filtering for biometrics, marker-less motion, and marker-based motion tracking, in accordance with embodiments of the present disclosure. Since marker-based motion tracking is the ground truth data, it is often required to apply spatial-temporal filtering to remove any undesirable motion due to non-rigid-body motion present in the ground truth data.
  • video frames e.g., a video consisting of sequential images
  • training data e.g., if processing the training data for model building
  • a streaming device or camera system associated with a medical imaging device e.g., if processing for a real-time application, such as an ongoing medical imaging procedure.
  • a region of interest (ROI) consisting of a group of pixels is selected from the video frames for the non-rigid-body motion correction to be performed.
  • ROI region of interest
  • a pixel group is selected within the scanline for spatial grouping.
  • Temporal grouping is done with multiple frames. In some implementations, especially for prospective motion correction, temporal grouping may not be available. Under those circumstances, only spatial grouping is done. Multiple scan lines within the image frame will then become the signal of interest.
  • a spatial-temporal filter is applied to pixel group.
  • This spatial- temporal filter is configured with appropriate parameters to specifically remove non-rigid-body motion from the intensity associated with pixel group.
  • Blocks 606, 607, and 608 loop application of the spatial-temporal filter until the filter has been applied to every pixel group within the ROI.
  • the filtered pixel intensity values can then be used to obtain features for use in generating marker centroids, and at block 610, the marker centroids can be converted to 6DOF motion coordinates.
  • An example spatial-temporal filter for this marker-based motion tracking may be as follows.
  • the spatial-temporal filter used in marker- less tracking can be reformulated, this time with a two-dimensional signal so that the filtering is affected by one smoothness parameter in both directions (x and y directions with N as number of pixels in x direction and M as number of pixels in y direction).
  • N number of pixels in x direction
  • M number of pixels in y direction
  • Equation (3) is a linear combination of two cost functions with a regularization term using a parameter
  • the cost function, J 1 (i,j), is used to represent how best to follow the signal (or goodness of fit) and is given by:
  • J 2 (i,j) is used to represent the measure of smoothness of the signal and is given by:
  • FIG. 7 illustrates graphs of filtered and unfiltered average pixel intensity signals for a video stream of a patient exhibiting both non-rigid-body and rigid-body motion.
  • FIG. 7 shows the average pixel intensity signal (e.g., the average intensity of all the pixels in a frame) calculated after the filtering occurs since that is better interpreted.
  • the ROI selected was of size ⁇ 128 x 128 ⁇ pixels to show the efficacy of the approach.
  • FIG. 8 illustrates graphs of the filtered and unfiltered individual pixel intensities associated with eye-blinking, as well as a graph of the filtered and unfiltered average pixel intensity signal associated with eye-blinking.
  • graph 802 shows the unfiltered individual pixel intensity signals over the duration of a video in which there is eye -blinking
  • graph 804 shows the individual pixel intensity signals over the duration of the video after a spatial-temporal filter has been applied.
  • the individual pixel intensity signals have been significantly smoothed out by the spatial temporal filter.
  • Graph 806 shows the unfiltered and filtered average pixel intensity signals, which communicate something similar. The average pixel intensity signal after filtering has been smoothed out.
  • FIG. 9 illustrates the actual image frames associated with FIG. 8, before the spatial-temporal filter is applied and after the spatial-temporal filter is applied.
  • the eye blink in the patient is removed and those frames look like the patient has their eye open the entire time. This shows the end result from removing non-rigid-body motion.
  • feature engineering is the process of using domain knowledge of the data (e.g., its nature and characteristics) in order to create and select the features (e.g., inputs) that make machine-learning algorithms work well.
  • Feature engineering can be done manually, but it also can be done using automated feature learning (e.g., techniques enabling a computing system to automatically learn and discover features to use for the model on its own behalf).
  • MMMD Multi-Modal and Multi-Disciplinary
  • MMMD Multi-Modal and Multi-Disciplinary
  • the MMMD Feature Engineering Framework may be utilized in an image-based marker-less motion and biometrics tracking system, such as by the feature engineering module 108 in FIG. 1.
  • FIG. 10 is a diagram of the Multi-Modal and Multi-Disciplinary (MMMD) Feature Engineering Framework, in accordance with embodiments of the present disclosure.
  • MMMD Multi-Modal and Multi-Disciplinary
  • the depicted MMMD Framework shows how information contained across heterogeneous data (e.g., data obtained from different sources and having varying characteristics, such as images captured from many different cameras) can be used and considered for the marker less motion and biometric sensing problem.
  • heterogeneous data e.g., data obtained from different sources and having varying characteristics, such as images captured from many different cameras
  • there may be information available about the internal state of the cameras during image capture e.g., on-times
  • there may be information known about a patient’ s physiology which can be collected by hand or through various medical devices
  • a large, initial set of lower- level features may be directly obtained from all the information available within the available pool of heterogeneous data, and this entire set of lower- level features may numerically or symbolically capture all the information within the data that may be relevant to the marker- less motion and biometric sensing problem (e.g., inform about the relationship that exists between sequentially captured images and actual motion coordinates and biometric waveforms).
  • Some of the features in this initial set of lower-level features may be manual features (e.g., features that have been brainstormed by a person on the basis of their prior domain knowledge of the data) that have been provided to the MMMD framework. Some of these manual features may be traditional knowledge -based features that are based on available knowledge about the patient, the cameras, and so forth. Some of these manual features may be statistical features that are calculated from the image data. These manual features may be selected based on existing domain knowledge of motion tracking used in MR/CT/PET/Proton therapy systems, and they are discussed in more detail herein. FIG.
  • manual features such as manual features obtained from patient information (e.g., associated with the patient’s age, body mass index, gender gender, height, range for normal resting heart rate, range for normal respiratory rate, whether the patient has a beard, and so forth) and manual features obtained from the video and image data (e.g., statistical descriptors, optical flow vectors, biomarkers correlated to motion, and so forth).
  • patient information e.g., associated with the patient’s age, body mass index, gender gender, height, range for normal resting heart rate, range for normal respiratory rate, whether the patient has a beard, and so forth
  • manual features obtained from the video and image data e.g., statistical descriptors, optical flow vectors, biomarkers correlated to motion, and so forth.
  • some of the features in the initial set of lower- level features may be learned features (e.g., features that are automatically learned and discovered through an automated feature learning technique), such as features learned from the video or image data using shallow or deep learning (e.g., artificial neural networks) techniques.
  • learned features e.g., features that are automatically learned and discovered through an automated feature learning technique
  • features learned from the video or image data using shallow or deep learning (e.g., artificial neural networks) techniques e.g., artificial neural networks
  • deep feature learning algorithms that are well- suited for learning features from the image data may include convolutional neural networks (CNNs) and autoencoders, and some of their various implementations will be discussed in further detail herein.
  • dimensionality reduction techniques may be performed on the initial set of lower- level features, such as feature selection (filtering redundant features by selecting a subset of the initial set of features) and/or feature extraction (creating a smaller set of brand new higher-order features from the initial set of features). For instance, feature extraction may be used in order to generate a smaller set of higher-order features that are derived from the features in the initial set of low-level features (e.g.,“features of features”), which would reduce the dimensionality in part by removing redundancies among the initial set of lower-level features.
  • feature selection filtering redundant features by selecting a subset of the initial set of features
  • feature extraction creating a smaller set of brand new higher-order features from the initial set of features.
  • feature extraction may be used in order to generate a smaller set of higher-order features that are derived from the features in the initial set of low-level features (e.g.,“features of features”), which would reduce the dimensionality in part by removing redundancies among the initial set of
  • any raw input data may be transformed into its set of higher-order features and used to generate a feature map (e.g., basis vectors, feature vector dimension, feature parameters, feature types, etc.) for mapping the transformation of raw input data into a feature vector that a regression model can use.
  • a feature map e.g., basis vectors, feature vector dimension, feature parameters, feature types, etc.
  • the feature map can be used later in order to transform raw input data into feature vectors for the training data or input data received in real-time.
  • raw input data in training data may be transformed into a feature vector that includes the set of higher-order features, or during the real-time application of an image-based marker-less motion and biometrics tracking system, input data (e.g., streamed video or images) received in real-time may be transformed into a feature vector that includes the set of higher-order features based on the feature map.
  • the feature vector may be considered an output that is in a form usable by a regression model.
  • only the learned features e.g., resulting from the deep feature learning algorithm
  • the deep feature learning algorithm may be used to automatically select the learned features and effort can be spent on designing the algorithm and its associated network for performing automatic feature engineering.
  • the manual features may be derived from a variety of input data, including image data (e.g., pixel intensities), physical geometries (e.g., wing size), logicals with semantic abstractions such as yes and no, interaction features, and other biological variables (e.g., gender).
  • image data e.g., pixel intensities
  • physical geometries e.g., wing size
  • semantic abstractions such as yes and no
  • interaction features e.g., gender
  • other biological variables e.g., gender
  • Manual features that can be obtained from the video or image data may include statistical descriptors (e.g., max, min, mean) associated with the image data, as well different orders of moments calculated over a window of pixel time-series data and its corresponding Fourier (FFT) spectrum.
  • FFT Fourier
  • each camera image consists of pixels of varying pixel intensities. Since motion contains a temporal element, in order to capture the temporal or dynamic aspects of the underlying measurement system, the feature calculations may be performed over a sliding window (e.g., sliding over time) of “w” for a pre-determined window width (e.g., 10 seconds). For a 60 frames per second (fps) camera system, in each sliding window there would be 600 frames of data.
  • fps frames per second
  • Features can be calculated for the images associated with each individual camera (e.g., images for one camera, then images for a second camera, and so forth) to obtain individual or univariate features, or they may be calculated based on the combined images of all the cameras (e.g., images for all four cameras) in order to obtain multivariate/interactive features.
  • Statistical descriptors can be calculated by grouping pixels for this window segment of camera pixel intensities.
  • Several examples of the statistical features which may be generated include median pixel intensity, standard deviation of pixel intensities, maximum pixel intensity, difference between maximum pixel intensity and minimum pixel intensity, or maximum absolute difference in total pixel intensities across / between successive frames.
  • Other statistical descriptors that are computationally convenient to process - such as skewness, kurtosis, etc. - can also be used as statistical descriptors.
  • Some of the univariate features calculated for images associated with a particular camera can include a few principal components obtained by performing singular value decomposition on a covariance matrix containing spatial-temporal pixel intensities within the sliding window.
  • Other univariate features from the images may include optical flow vector features.
  • Optical flow vectors generally approximate how much each pixel’s brightness changes between adjacent images as the patient moves. Assuming all temporal changes in pixel brightness (e.g., over time) are due to motion only, the system can compute differential optical flow using known algorithms (e.g., Lucas and Kanade or Horn and Schunck). Since the patient motion does not result in shape changes, optical flow captures real motions of the patient rather than expansions, contractions, deformations and/or shears.
  • Flow vectors in 2D contain magnitude and angle, and both of these components may be incorporated in optical flow vector features that are associated with a selected region of interest in every frame.
  • Multivariate features calculated based on the combined images of all the cameras in order to capture the relationships that may exist among multiple cameras and multiple groups of features can be determined from offline training, along with pixel intensities from each camera.
  • a first group of multivariate-based features may be directed to capturing the relationships between pairs of cameras, and can be defined using previous domain knowledge or leamt from the image data.
  • the relationships can be, for example, the frame- by-frame difference or the ratio of total pixel intensities within a predetermined region of interest (ROI) or the ratio of principal components, and can also be covariance of pixels within a region of interest between two cameras.
  • ROI region of interest
  • Region of interest may be determined based on one or many factors such as: (1) target size (e.g., pixels) covered by the ROI; (2) regions containing more structure (e.g., eye brows, eyes, hairs, nose, mouth, chin); and (3) regions without markers.
  • target size e.g., pixels
  • regions containing more structure e.g., eye brows, eyes, hairs, nose, mouth, chin
  • regions without markers e.g., eye brows, eyes, hairs, nose, mouth, chin
  • the number of features resulting in this group is equal to the number of relations defined.
  • the number of relations may be generated by grouping pixels from an entire sliding window that, depending on computational convenience, required motion bandwidth and sensitivity to 6DOF quantities or biometric waveforms. The grouping can be learnt or predetermined with offline training and associated with a specific subject or instrument.
  • Learned Features e.g.. Deep Feature Learning Algorithms
  • Some of the features in the initial set of lower-level features may be automatically learned and discovered from the available video or image data, such as by using deep learning techniques to drive automated feature learning algorithms.
  • Deep learning has attracted tremendous research attention and has proven to achieve outstanding performance in many applications in domains such as computer vision, speech recognition, and natural language processing.
  • deep learning involves learning good representations (e.g., features) of data through multiple levels of abstraction.
  • hierarchically learning features layer-by-layer, with higher-level features representing more abstract aspects of the data deep learning can discover sophisticated underlying relationships in the data and features that are more informative and more robust to variations.
  • Examples of deep learning techniques that are especially well-suited for learning features from the image data are various types of artificial neural networks, such as autoencoders and convolutional neural networks (CNNs).
  • artificial neural networks such as autoencoders and convolutional neural networks (CNNs).
  • An autoencoder is a type of artificial neural network that is used to leam a representation (e.g., encoding or feature) for a set of data in an unsupervised manner.
  • Autoencoders are often used for dimensionality reduction and may serve as improved alternatives to other known dimensionality reduction techniques, such as Principal Component Analysis (PCA), because autoencoders are capable of learning non-linear features as opposed to linear principal components.
  • PCA Principal Component Analysis
  • an autoencoder could be adapted for use as a deep feature learning algorithm on image data in order to generate a set of learned features from the image data (e.g., for the first stage 1010 of the MMMD Framework shown in FIG. 10).
  • the overall principle used in an autoencoder is to first perform a reduction (e.g., downsampling) on the original input data into a reduced encoding (e.g., features) of the original input data, and then subsequently perform a reconstruction (e.g., upsampling) on the reduced encoding in order to generate from it a representation that is as close as possible to the original input data.
  • a reduction e.g., downsampling
  • a reconstruction e.g., upsampling
  • an autoencoder algorithm generally has two parts: the encoder and decoder.
  • the encoder encodes the input (x) into a hidden or feature representation (y) and the decoder decodes that representation back into a reconstructed input (x).
  • FIG. 11 shows an encoder architecture 1110 and a decoder architecture 1120 for a deep learning algorithm with three hidden layers.
  • W is an r x n weight matrix
  • b is a vector of dimension r
  • Equation 1 represents the non-linear activation function.
  • the activation function may be expressed as a logistic sigmoid function such as that shown in Equation
  • Equation 3 is an example expression of a decoder function.
  • a decoder function maps hidden representation back to original input space (e.g., in order to validate how well the hidden representation captures the original input).
  • W is an n x r weight matrix
  • b’ is a vector of dimension n.
  • the optimal solution may be generated for an optimization equation such as that shown in Equation 4.
  • Equation 4 The term inside the summation of Equation 4 represents the error between the original input (x) and the reconstructed input (x).
  • the optimization occurs over N training points.
  • the autoencoder setup described and shown in FIG. 11 may be only one type amongst many variants of autoencoders, including the stacked denoised autoencoder, where multiple encoding and decoding layers are stacked one on top of the other to obtain deep layers by advocating denoising as part of the training criteria to achieve better representation capabilities.
  • Decoder equations may be associated with each of the three hidden layers shown FIG. 11. Example expressions are shown in Equations 5 and 6.
  • CNNs Convolutional Neural Networks
  • Convolutional Neural Networks are another type of deep learning algorithm, and they are most commonly applied to analyzing visual imagery and have proven to be very successful frameworks for image recognition and other computer vision tasks, such as the recognition of handwritten characters in optical character recognition (OCR).
  • OCR optical character recognition
  • the potential is there to adapt CNNs for use as a deep feature learning algorithm on image data in order to generate a set of learned features from the image data (e.g., for the first stage 1010 of the MMMD Framework shown in FIG. 10).
  • a convolution layer consists of a set of leamable convolution kernels (e.g., filters) that have small receptive fields, which are trained to identify the presence of a specific types of image feature at the same spatial position in the input data.
  • the kernels extract low dimensional features from the input data.
  • one convolution kernel may be configured to detect any horizontal edges in the input data (e.g., an image) and will identify all the spatial positions where a horizontal edge exists.
  • the Convolution Neural Network can use different convolution kernels or filters in an early convolution layer in order to identify the existence and positions of different low- level image features, such as edges and curves, and then build up to identifying features associated with more abstract concepts through additional series of convolutional layers.
  • CNN By hierarchically learning features, layer by layer, with higher-level features representing more abstract aspects of the data, CNN can be used to generate very sophisticated features and insights about the input data.
  • the input volume received by a convolutional layer can be convolved with a set of convolution kernel or filters (e.g., the initially- unknown parameters that are learned via training).
  • Each filter will be configured to detect if a specific type of feature is present at a specific spatial position in the input volume, and an activation map for that filter will be generated that indicates the presence of the specific type of feature in different spatial positions. Stacking the activation maps for all the filters will form the output volume of that convolution layer, which can be passed on to the next layer.
  • FIG. 12 illustrates a schematic view of an example convolution operation performed in a convolution layer using one filter in a 2D Convolution Neural Network, in which a 2D convolution is applied in the 2D space to capture spatial features in the input image.
  • an input image 1210 and a normalized input image 1220 of size ⁇ 86 x 86 ⁇ pixels is shown.
  • the CNN would not“see” either of those two images. Instead, it would see an array or matrix 1230 of pixel values (in this case, an 86 x 86 matrix), such as pixel intensities (with a value from 0 to 255 describing the pixel intensity at that point).
  • An element-wise convolution operation can be performed between the matrix 1230 of pixel intensities and a convolution filter 1240.
  • the filter 1240 will also be an array of numbers (which are the parameters in this instance), such as a 3x3 filter.
  • the size of the filter must be the same as the size of the receptive field of the matrix 1230 that the values in the filter 1240 are multiplied against.
  • the filter is then convolved, or slid, around the matrix 1230 to calculate the dot product of values in the filter with the corresponding values in the matrix 1230 (e.g., element wise multiplications). This is performed for every possible location for the matrix 1230.
  • the matrix 1230 is 86x86 and the filter 1240 is 3x3, so there would be a total of 84x84 locations that the filter 1240 could be.
  • a convolved image window 1250 (also referred to as an activation map or feature map) is obtained.
  • the final convolved image 1260 corresponding to the convolved image window 1250 is also shown.
  • the resulting convolved image 1260 is the image feature map which preserves the spatial relationship between pixels.
  • a convolutional layer usually contains multiple feature maps so that multiple features can be detected. Depending on the type of filter chosen, the feature map could result in an image containing only edges, curves, and so forth.
  • the values for the convolution kernels and filters used (e.g., convolution filter 1240) in the convolution layers can be manually specified based on pre existing knowledge of good parameters to use for this kind of image classification, but more likely, the values for the convolution kemels/filters will be initially unknown. Thus, the convolution kemels/filters will not be configured to specifically look for special image features such as edges and curves. Instead, the parameters for the convolution kernels/filters can be determined as in previous machine-learning models, e.g., determined iteratively from a large set of training data through a process called backpropagation, in which an optimization algorithm is used to select the best parameters over a number of iterations.
  • a 2D CNN may yield suitable enough results to be applied in the context of medical imaging and therapeutic procedures.
  • individual image frames can be convolved into a corresponding convolved image capturing the important, relevant information contained in that specific image frame.
  • the 2D CNN is not considering the dimension of time since it is processing the image data frame-by-frame and each frame is a single slice in time. This may pose a problem because movement is something that occurs over time.
  • the images provided to the 2D CNN may be video frames that consist of contiguous 2D frames captured one after another at fixed frame rate, and performing 2D convolution on multiple sequential images does not model motion accurately since the operation does not consider changes from neighboring image frames at the time of convolution operation.
  • a 3D Convolution Neural Network may be used instead of a 2D CNN, which may yield even better results (e.g., more accurate 6DOF motion coordinates and biometrics) because a 3D CNN is more effective for modeling volumetric data, which contiguous image frames may be regarded as.
  • FIG. 13 illustrates the difference in how video images would be used as input data between a 2D CNN and a 3D CNN.
  • a 2D CNN the 2D image frames 1320 taken from a video are individually used to generate a corresponding convolved image.
  • sequential image blocks 1320 can be formed by stacking multiple contiguous video frames together (e.g., if a 2D image frame is 86x86, then Z sequential image frames can be stacked together to form a 86x86xZ cuboid).
  • Each image block can serve as a cuboid kernel used as the input volume for the 3D CNN.
  • additional image blocks can be generated in a staggered manner (e.g., a second image block may include some of the image frames in the first image block, and so forth).
  • FIG. 14 illustrates example convolution operations performed within a 3D CNN.
  • a cuboid kernel e.g., 86x86x86 in size
  • a cuboid convolution kemel/filter e.g., 3x3x3
  • convolution is performed using the 3x3x3 filter simultaneously in the spatial and time dimensions.
  • the filter slides in all 3-directions (x & y - spatial, and z - time) to compute feature maps instead of only 2- directions (x & y - spatial only) as in 2D CNN.
  • the output is a 3-dimensional volume space such as cube or cuboid. Therefore, a 3D CNN operating on several contiguous 2D image frames can model the time evolution more accurately.
  • FIGS. 15-16 illustrate example feature maps that were obtained for various filters used in 3D CNNs.
  • the output 1510 shows feature maps generated from 10 filters in the spatial direction (e.g., for 86x86 image frames), for just one image frame.
  • the number of filters depend on number of frames. Since 30 frames for each cuboid kernel was used in the time direction, the output 1520 resulted in 27 activation maps generated for just one of the filters of the 3D CNN.
  • FIG. 16 shows the resulting time evolution of feature map for 10 different filters for one of the planes (40 th ) superimposed with normalized motion coordinates. It can be seen that the features change with time during the segment where there is motion and remain constant in the absence of motion. These features are further processed inside two fully connected networks (first with dimension 20 and second with dimension 6) as in the normal convolutional neural network.
  • FIG. 17 illustrates the final 6 features obtained from the 3D CNN associated with FIGS. 15-16. These 6 learned features can be used for training or validating a regression model. For instance, these 6 learned features generated by the 3D CNN can be combined with additional features, such as manual features for centroid of pixels (F3) and centroid of pixel light intensity (F7), via the first stage 1010 of the MMMD Framework shown in FIG. 10. Additional dimensionality reduction may also be performed on the combined set of features, and the generated feature vector can be converted into 6DOF motion coordinates and biometrics through the trained regression model.
  • FIG. 17 illustrates graphs of the 6DOF motion coordinates calculated using the features against ground truth motion coordinates, which in this case are normalized motion coordinates obtained from a marker-based system. As shown, the 6DOF motion coordinates derived from the features match pretty closely to the normalized motion coordinates obtained from the marker-based system.
  • a 4D Convolution Neural Network may even be used.
  • a 4D Convolution Neural Network could utilize an additional dimension in addition to the dimensions of spatial and time, such as dimension of channels (e.g., RGB) or a dimension of wavelengths (e.g., if each of the cameras utilized a different wavelength, including infrared). This can be imagined visually as a cuboid kernel for which each point stores an array of values instead of a single value.
  • the convolution kemel/filter used in a 4D CNN would also be 4D.
  • the value in using a 4D CNN is the ability to use features obtained from the dimensions of space, time, geometry, wavelength, and so forth (e.g., useful if the images in different wavelengths captured by the four cameras combined would contain more information about a patient’s movement or biometrics than images captured by the four cameras in the same wavelength). For instance, certain short region wavelengths (e.g., 805 nm, 750 nm, and so forth) may have interactions with blood (e.g., due to differences in absorption of oxygenated/deoxygenated blood) in a way that provides a better signal for calculating biometrics (e.g., pulse).
  • certain short region wavelengths e.g., 805 nm, 750 nm, and so forth
  • blood e.g., due to differences in absorption of oxygenated/deoxygenated blood
  • the features may be concatenated (e.g., at the end of the first stage 1010 of the MMMD Framework in FIG. 10) to produce an initial feature set.
  • various dimensionality reductions techniques may be used to reduce the number of features and redundancies that may exist within the features.
  • computational techniques based on statistics and information theory dealing with dimensionality reduction problem.
  • C.O.S. Sorzano, J. Vargas, A. Pascual-Montano,“A survey of dimensionality reduction techniques” (March 2014) provides a survey of such techniques and is hereby incorporated by reference in its entirety.
  • These techniques may include feature selection techniques (for selecting a subset of the features) and feature extraction techniques (for generating new, higher-order features that are derived from the existing features).
  • PCA Principal Component Analysis
  • a new coordinate system e.g.., basis vectors
  • the initial set of features can be transformed into a set of new, higher-order features that are derived from the features in the initial set (e.g.,“features of features”).
  • a feature vector capturing this new set of higher-order features can be used to calculate 6DOF motion coordinates and/or biometric signals using a trained regression model.
  • Detection modeling may refer to the combined process of building and training a regression-based model for motion and biometrics signal detection, as well as creating the mapping function/algorithms used to map a feature vector into 6DOF motion coordinates.
  • a more extended version of the model involves additional biometric outputs for the desired output. For instance, there could be an additional two biometric outputs, such as cardiopulmonary waveforms (e.g., videoplethismographic (VPG) waveform and respiratory waveform), resulting in a total of 8 outputs of the regression model (e.g., 6 motion signals and 2 biometric signals).
  • VPG waveform can be used to determine subject’s cardiac state and respiratory waveforms for respiratory rate.
  • the mapping function for the regression-based signal detection model is a multi-dimensional input-output function due to there being many features and many outputs, and it provides a good mathematical relationship between the features and outputs.
  • one or more regression methods may be selected for determining the best parameters for the signal detection model, including regression methods such as Linear Regression, Logistic Regression, Polynomial Regression, Stepwise Regression, Ridge Regression, Lasso Regression, ElasticNet Regression.
  • regression methods such as Linear Regression, Logistic Regression, Polynomial Regression, Stepwise Regression, Ridge Regression, Lasso Regression, ElasticNet Regression.
  • ELM Extreme Learning Machine
  • other regression techniques may be used without departing from the scope of the disclosure.
  • FIG. 18 shows an example architecture for a single layer ELM network.
  • the network includes L number of neurons in the hidden layer designed to produce a mapping function to transform the feature vector (e.g., produced by the MMMD Framework in FIG. 10) to 6DOF motion coordinates and/or biometric signals.
  • Training an ELM in order to determine a set of optimal parameters (e.g., a matrix of weighting values) that best fit the training examples available in the training data may be a linear least squares problem with constrained optimization whose solution is an analytical expression with generalized inverse of the hidden layer output matrix for use during real-time. Equation 7 provides one example expression of a mapping function for an ELM that may be used for the signal detection model.
  • N number of training data set ⁇ ( ;, y .) ⁇ W ;
  • k is the number of outputs for the ELM network including L hidden neurons
  • Equations (8) through (10) provide example expressions for generating the matrix of weighting values.
  • C represents a predetermined regularization constant empirically generated during training. The constant is used to control the tradeoff between the output weights and the training error.
  • T represents the transpose of a matrix or vector of values.
  • H represents a matrix of size N x L and Y represents a matrix of size N x 8 (where 8 is the number of outputs desired).
  • ML-ELM multilayer ELM
  • Deep ELM can also be used to improve the regression accuracy.
  • a biometric signal detection experiment was performed using the various systems, methods, and techniques described herein.
  • a small region of interest (ROI) of size ⁇ 191 x 217 ⁇ pixels was selected on the forehead of a patient.
  • FIG. 19 shows a frame of sensor data including four images captured by their respective cameras, as well as the ROI in the frame. The ROI was further reduced to ⁇ 28 x 28 ⁇ pixels by averaging sub blocks of size ⁇ 7 x 8 ⁇ pixels. Video frames were captured at a speed of 60fps.
  • ROI region of interest
  • FIG. 20 shows a plot of normalized pixel intensities for ROI coordinates ⁇ 1,1 ⁇ over a period of ten seconds, which is compared to representations of the 600 consecutive image frames that were recorded over that period of ten seconds. Accordingly, the 600 consecutive image frames over the ten seconds can be thought of as data that spans the spatial and temporal dimensions, which may be captured by a pixel data cube of size 28 x 28 x 600.
  • FIG. 21 illustrates the network structure of a deep feature learning algorithm employing CNN and a motion and biometric signal detection model employing ELM for the experiment.
  • FIG. 21 represents one possible structure and approach for implementing the data processing pipeline shown in FIG. 2.
  • a Convolution Neural Network was used for the deep feature learning algorithm, with one convolution layer having 71 neurons, stride 1, and a filter size equal to 21.
  • the convolution layer convolves the input by moving the filters along the input vertically and horizontally and computes the dot product of the weights and the input, and then adds a bias term.
  • the experiment did not use other types of layers, which may be included in some embodiments, such as batch normalization, reLU, or pooling.
  • the final output is obtained through two fully connected layers in sequence, one with 20 outputs and the other with one output.
  • the activation function in each layer is a sigmoid function. None of the parameters of the network were optimized for best performance. A simple regression was included to map the learned features to the blood volume waveform.
  • FIG. 22 shows a visual representation of the features in 2D learned by the Convolution Neural Network algorithm for the experiment. This feature map is the most strongly activated map at the end of the second fully connected layer of FIG. 21. These features are abstract and are difficult to interpret from human inspection. However, the prediction results shown in FIG. 23 are better than prior experiments using mean pixel intensities even with a non-optimized deep learning algorithm.
  • FIG. 24 is a block diagram of a stacked 2-layer autoencoder generated by the system during an experiment.
  • the autoencoder includes two layers having 100 and 150 neurons, respectively.
  • the autoencoder receives an input vector of 784 values (for a ROI having size 28x28 pixels) and provides an output feature vector having 150 elements.
  • Stacked autoencoder features are tuned by using both stacked encoder and decoder (not shown in FIG. 24) structure in an optimization equation (4).
  • the decoder is required only during tuning the parameters of the autoencoder features such number of neurons, weights and bias values and will not be used for real-time. After tuning the parameters of the encoder with configuration presented in FIG. 24, only encoder parameters are retained and used for tuning the ELM detection model.
  • the elements of encoder outputs are then provided as input to an ELM detection model, such as that shown in FIG. 18.
  • the ELM detection model provides an output vector indicating motion or biometrics predicted for the provided feature vector output from the encoder.
  • all the parameters of the ELM detection model e.g., number of neurons in the hidden layer, the weight matrix as in Equation (8) are obtained during offline training and then used as lookup tables for real-time tracking.
  • FIGS. 25 and 26 show an example of blood volume waveform results based on feature learnings from two different deep learning algorithms and two different detection models obtained with experimental data from one human subject. Similar waveforms are obtained without using any feature learning and with the ELM detection model.
  • the top plot shows PPG waveform, a VPG with mean ROI waveform, and a VPG with deep learning.
  • the bottom plot shows power spectral density curves corresponding to the curves in the top plot.
  • the mean square error values for PPG- VPG using two different deep learning algorithms and PPG- VPG without feature learning and with mean ROI are shown in Table 1. Both deep learning-based implementations provided lower error than the mean ROI without feature-based learning thus showing an improvement in the ability to detect biometric signals. As expected errors are high for case without feature-based learning. TABLE 1
  • FIG. 27 illustrates an example architecture of a model for biometrics and marker-less motion tracking that employs a Convolution Neural Network.
  • FIGS. 28-30 illustrate example results associated with a biometrics and marker-less motion tracking system.
  • FIG. 31 shows a fullsize image (1024x1280 pixels) when captured with KinetiCor’s four 60fps NIR camera system.
  • the image shown in FIG. 31 represents an example input image to the architecture shown in FIG. 27.
  • the rectangular ROI in FIG. 31 where markers are located is blocked out during marker-less tracking.
  • the fullsize image may be reduced to 86x86 pixels by performing subblock averaging.
  • the experiment used a robotic system with markers highlighted within the rectangular ROI (see FIG. 31) to simulate typical motion profiles.
  • the output of the Convolution Neural Network is the featurized image which is generated during real-time tracking with a reduced size image (86x86 pixel image sequence).
  • the experiment included recording seven motion coordinates (three X, Y, Z direction motion signals and four quaternions, Qr, Qx, Qy, Qz) under variety of simulated motions such as dystonia, crossing legs, feeling uncomfortable, falling asleep, staying quite etc. All seven motion coordinates were measured simultaneously using current marker-based motion tracking system.
  • FIG. 37 shows the first four training sample images
  • FIG. 38 shows sixty training sample images. The same training images are used for demonstrating effectiveness of feature-based approach with Convolution Neural Network with regression model and autoencoder with ELM model.
  • FIG. 39 shows highlights of feature maps (left side) learnt in Layer 2 (Convolution Layer with 71 neurons) of Convolution Neural Network. The right side of FIG. 39 shows an activation map from this feature maps for an example input image.
  • FIG. 40 shows highlights of feature maps and corresponding activation maps for an example input image for Layer 3 (a Fully Connected Layer with 20 outputs).
  • FIG. 41 shows highlights of feature maps and corresponding activation maps for an example input image for Layer 4 (a Fully Connected Layer with 7 outputs).
  • FIGs. 39-41 illustrate how the features detected by the trained network evolve as the number of dimensions of data being analyzed are reduced.
  • the feature map from the Convolution Neural Network comprises a three-by-three grid of pixels.
  • the systems and methods described can provide accurate motion detection or biometric signal detection using such a reduced dimensional data set.
  • validation images were used to compare detected values measured using traditional methods with detected values generated by a feature-based detection model processing the trained features.
  • FIG. 42 shows four validation sample images.
  • FIG. 34 includes plots of the results for 5000 validation frames (different from training image samples) when captured sequentially at 60fps capture rate. For training frame numbers 1 to 8072 were used. First four training image frames are shown in FIG. 37. These images show blocked out rectangular regions under each camera.
  • FIG. 42 shows four sample images for frames 4790 to 4793 from the validation pool.
  • FIG. 28 shows the trace of marker-based (green) and marker-less AI-based (red) tracking signals. AI-based tracking signal is generated with a network and detection model such as shown in FIG. 27.
  • FIG. 29 shows a table containing key statistics comparing different detection methods. Notably, the results in FIG. 29 provide data to compare marker based techniques with the detection using the features described in this application.
  • FIG. 30 shows cross correlation plot matrix for each measured motion coordinate using marker-based (x-axis) detection and marker-less (y-axis) detection. It should be noted that, based on experimental results, the correlation coefficient between results of marker-based detection and marker-less detection (when applied to robots or human patients) will often be over 0.98, with a small amount of absolute mean error between the results of the two (e.g., ⁇ 0.23 mm for translation, ⁇ 0.12 degrees for rotation).
  • FIG. 32 a single layer 71 neuron autoencoder network was trained with a 500 neuron ELM detection model.
  • FIG. 33 shows the logical view of the encoder-decoder network used during training. The output of the encoder is the activation map (or featurized image) which is generated during real-time tracking with a reduced size image (in this case we used again 86x86 pixel frame image sequence). The featurized image becomes the input to the ELM detection model.
  • FIG. 34 shows the trace of marker-based (green) and marker-less AI-based (red) tracking signals using feature learnings from Autoencoder and the ELM detection model. AI-based tracking signal with autoencoder-ELM model is generated such as shown in FIG. 32.
  • FIG. 35 shows a table containing key statistics with this method comparing different detection methods and
  • FIG. 36 shows cross correlation plot matrix for each measured motion coordinate using marker-based (x- axis) detection and marker-less (y-axis) detection.
  • the results show that the AI-based marker-less tracking system works reasonably well to approximate the results from marker-based solutions.
  • the marker-less tracking system relies on features detected from the image data and subject variable data (also referred to as patient variable data, which may include information that varies from patient to patient, such as demographic information, age, gender, physiological information such as BMI, range of heart rate, range of respiration rate, and so forth) without relying on artificial markers to be added to the subject.
  • ELM algorithm uses trained weights, biases and activation function, all obtained from offline training.
  • ELM algorithm has an analytic solution. Its hidden layer (in this simulation hidden layer contained 500 neurons) weights are randomly assigned as opposed to computed iteratively in real time via gradient search algorithms such as back-propagation.
  • the output weights (equation (8)) is a look up table computed during offline training.
  • This disclosure has described the training of marker- less models (e.g., to determine the best set of parameters to use) using training data containing the desired outputs for the 6DOF motion coordinates used as ground truth. More specifically, those 6DOF motion coordinates may have been calculated for each training example using a marker-based method which involved affixing a marker to the patient. Thus, the 6DOF motion coordinates from the marker-based method were used as ground truth for the marker-less motion tracking model.
  • the inputs for the marker-less motion tracking model include image frames of patients (e.g., collected from medical imaging procedures), which means the training examples must include that data as inputs.
  • building the training data requires a training cohort of human patients of different ages, gender, ethnicity, etc.
  • there may be other approaches of characterizing a marker-less model and its training data which would not require training data to be collected using a large set of human patients.
  • the present disclosure provides a method to use a characterization approach in which a calibration pattern (e.g., chessboard patterns, square grids patterns, circle hexagonal pattern or circle regular grid pattern) or a specifically designed custom target pattern, such as the pattem shown in FIG. 43, is mounted to a rigid surface (e.g., flat, spherical, etc.).
  • the rigid surface may serve as a proxy for a patient’s head, and the rigid surface and the mounted calibration pattern can be adjusted into different positions and orientations that can be measured.
  • the calibration pattern can be imaged for those different positions and orientations by a single pinhole camera (pinhole cameras have significant distortions which depend on the geometry of the optical system, a characteristic which is beneficial to this particular context).
  • the appearance of the calibration pattern is imaged and saved for each corresponding position and orientation, which are measured and known. These serve as the training examples in the training data.
  • Machine-learning methods or a simple calibration method can then be used to train a pinhole camera model and learn the best parameters to use for determining the position and orientation of the rigid surface based on the appearance of the calibration pattern and how it is“seen” by the pinhole camera as the calibration pattern changes position and orientation relative to the pinhole camera.
  • the pinhole camera model represents most of the physics contained in the imaging system with one camera. Accuracy of coordinate values generated by the pinhole camera model is limited by the optical geometry of the single camera system.
  • the trained model can then be used to reconstruct a camera pose model with its own parameters.
  • the camera pose model will receive the image pixels as inputs and generate 6 coordinate values as outputs.
  • the camera pose model could be, for example, one of the feature-based machine learning models such as the CNN network (e.g., shown in FIG. 48). It would be necessary to train a camera pose model to represent the characteristics of a physics-based model such as pinhole camera model because the camera-pose model is purely computation based, whereas the pinhole camera model is physics based.
  • the pinhole camera model can be used as an example to represent the physics of the optical system.
  • Any other suitable first principle model can be used in place of pinhole camera model, which can only process image pixels from specially designed targets (e.g., like the calibration pattern shown in FIG. 43). It cannot process image pixels from a human face.
  • By combining outputs of four such AI-based camera pose models e.g., one for each camera if four cameras are used), 24 total 6DOF coordinate values can be produced.
  • one AI-based camera pose model e.g., example of FIG. 48
  • These 24 coordinate values are combined to produce final 6 output coordinates in the camera aggregate model, which reconciles the differences in perspectives between the cameras. A detailed description this method follows next.
  • a camera pose model is a model, for a single camera, of the relationship between the 6DOF coordinates of a calibration image (associated with the position/orientation of the calibration image relative to the camera) and the image intensity captured of that calibration image by the camera.
  • the camera pose model is trained using well known physics-based models such as a pinhole camera model. Regression and calibration techniques are used to determine the best values of the parameters for the camera pose model based on the specific optical geometry in use.
  • a camera pose model “learns” the parameters (or the outputs generated by the pinhole camera model) associated with the pinhole camera model in order to take the relationship between how a calibration pattern appears to a camera and its position/orientation, and then generalize that relationship to be applied to any object or surface (and not just calibration patterns).
  • the camera pose model could take the understanding of perspective gained from the calibration pattern by the pinhole camera model (e.g., a particular part of the pattern will appear smaller if it is further away) and apply it to anything.
  • the camera pose model can be used to map a 3D scene to a 2D image plane and describe the mathematical relationship between the coordinates of a point in multi-dimensional space and its projection onto the 2D image plane.
  • This camera pose model would not consider time information, since it would be able to determine the spatial coordinates for an object in an image frame based on just its appearance in only that image frame.
  • the trained camera pose model can then be used in real-time to measure motion coordinates during a medical scan or therapeutic procedure without the use of special markers. This may involve tracking the object of interest, such as the head of a patient, and adjusting imaging planes in real-time or near real-time using the trained model for prospective motion correction such that the imaging planes follow the patient’s movement, resulting in captured images without motion artifacts.
  • the same motion data may also be applied for retrospective motion correction, in which motion artifacts are corrected after scan data collection but before image reconstruction.
  • the process of developing the camera pose model may involve multiple parts.
  • the first part involves the characterization of camera pose data for each camera being used (e.g., in the medical imaging procedure).
  • FIG. 44 is a flow chart illustrating the steps in this first part.
  • a patterned calibration target is selected (e.g., chessboard patterns, square grids patterns, circle hexagonal pattern or circle regular grid pattern).
  • the calibration target is held at different positions and/or orientations for generating translational and rotational pose data and images for each camera, which will be used as training data for the camera pose model.
  • the training data will consist of: 1) image intensities of the patterned calibration target as seen by each camera during each position and orientation; and (2) corresponding camera pose data (i.e., 6DOF values from each camera) associated with the patterned calibration target at each position and orientation.
  • the training data should not only include images of adequate resolutions in different positions and/or orientations, but those positions and orientations should also cover the entire range over which the motion measurement is required (e.g., in real time applications of motion tracking, such as in a medical imaging procedure).
  • FIG. 45 shows an example table or grid that illustrates the characterized camera pose data collected (e.g., during characterization as shown in FIG. 44) and how that data can be structured.
  • This experimental grid contains 24 different positions/locations and orientations with one center position/location. In order to improve resolution, finer positions and/or orientations of the calibration target can be included to this grid.
  • the physics-based model e.g., pinhole camera model
  • the second part of developing the camera pose model may involve training a machine-learning model to convert image intensities (as seen by a single camera) into pose data (e.g., position/orientation information) based on the characteristics and features within the image intensities that is has determined are best-suited for the task (e.g., via a deep learning algorithm).
  • pose data e.g., position/orientation information
  • FIG. 46 is a flow chart illustrating the steps in this second part of developing the camera pose model.
  • the images of the calibration pattern in the training data for the various positions/orientations are retrieved, and at block 4603 those images are processed and filtered.
  • an AI model e.g., AI-based camera pose model
  • an AI model is initialized and configured, which may utilize engineered features (knowledge-based, statistical, deep learning, etc.) and a parameterized regression model.
  • Features may be obtained automatically as in the Convolution Neural Networks (CNN), Extreme Learning Machines (ELM) or Auto Encoders (AE).
  • CNN Convolution Neural Networks
  • ELM Extreme Learning Machines
  • AE Auto Encoders
  • the best parameters for the regression model of the AI model can be determined from the collected training data (e.g., at block 4605), which consists of image intensities of the calibration target from all experiments and each camera and their corresponding pose data (i.e., 24 values per experiment for the 4-camera system).
  • This data is used as input for tuning the AI model parameters (e.g., at block 4608), and tuning may comprise comparing pose data in the training data (e.g., from block 205) to the calculated pose data obtained by the AI-based camera pose model for a given set of parameters until an optimal set of parameters is determined over a number of iterations.
  • the optimal set of parameters may minimize the error between the actual pose data in the training data and the calculate pose data from the AI-based camera pose model (e.g., at block 4607); the parameters are adjusted to minimize the error using learning/optimization techniques applicable to deep networks depending on the network size and depth.
  • the final output (e.g., at block 4609) is a trained AI-based camera pose model that computes the 6DOF coordinates from the image pixel intensities captured by a camera.
  • the AI-based camera pose model After the AI-based camera pose model is trained, it can be used in a third step for the training of another AI-based model that aggregates camera pose data from all cameras and then outputs final 6DOF motion coordinates (e.g., a motion coordinate output that factors in the differences between the multiple different perspectives of the different cameras).
  • This step may not be required when only one camera is used. However, the use of more than one camera may necessitate aggregating all camera pose data (i.e., 6DOF outputs from each camera) into final 6DOF motion coordinates.
  • this third step may still be required even when one camera is used, in order to customize the training for specific patient groups (e.g., children, adults, male, female etc.).
  • FIG. 47 is a flow chart illustrating the steps in this third part associated with an aggregate camera pose model.
  • a camera aggregation model could be trained through the use of human subject images.
  • the ground truth for the 6DOF motion coordinates can be associated with marker-based motion tracking, the use of robots with measurable motion coordinates, and so forth.
  • the ground truth is based on marker-based motion tracking (e.g., at block 4706), and the motion coordinate outputs generated from marker-based motion tracking may be compared at block 4707 to the outputs generated from an aggregate camera model (initialized at blocks 4703, 4704 and 4705) associated the same human patient image (e.g., retrieved at block 4702).
  • the aggregation camera model can be saved (e.g., at block 4710) and used for computing motion coordinates in real-time without the use of the marker. These training steps would be performed offline, and despite the requirement of human patients for training this camera aggregation model, the number of patients required for this step would be normally small when compared to methods without the initial camera pose model.
  • FIGS. 48-49 illustrate block diagram views of the steps associated with characterization for training marker- less tracking algorithms, which when compared to FIGS. 45- 47 illustrating a different view of the steps, may help facilitate understanding of the steps.
  • FIG. 50 illustrates the 6DOF coordinates produced by individual cameras using the trained AI-based camera pose model for a human patient exhibiting different motions in an experiment.
  • a CNN algorithm was used to train the AI-based camera pose model.
  • the 6DOF coordinates (X Y Z & Rx Ry Rz) generated by a marker-based approach are shown on the left, and the 6DOF motion coordinates generated by the trained AI-based camera pose model from image intensities captured by 4 cameras (Camera A, B, C and D) are shown on the right.
  • These signals are then aggregated by the AI-based camera aggregation model, which was trained using ELM.
  • FIG. 51 shows graphs of the resulting outputs from the AI-based camera aggregation model, as compared to the 6DOF coordinates generated by a marker-based approach. The final tracking coordinates are plotted with respect to frame number.
  • FIG. 52 shows a table summarizing the validation statistics and showing the efficacy of the approach of using a camera pose model with a camera aggregation model, as compared to a traditional marker-based motion tracking approach. Note that the image data in this instance did not use any spatial-temporal filter for removing non-rigid-body motion.
  • MRI machines and therapeutic devices may require data to be supplied in the form of quaternions instead of rotation angles Rx, Ry Rz in degrees.
  • the main disadvantages of Euler rotation angles are: (1) that certain important functions of Euler angles have singularities; and (2) that they are less accurate than unit quaternions when used to account for incremental changes in motion over time.
  • the rotation angles are converted to 4 quaternions using a well- known conversion formula.
  • the marker-less tracking method described here can produce 7 motion coordinates (3) translation coordinates and 4 rotation coordinates) for compensating motion artifacts in scanners and therapeutic procedures.
  • the motion data supplied to the scanner is in relative to initial position as the method described above does not produce absolute coordinates.
  • marker-based or other ground truth system can be used to mark the coordinates for initial position. Addendum
  • FIG. 53 is a flow diagram illustrating an example method of motion detection, which may be coordinated in whole or in part by a device described herein.
  • an instrument such as an MRI machine may be configured to perform the method to provide motion and biometric detection for a subject under test.
  • the coordinating device may receive sensor data of the subject.
  • the sensor data may include video or a sequence of images showing the subject.
  • the sensor data may be received from a camera.
  • the sensor data may include acoustic data or other information detected from the subject while in the instrument performing the test.
  • the coordinating device may receive medical image data of the subject from the scanning instrument.
  • the medical image data may include MRI data or other information which will be used to evaluate a medical condition of the subject.
  • the coordinating device may receive, from a data store, subject variable data (also referred to as patient variable data).
  • Subject variable data generally refers to information that may vary from subject to subject. Such information may include demographic information (e.g., age, gender) or physiological information (e.g., BMI, range of heart rate, range of respiration rate).
  • the subject may also be associated with specific feature maps, feature parameters, or learning algorithm parameters which guide the detection model in use and in training.
  • the coordinating device may generate subject tracking data representative of any one of motion and/or biometric condition of the subject while being scanned by the instrument.
  • the generation of the subject tracking data may include processing the data received at blocks 5306 and 5308 using a trained neural network model.
  • the model may receive a vector of input values including one or more of: (i) knowledge-based features, (ii) statistical features, (iii) optical flow features, (iv) principal components, (v) deep learning features, (vi) interaction features, or (vii) features of features.
  • generating the subject tracking data may include mapping a set of features to at least one of: (i) Quatemions/6DOF quantities, or (ii) biometric waveforms. The mapping may include or be based upon: (i) an extreme learning machine, (ii) a linear regression, (iii) a logistic regression, (iv) an autoregressive model, or (v) an autoregressive moving average model.
  • the model may generate an output vector corresponding to motion or biometrics of the subject.
  • the coordinating device may cause an adjustment to the instrument based at least in part on the subject tracking data.
  • the adjustment may include transmitting one or more messages to adjust the focus of a scanner included in the instrument to account for motion of the subject while being scanned by the instrument. Further details of how the adjustment may be generated and applied are discussed in reference to, for example, FIG. 55
  • the coordinating device may determine whether additional medical image data is available.
  • the determination may include monitoring a data stream for additional medical image data.
  • the determination may include identifying whether a session associated with an initially processed portion of the data is active.
  • a session generally refers to a series of communications between two or more devices or servers.
  • a session may be associated with a session identifier (e.g., a motion tracking identifier).
  • the session identifier may be included in messages exchanged to allow the session participants to associate specific messages with specific transactions. In this way, a server can concurrently provide motion or biometric detection services to multiple communication devices by associating devices with a unique session identifier.
  • the method may end. However, if additional medical image data is available for the subject, the method may return to block 5310 for processing of the new data.
  • FIG. 54 is a block diagram depicting an illustrative computing device that can implement the transcription features described.
  • the computing device 5400 can be a server or other computing device, and can comprise a processing unit 5402, a motion processor 5430, a network interface 5404, a computer readable medium drive 5406, an input/output device interface 5408, and a memory 5410.
  • the network interface 5404 can provide connectivity to one or more networks or computing systems.
  • the processing unit 5402 can receive information and instructions from other computing systems or services via the network interface 5404.
  • the network interface 5404 can also store data directly to memory 5410 or other data store.
  • the processing unit 5402 can communicate to and from memory 5410 and output information to an optional display 5418 via the input/output device interface 5408.
  • the input/output device interface 5408 can also accept input from the optional input device 5420, such as a keyboard, mouse, digital pen, microphone, mass storage device, etc.
  • the memory 5410 contains specific computer program instructions that the processing unit 5402 may execute to implement one or more embodiments.
  • the memory 5410 may include RAM, ROM, and/or other persistent, non-transitory computer readable media.
  • the memory 5410 can store an operating system 5412 that provides computer program instructions for use by the processing unit 5402 or other elements included in the computing device in the general administration and operation of the computing device 5400.
  • the memory 5410 can further include computer program instructions and other information for implementing aspects of the present disclosure.
  • the memory 5410 includes a motion configuration 5414.
  • the motion configuration 5414 may include the thresholds, regions of interest, regularization constants, and other configurable or predetermined parameters to dynamically adjust the motion processor 5430 and/or the computing device 5400 to process image and scanning data described above.
  • the motion configuration 5414 may store specific values for a given configuration element.
  • the specific threshold value may be included in the motion configuration 5414.
  • the motion configuration 5414 may, in some implementations, store information for obtaining specific values for a given configuration element such as from a network location (e.g., URL).
  • the memory 5410 may also include or communicate with one or more auxiliary data stores, such as data store 5422.
  • the data store 5422 may electronically store data regarding the audio being transcribed, characteristics of the audio source, generated thresholds, image data, scanned data, biometric waveforms, and the like.
  • the elements included in the computing device 5400 may be coupled by a bus 5490.
  • the bus 5490 may be a data bus, communication bus, or other bus mechanism to enable the various components of the computing device 5400 to exchange information.
  • the computing device 5400 may include additional or fewer components than are shown in FIG. 54.
  • a computing device 5400 may include more than one processing unit 5402 and computer readable medium drive 5406.
  • the computing device 5400 may not be coupled to a display 5418 or an input device 5420.
  • two or more computing devices 5400 may together form a computer system for executing features of the present disclosure.
  • FIG. 55 illustrates the coordinate frames of a system for real-time adaptive Medical Scanning.
  • the system comprises a Motion Tracking System (preferably tracking motion in real time), such as marker-less tracking system, which produces timely measurements of the subject pose within a motion tracking coordinate frame 'c'.
  • a Motion Tracking System preferably tracking motion in real time
  • marker-less tracking system which produces timely measurements of the subject pose within a motion tracking coordinate frame 'c'.
  • the subject is imaged by a Medical Scanning system, such as an MR Scanner, which operates within a medical imaging coordinate frame 'M'.
  • a Medical Scanning system such as an MR Scanner
  • Improved medical images are obtained if (real-time) Motion Information is available to the Medical Scanning system, but the Motion Information must be accurately translated (or transformed) from the real-time motion tracking system (coordinate frame 'c,') to the coordinate frame 'M' of the Medical Scanning system.
  • the motion tracking system is considered“calibrated” with respect to the MR system if the mathematical transformation leading from one coordinate system to the other coordinate system is known.
  • the calibration (or alignment) of the two coordinate systems can be lost, introducing inaccuracies, due to drift over time because of various factors, including heat and vibration.
  • Motion Information is transformed from frame 'c' to frame 'M' by a“coordinate transformation matrix”, or “Co-registration transformation T c - .”
  • the “coordinate transformation matrix” converts or transforms motion information from one coordinate frame to another, such as from the motion tracking coordinate frame c to the medical imaging coordinate frame M. Loss of calibration due to drift, as well as other calibration inaccuracies, may result in a change over time of the coordinate transformation matrix, which in turn can lead to errors in the tracking information.
  • This variation may introduce error into the Transformed Real-time Motion Information for real-time adaptive Medical Scanning (over the course of many hours or days) due to temperature changes, vibrations and other effects.
  • This variation can introduce error into the Transformed Real-time Motion Information for real-time adaptive Medical Scanning.
  • a coordination device can be or include a microprocessor, but in the alternative, the coordination device can be or include a controller, microcontroller, or state machine, combinations of the same, or the like configured to coordinate the processing of audio data to generate an accurate transcript of an utterance represented by the audio data.
  • a coordination device can include electrical circuitry configured to process computer-executable instructions.
  • a coordination device may also include primarily analog components.
  • some or all of the transcription algorithms or interfaces described herein may be implemented in analog circuitry or mixed analog and digital circuitry.
  • a computing environment can include a specialized computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a device controller, or a computational engine within an appliance, to name a few.
  • a software module can reside in random access memory (RAM) memory, flash memory, read only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or other form of a non-transitory computer-readable storage medium.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • registers hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or other form of a non-transitory computer-readable storage medium.
  • An illustrative storage medium can be coupled to the coordination device such that the coordination device can read information from, and write information to, the storage medium.
  • the storage medium can be integral to the coordination device.
  • the coordination device and the storage medium can reside in an application specific integrated circuit (ASIC).
  • the ASIC can reside in an access device or other coordination device.
  • the coordination device and the storage medium can reside as discrete components in an access device or electronic communication device.
  • the method may be a computer-implemented method performed under the control of a computing device, such as an access device or electronic communication device, executing specific computer- executable instructions.
  • Conditional language used herein such as, among others, “can,”“could,” “might,”“may,”“e.g.,” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without other input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular embodiment.
  • Disjunctive language such as the phrase“at least one of X, Y, Z,” unless specifically stated otherwise, is otherwise understood with the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each is present.
  • articles such as“a” or“a” should generally be interpreted to include one or more described items. Accordingly, phrases such as“a device configured to” are intended to include one or more recited devices. Such one or more recited devices can also be collectively configured to carry out the stated recitations. For example,“a processor configured to carry out recitations A, B and C” can include a first processor configured to carry out recitation A working in conjunction with a second processor configured to carry out recitations B and C.
  • the terms“determine” or“determining” encompass a wide variety of actions. For example,“determining” may include calculating, computing, processing, deriving, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Also,“determining” may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like. Also,“determining” may include resolving, selecting, choosing, establishing, and the like.
  • a“selective” process may include determining one option from multiple options.
  • A“selective” process may include one or more of: dynamically determined inputs, preconfigured inputs, or user-initiated inputs for making the determination.
  • an n-input switch may be included to provide selective functionality where n is the number of inputs used to make the selection.
  • the terms“provide” or“providing” encompass a wide variety of actions.
  • “providing” may include storing a value in a location for subsequent retrieval, transmitting a value directly to the recipient, transmitting or storing a reference to a value, and the like.
  • “Providing” may also include encoding, decoding, encrypting, decrypting, validating, verifying, and the like.
  • the term“message” encompasses a wide variety of formats for communicating (e.g., transmitting or receiving) information.
  • a message may include a machine readable aggregation of information such as an XML document, fixed field message, comma separated message, or the like.
  • a message may, in some implementations, include a signal utilized to transmit one or more representations of the information. While recited in the singular, it will be understood that a message may be composed, transmitted, stored, received, etc. in multiple parts.
  • a“user interface” may refer to a network based interface including data fields and/or other controls for receiving input signals or providing electronic information and/or for providing information to the user in response to any received input signals.
  • a UI may be implemented in whole or in part using technologies such as hyper-text mark-up language (HTML), ADOBE® FLASH®, JAVA®, MICROSOFT® .NET®, web services, and rich site summary (RSS).
  • a UI may be included in a stand-alone client (for example, thick client, fat client) configured to communicate (e.g., send or receive data) in accordance with one or more of the aspects described.

Abstract

L'invention concerne des systèmes, des procédés et des techniques destinés à des techniques d'apprentissage automatique et à des modèles fondés sur des images qui peuvent être utilisés pour le suivi et la détection (par exemple, la prédiction précise) des coordonnées de mouvement et d'informations biométriques du corps d'un patient (en temps réel) à partir d'images vidéo capturées, sans l'utilisation d'un marqueur externe fixé au patient. Ces systèmes, procédés et techniques combinent de multiples processus et éléments ensemble afin de maximiser la précision de détermination (par exemple, de prédiction) d'un mouvement de patient à partir d'images vidéo séquentielles du patient sans marqueur externe.
PCT/US2019/055819 2018-10-12 2019-10-11 Modèles fondés sur des images permettant un suivi des informations biométriques et de mouvement sans marqueur en temps réel dans des applications d'imagerie WO2020077198A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862744918P 2018-10-12 2018-10-12
US62/744,918 2018-10-12

Publications (1)

Publication Number Publication Date
WO2020077198A1 true WO2020077198A1 (fr) 2020-04-16

Family

ID=70164094

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/055819 WO2020077198A1 (fr) 2018-10-12 2019-10-11 Modèles fondés sur des images permettant un suivi des informations biométriques et de mouvement sans marqueur en temps réel dans des applications d'imagerie

Country Status (1)

Country Link
WO (1) WO2020077198A1 (fr)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111967499A (zh) * 2020-07-21 2020-11-20 电子科技大学 基于自步学习的数据降维方法
US20200405179A1 (en) * 2019-06-26 2020-12-31 Siemens Healthcare Gmbh Determining a patient movement during a medical imaging measurement
US20210304457A1 (en) * 2020-03-31 2021-09-30 The Regents Of The University Of California Using neural networks to estimate motion vectors for motion corrected pet image reconstruction
WO2021226292A1 (fr) * 2020-05-05 2021-11-11 Grosserode Stephen Système et procédé d'analyse de mouvement avec détection d'altération, de phase et d'événement
US20210398293A1 (en) * 2018-11-28 2021-12-23 Nippon Telegraph And Telephone Corporation Motion vector generation apparatus, projection image generation apparatus, motion vector generation method, and program
WO2022232749A1 (fr) * 2021-04-28 2022-11-03 Elekta, Inc. Surveillance de position anatomique en temps réel pour traitement de radiothérapie
US11521743B2 (en) * 2019-10-21 2022-12-06 Tencent America LLC Framework for performing electrocardiography analysis
WO2022265643A1 (fr) * 2021-06-17 2022-12-22 Abb Schweiz Ag Systèmes robotiques et procédés utilisés pour mettre à jour l'entraînement d'un réseau neuronal sur la base de sorties de réseau neuronal
CN115695803A (zh) * 2023-01-03 2023-02-03 宁波康达凯能医疗科技有限公司 一种基于极限学习机的帧间图像编码方法
CN116228867A (zh) * 2023-03-15 2023-06-06 北京百度网讯科技有限公司 位姿确定方法、装置、电子设备、介质
US11679276B2 (en) 2021-04-28 2023-06-20 Elekta, Inc. Real-time anatomic position monitoring for radiotherapy treatment control
US11980456B2 (en) * 2019-06-26 2024-05-14 Siemens Healthineers Ag Determining a patient movement during a medical imaging measurement

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080064952A1 (en) * 2006-08-18 2008-03-13 Dun Alex Li Systems and methods for on-line marker-less camera calibration using a position tracking system
US20130188830A1 (en) * 2006-05-19 2013-07-25 The Queen's Medical Center Motion tracking system for real time adaptive imaging and spectroscopy
US20150366527A1 (en) * 2013-02-01 2015-12-24 Kineticor, Inc. Motion tracking system for real time adaptive motion compensation in biomedical imaging
US20160035108A1 (en) * 2014-07-23 2016-02-04 Kineticor, Inc. Systems, devices, and methods for tracking and compensating for patient motion during a medical imaging scan
US20170200067A1 (en) * 2016-01-08 2017-07-13 Siemens Healthcare Gmbh Deep Image-to-Image Network Learning for Medical Image Analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130188830A1 (en) * 2006-05-19 2013-07-25 The Queen's Medical Center Motion tracking system for real time adaptive imaging and spectroscopy
US20080064952A1 (en) * 2006-08-18 2008-03-13 Dun Alex Li Systems and methods for on-line marker-less camera calibration using a position tracking system
US20150366527A1 (en) * 2013-02-01 2015-12-24 Kineticor, Inc. Motion tracking system for real time adaptive motion compensation in biomedical imaging
US20160035108A1 (en) * 2014-07-23 2016-02-04 Kineticor, Inc. Systems, devices, and methods for tracking and compensating for patient motion during a medical imaging scan
US20170200067A1 (en) * 2016-01-08 2017-07-13 Siemens Healthcare Gmbh Deep Image-to-Image Network Learning for Medical Image Analysis

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210398293A1 (en) * 2018-11-28 2021-12-23 Nippon Telegraph And Telephone Corporation Motion vector generation apparatus, projection image generation apparatus, motion vector generation method, and program
US11954867B2 (en) * 2018-11-28 2024-04-09 Nippon Telegraph And Telephone Corporation Motion vector generation apparatus, projection image generation apparatus, motion vector generation method, and program
US20200405179A1 (en) * 2019-06-26 2020-12-31 Siemens Healthcare Gmbh Determining a patient movement during a medical imaging measurement
US11980456B2 (en) * 2019-06-26 2024-05-14 Siemens Healthineers Ag Determining a patient movement during a medical imaging measurement
US11521743B2 (en) * 2019-10-21 2022-12-06 Tencent America LLC Framework for performing electrocardiography analysis
US20210304457A1 (en) * 2020-03-31 2021-09-30 The Regents Of The University Of California Using neural networks to estimate motion vectors for motion corrected pet image reconstruction
WO2021226292A1 (fr) * 2020-05-05 2021-11-11 Grosserode Stephen Système et procédé d'analyse de mouvement avec détection d'altération, de phase et d'événement
CN111967499B (zh) * 2020-07-21 2023-04-07 电子科技大学 基于自步学习的数据降维方法
CN111967499A (zh) * 2020-07-21 2020-11-20 电子科技大学 基于自步学习的数据降维方法
US11679276B2 (en) 2021-04-28 2023-06-20 Elekta, Inc. Real-time anatomic position monitoring for radiotherapy treatment control
WO2022232749A1 (fr) * 2021-04-28 2022-11-03 Elekta, Inc. Surveillance de position anatomique en temps réel pour traitement de radiothérapie
WO2022265643A1 (fr) * 2021-06-17 2022-12-22 Abb Schweiz Ag Systèmes robotiques et procédés utilisés pour mettre à jour l'entraînement d'un réseau neuronal sur la base de sorties de réseau neuronal
CN115695803B (zh) * 2023-01-03 2023-05-12 宁波康达凯能医疗科技有限公司 一种基于极限学习机的帧间图像编码方法
CN115695803A (zh) * 2023-01-03 2023-02-03 宁波康达凯能医疗科技有限公司 一种基于极限学习机的帧间图像编码方法
CN116228867A (zh) * 2023-03-15 2023-06-06 北京百度网讯科技有限公司 位姿确定方法、装置、电子设备、介质
CN116228867B (zh) * 2023-03-15 2024-04-05 北京百度网讯科技有限公司 位姿确定方法、装置、电子设备、介质

Similar Documents

Publication Publication Date Title
WO2020077198A1 (fr) Modèles fondés sur des images permettant un suivi des informations biométriques et de mouvement sans marqueur en temps réel dans des applications d'imagerie
Tan et al. Fully automated segmentation of the left ventricle in cine cardiac MRI using neural network regression
Ben Yedder et al. Deep learning for biomedical image reconstruction: A survey
US9892361B2 (en) Method and system for cross-domain synthesis of medical images using contextual deep network
Wang et al. Smartphone-based wound assessment system for patients with diabetes
Khagi et al. Pixel-label-based segmentation of cross-sectional brain MRI using simplified SegNet architecture-based CNN
Chaichulee et al. Cardio-respiratory signal extraction from video camera data for continuous non-contact vital sign monitoring using deep learning
Du et al. Cardiac-DeepIED: Automatic pixel-level deep segmentation for cardiac bi-ventricle using improved end-to-end encoder-decoder network
CN112750531A (zh) 一种中医自动化望诊系统、方法、设备和介质
Yan et al. Cine MRI analysis by deep learning of optical flow: Adding the temporal dimension
Tang et al. Graph-based tracking of the tongue contour in ultrasound sequences with adaptive temporal regularization
Lee et al. Lstc-rppg: Long short-term convolutional network for remote photoplethysmography
Whig et al. GAN for Augmenting Cardiac MRI Segmentation
Yu et al. Cardiac LGE MRI segmentation with cross-modality image augmentation and improved U-Net
US20230260652A1 (en) Self-Supervised Machine Learning for Medical Image Analysis
Almogadwy et al. A deep learning approach for slice to volume biomedical image integration
Romaszko et al. Direct learning left ventricular meshes from CMR images
Horng et al. The anomaly detection mechanism using deep learning in a limited amount of data for fog networking
Lee et al. Improving Remote Photoplethysmography Performance through Deep-Learning-Based Real-Time Skin Segmentation Network
US20220254012A1 (en) Methods, devices, and systems for determining presence of appendicitis
Zhou Deformable Image Registration Using Attentional Generative Adversarial Networks
US20230096850A1 (en) System for estimating a pose of a subject
Meng Graph representation learning for biometric and biomedical images analysis
Shen Prior-Informed Machine Learning for Biomedical Imaging and Perception
Tuhin et al. Detection and 3d visualization of brain tumor using deep learning and polynomial interpolation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19871337

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.07.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19871337

Country of ref document: EP

Kind code of ref document: A1