WO2023091128A1

WO2023091128A1 - Remote diagnostic system and method

Info

Publication number: WO2023091128A1
Application number: PCT/US2021/059585
Authority: WO
Inventors: Chao-Yuan Yeh; Ching-Yi Lee; Hsing-Hao Chen
Original assignee: Aetherai Ip Holding Llc
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2023-05-25

Abstract

A remote diagnostic system and a remote diagnostic method are disclosed. The remote diagnostic system comprises a first edge device for receiving an image from a first image input capturing images of an organ and performing a first operation of an object recognition model on the image by using multiple parameters to produce partially-processed image data; a remote diagnostic module, receiving the partially-processed image data from the first edge device, performing a second operation of the object recognition model to produce an inference result which identifies a target topographic profile, and transmitting the inference result to the first edge device; and wherein the partially-processed image is unconvertable back to the image before the first operation without the multiple parameters

Description

REMOTE DIAGNOSTIC SYSTEM AND METHOD

FIELD OF THE INVENTION

The present invention is related to a remote artificial intelligence aided diagnostic system and method; more specifically, a remote artificial intelligence aided diagnostic system and method capable of identifying and classifying lesion from video or image in real-time.

DESCRIPTION OF RELATED ART

The advancement of artificial intelligence (Al) enables it to be applied to a varieties of industry sectors. Particularly, Al with object recognition capability has been a target of interest for the medical field, especially for clinical practitioner. Al aided diagnostic system may be used as an assistance to clinical doctors or examiners for identifying lesion within the digestive system by using graphic or object recognition. However, Al aided diagnostic system may require large amount of physical space, and thus, it may not be suitable for hospitals or clinics having limited physical space. The present invention aims to provide improved data processing architecture such that Al aided diagnostic system can be widely implemented to different hospital or clinical settings without occupying a large amount of physical space.

SUMMARY

The implantation of Al aided diagnostic system in hospitals or clinics is restricted by physical space and financial resources. Ideally, although it may be desired to have multiple full Al aided diagnostic devices implemented in the hospitals or clinics; however, this is often time difficult due to the factors mentioned above. In another aspect, although fully remote Al aided diagnostic system may conserve physical space; however, the operation of the first edge device relies heavily on network stability, which makes the Al aided diagnostic system vulnerable to external influences. Therefore, design an Al aided diagnostic system with operation flexibility is crucial for its development.

In order to facilitate the implementation of the Al aided diagnostic system to different hospital or clinical settings, in one embodiment of the present invention, the remote diagnostic system comprises a first edge device for receiving an image from a first image input capturing images of an organ and performing a first operation of an object recognition model on the image by using multiple parameters to produce partially -processed image data; a remote diagnostic module, receiving the partially- processed image data from the first edge device, performing a second operation of the object recognition model to produce an inference result which identifies a target topographic profile, and transmitting the inference result to the first edge device. The partially -processed image is unconvertable back to the image before the first operation without the multiple parameters.

The first edge device and the remote diagnostic module alone or collectively may comprise an object recognition model to perform the object recognition which generates the inference result for the operator to aid his/her diagnosis. The inference result may include identification and/or classification. Identification detects and localizes a target topographic profile. Classification provides the probability of a target topographic profile belonging to a specific category.

In a first embodiment, the first edge device alone may be able to include an object recognition model with relatively small size for the object recognition, which may promptly generate an inference result with limited computation power in the first edge device. In particular, such an arrangement may serve as a backup solution when the communication connection between the first edge device and the remote diagnostic module is abnormal. In this embodiment, the inference result may be generated immediately and locally although the inference result may be less accurate compared with the other two embodiments.

In a second embodiment, the remote diagnostic module alone may be able to include an object recognition model with relatively large size for the object recognition, which may generate an inference result with higher accuracy and more comprehensive information with intensive computation power in the remote diagnostic module. As long as the communication between the first edge device and the remote diagnostic module does not raise any latency concern, this embodiments may generate more accurate and comprehensive inference result.

In a third embodiment, the first edge device and the remote diagnostic module collectively perform the object recognition. Thus, a first portion of the object recognition model is installed in the first edge device to perform a first operation and a second portion of the object recognition model is installed in the remote diagnostic module to perform a second operation. After image preparation, the first edge device may perform a first operation of the object recognition model on the prepared image by using multiple parameters to prepare a partially -processed image, such as feature maps in a CNN. The multiple parameters may be weights and biases for a convolutional layer in a CNN. Such partially -processed image is transmitted from the first edge device to the remote diagnostic module for further processing. The remote diagnostic module performs a second operation of the object recognition model to produce an inference result which identifies, recognizes, or classifies a target topographic profile, and transmits the inference result back to the first edge device. The combination of the first operation and the second operation forms a complete process of object recognition which produces an inference result.

All of the above three types of datasets may be used to train an object recognition model, such as a CNN model. The images in these datasets are prepared and stored before they are fed to the model for training. In addition to the preparation described before, such as sharpness adjustment, pixel value scaling, resolution resizing, RGB color channel reorganization, normalization, and/or standardization, image data augmentation techniques, such as cutmix or mixup, may be used to create more labeled images for training.

According to one embodiment of the present invention, the remote diagnostic system may further comprise a controller for receiving instruction from an operator. The controller receives input from the operator to determine a set of parameters or instructions regarding the processing of the image information to produce partially- processed image information such as a feature map.

In one embodiment, the object recognition model may be a one-stage (such as Yolo), or two-stage (such as Faster RCNN) CNN. The backbone of the object recognition model refers to the portion of network which takes the image as input and extracts the feature maps upon which the rest of the network is based. Commonly used backbones include ResNet and DarkNet.

In one embodiment of the present invention, the partially-processed image information may comprise feature maps of the image. A feature is an individual measurable property or characteristic of a unit of the image. The remote diagnostic module receives the partially-processed image and then perform the second operation to identify and/or classify a target topographic profile as an inference result, by using the portion of the object recognition model installed in the remote diagnostic module.

In one embodiment, the CNN in accordance with one embodiment of the present invention does not use inference results of the patients for training; instead, the CNN is trained prior to installing to the remote diagnostic module. The remote diagnostic module only receives feature maps from the first edge devices. The CNN of the present invention can be trained and revised periodically from the server side for continue improving the object recognition capability. In this embodiment, due to the fact that the remote diagnostic module only receives feature map, privacy concerns for the patients can be reduced. Furthermore, in the present invention, object recognition model of the CNN is decentralized, only a portion of the object recognition model is installed in the first edge device, and a portion of the object recognition model is installed in the remote diagnostic system, the integrity of the entire object recognition model can be maintained. The object recognition model is less vulnerable to piracy.

In other embodiments, with the agreement of the patients, the CNN in accordance with the present invention can use inference results of the patients for training. The capability of lesion recognition of the CNN can be updated continuously and in real-time.

Yet in another embodiment of the present invention, the remote diagnostic system in accordance with the present invention further comprises a video and audio output for outputting the notification to the operator. The format of the notification is determined based on an input of the controller, which is determined or selected by the operator. The notification may comprise visual annotators or acoustic annotators.

In another aspect of the present invention, an Al aided remote diagnostic method is disclosed. The method comprises: receiving a partially processed real-time image from a first edge device which performs a first operation of an object recognition model on a real-time image received from a first image input by using multiple parameters; performing a second operation of the object recognition model on the partially processed real-time image to produce an inference result which identifies a target topographic profile; transmitting the inference result to the first edge device; and wherein the partially-processed image is unconvertable back to the real-time image without the multiple parameters. By dividing the tasks of object recognition in a manner such that a portion of the process of the object recognition is performed by the first edge device and the other portion of the process is performed by the remote diagnostic module, the Al aided diagnostic system in accordance with the present invention may be more applicable to hospital or clinic having limited physical space. The efficiency of object recognition may also be enhanced since the speed of object recognition can no longer be limited by the hardware limitation of the local device (edge device). Furthermore, since the remote diagnostic module only receives feature maps of the real-time image generated by the first edge device, it may be difficult to intercept and interpret the image information by a third party since the feature map acts as an encryption to the real-time image. As a result, the privacy of the patients can be secured.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic view of the remote diagnostic system in accordance with the present invention.

FIG. 2 is a schematic view of the remote diagnostic system in accordance with a first embodiment of the present invention.

FIG. 3 is a schematic view of the remote diagnostic system in accordance with a second embodiment of the present invention.

FIG. 4 is a schematic view of the remote diagnostic system in accordance with a third embodiment of the present invention.

FIG. 5 is a schematic view illustrating an embodiment of the controller, the video and audio output, and the notification for indicating the location of the lesion.

FIG. 6 is a schematic view illustrating an embodiment of the first edge device and its physical dimension.

FIG. 7 is a schematic view illustrating the relation and operation between the first edge device and the remote diagnostic module.

FIG. 8 is a schematic view illustrating an embodiment in which multiple image inputs and edge devices share a centralized remote diagnostic module.

FIG. 9 is a flow chart illustrating a remote diagnostic method in accordance with the present invention.

FIG. 10 is another flow chart illustrating a remote diagnostic method in accordance with the present invention. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The terminology used in the description presented below is intended to be interpreted in its broadest reasonable manner, even though it is used in conjunction with a detailed description of certain specific embodiments of the technology. Certain terms may even be emphasized below; however, any terminology intended to be interpreted in any restricted manner will be specifically defined as such in this Detailed Description section. Embodiments of the present invention will be described, by way of examples only, with reference to the accompanying drawings.

The embodiments introduced below can be implemented by programmable circuitry programmed or configured by software and/or firmware, or entirely by special-purpose circuitry, or in a combination of such forms. Such special-purpose circuitry (if any) can be in the form of, for example, one or more application-specific integrated circuits (ASICs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), etc.

The phrase “object recognition” in the present disclosure refers to object/ feature identification (including detection and localization) and/or classification from an image including a real-time image and a stored image or a video including multiple images. The phrase “object recognition model” in the present disclosure refers to a black box comprising multiple parameters, algorithms and processes, implemented by software or hardware, to perform the task of object recognition. The object recognition model may be a deep learning neural network, including various kinds of convolutional neural networks (“CNN”).

With reference to FIG. 1, in one embodiment of the present invention, the remote diagnostic system 1 comprises a first image input 110, a first edge device 120, and a remote diagnostic module 130. As an example, the first image input 110 may be an endoscope for capturing real-time image of the digestive system of a patient. The real-time image may refer to series of real-time endoscopic images or videos captured in real-time while an operator of the first image input 110 is operating the first image input 110. An endoscope is a long, thin, flexible tube that has one or more cameras and light sources at one end. The camera may be a CCD camera, a fiber optic camera, or other video recording device. In other embodiments, the first image input 110 may be an image capturing device or any other devices having image capturing capability for obtaining images of a human or animal organ, such as esophagus, stomach, small intestine, large intestine, kidney, ureter, and bladder. Examples include magnetic resonance imaging devices, X-ray devices, or supersonic imaging devices... etc. An image may refer to an image frame captured by the first image input 110 at a specific time and the related image information processed from the image frame. In some instances, the image, including real-time image(s) or video stream captured by the first image input 110 may be pixelized or digitized for the processes of object recognition. The real-time images may include multiple image frames with a predetermined frame rate. As an example, the frame rate for an endoscope may be 30 or 60 frames/second. However, the frame rate of the first image input 110 is not limited to this frame rate.

The first edge device 120 receives images, such as real-time images, and related information from the first image input 110 and prepares the real-time images to facilitate the object recognition processes. The preparation may include sharpness adjustment, pixel value scaling, resolution resizing, RGB color channel reorganization, normalization, and/or standardization. In some embodiments, the first edge device 120 may be a device at the same location where the action of capturing the images by the first image input 110 occurs, such as a room for endoscopy procedure. The first edge device 120 may be communicatively connected to the first image input 110 via wired or wireless connection in order for the first edge device 120 is to receive the images and related information with relatively little latency. For example, the latency requirement in some embodiments may be 180 ms/frame or shorter from the time an image frame is captured by the first image input to the time the inference result is presented to the operator.

The first edge device 120 and the remote diagnostic module 130 alone or collectively may comprise an object recognition model to perform the object recognition which generates the inference result for the operator to aid his/her diagnosis. The inference result may include identification and/or classification. Identification detects and localizes a target topographic profile. Thus, the target topographic profile is marked by a bounding box in the image. The type of target topographic profile directly corresponds to the type of first image input 110 and the part of human organ. For examples, if the first image input 110 is a X-ray camera taking an image of a human lung, the target topographic profile may be tuberculosis in the lung; or if the first image input 110 is an ultrasonic imaging device taking an image of a human kidney, the target topographic profile may be a kidney stone. Each type of the first image input 110 may be capable of taking image of a particular human or animal organ, such as esophagus, stomach, small intestine, large intestine, kidney, ureter, and bladder... etc. The first image input 110 may perform magnetic resonance imaging, X-ray imaging, or ultrasonic imaging for obtaining the image of the target topographic profile.

Classification provides the probability of a target topographic profile belonging to a specific category. For example, a target topographic profile may be classified to be either neoplastic or non-neoplastic. A neoplastic target topographic profile may be classified to be either benign, or potentially malignant, or malignant. Based on the roles of the first edge device 120 and the remote diagnostic module 130 in performing the object recognition function, there are three embodiments.

With reference to FIG. 2, in a first embodiment, the first edge device 120 alone may be able to include an object recognition model with relatively small size for the object recognition, which may promptly generate an inference result with limited computation power in the first edge device. In particular, such an arrangement may serve as a backup solution when the communication connection between the first edge device 120 and the remote diagnostic module is abnormal. In this embodiment, the inference result may be generated immediately and locally although the inference result may be less accurate compared with the other two embodiments.

With reference to FIG. 3, in a second embodiment, the remote diagnostic module 120 alone may be able to include an object recognition model with relatively large size for the object recognition, which may generate an inference result with higher accuracy and more comprehensive information with intensive computation power in the remote diagnostic module. As long as the communication between the first edge device and the remote diagnostic module does not raise any latency concern, this embodiments may generate more accurate and comprehensive inference result. However, patients image’s and other related information, usually regulated by privacy protection law such as General Data Protection Regulation (“GDPR”) in European Union, may not be allowed for transmission to a remote diagnostic module located outside a hospital or clinics without the patient’s consent.

With reference to FIG. 4, in a third embodiment, the first edge device 120 and the remote diagnostic module 130 collectively perform the object recognition. Thus, a first portion of the object recognition model is installed in the first edge device 120 to perform a first operation and a second portion of the object recognition model is installed in the remote diagnostic module 130 to perform a second operation. After image preparation, the first edge device 120 may perform a first operation of the object recognition model on the prepared image by using multiple parameters to prepare a partially -processed image, such as feature maps in a CNN. The multiple parameters may be weights and biases for a convolutional layer in a CNN. Such partially -processed image is transmitted from the first edge device 120 to the remote diagnostic module 130 for further processing. The remote diagnostic module 130 performs a second operation of the object recognition model to produce an inference result which identifies, recognizes, or classifies a target topographic profile, and transmits the inference result back to the first edge device 120. The combination of the first operation and the second operation forms a complete process of object recognition which produces an inference result. The first operation may be related to processing the image frames with a number of feature filters (or neurons) in the CNN to extract specific features from the image frame to produce feature maps for the purpose of object recognition. Since the feature maps scrambles the image frame so it cannot be read or recognized by a third party; therefore, the first operation is similar to an encryption to the image frames. Thus, the partially -processed image is very different from the image before the first operation and is unconvertable back to the real-time image without the multiple parameters. In other words, any third party who intentionally or accidentally receives the partially-processed image from the first edge devise 120 cannot convert it back to the image taken by the first image input 110. In addition, similar to encrypted information, the partially-processed image cannot be comprehended to have any meaning associated with a patient. Thus, the partially- processed image may be considered a de-personalized information and free to be transmitted outside of a hospital or clinics without further consent from the patient.

With reference to FIGS. 1 -5, according to one embodiment of the present invention, the remote diagnostic system 1 may further comprise a controller 140 for receiving an instruction from an operator and sending said instruction to the first edge device 120. More specifically, the controller 140 receives instruction input from the operator to determine the desired setting or operation manner of the remote diagnostic system 1. For examples, the instructions may include focusing on a specific area of the image frame, zooming in and zooming out of the image, switching between different inference modes, selecting a specific inference mode, adjusting some parameters for the object recognition model such as sensitivity, selecting a scope of first operation performed at the first edge device 120, turning on or off the notification, or cropping and saving an image for further diagnostic analysis...etc. Particularly, the sensitivity of the object recognition model may affect the volume of target topographic profile detected by the object recognition model in the image. For example, when the sensitivity is high, features in the image having relatively lower possibility of being a lesion may be indicated by the object recognition model as a lesion in the inference result; when the sensitivity is low, only features in the image having relatively high possibility of being a lesion may be indicated by the object recognition model as lesion. A person with ordinary skill in the art can conceive other instructions provided to the controller 140 for controlling the operation of the remote diagnostic system without deviating from the spirit of the present invention. The controller 140 executes the operator’s instructions by transmitting signals and/or information to the first image input 110, the first edge device 120, and/or the remote diagnostic module 130.

In one embodiment, the object recognition model may be a one-stage (such as Yolo), or two-stage (such as Faster RCNN) CNN. The backbone of the object recognition model refers to the portion of network which takes the image as input and extracts the feature maps upon which the rest of the network is based. Commonly used backbones include ResNet and DarkNet. Usually, the largest proportion of parameters in a CNN comes from backbone. The second component is a neck attached to backbone that fuses multi-level features. The third component is a head for classification and localization. In another embodiment, the CNN also includes a region proposal network (“RPN”) which generates prediction candidates on extracted features. A typical RPN is lightweight and efficient, such as a convolution layer followed by two fully connected layers for region proposal classification and bounding box regression.

The one-stage detection model is able to detect whether there is a target object within the bounding box, and determine the proper geometric shape of the bounding box. The two-stage detection model can implement elective search region proposal network (RPN) to find out the possible target lesion containing bounding boxes, detect whether there is a target object within the bounding box, and then determine the proper geometric shape of the bounding box.

As described in the third embodiment previously, the first operation of the object recognition model may include a portion of the backbone component to be performed at the first edge device 120. In one embodiment, the backbone may use DarkNet 53 and the first operation may include the first one, two, or three convolutional layers. The second operation may include the remaining portion of the backbone component, the neck, and the head, to be performed at the remote diagnostic module 130. In the above example, after image preparation, the first edge device may perform a first one convolution layer operation on the prepared image to generate the partially-processed image, in the form of feature maps. For example, one convolution layer may have 64 filters each of the filters is a neuron in the neural network. With reference to FIG. 6, the first edge device 120 may have a small physical size compared to most of other Al aided diagnostic systems. In such embodiment, due to the limited size and calculation power of the first edge device 120, most of the Al aided diagnostic functions may be implemented in the remote diagnostic module 130. Because the first edge device 120 may have a very small size, it can be easily installed to fit in most of the hospitals and clinics setting. FIG. 6 illustrates an exemplary embodiment in which the first edge device 120 has a dimension of 118.2mm x 135mm x 93.1mm. However, the physical dimension of the first edge device 120 is not limited to this example.

With reference to FIGS. 4 and 7, in one embodiment of the present invention, the partially-processed image information may comprise feature maps of the image. A feature is an individual measurable property or characteristic of a unit of the image. For example, a unit of the image may be a pixel or a grid in an image frame. For an image frame with 672 x 382 pixels having 3 layers (R, G, and B layers) as the input of a CNN model, the feature maps generated after the first one convolutional layer operation may have 32 layers of feature maps each of which contain 672 x 382 grids. Each feature map is a layer of 672 x 382 quantitative information, such as numbers, representing a measurement of a specific feature (e.g., color, intensity... etc.) for each grid. The feature map includes information regarding these numerical expressions for each unit (e.g., pixel/grid) along with its corresponding location in a frame of the image. Therefore, for example, the feature map may comprise a plurality of numerical expressions of graphical properties for each unit. In general, multiple corresponding feature maps may be extracted from an image frame. In addition, the feature map may contain results of mathematical operations performed on numerical expressions of graphical properties. The mathematical operation may act as a feature detector for performing the function of object recognition, including identification and classification. For example, a feature map may contain results of mathematical operations for identifying edges of an object in the image. Thus, the convolution operation will be able to identify the edges of a target topographic profile of human organs, such as a polyp in a colon, in the image. The mathematical operation may also act as a feature detector on the image to highlight certain topographic profile of human organs. As a result, in one embodiment, the feature maps may be in the form of a multiple dimension matrix.

With reference to FIG. 4, the remote diagnostic module 130 receives the partially -processed image and then perform the second operation to identify and/or classify a target topographic profile as an inference result, by using the portion of the object recognition model 131 installed in the remote diagnostic module 130. As an example, the CNN described in the present invention may be Yolov4 with Darknet53 backbone, which comprise a plurality of convolution layers. Darknet53 may comprise 52 convolution layers. In this example, the partially-processed image information may comprise a feature map generated from one or more of the convolution layers. In this example, the first edge device 120 comprises a portion (e.g., at least one convolution layer) of the CNN, such as Yolov4 with Darknet 53 architecture, so a portion of the object recognition process may be carried out by the first operation at the first edge device 120. The amount of convolution layer contained in the first edge device 120 may be adjusted in different embodiments to balance the needs of high efficiency and low latency. The related factors to be considered include the computation power and physical size of the first edge device 120, and the speed of wireless transmission between the first edge device 120 and the remote diagnostic module 130. In some embodiments, when the remote diagnostic module 130 is not available, for example due to unexpected network malfunction, disconnection or jam, a less complicated object recognition model may be completely executed in the first edge device 120. In one embodiment, the first edge device 120 alone may be able to implement a basic object recognition model, such as fewer convolution layers, to perform only lesion identification function under such situation, without lesion classification function which may rely on the calculation power of the remote diagnostic module 130.

A convolution layer is a core building block of CNN, it carries out the feature detection function. Through multiple convolution layers, a CNN is able to identify the target topographic profile (e.g., lesion on the human organ). In some embodiments, the CNN may have multiple feature detectors to perform different functions such as shape identification, or colors identification, or texture identification ...etc. Each of the feature detectors is a set of algorithms performed by layers of neurons or convolution layers. The feature detectors act as feature filters and output the corresponding feature maps.

In general, an object recognition model has to be trained before it is used to perform the object recognition function on a new image. One type of training is referred to as a supervised learning. Supervised learning is a subcategory of machine learning in artificial intelligence. Supervised learning uses labeled datasets to teach models to yield the desired output. This labeled dataset includes inputs and correct outputs, which allow the model to learn over time. The algorithm measures its accuracy through the loss function, adjusting until the error has been sufficiently minimized. After training, the object recognition model may use the It uses labeled datasets to train algorithms of a neural network model for classifying data or predicting outcomes accurately. As input data is fed into the model, the labeled outcome adjusts its weights and parameters until the model has been fitted appropriately, which occurs as part of the cross-validation process. For supervised learning, at least three types of labeled datasets may be used for training the object recognition model.

First, an image frame is labeled with one or more target topographic profiles (e.g., lesion on the human organ or more specifically, a polyp in a colon), each of which is labeled with a bounding box to identify its location.

Second, an image frame is labeled with weather it contains one or more target topographic profiles but with no bounding box at all. Thus, the model to be trained only knows a specific image frame has one or more target topographic profiles but does not know where it is located. In such case, visualization method (such as CAM, gradCAM and gradCAM++) may be used to train the model by identifying which features have higher importance which leads to the identification of a target topographic profile. For instance, a heatmap may be generated as an outcome of fully- connected layer and global pooling layer. The heatmap serves the purpose of informing what features are important for detecting and localizing a target topographic profile. Using this type of datasets may effectively reduce the time and efforts for physicians to label where a target topographic profile is located in each image frame.

Third, a video (image bag) containing multiple image frames is labeled with weather it contains one or more target topographic profiles without identifying which image frame contains a target topographic profile. Thus, the model to be trained only knows that a video, for example containing 500 image frames, has one or more topographic profiles but does not know which image frame contains a target topographic profile. In such case, in addition to CAM, multiple-instance learning (MIL) may be implemented to train the model. As a simple case of multiple-instance binary classification, a video (image bag) may be labeled negative if all the image frames (instances) in the video are negative. On the other hand, a video is labeled positive if there is at least one image frame (instance) in the video, which is positive. In one embodiment, for a video labeled positive, the model is trained by only using the image frame to be predicted to mostly likely contain a target topographic profile; for a video labeled negative, the model is trained by only using the image frame to be predicted to most likely contain a target topographic profile. From a collection of labeled videos, the model is trained to either (i) induce a concept that will label individual images correctly or (ii) learn how to label videos without inducing the concept. Using this type of datasets may substantially reduce the time and efforts for physicians to label each individual image frame.

All of the above three types of datasets may be used to train an object recognition model, such as a CNN model. The images in these datasets are prepared and stored before they are fed to the model for training. In addition to the preparation described before, such as sharpness adjustment, pixel value scaling, resolution resizing, RGB color channel reorganization, normalization, and/or standardization, image data augmentation techniques, such as cutmix or mixup, may be used to create more labeled images for training. Cutmix is a data augmentation technique that addresses the issue of information loss and inefficiency present in regional dropout situations. Instead of removing pixels and filling them with black or grey pixels or Gaussian noise, cutmix replaces the removed regions with a patch from another image, while the ground truth labels are mixed proportionally to the number of pixels of combined images. The added patches further enhance localization ability by requiring the model to identify the object from a partial view. Mixup is a generic and straightforward data augmentation principle. In essence, mixup generates a weighted combinations of random image pairs from the training data to train a neural network. By doing so, mixup regularizes the neural network to favor simple linear behavior inbetween training examples.

In one embodiment, the object recognition model may be concurrently trained when the remote diagnostic system is on a working mode, if the operator provides feedback on the inference result generated by the object recognition model. For example, the operator labels a target topographic profile that the object recognition model does not identify or vice versa. Another example is that the operator reclassifies a target topographic profile from one category to another category. In another embodiment, the remote diagnostic system may store the images and the operator’s feedbacks and re-train the object recognition model periodically when system accumulates certain amount of labeled datasets.

In other embodiments, the CNN bounding box may perform K- means operation and intersection over union (IOU) principle on the images to produce bounding boxes. The procedures for generating the bounding boxes involve:

1. determining the number of clusters C: the number of clusters corresponds to the numbers of target topographic profile in an image.

2. select C random points as initial centroids: suppose there are 3 target topographic profiles in the image, we randomly select 3 centroids for each cluster (which means C=3);

3. calculate the IOU between each of the anchor boxes indicating the feature of the target topographic profile in all the clusters and the 3 centroids to assign all the anchor boxes to their respective closest cluster centroid;

4. recompute the centroids of the newly formed clusters; and

5. repeat steps 3 and 4;

This process is a single iteration. The iteration can stop when: a. centroids of newly formed clusters do not change; b. the anchor boxes assigned to the clusters remain the same; or c. maximum number of iterations are reached

In other embodiments, with the agreement of the patients, the CNN in accordance with the present invention can use inference results of the patients for training. In this embodiment, the capability of lesion recognition of the CNN can be updated continuously and in real-time, which is very beneficial in terms of machine learning. In some embodiments, when the operator performs endoscopy on a patient with the present invention, the CNN may produce an inference result and present said result to the operator in real time. However, the operator may manually correct the inference result if the operator finds it to be inaccurate or inadequate. For example, the inference result may be a location of a lesion indicated by a bounding box. If the bounding inaccurately indicates the location of the lesion, the operator may drag the bounding box to the correct location by controlling the controller 140; or if the inference result falsely classifying a lesion, the operator may manually enter the correct type of lesion. Theses correction or feedbacks from the operator may be used as ground truth for improving and tuning the object recognition accuracy of the CNN.With reference to FIG. 7, the object recognition model 131 uses the sets of rules and parameters obtained from training to identify and recognize a target topographic profile in the feature map. The target topographic profile may be, but not limited to, a lesion of the human organ, a change in shape, coloration, evenness, composition, or temperature gradient of certain areas in the human organ...etc., depending on the types of first image input 110. The target topographic profile is the topographical information of an area of a human organ. Each target topographic profile may be associated with a specific type of lesion or medical condition in human organs. The object recognition model 131 specifically looks for target topographic profile in the feature map to identify, recognize, and classify lesions, and thus helps diagnose patients. After performing object recognition of the target topographic profile, the remote diagnostic module 130 produces an inference result and notifies the inference result to the operator of the first image input 110. For examples, the inference result may contain information such as whether a lesion is found, the location of the lesion, the probability of the target topography profile being a specific classification of lesion, and the classification of the lesion. In some embodiments, the inference result is transmitted by the remote diagnostic module 130 to the first edge device 120. In some embodiments of the present invention, the parameters and/or algorithm for performing identification, recognition of the target topographic profile may be modified or altered as desired by the operator of the present invention, such that the sensitivity for lesion recognition can be modified adjusted according to the need of the operator. For example, the sensitivity may be adjusted to a higher level such that every abnormal topographic profile is marked and notified; or the sensitivity may be adjusted to a lower level such that only the abnormal topographic profile having a higher probability of being a specific type of lesion is marked and notified.

In the present invention, the remote diagnostic module 130 may comprise a plurality of different object recognition models 131. For examples, each object recognition model 131 may correspond to different organs of the human body and different sensitivity for object recognition; or each of the object recognition model 131 may correspond to recognition of different types of lesions. The operator can select the appropriate object recognition models 131 for operation.

In some instances, the operator can select a non-inference mode in which no object recognition models 131 in the remote diagnostic module 130 is selected. The operator can operate the first image input 110 and edge device 120 without the assistance of the object recognition models 131. In this case, the first edge device 120 may be readily equipped with preliminary object recognition capability and do not need to rely on the remote diagnostic module 130 to perform object recognition completely. By equipping the first edge device 120 with at least a portion of the object recognition capability, it gives the present invention more flexibility and usability; it also makes the present invention to be less vulnerable to technical issues related to network communication between the first edge device 120 and the remote diagnostic module 130. In some other instances, the first edge device 120 may selectively download recognition models 131 from the remote diagnostic module 130 so as to execute object recognition locally by the first edge device 120 alone.

In one embodiment of the present invention, in order to achieve real-time object recognition, the convolution rate (the amount of time spent to generate an inference result from an image frame) of the first edge device 120 and/or the remote diagnostic module 130 may be faster than a predetermined value. As an example, the predetermined value may be 33ms/frame. To increase the convolution rate, the resolution of the image frame may be degraded to, for example: 704 x 384, 704x 352, or 672 x 352, as long as the accuracy of object recognition can be maintained.

In one embodiment of the present invention, the remote diagnostic system 1 in accordance with the present invention further comprises a video and audio output 150 for outputting a notification to the operator based on the inference result. The format of the notification is determined based on an input of the controller 140, which is determined or selected by the operator. The notification may comprise visual annotators or acoustic annotators. For examples, the visual annotator may be superimposed on the real-time image to form a final image, in which a location of the lesion is indicated for the operator to see; the acoustic annotators may be played to alert the operator that a lesion is identified in the real-time image .

In another embodiment, the remote diagnostic module 130 generates an information regarding the classification of the target topography profile and the probability of the target topography profile being a particular type of lesion. The video and audio output 150 outputs a notification including the classification of the target topography profile and the probability of the target topography profile being a particular type of lesion to the operator.

Since the object recognition model 131 perform lesions recognition on the feature map for each image frame, the object recognition model 131 needs to have an image frame process rate larger than the predetermine frame rate in order to achieve real-time lesion recognition.

With reference to FIGS. 7 and 8, in one embodiment of the present invention, the remote diagnostic system 1 may comprise one centralized remote diagnostic module 130, a plurality of image inputs 110, and a plurality of edge devices 120. The centralized remote diagnostic module 130 is in communication with the plurality of edge devices 120 and the plurality of image inputs 110 via wired or wireless communication network (such as internet, ethemet, WIFI, or Bluetooth... etc.). The centralized remote diagnostic module 130 may comprise a plurality of object recognition models 131. Each of the plurality of object recognition models 131 is responsible for object recognition for different organs of the body; or each of the plurality of object recognition models 131 may correspond to a respective type or manufacturer (e.g., Olympus, Pantex, or Fuji...etc.) of the first edge devices 120 or image input. The centralized remote diagnostic module 130 may be implemented as a server for respectively receiving different inference requests or feature maps from the plurality of image inputs 110 and the plurality of edge devices 120 at the same time. The centralize remote diagnostic module 130 find the appropriate object recognition models 131 for the respective inference requests or feature maps. After performing object recognition, the centralized remote diagnostic module 130 provides a plurality of inference results to the plurality of edge device 120 concurrently. Under this type of architecture, hospitals or clinics only need one designated area for placing the centralized remote diagnostic module 130 (which requires relatively large amount of physical space), and the plurality of edge devices 120 and image inputs 110 can be installed at respective operating rooms; in other instances, the centralized remote diagnostic module 130 may be provided at other remote locations such that the plurality of edge devices 120 and image inputs 110 from different hospitals or clinics can be connected to the same centralized remote diagnostic module 130. Thereby, maintenance cost and physical space can be saved in this way.

In some embodiments, the plurality of edge devices 120 may be manufactured by different manufacturers. The centralize remote diagnostic module 130 may comprise a plurality of object recognition models 131 suitable for different types of edge devices 120. The operator may select the appropriate object recognition model 131 for the corresponding edge device 120 during operation.

With reference to FIG.9 and FIG.10, the present invention discloses an Al aided remote diagnostic method. The steps of the remote diagnostic method are described in detail below.

Step SOI: obtaining an image via the first image input 110. The image mentioned herein refers to series of images or videos captured in real-time while an operator is operating a first image input 110. The first image input 110 may be, but not limited to, a camera or any other devices having image capturing capability for obtaining still images, or real-time images of a human organs; examples of which may be endoscopes, magnetic resonance imaging devices, X-ray devices, or supersonic imaging devices...etc. In one embodiment, the first image input 110 receives instructions from the controller 140 with regard to the operation settings of the first image input 110. The instructions may include focusing on a target area of the image , zoom in and zoom out of the image , turning on or off of an inference mode, turning on or off the notification, or cropping and saving a specific section of the image for further diagnostic analysis...etc. More specifically, in some embodiments, the inference mode refers to whether to activate Al aided diagnostic functions. The first image input 110 captures the image with a predetermined frame rate; and thus, the image may be consisted of a plurality of image frames. In some instances, the image or each of the image frames may be pixelized or digitized for the proceeding processes of object recognition. In some embodiments, if the inference mode is on, the image is provided to the first edge device 120 for partial-processing. On the other hand, if the inference mode is off, the image will be superimposed with basic system information (such as: date, time, or types inference mode) to form a final image and to be displayed to the operator via the video and audio output 150.

Step S02: partial-processing the image via first edge device 120 to generate a feature map. The first edge device 120 is in communication with the first image input 110 and receives the image information contained in the image . The first edge device 120 may be a device local to the place where the action of obtaining the image information by the first image input 110 occurs. After receiving the image information contained in the image , the first edge device 120 then partial-process the image for facilitating the proceeding object recognition processes. Particularly, the image is partially -processed to generate partially -processed image information which contains a series of feature maps of the image information contained in the image . An example of the partially-processed image information may be in the form of a feature map. As mentioned earlier, the image may consist of a plurality of image frames, and each of the image frames may be pixelized or digitized for facilitating the object recognition process. The feature contained in the feature map may correspond to an individual measurable property or characteristic of a unit of the image . In the case which the image is pixelized or digitized and comprises a plurality of image frames, the feature may refer to the measurable quantities which a pixel may comprise (e.g., color, intensity...etc.) in one image frame, which are often time numerical quantities. The feature map gives numerical expressions for each unit (e.g., pixel) of the image in a specific image frame. Therefore, the feature map may comprise a plurality of numerical expressions of graphical properties associated with multiple dimensional coordinates of each of the numerical expressions. In some embodiments, each image frame may have corresponding feature map; or in some other cases, more than one feature map may be extracted from one image frame. As a result, the partially- processed image information may contain a plurality of feature maps. In some other embodiments, the feature map may be the result of mathematical operations performed on each numerical expressions of graphical properties. The mathematical operation may act as a feature detector or filter for, as an example, identifying edges in the image. These mathematical operations may be considered as a portion of the procedure for identifying the target topographic profile. The mathematical operation may also act as an image filter on the image to highlight certain topographic profile of human organs. In some instances, the feature map may be in the form of a multiple dimension matrix.

Step S03: transmitting the feature map to a remote diagnostic module 130 for identifying a target topographic profile based on an object recognition model 131 to produce an inference result. In some instances, inference request may be provided to the remote diagnostic module along with the feature map. The remote diagnostic module 130 receives the partially -processed image information and identify a target topographic profile based on an object recognition model 131 in the remote diagnostic module 130 to produce the inference result. The target topographic profile may be, but not limited to, a change in coloration, evenness, composition, or temperature gradient of certain areas in the human organ, depending on the types of first image input. The target topographic profile is topographical information of an area of a human organ. As mentioned previously, in one embodiment of the present invention, the partially- processed image information contains feature maps of the image information in the image . The object recognition model 131 may be a Convolutional Neural Network (CNN), which is a specialized neural network for processing data that has an input of multiple dimension matrix and can be used for object recognition. In the present embodiment, the CNN may have multiple feature detectors to perform different functions such as shape identification, colors identification, or classification of the target topographic profile...etc. Each of the feature detectors may utilize a different feature map produced by the first edge device 120. In some embodiments of the present invention, the CNN may be Yolov4 for enhancing real-time object recognition capability.

In one embodiment, for reducing privacy concerns of the patients, the CNN in accordance with one embodiment of the present invention does not use inference results of the patients for training; instead, the CNN is trained prior to installing to the remote diagnostic module 130. During training, the CNN may take multiple dimensional images of lesions as inputs and continuously perform data pattern extraction or rule extraction from the images. Furthermore, the remote diagnostic module 130 only receives feature maps from the first edge devices 120, thus reducing privacy risks. The CNN of the present invention can be trained and revised periodically from the server side for continue improving the object recognition capability. As a result, the performance of the remote diagnostic system 1 in accordance with the present invention can be improved constantly. In other embodiments, with the agreement of the patients, the CNN in accordance with the present invention can use inference results of the patients for training. In this embodiment, the capability of lesion recognition of the CNN can be updated continuously and in real-time, which is very beneficial in terms of machine learning.

The object recognition model 131 uses the sets of rules and parameters obtained from training to identify and recognize a target topographic profile in the feature map. Each target topographic profile may be associated with a specific type of lesion or medical condition in human organs. The object recognition model 131 specifically looks for target topographic profile in the feature map to identify, recognize, and classify lesions, and thus helps diagnose the patients. After performing identification, recognition of the target topographic profile, the remote diagnostic module 130 produces an inference result, and notifies the inference result to the operator of the first image input 110. For examples, the inference result may contain information such as whether a lesion is found, the location of the lesion, and the classification of the lesion. The inference result comprises a probability of the target topographic profile being a lesion or a classification of the lesion.

In some embodiments of the present invention, the parameters and/or algorithm for performing identification, recognition of the target topographic profile may be modified or altered as desired by the operator of the present invention, such that the sensitivity for lesion recognition can be modified adjusted according to the need of the operator. For example, the sensitivity may be adjusted to a higher level such that every abnormal topographic profile is marked and notified; or the sensitivity may be adjusted to a lower level such that the abnormal topographic profile having a higher probability of being a specific type of lesion is marked and notified.

In some instances, the remote diagnostic module 130 may comprise a plurality of object recognition models 131, each of the plurality of object recognition models 131 corresponds to object recognition of a portion of organ. Since the object recognition model 131 perform lesions recognition on the feature map for each image frame, the object recognition model 131 needs to have an image frame process rate larger than the predetermine frame rate in order to achieve real-time lesion recognition.

Step S04: transmitting an inference result to the first edge device 120, the first edge device 120 outputting a notification to an operator of the first image input 110 based on the inference result. In some embodiments, the remote diagnostic system 1 in accordance with the present invention comprises a video and audio output 150 for outputting the notification to the operator. The format of the notification is determined based on an input of a controller 140, which is determined or selected by the operator. More specifically, the controller 140 receives instructions from the operator and forms an inference request, which contains a set of parameters regarding, for examples, the type of lesion to be identified, the format of the notification when a lesion is found...etc. Then, the first edge device 120 receives a set of parameters specified by an inference request from the controller 140. The notification may comprise visual annotators or acoustic annotators. For examples, the visual annotator may be superimposed on the image to form a final image, in which a location of the lesion is indicated for the operator to see; the acoustic annotators may be played to alert the operator that a lesion is identified in the image .

Although particular embodiments of the present invention have been described in detail for purposes of illustration, various modifications and enhancements may be made without departing from the spirit and scope of the present invention. Accordingly, the present invention is not to be limited except as by the appended claims.

The foregoing description of embodiments is provided to enable any person skilled in the art to make and use the subject matter. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the novel principles and subject matter disclosed herein may be applied to other embodiments without the use of the innovative faculty. The claimed subject matter set forth in the claims is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. It is contemplated that additional embodiments are within the spirit and true scope of the disclosed subject matter. Thus, it is intended that the present invention covers modifications and variations that come within the scope of the appended claims and their equivalents.

Claims

WHAT IS CLAIMED IS:

1. A remote diagnostic system employing a neural network for object recognition, comprising: a first edge device for receiving an image from a first image input capturing images of an organ and performing a first operation of an object recognition model on the image by using multiple parameters to produce partially -processed image data; a remote diagnostic module, receiving the partially-processed image data from the first edge device, performing a second operation of the object recognition model to produce an inference result which identifies a target topographic profile, and transmitting the inference result to the first edge device; and wherein the partially -processed image is unconvertable back to the image before the first operation without the multiple parameters.

2. The remote diagnostic system of claim 1, wherein the organ is one of esophagus, stomach, small intestine, large intestine, kidney, ureter, and bladder.

3. The remote diagnostic system of claim 1, wherein the target topographic profile is a lesion, which is a polyp.

4. The remote diagnostic system of claim 1, wherein the inference result identifies a location of the target topographic profile with a bounding box.

5. The remote diagnostic system of claim 1, wherein the inference result classifies the target topographic profile to be either neoplastic or non-neoplastic.

6. The remote diagnostic system of claim 5, wherein the inference result classifies the neoplastic target topographic profile to be either benign, or potentially malignant, or malignant.

25

7. The remote diagnostic system of claim 1, wherein the inference result includes a probability of the target topographic profile being a lesion or being one of neoplastic lesion or non-neoplastic lesion.

8. The remote diagnostic system of claim 1, wherein the remote diagnostic module comprises multiple types of object recognition models respectively for different types of organs or different manufacturers of the first image input.

9. The remote diagnostic system of claim 4, wherein the object recognition model implements one or more of object detection, object localization, and object classification function.

10. The remote diagnostic system of claim 1, wherein the object recognition model uses one-stage model or two-stage model to perform object detection and object localization.

11. The remote diagnostic system of claim 1, wherein the object recognition model comprises a convolutional neural network.

12. The remote diagnostic system of claim 1, wherein the object recognition model is a Yolo 4 model.

13. The remote diagnostic system of claim 1, wherein the object recognition model comprises the first operation performed at the first edge device and the second operation performed at the remote diagnostic module, and the first operation includes at least one convolution layer.

14. The remote diagnostic system of claim 1, further comprising a controller communicatively connected to the first edge device, the controller receiving an instruction from an operator for managing the remote diagnostic system.

15. The remote diagnostic system of claim 1, further comprising a video and audio output communicatively connected to the first edge device, the video and audio output providing a notification to the operator.

16. The remote diagnostic system of claim 15, wherein the notification is a visual notification or an acoustic notification.

17. The remote diagnostic system of claim 1, wherein the real-time image comprises a plurality of image frames with a predetermined frame rate.

18. The remote diagnostic system of claim 15, wherein the object recognition model has an image frame process rate larger than the predetermine frame rate.

19. The remote diagnostic system of claim 1, wherein the object recognition model is trained with an operator’ s feedback on the inference result concurrently with the performing of the first operation or the second operation. 0. A remote diagnostic method comprising: receiving a partially processed real-time image from a first edge device which performs a first operation of an object recognition model on a real-time image received from a first image input by using multiple parameters; performing a second operation of the object recognition model on the partially processed real-time image to produce an inference result which identifies a target topographic profile; transmitting the inference result to the first edge device; and wherein the partially -processed image is unconvertable back to the real-time image without the multiple parameters. 1. The remote diagnostic method of claim 20, further comprising: receiving a model operating parameter from a controller operated by an operator. 2. The remote diagnostic method of claim 20, further comprising: preparing the real-time image received from a first image input by pixel value scaling, resolution resizing, or RGB color channel reorganization before performing the first operation of the object recognition model.

23. The remote diagnostic method of claim 20, wherein the inference result includes a probability of the target topographic profile being a lesion, or being one of neoplastic lesion or non-neoplastic lesion.

24. The remote diagnostic method of claim 20, wherein the remote diagnostic module comprises multiple types of object recognition models respectively for different types of organs or different manufacturers of the first image input.

25. The remote diagnostic method of claim 25, wherein the visual annotator is superimposed on the real-time image and indicates a location of a lesion.

26. A method for training an object recognition model used in a remote diagnostic system comprising: providing multiple labeled images; preparing the multiple labeled images by pixel value scaling, resolution resizing, or RGB color channel reorganization to generate multiple prepared images; augmenting the multiple prepared images by cutmix or mixup to generate multiple augmented images; and providing the multiple augmented images to train a convolutional neural network.

27. The method of claim 26, wherein the convolutional neural network is a Yolo model.

28. The method of claim 26, wherein the class activation mapping is used for training when the labeled image only informs the convolutional neural network whether the labeled image contain a target topographic profile without a location of the target topographic profile.

28 The method of claim 26, wherein the multiple instance problem is used for training when multiple labeled images only informs the convolutional neural network whether a bag of multiple images contain a target topographic profile.

29