WO2017151757A1

WO2017151757A1 - Recurrent neural feedback model for automated image annotation

Info

Publication number: WO2017151757A1
Application number: PCT/US2017/020183
Authority: WO
Inventors: Hoo-Chang SHIN; Le LU; Ronald M. SUMMERS
Original assignee: The United States Of America, As Represented By The Secretary, Department Of Health And Human Services
Priority date: 2016-03-01
Filing date: 2017-03-01
Publication date: 2017-09-08

Abstract

A deep learning model is provided to efficiently detect disease from an image (e.g., an x-ray image) and annotate its contexts. In one example of the disclosed technology, a method of generating an annotation sequence describing an input image includes training a convolutional neural network (CNN) with a series of reference images and associated annotation sequences, training a recurrent neural network (RNN) by initializing the RNN with the trained CNN embedding of the reference image and a first word of an annotation sequence, sampling the CNN and RNN with a reference image, and producing a sequence of annotation describing the image, disease(s) in the image, one or more attributes or contexts. In one examples of the disclosed technology, mean pooling is applied to the state vectors of RNN to obtain a joint image/text context vector summarizing the contexts of image and text annotation.

Description

RECURRENT NEURAL FEEDBACK MODEL FOR

AUTOMATED IMAGE ANNOTATION

CROSS REFERENCE TO RELATED APPLICATION

[001] This application claims the benefit of and priority to U.S. Provisional Application No. 62/302,084, filed March 1, 2016, which application is incorporated by reference in its entirety.

ACKNOWLEDGMENT OF GOVERNMENT SUPPORT

[002] This invention was made with government support under contract no.

HHSN263200900026I awarded by the National Institutes of Health. The government has certain rights in the invention.

SUMMARY

[003] Methods and apparatus are disclosed for machine learning using neural networks to analyze medical image text reports and generating annotations for medical images describing diseases and their contexts. Providing a description of a medical image's content similar to how a radiologist would describe an image can have a great impact. A person can better understand a disease in an image if it is presented with its context, e.g. , where the disease is, how severe it is, and which organ is affected. Furthermore, a large collection of medical images can be automatically annotated with the disease context and the images can be retrieved based on their context, with natural language queries such as "find me images with pulmonary disease in the upper right lobe."

[004] A deep learning model is provided to efficiently detect disease from an image (e.g. , an x-ray image) and annotate its contexts. In one example of the disclosed technology, a method of generating an annotation sequence describing an input image includes training a convolutional neural network (CNN) with a series of reference images and associated annotation sequences, training a recurrent neural network (RNN) by initializing the RNN with the trained CNN embedding of the reference image and a first word of an annotation sequence, sampling the CNN and RNN with a reference image, and producing a sequence of annotation describing the image, disease(s) in the image, one or more attributes or contexts. In some examples of the disclosed technology, mean pooling is applied to the state vectors of RNN to obtain a joint image/text context vector summarizing the contexts of image and text annotation. In one example, a clustering technique is applied to the obtained joint image/text context vector to assign more precise labels to the image taking the context into account. Training the CNN and RNN again with these more precise labels leads to generating more accurate annotations for a new unseen image. In some examples, images are selected for training the neural networks by adjusting the ratio of normal to diseased images. In some examples, the image training set is augmented by training the neural networks with randomly cropped versions of the training images, whereby images of normal cases are randomly selected to balance the number of diseased-to-normal cases during training.

[005] In some examples of the disclosed technology, a deep learning model is provided to efficiently detect diseases from an image (e.g. , an x-ray, magnetic resonance image, computerized axial tomography, or acoustic ultrasound scan of mammals including humans) and annotate its contexts (e.g. , location, severity level, and/or affected organs). In some examples, image annotations from a radiology dataset of medical images and associated reports are used to mine disease names to train convolutional neural networks (CNNs). In some examples, ImageNet- trained CNN features and regularization techniques are used to circumvent large normal- vs- diseased cases bias. In some examples, recurrent neural networks (RNNs) are then trained to describe contexts of a detected disease, based on deep CNN features. In some examples, feedback from an already-trained pair of CNN/RNNs is used with the domain- specific image/text dataset to infer joint image/text contexts for composite image labeling. Thus, in some examples, significantly improved image annotation results are demonstrated using a recurrent neural feedback model by taking joint image/text contexts into account.

[006] Methods and apparatus are disclosed for using a deep learning model to effectively and efficiently detect pathologies from an image and annotate its context (e.g. , pathology, organ, location, and severity of the detected pathology). In certain examples, a radiology database of chest x-rays and associated image annotations are used to mine disease names to train CNNs. RNNs are then trained to describe the context of a detected disease or pathology, building on top of the deep CNN features. Further, feedback from a previously-trained pair of CNNs and RNNs with a domain-specific image/text dataset can be used to infer joint image/text context that can be used for composite image labeling. Thus, image annotation results for images such as x-rays and other medical images can be produces using an RNN feedback model that takes into account joint image/text contextual information.

[007] In some examples of the disclosed technology, a method of generating an annotation sequence describing an input image includes training a CNN by applying a reference image and an associated annotation sequence as input to the CNN. The associated annotation sequence indicates diagnosis of each respective reference image. The method further includes training an RNN by initializing the RNN with the trained CNN embedding of the reference image and a first word of an annotation sequence, thus producing a first RNN state vector. The trained CNN can be sampled by applying an input image as input to the CNN, thereby producing a CNN embedding of the input image. The trained RNN can then be initialized by the CNN image embedding as the state vector of the RNN. A context vector can be produced by "unrolling" the RNN with the trained CNN embedding initialization and a sequence of words of the annotation sequence and, by averaging (mean pooling) the state vectors of RNNs in each unrolled state. The produced context vector summarizes the input image as well as the associated text annotation.

[008] In some examples, training data sets can be improved by normalizing the ratio of normal to diseased images used to train the CNNs and/or RNNs. In some examples, diseased images are augmented with randomly- selected, cropped portions of the image before training the CNNs and/or RNNs a number of times.

[009] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Any trademarks used herein remain the property of their respective owners. The foregoing and other objects, features, and advantages of the disclosed subject matter will become more apparent from the following detailed description, which proceeds with reference to the accompanying figures.

BRIEF DESCRIPTION OF THE DRAWINGS

[010] FTG. 1 is a block diagram outlining an example system for performing image analysis, as can be used in certain examples of the disclosed technology.

[011] FTG. 2 is a diagram illustrating the use of neural networks to produce a context vector, as can be used in certain examples of the disclosed technology.

[012] FIG. 3 illustrates an example convolutional neural network, as can be used in certain examples of the disclosed technology.

[013] FIG. 4 is a flowchart outlining an example method of producing a joint text/image context vector, as can be performed in certain examples of the disclosed technology.

[014] FIG. 5 illustrates an x-ray image and associated annotation text sequences, as can be analyzed using certain disclosed methods and apparatus. [015] FIG. 6 is a diagram of a long short-term memory RNN, as can be used in certain examples of the disclosed technology.

[016] FIG. 7 is a diagram illustrating an example of a gated recurrent unit RNN, as can be used in certain examples of the disclosed technology.

[017] FIGS. 8A and 8B are depictions of a joint image/text context vector, as can be disclosed in certain examples of the disclosed technology.

[018] FIG. 9 is a diagram illustrating an example computing environment in which certain examples of the disclosed technology can be implemented.

[019] FIGS. 10-18 illustrate examples of annotation generations generated using an example of the disclosed technology compared to annotations provided by a human radiologist.

DETAILED DESCRIPTION

I. General Considerations

[020] This disclosure is set forth in the context of representative embodiments that are not intended to be limiting in any way.

[021] As used in this application the singular forms "a," "an," and "the" include the plural forms unless the context clearly dictates otherwise. Additionally, the term "includes" means

"comprises." Further, the term "coupled" encompasses mechanical, electrical, magnetic, optical, as well as other practical ways of coupling or linking items together, and does not exclude the presence of intermediate elements between the coupled items. Furthermore, as used herein, the term "and/or" means any one item or combination of items in the phrase.

[022] The systems, methods, and apparatus described herein should not be construed as being limiting in any way. Instead, this disclosure is directed toward all novel and non-obvious features and aspects of the various disclosed embodiments, alone and in various combinations and subcombinations with one another. The disclosed systems, methods, and apparatus are not limited to any specific aspect or feature or combinations thereof, nor do the disclosed things and methods require that any one or more specific advantages be present or problems be solved. Furthermore, any features or aspects of the disclosed embodiments can be used in various combinations and subcombinations with one another. [023] Although the operations of some of the disclosed methods are described in a particular, sequential order for convenient presentation, it should be understood that this manner of description encompasses rearrangement, unless a particular ordering is required by specific language set forth below. For example, operations described sequentially may in some cases be rearranged or performed concurrently. Moreover, for the sake of simplicity, the attached figures may not show the various ways in which the disclosed things and methods can be used in conjunction with other things and methods. Additionally, the description sometimes uses terms like "produce,"

"generate," "display," "receive," "train," "sample," "initialize," "embed," "execute," and "initiate" to describe the disclosed methods. These terms are high-level descriptions of the actual operations that are performed. The actual operations that correspond to these terms will vary depending on the particular implementation and are readily discernible by one of ordinary skill in the art.

[024] Theories of operation, scientific principles, or other theoretical descriptions presented herein in reference to the apparatus or methods of this disclosure have been provided for the purposes of better understanding and are not intended to be limiting in scope. The apparatus and methods in the appended claims are not limited to those apparatus and methods that function in the manner described by such theories of operation.

[025] Any of the disclosed methods can be implemented as computer-executable instructions stored on one or more computer-readable media (e.g. , computer-readable media, such as one or more optical media discs, volatile memory components (such as DRAM or SRAM), or nonvolatile memory components (such as hard drives)) and executed on a computer (e.g. , any commercially available computer, including smart phones or other mobile devices that include computing hardware). Any of the computer-executable instructions for implementing the disclosed techniques, as well as any data created and used during implementation of the disclosed embodiments, can be stored on one or more computer-readable media (e.g. , computer-readable storage media). The computer-executable instructions can be part of, for example, a dedicated software application or a software application that is accessed or downloaded via a web browser or other software application (such as a remote computing application). Such software can be executed, for example, on a single local computer (e.g. , with a general -purpose and/or graphics processors executing on any suitable commercially available computer) or in a network

environment (e.g. , via the Internet, a wide-area network, a local-area network, a client-server network (such as a cloud computing network), or other such network) using one or more network computers. [026] For clarity, only certain selected aspects of the software-based implementations are described. Other details that are well known in the art are omitted. For example, it should be understood that the disclosed technology is not limited to any specific computer language or program. For instance, the disclosed technology can be implemented by software written in C, C++, Java, or any other suitable programming language. Likewise, the disclosed technology is not limited to any particular computer or type of hardware. Certain details of suitable computers and hardware are well-known and need not be set forth in detail in this disclosure.

[027] Furthermore, any of the software-based embodiments (comprising, for example, computer- executable instructions for causing a computer to perform any of the disclosed methods) can be uploaded, downloaded, or remotely accessed through a suitable communication means. Such suitable communication means include, for example, the Internet, the World Wide Web, an intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave, and infrared communications), electronic communications, or other such communication means.

II. Introduction to the Disclosed Technology

[028] Comprehensive image understanding typically requires more than single object classification. Automatic generation of image captions to describe image contents can be expanded to provide a more complete image understanding than classifying an image to a single object class. Image caption generation performance can be substantially improved with the use of improved databases (e.g. , an ImageNet database) and neural networks (e.g. , deep convolutional neural networks (CNNs), effectively learning to recognize the images with a large pool of hierarchical representations. Recurrent neural networks (RNNs) can be adapted using deep CNN features to generate image captions.

[029] Beyond general image recognition, automatic recognition and localization of specific diseases and organs can be performed, for example, using datasets where target objects are explicitly annotated. In some examples of the disclosed technology, CNNs and RNNs can be used to automatically annotate chest x-rays with pathologies or diseases along with describing context(s) of a disease, for example, data indicating: location, severity, and/or affected organs. In some examples, a collection of radiology images and associated annotation stored in a picture archiving and communication system (PACS) system is used. In other examples, a publicly- available radiology dataset containing chest x-ray images and reports published on the Web as a part of the Openl open source literature and biomedical image collections can be used to supplement, or instead of, data stored in a proprietary PACS database.

[030] Data bias is an issue in medical image analysis. When considering the whole population, diseased cases are typically much rarer than healthy cases, for example, in chest x-ray datasets. In one example dataset, normal cases account for 37% (2,696 images) of the entire dataset (7,284 images), compared to the most frequent disease case "opacity," which accounts for 12% (840 images), and the next frequent, "cardiomegaly," accounting for 9% (655 images). In addition, many diseases frequently appear in conjunction with others. For instance, 79% of "opacity" instances occur with other diseases and 75% of "cardiomegaly" instances co-exist with other diseases.

[031] In order to circumvent the normal- vs -diseased cases bias, various regularization techniques can be applied to CNN training, such as data dropout and batch normalization. To train CNNs with image labels, or to assign labels to the images to train CNNs with, a pattern mining approach is used to assign a disease label to an image. Disclosed image caption generation methods are applied to annotate the rest of the content of a chest x-ray image, for example, disease location, size, severity, etc. This can be conducted using recurrent neural networks (RNNs) to annotate any possible additional diseases and describe their contexts, based on the convolutional neural network (CNN) image encodings (or embeddings).

[032] In some examples, CNN models are trained with one disease label per chest x-ray inferred from image annotations, for example, "calcified granuloma," or "cardiomegaly." However, such single disease labels do not fully account for the context of a disease. For instance, "calcified granuloma in right upper lobe" will be labeled the same as the "small calcified granuloma in left lung base" or "multiple calcified granuloma."

[033] The trained RNNs can be employed to obtain the context of annotations, and recurrently used to infer the image labels with contexts as attributes. The CNNs are re-trained with the obtained joint image/text contexts and used to generate annotations based on the new CNN features. For example, images with "calcified granuloma in right upper lobe" and "small calcified granuloma in left lung base" will be assigned different labels. The CNNs can be re-trained using the newly-assigned labels. With this recurrent feedback model, image/text contexts are taken into account for CNN training to generate improved, more accurate image annotations. III. Example System for Performing Image Analysis and Annotation

[034] FIG. 1 is a block diagram 100 that outlines an example computing system that can be used to perform image analysis in some examples of the disclosed technology. For example, the illustrated system can be used to perform image classification using a database of x-ray images that are each associated with text annotations that describe pathologies, or the lack of pathologies, exhibited in the respective image. For example, the image annotations can be encoded according a standardized collection of terms (e.g., MeSH (medical subject headings)) and formal regulations (e.g. according to a grammar describing how attributes are listed, similar to human languages). Human readers should be able to infer from such annotations and understand where a disease is in the image, how severe it is, which organ is affected, etc. Such annotations can be generated by a radiologist for each image or from an already existing radiology report using a system to summarize it to a collection of standardized terms with pre-defined regulations. Such annotations can be generated by a radiologist for each image. Thus, new sample images can be annotated automatically based on neural learning performed with an existing dataset of image/annotation pairs.

[035] As shown in FIG. 1, a plurality of images 110 are collected and stored in an image database 115. Each of the images includes an associated annotation sequence of a plurality of annotation sequences 120 describing each respective image. For example, the annotation sequence can include labels indicating a diagnosis, an organ, an indication of severity of the disorder, and/or a location of the disorder. The words forming the annotation sequence can be arranged according to a predetermined order. The annotation sequences are stored in a corresponding image annotation database 125. In some examples, the images 110 and annotations sequences 120 are stored in the same database.

[036] As shown in FIG. 1, a convolutional neural network (CNN) 130 is trained by applying images from the image database 115 and their corresponding image label, extracted from the annotation database 125. Examples of suitable CNNs 130 that can be used to implement the trained neural network include, but are not limited to: network-in-network (NIN), AlexNet, and GoogLeNet architectures. The CNN 130 can be trained using hundreds, thousands, hundreds of thousands, or more images, depending on availability of databases with suitable images and annotations. In some examples, image and annotation data is anonymized prior to training the CNN 130. [037] Once the CNN 130 has been trained, an input image 140 (e.g. , an x-ray image), of one or more organs, including one or more unknown pathologies is provided as input to the CNN 130 producing a CNN embedding of the input image CNN(T) 145. The output of the embedding CNN is applied to a recurrent neural network 150 to be trained using one or more images from the image database 115 and respective associated image annotations 125. Examples of suitable recurrent neural networks (RNNs) that can be used include long short-term memory (LSTM) and gated recurrent unit (GRU) RNNs. The RNN 150 is initialized by embedding the output of the initialized CNN(7) 145 as an updated state vector of the RNN and applying a first word of an annotation sequence, thus producing a new candidate state vector stored within the RNN. A context vector can be produced by unrolling the RNN with an updated trained CNN embedding a new candidate state vector and a subsequent word of the annotation sequence. In the example shown, N input words 155 are applied from the input annotation sequence, producing N output words and and the same number N of state vectors from the RNN 150 are provided to a mean-pooling circuit 160, where N represents the number of unrollings of the RNN network. The mean-pooling circuit averages the state vector values. In other examples, the number of iterations of applying words of an annotation sequence to the RNN 150 is different. The mean-pooling circuit 160 in turn collects values output from the state vectors from each iteration of the RNN 150. This produces an image/text context vector 170, which can encode the existence of plural pathologies, their affected organs, severities, locations, and other pathology context data in a single vector. The image/text context vector 170 can thus be used to provide generated annotations that describe one or more pathologies present in the input image 140.

[038] The image/text joint context vector 170 is used to re-label input images. Once the

"image/text joint context vector" is produced, a clustering technique (e.g. , £-means, or over- segmented £-means followed by Regularized Information Maximization (RIM)) is applied, and new labels are assigned to the images. The image/text/joint context vector can be applied to retrain the CNN and/or RNN in an iterative process. After a satisfactory number of iterations, the neural network training converges and can be used for annotation sequence generation, producing the output words 159 that label the sample image. An example of such relabeling is discussed further below with respect to FIGS. 8 A and 8B.

[039] As will be readily understood to one of ordinary skill in the art having the benefit of the present disclosure, in some examples, the RNN 150 includes a single circuit or module implementing the sampling evaluation that is updated for each applied annotation sequence. In other examples, two or more similar circuits or modules implementing the RNN 150 can be provided in a pipeline, thereby improving throughput of the evaluated image data and annotated sequence.

IV. Example System for Image Analysis Using CNNs and RNNs

[040] FIG. 2 is a block diagram 200 outlining a further detailed example of a system that can be used to perform image analysis according to certain examples of the disclosed technology. As shown, an input image / 210 is applied to a trained CNN 220 thereby producing an embedding CNN output CNN(7) 225. This neural network output 225 is used to initialize a current state vector of an RNN. The illustrated RNN is an LSTM, but in other examples, other types of RNNs can be used (e.g. , a GRU). As will be discussed further below, the current state vector h_t=i is applied to a memory cell that is updated in part to applying an input word (INwordi) with the output of the initialized The memory cell can be updated using, for example, a sigmoid function. The output of the updated memory content is stored in a memory as an updated candidate state

The output of the new candidate state is combined at least in part with the current state using a forget gate to update the CNN state of the current iteration of the RNN. The resulting output word (OUT_Word2 240) is provided as shown. Further, the updated current state of the RNN 250 is provided as input to a mean pooling circuit 260. The RNN is "unrolled" by iterating computation of the RNN for each annotation word in a given sequence, where their state- vectors are being updated for each iteration. The input annotation word is updated for each iteration (e.g. , input words 230, 231, and 239). In some examples, the RNN 250 is unrolled by iterating the same circuit or module, while in other examples, the RNN circuit or module can be duplicated thereby providing pipelined output. As shown, the updated state vectors are provided from each RNN 250, 251, and 259 to the mean pooling circuit 260. The mean pooling circuit 260 in turn averages the received state to produce a joint image/text context vector Hi_m:text 270. Each of the output words 240, 241, and 249 can be used to describe the input image 210.

V. Example Convolutional Neural Network

[041] FIG. 3 is a diagram 300 outlining an example convolutional neural network (CNN), as can be used in certain examples of the disclosed technology. In some examples, the illustrated neural network can be implemented using one or more general-purpose microprocessors. In other examples, the illustrated neural network can be implemented using acceleration provided by graphics process units (GPU), field programmable gate arrays (FPGA), or other suitable acceleration technology. The illustrated neural network 310 of FIG. 3 can be deemed a network-in- network (NIN) topology. In other examples of the disclosed technology, other neural network architectures can be employed, including AlexNet, GoogLeNet, or other suitable architectures.

[042] As shown, an input image 320, selected according to the disclosed technologies is input to the NIN neural network 310, which includes a number of multilayer perceptron (MLP) convolutional layers 330, 331, and 332, and a global average pooling layer or fully connected layer 340. Use of multilayer perceptrons is compatible with the structure of convolutional neural networks and can be trained using back-propagation. The multilayer perceptron can be a deep model itself. In the illustrated example, the calculation performed by a multilayer perceptron layer is shown as follows:

where η is the number of layers in the multilayer perceptron. A rectified linear unit is used as the activation function in the multilayer perceptron. From a cross-channel pooling point of view, this calculation is equivalent to a cascaded cross-channel parametric pooling on a normal convolution layer. Each pooling layer performs weighted linear recombination on the input feature maps, which then go through a rectifier linear unit. The cross-channel pooled feature maps are cross channel pooled repeatedly in the next layers. This cascaded cross-channel parametric pooling structure allows for complex and learnable interactions of cross channel information.

VI. Example Method of Generating an

Output Annotation Sequence Describing an Input Image

[043] FIG. 4 is a flowchart 400 outlining an example method of generating an output annotation sequence describing an input image, as can be performed in certain examples in the disclosed technology. For example, the system described above regarding FIG. 1 can be used to perform the illustrated method, although other suitable systems can be adapted to perform the illustrated method.

[044] At process block 410, a CNN is trained with a plurality of images and associated annotations. For example, a plurality of x-ray images that have been annotated with an annotation sequence describing pathologies in the image can be used. In some examples, some of the image annotations include the annotation normal, indicating that no pathology is present.

[045] At process block 420, a recurrent neural network (RNN) is trained by initializing the RNN with a trained CNN embedding, and unrolled over one or more words of an annotation sequence. Memory or other state elements within the RNN are used to store state vectors providing memory between iterations, or un-rollings, of the RNN. Examples of suitable RNNs that can be employed include LSTMS and GRUs, but as can be readily understood to one of ordinary skill in the relevant art, other suitable RNNs can be used.

[046] At process block 430, the trained CNN is sampled by applying an input image as input to the CNN that was trained at process block 420. In some examples, the output layer of the CNN is used as the sample output, while in other examples, internal nodes, or internal and output nodes are used as the sampling output.

[047] At process block 440, the RNN that was trained at process block 420 is initialed by embedding the sampled output from the initialized CNN that was produced at process block 430. The output of the initialized CNN is used as an updated state vector and a first word of an annotation sequence is applied as input to the RNN. Gating logic within the RNN can be used to update memory elements to perform, producing an updated state vector h.

[048] At process block 450, a context vector is produced by un-rolling the RNN starting with the CNN image embedding as the initial state vector, for a subsequent word of the annotation sequence. For example, the RNN can be unrolled and the embedding values in the CNN can be averaged by a mean pooling circuit. The mean pooling circuit combines values from the progressively changing state vectors of RNN (starting from the CNN image embedding as the initialization) to produce the context vector (hm-.τΕχτ)- In some examples, the mean pooling circuit combines the values from the state vectors by computing the mean of all the state vectors.

[049] At process block 460, the image/text joint context vectors generated at process block 460 are used to prepare data for retraining the CNN and/or the RNN. In one example, clustering is applied to the vectors and new labels are assigned to the images based on the clustering. The method then proceeds to process block 410 and 420 to re-train the CNN with the newly assigned image labels, and retrain the RNN with the new CNN embedding, respectively. This iterative process can repeat a number of times, including performance of the clustering and image labeling at process block 460. Once it is determined that the context vectors have converged, or are otherwise suitable for generating annotation sequences describing the input image, the method proceeds to process block 470.

[050] At process block 470, an output annotation sequence is generated describing the input image using the context vector that was produced at process block 450. For example, the output annotation sequence can include a description of a pathology, the affected organ, a level of severity, and a location, in which the described pathology is located within the input image.

[051] It should be noted that the process blocks can be re-executed using different input annotation sequences in order to determine different context vectors. Thus, the produced context vectors can describe more than one pathology for the same input image.

VII. Example Input Image and Associate Annotation

[052] FIG. 5 is a diagram 500 illustrating an example of an input image 510 and associated inputs 520, including a report and annotations that have been generated by a trained radiologist that can be obtained from the Openl collection. The pathologies of the input image 510 are described using two different annotation sequences: the first being "pulmonary atelectasis / lingula / focal," and the second being "calcinosis / lung / hilum / right." The annotation sequence is encoded using a standard MeSH format.

VIII. Example Long Short-Term Memory RNN

[053] FIG. 6 is a diagram 600 illustrating a simplified representation of an RNN suitable for use with certain examples of the disclosed technology. In particular, the RNN is a long short-term memory. The LSTM unit maintains a memory that is changed over time. The output, or activation of the LSTM unit, can be computed as a function of the stored memory values. For example, each output element H can be computed by applying an output gate 610 that modulates the amount that the memory content is exposed on the output. An intermediate function, for example a sigmoid or hyperbolic tangent, can be applied to values stored in the corresponding memory cell. The memory cell can then be updated at a next time unit by partially forgetting the existing memory value and adding new memory content through the input. The extent to which the existing memory is forgotten can be modulated by a forget gate 620 and the degree to which the new content is added to the memory cell can be modulated by an input gate 630. Gates can be computed using a matrix function. Any form of suitable memory (e.g. , latches, flip-flops, registers, addressable memories implemented with dynamic RAM (included embedded DRAM), static RAM, memristors) can be used to store data for the state vector h 640 and the new state vector h 650. A general-purpose processor, a co-processor (e.g., a GPU or neural network chip), an FPGA, or a system-on-chip (SoC) including such memory or coupled to such a memory can be adapted to provide the illustrated LSTM RNN. The LSTM unit is able to decide whether to keep the existing memory values via the introduced gates. Thus, if the LSTM unit detects an important feature from an input sequence from an early stage, it can easily carry this information (e.g. , the existence of the feature itself) for a long distance, hence capturing potential long distance dependencies.

[054] The LSTM unit can be used in several different applications, including speech recognition, sequence generation, machine translation, and image caption generation. In some examples of the disclosed technology, the operation of the LSTM unit can be described by the following equations:

where is the input gate,

the forget gate, ot the output gate, h_t the state vector (memory), the

new state vector (new memory), and m_t the output vector. W is a matrix of trained parameters (weights), and σ is the logistic sigmoid function. Θ represents the product of a vector with a gate value.

IX. Example Gated Recurrent Unit (GRU) RNN

[055] FIG. 7 is a diagram 700 outlining an example of a different time of RNN, a gated recurrent unit (GRU). The GRU allows each recurrent unit of the RNN to adaptively capture dependencies of different time scales. The GRU is similar to the LSTM unit in that there are gating units (e.g. , a reset gate 710 and an update gate 720 used to modulate the flow of information inside the unit. However, the GRU differs from the LSTM in that it does not have separate memory cells besides the current state h 730 and the candidate state h 740. Any form of suitable memory (e.g. , latches, flip-flops, registers, addressable memories implemented with dynamic RAM (included embedded DRAM), static RAM, memristors) can be used to store data for the current and candidate state, general-purpose processor, a co-processor (e.g., a GPU or neural network chip), an FPGA, or a system-on-chip (SoC) including such memory or coupled to such a memory can be adapted to provide the illustrated GRU RNN. Both Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) implementations address the vanishing gradient problem exhibits in certain recurrent neural networks (RNNs). [056] Thus, the described procedure of taking a linear sum between an existing state and a newly computed state is similar in some fashion to the LSTM unit. The GRU RNN, however, does not have any mechanism to control the degree to which its state is exposed, but exposes the state the whole time that the RNN is evaluated.

[057] In some examples of the disclosed technology, operation of a GRU can be described by the following equations:

where z_t is the update gate, r_t the reset gate, h_t the new state vector, and h_t the final state vector.

[058] It should be noted that these types of RNNs include an additive component when updating from time t to time i+1 that is not found in certain other types of RNNs. Thus, these types of RNNs keep existing content and add/combine new content with the existing content. This allows for each RNN unit to remember the existence of a specific feature in the input stream for a long series of steps. Thus, certain important features, as determined by a forget gate or an update gate, will not be overwritten. Further, this addition allows for the creation of shortcut paths that can bypass multiple temporal steps. This can allow for errors to be back propagated easily without vanishing too quickly as a result of passing through multiple bounded non-linearities, thereby reducing difficulties cause by multiple gradients.

[059] FIG. 8A depicts 800 a joint image/context vector describing a calcified granuloma. FIG. 8B depicts 810 a joint image/context vector describing opacity. Dimension reduction from a 1,024 dimensional domain to a two-dimensional domain can be performed using t-SNE to visualize the 1 ,024 dimensional vectors on the two-dimensional space.

[060] Each of the word sequences represent annotation pairing with an image. All of the cases in FIG. 8A were previously labeled as "calcified granuloma" when training the CNN (in the first phase). After a first-stage RNN training and generating the joint image/text context vector, a clustering technique is applied to these vectors such that images annotated with "multiple calcified granuloma in the lower lobe of lung (top)" and "small calcified granuloma in the right upper lobe" are given different labels when training CNN. Thus, the CNN learns to distinguish these differences of disease appearances (contexts) in the second phase of CNN training. The RNN is trained again using this CNN embedding trained from the second phase, and improved annotations will be generated when given a new image. Thus, annotations describing the same disease can be divided into different labels based on their joint image/text context.

X. Example Computing Environment

[061] FIG. 9 illustrates a generalized example of a suitable computing environment 900 in which described embodiments, techniques, and technologies, including image analysis using CNNs and RNNs, can be implemented. For example, the computing environment 900 can implement disclosed techniques for analyzing images by repeatedly applying a sequence of input words using an RNN, as described herein.

[062] The computing environment 900 is not intended to suggest any limitation as to scope of use or functionality of the technology, as the technology may be implemented in diverse general- purpose or special-purpose computing environments. For example, the disclosed technology may be implemented with other computer system configurations, including hand held devices, multiprocessor systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. The disclosed technology may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

[063] With reference to FIG. 9, the computing environment 900 includes at least one processing unit 910 and memory 920. In FIG. 9, this most basic configuration 930 is included within a dashed line. The processing unit 910 executes computer-executable instructions and may be a real or a virtual processor. In a multi-processing system, multiple processing units execute computer- executable instructions to increase processing power and as such, multiple processors can be running simultaneously. The memory 920 may be volatile memory (e.g. , registers, cache, RAM), non-volatile memory (e.g. , ROM, EEPROM, flash memory, etc.), or some combination of the two. The memory 920 stores software 980, images, and video that can, for example, implement the technologies described herein. A computing environment may have additional features. For example, one or more co-processing units 915 or accelerators, including graphics processing units (GPUs), can be used to accelerate certain functions, including implementation of CNNs and RNNs. The computing environment 900 may also include storage 940, one or more input device(s) 950, one or more output device(s) 960, and one or more communication connection(s) 970. An interconnection mechanism (not shown) such as a bus, a controller, or a network, interconnects the components of the computing environment 900. Typically, operating system software (not shown) provides an operating environment for other software executing in the computing environment 900, and coordinates activities of the components of the computing environment 900.

[064] The storage 940 may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, CD-ROMs, CD-RWs, DVDs, or any other medium which can be used to store information and that can be accessed within the computing environment 900. The storage 940 stores instructions for the software 980, image data, and annotation data, which can be used to implement technologies described herein.

[065] The input device(s) 950 may be a touch input device, such as a keyboard, keypad, mouse, touch screen display, pen, or trackball, a voice input device, a scanning device, or another device, that provides input to the computing environment 900. For audio, the input device(s) 950 may be a sound card or similar device that accepts audio input in analog or digital form, or a CD-ROM reader that provides audio samples to the computing environment 900. The output device(s) 960 may be a display, printer, speaker, CD-writer, or another device that provides output from the computing environment 900.

[066] The communication connection(s) 970 enable communication over a communication medium (e.g. , a connecting network) to another computing entity. The communication medium conveys information such as computer-executable instructions, compressed graphics information, video, or other data in a modulated data signal. The communication connection(s) 970 are not limited to wired connections (e.g. , megabit or gigabit Ethernet, Infiniband, Fibre Channel over electrical or fiber optic connections) but also include wireless technologies (e.g. , RF connections via Bluetooth, WiFi (IEEE 802.11a/b/n), WiMax, cellular, satellite, laser, infrared) and other suitable communication connections for providing a network connection for the disclosed methods. In a virtual host environment, the communication(s) connections can be a virtualized network connection provided by the virtual host.

[067] Some embodiments of the disclosed methods can be performed using computer-executable instructions implementing all or a portion of the disclosed technology in a computing cloud 990. For example, disclosed compilers and/or processor servers are located in the computing

environment, or the disclosed compilers can be executed on servers located in the computing cloud 990. In some examples, the disclosed compilers execute on traditional central processing units (e.g. , RISC or CISC processors). [068] Computer-readable media are any available media that can be accessed within a computing environment 900. By way of example, and not limitation, with the computing environment 900, computer-readable media include memory 920 and/or storage 940. As should be readily understood, the term computer-readable storage media includes the media for data storage such as memory 920 and storage 940, and not transmission media such as modulated data signals.

XI. Example Image Analysis Results

[069] Example image analysis results (e.g. , text annotations describing a medical image) are disclosed in the section, as can be performed in certain examples of the disclosed technology. As will be readily apparent to one of ordinary skill in the relevant art, the example systems, methods, and neural networks described above regarding FIGS. 1-4 and 9 can be adapted to provide the disclosed analysis results. Further, the technologies described above can be modified to suit particular datasets, computing environments, and performance requirements.

A. Dataset

[070] In some examples, a publicly available radiology dataset of chest x-rays and reports is used that is a subset of the Openl open source literature and biomedical image collections. An example of this radiology database contains 3,955 radiology reports from the Indiana Network for Patient Care, and 7,470 associated chest x-rays from the hospitals' picture archiving systems. The dataset is fully anonymized via an aggressive anonymization scheme, which achieved 90% precision in de- identification. However, a few findings are rendered uninterpretable. An example case of the dataset is shown in FIG. 5.

[071] The data in the reports can be structured as comparison, indication, findings, and impression sections, in line with a common radiology reporting format for diagnostic chest x-rays. In the example shown in FIG. 5, there is an error resulting from an aggressive automated de-identification scheme. A word possibly indicating a disease was falsely detected as a personal information, and was thereby "anonymized" as "XXXX." While radiology reports contain comprehensive information about the image and the patient, they may also contain information that cannot be inferred from the image content. For instance, in the example shown in FIG. 5, it is probably impossible to determine that the image is of a Burmese male.

[072] On the other hand, a manual annotation of MEDLINE® citations with controlled vocabulary terms (Medical Subject Headings (MeSH®) is known to significantly improve the quality of the image retrieval results. MeSH terms for each radiology report in the Openl chest x- ray data are annotated and used to train neural networks.

[073] Nonetheless, it is not effective to assign a single image label based on MeSH and train a CNN to reproduce them, because MeSH terms seldom appear individually when describing an image. The twenty most frequent MeSH terms appear with other terms in more than 60% of the cases. Normal cases (term "normal") on the contrary, do not have any overlap, and account for 37% of the entire example dataset. The thirteen most frequent MeSH terms appearing more than 180 times are provided below in Table 1, along with the total number of cases in which they appear, the number of cases they overlap with in an image, and the overlap percentages. The x-ray images are provided in Portable Network Graphics (PNG) format, with sizes varying from 512x420 to 512x624. In the described example, CNN input training and testing images are scaled to a size of 256x256.

B. Disease Label Mining

[074] The CNN-RNN based image caption generation approaches uses a well-trained CNN to encode input images effectively. Unlike natural images that can simply be encoded by ImageNet- trained CNNs, medical images such as chest x-rays differ significantly from natural ImageNet images. A number of frequent annotation patterns are sampled with less overlaps for each image, in order to assign image labels to each chest x-ray image and train with cross-entropy criteria.

[075] In the example dataset, seventeen unique patterns of MeSH term combinations appeared in 30 or more cases. The dataset is split into training/validation/testing cases as 80%/10%/10% and at least 9 cases each are placed in the validation and testing sets. These include the terms shown in Table 1, as well as scoliosis, osteophyte, spondylosis, fractures/bone. MeSH terms appearing frequently but without unique appearance patterns include pulmonary atelectasis, aorta/tortuous, pleural effusion, cicatrix, etc. They often appear with other disease terms (e.g. consolidation, airspace disease, atherosclerosis). About 40% of the full dataset with this disease image label mining is retained, where the annotations for the remaining 60% of images are more complex (thereby making it difficult to assign a single disease label).

As shown in Table 1, the thirteen most frequent MeSH terms appear over 180 times, and the table further includes the number of the terms mentioned with other terms (overlap) in an image and their associated percentages.

C. Image Classification with CNN

[076] The aforementioned seventeen unique disease annotation patterns (in Table 1, and scoliosis, osteophyte, spondylosis, fractures/bone) are used to label the images and train CNNs. The adaptability of ImageNet-trained CNN features, as adopted using various regularization techniques to deal with the normal- vs-diseased cases bias is illustrated. For the default CNN model, the simple yet effective Network-In-Network (NIN) model can be used, as the model is small in size, fast to train, and achieves similar or better performance to other neural models (e.g. , the AlexNet model). Results are compared to a more complex CNN model, the GoogLeNet neural network.

[077] From the 17 chosen disease annotation patterns, normal cases account for 71% of all images, well above the numbers of cases for the remaining 16 disease annotation patterns. The number of samples is balanced for each case by augmenting the training images of the smaller cases by randomly cropping 224x224 size images from the original 256x256 size image.

D. Adaptability of ImageNet-Trained CNN Features

[078] While chest x-rays do not seem apparently related to the images from ImageNet, rich features learned from the large scale ImageNet dataset can help encode chest x-ray images. In order to validate the adaptability of the ImageNet-trained CNN features to chest x-rays is tested by fine-tuning the ImageNet-trained CNN to chest x-ray classification. For fine-tuning, the ImageNet- trained CNN weights of all the layers are re-used, except the last layer for classification, and trained with a learning rate one-tenth lower than the default learning rate.

[079] An example of results from training and validating accuracies of a NIN model, initializing its weights with normal random distribution, or fine-tuning from ImageNet-trained CNNs, are provided below in Table 2. We experiment with default learning rates ranging from 9^"3 to 9^"1 for fine-tuning, and from 9^~2 to 9° for training from random initialization. While the fine-tuned model could not exceed 15% training accuracy, random initialization yielded over 99% training accuracy. From these results, one can conclude that the CNN features specifically trained for chest x-rays are better suited than re-using the ImageNet-trained CNN features for the images of chest x-ray images, which have specific characteristics.

Table 2 provides data on training and validation accuracy of NIN model fine-tuned from

ImageNet-trained CNN and trained from random initialization.

E. Regularization by Batch Normalization and Data Dropout

[080] Even when the dataset is balanced by augmenting many diseased samples, it is difficult for a CNN to learn a good model to distinguish many diseased cases from normal cases, which have many variations on their original samples. Normalizing the set of images and annotations used to train the neural networks by applying mini-batch statistics during training can serve as an effective regularization technique to improve the performance of a CNN model. By normalizing via mini- batch statistics, the training network was shown not to produce deterministic values for a given training example, thereby regularizing the model to generalize better.

[081] Training deep neural networks can be complicated because the distribution of each layer's inputs changes during training. One way to address this is to slow down the training by requiring lower learning rates and careful parameter initialization. However, such an approach is less successful in models that have saturating nonlinearities— internal covariate shift. This issue can be addressed by normalizing layer inputs. In the disclosed example, normalization is a part of the model architecture, and is performed for each mini-batch. Such batch normalization allows the use of much higher learning rates and reduces the importance of selecting initialization values. In some examples, batch normalization can eliminate the use of dropout.

[082] Image analysis can be further improved by utilizing a data dropout technique. Data dropout and batch normalization can each be performed during training. In one example of data dropout, each image of a diseased case is augmented at least four times, and normal cases are randomly picked to match four times (due to the augmentation) the number of total diseased cases in a mini- batch. When training a CNN only for one epoch, it is likely that not all normal cases are seen because they are randomly picked during training. However, all the normal cases are likely to be seen (accounted for training) when training the CNN for many epochs (the chances are higher most of them are picked for when forming a mini-batch, when the training procedure is iterated a relatively large number of times).

[083] Both regularization techniques were tested to assess their effectiveness on our dataset. The training and validation accuracies for an example of the NIN model with batch-normalization, data- dropout, and both are provided below in Table 3. While batch- normalization and data-dropout alone do not significantly improve performance, combining both increases the validation accuracy by about 2%.

Table 3

Table 3 includes data on the training and validation accuracy of a NIN model with batch- normalization, data-dropout, and both batch-normalization and data-dropout. Diseased cases are very limited compared to normal cases, leading to overfitting, even with regularizations.

F. Effect of Model Complexity

[084] We also validate whether the dataset can benefit from a more complex neural network, GoogLeNet. For the described results, both batch-normalization and data-dropout are applied to: increase learning rate, remove dropout, and remove local response normalization. The

experimental training and validation accuracies using GoogLeNet model are provided below in Table 4, indicating a higher (~4%) accuracy result. We also observe a further ~3% increase in accuracy when the images are no longer cropped, but merely duplicated to balance the dataset. This may imply that, unlike the ImageNet images, the chest x-ray images contain some useful information near their boundary.

Table 4 provides training and validation accuracy of GoogLeNet model with batch- normalization, data-dropout, and without cropping the images for data augmentation.

XII. Example Annotation Generation using RNNs

[085] This section discloses results from using recurrent neural networks (RNNs) to learn the annotation sequence given input image CNN embeddings. As will be readily apparent to one of ordinary skill in the relevant art, the example systems, methods, and neural networks described above regarding FIGS. 1-4 and 9 can be adapted to provide the disclosed analysis results. Further, the technologies described above can be modified to suit particular datasets, computing

environments, and performance requirements.

A. Example RNN Training Method

[086] In the example annotation dataset, the number of standardized terms (in this example, MeSH terms) describing diseases ranges from 1 to 8 (except normal, which is one word), with a mean of 2.56 and standard deviation of 1.36. The majority of descriptions contain up to five words. Since only nine cases have images with descriptions longer than six words, these cases are ignored by constraining the RNNs to unroll up to five time steps. Annotations with less than five words are zero-padded with the end-of-sentence token used to fill in the five-word space.

[087] For the disclosed example, both Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) implementations of RNNs. As discussed further above, simplified illustrations of LSTM and GRU RNNs used are shown in FIGS. 6 and 7, respectively. The parameters of the gates in LSTM and GRU decide whether to update their current state h to the new candidate state h, where these states are learned from the previous input sequences (for example, a previous iteration of evaluating the RNN, or from an initialized CNN). The initial state of RNNs is set as the CNN image embedding (CNN(/)), and the first annotation word /Nwordi as the initial input. See FIG. 2. The output of the RNNs are the following annotation word sequences, and the RNNs are trained by minimizing the negative log likelihood of output sequences and true sequences:

where y_t is the output word of RNN in time step t, st the correct word, CNN(T) is the CNN embedding of input image /, and N the number of words are in the annotation (for this example, N = 5 with any applicable end-of-sequence zero-padding). Equation 2 is not a true conditional probability (because the RNNs' state vector is initialized to be CNN(/)), but is a convenient way to describe the training procedure.

[088] The NIN and GoogLeNet models replace the fully-connected layers with average-pooling layers. The output of the last spatial average-pooling layer is used as the image embedding to initialize the RNN state vectors. The size of RNN state vectors used in to generate the example results are which is identical to the output size of the average-pooling layers from NIN and

GoogLeNet.

B. Sampling

[089] In sampling the trained RNN, the RNN state vectors are initialized with the CNN image embedding The CNN prediction of the input image is used as the first word as the

input to the RNN, to sample following sequences up to five words. In some examples, images are normalized by the batch statistics before being fed to the CNN.

C. Evaluation

[090] The annotation generation was evaluated using a calculated bilingual evaluation understudy (BLEU) score averaged over all of the images and their annotations in the training, validation, and test set. The BLEU scores evaluated are provided below in Table 5. The BLEU-N scores are evaluated for cases with > N words in the annotations, using the implementation of [4]. For the example dataset, the LSTM RNN was easier to train, while the example GRU RNN model yields better results with more carefully selected hyper-parameters. Thus, while it is difficult to conclude which model is better, the GRU model seems to achieve higher scores on average.

Table 5 provides BLEU scores validated on the training, validation, test set, using LSTM and GRU RNN models for the sequence generation.

XIII. Example Results using Recurrent Feedback Model for

Image Labeling with Joint Image/Text Context

[091] For the example results provided in this section, the CNN models are trained with disease labels only where the context of diseases are not considered. For instance, the same calcified granuloma label is assigned to all image cases that actually may describe the disease differently in a finer semantic level, such as "calcified granuloma in right upper lobe," "small calcified granuloma in left lung base," and "multiple calcified granuloma."

[092] Meanwhile, the RNNs encode the text annotation sequences given the CNN embedding of the image the annotation is describing. The already-trained CNN and RNN are used to infer better image labels, integrating the contexts of the image annotations beyond just the name of the disease. This is achieved by generating joint image/text context vectors that are computed by applying mean-pooling on the state vectors (h) of RNN at each iteration over the annotation sequence. It should be noted that the state vector of RNN is initialized with the CNN image embeddings (CNN(/)), and the RNN is unrolled over the annotation sequence, taking each word of the annotation as input. The procedure used is discussed above regarding FIG. 2, and the RNNs share the same parameters.

[093] The obtained joint image/text context vector encodes the image context as well as

the text context describing the image. Using a notation similar to Equation 1, the joint image/text context vector can be written as:

where x_t is the input word in the annotation sequence with N words. Different annotations describing a disease are thereby separated into different categories by the as shown in FIGS.

8 A and 8B. In FIGS. 8 A and 8B, the vectors of about fifty annotations describing calcified

granuloma and opacity are projected onto two-dimensional planes via dimensionality reduction using a t-distributed stochastic neighbor embedding (t-SNE) implementation. For

this example, the GRU implementation of the RNN is used because it showed better overall BLEU scores in Table 5. [094] From the generated for each of the image/annotation pair in the training and

validation sets, new image labels are obtained by taking disease context into account. In addition, the disease annotation is not limited to mostly describing a single disease. Thus, the joint image/text context vector

summarizes both the image's context and word sequence, so that annotations such as "calcified granuloma in right upper lobe," "small calcified granuloma in left lung base," and "multiple calcified granuloma" have different vectors based on their contexts.

[095] Additionally, the disease labels used with unique annotation patterns can now have more cases, as cases with a disease described by different annotation words are no longer filtered out. For example, calcified granuloma previously had only 139 cases because cases with multiple diseases mentioned or with long description sequences were filtered out. Using the joint image/text context vector cases are associated with calcified granuloma. Likewise, opacity now

has 207 cases, as opposed to the previous 65. The average number of cases of all first-mentioned disease labels has is 83.89, with a standard deviation of 86.07, a maximum of 414 (calcified granuloma), and a minimum of 18 (emphysema).

[096] For a disease label having more than 170 cases (n > 170 = (average+standard deviation)), the cases are divided into sub-groups of more than 50 cases by applying k-means clustering to the

with k = Round(n/50). The CNN is trained once more with the additional labels (57, compared to 17 used above), training the RNN with the new CNN image embedding, and finally generate image annotations. The new RNN training cost function (compared to Equation 2) can be expressed as:

where denotes the joint image/text context vector obtained from the first round (with

limited cases and image labels at 0th iteration) of CNN and RNN training. In the second CNN training round (first iteration), the RNN is fine-tuned from the previous CNNit_er=o, by replacing the last classification layer with the new set of labels (17→ 57) and training it with a lower learning rate (0.1), except for the classification layer. A. Evaluation

[097] The final evaluated BLEU scores are provided below in Table 6. As shown, using the joint image/text context, better overall BLEU scores are achieved than those in Table 5. Also, slightly better BLEU scores are obtained using GRU on average, although overall better BLEU-1 scores are acquired using LSTM. Examples of generated annotations on the chest x-ray images are shown in a number of images 1000 in FIG. 10. Each of the images includes a photograph of an input image (e.g. , input image 1010), text annotation for the "true annotation" generated by a radiologist (e.g. , true annotation 1012), and text annotation generated according to the disclosed technology (e.g. , generated annotation 1014), positioned above the true annotation. These annotations were generated using the joint image/context vector and a GRU model. FIGS. 11-18 illustrate additional examples of annotations generated for x-ray images (1100, 1200, 1300, 1400, 1500, 1600, 1700, and 1800) according to the disclosed techniques, including the use of a joint image/context vector.

[098] Table 6 provides BLEU scores validated on the training, validation, test set, using LSTM and GRU RNN models trained on the first iteration for the sequence generation.

[099] Thus, effective frameworks to learn, detect disease, and describe their contexts from the patient chest x-rays and their accompanying radiology reports with Medical Subject Headings (MeSH) annotations are disclosed. Furthermore, an approach is disclosed to mine joint contexts from a collection of images and their accompanying text, by summarizing the CNN/RNN outputs and their states on each of the image/text instances. Higher performance on text generation can be achieved on the test set if the joint image/text contexts are used to re-label the images and to train the proposed CNN/RNN framework subsequently.

[0100] While the examples discussed herein are based on a medical dataset, the suggested approaches could also be applied to other application scenarios with datasets containing coexisting pairs of images and text annotations, where, for example, the domain- specific images differ from those of the ImageNet. B. Additional Examples of Annotation Generation

[0101] More annotation generation examples are provided in FIGS. 15-18. Overall, the system generates promising results on predicting disease (labels) and its context (attributes) in the images. However, rare disease cases are more difficult to detect. For example, the cases pulmonary atelectasis, spondylosis, and density (FIGS. 15 and 16), as well as foreign bodies, atherosclerosis, costophrenic angle, and deformity (FIGS. 17 and 18) are much rarer in the data than calcified granuloma, cardiomegaly, and all the frequent cases listed above in Table 1.

[0102] Furthermore, it should be noted that the (left or right) location of the disease cannot be identified in a lateral view (obtained by scanning the patient from the side), as shown in FIGS. 17 and 18. Since the example dataset used to generate these figures contains a limited number of disease cases, each x-ray image and report is treated as a distinct sample, and different views of the same patient/condition are not taken into account in these exemplary results.

[0103] As will be readily understood to one of ordinary skill in the relevant art having the benefit of the present disclosure, prediction accuracy can be improved by both (a) accounting for different views of the same patient/condition, and (b) collecting a larger dataset to better account for rare diseases.

[0104] In view of the many possible embodiments to which the principles of the disclosed subject matter may be applied, it should be recognized that the illustrated embodiments are only preferred examples and should not be taken as limiting the scope of the claims to those preferred examples. Rather, the scope of the claimed subject matter is defined by the following claims. We therefore claim as our invention all that comes within the scope of these claims and their equivalents.

Claims

We claim:

1. A method of generating an annotation sequence describing an input image, the method comprising:

training a convolutional neural network (CNN) by applying reference images and using associated annotation sequences to train the CNN, the associated annotation sequences indicating a diagnosis for a respective one of the reference images;

training a recurrent neural network (RNN) by initializing the RNN with the trained CNN embedding of a reference image and a first word of an annotation sequence, producing a first RNN state vector;

initializing the RNN by the CNN image embedding;

training the initialized RNN with a subsequent word of the annotation sequence; and with the trained CNN and RNN, generating an output annotation sequence describing the input image.

2. The method of claim 1, further comprising:

using the trained CNN and trained RNN, generating joint image/text context vectors and text annotations summarizing context of the input image by averaging RNN state vectors rolled over each word of the associated annotation sequence.

3. The method of claim 2, further comprising relabeling the reference images by clustering the joint image/text context vector.

4. The method of claim 3, wherein the relabeling improves accuracy with which the CNN distinguishes cases with a same disease case and different contexts.

5. The method of claim 3, further comprising retraining the trained CNN and/or the trained RNN with newly-assigned image labels generated using the joint image/context vectors.

6. The method of claim 5, wherein the retrained CNN and/or retrained RNN exhibit improved accuracy of the output annotation sequence.

7. The method of claim 1, further comprising sampling the trained CNN by applying the reference image as input to the CNN.

8. The method of claim 1, wherein the training the initialized RNN comprises applying a first word of the annotation sequence, producing a new candidate state vector.

9. The method of claim 1, further comprising:

producing a context vector by re-initializing the RNN with: the trained CNN embedding the new candidate state vector and a subsequent word of the annotation sequence; and

wherein the generating an output annotation sequence comprises using the produced context vector.

10. The method of claim 1, wherein the input image and the reference image are x-ray images of a mammal.

11. The method of claim 1 , wherein the associated annotation sequence includes data indicating at least one or more of the following: a disease, a severity level of a disease, an affected organ, or a location.

12. The method of claim 1, wherein the associated annotation sequence is encoded according to a standardized set of terminologies.

13. The method of claim 12, wherein the standardized set of terminologies is a Medical Subject Headings (MeSH) vocabulary.

14. The method of claim 1, further comprising diagnosing a pathology for the input image based on the output annotation sequence.

15. The method of claim 1, further comprising diagnosing two or more pathologies for the input image based on the output annotation sequence.

16. The method of claim 1, further comprising diagnosing the input image as normal.

17. The method of claim 1, further comprising treating a patient associated with the input image based on a diagnosis generated with the output annotation sequence.

18. The method of claim 1, further comprising producing a database comprising a plurality of reference images, each of the plurality of images being associated with one or more annotation sequences describing an aspect of a pathology present in the respective reference image, wherein the training the CNN is performed by applying at least a portion of the reference images as input to the CNN.

19. The method of claim 18, further comprising selecting the at least a portion of the reference images by adjusting the relative number of normal images to diseased images applied for training the CNN.

20. The method of claim 18, wherein at least one of the associated annotation sequences describes aspects of two or more pathologies present in the respective reference image.

21. The method of claim 18, wherein each of a plurality of diseased images of the plurality of images is augmented two or more times with a randomly cropped version of the respective diseased image.

22. The method of claim 18, further comprising randomly or pseudo-randomly selecting one or more normal cases from the plurality of reference images to augment the database.

23. The method of claim 1, wherein the RNN comprises a Long Short-Term Memory (LSTM).

24. The method of claim 23, wherein the LSTM operates according to the following equations:

wherein i_t is the input gate,/; the forget gate, o_t the output gate, h_t the state vector (memory), h_t the new state vector (new memory), m_t the output vector, W is a matrix of trained parameters (weights), and σ is the logistic sigmoid function, and Θ represents the product of a vector with a gate value.

25. The method of claim 1, wherein the RNN comprises a Gated Recurrent Unit (GRU).

26. The method of claim 25, wherein the GRU operates according to the following equations:

wherein it is the update gate, the reset gate, the new state vector, and the final state

vector.

27. The method of claim 1, wherein the CNN and the RNN are implemented with a general-purpose microprocessor coupled to memory.

28. The method of claim 1, wherein the CNN and the RNN are implemented with a graphics processing unit (GPU).

29. One or more computer-readable storage media storing computer-readable instructions that when executed by a processor, cause the processor to perform the method of any one of claims 1-28.

30. A system, comprising:

memory;

one or more general purpose processors and/or graphics processing unit processors; and one or more computer-readable storage media storing computer-readable instructions that when executed by the processors, cause the processors to perform any one of the methods of claims

1-28.

31. A system, comprising:

a convolutional neural network (CNN) trained with images and respective associated image annotations;

a recurrent neural network (RNN) embedding the convolutional neural network, the RNN being configured to be trained to receive a sequence of input words and update memory elements of the RNN responsive to the RNN and output of the CNN; and

a mean pooling module configured to collect state vector data from the RNNs to generate context vectors summarizing contexts of the images and their respective associated image annotations; a clustering module configured to group the generated context vectors and assign improved labels to the images; and

a neural network architecture configured to generate an output annotation sequence responsive to an input image being sampled with the CNN.

32. The system of claim 31, wherein the CNN, the RNN, and the mean pooling module are implemented with a general-purpose processor coupled to memory.

33. The system of claim 31, wherein at least one of the CNN, the RNN, and the mean pooling module are implemented with a graphics processing unit.

34. The system of claim 31, wherein RNN comprises a Long Short-Term Memory (LSTM) configured to operate according to the following equations:

35. The system of claim 31, wherein the RNN comprises a Gated Recurrent Unit (GRU) configured to operate according to the following equations:

where it is the update gate, r_t the reset gate, h_t the new state vector, and h_t the final state vector.