WO2021062133A1 - Unsupervised and weakly-supervised anomaly detection and localization in images - Google Patents

Unsupervised and weakly-supervised anomaly detection and localization in images Download PDF

Info

Publication number
WO2021062133A1
WO2021062133A1 PCT/US2020/052686 US2020052686W WO2021062133A1 WO 2021062133 A1 WO2021062133 A1 WO 2021062133A1 US 2020052686 W US2020052686 W US 2020052686W WO 2021062133 A1 WO2021062133 A1 WO 2021062133A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
network
input image
anomalous
loss
Prior art date
Application number
PCT/US2020/052686
Other languages
French (fr)
Inventor
Shashanka VENKATARAMANAN
Rajat Vikram SINGH
Kuan-Chuan Peng
Original Assignee
Siemens Gas And Power Gmbh & Co. Kg
Siemens Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Gas And Power Gmbh & Co. Kg, Siemens Corporation filed Critical Siemens Gas And Power Gmbh & Co. Kg
Publication of WO2021062133A1 publication Critical patent/WO2021062133A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/73Deblurring; Sharpening
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0004Industrial image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/136Segmentation; Edge detection involving thresholding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/77Determining position or orientation of objects or cameras using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20224Image subtraction

Definitions

  • This application relates to machine learning applied to image processing. More particularly, this application relates to unsupervised and weakly-supervised anomaly detection and localization in images.
  • An anomaly is defined as any event or occurrence which does not follow expected or normal behavior. Defining the concept of the anomaly in the context of images can be very challenging and is critical to the success and effectiveness of an anomaly detector. An efficient anomaly detector should be capable of differentiating between anomalous and normal instances with high precision to avoid false alarms. Extending this further, localization of the anomaly (e.g., attention mapping) in an image is useful to reduce human efforts. Anomaly localization has been applied in industrial inspection settings to segment defective product parts, in surveillance to locate intruders, in medical imaging to segment tumor in brain MRI or glaucoma in retina images, etc. There has been an increase in analysis towards segmenting potential anomalous regions in images.
  • a disclosed method trains a deep neural network on non-anomalous images which encourages the latent space of the network to learn distribution of non-anomalous images.
  • unsupervised training does not have image-level labels, and instead uses activation maps obtained from the latent space to produce an attention map that localizes the anomaly in the image.
  • weakly supervised image-level labels are used to train the deep learning network and predictions by a classifier at the output of latent space are used to compute an attention map. Since precision of attention map depends on classifier performance, the attention map is based on gradients for the images correctly predicted by the classifier. From this, the deep learning network localizes the anomaly with better accuracy.
  • FIG. 1 shows an example of a single layer of a residual decoder in accordance with embodiments of this disclosure.
  • FIG. 2 shows an example of a pipeline for unsupervised anomaly detection and localization in accordance with embodiments of this disclosure.
  • FIG. 3 shows an example of a pipeline for weakly supervised anomaly detection and localization according to embodiments of this disclosure.
  • FIG. 4 shows an example of an attention map indicating localization using the pipeline shown in FIG. 3 according to embodiments of this disclosure.
  • FIG. 5 illustrates an example of a computing environment within which embodiments of the disclosure may be implemented.
  • a disclosed framework can apply two different supervision training techniques for a machine learning-based solution.
  • the framework is an end-to-end convolutional trainable pipeline with attention guidance formed by a generative adversarial network (GAN) based model, such as a Convolutional Adversarial Variational Autoencoder with Guided Attention (CAVGA) model.
  • GAN generative adversarial network
  • CAVGA Convolutional Adversarial Variational Autoencoder with Guided Attention
  • the pipeline is trained only on non-anomalous images to encourage the latent space of the GAN based model to learn a distribution of non-anomalous images.
  • An attention expansion loss is used to encourage an attention map to cover the entire normal regions.
  • a complementary guided attention loss is used to minimize the anomalous attention and simultaneously expand the normal attention for the normal images correctly predicted by the classifier.
  • an attention map for a detected anomaly is usually generated by the technique of backpropagating the gradients corresponding to a specific class for the input image.
  • embodiments of this disclosure involve generating activation maps obtained from the latent space of a deep learning model to produce an attention map without image-level labels as the model is trained by unsupervised training, and without anomalous training images.
  • the attention maps describe the regions of the image that are highly discriminative.
  • a Grad-CAM algorithm is used to compute the attention map using gradient backpropagation. During inference, the inverse attention is used for localizing the anomalous objects.
  • weakly supervised image-level labels are leveraged to train the GAN based model. Predictions of a classifier are applied at the output of the latent space to compute the attention map. Since the precision of the attention map depends on the performance of the classifier, the attention map is generated based on gradients for the images that were correctly predicted by the classifier.
  • FIG. 1 shows an example of a pipeline for unsupervised anomaly detection and localization in accordance with embodiments of this disclosure.
  • a GAN based model such as an CAVGA
  • the GAN based model includes an encoder network 110, a residual decoder 112, and a discriminator.
  • input image x is passed through encoder network 110 (e.g., ResNet-18) where output z is the feature representation of x in the latent space.
  • Output z is used by residual decoder 112 to generate a reconstructed image x, which is a reconstruction of original input image x.
  • the discriminator 114 is another convolution network which determines whether the reconstructed image x is of the same distribution as that of input image x, thereby resulting in adversarial loss as output 116.
  • An objective function is used to derive the attention map A during gradient backpropagation (e.g., Grad-CAM) from output z with an objective to sharpen the reconstruction image x.
  • the attention map A is computed as a normalization such that A y 6 (0,1), where Ay is the (i j) element of A.
  • the objective function is an attention loss L that can be expressed as follows in Equation 1 :
  • KLD Kullback-Leibler
  • the encoder 110 loss can be expressed by the following:
  • Equation (3) The posterior r ⁇ z ⁇ c ) is modeled using a standard Gaussian distribution for prior p(z) with the help of Kullback-Liebler (KL) divergence through cj, j ,(z ⁇ x).
  • discriminator 114 determines adversarial loss 115 (/. here,/, ⁇ ) formulated as follows:
  • the disclosed embodiments use a convolutional latent variable to preserve the spatial relation between the input and the latent variable.
  • attention A obtained from feature maps focuses on the regions of the image based on the activation of neurons in the feature maps and its respective importance. Due to the lack of prior knowledge about the anomaly, in general, humans need to look at the entire image to identify anomalous regions. Extending this concept to the disclosed framework, the feature representation of the entire normal image is learned by proposing an attention expansion loss 115, where the network is encouraged to generate an attention map covering all the normal regions. This attention expansion loss 115 for each image L ae , i is defined as follows:
  • the final attention expansion loss L ae is the average of ae, i over the N images. Since the idea of attention mechanisms involves locating the most salient regions in the image which typically does not cover the entire image, attention expansion loss L ae is used as an additional supervision on the network, such that the trained network generates an attention map that covers all the normal regions. Without using L ae (i.e., unsupervised training of CAVGA) only with adversarial learning (Ladv + L)), not all the normal regions are encoded into the latent variable, and that the attention map fails to cover the entire image. Furthermore, supervising on attention maps prevents the trained model to make inference based on incorrect areas and also alleviates the need of using large amount of training data, which is not enforced in existing methods.
  • a final objective loss function L final is defined as follows:
  • Lfinai WrL + w a dvL a dv + w ae L a e Equation (6)
  • w r , w a dv, and w ae are empirically set as 1, 1, and 0.01 respectively.
  • the GAN based model is trained only on non-anomalous images such that during inference time, when anomalous images are passed through to the network, the regions pertaining to the anomaly will not be reconstructed.
  • the score is higher as compared to passing the non-anomalous image through.
  • a ResNet-18 convolutional neural network model pretrained on ImageNet training data may be used as the encoder which can be finetuned with available training data.
  • the trained pipeline 100 operates as follows. Image x test is fed into the encoder 110 followed by the decoder 112, which reconstructs an image The pixel- wise difference is computed between xte t and xtest as the anomalous score s a . Intuitively, if xtest is drawn from the learnt distribution of z, then s a is small. Without using any anomalous training images in the unsupervised setting, s a is normalized between [0, 1] and empirically set 0.5 as the threshold to detect an image as anomalous.
  • the attention map A te t is computed from z using backpropagation (e.g., Grad-CAM) and is inverted (1 - Atest) to obtain an anomalous attention map which localizes the anomaly.
  • backpropagation e.g., Grad-CAM
  • 1 refers to a matrix of all ones with the same dimensions as A te t .
  • Threshold 0.5 is empirically chosen on the anomalous attention map to evaluate the localization performance.
  • FIG. 2 illustrates an example of a single layer for the residual decoder 112.
  • Layer 200 includes an upsampler 210, a BatchNorm unit 212, a ReLU 214, convolution operation 216, BatchNorm 218, and ReLU 220.
  • Input image 201 is processed by the decoder layer 200 to produce an output 202.
  • Discriminator 114 is used at the output 202 of the decoder, to maintain the distribution of the input and reconstructed image and thereby enables a sharper reconstruction.
  • the GAN based model is end-to-end convolutional.
  • attention mapping is used to solve this problem.
  • attention maps are computed by backpropagating the gradients.
  • the attention area is maximized. Motivation for maximizing the attention is so that through the loss function (Equation 1), extra supervision is provided to the network to better attend to the non-anomalous regions of the image.
  • the attention map is obtained on the non-anomalous regions of the image, such that the inverting the attention map results in an attention map highlighting the abnormal region of the image. This inverse attention thereby results in the localization of the anomalous region in the image.
  • FIG. 3 shows an example of a pipeline for weakly supervised anomaly detection and localization according to embodiments of this disclosure.
  • a weakly supervised approach is now described for training the GAN based network (e.g., CAVGA) to detect and localize anomalies leveraged on image-level labels.
  • Pipeline 300 includes encoder 310, classifier 311, decoder 112 and discriminator 114.
  • the localization is obtained using attention maps by backpropagating the gradients from the prediction of the classifier 311. Localization is improved by backpropagating only those gradients obtained from the correct prediction of the classifier 311.
  • This approach is also applicable in training a network with an objective not confined to the task of anomaly detection (e.g., such as novelty detection).
  • the encoder 310 (e.g., CAVGA) is trained on both anomalous and non-anomalous images.
  • a binary classifier 311 uses the output z of the latent space, which is trained using a binary cross entropy loss 312.
  • the pipeline shown in FIG. 3 uses an objective function (Equations 7, 8 and 9).
  • the attention map is computed from the prediction of the classifier 311 by backpropagating the gradients in the encoder network 310. Since the precision of the attention map is dependent on the performance of the classifier 311, the attention map is computed using a selective gradient, in which only those gradients which result in the correct prediction by the classifier 311 are backpropagated and used for attention loss.
  • An objective function to train the network 300 differs from that of the unsupervised approach shown in FIG. 2 in formulating the attention loss. A first part of the attention loss /.
  • N is formulated to maximize the attention obtained from the non-anomalous (or normal) prediction on non-anomalous image, called a normal attention represented by A , where the superscript value represents the normal prediction and the subscript value represents the non-anomalous image, as expressed by Equation (7):
  • Class Loss represents classification loss result 312.
  • a second part of the attention loss L formulation relates to an abnormal prediction on a non- anomalous image, called abnormal attention represented by A co , where superscript represents the abnormal prediction and the subscript represent the non-anomalous image.
  • the objective involves minimizing the attention as expressed in Equation (8):
  • Equation (9) an objective function for attention loss LAI related to an abnormal image during training is represented by Equation (9):
  • L A I ar grain [BCE(x, x ) + KLD (z, N(0, 1)) + Adv Loss + Class Loss
  • classifier 311 prediction can be defined as/? 6 ⁇ ca, cn ⁇ , where ca and cn are anomalous and normal classes, respectively.
  • z is cloned into a new tensor and flattened to form a fully connected layer z/ c . and a 2-node output layer is added to form classifier 311.
  • Variables z and zfc share parameters. Flattening z/ c enables a higher magnitude of gradient backpropagation from prediction p.
  • the disclosed embodiment proposing supervision on attention maps for anomaly localization in the weakly supervised setting is a novel approach. Since the attention map depends on the performance of classifier 311, a complementary guided attention loss Lcga based on classifier 311 prediction improves anomaly localization.
  • proposed loss Lcga minimizes the areas covered by A x ca but simultaneously enforces A " to cover the entire normal image.
  • a loss L cga, i is defined as the complementary guided attention loss for each image, for the weakly supervised setting as follows:
  • Equation (10) where l(-) is an indicator function.
  • the complimentary guided attention loss Lcga is the average of loss Lcga, l over the N images.
  • the final objective loss function /. / repet «,/ is then defined as follows:
  • Equation (11) Lc is binary cross entropy loss of classifier 311, wr, wadv, wc, and wcga are empirically set as 1, 1, 0.001, and 0.01 respectively.
  • trained pipeline 300 uses classifier 311 to predict the input image xtest as anomalous or normal.
  • Anomaly localization applies the same evaluation method as described above for the unsupervised pipeline 100.
  • FIG. 5 illustrates an example of a computing environment within which embodiments of the present disclosure may be implemented.
  • a computing environment 500 includes a computer system 510 that may include a communication mechanism such as a system bus 521 or other communication mechanism for communicating information within the computer system 510.
  • the computer system 510 further includes one or more processors 520 coupled with the system bus 521 for processing the information.
  • computing environment 500 corresponds to system for performing the above described embodiments, in which the computer system 510 relates to a computer described below in greater detail.
  • the processors 520 may include one or more central processing units (CPUs), graphical processing units (GPUs), or any other processor known in the art. More generally, a processor as described herein is a device for executing machine-readable instructions stored on a computer readable medium, for performing tasks and may comprise any one or combination of, hardware and firmware. A processor may also comprise memory storing machine-readable instructions executable for performing tasks. A processor acts upon information by manipulating, analyzing, modifying, converting or transmitting information for use by an executable procedure or an information device, and/or by routing the information to an output device.
  • CPUs central processing units
  • GPUs graphical processing units
  • a processor may use or comprise the capabilities of a computer, controller or microprocessor, for example, and be conditioned using executable instructions to perform special purpose functions not performed by a general purpose computer.
  • a processor may include any type of suitable processing unit including, but not limited to, a central processing unit, a microprocessor, a Reduced Instruction Set Computer (RISC) microprocessor, a Complex Instruction Set Computer (CISC) microprocessor, a microcontroller, an Application Specific Integrated Circuit (ASIC), a Field- Programmable Gate Array (FPGA), a System-on-a-Chip (SoC), a digital signal processor (DSP), and so forth.
  • RISC Reduced Instruction Set Computer
  • CISC Complex Instruction Set Computer
  • ASIC Application Specific Integrated Circuit
  • FPGA Field- Programmable Gate Array
  • SoC System-on-a-Chip
  • DSP digital signal processor
  • processor(s) 520 may have any suitable microarchitecture design that includes any number of constituent components such as, for example, registers, multiplexers, arithmetic logic units, cache controllers for controlling read/write operations to cache memory, branch predictors, or the like.
  • the microarchitecture design of the processor may be capable of supporting any of a variety of instruction sets.
  • a processor may be coupled (electrically and/or as comprising executable components) with any other processor enabling interaction and/or communication there-between.
  • a user interface processor or generator is a known element comprising electronic circuitry or software or a combination of both for generating display images or portions thereof.
  • a user interface comprises one or more display images enabling user interaction with a processor or other device.
  • the system bus 521 may include at least one of a system bus, a memory bus, an address bus, or a message bus, and may permit exchange of information (e.g., data (including computer- executable code), signaling, etc.) between various components of the computer system 510.
  • the system bus 521 may include, without limitation, a memory bus or a memory controller, a peripheral bus, an accelerated graphics port, and so forth.
  • the system bus 521 may be associated with any suitable bus architecture including, without limitation, an Industry Standard Architecture (ISA), a Micro Channel Architecture (MCA), an Enhanced ISA (EISA), a Video Electronics Standards Association (VESA) architecture, an Accelerated Graphics Port (AGP) architecture, a Peripheral Component Interconnects (PCI) architecture, a PCI-Express architecture, a Personal Computer Memory Card International Association (PCMCIA) architecture, a Universal Serial Bus (USB) architecture, and so forth.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • EISA Enhanced ISA
  • VESA Video Electronics Standards Association
  • AGP Accelerated Graphics Port
  • PCI Peripheral Component Interconnects
  • PCMCIA Personal Computer Memory Card International Association
  • USB Universal Serial Bus
  • the computer system 510 may also include a system memory 530 coupled to the system bus 521 for storing information and instructions to be executed by processors 520.
  • the system memory 530 may include computer readable storage media in the form of volatile and/or nonvolatile memory, such as read only memory (ROM) 531 and/or random access memory (RAM) 532.
  • the RAM 532 may include other dynamic storage device(s) (e.g., dynamic RAM, static RAM, and synchronous DRAM).
  • the ROM 531 may include other static storage device(s) (e.g., programmable ROM, erasable PROM, and electrically erasable PROM).
  • system memory 530 may be used for storing temporary variables or other intermediate information during the execution of instructions by the processors 520.
  • a basic input/output system 533 (BIOS) containing the basic routines that help to transfer information between elements within computer system 510, such as during start-up, may be stored in the ROM 531.
  • RAM 532 may contain data and/or program modules that are immediately accessible to and/or presently being operated on by the processors 520.
  • System memory 530 may additionally include, for example, operating system 534, application modules 535, and other program modules 536.
  • Application modules 535 may include aforementioned modules described for FIG. 1 and may also include a user portal for development of the application program, allowing input parameters to be entered and modified as necessary.
  • the operating system 534 may be loaded into the memory 530 and may provide an interface between other application software executing on the computer system 510 and hardware resources of the computer system 510. More specifically, the operating system 534 may include a set of computer-executable instructions for managing hardware resources of the computer system 510 and for providing common services to other application programs (e.g., managing memory allocation among various application programs). In certain example embodiments, the operating system 534 may control execution of one or more of the program modules depicted as being stored in the data storage 540.
  • the operating system 534 may include any operating system now known or which may be developed in the future including, but not limited to, any server operating system, any mainframe operating system, or any other proprietary or non-proprietary operating system.
  • the computer system 510 may also include a disk/media controller 543 coupled to the system bus 521 to control one or more storage devices for storing information and instructions, such as a magnetic hard disk 541 and/or a removable media drive 542 (e.g., floppy disk drive, compact disc drive, tape drive, flash drive, and/or solid state drive).
  • Storage devices 540 may be added to the computer system 510 using an appropriate device interface (e.g., a small computer system interface (SCSI), integrated device electronics (IDE), Universal Serial Bus (USB), or FireWire).
  • Storage devices 541, 542 may be external to the computer system 510.
  • the computer system 510 may include a user interface module 560 to process user inputs from user input devices 561, which may comprise one or more devices such as a keyboard, touchscreen, tablet and/or a pointing device, for interacting with a computer user and providing information to the processors 520.
  • user interface module 560 also processes system outputs to user display devices 562, (e.g., via an interactive GUI display).
  • the computer system 510 may perform a portion or all of the processing steps of embodiments of the invention in response to the processors 520 executing one or more sequences of one or more instructions contained in a memory, such as the system memory 530. Such instructions may be read into the system memory 530 from another computer readable medium of storage 540, such as the magnetic hard disk 541 or the removable media drive 542.
  • the magnetic hard disk 541 and/or removable media drive 542 may contain one or more data stores and data files used by embodiments of the present disclosure.
  • the data store 540 may include, but are not limited to, databases (e.g., relational, object-oriented, etc.), file systems, flat files, distributed data stores in which data is stored on more than one node of a computer network, peer-to-peer network data stores, or the like. Data store contents and data files may be encrypted to improve security.
  • the processors 520 may also be employed in a multi-processing arrangement to execute the one or more sequences of instructions contained in system memory 530.
  • hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.
  • the computer system 510 may include at least one computer readable medium or memory for holding instructions programmed according to embodiments of the invention and for containing data structures, tables, records, or other data described herein.
  • the term “computer readable medium” as used herein refers to any medium that participates in providing instructions to the processors 520 for execution, and may take many forms including, but not limited to, non-transitory, non-volatile media, volatile media, and transmission media.
  • Non-limiting examples of non-volatile media include optical disks, solid state drives, magnetic disks, and magneto-optical disks, such as magnetic hard disk 541 or removable media drive 542.
  • Non-limiting examples of volatile media include dynamic memory, such as system memory 530.
  • Non-limiting examples of transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up the system bus 521. Transmission media may also take the form of acoustic or light waves, such as in radio wave and infrared data communications.
  • Computer readable medium instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
  • FPGA field-programmable gate arrays
  • PDA programmable logic arrays
  • the computing environment 500 may further include the computer system 510 operating in a networked environment using logical connections to one or more remote computers, such as remote computing device 573.
  • the network interface 570 may enable communication, for example, with other remote devices 573 or systems and/or the storage devices 541, 542 via the network 571.
  • Remote computing device 573 may be a personal computer (laptop or desktop), a mobile device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer system 510.
  • Network 571 links remote data sources to computing device 510.
  • Remote sensing devices 574 e.g., cameras
  • Remote data repositories 575 may store images used for training the anomaly detection networks.
  • computer system 510 may include modem 572 for establishing communications over a network 571, such as the Internet.
  • Modem 572 may be connected to system bus 521 via user network interface 570, or via another appropriate mechanism.
  • Network 571 may be any network or system generally known in the art, including the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a direct connection or series of connections, a cellular telephone network, or any other network or medium capable of facilitating communication between computer system 510 and other computers (e.g., remote computing device 573).
  • the network 571 may be wired, wireless or a combination thereof. Wired connections may be implemented using Ethernet, Universal Serial Bus (USB), RJ-6, or any other wired connection generally known in the art.
  • Wireless connections may be implemented using Wi-Fi, WiMAX, and Bluetooth, infrared, cellular networks, satellite or any other wireless connection methodology generally known in the art. Additionally, several networks may work alone or in communication with each other to facilitate communication in the network 571.
  • program modules, applications, computer-executable instructions, code, or the like depicted in FIG. 5 as being stored in the system memory 530 are merely illustrative and not exhaustive and that processing described as being supported by any particular module may alternatively be distributed across multiple modules or performed by a different module.
  • various program module(s), script(s), plug-in(s), Application Programming Interface(s) (API(s)), or any other suitable computer-executable code hosted locally on the computer system 510, the remote device 573, and/or hosted on other computing device(s) accessible via one or more of the network(s) 571 may be provided to support functionality provided by the program modules, applications, or computer-executable code depicted in FIG.
  • functionality may be modularized differently such that processing described as being supported collectively by the collection of program modules depicted in FIG. 5 may be performed by a fewer or greater number of modules at least in part by another module.
  • program modules that support the functionality described herein may form part of one or more applications executable across any number of systems or devices in accordance with any suitable computing model such as, for example, a client-server model, a peer-to-peer model, and so forth.
  • any of the functionality described as being supported by any of the program modules depicted in FIG. 5 may be implemented, at least partially, in hardware and/or firmware across any number of devices.
  • Computer system 510 may include alternate and/or additional hardware, software, or firmware components beyond those described or depicted without departing from the scope of the disclosure. More particularly, it should be appreciated that software, firmware, or hardware components depicted as forming part of the computer system 510 are merely illustrative and that some components may not be present or additional components may be provided in various embodiments. While various illustrative program modules have been depicted and described as software modules stored in system memory 530, it should be appreciated that functionality described as being supported by the program modules may be enabled by any combination of hardware, software, and/or firmware. It should further be appreciated that each of the above- mentioned modules may, in various embodiments, represent a logical partitioning of supported functionality.
  • This logical partitioning is depicted for ease of explanation of the functionality and may not be representative of the structure of software, hardware, and/or firmware for implementing the functionality. Accordingly, it should be appreciated that functionality described as being provided by a particular module may, in various embodiments, be provided at least in part by one or more other modules. Further, one or more depicted modules may not be present in certain embodiments, while in other embodiments, additional modules not depicted may be present and may support at least a portion of the described functionality and/or additional functionality. Moreover, while certain modules may be depicted and described as sub-modules of another module, in certain embodiments, such modules may be provided as independent modules or as sub-modules of other modules.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the Figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • Each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the illustrations can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Image Analysis (AREA)

Abstract

System and method for anomaly detection and localization in images is disclosed. An end-to-end convolutional pipeline includes a generative adversarial network (GAN) based model with an encoder network, decoder network and a discriminator network. The encoder network is trained to generate a latent space representation of an input image and to generate an attention map by backpropagating gradients using an objective function. Residual decoder network generates a reconstructed image of the input image from the latent space representation. Discriminator network determines whether the reconstruction image is of a same distribution as the input image, where the objective function sharpens the reconstructed image. Given an anomalous input image, the pipeline inverts the attention map to localize an anomalous region of the image.

Description

UNSUPER VISED AND WEAKLY- SUPERVISED ANOMALY DETECTION AND LOCALIZATION IN IMAGES
TECHNICAL FIELD
[0001] This application relates to machine learning applied to image processing. More particularly, this application relates to unsupervised and weakly-supervised anomaly detection and localization in images.
BACKGROUND
[0002] An anomaly is defined as any event or occurrence which does not follow expected or normal behavior. Defining the concept of the anomaly in the context of images can be very challenging and is critical to the success and effectiveness of an anomaly detector. An efficient anomaly detector should be capable of differentiating between anomalous and normal instances with high precision to avoid false alarms. Extending this further, localization of the anomaly (e.g., attention mapping) in an image is useful to reduce human efforts. Anomaly localization has been applied in industrial inspection settings to segment defective product parts, in surveillance to locate intruders, in medical imaging to segment tumor in brain MRI or glaucoma in retina images, etc. There has been an increase in analysis towards segmenting potential anomalous regions in images.
[0003] State-of-the-art localization methods are based on deep learning. Developing deep learning based algorithms can be challenging due to the small pixel coverage of the anomaly and lack of suitable data, as anomalous images are rarely available in the real world. A threshold score obtained from pixel-wise difference between the reconstructed image and input image is used to predict whether the input image is anomalous or not. Such approaches rely on this difference to try to identify the regions of the anomaly in the image, but they are vulnerable to noisy images which interferes with detecting an anomaly and can result in false alarms. Moreover, such methods need to determine class-specific thresholds using anomalous training images (i.e., supervised training) which are often unavailable in real-world scenarios.
SUMMARY
[0004] Method and system are disclosed for an unsupervised and weakly-supervised anomaly detection and localization in images. In contrast to current solutions which generate an attention map by backpropagating gradients corresponding to a specific class to an input image, a disclosed method trains a deep neural network on non-anomalous images which encourages the latent space of the network to learn distribution of non-anomalous images. In one embodiment, unsupervised training does not have image-level labels, and instead uses activation maps obtained from the latent space to produce an attention map that localizes the anomaly in the image. In another embodiment, weakly supervised image-level labels are used to train the deep learning network and predictions by a classifier at the output of latent space are used to compute an attention map. Since precision of attention map depends on classifier performance, the attention map is based on gradients for the images correctly predicted by the classifier. From this, the deep learning network localizes the anomaly with better accuracy.
BRIEF DESCRIPTION OF THE DRAWINGS [0005] Non-limiting and non-exhaustive embodiments of the present embodiments are described with reference to the following FIGURES, wherein like reference numerals refer to like elements throughout the drawings unless otherwise specified.
FIG. 1 shows an example of a single layer of a residual decoder in accordance with embodiments of this disclosure. FIG. 2 shows an example of a pipeline for unsupervised anomaly detection and localization in accordance with embodiments of this disclosure.
FIG. 3 shows an example of a pipeline for weakly supervised anomaly detection and localization according to embodiments of this disclosure.
FIG. 4 shows an example of an attention map indicating localization using the pipeline shown in FIG. 3 according to embodiments of this disclosure.
FIG. 5 illustrates an example of a computing environment within which embodiments of the disclosure may be implemented.
DETAILED DESCRIPTION
[0006] In order to solve the technical problem of anomaly detection in images, a disclosed framework can apply two different supervision training techniques for a machine learning-based solution. The framework is an end-to-end convolutional trainable pipeline with attention guidance formed by a generative adversarial network (GAN) based model, such as a Convolutional Adversarial Variational Autoencoder with Guided Attention (CAVGA) model. Firstly, in an unsupervised approach, the pipeline is trained only on non-anomalous images to encourage the latent space of the GAN based model to learn a distribution of non-anomalous images. An attention expansion loss is used to encourage an attention map to cover the entire normal regions. A complementary guided attention loss is used to minimize the anomalous attention and simultaneously expand the normal attention for the normal images correctly predicted by the classifier. In conventional deep learning based classifiers, an attention map for a detected anomaly is usually generated by the technique of backpropagating the gradients corresponding to a specific class for the input image. In contrast with such approaches which require anomalous images for training, embodiments of this disclosure involve generating activation maps obtained from the latent space of a deep learning model to produce an attention map without image-level labels as the model is trained by unsupervised training, and without anomalous training images. The attention maps describe the regions of the image that are highly discriminative. In an embodiment, a Grad-CAM algorithm is used to compute the attention map using gradient backpropagation. During inference, the inverse attention is used for localizing the anomalous objects.
[0007] In an embodiment according to a second approach, weakly supervised image-level labels are leveraged to train the GAN based model. Predictions of a classifier are applied at the output of the latent space to compute the attention map. Since the precision of the attention map depends on the performance of the classifier, the attention map is generated based on gradients for the images that were correctly predicted by the classifier.
[0008] FIG. 1 shows an example of a pipeline for unsupervised anomaly detection and localization in accordance with embodiments of this disclosure. According to the unsupervised training embodiment, a GAN based model, such as an CAVGA, is trained to detect and localize anomalies. The anomaly localization is obtained using attention maps by backpropagating the gradients from the activation map of the latent space. This approach is also applicable in training a network with an objective not confined to the task of anomaly detection (e.g., such as novelty detection). The GAN based model includes an encoder network 110, a residual decoder 112, and a discriminator.
[0009] As shown in FIG. 1, input image x is passed through encoder network 110 (e.g., ResNet-18) where output z is the feature representation of x in the latent space. Output z is used by residual decoder 112 to generate a reconstructed image x, which is a reconstruction of original input image x. The discriminator 114 is another convolution network which determines whether the reconstructed image x is of the same distribution as that of input image x, thereby resulting in adversarial loss as output 116. An objective function is used to derive the attention map A during gradient backpropagation (e.g., Grad-CAM) from output z with an objective to sharpen the reconstruction image x. The attention map A is computed as a normalization such that Ay 6 (0,1), where Ay is the (i j) element of A. The objective function is an attention loss L that can be expressed as follows in Equation 1 :
Figure imgf000007_0001
Equation (1) where:
- BCE represents Binary Cross Entropy Loss between the input image and reconstructed image,
- KLD denotes the Kullback-Leibler (KL) divergence Loss between the z and N(0, 1) which represents Normal Distribution,
- Adv Loss represents Adversarial loss between the input image and reconstructed image.
These losses BCE, KLD, and Adv. Loss are jointly used to obtain a sharper reconstruction image x of the input image x.
[0010] A refinement to the above computation for attention loss can be applied as follows. The encoder 110 loss can be expressed by the following:
L = LR(X, X) + Kå^y(z\c)\\r(z\c)) Equation (2) where reconstruction loss LR between input x and reconstructed images x for N total images is expressed by:
Figure imgf000007_0002
Equation (3) The posterior rάz\c) is modeled using a standard Gaussian distribution for prior p(z) with the help of Kullback-Liebler (KL) divergence through cj,j,(z\x). To improve stability of the training and generate sharper reconstructed images x , discriminator 114 determines adversarial loss 115 (/.„,/,·) formulated as follows:
Figure imgf000008_0001
Equation (4)
Unlike traditional autoencoders where the latent variable is flattened, the disclosed embodiments use a convolutional latent variable to preserve the spatial relation between the input and the latent variable.
[0011] Intuitively, attention A obtained from feature maps focuses on the regions of the image based on the activation of neurons in the feature maps and its respective importance. Due to the lack of prior knowledge about the anomaly, in general, humans need to look at the entire image to identify anomalous regions. Extending this concept to the disclosed framework, the feature representation of the entire normal image is learned by proposing an attention expansion loss 115, where the network is encouraged to generate an attention map covering all the normal regions. This attention expansion loss 115 for each image Lae, i is defined as follows:
Figure imgf000008_0002
Equation (5) where \A\ is the total number of elements in A. The final attention expansion loss Lae is the average of ae, i over the N images. Since the idea of attention mechanisms involves locating the most salient regions in the image which typically does not cover the entire image, attention expansion loss Lae is used as an additional supervision on the network, such that the trained network generates an attention map that covers all the normal regions. Without using Lae (i.e., unsupervised training of CAVGA) only with adversarial learning (Ladv + L)), not all the normal regions are encoded into the latent variable, and that the attention map fails to cover the entire image. Furthermore, supervising on attention maps prevents the trained model to make inference based on incorrect areas and also alleviates the need of using large amount of training data, which is not enforced in existing methods.
[0012] A final objective loss function L final is defined as follows:
Lfinai = WrL + wadvLadv + waeLae Equation (6) where wr, wadv, and wae are empirically set as 1, 1, and 0.01 respectively.
[0013] In order to first detect anomaly in an unsupervised manner, the GAN based model is trained only on non-anomalous images such that during inference time, when anomalous images are passed through to the network, the regions pertaining to the anomaly will not be reconstructed. When a pixel-wise difference is computed between the reconstructed image and the input anomalous image, the score is higher as compared to passing the non-anomalous image through. In an aspect to overcome limited availability of training data, a ResNet-18 convolutional neural network model pretrained on ImageNet training data may be used as the encoder which can be finetuned with available training data. Since the dataset contains images with large amount of high frequency components, a residual decoder with skip connections and an inter-leaved convolutional layer is employed to preserve local information. [0014] Following training, the trained pipeline 100 operates as follows. Image xtest is fed into the encoder 110 followed by the decoder 112, which reconstructs an image
Figure imgf000010_0001
The pixel- wise difference is computed between xte t and xtest as the anomalous score sa. Intuitively, if xtest is drawn from the learnt distribution of z, then sa is small. Without using any anomalous training images in the unsupervised setting, sa is normalized between [0, 1] and empirically set 0.5 as the threshold to detect an image as anomalous. The attention map Ate t is computed from z using backpropagation (e.g., Grad-CAM) and is inverted (1 - Atest) to obtain an anomalous attention map which localizes the anomaly. Here, 1 refers to a matrix of all ones with the same dimensions as Ate t. Threshold 0.5 is empirically chosen on the anomalous attention map to evaluate the localization performance.
[0015] FIG. 2 illustrates an example of a single layer for the residual decoder 112. Layer 200 includes an upsampler 210, a BatchNorm unit 212, a ReLU 214, convolution operation 216, BatchNorm 218, and ReLU 220. Input image 201 is processed by the decoder layer 200 to produce an output 202. Discriminator 114 is used at the output 202 of the decoder, to maintain the distribution of the input and reconstructed image and thereby enables a sharper reconstruction. In order to preserve the spatial relation, the GAN based model is end-to-end convolutional.
[0016] Since an objective also involves localizing the anomaly without any prior information of where or what the anomaly is, attention mapping is used to solve this problem. Using the activation maps of the output of the encoder 110 as the latent space, attention maps are computed by backpropagating the gradients. In order to encourage the network to attend to the entire image (since training contains only non-anomalous images), the attention area is maximized. Motivation for maximizing the attention is so that through the loss function (Equation 1), extra supervision is provided to the network to better attend to the non-anomalous regions of the image. During operation of the trained pipeline 100, given an anomalous image, the attention map is obtained on the non-anomalous regions of the image, such that the inverting the attention map results in an attention map highlighting the abnormal region of the image. This inverse attention thereby results in the localization of the anomalous region in the image.
[0017] FIG. 3 shows an example of a pipeline for weakly supervised anomaly detection and localization according to embodiments of this disclosure. A weakly supervised approach is now described for training the GAN based network (e.g., CAVGA) to detect and localize anomalies leveraged on image-level labels. Pipeline 300 includes encoder 310, classifier 311, decoder 112 and discriminator 114. The localization is obtained using attention maps by backpropagating the gradients from the prediction of the classifier 311. Localization is improved by backpropagating only those gradients obtained from the correct prediction of the classifier 311. This approach is also applicable in training a network with an objective not confined to the task of anomaly detection (e.g., such as novelty detection).
[0018] In order to first detect anomaly in a weakly supervised manner, the encoder 310 (e.g., CAVGA) is trained on both anomalous and non-anomalous images. A binary classifier 311 uses the output z of the latent space, which is trained using a binary cross entropy loss 312.
[0019] Since an objective of pipeline 300 also involves localizing the anomaly, attention is used to solve this problem. The pipeline shown in FIG. 3 uses an objective function (Equations 7, 8 and 9). The attention map is computed from the prediction of the classifier 311 by backpropagating the gradients in the encoder network 310. Since the precision of the attention map is dependent on the performance of the classifier 311, the attention map is computed using a selective gradient, in which only those gradients which result in the correct prediction by the classifier 311 are backpropagated and used for attention loss. An objective function to train the network 300 differs from that of the unsupervised approach shown in FIG. 2 in formulating the attention loss. A first part of the attention loss /.N is formulated to maximize the attention obtained from the non-anomalous (or normal) prediction on non-anomalous image, called a normal attention represented by A , where the superscript value represents the normal prediction and the subscript value represents the non-anomalous image, as expressed by Equation (7):
Figure imgf000012_0001
Equation (7) where:
Class Loss represents classification loss result 312.
A second part of the attention loss L formulation relates to an abnormal prediction on a non- anomalous image, called abnormal attention represented by Aco, where superscript represents the abnormal prediction and the subscript represent the non-anomalous image. For this attention loss formulation, the objective involves minimizing the attention as expressed in Equation (8):
Figure imgf000012_0002
Equation (8)
These two attention losses EN, LA individually contribute towards learning a better localization of the anomaly along with the selective gradient technique. Since the attention loss for the abnormal image is not computed, an objective function for attention loss LAI related to an abnormal image during training is represented by Equation (9):
LAI = ar grain [BCE(x, x ) + KLD (z, N(0, 1)) + Adv Loss + Class Loss
Equation (9)
[0020] A refinement to the above described computation of attention loss for the weakly- supervised pipeline 300 can be applied as follows. Given an image x and its ground truth label y, classifier 311 prediction can be defined as/? 6 {ca, cn}, where ca and cn are anomalous and normal classes, respectively. As shown in FIG. 3, z is cloned into a new tensor and flattened to form a fully connected layer z/c. and a 2-node output layer is added to form classifier 311. Variables z and zfc share parameters. Flattening z/c enables a higher magnitude of gradient backpropagation from prediction p.
[0021] Although attention maps generated from a trained classifier have been used in weakly supervised semantic segmentation tasks, the disclosed embodiment proposing supervision on attention maps for anomaly localization in the weakly supervised setting is a novel approach. Since the attention map depends on the performance of classifier 311, a complementary guided attention loss Lcga based on classifier 311 prediction improves anomaly localization. In an aspect, Grad- CAM algorithm is used to compute the attention map for the anomalous class Ax ca and the attention map for the normal class Ax cn on the normal image x (y = cn). Using Ax ca and Ax cn, proposed loss Lcga minimizes the areas covered by Ax ca but simultaneously enforces A " to cover the entire normal image. Since the attention map is computed by backpropagating the gradients from prediction p, any incorrect prediction p would generate an undesired attention map. This would lead to the network learning to focus on erroneous areas of the image during training, which is avoided using the complimentary guided attention loss Lcga. Proposed loss Lcga is computed only for the normal images correctly classified by the classifier (i.e., if p =y = cn). A loss Lcga, i is defined as the complementary guided attention loss for each image, for the weakly supervised setting as follows:
Figure imgf000013_0001
Equation (10) where l(-) is an indicator function. The complimentary guided attention loss Lcga is the average of loss Lcga, l over the N images. The final objective loss function /./„«,/ is then defined as follows:
L final WrL + W advLadv WcLc F W cgaLcga
Equation (11) where Lc is binary cross entropy loss of classifier 311, wr, wadv, wc, and wcga are empirically set as 1, 1, 0.001, and 0.01 respectively. During testing, trained pipeline 300 uses classifier 311 to predict the input image xtest as anomalous or normal. The anomalous attention map Atest of xtest is computed when y = ca. Anomaly localization applies the same evaluation method as described above for the unsupervised pipeline 100.
[0022] FIG. 5 illustrates an example of a computing environment within which embodiments of the present disclosure may be implemented. A computing environment 500 includes a computer system 510 that may include a communication mechanism such as a system bus 521 or other communication mechanism for communicating information within the computer system 510. The computer system 510 further includes one or more processors 520 coupled with the system bus 521 for processing the information. In an embodiment, computing environment 500 corresponds to system for performing the above described embodiments, in which the computer system 510 relates to a computer described below in greater detail.
[0023] The processors 520 may include one or more central processing units (CPUs), graphical processing units (GPUs), or any other processor known in the art. More generally, a processor as described herein is a device for executing machine-readable instructions stored on a computer readable medium, for performing tasks and may comprise any one or combination of, hardware and firmware. A processor may also comprise memory storing machine-readable instructions executable for performing tasks. A processor acts upon information by manipulating, analyzing, modifying, converting or transmitting information for use by an executable procedure or an information device, and/or by routing the information to an output device. A processor may use or comprise the capabilities of a computer, controller or microprocessor, for example, and be conditioned using executable instructions to perform special purpose functions not performed by a general purpose computer. A processor may include any type of suitable processing unit including, but not limited to, a central processing unit, a microprocessor, a Reduced Instruction Set Computer (RISC) microprocessor, a Complex Instruction Set Computer (CISC) microprocessor, a microcontroller, an Application Specific Integrated Circuit (ASIC), a Field- Programmable Gate Array (FPGA), a System-on-a-Chip (SoC), a digital signal processor (DSP), and so forth. Further, the processor(s) 520 may have any suitable microarchitecture design that includes any number of constituent components such as, for example, registers, multiplexers, arithmetic logic units, cache controllers for controlling read/write operations to cache memory, branch predictors, or the like. The microarchitecture design of the processor may be capable of supporting any of a variety of instruction sets. A processor may be coupled (electrically and/or as comprising executable components) with any other processor enabling interaction and/or communication there-between. A user interface processor or generator is a known element comprising electronic circuitry or software or a combination of both for generating display images or portions thereof. A user interface comprises one or more display images enabling user interaction with a processor or other device.
[0024] The system bus 521 may include at least one of a system bus, a memory bus, an address bus, or a message bus, and may permit exchange of information (e.g., data (including computer- executable code), signaling, etc.) between various components of the computer system 510. The system bus 521 may include, without limitation, a memory bus or a memory controller, a peripheral bus, an accelerated graphics port, and so forth. The system bus 521 may be associated with any suitable bus architecture including, without limitation, an Industry Standard Architecture (ISA), a Micro Channel Architecture (MCA), an Enhanced ISA (EISA), a Video Electronics Standards Association (VESA) architecture, an Accelerated Graphics Port (AGP) architecture, a Peripheral Component Interconnects (PCI) architecture, a PCI-Express architecture, a Personal Computer Memory Card International Association (PCMCIA) architecture, a Universal Serial Bus (USB) architecture, and so forth.
[0025] Continuing with reference to FIG. 5, the computer system 510 may also include a system memory 530 coupled to the system bus 521 for storing information and instructions to be executed by processors 520. The system memory 530 may include computer readable storage media in the form of volatile and/or nonvolatile memory, such as read only memory (ROM) 531 and/or random access memory (RAM) 532. The RAM 532 may include other dynamic storage device(s) (e.g., dynamic RAM, static RAM, and synchronous DRAM). The ROM 531 may include other static storage device(s) (e.g., programmable ROM, erasable PROM, and electrically erasable PROM). In addition, the system memory 530 may be used for storing temporary variables or other intermediate information during the execution of instructions by the processors 520. A basic input/output system 533 (BIOS) containing the basic routines that help to transfer information between elements within computer system 510, such as during start-up, may be stored in the ROM 531. RAM 532 may contain data and/or program modules that are immediately accessible to and/or presently being operated on by the processors 520. System memory 530 may additionally include, for example, operating system 534, application modules 535, and other program modules 536. Application modules 535 may include aforementioned modules described for FIG. 1 and may also include a user portal for development of the application program, allowing input parameters to be entered and modified as necessary.
[0026] The operating system 534 may be loaded into the memory 530 and may provide an interface between other application software executing on the computer system 510 and hardware resources of the computer system 510. More specifically, the operating system 534 may include a set of computer-executable instructions for managing hardware resources of the computer system 510 and for providing common services to other application programs (e.g., managing memory allocation among various application programs). In certain example embodiments, the operating system 534 may control execution of one or more of the program modules depicted as being stored in the data storage 540. The operating system 534 may include any operating system now known or which may be developed in the future including, but not limited to, any server operating system, any mainframe operating system, or any other proprietary or non-proprietary operating system.
[0027] The computer system 510 may also include a disk/media controller 543 coupled to the system bus 521 to control one or more storage devices for storing information and instructions, such as a magnetic hard disk 541 and/or a removable media drive 542 (e.g., floppy disk drive, compact disc drive, tape drive, flash drive, and/or solid state drive). Storage devices 540 may be added to the computer system 510 using an appropriate device interface (e.g., a small computer system interface (SCSI), integrated device electronics (IDE), Universal Serial Bus (USB), or FireWire). Storage devices 541, 542 may be external to the computer system 510.
[0028] The computer system 510 may include a user interface module 560 to process user inputs from user input devices 561, which may comprise one or more devices such as a keyboard, touchscreen, tablet and/or a pointing device, for interacting with a computer user and providing information to the processors 520. User interface module 560 also processes system outputs to user display devices 562, (e.g., via an interactive GUI display).
[0029] The computer system 510 may perform a portion or all of the processing steps of embodiments of the invention in response to the processors 520 executing one or more sequences of one or more instructions contained in a memory, such as the system memory 530. Such instructions may be read into the system memory 530 from another computer readable medium of storage 540, such as the magnetic hard disk 541 or the removable media drive 542. The magnetic hard disk 541 and/or removable media drive 542 may contain one or more data stores and data files used by embodiments of the present disclosure. The data store 540 may include, but are not limited to, databases (e.g., relational, object-oriented, etc.), file systems, flat files, distributed data stores in which data is stored on more than one node of a computer network, peer-to-peer network data stores, or the like. Data store contents and data files may be encrypted to improve security. The processors 520 may also be employed in a multi-processing arrangement to execute the one or more sequences of instructions contained in system memory 530. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.
[0030] As stated above, the computer system 510 may include at least one computer readable medium or memory for holding instructions programmed according to embodiments of the invention and for containing data structures, tables, records, or other data described herein. The term “computer readable medium” as used herein refers to any medium that participates in providing instructions to the processors 520 for execution, and may take many forms including, but not limited to, non-transitory, non-volatile media, volatile media, and transmission media. Non-limiting examples of non-volatile media include optical disks, solid state drives, magnetic disks, and magneto-optical disks, such as magnetic hard disk 541 or removable media drive 542. Non-limiting examples of volatile media include dynamic memory, such as system memory 530. Non-limiting examples of transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up the system bus 521. Transmission media may also take the form of acoustic or light waves, such as in radio wave and infrared data communications.
[0031] Computer readable medium instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure. [0032] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, systems, and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may be implemented by computer readable medium instructions.
[0033] The computing environment 500 may further include the computer system 510 operating in a networked environment using logical connections to one or more remote computers, such as remote computing device 573. The network interface 570 may enable communication, for example, with other remote devices 573 or systems and/or the storage devices 541, 542 via the network 571. Remote computing device 573 may be a personal computer (laptop or desktop), a mobile device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer system 510. Network 571 links remote data sources to computing device 510. Remote sensing devices 574 (e.g., cameras) may generate the images used as inputs for the anomaly detection. Remote data repositories 575 may store images used for training the anomaly detection networks. When used in a networking environment, computer system 510 may include modem 572 for establishing communications over a network 571, such as the Internet. Modem 572 may be connected to system bus 521 via user network interface 570, or via another appropriate mechanism.
[0034] Network 571 may be any network or system generally known in the art, including the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a direct connection or series of connections, a cellular telephone network, or any other network or medium capable of facilitating communication between computer system 510 and other computers (e.g., remote computing device 573). The network 571 may be wired, wireless or a combination thereof. Wired connections may be implemented using Ethernet, Universal Serial Bus (USB), RJ-6, or any other wired connection generally known in the art. Wireless connections may be implemented using Wi-Fi, WiMAX, and Bluetooth, infrared, cellular networks, satellite or any other wireless connection methodology generally known in the art. Additionally, several networks may work alone or in communication with each other to facilitate communication in the network 571.
[0035] It should be appreciated that the program modules, applications, computer-executable instructions, code, or the like depicted in FIG. 5 as being stored in the system memory 530 are merely illustrative and not exhaustive and that processing described as being supported by any particular module may alternatively be distributed across multiple modules or performed by a different module. In addition, various program module(s), script(s), plug-in(s), Application Programming Interface(s) (API(s)), or any other suitable computer-executable code hosted locally on the computer system 510, the remote device 573, and/or hosted on other computing device(s) accessible via one or more of the network(s) 571, may be provided to support functionality provided by the program modules, applications, or computer-executable code depicted in FIG. 5 and/or additional or alternate functionality. Further, functionality may be modularized differently such that processing described as being supported collectively by the collection of program modules depicted in FIG. 5 may be performed by a fewer or greater number of modules at least in part by another module. In addition, program modules that support the functionality described herein may form part of one or more applications executable across any number of systems or devices in accordance with any suitable computing model such as, for example, a client-server model, a peer-to-peer model, and so forth. In addition, any of the functionality described as being supported by any of the program modules depicted in FIG. 5 may be implemented, at least partially, in hardware and/or firmware across any number of devices.
[0036] Computer system 510 may include alternate and/or additional hardware, software, or firmware components beyond those described or depicted without departing from the scope of the disclosure. More particularly, it should be appreciated that software, firmware, or hardware components depicted as forming part of the computer system 510 are merely illustrative and that some components may not be present or additional components may be provided in various embodiments. While various illustrative program modules have been depicted and described as software modules stored in system memory 530, it should be appreciated that functionality described as being supported by the program modules may be enabled by any combination of hardware, software, and/or firmware. It should further be appreciated that each of the above- mentioned modules may, in various embodiments, represent a logical partitioning of supported functionality. This logical partitioning is depicted for ease of explanation of the functionality and may not be representative of the structure of software, hardware, and/or firmware for implementing the functionality. Accordingly, it should be appreciated that functionality described as being provided by a particular module may, in various embodiments, be provided at least in part by one or more other modules. Further, one or more depicted modules may not be present in certain embodiments, while in other embodiments, additional modules not depicted may be present and may support at least a portion of the described functionality and/or additional functionality. Moreover, while certain modules may be depicted and described as sub-modules of another module, in certain embodiments, such modules may be provided as independent modules or as sub-modules of other modules.
[0037] Although specific embodiments of the disclosure have been described, one of ordinary skill in the art will recognize that numerous other modifications and alternative embodiments are within the scope of the disclosure. For example, any of the functionality and/or processing capabilities described with respect to a particular device or component may be performed by any other device or component. Further, while various illustrative implementations and architectures have been described in accordance with embodiments of the disclosure, numerous other modifications to the illustrative implementations and architectures described herein are also within the scope of this disclosure. In addition, it should be appreciated that any operation, element, component, data, or the like described herein as being based on another operation, element, component, data, or the like can be additionally based on one or more other operations, elements, components, data, or the like. Accordingly, the phrase “based on,” or variants thereof, should be interpreted as “based at least in part on.”
[0038] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. Each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the illustrations can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Claims

CLAIMS What is claimed is:
1. A system for anomaly detection and localization in images, the system comprising: a processor; and a non-transitory memory having stored thereon an end-to-end convolutional pipeline comprising a generative adversarial network (GAN) based model executed by the processor, the model comprising: an encoder network trained to generate a latent space representation of an input image, and generate an attention map by backpropagating gradients using an objective function, a residual decoder network configured to generate a reconstructed image of the input image from the latent space representation, and a discriminator network configured to determine whether the reconstruction image is of a same distribution as the input image, wherein the objective function has an objective to sharpen the reconstructed image, and wherein given an anomalous input image, the pipeline is configured to invert the attention map to localize an anomalous region of the image.
2. The system of claim 1, wherein the GAN based model is configured as an adversarial variational autoencoder model.
3. The system of claim 1, wherein the pipeline is trained using only unsupervised non- anomalous images.
4. The system of claim 3, wherein a Grad-CAM algorithm is used to compute the attention map using gradient backpropagation.
5. The system of claim 3, wherein the objective is based on a binary cross entropy loss between the input image and the reconstructed image, a Kullback-Leibler divergence loss between the latent space representation and a normal distribution, and an adversarial loss between the input image and the reconstructed image.
6. The system of claim 1, further comprising a classifier network, wherein the pipeline is trained using weakly supervised anomalous and non-anomalous images.
7. The system of claim 6, wherein the attention map is computed by the encoder network using a selective gradient in which only gradients from a correct prediction of the classifier network are backpropogated.
8. The system of claim 6, wherein the objective is based on a binary cross entropy loss between the input image and the reconstructed image, a Kullback-Leibler divergence loss between the latent space representation and a normal distribution, an adversarial loss between the input image and the reconstructed image, and a loss result of the classifier network.
9. A method for anomaly detection and localization in images using a trained end-to-end convolutional pipeline comprising a generative adversarial network (GAN) based model, the method comprising: generating a latent space representation of an input image, generating an attention map by backpropagating gradients using an objective function, generating a reconstructed image of the input image from the latent space representation, and determining whether the reconstruction image is of a same distribution as the input image, wherein the objective function has an objective to sharpen the reconstructed image, and wherein responsive to receiving an anomalous input image, the pipeline inverts the attention map to localize an anomalous region of the image.
10. The method of claim 9, further comprising: training the pipeline using only unsupervised non-anomalous images.
11. The method of claim 10, wherein a Grad-CAM algorithm is used to compute the attention map using gradient backpropagation.
12. The method of claim 10, wherein the objective is based on a binary cross entropy loss between the input image and the reconstructed image, a Kullback-Leibler divergence loss between the latent space representation and a normal distribution, and an adversarial loss between the input image and the reconstructed image.
13. The method of claim 9, further comprising: training the pipeline using a classifier network with weakly supervised anomalous and non-anomalous images.
14. The method of claim 13, wherein the attention map is computed using a selective gradient in which only gradients from a correct prediction of the classifier network are backpropogated.
15. The method of claim 13, wherein the objective is based on a binary cross entropy loss between the input image and the reconstructed image, a Kullback-Leibler divergence loss between the latent space representation and a normal distribution, an adversarial loss between the input image and the reconstructed image, and a loss result of the classifier network.
PCT/US2020/052686 2019-09-25 2020-09-25 Unsupervised and weakly-supervised anomaly detection and localization in images WO2021062133A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962905447P 2019-09-25 2019-09-25
US62/905,447 2019-09-25

Publications (1)

Publication Number Publication Date
WO2021062133A1 true WO2021062133A1 (en) 2021-04-01

Family

ID=72812010

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/052686 WO2021062133A1 (en) 2019-09-25 2020-09-25 Unsupervised and weakly-supervised anomaly detection and localization in images

Country Status (1)

Country Link
WO (1) WO2021062133A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139974A (en) * 2021-04-13 2021-07-20 广东工业大学 Focus segmentation model training and application method based on semi-supervised learning
CN113572539A (en) * 2021-06-24 2021-10-29 西安电子科技大学 Storage-enhanced unsupervised spectrum anomaly detection method, system, device and medium
CN114092856A (en) * 2021-11-18 2022-02-25 西安交通大学 Video weak supervision abnormity detection system and method of confrontation and attention combined mechanism
CN114117333A (en) * 2022-01-20 2022-03-01 南湖实验室 Countermeasure reconstruction network design, training method and detection method for anomaly detection
US20220108122A1 (en) * 2020-10-02 2022-04-07 Element Ai Inc. Systems and computer-implemented methods for identifying anomalies in an object and training methods therefor
CN115345238A (en) * 2022-08-17 2022-11-15 中国人民解放军61741部队 Method and device for generating seawater transparency fusion data
CN116343200A (en) * 2023-05-29 2023-06-27 安徽高哲信息技术有限公司 Abnormal grain detection method, abnormal grain detection device, computer readable medium and computer equipment
WO2024102565A1 (en) 2022-11-11 2024-05-16 Siemens Corporation System and method for joint detection, localization, segmentation and classification of anomalies in images
US11989939B2 (en) 2021-03-17 2024-05-21 Samsung Electronics Co., Ltd. System and method for enhancing machine learning model for audio/video understanding using gated multi-level attention and temporal adversarial training
EP4375920A1 (en) * 2022-11-22 2024-05-29 Toyota Jidosha Kabushiki Kaisha Anomaly evaluation method and system, anomaly threshold determination method, computer program(s) and non-transitory computer readable medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018192672A1 (en) * 2017-04-19 2018-10-25 Siemens Healthcare Gmbh Target detection in latent space

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018192672A1 (en) * 2017-04-19 2018-10-25 Siemens Healthcare Gmbh Target detection in latent space

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SAMET AKCAY ET AL: "GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 May 2018 (2018-05-17), XP081425566 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220108122A1 (en) * 2020-10-02 2022-04-07 Element Ai Inc. Systems and computer-implemented methods for identifying anomalies in an object and training methods therefor
US11670072B2 (en) * 2020-10-02 2023-06-06 Servicenow Canada Inc. Systems and computer-implemented methods for identifying anomalies in an object and training methods therefor
US11989939B2 (en) 2021-03-17 2024-05-21 Samsung Electronics Co., Ltd. System and method for enhancing machine learning model for audio/video understanding using gated multi-level attention and temporal adversarial training
CN113139974B (en) * 2021-04-13 2023-08-22 广东工业大学 Focus segmentation model training and application method based on semi-supervised learning
CN113139974A (en) * 2021-04-13 2021-07-20 广东工业大学 Focus segmentation model training and application method based on semi-supervised learning
CN113572539A (en) * 2021-06-24 2021-10-29 西安电子科技大学 Storage-enhanced unsupervised spectrum anomaly detection method, system, device and medium
CN114092856A (en) * 2021-11-18 2022-02-25 西安交通大学 Video weak supervision abnormity detection system and method of confrontation and attention combined mechanism
CN114092856B (en) * 2021-11-18 2024-02-06 西安交通大学 Video weak supervision abnormality detection system and method for antagonism and attention combination mechanism
CN114117333A (en) * 2022-01-20 2022-03-01 南湖实验室 Countermeasure reconstruction network design, training method and detection method for anomaly detection
CN115345238B (en) * 2022-08-17 2023-04-07 中国人民解放军61741部队 Method and device for generating seawater transparency fusion data
CN115345238A (en) * 2022-08-17 2022-11-15 中国人民解放军61741部队 Method and device for generating seawater transparency fusion data
WO2024102565A1 (en) 2022-11-11 2024-05-16 Siemens Corporation System and method for joint detection, localization, segmentation and classification of anomalies in images
EP4375920A1 (en) * 2022-11-22 2024-05-29 Toyota Jidosha Kabushiki Kaisha Anomaly evaluation method and system, anomaly threshold determination method, computer program(s) and non-transitory computer readable medium
CN116343200A (en) * 2023-05-29 2023-06-27 安徽高哲信息技术有限公司 Abnormal grain detection method, abnormal grain detection device, computer readable medium and computer equipment
CN116343200B (en) * 2023-05-29 2023-09-19 安徽高哲信息技术有限公司 Abnormal grain detection method, abnormal grain detection device, computer readable medium and computer equipment

Similar Documents

Publication Publication Date Title
WO2021062133A1 (en) Unsupervised and weakly-supervised anomaly detection and localization in images
EP3655923B1 (en) Weakly supervised anomaly detection and segmentation in images
US11074687B2 (en) Deep convolutional neural network with self-transfer learning
US10755147B2 (en) Classification and localization based on annotation information
US20190122104A1 (en) Building a binary neural network architecture
US11216927B2 (en) Visual localization in images using weakly supervised neural network
US10755140B2 (en) Classification based on annotation information
US10885317B2 (en) Apparatuses and methods for recognizing object and facial expression robust against change in facial expression, and apparatuses and methods for training
US20200012904A1 (en) Classification based on annotation information
US20180189610A1 (en) Active machine learning for training an event classification
US20230021661A1 (en) Forgery detection of face image
US20190164057A1 (en) Mapping and quantification of influence of neural network features for explainable artificial intelligence
US11545266B2 (en) Medical imaging stroke model
US11331056B2 (en) Computed tomography medical imaging stroke model
US20230326195A1 (en) Incremental learning for anomaly detection and localization in images
WO2020097461A1 (en) Convolutional neural networks with reduced attention overlap
US11423262B2 (en) Automatically filtering out objects based on user preferences
US20210097678A1 (en) Computed tomography medical imaging spine model
Jeon et al. CutPaste-Based Anomaly Detection Model using Multi Scale Feature Extraction in Time Series Streaming Data.
US11842274B2 (en) Electronic apparatus and controlling method thereof
Wu et al. Pneumonia detection based on RSNA dataset and anchor-free deep learning detector
Wang et al. CD-GAN: A robust fusion-based generative adversarial network for unsupervised remote sensing change detection with heterogeneous sensors

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20789385

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20789385

Country of ref document: EP

Kind code of ref document: A1