GB2542118A

GB2542118A - A method, apparatus, system, and computer readable medium for detecting change to a structure

Info

Publication number: GB2542118A
Application number: GB1515742.3A
Authority: GB
Inventors: Stenger Bjorn; Gherardi Riccardo; Cipolla Roberto; Stent Simon
Original assignee: Toshiba Research Europe Ltd
Current assignee: Toshiba Europe Ltd
Priority date: 2015-09-04
Filing date: 2015-09-04
Publication date: 2017-03-15
Anticipated expiration: 2035-09-04
Also published as: GB2542118B; JP6289564B2; JP2017062776A; GB201515742D0

Abstract

A two channel convolutional neural network (CNN) is trained to compare first and second images of a structure (tunnel 110) where the images are obtained at first and second times separated by a time period (days, weeks, months etc) and to detect differences between the pair of images of the structure that are due to changes in the structure, changes include cracks or discolouration. The CNN is trained to distinguish between changes in the images that are due to structural changes, such as widening of a crack, and differences that are not due to a change to the structure, such as different lighting conditions. The neural network provides a value indicative of the detected change in the structure and outputs a change map detailing the changes. This approach may be used to remotely inspect tunnels using a trolley 118 with an array of cameras 114 attached.

Description

A method, apparatus, system, and computer readable medium for detecting change to a structure

Field

This disclosure relates to change detection. In particular, but without limitation, this disclosure relates to the detection of temporal changes in a structure.

Background

Physical structures such as tunnels, bridges, dams, roads, and buildings can change over time. Some changes, such as a changing of colour due to a watermark on a pipe, are not of concern to engineers. However, some changes, such as the appearance of a crack or a leak in a tunnel, are of great concern to engineers and so structures may need to be regularly monitored in order to identify changes thereto. Visual inspection of a structure is a good way of identifying changes in that structure but can be highly labour intensive and can be susceptible to observer inconsistency.

An approach to reduce the labour intensive nature of manual inspection is to pass one or more image capture devices, such as a camera, along the structure so as to record the state of the structure during an initial time period. Images of the structure that are subsequently acquired can then be compared with the image data acquired during the initial time period.

Summary

Aspects and features of the invention are set out in the claims.

Brief description of the drawings

Examples of the present disclosure will now be described with reference to the accompanying drawings in which:

Figure 1 shows a cross-section through a tunnel lining in which an image capture device is positioned;

Figure 2 shows an exemplary block diagram of the macro components of the computer;

Figure 3 shows a flow diagram illustrating the steps of a method according to the present disclosure;

Figure 4 shows an overview of a machine vision system according to the present disclosure, in stage 4 changes are detected between registered sets of image mosaics captured at different times, the approach uses a two-channel Convolutional Neural Network (CNN) for change detection and the network learns a model for normal modes of image variation, so as to detect abnormal changes with fewer false positives;

Figure 5 shows: (a) an array of randomly sampled 64 χ 64 pixel patches from the dataset, (b) the same as (a), but each row contains 9 different viewpoints of the same unchanged patch to illustrate the natural image variation and registration error, and (c) examples of changed patches; top rows are different viewpoints from tn bottom rows from tq,

Figure 6 shows a timeline and explains the datasets gathered for evaluation of the approaches described herein;

Figure 7 shows example training pairs for the neural networks described herein;

Figure 8 shows positive example images used for training the neural networks described herein;

Figure 9 shows the results of an evaluation of the change detection approach for Dbng datasets; and

Figure 10 shows the results of an evaluation of the change detection approach for Dc^, datasets.

Detailed description

In the present disclosure, two images, for example images of a structure that have been taken during different time periods, are used to identify changes in the structure. This is achieved by using a neural network that has been trained to identify changes in images. The neural network has a CNN component which, when compared to a conventional fully connected neural network is much less computationally intensive to use. The neural network is trained to be indifferent to differences in the images that are not due to change to the structure - for example as may arise from using a different camera to acquire the images, or from changes in illumination intensity. As an example, pairs of images that are acquired during the same time period, but using different cameras may be used to train the neural network to be indifferent to such changes. As changes in structures generally occur seldomly, and so example images that demonstrate change are scarce, synthetic changes may be used to train the neural network to identify changes.

Figure 1 shows a cross-section through a tunnel lining 110 in which an example image capture device 112 is positioned. The image capture device 112 comprises a plurality of cameras 114 that are mounted to a body 116 of the image capture device 112 and which are arranged so as to capture overlapping images of the tunnel lining 110 when the image capture device 112 is present within the tunnel lining. The image capture device 112 further comprises a flat-bed trolley 118 upon which the image capture device 112 may ride so as to move longitudinally along the tunnel lining 110 thereby enabling the capture of images that overlap in both radial and longitudinal directions. The image capture device 112 further comprises a memory and communication module 120 that is arranged to record the captured images and subsequently communicate them wirelessly to a computer 122.

Figure 2 shows an exemplary block diagram of the macro components of the computer 122. The computer 122 comprises a micro-processor 210 arranged to execute computer readable instructions as may be provided to the computer 122 via one or more of: a network interface 212 arranged to enable the micro-processor 210 to communicate with an external network - for example the internet; a wireless interface 214; a plurality of input interfaces 216 including a keyboard, a mouse, a disk drive and a USB connection; and a memory 218 that is arranged to be able to retrieve and provide to the micro-processor 210 both instructions and data that have been stored in the memory 218. Further, the micro-processor 210 is coupled to a monitor 220 upon which a user interface may be displayed and further upon which the results of processing operations may be presented.

During operation, the image capture device 112 is traversed along the tunnel lining 110 whilst images are acquired by the plurality of cameras 114 and stored in the memory and communication module 120. Subsequently, the images recorded on the capture device are transmitted to the computer 122 and stored in the memory 218 thereof. Following such an initial scan of the tunnel lining 110, during a subsequent time period, for example when it is deemed to be time to again inspect the tunnel lining, the image capture device 112 is again positioned within the tunnel lining 110 and one or more further images are required. The further images are transmitted to the computer 122 so that they can be compared with the initially acquired images in order to identify whether any changes to the tunnel lining 110 have occurred.

Differences between initially acquired and subsequently acquired images may, in addition to being due to an underlying change to the structure, also be due to a number of other factors such as misalignment between the images (caused, for example by the images having been taken from different positions or by the images having been taken at different time points but not having been properly aligned) and differences in the direction and strength of the illuminant used during image capture -as may occur when different lighting rigs are employed, when a flash bulb fades during its lifetime, or when different flash bulbs produce different amounts of light -which can result in different shading being present in different images. The approaches described herein use a trained CNN to identify changes in the structure despite the presence of image differences that are not due to changes in the structure.

Figure 3 shows a flow diagram illustrating the steps of a method according to the present disclosure. At step S310 the image capture device 112 is traversed along the structure (in this case the tunnel lining 110) during which a first set of images are captured by the plurality of cameras 114 and conveyed to and received by the computer 122. The first set images are captured during (and therefore associated with) a first time period and represent a recordal of the state of the tunnel lining 110 at the first time. The firsts set of images are acquired with different ones of the plurality of cameras 114. Parts of the some of the images of the first set of images will overlap and, where they do multiple images will capture the same part of the tunnel (or structure). However, due at least to inconsistencies in camera calibrations, the portions of images that are of the same part of the tunnel are likely to be different - even though they were acquired at the same time.

Subsequently, the image capture device 112 again traverses along the structure and captures a second set of images during a second time period that represents a recordal of the state of the tunnel lining 110 during the second time period so that the second set of images are associated with the second time period. The second set of images are then conveyed to and received by the computer 122.

At step S312, the first set of images are processed relative to one another in chunks associated with sections of the traversal of the image capture device 112 within the structure. In particular, structure from motion analysis is used to return point clouds and camera pose estimations. The same is also done for the second set of images. Before point clouds associated with chunks of the first set of images are rigidly registered to point clouds associated with chunks of the second set of images so as provide a coarse alignment between images of the first and second sets of images. In this case, a Procrustes registration approach is employed, other registration approaches - either chunk-based or otherwise, could equally be employed. At step S314, the images of each chunk of the second set of images is transformed by the results of the registration to form a transformed image set which is received within the computer 122. As the second set of images is associated with the second time period, so is the transformed image set.

In order to enable observers to visualize the aligned image sets, the transformed images are mosaiced into a single image. This is achieved by making a geometrical assumption about the shape of the surface of the structure (a cylinder in the case of tunnels) and projecting the transformed images onto the surface before blending them together.

At step S316, an image from the first set of images a spatially corresponding image selected from the transformed image set are provided as first and second channel inputs to a two-channel CNN. In order to select a spatially corresponding image, an image is searched for in the transformed image set that overlaps the image from the first set of images, optionally the search may look for an image that has the greatest overlap with the image from the first set of images. Consequent to the provision of the first and second channel, the CNN outputs a change mask indicative of the presence or absence of a change to the structure between the first and second time periods. As one possibility, the change mask is a binary array the same size as one or both of the images that were used as the first and second channel inputs and indicates, on a pixel-by-pixel basis the presence of a change with a T and the absence of a change with a ‘0’ (or vice versa). In cases where the images that were used as the first and second channel inputs only partially overlap or are of differing sizes, the change mask may be arranged so as to indicate the presence or absence of a change relative to one or the images that were used as the first and second channel inputs.

Optionally, at step S318, one or both of the images that were used as the first and second channel inputs is classified as either being associated with a change to the structure or not being associated with a change. Those images that are so classified may then be manually inspected. As a further possibility, the CNN may be trained to provide different outputs in the mask that are indicative of different types of change. For example, the CNN could be trained using synthetic crack images to provide a value in the mask indicative of a crack change and also trained using synthetic discolouration images to provide a value in the mask indicative of a discoloration change.

The CNN architecture that is used employs a two-channel approach wherein the first layer is a convolutional layer for which the filters are arranged to operate on the pixels of both of the images of both the first and second channel inputs. Optionally, the first convolutional layer is followed by three further convolutional layers and the depths of the first four layers may be respectively: 32, 64, 128, and 512. Optionally, the convolutional layer (or layers), are followed by two fully connected layers, which may be of depth 512 and may be followed by a softmax layer to classify the input pair between changed and unchanged states. The first three convolution layers may each be followed by 2 x 2 max pooling. And/or all hidden layers may be constrained by a ReLU non-linearity. The filters of the first layer may be 7 x 7 x 2 pixel filters operating directly on 64 x 64 pixel gray-scale patch inputs of both input channels wherein the patch inputs are normalised to have zero mean and unit variance.

As one possibility, in order to train the CNN to indicate in the change mask the absence of a change, the CNN is provided with pairs of images (negative training images) that were captured during the same time period (a common time period). For example images that overlap and were captured by adjacent cameras during a traversal of a tunnel. As the images were captured during the same time period, any differences in a part of the structure that is imaged will not be due to change and will instead be due to other factors - for example differences in camera calibration, sensor response, illumination angle, etc.. Such an approach therefore helps to reduce the sensitivity of the CNN to image differences that are not due to change to the structure.

As one possibility, in order to train the CNN to indicate in the change mask the presence of a change, the CNN is provided with pairs of images (positive training images), for which one of the images has been modified so as to simulate a change. For example, the appearance of, widening of, and/or extension of a crack, and/or the appearance of or enlargement of a watermark or area of discolouration may be simulated in the modified image. Modification can be performed to additionally or alternatively simulate defacing, marking from an engineer, maintenance stickers, spalling, dirt, vegetation/mould growths, leaks, insects, footprints etc.. The simulation may further comprise the application of a translation, rotation, flipping, and or texture, noise, illumination gradient, illumination bias to the image and/or the blending of the image with a background image. Figure 8 shows example images on the first row that were then modified by the addition of a simulated change (second row) along with, on the third row, difference images of the first two rows. The direction and size of simulated changes may be determined using a random, or pseudo random number generator or another approach such as a fractal Brownian motion simulator. A benefit of using simulated changes is that the incidence of changes in real data may be very small. For example, imaging of a very large structure that might result in say twelve million images, might only produce a few thousand images that have changes - which might not be enough to use to train a neural network.

Where mention has been made herein of time periods, it will be appreciated that a given time period may be so short as to relate to a single point in time only - as may be the case when a plurality of images are acquired instantaneously, but may span a number of minutes, hours, or even days - in order to reflect the amount of time taken for images to be acquired of a large structure - for example a tunnel that is tens of kilometres long or more. Furthermore, there will generally be a time gap between the first and second time periods - for example, the second time period may follow the first time period after a time gap of a day or less in cases where rapid structure change is expected or weeks, months, or even years in other cases.

Although the above has been described with reference to the CNN being provided with an image from the first set of images and a spatially corresponding image from the transformed image set, the CNN could equally be provided with an image from each of the first and second image sets, or simply with two images of a structure that have been acquired at different times.

Although the above has been described with reference to the images of the second set of images being transformed so as to form the transformed image set, as another possibility, the first set of images could instead be transformed and the CNN provided with an image from the second image set and a spatially corresponding image from the transformed image set.

Although the above has been described with reference to tunnels, the approaches described herein could equally be applied to other types of structure, for example aqueducts, roads, dams, and bridges etc..

The approaches described herein may be employed with images that are acquired in manners other than by passing a flat-bed trolley along a tunnel - for example one or more cameras may be suspended from a monorail or floated on a barge. Likewise, although the above has described the registration of chunks of image sets, different registration approaches could equally be employed and that the registration stage may even be omitted. Furthermore, as the images may be acquired by one party - for example a contractor assigned with collecting images of a sewer - before then being processed by a second party, some of the approaches described herein may be performed by a party without that party acquiring the images.

The above has been described in relation to images that have been acquired using a camera. As such the images may have been acquired in the human visible spectrum, and/or they may include light that was acquired beyond the range of human visibility, for example infrared or thermal images (possibly with a compensation applied for the expected temperature of the structure at the time of acquisition). As one possibility, the images could have been obtained using one or more gamma cameras or Geiger counters. In situations where there is not sufficient ambient light for the camera(s) to acquire images on their own, the cameras may be provided with one or more light sources, for example a permanent light or a timed flash.

Although the above has been described by way of example with reference to a tunnel lining, the approaches described herein could also be applied to other structures, including, but not limited to, bridges, dams, roads, and buildings.

Although the above has been described with reference to Figure 1 in which an image capture device comprises a plurality of cameras mounted on a flatbed trolley, the approaches described herein could be applied to images acquired using other images capture and/or creation devices. Further, although the image capture device of Figure 1 has been described as being arranged to record the captured images and subsequently communicate them wirelessly to a computer, communication to the computer could be by other means, for example by way of a cable transfer and/or the physical transfer of a computer readable medium.

There is described herein the training of a two channel CNN to distinguish between differences between a pair of images of a structure that are due to changes in the structure and differences that are not due to a change to the structure. The neural network is then applied to images of a structure in order to identify changes in that structure.

The approaches described herein may be embodied on a computer readable medium, which may be a non-transitory computer readable medium. The computer readable medium carrying computer readable instructions arranged for execution upon a processor so as to make the processor carry out any or all of the methods described herein.

The term computer readable medium as used herein refers to any medium that stores data and/or instructions for causing a processor to operate in a specific manner. Such a storage medium may comprise non-volatile media and/or volatile media. Non-volatile media may include, for example, optical or magnetic disks. Volatile media may include dynamic memory. Exemplary forms of storage medium include, a floppy disk, a flexible disk, a hard disk, a solid state drive, a magnetic tape, any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with one or more patterns of holes or protrusions, a RAM, a PROM, an EPROM, a FLASH-EPROM, NVRAM, and any other memory chip or cartridge.

Where mention is made of change to a structure, the change may be within the structure itself - such as a crack that permeates through a volume of the structure, or may be on the structure - such as a discolouration or deposit or other accumulation on the surface of the structure.

Further, non-limiting, examples are described below.

Examples

There is described herein a system for the detection of changes in multiple views of a tunnel surface. From data gathered by a robotic inspection rig, a structure-from-motion pipeline is used to build panoramas of the surface and register images from different time instances. Reliably detecting changes such as hairline cracks, water ingress and other surface damage between the registered images is a challenging problem: achieving the best possible performance for a given set of data has previously required sub-pixel precision and careful modelling of the noise sources. The task is further complicated by factors such as unavoidable registration error and changes in image sensors, capture settings and lighting.

The approach described herein is to detect change using a two-channel CNN. The network accepts pairs of approximately registered image patches taken at different times and classifies them to detect anomalous changes. To train the network, advantage is taken of synthetically generated training examples and the homogeneity of the tunnel surfaces to eliminate most of the manual labelling effort. The method is evaluated on field data gathered from a live tunnel over several months, demonstrating it to outperform existing approaches. 1 Introduction

The issue of change detection between pairs of images taken at different times by a moving camera is addressed herein. The motivation is the development of a non-contact inspection system to be used for detecting anomalous visual changes on surfaces, and in particular tunnel linings and the approach is summarised in Figure 4. This application is of increasing social importance as infrastructure ages and requires more efficient maintenance than existing, frequently labour-intensive, methods can provide. The issue is challenging for several reasons: i) Size and nature of changes. Changes of interest are often small and subtle - e.g. a fattening in the width of a hairline crack or a patch of discolouration caused by water ingress, organic growth, rusting, and/or concrete spalling. This property emerges from the nature of the change detection issue: as the period over which change is measured decreases, any algorithm is pushed against the intrinsic limits set by image resolution and sensor noise. In the datasets examined here, fewer than 0.07% of the pixels were labelled as changes of interest, and in a different scenario the ratio could be several orders of magnitude lower. Furthermore, while certain changes such as cracks are known in advance and may be explicitly detected, others may be too infrequent for explicit modelling and only detectable as anomalous to natural modes of image variation. ii) Nuisance factors. A sizeable proportion of the observed change over time is caused by nuisance factors, either internal to the acquisition system (such as different image sensors, capture settings or lighting setup) or due to external causes (for example, seasonal changes of temperature and humidity). While tunnels are relatively static in comparison to other environments such as outdoor scenes, external conditions such as humidity and dust levels can cause sufficient variation in visual appearance to shroud more important structural changes of interest. Figure 5(b) illustrates the variation in appearance from a random set of corresponding unchanged image patches taken at different times and conditions. iii) Registration error. Achieving the pixel-accurate registration required for change detection is challenging because neither the sensor position nor the tunnel geometry can be reliably determined. Inaccurate or un-modelled geometry causes parallax errors when images are re-projected; in addition, a blanket change across the scene - caused for example by a change in tunnel humidity level - can make feature-based registration of any single image impossible.

The approach described herein circumvents the need for improving both the registration and insensitivity to nuisance sources through machine learning. In the approach, a trained two-CNN takes as input a pair of image patches and returns a measure of dissimilarity or change. CNNs have recently been shown to be very effective at learning invariance to certain modes of image variability. They require however large amounts of labelled image data. By taking registered viewpoints from different cameras from the same time, near unlimited access to negative pairs (i.e. patches where no abnormal change has occurred) is provided. This can be supplemented with a smaller dataset of negative pairs across the different test times from regions where no changes of interest have occurred. This requires a limited effort in coarsely labelling a small subset of the test data. Together, these negative pairs capture much of the natural nuisance variation from lighting, registration errors and camera pose variation. For the positive (changed) pair generation, randomly sampled pairs are used as well as synthetically generated changes. The homogeneity of the tunnel environment - illustrated by Figure 5(a) -allows a network to generalize well from a manageable amount of labelled ground-truth.

The approach was evaluated using three sets of data from a live tunnel captured at different times. A trained inspector was tasked with simulating real changes in the tunnel between captures and a set of ground truth change images were generated for testing. We compare against a known implementation and against the results of a manual inspection carried out by a second trained inspector in the field. The latter is of particular importance to industry, as it is commonly still the method of choice for tunnel inspection. To our knowledge, this is the first comparison of this kind reported. 2 Background

There follows a definition the problem of change detection for multi-view surface inspection. Given a reference image Ir and a query image Iq taken of a surface from different positions and under different imaging conditions at times tr and tq respectively, a binary change mask, C, is sought which is 1 at every position in Iq that has undergone a change of interest and 0 elsewhere. In practice, it is assumed that the two images have been registered into a common 2D coordinate frame using a surface model of the scene, acquired in this case via surface fitting on geometry recovered from Structure-from-Motion (SfM).

The problem of change detection is then to determine: P(C( p) = 1 |/ΛΡ)Λ(Ρ)) = ^Ρ)Λ(Ρ)) (1) for any pixel or patch of pixels p. The function f is a measure of change between the two image patches and can either be designed using domain knowledge or learned from a given dataset. The definition of change is always problem-specific; in this approach local changes in the state of the surface such as cracks, water ingress, rust and surface damage are sought.

In many situations, including that of structure change detection, pixel-accurate registration is very difficult to achieve. In urban change detection for example, camera pose, geometry and radiometric variation are often quite severe. While the approach described herein may, as a pre-processing step, perform use a geometric model for approximate registration, any need for finer registration or radiometric correction is sidestepped by using a CNN, trained to detect unnatural changes between pairs of coarsely registered image patches. In particular, similarity functions fare learned in order to classify image patches using, for example, 64 χ 64 pixel patches. The CNN is trained directly on a mixture of task and synthetic data to detect change. As one possibility, additional patches from larger scales are not incorporated in separate input channels since by design all of the employed patch pairs have similar sizes (corresponding approximately to 20 * 20mm). 3 System description

An outline of the main steps of the approach that may accompany the change detection stage will now be described with reference to Figure 4.

Image Capture. In stage 1, overlapping 360 degree rings of images are gathered by an autonomous calibrated camera system running along a monorail. The images are taken using polarised lighting and orthogonally polarised lens filters, to remove or attenuate image variation modes due to scene specularities.

Reconstruction and registration. Images from different times are processed independently via Structure-fnom-Motion (SfM) to return sparse point clouds (side views shown) and camera pose estimates in stage 2. The data is processed in overlapping parallel subsets corresponding to approximately 3 metre long sections. The pipeline of choice for 3D reconstruction is a visual structure from motion system using accelerated scale invariant feature transform features for matching and adding ring-closure checks to ensure complete reconstructions. Rings of images are treated independently given their immediate neighbouring rings, guaranteeing both efficiency and robustness during reconstruction. Neighbouring reconstructed subsets are registered across time in a piece-wise rigid fashion, using a similarity transform estimated via Procrustes alignment on a subset of confident feature correspondences. This global alignment on a large set of images ensures that single images can still be successfully registered even in the presence of large changes in appearance.

Mosaicing for visualisation. A surface model is then estimated for each reconstructed subset from tr using a cylindrical assumption. Points which lie close to the surface are projected directly onto the surface and individual camera poses are refined (resectioned) to reduce mosaic registration error. A mosaic is obtained by projecting all the images onto the surface model and blending them together. This can result in ghosting artefacts for areas which are off-surface but otherwise produces results which are sufficiently accurate for visual inspection of pixel-wide (0.3mm wide) cracks.

Change Detection Method For change detection, a second set of mosaics is generated by dividing the mosaicing area into 64x64 pixel patches, and then for each patch projecting only the image from the nearest camera. Doing so achieves two goals: firstly, within each block the patches are free from compositing artefacts and secondly it avoids the computational cost required to process all the available overlapping image pairs independently.

The CNN architecture is a two-channel approach that comprises four convolution layers of depths 32,64,128 and 512, and two fully connected layers of depth 512, with a softmax layer to classify the input pair between changed and unchanged states. The first three convolution layers are followed by 2 x 2 max pooling, and all hidden layers by a ReLU non-linearity. The input is two-channel, with the first layer of 7 x 7 x 2 pixel filters operating directly on both 64 x 64 pixel gray-scale patch inputs normalised to have zero mean and unit variance. This might be preferable in practice than maintaining separation until a deeper layer; one likely reason for this is that high-frequency information can be immediately compared between the patches, providing valuable similarity information that might be otherwise lost through pooling. 4.1 Synthetic Crack Generation

Synthetic crack images are generated for training by blending real image patches with a crack mask. Each mask is created by randomly sampling a small set of crack support points within a region encompassing the image patch. A minimum spanning tree is formed over the support points, and branches from the tree are recursively subdivided to generate new support points, each of which is perturbed randomly according to a pre-generated perlin noise map. The resulting crack map is rasterised, with width determined by a second perlin noise map, resulting in a realistic random crack image generator. 5 Datasets

Testing. Data was gathered and processed from the field to produce two different test datasets, with the schedule detailed in Figure 6 which shows the timeline and datasets gathered for evaluation. Artificial changes such as cracks, leaks, rust and stickers were applied to the tunnel surface before the capture of 1« and 1^. Some examples are shown in Figure 5(c). The changes were applied by a professional inspector and designed to be as realistic as possible. 90 changes were applied in total (45 in each instance), covering altogether less than 0.07% of all mosaicked pixels in the test set.

The resulting change detection datasets, D^and D^, compare changes over two months and one day respectively. The one day dataset, D^, is more amenable to automatic change detection since within a shorter time frame the chance for new defects to appear other than the one purposely introduced as part of the test protocol is lower. The changes applied in this instance were subtle and harder to detect for human observers, including variations in crack width and length. Dfongis a more challenging dataset using a different camera and lighting setup and more realistic temporal change of over two months. The changes here also include the appearance of new cracks, objects or defects.

Manual inspections were carried out by a second professional inspector before each capture of In and 1^. The inspector was informed of what kind of changes to be aware of before each test, and during the second inspection was allowed to consult with his own notes from the first CNN Training. Taking a single corresponding pair of mosaic images from tr and tq as a training set, four separate networks were trained with the architecture described in section 4 from random initialisations, each using one of the training sets (i,ii,iv and v) from table 1. The training sets were split equally into positive (changed) and negative (unchanged) samples, with negative samples reused across training sets (i-iv) for fairness of comparison, and to gauge the effect of using different strategies for positive pair sampling on the network’s performance.

Table 1: CNN training sets used, (i-iv) compare the effect of different positive pair generation methods; (v) compares the effect of training set size vs (iv).

Figure 7 illustrates various sets of training pairs and their differences. To generate each column of negative (unchanged) pairs in (a), a random location was sampled and two overlapping image patches were drawn from each of the tr and tq image datasets. Ground truth is required to avoid sampling locations which have changed; to create it, the training mosaic is assigned coarse labels, which are collected into a discrete change mask. In particular, Figure 7 shows: Sample training pairs (rows 1+2) and their difference images (row 3) from different training sets: (a) negative (unchanged) pairs; (b) positive (changed) random pairs, with both members chosen randomly (TS-R); (c) semi-random positive pairs, combining (a) and (b) (TS-SR); (d) positive crack pairs, including crack appearance/disappearance, extension and widening (TS-C); (e) negative crack pairs (TS-C).

To generate each positive pair in (b) a new random location is chosen in each of the tr and tq image datasets and patches are extracted. The semi-random patches in (c) take half of the random patches from (b) and half of the negative patches from (a), thus ensuring that a positive sample is tied to every negative sample in the dataset. Finally (d) and (e) are generated using the synthetic crack generator described in section 4.1. Either an image pair from (a) was taken and a crack added to one of the pair, or a single base image was used which was arbitrarily translated to generate two patches. The translation was drawn from a uniform distribution over ±7 pixels in x and y empirically accounting for the majority of surface registration errors. The translation being known, the crack appearance in either of the images can be modified to simulate crack extension or widening.

Each network was trained identically until convergence of a log loss cost function on the softmax output. A stochastic gradient descent was used with momentum for optimisation and 50% dropout was applied in the two fully connected layers to reduce overfitting. The networks were implemented in MatConvNet with CuDNN support.

Evaluation and Discussion

The results were compared our method against both the manual inspection results and a known approach modified to run on high-resolution test datasets. In all methods, a geometric prior was employed to restrict change detection to segments of the image that lie on the tunnel surface.

Quantitative Evaluation. Figures 9 and 10 illustrate change detection performance over the two test datasets. The x-axis represents the False Positive Rate (FPR), the proportion of actual negatives which are incorrectly assigned as positive. The y-axis shows the average ratio of pixels in each ground truth change that were correctly labelled as having changed. This metric was chosen in order to fairly represent all changes and to be fair to the human inspector, since the distribution of the area of changes is broad - from very small and thin cracks to large leaks. Manual refers to the manual inspection by a trained inspector, which uncovered 29% of changes in Dshort and 58% in D^; RGB shows the performance of pixel-to-pixel absolute differencing; and the known method is applied using NCC windows of varying sizes from 5x5 to 15x 15 pixels.

In both datasets, the CNN approach, even when trained in a naive manner, outperforms the existing methods by a significant margin. RGB and NCC methods both require good registration, which is not equally reliable throughout the datasets - especially in Diong, where the capture setup varied significantly. While the manual method outperforms ours at very low FPR, it is not possible to retrospectively trade off FPR fbrTPR so the performance is bounded below what CNN can achieve in theory.

Among the CNN methods, the performance difference between training with random or semi-random positive pairs is negligible (CNN-TS-R vs. CNN-TS-SR), but performance can be seen to improve when the data is augmented with synthetic crack data (CNN-TS-SM). This is especially true of Dsh0rt, where 27% of changes involve cracks expanding or extending (vs 0% in Diong). Increasing the size of the training set (from CNN-TS-SM to CNN-TS-LM) improves performance significantly in Diong but has little effect in Dsh0rt· One possible explanation is that D|0ng, which was captured over a longer time period and with a different capture setup, contains more nuisance variation and thus benefits from a larger training set to learn from.

Table 2 shows the percentage of detected changes at different FPR thresholds for various methods. A detected change is defined as one containing >50% of positive pixels. The described approach shows significant improvement over known approaches in both datasets and over manual inspection in Dsh0It though manual inspection discovers more changes at very low FPR setting. It should be noted that not all false positives are strictly misclassifications; many correspond to real anomalous changes that were not part of the labelled changes of interest.

Table 2: Percentage of artificial changes detected by the compared systems at different false positive rates. Changes are considered detected if they are greater than >50% positively labelled.

Qualitative Evaluation between Automated and Manual Approaches. Several more factors are noteworthy when comparing the tested approaches, (i) Time required. The manual inspections took 70 minutes for D|0ng and 30 minutes for Dsh0rt, with several additional hours required to process the results. While the automated processes were not run exclusively on the test datasets from end to end in a single stream, processing them would take an order of magnitude extra time on a single desktop machine without employing significant parallelisation, (ii) Objectivity. Despite the cost and time for processing, the automated approach has numerous advantages - the foremost being that it is completely objective. The approach does not suffer from inattentional blindness and can view every point in the tunnel at the same resolution, (iii) Scalability. The performance of the automated approach scales favourably with data size, as Figure 10 demonstrates. Manual inspection performance drops with scale, due to human fatigue over a repetitive task, (iv) Visualisation. Automation allows data to be visualised at any later date. In contrast, manual inspection notes are gathered by hand and typed up to computer and are difficult to cross-reference across time. 7 Conclusions

In the above, a novel approach to change detection using a two-channel CNN has been presented and its favourable performance on field data versus competing solutions demonstrated.

The approach can be straightforwardly adapted to different textured surfaces and new scenarios with minimal manual training effort. It is also very efficient for processing data on the scale of a working system, where there may be kilometres of data to survey.

Claims

1. A method for detecting change to a structure, the method comprising: receiving first and second images representative of at least a part of a structure, wherein the first and second images are respectively associated with first and second time periods; providing the first and second images as first and second channel inputs to a two channel Convolutional Neural Network (CNN) that has been trained, upon the provision of such images as first and second channel inputs, to output a change mask indicative of the presence or absence of change to the structure between the first and second time periods.

2. The method of any preceding claim, wherein the CNN has been trained to indicate in the change mask the absence of change using pairs of first and second negative training images, wherein the first and second negative training images of each pair of first and second negative training images are associated with a respective common time period.

3. The method of claim 2, wherein the first and second negative training images of each pair of first and second negative training images represent images acquired with respective different image acquisition devices.

4. The method of any preceding claim, wherein the CNN has been trained to indicate in the change mask the presence of change using pairs of first and second positive training images, wherein in one of the first and second positive training images of each pair of first and second positive training images one or more changes have been simulated.

5. The method of claim 4, wherein the one or more changes are one or more of: the appearance of a crack, the widening of a crack, the extension of a crack, the appearance of an area of discolouration, the enlargement of an area of discolouration, and/or a colour change to an area of discolouration.

6. The method of any preceding claim, wherein the CNN has four convolutional layers.

7. The method of claim 6, wherein the first convolutional layer comprises a plurality of filters that are each arranged to operate on both the first and second channel inputs.

8. The method of claim 6 or 7, wherein the CNN has two fully connected layers that follow the convolutional layers.

9. The method of any preceding claim wherein the change mask is indicative of the presence or absence of change to the structure on a pixel-by-pixel basis with respect to one of the first and second images.

10. The method of any preceding claim, wherein the second time period follows the first time period and is spaced apart from the first time period by a time gap.

11. The method of any preceding claim, wherein the structure is a tunnel.

12. A computer readable medium carrying machine readable instructions arranged, when executed by a processor, to cause the processor to carry out the method of any preceding claim.

13. An apparatus arranged to perform the method of any of claims 1 to 11.

14. A method, apparatus, or computer readable medium substantially as described herein and with reference to the accompanying drawings.