GB2592583A

GB2592583A - Aligning images

Info

Publication number: GB2592583A
Application number: GB2002884.1A
Authority: GB
Inventors: Davidson Benjamin
Original assignee: Disperse Io Ltd
Current assignee: Disperse Io Ltd
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2021-09-08
Anticipated expiration: 2040-02-28
Also published as: GB202002884D0; GB2592583B

Abstract

A method and apparatus for aligning images to a reference plane is disclosed. In the method image data for a spherical image captured from a scene is received and geometric cue data for the scene is obtained. The image data and the geometric cue data is input into a neural segmentation network for segmentation of the spherical image using the image data and the geometric cue data. A direction orthogonal to the reference plane is then determined based on the segmentation of the spherical image. The spherical image can be aligned based on the determined direction. Geometric cue data such as vanishing points may be obtained from the received image data. The neural segmentation network may be a Gated-shape convolutional neural network configured to process the geometric queue data. An attention map may be generated which feeds into an atrous spatial pooling layer

Description

Aligning images This disclosure relates to aligning images, and more particularly aligning 5 spherical images The ability of spherical, or 3600 imaging to capture an entire scene with a single capture makes it a powerful tool for rapidly documenting entire scenes. This can be used for example for creating virtual walk-throughs. For example, 360° imaging has been used to record crime scenes where it is vital to image the entire scene for evidence. Moreover, 360° images can be used to create virtual reality (VR) videos. With the availability of cheap 360° capture devices and the growth of VR headsets there is an increased demand for techniques to automatically analyse and process 360° images. Various industrial applications have also emerged. For example, construction industry has been using applications where 360° imaging is utilised.

Spherical or panoramic 360° images taken under unconstrained conditions can however present a significant challenge to current recognition pipelines, since the assumption of a mostly upright camera is not necessarily valid. A problem for spherical images is a misalignment between the ground plane of the camera frames and the ground plane of the world frames. This problem has been addressed through use of sensors embedded with the camera equipment which estimate the pitch and roll of the camera relative to the ground plane at capture time. With this information, the image can be rotated so that it is level with the ground. However, in practical commercial cameras these on-board sensors are often inaccurate, leading to misaligned images.

Misalignment makes automatic processing of spherical images a challenging task for neural networks. For example, training a spherical object detector on misaligned images would require a neural network to learn a representation which was invariant to rotations away from the vertical axis. If all images were level, the representation could be sensitive to these rotations, simplifying the task to be learned by a neural network.

A prior art solution to neural network aided ground plane alignment has been to extract straight-line segments from an image and use these to estimate a vanishing point in the direction of the vertical axis. These methods rely on what are known as the Manhattan or Atlanta world assumptions, which assert that the scene that has been captured will contain some orthogonal structure, given the tendency in human construction to build at right angles. This assumption, however, does not always hold in practice. One typical way to extract this orthogonal structure is to determine the direction in which all straight lines in an image are pointing and have each line vote on vanishing point directions. The orthogonal directions of the scene can then be found by looking for the three orthogonal directions which together have the most votes. However, many scenes may not have this orthogonal structure, and it may not be possible to extract many straight-line segments from the image. Moreover, the maximal orthogonal set found by maximisation may not be the true orthogonal directions. Due to the many assumptions of this approach, it can be very brittle in practice, and is not necessarily reliable, despite the apparent strength of the vanishing point features it uses.

Deep learning solutions to ground plane alignment have shown to be more robust than the vanishing point methods. The current methods are either some variation of a deep convolutional regression network or a classification network. The most recent regression network, commonly referred to as Deep360Up, outputs the vertical direction directly from a network ("DenseNet"), which is trained using the logarithm of the hyperbolic cosine between the estimated and ground truth vertical directions. A recent deep network approach is to use coarse to fine classification network. This approach classifies the pitch and roll of an image as belonging to a 100 bin (coarse), adjusting the image to be within 10° of level, and then classifying the adjusted image to be within a 1°-bin (fine). Another standard feature of such solutions is to generate training data from already levelled images.

These methods have demonstrated the power of neural networks. However, the inventor has found that it is possible to improve accuracy of the alignment.

It is noted that the above discussed issues are not limited to any particular imaging apparatus and data processing apparatus but may occur in any imaging system where images require alignment.

According to an aspect there is provided a method for aligning images to a reference plane, the method comprising: receiving image data for a spherical image captured from a scene, obtaining geometric cue data for the scene, inputting the image data and the geometric cue data into a neural segmentation network for segmentation of the spherical image using the image data and the geometric cue data, determining a direction orthogonal to the reference plane based on the segmentation of the spherical image, and aligning the spherical image based on the determined direction.

According to another aspect there is provided a data processing apparatus for aligning images to a reference plane, the data processing apparatus comprising at least one processor, and at least one memory including computer program code, wherein the at least one memory and the computer program code are configured, with the at least one processor, to receive image data for a spherical image captured from a scene, obtain geometric cue data for the scene, input the received image data and the determined geometric cue data into a neural segmentation network for segmentation of the spherical image, determine a direction orthogonal to the reference plane based on the segmentation of the spherical image, and align the spherical image based on the determined direction.

In accordance with a more detailed aspect geometric cue data is obtained 20 from the received image data.

The geometric cue data may comprise information of vanishing points. The vanishing points can be calculated directly from the image data.

A normal vector to the reference plane may be determined in camera coordinates.

The reference plane may comprise a ground plane. A vertical axis orthogonal to the ground plane can be determined. The determined vertical axis can be used to orient the image.

The neural segmentation network may comprise a Gated-Shape Convolutional Neural Network configured to process the geometric cue data.

The processing may comprise processing backbone features and geometric cue data to generate an attention map. The attention map may be fed into an atrous spatial pooling layer.

Information on geometric cues may comprise information derived from the edges of objects in the spherical image. A vanishing point image may be generated based on information of edges related to vanishing points.

The determining of the direction orthogonal to the reference plane may 5 comprise determining a single direction.

All points on sphere of the spherical image may be segmented within five degrees of the two points where the axis orthogonal to the reference plane intersects the sphere.

The neural segmentation network may be trained based on information of a 10 weighted generalised dice loss on uniformly distributed points on the sphere. Training may comprise rotating an image to generate pairs of the image and a predefined direction.

A test set of data comprising information of unrotated images may be used to test the neural segmentation network. It can be tested that the neural segmentation 15 network has not simply inverted a rotation in training data.

Computer software products may also be provided for implementing the herein described tasks.

Certain more detailed aspects are evident from the detailed description.

Various exemplifying embodiments of the invention are illustrated by the attached drawings. Steps and elements may be reordered, omitted, and combined to form new embodiments, and any step indicated as performed may be caused to be performed by another device or module. In the Figures Fig. 1 illustrates an example of an environment where the invention can be embodied; Fig. 2 shows a flowchart in accordance with an embodiment; Fig. 3 illustrates the relationship between an equirectangular image and a sphere; Figs. 4A and 4B show an equirectangular image and corresponding cube faces, Fig. 5 shows a flowchart in accordance with an embodiment; Fig. 6 illustrates the effect of rotation to vanishing points; Figs. 7A and 7B show and image and its vanishing point image; Fig. 8 shows an example of calculating vanishing points; Fig. 9 shows a flowchart in accordance with an embodiment; Fig. 10 shows data processing apparatus; and Fig. 11 shows an example relating to testing a neural network.

Figure 1 is a schematic view of an example of an imaging system configured to capture spherical images of a scene. More particularly, Figure 1 shows camera apparatus 10 rotatably mounted on ground-plane 12. The rotation of the camera apparatus on a plane is denoted by arrow 18. The camera apparatus is configured to capture spherical images of the scene comprising objects 15, 16 and store the captured digital image data for later use. The camera apparatus can comprise any image capturing apparatus capable of producing image data of a scene. The objects may be for example in a closed space such as a room or floor, an open space such as an outdoor view and so forth. The camera apparatus is shown to comprise a data processing unit 14 configured to control the operation of the camera. The processor apparatus may also perform at least some of the image processing and alignment tasks described herein. The camera apparatus may be connected to a data communication system and external devices via appropriate interfaces (not shown).

Fig. 1 shows further a data processing apparatus 20 arranged to process image data captured by and obtained from the camera apparatus 10. Transfer of image data to the data processing apparatus 20 is denoted by dashed arrow 22. Image data may be transferred from the camera apparatus 10 to the data processing apparatus by a physical data carrier, e.g., by means of a memory card, US B stick and so forth. The data processing apparatus may also be connected to the camera 10. The connection may be through a data connection, e.g., via a data network connection or a direct connection. The connection may be at least in part wireless. The data processing apparatus 20 can be configured to perform the image alignment processing tasks described below. It is noted that at least a part of the processing may be distributed between separate data processing devices and/or may occur in the camera device.

Images to be processed are typically of a cubic or equirectangular format.

The cubic format consists of six undistorted, perspective images: Up, down, left, right, forward and backward. The equirectangular format is one single stitched image of 360° horizontally and 180° vertically. The cubic format is considered to suffer from less distortion than the equirectangular. However, the equirectangular is more popular and more widely supported, and is well defined and standardized. For example, the equirectangular format is the most common spherical projection used to store 360° images on a disc.

Aligning spherical images to the ground is an important pre-processing step for various downstream tasks. For example, object detectors, and segmentation networks are typically trained on upright equirectangular images, and do not work under arbitrary rotations. Human visual recognition also degrades quickly with extreme rotations. There can also be classification problems that are impossible to solve under arbitrary rotations. An example of this is the distinction between the digits 9 and 6. Ground-plane alignment can also make pose estimation more robust, as estimating the pose of a levelled image requires two fewer DOE. Moreover, alignment is important for human-computer interfaces, making interactive browsing of 360° imagery more intuitive, as users generally expect images or photos to look upright. Therefore, automatically levelling these images may provide numerous benefits, both to solving technical problems and to increasing the utility of images when used by humans.

The inventor has found that a segmentation-based strategy may be used to increase performance and reduce errors. The method is based on input of RGB (red, green, and blue) image and purely geometric cues, such as apparent vanishing points to a neural network. Some learned semantic cues, such as the expectation that some visual elements (e.g. doors, windows, structural elements) have a natural upright position may be used to achieve more accurate alignment of images. A deep neural network can be trained to leverage these cues to segment the image-space endpoints of an imagined "vertical axis", which is orthogonal to the ground plane of a scene, thus levelling the camera.

The following focuses on examples where a ground-plane alignment system is configured to estimate two degrees of freedom (DOE). More specifically, roll and pitch of the images can be estimated. This contrasts to general camera pose estimation, which typically estimates six DOF (translation and rotation in 3D). Ground-plane alignment can be performed based on a single image, by using simple cues (e.g. vertical walls, ground or sky/ceiling positions). General camera pose estimation would typically require a second reference image making this technique much less applicable.

According to an example image data and vanishing point information is input 5 into a convolutional backbone neural network for segmentation of the image. Convolution neural networks (CNN) have been used for task such as image classification and have proven to yield good results for image classification problems. CNNs can be used, e.g., to capture important features that help in identifying the object(s) in the image. For example, to identify a box the edges need 10 to be identified first. From the output it is possible to determine a normal vector to the world ground plane in camera coordinates. With the camera coordinates of this vector, it is relatively straightforward to compute the rotation aligning the camera ground plane with the world ground plane. The image can then be warped (resampled) to undo this rotation, achieving an upright image orientation.

In accordance with a more detailed example a Gated-Shape Convolutional Neural Network (GSCNN) is used for the segmentation. This network architecture was originally developed for standard segmentation tasks such as the CityScapes dataset. Such dataset is discussed e.g. in 2016 article by Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding' in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

GSCNNs follow a fairly standard segmentation framework in that they consist of a fully convolutional backbone network and incorporate local and global context with an atrous spatial pyramid pooling (ASPP) layer. The GSCNN can be configured to provide what can be referred to herein as a shape stream. The shape stream can be configured to take intermediate backbone features and image gradient information derived from edges of object that relate to the vanishing points as input. The information is processed to generate an attention map. The attention map is then used inside the ASPP.

Information on edges of objects relating to vanishing points can be an advantageous feature. The shape stream provides a way for the network to utilise the edge information, by deriving straight lines from the edges. A modification can be made to the network to enable utilization of vanishing point information instead of gradient information. The derived vanishing point information is relevant for segmenting vertical directions. A vanishing point image may be generated based on information of edges related to vanishing points.

It is possible to broadly split the approach into three stages: calculating the vanishing points, segmenting the image, and processing the segmentation into a single vertical direction. This is illustrated in Fig. 2 showing an overview of possible steps of correcting alignment of an input image. In the example equirectangular images can be segmented with a GSCNN using vanishing point features to assist the network. It shall be appreciated that other geometrical cues may also be used in addition, or as an alternative input of assisting information.

From the segmentation output the vertical axis can then be calculated. The calculated vertical axis can be used to orient the image upright.

In accordance with a possibility an axis orthogonal to the ground plane can be estimated by segmenting the unit sphere in the camera coordinate frame into likely candidates. In this description this axis is called the vertical axis, and its standard basis vector the vertical direction. As described in more detail below, the unit sphere can be segmented by segmenting an equirectangular image. A deep convolutional segmentation network may be crated that takes an equirectangular image as input, and outputs a segmentation of the sphere. A relatively simple post-processing of the spherical segmentation may follow to provide the vertical direction. By posing ground plane alignment as a segmentation problem it is possible to use existing image segmentation techniques for the purpose of aligning the view and increasing performance.

The segmentation framework may use vanishing point features as inputs to the network. A detailed example of use of vanishing points will be discussed later with reference to Figures 7A and 7B. The vanishing point feature is useful since it contains a strong signal as to the vertical axis. A vanishing point is a point on the image plane of a perspective drawing where the two-dimensional perspective projections of mutually parallel lines in three-dimensional space appear to converge.

Prior art solutions to ground plane alignment often construct an equirectangular feature image which encodes the locations of vanishing points within the image, and the vertical axis is directly computed from the feature image itself. In the herein disclosed examples vanishing point information can be used as input to a deep neural network. The network can then decide what to do with this input. Using vanishing point feature as input to a deep neural network makes the most of classical feature engineering, whilst maintaining the robustness of deep learning approaches when faced with ambiguous inputs. Leveraging the power of feature engineering with segmentation techniques an accurate procedure for the alignment can be provided for various applications.

Before describing an exemplifying alignment method in more detail, some further background on equirectangular images and operations that can be applied to them is provided. An equirectangular image can be understood as a planar *to representation of an image on the sphere, where height and width correspond respectively to latitude and longitude. The explicit transformation between pixel coordinates (x; y) and spherical coordinates (A;(1)) is: (1) where w and h are the dimensions in pixels.

It is noted that this is an invertible transformation and so it is possible to move from the image to the sphere and vice versa.

p will frequently refer to an equirectangular image as being on the sphere, 20 which means the projection of the image to the sphere. Furthermore, spherical coordinated can be mapped to cartesian coordinates and vice versa using, e.g., the well-known spherical to cartesian transformation. This will be denoted herein with f. Fig. 3 illustrates the manner how an equirectangular image relates to a sphere. By a simple change of basis, it is possible to move from pixels to spherical 25 coordinates (see the top row of Fig. 3). These are then straightforward to map to cartesian points on the sphere (bottom row).

p can be used to rotate an equirectangular image Ism to create another image List of different orientation, by rotating the sphere. Starting from a point xdst in the pixel space of Idst and projecting xdst to the sphere p(xdst), and rotating the sphere 30 with a rotation matrix R 80(3). Note that R represents an arbitrary rotation in 3D space, with an axis of rotation not necessarily corresponding to latitude or longitude. Doing this gives the following relationship between coordinate systems: (2) Following this transformation the algorithm can project back to image space p-i (f-1(xsrc)) and re-sample from Ism to create List. As can be seen in equation (2), the image may be rotated so that an equirectangular image of any orientation can be generated. This can be used to generate training data for the neural segmentation network.

An aspect about use of equirectangular images is how to extract the straight-line segments visible within the scene. Straight lines in the scene do not in general map to straight lines in an equirectangular image. This is illustrated by Figs. 4A and 4B showing an equirectangular image, and its corresponding cubefaces. The many curved lines in the equirectangular image of Fig 4A become straight in the corresponding cubeface of Fig. 4B.

To recover straight lines from an equirectangular image, it can be converted to one or more perspective images. The full 360 degree view of Fig. 4A can be covered with perspective views, corresponding to six cube faces, as illustrated by Fig. 4B. Each one is produced by rendering the sphere (with the mapped texture) from six different points-of-view, at right angles. This kind of "cube mapping" is used in computer graphics to render far-away scenes. Following this any line segment detector can be applied.

An example for the segmentation framework will be described next with reference to Fig. 5. In this example inputs to the neural network are the image and vanishing point feature. For example, InceptionV3 type architecture can be used as a backbone from which to extract features at various depths. Such architecture has been described by Szegedy, C., Vanhoucke, V., loe, S., Shlens, J., Wojna, Z.: 'Rethinking the inception architecture for computer vision': 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (June 2016). These features can then be used in the vanishing point stream, and ASPP layer can be used to construct the final segmentation.

The segmentation framework can be based on a variation of Gated-Shape CNNs. The Gated-Shape CNN is an example of a typical, deep, fully convolutional neural network, which can be configured to accept the vanishing point image V as input to an attention module. This is different to the traditional use of a Gated-Shape CNN where image gradients are used as input to an attention module. The output of the network is a binary segmentation of the original equirectangular image, which via p is equivalent to a segmentation of the sphere into background and likely directions for the vertical axis. Specifically, it is possible to segment all points on the sphere within five degrees of the two points where the vertical axis intersects the sphere. By using the vanishing points within a GSCNN architecture, it is possible to train an accurate vertical axes segmentation network.

The network architecture of a GSCNN is a fully convolutional segmentation framework, designed to utilise object boundaries to improve performance on typical segmentation tasks. They consist of a backbone feature extractor, for example InceptionV3, an ASPP layer, and the shape stream. The shape stream can be configured to accept vanishing points and intermediate backbone features as input, and then outputs a single channel feature image. The output shape stream features can then be combined with other backbone features in the ASPP layer to generate a dense feature map of the same resolution as the input image. Each pixel in the feature image can then be classified with a fully connected layer.

The architecture modifies GSCNNs so that it would be more informative to call the shape stream the vanishing point stream, as image gradients are replaced with the vanishing point image V. The reasoning behind this is that vanishing point image V is a feature with a lot of signal for the vertical axis, and it is considered helpful to let the network exploit this source of information. Also, feeding vanishing point image V to the network in this manner allows use of a pre-trained backbone, which would not be possible by just concatenating vanishing point image V to the channels of the image. Using a GSCNN enables introduction of information relating to vanishing points, whilst also retaining the ability to use pre-trained backbones.

Fig. 6 illustrates effects of image rotation. The vanishing point feature becomes more informative if the image is not nearly level. The two blocks in the top row illustrate a vanishing point feature for a level image, and its segmentation. In the bottom row the same image's feature is presented after rotating the image 200 from the level. Two locations where many lines meet each other appear, indicating where the relevant vanishing point features become more localised after rotation. The circle highlights where the topmost intersection point is located. Processing segmentations to vertical axes once there is an equirectangular segmentation, it is possible to extract a vertical direction by looking for the most probable connected component and taking its centroid as the vertical direction. Given such a centroid c "EA -and equation (1) the vertical direction can be recovered via -1 c In accordance with a possibility processing equirectangular segmentations to vertical directions can be provided as follows. To lift an equirectangular segmentation to a uniformly distributed spherical segmentation, a subset of points is projected in the equirectangular segmentation to the sphere. This subset can be constructed by uniformly distributing nsp points on the surface of the upper hemisphere of the sphere, and by taking the antipodal point to each such point. Thus, 2nsp points are uniformly distributed across the surface. Following this projection is made to the equirectangular image and the values are interpolated, giving segmentation probabilities on the sphere. Sampling in this way, where points are antipodal to the original nsp points, allows assignment of a score to potential vertical axes easily by summing the probability of a point with its antipodal counterpart. Moreover, sampling the equirectangular segmentation uniformly like this gives equal weight to all directions. This would not be the case if the equirectangular segmentation is used directly as undue weight would be placed on the poles, due to over sampling.

At this point a uniformly distributed spherical segmentation is available. For simplicity of computations, a single vector may be desired. To do this, the probabilities for all antipodal points can be summed, and this score assigned to the corresponding point in the upper hemisphere, discarding the points in the lower hemisphere.

It is noted that by doing this it has been assumed that the vertical direction is in the upper hemisphere. This may limit the capability to correct images for which the misalignment is less than 90 °. In practice this may not matter much as images are typically not taken upside down. Even if they are, for example on-board camera sensors can be used to roughly correct the image before applying the method. This gives a vertical direction score for all vectors in the upper hemisphere.

Next, all points with a score less than Ascore can be discarded. To extract a single vector from the remaining candidates, all vectors can be grouped together which are within distance Adist of each other. For each such group, mean score and mean vector can be calculated. The vertical direction is then the mean vector with maximal mean score.

According to an aspect test time augmentation may be provided. The testing may be provided at final stage of the alignment. The testing may rotate an image and rerun the segmentation. Let u be a candidate vertical direction obtained after running a single forward pass. If u is already close to level, i.e. if u is within test threshold degrees of z = (0; 0; 1), then the pitch of the image can be rotated by 200 and the entire inference and post-processing steps rerun to get a new u'. The reason for this is that, if the image is already close to level, the vanishing point features for the vertical axis are close to the points of most distortion: z (see Fig. 6). Following this, u' can be rotated back 20° and the resulting vector taken as the vertical direction.

A possible training regime will be disclosed next. To train the network, it is possible to use a weighted generalised dice loss on uniformly distributed points on the sphere. This contrasts the typical GSCNN training which utilises auxiliary and regularising losses. This change is possible because there is no direct corollary in the herein described procedure, as these losses are based on segmentation boundaries, which are not meaningful in here. The generalised dice loss can be used as this has been shown to perform well in situations where there are large class imbalances between foreground and background classes. The generalised dice loss for ground truth segmentation y and the estimated probabilistic segmentation Ay is given by The loss is not computed directly on the equirectangular segmentation as this would over-sample the poles, thus disproportionately weighing vertical directions near them. Instead, points can be selected that are uniformly distributed around the sphere. These points can then be projected into the equirectangular segmentation, and ground truth. Finally, the values of each projected point can be interpolated to construct y and Ay.

Training and validation data fed to our network during training can comprise equirectangular images, and ground truth equirectangular segmentations which are generated from already levelled equirectangular images. The dataset to begin with can comprise only levelled equirectangular images. For all of these images the vertical direction is z = (0; 0; 1). By rotating a levelled image by R and using equation (2), the resulting vertical direction of the rotated image can be assumed to be R-1z. From this training pairs of image and vertical direction can be generated. After the vertical direction is obtained, it is simple to construct a binary equirectangular segmentation.

Let u be the generic vertical direction for some image and / an equirectangular image. After applying f o p to all pixel values in /, the i; jth pixel can 20 be considered as sitting at xij on the sphere in R3. The segmentation sij is then 311. * Xi j Ot (5) Based on this it is possible to consider pixels that project near to the vertical axis as foreground (1) and all others as background (0). Rotating level images whilst keeping track of the vertical axis allows construction of many pairs of image 25 and segmentation from a single levelled image. To actually generate the dataset nrot rotations are computed which will place the vertical axis uniformly around the sphere, and then a small offset rotation is applied. Performing these almost uniform rotations avoids using the same nro( rotations for every image, whilst keeping a good distribution around the sphere to learn from.

Uniform distribution of the vertical axis may comprise, before training, generating mot rotations which place the vertical direction uniformly distributed around the unit sphere. To do first generate nro( points uniformly around the sphere and then construct a rotation that will place the vertical direction at each of these points. Let v be one such point, then the rotation which places z there is given by the Rodrigues vector: (z x v)(z). (6) After applying this rotation, a small random rotation can be applied. This is achieved by uniformly sampling a vector from the unit sphere and scaling it by a uniformly distributed number in the range of [0; r] to give a random Rodrigues rotation, which can then be applied after the initial rotation.

It is noted that there are in finitely many rotations placing the vertical direction at a specific point (as it is possible to spin around the vertical axis once there). This rotation can be incorporated online during training, as the rotation can be represented via rolling the equirectangular image along its width axis.

Even when using high quality interpolation methods, rotational artifacts cannot be avoided appearing in the rotated images. Experimentation described below shows that from these artifacts alone it is possible that the network learns to invert the applied rotation, rather than finds the vertical direction. However, this data leakage is not considered a problem, as it still generalises to images where no rotations have been applied. It may be advantageous to have an unrotated test set to measure performance to test the performance of the network and that it has not simply learned the inverse rotation.

Fig. 7A shows an image and Fig. 7B the vanishing point image of Fig. 7A. As discussed above, vanishing points can be used as inputs to the network. The highlighted regions are areas which have received a lot of votes. Note that in this case the vanishing points correspond to the orthogonal layout of the room as they point to the part of each wall which is closest to the camera. Vanishing points offer a strong a geometric cue for many computer vision tasks, including ground plane alignment. In many scenes a horizon line and/or orthogonal structures such as the corners of buildings are visible. These structures are useful for determining the vertical axis and can be emphasised by calculating the vanishing points. The vanishing points can be computed directly from images, with no learning required.

To build the vanishing point image in Figure 7B all straight lines in the scene are extracted and each line is used to vote on vanishing directions. The first step of this process is to project the equirectangular image to the six cubefaces and extract line segments from each face. To extract the line segments Canny edge detection to combined with a probabilistic Hough transform can be used, for example. Then each line segment can be converted to a plane, defined by the line endpoints, and the origin of the sphere. Let n be the normal vector to this plane. n can be used to vote for vanishing point locations, by voting for all directions on the sphere which are orthogonal to n. Geometrically this means all points on the great circle defined by the intersection of the plane and sphere receive a vote. In practice the sphere can be split into h x w bins by projecting each pixel in an equirectangular image Ito the sphere and then voting via otherwise (7) An n can be calculated for every line segment and accumulate votes by summing = n (8) V can be normalised to be an intensity image with values in the range of [0; 255]. By constructing the array like this, a feature can be built which indicates the directions that straight-line segments point within a scene and use this as a feature to assist the network in finding the vertical axis (see Fig. 8).

The above examples related to alignment relative to a ground plane.

However, the alignment of spherical images can be made relative to any reference plane. For example, vertical place can be used as the base plane.

Fig. 9 shows in general terms an aspect of the invention. In the method image data for a spherical image captured from a scene is received at 100.

Geometric cue data for the scene is obtained at 102. The received image data and the obtained geometric cue data is input at 104 into a neural segmentation network for segmentation of the spherical image based on the image data and the geometric cue data. A direction orthogonal to a reference plane is determined at 106 based on the segmentation of the spherical image. The spherical image can then be aligned at 108 based on the determined direction.

The geometric cue data may comprise information of vanishing points. The geometric cue data may be obtained directly from the received image data. The vanishing points may be calculated from the image data. A normal vector may be calculated to the reference plane in camera co-ordinates. The reference plane can be elected to comprise a ground plane. The method comprises determining a vertical axis orthogonal to the ground plane and using the vertical axis to orient the image.

Fig. 10 shows an example of data processing apparatus for providing necessary data processing functions to implement the above described functions. The data processing apparatus 90 can be for example integrated with, coupled to and/or otherwise arranged for controlling images processing of images captured by the camera apparatus 10 of Fig. 1. The data processing apparatus, or parts thereof, can be provided at a separate server apparatus or another external device, for example at the data processing apparatus 20 of Fig. 1. The apparatus comprises at least one memory 91, at least one data processing unit 92, 93 and at least one input/output interface 94. Via the interface the apparatus can be coupled to other entities of the camera device and/or external devices. The interface may also be provided with or connected to a reader of a physical data carrier. The apparatus can be configured to execute an appropriate software code to provide the necessary functions. The apparatus can also be interconnected with other control entities.

The various embodiments and their combinations or subdivisions may be implemented as methods, apparatuses, or computer program products. According to an aspect at least some of the data processing functionalities are provided in virtualised environment. Methods for downloading computer program code for performing the same may also be provided. Computer program products may be stored on non-transitory computer-readable media, such as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD, magnetic disk, or semiconductor memory. Method steps may be implemented using instructions operable to cause a computer to perform the method steps using processor apparatus and memory. Computer readable instructions may be stored on any computer-readable media, such as memory or non-volatile storage.

The data processing apparatus may be provided by means of one or more data processors. The data processing apparatus can be arranged remotely from the camera apparatus. The functions may be provided by separate processors or by an integrated processor. All or part of the processing may be integrated with the camera apparatus. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi core processor architecture, as non-limiting examples. The data processing may be distributed across several data processing modules.

Experiments The herein disclosed method has been tested to see how it performs compares to existing method. The results of the testing are shown in Tables 1 and 2 below. The herein disclosed method is labelled as "Ours" in the tables.

In the experiments the method was compared with a recent automatic alignment method known as Deep360Up. The method is described by Jung, R., Lee, A.S.J., Ashtari, A., Bazin, J. in "Deep360up: A deep learning-based approach for automatic vr image upright adjustment." 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). (March 2019).

An alignment method known as Coarse2Fine was also considered. This was only compared with the method on the TestDeep360Up dataset as Coarse2Fine method could not be retrained, so the results are a copy, when possible, presented in paper by Shan, Y., Li, S. in "Discrete spherical image representation for cnnbased inclination estimation" in IEEE Access 8 (2020), p.2008 -2022.

Improved performance on the so called Sun360 dataset was observed, as well a new dataset of construction images that we collected. The Sun360 dataset is explained by Xiao, J., Ehinger, K.A., Oliva, A., Torralba, A. in "Recognizing scene viewpoint using panoramic place representation" in 2012 IEEE Conference on Computer Vision and Pattern Recognition. (June 2012) p. 2695-2702.

Due to the synthetic rotation of images, it was ensured that the network was not cheating, and simply using rotational artifacts to solve the problem. To ensure that the method was not just simply learning rotational artifacts a test set of unlevelled images was collected, where the vertical direction has been calculated manually. There can be rotational artifacts which alone can be used to determine the vertical direction, and therefore, it is advantageous to have a test set for which no rotations have been applied, so that it can be checked that the network has solved the task rather than just learned to invert the rotation. To calculate the vertical direction manually two vertical lines, vertical in the world frame, are manually identified, which allows construction of a plane parallel to the ground plane, by computing the normal vectors as explained above. This plane parallel to the ground trivially gives the vertical axis, as the axis orthogonal to the plane. By ensuring use of unrotated test images, data leakage due to rotational artifacts present in the images can be avoided.

Thus, to ensure that the network was solving the problem at hand, a dataset of images was collected from real life construction sites where we raw captures could be taken, and the vertical axis of the raw capture was known. This allowed alignment of images where no artifacts were present.

The training and testing was based on three datasets: a synthetic dataset of noise, the Sun360 dataset, and finally the dataset of collected construction images. For all experiments the method was compared with that of Deep360Up and a baseline vanishing point (VP) method explained by Zhang, Y., Song, S., Tan, P., Xiao, J. in Panocontext article "A whole-room 3d context model for panoramic scene understanding and by D., Pajdla, T., Schiele, B., Tuytelaars, Tin Computer Vision ECCV 2014, Cham, Springer International Publishing (2014) 668-686.

When possible the performance of the Coarse2Fine approach was also 30 observed, by testing on the same test set.

A method that generates training data by rotating level images must avoid simply learning to invert the rotation applied to generate the data. Thus images of random noise were generated and both the "Our" method and the Deep360Up method were trained. In both cases the network could learn to undo the transformation, highlighting the advantage of an unrotated test set to be sure of the network's performance at test time. A new random noise image was generated at each training and validation step, meaning that this was not a result of over-fitting, as every image the network saw was different. For Deep360Up, the average angular error in this case was observed to be around 5 degrees, and for the herein disclosed method the generalised dice loss was noted to fall to 0.04. Both indicate that the network was able to significantly beat chance using only rotational artifacts.

An example of an input image and resulting segmentation can be seen in Figure 11. The process begins with an image of noise that is rotated to produce an input image and its corresponding ground truth segmentation. In the rotated image there are no cues as to the vertical direction. Despite this, the herein described approach, when trained on noise, is able to accurately segment the image.

Table 1 below shows the results of experimenting made based on "Sun360" dataset. This dataset consisted of 30,000 images. 24,000 images were used for training and 3000 images each for validation and testing. Every image in this dataset is already levelled, meaning that testing was not possible on any images which have no rotational artifacts. To get around this, the network was evaluated on unrotated images, as well as rotated images. Performance is reported on three subsets of data: the test set, the unrotated validation set, and a rotated validation set where all vertical directions are in the upper hemisphere. To compare the herein disclosed method with both Deep360Up and Coarse2Fine the report also includes results on a synthetically rotated test set, referred to as Test Deep360Up consisting of 17825 images that was previously used to evaluate both methods.

Evaluating on combination of rotated and unrotated images demonstrated that the network has learned to solve the task at hand and was not simply inverting the rotational artifacts. Table 1 shows that the method is accurate, achieving 92:6% of estimates within 5 degrees for the rotated test set, and 97:8% for the at test set.

a 1:16: Xu.s T):6,01*;.-_ ENI0:1ItId 'It' 7...,.I' * IO5': E:E.I2*':"' 7c..1:41I HR.c.4I:tetFThm t OH 60.ITo tU.. :".,:"^ : 140,*,.. 7 H97 'a I 5 98'E...1:: 98,4 Pcp06(ftp 02...4,...6 j* .'j34*:1. I: 010 957. *.*: t$: 32 at I.../ .:3, 7: , 788 924 00,4.* 97M..:90,9. 9') 3 * M:.414 ( I. 20L a 3°, :45...,:..;3 3h8.t.I.I.:*..:I 93*.';;I 07 or 2.;.8E V E71.3::: 1 1211....1610.E 1E «::3.r0I:(:::-.140..... ::7 53,4, - 75.3 $1.2 924" 971 982 814 P*00..p00.*.Ui)** 215:*' r.... .1.A.*;:7.- 7II:E2: 91.1 fj..c.84,k121480:30:....9 51.."; o3,,3 7 L 70i N t NA * iL:f.) VT?' 09 EIS 2: icr*Ei* i*it, -. 44ri: 7S41 9 4.; 97 8 E*E 9. :, E. 994. E :, ,:. ....,:.* * ..:,: .: .. ** ". :. ... ,... t.2 U{4.* .:,........ : :C;.+0,ep,i*if.:..4) 60**:1 973 viP: . , 9;5 (14 E E.:,'.':',......E73:: H 0 1 1 lo I lIi.....5

Table 1

The third set of testing was done on a construction dataset, the results being 5 shown in Table 2 below. This dataset consists of 10,054 images with a 90-10 split for training and validation, and 1006 images for testing. The imbalance in the number of images for training and validation compared to testing images came from how data were collected. The training and validation data was obtained after a rotation had already been applied to level the images. In contrast, the testing data 10 was gathered by hand so that testing could be done on unlevelled images, and ensure the network was not learning to invert the rotation. The herein disclosed method was found to be more accurate than existing techniques. Table 2 shows the approach is the most accurate on all datasets, achieving 98% of estimates within 5 degrees for the test set.

PerCell Lage 01 est miated axes s..v.it er Datase1 Method 00.:. 3' 5 I 10° o Val Rotated Ours 23.1 59.0 79.2 88.2 93.2 96.4 97.2 97.5 Deep36011p 3.0 12,1 27,4 42.7 58.3 82.5 91,6 95.7 VP 4.2 13.4 22.3 28C9 33.3 38.9 41.6 4.3.0 Val Flat Ours 25.3 66.3 87.4 93.5 96.0 97.9 98.3 98.7 Deep360up 12.4 25.3 39.6 50.8 61.5 82.4 93.5 96.7 VP 2,8 11.6 25.7 30.1 47.4 60.8 06.0 73.2 Test Ours 26.9 67,3 88,6 95.0 97,5 99,4 99.7 99.7 Deep3601.1p 9.0 29.3 49.2 62.1 73.0 88.5 94.2 96.2 VP 4.9 15.1 27.9 38.1 46.7 62.4 70.5 74.9

Table 2

The testing confirmed accuracy of the auto-alignment method where segmentation methods are combined with vanishing point features. Moreover, the testing demonstrated that casting the vertical axis estimation problem as a segmentation problem results in improved performance, whilst using as such known segmentation techniques.

One issue with the approach is the assumption that the vertical direction is already in the upper hemisphere. Though this is a reasonable assumption given how images are captured (where such misalignment is rarely an issue), and the availability of onboard sensors to roughly align an image, this can be remedied by segmenting the image into three classes: up, down and background. Doing so would allow calculation of a vertical axis as before, but then use the up or downness to vote for the up direction.

While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in various combinations in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The various aspects and features discussed above can be combined in manners not specifically shown by the drawings and/or described above.

The foregoing description provides by way of exemplary and non-limiting examples a full and informative description of exemplary embodiments and aspects of the invention. However, various modifications and adaptations falling within the scope of this disclosure may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims.

Claims

Claims 1. A method for aligning images to a reference plane, the method comprising: receiving image data for a spherical image captured from a scene, obtaining geometric cue data for the scene, inputting the image data and the geometric cue data into a neural segmentation network for segmentation of the spherical image using the image data and the geometric cue data, determining a direction orthogonal to the reference plane based on the segmentation of the spherical image, and aligning the spherical image based on the determined direction.
2. A method according to claim 1, comprising obtaining geometric cue data from 15 the received image data.
3. A method according to claim 1 or 2, wherein the geometric cue data comprises information of vanishing points.
4. A method according to claim 3, comprising calculating the vanishing points from the image data.
5. A method according to any preceding claim, comprising determining a normal vector to the reference plane in camera coordinates.
6. A method according to any preceding claim, wherein the reference plane comprises a ground plane, the method comprising: determining a vertical axis orthogonal to the ground plane, and using the vertical axis to orient the image.
7. A method according to any preceding claim, wherein the neural segmentation network comprises a Gated-Shape Convolutional Neural Network configured to process the geometric cue data.
8. A method according to any preceding claim, comprising processing backbone features and geometric cue data to generate an attention map.
9. A method according to claim 8, comprising feeding the attention map into an 5 atrous spatial pooling layer.
10. A method according to any preceding claim, comprising deriving the geometric cue data from information of edges of objects in the spherical image.to
11. A method according to any claim 10, comprising generating a vanishing point image based on information of edges related to vanishing points.
12. A method according to any preceding claim, wherein the determining of the direction orthogonal to the reference plane comprises processing the segmentation 15 to determine a single direction.
13. A method according to any preceding claim, comprising segmenting all points on the sphere of the spherical image within five degrees of the two points where the axis orthogonal to the reference plane intersects the sphere.
14. A method according to any preceding claim, comprising training the neural segmentation network based on information of a weighted generalised dice loss on uniformly distributed points on the sphere.
15. A method according to claim 14, the training comprising rotating the image to generate pairs of the image and a predefined direction.
16. A method according to any preceding claim, comprising using a test set of data comprising information of unrotated images to test that the neural 30 segmentation network has not inverted a rotation in training data.
17. A data processing apparatus for aligning images to a reference plane, the data processing apparatus comprising at least one processor, and at least one memory including computer program code, wherein the at least one memory and 35 the computer program code are configured, with the at least one processor, to: receive image data for a spherical image captured from a scene, obtain geometric cue data for the scene, input the received image data and the determined geometric cue data into a neural segmentation network for segmentation of the spherical image, determine a direction orthogonal to the reference plane based on the segmentation of the spherical image, and align the spherical image based on the determined direction.
18. A data processing apparatus according to claim 17, configured to obtain the 10 geometric cue data from the received image data by calculating the vanishing points from the image data.
19. A data processing apparatus according to claim 17 or 18, wherein the geometric cue data comprises information of vanishing points.
20. A data processing apparatus according to any of claims 17 to 19, wherein the reference plane comprises a ground plane, the apparatus being configured to determine a vertical axis orthogonal to the ground plane and use the vertical axis to orient the image.
21. A data processing apparatus according to any of claims 17 to 20, wherein the neural segmentation network comprises a Gated-Shape Convolutional Neural Network configured to process the geometric cue data.
22. A data processing apparatus according to any of claims 17 to 21, configured to at least one of process backbone features and geometric cue data to generate an attention map, input an attention map generated based on geometric cue data into an atrous 30 spatial pooling layer, determine a normal vector to the reference plane in camera coordinates, use information of edges of objects in the spherical image to determine a vanishing point image, determine a single direction by the segmentation, and segment all points on the sphere of the spherical image within five degrees of the two points where the axis orthogonal to the reference plane intersects the sphere.
23. A data processing apparatus according to any of claims 17 to 22, wherein training of the neural segmentation network is based on information of a weighted generalised dice loss on uniformly distributed points on the sphere.
24. A data processing apparatus according to claim 23, configured to rotate the image in training to generate pairs of the image and a predefined direction.
25. A data processing apparatus according to any of claims 17 to 24, configured to use a test set of data comprising information of unrotated images to test that the 15 neural segmentation network has not inverted a rotation in training data.