GB2583774A

GB2583774A - Stereo image processing

Info

Publication number: GB2583774A
Application number: GB1906652.1A
Authority: GB
Inventors: Gao Chao
Original assignee: Robok Ltd
Current assignee: Robok Ltd
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2020-11-11
Anticipated expiration: 2039-05-10
Also published as: GB2583774B; GB201906652D0

Abstract

A method of generating a three-dimensional representation of an environment from stereo image data 302 comprises: selecting 308 a set of feature points in each image, calculating the stereo disparities between the points, joining 310 the points in each image to form a polygonal mesh for each image, and using the disparity between corresponding points to determine 312 the plane parameters for the image. The method may be used in an advanced driving system comprising stereoscopic pair of cameras. A second method of processing stereo image data comprises: using stereo matching to generate a 3D representation of the environment for object detection, using the representation to determine a transformation to be applied to the first image or the second image such that distant portions of the image are expanded and near portions are shrunk. A further method of processing stereo image data for application in an object detection system comprises: using stereo matching to produce a 3D representation, and determining, using the 3D representation, one or more 2D region of interest where each of the regions of interest is associated with a portion of the generated 3D representation.

Description

STEREO IMAGE PROCESSING

Technical Field

The present invention relates to stereo image data processing. The invention has particular, but not exclusive, relevance to the generation of input data for an advanced driver assistance system (ADAS) or an automated driving system (ADS).

Background

In order to prevent or mitigate risks associated with human driving errors or limitations, a vehicle may be fitted with an advanced driver assistance system (ADAS) which is capable of automatically performing collision avoidance actions such as emergency braking, pedestrian crash avoidance mitigation (PCAM), and automatic lane centring, as well as generating driver alerts such as lane departure warnings or proximity alerts. An autonomous vehicle, on the other hand, may be controlled by an automated driving system (ADS) without any need for human input.

In order to function effectively, an ADAS or ADS requires a high-frequency stream of input data including, for example, a three-dimensional representation of the environment surrounding the vehicle, and/or object detection data indicating objects in the vicinity of the vehicle. Various methods are known for determining distances to objects for the purpose of generating a three-dimensional representation of an environment, for example sonar, radar, velocity-based methods, and stereo matching. Stereo matching involves identifying corresponding pixels in a pair of stereo images captured by stereo cameras, and determining distances to objects in the stereo image pair using the stereo disparity of the corresponding pixels. Existing stereo matching methods are highly computationally expensive, and are typically incapable of being performed at a sufficiently high frequency to generate input data for an ADAS or ADS without the use of highly specialised and/or expensive hardware components such as graphics processing units (GPUs), which may not be practicable to incorporate into a vehicle.

Summary

According to a first aspect of the present invention, there is provided a method of processing stereo image data to generate a three-dimensional representation of an environment, wherein the stereo image data includes a first image and a second image corresponding to simultaneous views of an environment from two respective different locations and each comprising a plurality of pixels. The method includes selecting a plurality of representative points to represent the environment, the representative points being associated with a subset of the pixels of at least one of the first image and the second image, and determining stereo disparities for the selected representative points. The method includes generating, by connecting the selected representative points, a mesh comprising a plurality of polygons, each having a respective plurality of mesh vertices, and determining, using the respective stereo disparities of the mesh vertices, plane parameters for each of the polygons of the generated mesh, whereby the plurality of polygons and the determined plane parameters provide the three-dimensional representation of the environment.

According to a second aspect of the present invention, there is provided a method of processing the stereo image data to generate input data for an object detection system, wherein the stereo image data includes a first image and a second image corresponding to simultaneous views of an environment from two respective different locations and each comprising a plurality of pixels. The method includes processing the first image and the second image using stereo matching to generate a three-dimensional representation of the environment, and determining, using the generated three-dimensional representation, a transformation to be applied to the first image or the second image, wherein the transformation expands regions of the first image or the second image corresponding to distant portions of the three-dimensional representation and shrinks regions of the first image or the second image corresponding to nearby portions of the three-dimensional representation. The method further includes transforming the first image or the second image in accordance with the determined transformation, to generate a transformed image for input to the object detection system.

According to a third aspect of the present invention, there is provided a further method of processing stereo image data to generate input data for an object detection system, wherein the stereo image data includes a first image and a second image corresponding to simultaneous views of an environment from two respective different locations and each comprising a plurality of pixels. The method includes processing the first image and the second image using stereo matching to generate a three-dimensional representation of the environment, and determining, using the generated three-dimensional representation of the environment, one or more two-dimensional regions of interest (ROIs) of the first image or the second image for input to the object detection system, wherein each of the one or more ROls is associated with a respective predetermined three-dimensional portion of the generated three-dimensional representation of the environment.

Further features and advantages of the invention will become apparent from the following description of preferred embodiments of the invention, given by way of example only, which is made with reference to the accompanying drawings.

Brief Description of the Drawings

Figure 1 is a schematic block diagram representing an image processing system in accordance with an embodiment of the present invention.

Figure 2 shows an example of a stereo image pair.

Figure 3 is a flow diagram representing a method of processing stereo image data in accordance with an embodiment of the present invention.

Figure 4 shows an example of an image patch used to determine a feature descriptor for a pixel in an image.

Figure 5 is a flow diagram representing a method of determining a set of support points in accordance with an embodiment of the present invention.

Figure 6a, 6b, and 6c show an example of a set of support points being determined for the stereo image pair of Figure 2.

Figures 7a and 7b show an example of outliers being removed from the set of support points of Figure 6.

Figure 8 shows an example of a mesh generated from the support points of Figure 7b.

Figure 9 shows an example of identifying a mesh vertex of the mesh of Figure 8 with an anomalous disparity.

Figure 10 shows an example of determining a disparity search range for the anomalous mesh vertex of Figure 9.

Figure 11 is a flow diagram representing a method of refining a mesh in accordance with an embodiment of the present invention.

Figure 12 shows an example of the mesh of Figure 8 being refined. Figures 13a and 13b show an example of a mesh being generated to include a connected component.

Figure 14 is a flow diagram representing a method of generating object detection data in accordance with an embodiment of the present invention.

Figure 15 is a flow diagram representing a further method of generating object detection data in accordance with an embodiment of the present invention.

Figure 16 shows an example of a region of interest (ROI) of one of the images of the stereo image pair of Figure 2.

Figure 17 shows an example of enhanced object detection data for one of the images of the stereo image pair of Figure 2.

Detailed Description

As shown in Figure 1, according to an embodiment of the present invention, a vehicle 100 includes an image processing system 101, which is arranged to receive stereo image data from a left camera 102 and a right camera 104. In this example, the left camera 102 and the right camera 104 are mounted to the vehicle 100 at equal heights but having a known separation along a horizontal axis perpendicular to the forward direction of the vehicle. The left camera 102 and the right camera 104 are directed substantially in the forward direction of the vehicle, such that the cameras are able to capture simultaneous views of an environment in front of the vehicle from two horizontally-separated locations.

The image processing system 101 includes a left image pre-processing module 106 and a right image pre-processing module 108. The left image preprocessing module 106 and the right image pre-processing module 108 are arranged to receive a left image and a right image from the left camera 102 and the right camera 104 respectively, and to perform pre-processing operations including image rectification and down-scaling. The left image and the right image are a stereo image pair, and correspond to simultaneous views of an environment from the horizontally-separated locations.

During image rectification, the image pre-processing modules 106, 108 transform the left image and the right image to account for distortions resulting from the lenses of the left camera 102 and the right camera 104, and the fact that the left camera 102 and the right camera 104 may not be perfectly coplanar (for example, the left camera 102 and the right camera 104 may face slightly towards each other or away from each other, as opposed to facing in a direction perpendicular to the horizontal axis between the cameras). After image rectification has been performed, the left image and the right image are transformed so as to appear to be captured from coplanar, distortion-free cameras. It is noted that for coplanar, distortion-free cameras, image rectification may be unnecessary.

In the present example, the image pre-processing modules 106, 108 down-scale the left image 102 and the right image 104 using interpolation to reduce the resolution of the left image and the right image. In this example, downscaling is performed after image rectification. In other examples, downscaling may be performed before image rectification. In some examples, alternative down-scaling methods such as max pooling or average pooling may be used instead of interpolation. As will be explained in more detail hereafter, the present invention is able to generate a three-dimensional representation of an environment for overlaying on a full resolution input image, even when the resolution of the input images has been reduced, without the need for computationally expensive interpolation or filtering of a low resolution output. The image processing system 101 includes a stereo image processing module 110. The stereo image processing module 110 is arranged to receive the pre-processed left image from the left image pre-processing module 106 and the pre-processed right image from the right image pre-processing module 108, and to perform stereo matching to generate a three-dimensional representation of the environment appearing in the stereo image pair. A detailed example of a specific stereo matching method will be described in detail hereafter with reference to Figure 3.

In the present example, the stereo image processing system 101 is arranged to generate input data for an advanced driver assistance system (ADAS) 112 of the vehicle 100 on which the cameras 102, 104 are mounted. The ADAS 112 is arranged to use the input data to make decisions regarding actions for preventing and/or mitigating dangers associated with human errors or limitations, and accordingly to improve safety for the driver of the vehicle and others in the vicinity of the vehicle. These decisions are used to generate control signals for one or more actuators 114, which control driving functionality of the vehicle, for example steering, throttle, brakes, and/or gears.

The ADAS 112 may, for example, send control signals to the actuators to perform collision avoidance actions such as automatic emergency braking, pedestrian crash avoidance mitigation (PCAM) actions, and automatic lane centring. The ADAS 112 may also be configured to generate driver alerts and/or generate lane departure warnings. The image processing system 101 must be able to generate input data for the ADAS 112 in near real-time, and accordingly must be able to process images at a high frequency, for example at a frequency higher than 10Hz, higher the 50Hz, or higher than 101Hz. Methods described herein allow for images to be processed at sufficiently high frequencies, even using relatively simple processing hardware.

In other examples in accordance with the preset information, an autonomous vehicle may be provided with components corresponding to those of the vehicle 100 of Figure 1, but with an automated driving system (ADS) instead of, or in addition to, an ADAS. In such examples, the ADS is arranged to generate control signals to actuators for controlling the entire driving functionality of the vehicle.

Figure 2 shows an example of stereo image pair 200 including a left image 202 captured using the left camera 102, and a right image 204 captured using the right camera 104, after image rectification and down-scaling have been performed. The left image 202 and the right image 204 in this example are greyscale digital images containing equal numbers of pixels, each having an associated pixel value indicating a pixel intensity. Methods described herein are also applicable to colour images, for example images encoded using RGB or YUV colour formats, in which case each pixel is associated with multiple pixel values corresponding to multiple colour channels.

The left image 202 and the right image 204 correspond to simultaneous views of an environment from two horizontally-separated locations. The horizontal separation has been exaggerated in Figure 2 for ease of illustration.

In practice, the horizontal separation of the left camera 102 and the right camera 104 (referred to as the baseline) is chosen according to the application of the image processing system. A longer baseline results in distances to farther objects being resolvable. For applications in which the invention is implemented within a vehicle, for example to provide input data for an ADAS or ADS, the baseline is typically between around 5 centimetres and 50 centimetres, for example between 15 centimetres and 30 centimetres. In some examples, a vehicle may have multiple pairs of stereo cameras with different baselines for resolving objects at different distances within an environment.

As a result of image rectification, the left image 202 and the right image 204 appear to have been captured by coplanar cameras pointing in a common direction, and hence corresponding pixels in the left image 202 and the right image 204 are located at the same vertical position as each other, but at different horizontal positions. The horizontal separation of corresponding pixels in the left image 202 and the right image 204 (typically measured in terms of a number of pixels) is referred to as the stereo disparity of the corresponding pixels. Pixels corresponding to closer objects have a greater stereo disparity than pixels corresponding to more distant objects, and therefore by determining the stereo disparity for corresponding pixels in the stereo image pair 200, it is possible to determine distances of objects from a plane containing the left camera 102 and the right camera 104 and perpendicular to the apparent common direction of the cameras 102, 104. The process of identifying corresponding pixels in a stereo image pair, for example to determine distances to objects, is referred to as stereo matching.

Figure 3 shows an example of a method performed by the stereo image processing module 110 of Figure 1, in which the stereo image pair 200 is processed using stereo matching to generate a three-dimensional representation of the environment appearing in the stereo image pair 200. As mentioned above, prior to the method being performed, the stereo image pair 200 is preprocessed using the image pre-processing modules 106, 108, which includes image rectification and down-scaling. The stereo image processing module 110 receives, at S302, the pre-processed stereo image pair 200, and determines, at S304, gradient-based matching scores for pixels of the left image 202 and for pixels of the right image 204. The gradient-based matching score for a given pixel is determined on the basis of a pixel value gradient of the pixel. In the present example, the pixel value gradient (g,,[1, j], gy[i, MTof a pixel with co-ordinates (i,j) in an image is determined as half of the difference in pixel values of the neighbouring pixels in the horizontal and vertical directions, as shown in Equation (1): pixelValue[i + 1,]] -pixelValue[i -1,]] (1) = pixelValue[i, j + 1] -pixel Value [i,j -1] In other examples, alternative methods may be used to determine a pixel value gradient, for example using the Sobel operator or the Scharr operator. In the case of a colour image, a pixel gradient may be determined for one or more colour channels. A gradient-based matching score for a pixel may then be determined, for example, on the basis of a maximum pixel gradient for the pixel, or on a gradient of an absolute pixel value.

In the present example, the gradient-based matching score SM for a pixel with pixel value gradient (g,,,gy)Tis given by Equation (2): SM = a(1 + g32,/9D + 2b/g. (1) where a and b are known parameters used to describe the image noise level associated with the cameras 102, 104. Specifically, a is the variance of the positioning error of the epipolar line (where the positioning error is assumed to be an isotropic Gaussian error), and b is the variance of the image intensity noise (assumed to be isotropic Gaussian noise). The denominator of the right hand side of Equation (1) represents an estimate of the variance of the stereo disparity for the pixel. As a result, the gradient-based matching score is higher for pixels that have a lower estimated observation variance, which are more likely to be uniquely identified among pixels in the image.

Having determined the gradient-based matching scores, the stereo image processing module 110 determines feature descriptors for the pixels of the left image 202 and pixels of the right image 204. In the present example, the feature descriptor for a pixel with co-ordinates (i,j) is given by a vector of pixel gradients for pixels in an N x N patch centred at the pixel with co-ordinates (i,j). Figure 4 shows a simplified example of an image 400 containing 9 x 9 pixels. The feature descriptor DF [4,3] for the pixel with coordinates (4,3), shown with solid fill in Figure 4, is given by an 18-component vector with components given by Equation (2): DF [4,3] = (9x [3,2], gy [3,2], gx [5,4], gy [5,4] )T, (2) where the vector includes the horizontal and vertical components of the pixel gradients for each of the pixels in the 3 x 3 patch centred at (4,3), shown shaded in Figure 4. It will be appreciated that using a larger patch for a feature descriptor of a given pixel will result in a greater probability of the feature descriptor being unique among pixels of the image, at the expense of a higher computational cost for comparing feature descriptors.

It will be appreciated that other forms of feature descriptors may be used as alternatives to the feature descriptor described above, without departing from the scope of the invention. Other examples of feature descriptors include SIFT, SURF, ORB, and BRIEF. A feature descriptor should be chosen to have a high probability of uniqueness for pixels in the image, whilst being simple enough for comparisons to be made between feature descriptors of pixels at a relatively low computational cost.

Returning to Figure 3, the stereo image processing module 110 selects, at 5308, a set of representative support points to represent the environment, and determines stereo disparities for the selected support points. In this example, the support points correspond to a subset of the pixels of the left image 202 and a subset of the pixels of the right image 204. In other examples, the support points may correspond to a subset of pixels from only one image of a stereo image pair. The set of support points is sparse in comparison to the full sets of pixels in the left image 202 and the right image 204, but is nevertheless determined so as to be representative of the environment, ensuring accuracy of the resulting three-dimensional representation of the environment. It is noted that neither the sparsity nor the optimality of the determined set of support points is affected by the down-scaling of the image by the image pre-processing modules 106, 108, and therefore down-scaling the image prior to determining the set of support points can result in a reduction in computational cost, without losing accuracy of the three-dimensional representation of the environment.

Figure 5 shows an example of a method performed by the stereo image processing module 110 for selecting a set of support points and determining stereo disparities for the support points. The stereo image processing module 110 divides, at 5502, the images 202, 204 of the stereo image pair 200 into portions. Figures 6a and 6b show examples in which the left image 202 and the right image 204 have each been divided into nine portions using a rectangular 3 x 3 grid. The 3 x 3 grid has been chosen for ease of illustration, but in practice a grid with a significantly higher resolution should be used, resulting in a significantly higher number of portions for each image.

The stereo image processing module 110 determines, at 5504, a candidate pixel within each of the determined portions. The candidate pixel within a portion is determined as the pixel with the highest gradient-based matching score in the portion, as determined at S304. The candidate pixel therefore represents the pixel estimated to have the highest probability of being uniquely identified, and therefore the pixel for which the probability of successfully identifying a corresponding pixel in the other image of the stereo image pair 200 is highest. The black and white circles in Figures 6a and 6b respectively show locations of candidate pixels within the portions of the left image 202 and the right image 204. By determining one candidate pixel for each portion, the set of candidate points for each of the images 202, 204 is guaranteed to be approximately evenly spread throughout the image. This ensures that the resulting three-dimensional representation accurately captures the environment throughout the extent of the stereo image pair 200.

The stereo image processing module 110 identifies, at 5506, pixels in the right image 204 and the left image 202 that correspond to the determined candidate pixels of the left image 202 and the right image 204 respectively.

The corresponding pixels are determined on the basis of the determined feature descriptors. More precisely, in the present example a pixel p' in the right image 204 corresponding to a candidate pixel p in the left image 202 is determined as the pixel in a set Qp having a feature descriptor DF(p') with a minimum absolute difference from the feature descriptor DF (p) of the candidate pixel, as shown in Equation (3): p' = argminIDF(P) DF(9)1. (3) q EQp In cases where the minimum absolute difference is not unique up to a predetermined threshold, the pixel p' is unable to be uniquely identified. In the present example, due to image rectification performed by the image pre-processing modules 106, 108, the corresponding pixel to a given candidate pixel will appear on a horizontal line segment starting at the candidate pixel and extending in a known direction. For a candidate pixel in the left image 202, the location of the corresponding pixel in the right image 204 will be horizontally to the left of the location of the candidate pixel. For a candidate pixel in the right image 204, the location of the corresponding pixel in the left image 202 will be horizontally to the left of the location of the candidate pixel. The set Qp for the candidate pixel p therefore contains pixels lying on the appropriate horizontal line segment starting from the candidate pixel p. It is noted that for examples in which image rectification has not been performed, and where two cameras used to capture a stereo image pair are not perfectly coplanar, corresponding pixels will lie on slanted line segments as opposed to horizontal line segments.

Once the corresponding pixels in the right image 204 have been identified for the candidate pixels in the left image 202, and the corresponding pixels in the left image 202 have been identified for the candidate pixels in the right image 204, the stereo image processing module 110 determines, at S508, a respective stereo disparity for each of the determined candidate pixels using the respective corresponding pixels determined at 5506. The stereo disparity for a given candidate pixel is determined as the distance, in number of pixels, between the location of the candidate pixel in one image of the stereo image pair 200 and the location of the respective corresponding pixel in the other image of the stereo image pair 200. As mentioned above, for certain candidate pixels, the stereo image processing module 110 may be unable to be uniquely identify a corresponding pixel in the other image. The stereo disparities for such candidate pixels are determined to be unknown.

The candidate pixels of the left image 202 (the black circles in Figure 6a) and the candidate pixels of the right image 204 (the white circles in Figure 6b) correspond to different views of an environment. In order to determine a set of support points corresponding to a common view of the environment, the stereo image processing module 110 projects, at 5510, either the candidate pixels of the left image 202 to the right image 204, or the candidate pixels of the right image 204 to the left image 202, using the respective stereo disparities determined at S508. In other words, the stereo image processing module 110 displaces the candidate pixels from one image by an amount and in a direction corresponding to the respective stereo disparities. Accordingly, the determined set of support points corresponds either to the candidate pixels of the left image 202 and the projected candidate pixels of the right image 204, or the candidate pixels of the right image 204 and the projected candidate pixels of the left image 202. Figure 6c shows an example of a set of support points corresponding to the candidate pixels of the left image 202 (the black circles) and the projected candidate pixels of the right image 204 (the white circles).

Note that the projecting of the pixels of the right image 204 has been exaggerated in Figure 6c for the purpose of illustration.

The stereo image processing module 110 removes, at S512, outliers from the determined set of support points. In the present example, removing the outliers includes removing support points from any of the determined portions containing only one support point. Figure 7a shows the example set of support points described above before any outliers have been removed. It is observed that the top-left portion includes only one support point. Figure 7b shows the example set of support points after outliers have been removed. The support point from the top-left portion has been removed.

Removing the outliers also includes removing support points from any of the portions containing support points determined to have a range of stereo disparities greater than a predetermined threshold range. In other words, when the difference between the maximum stereo disparity and the minimum stereo disparity determined for support points within a given portion exceeds the predetermined threshold range, support points within the given portion are removed as outliers. In the example set of support points in Figure 7b, the support points in the top-middle portion are determined to have a range of disparities greater than the predetermined threshold range, and therefore have been removed from the set of support points shown in Figure 7b.

Returning to the method of Figure 3, the stereo processing module 110 generates, at 5310, a mesh by connecting support points from the determined set of support points. The mesh is formed of a set of polygons, each having a respective set of mesh vertices. In the present example, the support points are connected by non-intersecting straight line segments such that each support point is connected to at least two other support points and each of the polygons in the generated mesh is a triangle. Figure 8 shows an example of an intermediate mesh 800 generated by joining the support points of Figure 7b (after outliers have been removed). In some examples of the present invention, no further processing is required to generate a final mesh, in which case the intermediate mesh 800 would be used to generate the three-dimensional representation of the environment. In the present example, however, generating the mesh further includes an additional refinement stage in which triangles with areas greater than a predetermined threshold area are divided into smaller triangles, as will be described in more detail hereafter with reference to Figure 11.

In some examples, the stereo disparities of all of the mesh vertices of the generated mesh are known. In other examples, one or more of the mesh vertices have unknown stereo disparities, and it is necessary to determine the stereo disparities of these mesh vertices. Examples of mesh vertices with unknown stereo disparities include mesh vertices for which corresponding pixels in the other image could not be uniquely identified. In the present example, the stereo processing module 110 re-computes stereo disparities that are determined to be anomalous, as described hereafter with reference to Figure 9 In order to determine anomalous stereo disparities of mesh vertices, the stereo image processing module 110 clusters neighbouring mesh vertices with known stereo disparities according to the known stereo disparities. In this example, the stereo image processing module 110 initializes a cluster at each mesh vertex, and iteratively merges any neighbouring clusters directly connected by two mesh vertices with a difference in stereo disparities determined to be less than a predetermined threshold difference. This iterative merging process is continued until no further clusters can be merged. Note that some mesh vertices may have unknown stereo disparities, and therefore do not belong to any clusters. In the mesh 800, reproduced in Figure 9, four clusters of mesh vertices are indicated by respective different point styles.

Having clustered the neighbouring mesh vertices with known stereo disparities, the stereo image matching module 110 determines any mesh vertex in a cluster containing fewer than a predetermined threshold number of mesh vertices as being anomalous, and sets the stereo disparity of any such mesh vertex to unknown. In the example of Figure 9, for the purpose of illustration, the predetermined threshold number is three. Accordingly, the stereo disparity of the mesh vertex 902, which is alone in a cluster containing only the mesh vertex 902, is set to unknown.

After determination of anomalous mesh vertices, the generated mesh includes a combination of mesh vertices with known stereo disparities and mesh vertices with unknown stereo disparities. For a mesh vertex with an unknown disparity, a stereo disparity is determined using the method described above with reference to 5308, but with a restricted disparity search range determined in dependence on the stereo disparities of neighbouring mesh vertices with known stereo disparities. The neighbouring mesh vertices are those directly connected to the mesh vertex with the unknown stereo disparity. In the mesh 800, reproduced in Figure 10, the mesh vertex 902, which has an unknown stereo disparity, has five neighbouring mesh vertices with known disparities (shown as black circles in Figure 10). In the present example, the disparity search range is chosen to be slightly wider than the range of stereo disparities of the neighbouring mesh vertices. In particular, if the range of stereo disparities of the neighbouring mesh vertices is given by [dini", dmax], the disparity search range is given by [(1 -x)dmin, (1 + x)dmax], where x is given by, for example 0.02, 0.05, 0.1, 0,2, 0.5, or any other suitable value. Using a restricted disparity search range reduces the number of operations performed in identifying a corresponding pixel, and therefore reduces the overall computational cost of the stereo image processing method. It is noted that, the greater the baseline for a pair of stereo cameras, the greater the stereo disparities of pixels in a stereo image pair captured by the pair of stereo cameras. This results in a higher sensitivity with regard to detecting distances to objects in an environment, but results in a greater disparity search range and hence a higher computational cost of determining the stereo disparities.

As mentioned above, generating a mesh in the present example includes an additional mesh refinement stage. As shown in Figure 11, the mesh refinement method includes identifying, at S1102, a triangle in the intermediate mesh having an area greater than a predetermined threshold area. The method includes identifying, at S1104, a pixel in a central region of the triangle determined to have a matching score greater than a predetermined threshold matching score. In the present example, the central region is a square region of predetermined size, centred at the centroid of the identified triangle. The method includes dividing, at S1106, the identified triangle into three smaller triangles each sharing two vertices with the identified triangle and having a third vertex corresponding to the identified pixel in the central region of the identified triangle. The method continues iteratively until the mesh contains no triangles with areas greater than the predetermined threshold area. In the present example, the resulting refined mesh therefore includes only a set of triangles with areas below the predetermined threshold area. Stereo matching is performed as described above to determine a stereo disparity of each of the mesh vertices generated during mesh refinement.

Figure 12 shows an example of a refined mesh 1200 generated from the intermediate mesh 800 of Figure 8, in accordance with the method of Figure 11. In this example, three triangles of the intermediate mesh 800 are identified with areas greater than the predetermined threshold area. Respective pixels (shown as striped circles in Figure 12) are identified in respective central regions (shown as dashed boxes in Figure 12) of each of the three identified triangles. The refined mesh 1200 is generated by dividing the three identified triangles as shown by the dashed lines in Figure 12.

Returning to the method of Figure 3, the stereo image processing module 1 1 0 determines, at S312, plane parameters for each of the polygons of the generated mesh, using the respective stereo disparities determined for the mesh vertices. By treating each polygon of the generated mesh as a plane, the stereo disparities of pixels of the left image 202 can be inferred using Equation (4): d = ax + by + c, (4) where d is the stereo disparity of a pixel with position (x,y) in the left image 202, and a, b, c are plane parameters for the polygon containing the pixel. The pixel corresponds to a point in the environment a distance z from the plane containing the left camera 102 and the right camera 104 and perpendicular to the apparent common direction of the cameras 102, 104. The distance z is determined using the relationship z = bf/d, where b is the baseline and f is the focal length of the lenses of the left camera 102 and right camera 104.

In the present example, in which the generated mesh is formed of triangles each having three respective mesh vertices, the plane parameters of a given triangle are determined using the positions and stereo disparities of the three respective mesh vertices. Specifically, the plane parameters a, b,c are determined by solving a set of three simultaneous equations, each corresponding to Equation (4) with the respective known position and stereo disparity of one of the mesh vertices.

The method described above provides an efficient method of generating a three-dimensional representation of an environment, where the three-dimensional representation includes a mesh of polygons corresponding to planar regions with respective sets of plane parameters. In some examples, however, some of the processing performed by the stereo image processing module 110 can be avoided, thereby further improving the efficiency of the method. In particular, certain portions of an image may correspond to very distant regions of an environment (for example, sky regions), while other portions of the image may be untextured, and are therefore likely to be planar.

In accordance with an embodiment of the present invention, a method of detecting an untextured region within an image in a stereo image pair includes binarizing pixels of the image in dependence on the pixel values of the pixels. Pixels with pixel value gradients of magnitude greater than a predetermined threshold pixel value are assigned a first value (for example 0), and pixels with pixel value gradients of magnitude less than or equal to the predetermined threshold pixel value are assigned a second, different, value (for example 1). It is noted that, for colour images, pixel gradients may be determined for each colour channel, and binarization may be performed in dependence on the pixel value gradient with the greatest magnitude. A cluster of neighbouring pixels assigned the second value is determined to be untextured, and is an example of a connected component.

In accordance with an embodiment of the present invention, a method of detecting a distant region within an image in a stereo image pair includes subtracting pixel values or feature descriptors of pixels of one image of the stereo image pair from pixel values or feature descriptors of pixels of the other image of the stereo image pair to generate a delta image formed of a set of delta pixels. The method further includes binarizing the delta image in dependence on the delta pixel values, such that delta pixels with delta pixel values greater than a predetermined threshold pixel value are assigned a first value (for example 0), and delta pixels with delta pixel values less than or equal to the predetermined threshold pixel value are assigned a second, different, value (for example 1). In cases where the delta image is generated by subtracting multidimensional feature descriptors, each resulting delta pixel has multiple values corresponding to the dimensions of the feature descriptors. The image may then be binarized, for example, in dependence on the absolute value of the delta pixel values. It is noted that subtracting feature descriptors in this way results in more robust detection of distant regions, but a higher computational cost, than subtracting pixel values. A cluster of neighbouring pixels for which the corresponding delta pixel is assigned the second value is determined to be a distant region, and is an example of a connected component.

It is noted that the methods described above for determining connected components corresponding to untextured or distant regions of an image may be performed on down-scaled images (for example, downscaled using interpolation or any other suitable down-scaling method), resulting in a reduction in computational cost. In some examples, an image is down-scaled prior to determining gradient-based matching scores and feature descriptors, as described above, and is further down-scaled prior to determining the connected components. For examples in which down-scaling is used, polygons of the resulting generated mesh must be scaled up to match the scale of the stereo image pair. Provided the down-scaling is linear, scaling up the polygons of the generated mesh only requires multiplication of the locations of the mesh vertices, and of the plane parameters, by a scale factor. Multiplying by a scale factor is significantly less computationally expensive than scaling up by interpolation or filtering of a low-resolution depth map, which represents an advantage of the present method over existing stereo matching methods.

Connected components, as described above, are polygons each having a set of vertices connected by a corresponding set of straight lines. Connected components correspond either to distant regions of an image, or candidate planar regions of an image. In either case, it is unnecessary to manually determine stereo disparities of pixels located within the connected component. For examples in which connected components are detected, the method of Figure 3 may still be applied, but selecting the set of support points at S308 includes expanding the original set of support points to include the vertices of the one or more determined connected components. In order to generate the mesh at S310 in this case, the support points of the expanded set of support points are connected by non-intersecting straight line segments such that each support point is connected to at least two other support points to form a triangle, where no line segment passes through a connected component. The resulting mesh is formed of a set of polygons including the one or more connected components and a set of triangles.

Figure 13a shows the left image 202 of the stereo image pair, in which vertices of a connected component corresponding to the sky are shown as white circles (only some of the vertices are shown in Figure Ha, for ease of illustration). Figure 13b shows an expanded set of support points including the original set of support points (shown as black circles) and the vertices of the connected components (shown as white circles). The resulting mesh 1300 is formed of a set of polygons including the connected component.

For examples in which a connected component is detected having more than three vertices, it is not possible to determine plane parameters for the connected component using all of the vertices, as the resulting system of simultaneous equations is overdetermined. Accordingly, only three vertices can be used to determine the plane parameters. If the connected component corresponds to a perfect plane, the same plane parameters will be determined if any three vertices are used. However, in most cases the connected component is unlikely to correspond to a perfect plane, and furthermore may include vertices with outlying stereo disparities, and therefore different plane parameters will be calculated for different choices of vertices. In order to determine the plane parameters, a best three vertices of the connected component are identified for estimating the plane parameters. The best three vertices may be identified, for example, using random sample consensus (RANSAC), a three dimensional Hough transform method, or any other suitable method. The best three vertices are those for which the resulting set of plane parameters are most consistent with the disparities of the other vertices of the connected component. The plane parameters of the connected component are estimated using the respective stereo disparities of the identified best three vertices, and the disparities of the other vertices of the connected components are recomputed using Equation (4) with the estimated plane parameters, resulting in any outlying stereo disparities being removed.

For some examples in which connected components are detected, the plane parameters of the connected components are determined before those of the other polygons in the mesh. In this way, stereo disparities of mesh vertices are recomputed so as to be consistent with the hypothesis that the connected components are planar. The plane parameters of the other polygons are then detected using the recomputed stereo disparities, such that the resulting three-dimensional representation of the environment is formed of a set of interconnected planes.

As mentioned above, the stereo image processing system 101 of Figure 1 is arranged to generate input data for the ADA S 112 of the vehicle on which the cameras 102, 104 are mounted. In some examples, the input data is generated using the three-dimensional representation of the environment. Using the three-dimensional representation, locations of points in the environment can be determined in three dimensions, and this information can be used by the ADAS 112 to make decisions regarding actions to be performed. In some examples, the input data for the ADAS 112 includes information representative of the entire three-dimensional representation of the environment. In other examples, the input data may include, for example, the locations of any points less than a predetermined threshold distance from the vehicle. The ADAS 112 may then provide control signals to the actuators 114 to control driving functionality of the vehicle 100. The ADAS 112 may, for example, actuate one or more of steering, brakes, throttle, signalling, and gears of the vehicle 100, in order to avoid a collision between the vehicle 100 and an object within the environment.

An advantage of generating input data for an ADAS or ADS using a three-dimensional mesh representation of an environment as described above, as opposed to generating input data from a dense pixel depth map, or a point cloud generated using, for example, lidar, is that the number of polygons in a mesh generated using the present method is typically several orders of magnitude less than the number of pixels (or points in the case of lidar). For most polygons, it is sufficient to check distances to the vertices of the polygon to determine whether a vehicle is at risk of colliding with an object. For other polygons, it is sufficient to determine a portion of the polygon that overlaps with a specific region in front of the vehicle, and to check distances to the vertices within the overlapped portion. Checking distances to these vertices is far less computationally expensive than performing a pixel-wise or point-wise distance check.

In some examples, a three-dimensional representation of an environment, such as that generated using the method described above with reference to Figure 3, can be used to generate input data for an object detection system. Figure 14 shows an example of a method implemented by the stereo image processing module 110 to generate input data for an object detection system.

The stereo image processing module 110 receives, at 51402, a stereo image pair including a left image and a right image, captured by the left camera 102 and the right camera 104 respectively and pre-processed by the left image pre-processing module 106 and the right image pre-processing module 108 as described above. The stereo image processing module processes, at S1404, the left image and the right image using stereo matching to generate a three-dimensional representation of an environment. In the present example, the stereo image processing module 110 implements the method described above with reference to Figure 3, though it will be appreciated that other stereo matching methods are known and may be used as alternatives to the method of Figure 3. In this example, the generated three-dimensional representation of the environment includes a mesh of polygons, each having associated plane parameters. In other examples, a three-dimensional representation may include, for example, depth data indicating depths associated with each pixel of an image.

The stereo image processing module determines, at S1406, a transformation to be applied to one of the images of the stereo image pair, using the generated three-dimensional representation of the environment. The transformation expands regions of the image corresponding to distant portions of the three-dimensional representation, and shrinks regions of the image corresponding to nearby regions of the three-dimensional representation. In this example, the transformation moves a point with three dimensional co-ordinates (x, y, z) to a new point with three-dimensional co-ordinates (x', y', z'), where z' is predetermined and x' and y' are given by Equation (5): (x -cx)bf (5) x + cx dz' (y -dz' cy)bf + cy, where d = bf /z is the stereo disparity for the point with co-ordinates (x, y, z).

The constants cx, cy are the x and y co-ordinates of the principle point of the camera that captured the image. The constants b, f, cx, cy are all determined during calibration of the camera, as will be understood by a person skilled in the art. In this example, the transformation of Equation (5) is applied to each pixel in the image, though alternative methods may be employed in other examples, for example applying a transformation such as the transformation of Equation (5) to different regions of interest (ROIs) within the image, as will be described in more detail hereinafter.

The stereo processing module 110 transforms, at 51408, one of the images of the stereo image pair, in accordance with the transformation determined at S1406, thereby expanding portions of the image and/or shrinking portions of the image. The transformed image may then be used as input data for an object detection system. In the example of a driving environment, most objects appearing in an image are likely to be less than around 20 metres or 30 metres from a pair of stereo cameras mounted on a vehicle. Therefore, by moving objects to a predetermined distance z' of around 20 metres to 30 metres, most regions of the image will shrink. As a result, the transformed image will contain fewer pixels than the original image.

As mentioned above, the transformed image generated at S1408, may be input to an object detection system, which processes the transformed image to generate object detection data indicating objects detected in the transformed image. Any suitable object detection method may be used, for example You Only Look Once (YOLO), Region-Convolutional Neural Network (R-CNN), Fast R-CNN, or Single Shot Detector (SSD). Each of these object detection methods generates, for a given object in the image, a detection score (for example being indicative of a probability of an object being detected, or of a specific class of object being detected), a classification indicating a predicted class of the object (for example, human, car, or lorry), and a bounding box indicating a predicted location and dimensions of the object.

The method described above with reference to Figure 14 utilises depth information within the three-dimensional representation of the environment to improve both the efficiency and the accuracy of object detection. In particular, nearby objects are scaled down to include a smaller number of pixels, resulting in more efficient detection of nearby objects. By contrast, distant objects are scaled up, resulting in improved detection performance for distant objects.

Object detection methods generally rely on detection anchors, which are candidate regions of an image in which an object may be detected. In order to detect all objects within an image, the detection anchors must span the spatial extent of the image, and must also span a range of scales and/or aspect ratios such that objects of different shapes and sizes (or at different depths within the image) are detected. In order to detect objects for which the size and/or shape is known approximately (for example, humans or vehicles), one way of improving efficiency would be to modify the object detection code in dependence on the three-dimensional representation of the environment such that larger detection anchors are used for nearer regions of the image and smaller detection anchors are used for more distant regions of the image. However, this approach requires significant modification of the object detection code, which may not always be practicable, particularly in examples where the object detection code is provided by a third party or where object detection is performed by a separate module. Furthermore, any change in the object detection code may require the object detector to be retained, which may be highly inconvenient and/or impracticable. The method of Figure 14 provides an alternative approach in which an image is transformed to modify regions of the image such that objects appear at a predetermined depth in the transformed image, resulting in the objects being detected using a narrower range of detection anchors. The image is modified prior to object detection being performed, such that the depth information is taken into account without the need to modify or retrain the object detector. Accordingly, different object detectors may be used interchangeably.

Figure 15 shows a further example of a method implemented by the stereo image processing module 110 to generate input data for an object detection system. The stereo image processing module 110 receives, at S1502, a stereo image pair including a left image and a right image, captured by the left camera 102 and the right camera 104 respectively and pre-processed by the left image pre-processing module 106 and the right image pre-processing module 108 as described above. The stereo image processing module processes, at 51504, the left image and the right image using stereo matching to generate a three-dimensional representation of an environment. In the present example, the stereo image processing module 110 implements the method described above with reference to Figure 3, though it will be appreciated that other stereo matching methods are known and may be used as alternatives to the method of Figure 3.

The stereo image processing module 110 determines, at 51506, one or more two-dimensional regions of interest (ROIs) of the left image or the right image for input to the object detection system, each of the one or more ROIs being associated with a respective predetermined portion of the three-dimensional representation generated at 51504. Depending on the specific application, regions of an image corresponding to particular depth ranges may be considered as having higher importance than other regions of the image. For example, in the present example, in which the input data is generated for the ADAS 112, regions of the image corresponding to closer portions of the three-dimensional representation may be considered have a higher importance than regions corresponding to more distant portions of the three-dimensional representation.

Figure 16 shows an example of an ROI of the left image 202 of the stereo image pair 200, corresponding to a portion of the three-dimensional representation of the environment. In this example, the ROI (the region of the image 202 below the dashed line) corresponds to a portion of the three-dimensional representation that is less than 6m from the plane containing the left camera 102 and the right camera 104. In some examples, a portion of a three-dimensional representation used to determine an ROI may be bounded in the plane of the image, as well as in the direction perpendicular to the plane of the image.

An object detection system may be arranged to detect objects within the determined ROls to generate object detection data. By performing object detection on the ROIs, regions of the image that have higher importance, for example regions of the image more likely to be pertinent to decisions made by an ADAS, may be prioritised compared with other regions of the image. Alternatively, object detection may be performed exclusively for the ROIs, such that objects in regions of the image that are considered to be of low importance are not considered.

In some examples, the methods of Figures 14 and 15 may be combined, for example by transforming or rescaling one or more ROls of an image in accordance with the corresponding distance ranges of the ROls. For example, in a driving environment, an ROI corresponding to a portion of the environment lying between 90 metres and 150 metres from a vehicle may be scaled up, whereas an ROI corresponding to a portion of the environment lying between 5 metres and 20 metres may be scaled down. The rescaled ROIs may then be processed one by one by an object detector in accordance with a predetermined order of priority (for example, ROIs closer to the vehicle and/r directly in front of the vehicle may be allocated a higher priority than ROIs farther from the vehicle and/or to the side of the vehicle). The object detector therefore processes regions of higher importance before regions of lower importance, and all of the regions are appropriately scaled, allowing the object detector to be configured with for a lower input resolution. Alternatively, the transformed or rescaled ROls may be concatenated together to form a new image for input to the object detector. In either case, no modification or retraining of the object detector is required, but the object detector is implicitly guided to pay more attention to distant objects (by scaling up certain ROIs) and to spend reasonable computational resources to detect nearby objects (by scaling down certain ROIs). Either of these methods results in improved efficiency of object detection, whilst maintaining or improving detection accuracy.

The stereo image processing module 110 of Figure 1 may be configured to perform object detection by processing input data generated by either of the methods of Figure 14 or 15. The stereo image processing module 110 may further be configured to generate enhanced object detection data by combining the object detection data with a three-dimensional representation of an environment. For a given detected object, in addition to regular object detection data (for example, a detection score, a classification, and a bounding box), enhanced object detection data includes depth information indicating a distance to the detected object, determined using the relevant portion of the three-dimensional representation. Figure 16 shows an example of enhanced object detection data overlaid on the left image 202 of the stereo image pair 200. In this example, three objects have been detected -a human, a lorry, and a tree. For each object, a bounding box is shown, along with depth information indicating a distance of the object from the plane containing the left camera 102 and the right camera 104.

In some examples, the stereo image processing module 110 generates input data for the ADAS 112 using enhanced object detection data. The enhanced object detection data includes depth information relating to the distance of objects from the vehicle. The ADAS 112 can make decisions regarding actions to be performed on the basis of the three-dimensional positions of objects of different classes. The ADAS 112 may then provide control signals to the actuators 114 to control driving functionality of the vehicle 100. The ADAS 112 may, for example, actuate one or more of steering, brakes, throttle, signalling, and gears of the vehicle 100, in order to avoid a collision between the vehicle 100 and a detected object.

The above embodiments are to be understood as illustrative examples of the invention. Further embodiments of the invention are envisaged. For example, stereo cameras are not necessarily horizontally-separated, and more generally may be any two cameras with overlapping views of a common environment. In some examples, existing cameras may be paired and repurposed as stereo cameras. Furthermore, methods described herein may be applied to stereo images of virtual or simulated environments as opposed to stereo images of real environments captured using stereo cameras.

In some examples, a three-dimensional representation of an environment, and/or enhanced object detection data, may be used to provide input data for an ADS of a vehicle, for example a fully autonomous vehicle or a vehicle configured in an autonomous driving mode. The ADS may then perform driving functions of the vehicle, for example by actuating one or more of steering, brakes, signalling, throttle, and gears of the vehicle, such that no human input is required to drive the vehicle.

In some examples, a vehicle with an ADS and/or ADAS may have multiple stereo cameras facing in different directions, such that methods described herein may be used to generate a three-dimensional representation of an environment surrounding the vehicle, as opposed to only in front of the vehicle. In some examples, a vehicle may have multiple cameras facing in the same direction and separated by different baselines such that different pairs of cameras can resolve distances in different distance ranges.

The hardware components described with reference to Figure 1 are exemplary, and methods described herein could be performed using any suitable device or system with appropriate processing circuitry and memory circuitry, for example a general purpose computer or a network-based computer system.

Although in the example of Figure 3, representative points within an image are selected using gradient-based matching scores, in other examples alternative methods may be used to select representative points, for example corner detection methods such as the Moravec, Harris & Stephens, Plessey, or Shi-Tomasi corner detection methods. The skilled person will appreciate that alternative stereo matching methods to those described herein may be used to determine stereo disparities for the selected representative points, including intensity-based and feature-based stereo matching methods. In some examples, multiresolution matching methods may be employed.

It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.

Claims

CLAIMS1. A method of processing stereo image data to generate a three-dimensional representation of an environment, wherein the stereo image data comprises a first image and a second image corresponding to simultaneous views of an environment from two respective different locations and each comprising a plurality of pixels, the method comprising: selecting a plurality of representative points to represent the environment, the representative points corresponding to a subset of the pixels of at least one of the first image and the second image; determining stereo disparities for the selected representative points; generating, by connecting the selected representative points, a mesh comprising a plurality of polygons, each having a respective plurality of mesh vertices; and determining, using the respective stereo disparities of the mesh vertices, plane parameters for each of the polygons of the generated mesh, whereby the plurality of polygons and the determined plane parameters provide the three-dimensional representation of the environment.
2. The method of claim 1, comprising: determining one or more connected components, each connected component having a respective plurality of vertices and corresponding to an untextured region or a distant region within the first image; and including the vertices of the one or more determined connected components within the selected plurality of representative points, wherein generating the mesh comprises connecting the selected plurality of representative points, such that the generated mesh comprises a plurality of polygons including the one or more connected components.
3. The method of claim 2, wherein determining the one or more connected components comprises determining an untextured region within the first image by: binarizing pixels of the first image in dependence on associated pixel gradients; and determining the untextured region on the basis of a cluster of the binarized pixels having the same binarized pixel value.
4. The method of claim 2 or 3, wherein determining the plane parameters for one of the one or more determined connected components comprises: identifying a best three vertices of the connected component for estimating the plane parameters for the connected component; estimating the plane parameters for the connected component using the respective stereo disparities of the identified best three vertices; and recomputing the disparities of the vertices of the connected component other than the identified best three mesh vertices, in accordance with the estimated plane parameters.
5. The method of any of claims 2 to 4, wherein determining the one or more connected components comprises determining a distant region within the first image by: subtracting pixel values or feature descriptors of pixels of the second image from pixel values or feature descriptors of identically-located pixels of the first image to generate a delta image comprising a plurality of delta pixel values; binarizing the generated delta image in dependence on the delta pixel values; and determining the distant region on the basis of a cluster of the binarized delta pixel values having the same binarized pixel value
6. The method of any of claims 2 to 5, comprising scaling down at least one of the first image and the second image prior to determining the one or more connected components.
7. The method of any preceding claim, comprising scaling down the first image and the second image prior to selecting the plurality of representative points.
8. The method of claim 6 or 7, comprising enlarging the polygons of the generated mesh to match the scale of the first and second images.
9. The method of any preceding claim, comprising: determining gradient-based matching scores for pixels of the first image and pixels of the second image; and determining feature descriptors for pixels of the first image and pixels of the second image, wherein: the plurality of representative points are selected using the determined gradient-based matching scores; and the stereo disparities for the selected representative points are determined using the determined feature descriptors.
10. The method of claim 9, wherein selecting the plurality of representative points and determining the stereo disparities for the selected representative points comprises: dividing the first image into a plurality of first portions; dividing the second image into a plurality of second portions; determining, within each of the plurality of first portions and each of the plurality of second portions, a candidate pixel with a highest gradient-based matching score; for each of the determined candidate pixels in the first image, identifying a corresponding pixel in the second image on the basis of the determined feature descriptors; for each of the determined candidate pixels in the second image, identifying a corresponding pixel in the first image on the basis of the determined feature descriptors; determining, using said corresponding pixels, the respective stereo disparities of each of the determined candidate pixels; projecting the candidate pixels of the second image to the first image using the determined stereo disparities; and selecting points corresponding to the candidate pixels of the first image and the projected candidate pixels of the second image.
11. The method of claim 10, comprising removing outliers from the selected plurality of representative points, wherein removing the outliers comprises at least one of: discarding points from any of the first portions containing only one representative point; and discarding points from any of the first portions containing representative points determined to have a range of stereo disparities greater than a predetermined threshold range.
12. The method of any preceding claim, comprising: determining that one or more of the mesh vertices have unknown stereo disparities; and determining stereo disparities for the one or more mesh vertices determined to have unknown stereo disparities, wherein determining the stereo disparity of a mesh vertex with an unknown disparity and associated with a pixel in the first image comprises: determining a search range in dependence on known stereo disparities of neighbouring mesh vertices in the generated mesh; identifying, within the search range, a pixel in the second image corresponding to pixel in the first image associated with the mesh vertex with the unknown stereo disparity, using the determined feature descriptors; and determining, using the corresponding pixel identified within the search range, the respective stereo disparity for the mesh vertex with the unknown disparity.
13. The method of claim 12, wherein determining that the one or more mesh vertices have an unknown stereo disparity comprises: clustering neighbouring mesh vertices with known stereo disparities according to the known stereo disparities; and determining the stereo disparity of any mesh vertex in a cluster containing fewer than a predetermined threshold number of mesh vertices to be unknown.
14. The method of any preceding claim, wherein generating the mesh comprises: determining a set of triangles by connecting the selected representative points; and iteratively: identifying a triangle in the set with an area greater than a predetermined threshold area; identifying a pixel in a central region of the identified triangle determined to have a gradient-based matching score greater than a predetermined threshold matching score; and dividing the identified triangle into three smaller triangles each sharing two vertices with the identified triangle and having a third vertex associated with the pixel identified in the central region of the identified triangle, such that the plurality of polygons of the mesh includes a plurality of triangles with areas below the predetermined threshold area.
15. The method of any preceding claim, comprising generating input data for an advanced driver assistance system (ADAS) or an automated driving system (ADS) using the generated three-dimensional representation of the 30 environment.
16. The method of any previous claim, comprising generating input data for an object detection system, wherein generating the input data for the object detection system comprises: determining, using the generated three-dimensional representation, a transformation to be applied to the first image or the second image, wherein the transformation expands regions of the first image or the second image corresponding to distant portions of the three-dimensional representation and shrinks regions of the first image or the second image corresponding to nearby portions of the three-dimensional representation; and transforming the first image or the second image in accordance with the determined transformation, to generate a transformed image for input to the object detection system.
17. The method of any preceding claim, comprising generating input data for an object detection system, wherein generating the input data for the object detection system comprises: determining, using the generated three-dimensional representation of the environment, one or more two-dimensional regions of interest (ROIs) of the first image or the second image for input to the object detection system, wherein each of the one or more ROIs is associated with a respective predetermined three-dimensional portion of the generated three-dimensional representation of the environment.
18. A method of processing stereo image data to generate input data for an object detection system, wherein the stereo image data comprises a first image and a second image corresponding to simultaneous views of an environment from two respective different locations and each comprising a plurality of pixels, the method comprising: processing the first image and the second image using stereo matching to generate a three-dimensional representation of the environment; determining, using the generated three-dimensional representation, a transformation to be applied to the first image or the second image, wherein the transformation expands regions of the first image or the second image corresponding to distant portions of the three-dimensional representation and shrinks regions of the first image or the second image corresponding to nearby portions of the three-dimensional representation; transforming the first image or the second image in accordance with the determined transformation, to generate a transformed image for input to the object detection system.
19. A method of processing stereo image data to generate input data for an object detection system, wherein the stereo image data comprises a first image and a second image corresponding to simultaneous views of an environment from two respective different locations and each comprising a plurality of pixels, the method comprising: processing the first image and the second image using stereo matching to generate a three-dimensional representation of the environment; determining, using the generated three-dimensional representation of the environment, one or more two-dimensional regions of interest (ROIs) of the first image or the second image for input to the object detection system, wherein each of the one or more ROls is associated with a respective predetermined three-dimensional portion of the generated three-dimensional representation of the environment.
20. The method of any of claims 16 to 19, comprising: processing the input data for the object detection system using the object detection system to generate object detection data; and combining the generated object detection data with the generated three-dimensional representation of the environment, to generate enhanced object detection data indicating a distance to each of the detected objects.
21. The method of claim 20, comprising generating input data for an advanced driver assistance system (ADAS) or an automated driving system (ADS) using the generated enhanced object detection data.
22. A computer system comprising a processing unit and a memory unit, wherein the memory unit holds machine-readable instructions which, when executed by the processing unit, cause the computer system to perform the method of any of claims 1 to 21.
23. A computer program product comprising machine-readable instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1 to 21.
24. A vehicle comprising: one or more actuators for controlling driving functionality of the vehicle; an advanced driver assistance system (ADAS) arranged to process input data to generate control signals for the one or more actuators; a pair of stereo cameras arranged to generate stereo image data; and an image processing module arranged to process the stereo image data generated by the pair of stereo cameras to generate the input data for the ADAS in accordance with the method of claim 15 or claim 21.
25. A vehicle comprising: one or more actuators for controlling driving functionality of the vehicle; an automated driving system (ADS) arranged to process input data to generate control signals for the one or more actuators; a pair of stereo cameras arranged to generate stereo image data; and an image processing module arranged to process the stereo image data generated by the pair of stereo cameras to generate the input data for the ADS in accordance with the method of claim 15 or claim 21.