WO2020183202A1

WO2020183202A1 - Image processing method and system

Info

Publication number: WO2020183202A1
Application number: PCT/GB2020/050658
Authority: WO
Inventors: Michael Vi Nguyen TRUONG
Original assignee: Memento Ltd
Priority date: 2019-03-13
Filing date: 2020-03-13
Publication date: 2020-09-17
Also published as: GB201903431D0; GB2582164A

Abstract

A method of combining image data to generate a virtual reality representation of an event, the method comprising; receiving a first set of one or more images captured at the location of the event; generating the virtual reality representation in dependence on a first representation that has been generated in dependence on said first set of one or more images; receiving a second set of one or more images captured at the location of the event by one or more attendees of the event; and including the second set of one or more images in the virtual reality representation.

Description

IMAGE PROCESSING METHOD AND SYSTEM

The present invention relates to a method and a system for producing a virtual reality representation of an event.

At events, attendees often take images of themselves, other attendees or other interesting events that take place within the scope of the larger event using a mobile phone or other method of capturing images. These events can include personal events of importance to the attendees, such as weddings, christenings, birthdays or other celebrations and public events such as music festivals, concerts, parades, fashion shows and royal weddings or celebrations. There is an expectation that lots of images will be captured at such events. People can also be involved in events such as accidents, disasters, or other emergencies in which they may take images of the event. People at the event may also capture video footage of the event using a mobile phone or other method of acquiring video.

Virtual reality (VR) can be used to experience a location without needing to be physically present at the location. To create a VR representation, a number of images or a video of the location can be captured. By using a VR headset, the orientation and relative displacement of a user’s head can be tracked, and corresponding images can be displayed to the user to create the impression that the user is actually present at the location. There is a general need to improve VR representations of events.

According to a first aspect of the invention there is provided a method of combining image data to generate a virtual reality representation of an event, the method comprising; receiving a first set of one or more images captured at the location of the event; generating the virtual reality representation in dependence on a first representation that has been generated in dependence on said first set of one or more images; receiving a second set of one or more images captured at the location of the event by one or more attendees of the event; and including the second set of one or more images in the virtual reality representation.

The first representation may be a panoramic image or a 3D model.

The generation of the virtual reality representation of the event may be performed by a neutral network. The first set of one or more images may be captured before the event takes place.

The method may further comprise processing the one or more images in the second set to separate regions corresponding to the foreground and background of the one or more images in the second set; wherein the inclusion of the second set of one or more images may be performed based on the background of each image.

The generation of the first representation may be further dependent on the regions corresponding to the background of the one or more images of the second set.

The method may further comprise processing the background of the one or more images in the second set to minimize the difference in colour balance between the background of the one or more images in the second set and the first representation.

The method may further comprise processing the one or more images in the second set to separate regions corresponding to separate regions of the foreground at different focal planes; wherein the inclusion of the second set of one or more images may be performed based on one of the separate regions of the foreground of each image.

The method may further comprise processing the one or more images in the second set to separate regions corresponding to static and dynamic regions of each image; wherein the inclusion of the second set of one or more images may be performed based on one of the separate regions corresponding to the static regions of each image.

The first representation may be represented by a cube map; and the inclusion of the second set of one or more images in the virtual reality representation may comprise at least one of the steps of: comparing the one or more images in the second set to each horizontal face of the cube map and assigning each of the one or more images in the second set to a horizontal face of the cube map based on the comparison; and combining each of the one or more images in the second set with the region of the first representation corresponding to the horizontal face of the cube map to which each of the one or more images in the second set has been assigned.

The method may further comprise receiving contextual data associated with the one or more images of the second set; wherein the inclusion of the second set of one or more images may be determined based on the contextual data.

The contextual data may include sensor data captured by the device used to capture each image. The sensor data may include the location of the device used to capture each of the one or more images at the time when the image was captured.

The sensor data may include the orientation of the device used to capture each of the one or more images at the time when the image was captured.

The sensor data may include the time at which each of the one or more images was captured.

The method may further comprise at least one of the steps of: receiving a third set of one or more images captured at the location of the event, wherein the third set of one of more images may have been captured at a different position within the same location to the first set of one or more images; wherein the generation of the virtual reality representation of the event may be further in dependence on a second representation that has been generated in dependence on said third set of one or more images.

The method may further comprise, when the first representation and the second representation are both a panoramic image, obtaining a 3D model of the location of the event based on the first representation and the second representation; wherein the virtual reality representation may be generated based on the obtained 3D model of the location of the event.

The method may further comprise at least one of the steps of: generating a 3D model of an object in dependence on at least one of the second set of one or more images; and including the 3D model of the object in the VR representation.

According to a second aspect of the invention there is provided a system configured to produce a virtual reality representation of an event, the system comprising a remote processor, wherein the remote processor is configured to perform the method of the first aspect.

The system may further comprise a network of mobile devices wherein each mobile device is configured to capture images; wherein the network of mobile devices may be networked with the remote processor.

Embodiments of the present invention will now be described by way of non- limitative example with reference to the accompanying drawings, of which

Fig. 1 shows an example of a VR representation in which images of an event have been incorporated Fig. 2 is a schematic diagram of a system that may be used to perform the present invention

Fig. 3. is a flowchart showing the steps performed in an embodiment of the present invention.

It is desirable to create a VR representation that can reproduce the experience of being at an event at a location. However, the required processing to create the VR representation can be difficult, requiring large amounts of processing power or a long time to produce. It would therefore be desirable to increase the efficiency of the processing required to produce a VR representation of an event.

Figure 1 shows an example of a virtual reality (VR) representation of an event. The VR representation may reproduce the experience of attending the event. Examples of such events includes personal events such as weddings, christenings, birthday parties, funerals or other social events. Other events may include public events such as concerts, plays, operas, sporting events or other performances. VR representations may also be produced of spontaneous or unplanned events such as accidents, natural disasters, riots or other dangerous activities or situations. A VR representation may be produced of any event for which people were present at the location of the event and captured at least one image as the event is taking place.

People present at an event are usually able to capture images 10 or videos of an event taking place because they will have access to devices such as mobile phones or video recorders that are capable of capturing images or videos of the event taking place. In addition to capturing the image 10 or video of the event, the device may also be capable of acquiring further information at the time at which the image 10 or video is captured. For example, the device may be capable of determining the location of the device when the image 10 is captured by using a GPS or Wi-Fi-positioning system located on the device. The device may be capable of determining the orientation of device when the image 10 is captured, for example using a gyroscope or accelerometer located on the device. The device may be capable of determining the altitude of the device when the image is captured, for example using a barometer. Data from other sensors such as a magnetometer present on the device may also be recorded when an image 10 is captured.

A set of one or more images of a location at which an event may be captured prior to an event taking place. This set of one or more images may be referred to as a first set of images. Such a set of one or more images can be used to construct a first representation 20 of the location at which the event is taking place. The first representation may be a panoramic image or a 3D model, as described in greater detail below. Approximately 1 image per 25m² of the location of the event may be captured to form the set of one or more images. In an example, a set of one or more images may be captured using a 360° camera. The 360° camera may capture images of the location using two fish-eye lenses arranged back to back. The two images may then be combined to produce a panoramic image 20 of the location. Another method of obtaining a set of one or more images to produce a first representation 20 is to use a device to capture the set of one or more images of the location at different orientations from the same position. The one or more images can then be combined to produce a first representation 20 of the location.

In an embodiment, a set of one or more images obtained at a location of an event and one or more event images 10 captured by one or more attendees of the event are combined to produce a VR representation of the event. The one or more event images may be referred to as a second set of images. The VR representation may be generated in dependence on the first representation 20 that has been generated in dependence on the first set of images and further include the second set of images. An example of a viewpoint in a VR representation is shown in figure 1. The event image 10 captured during the event by an attendee at the event may be incorporated into the first

representation 20. The first representation 20 may have been created using one or more images of the location at which the event is occurring that have been captured before the event has occurred. Alternatively, the one or more images of the location may have been captured after the event has occurred.

The event image 10 captured by an attendee may include features of interest to the person attending the event. For example, the event image 10 may include other attendees of the event, as shown in Figure 1. The event image 10 may include other features of interest, such as important objects or decorations related to the event. The event image 10 may include features that are temporary and are not present at the location when the event is not taking place. The event image 10 may include features that are in motion. Regions of the event image 10 may be divided into foreground regions that contain temporary features or features of interest to the attendees, and background regions that contain permanent features that are present at the location of the event when the event is not taking place. The event image 10 may be divided into regions containing features at different focal planes and/or at different distances from the position at which the event image 10 is captured.

The set of one or more images obtained at the location of the event used to produce the first representation 20 may not contain temporary features that are only present at the location during the event. For example, the set of one or more images may only contain features associated with the building in which the event is being held, such as the walls, floor and ceiling of the building. If the event is being held outdoors, the set of one or more images may contain landscape features such as buildings, hills, mountains and/or rivers. The set of one or more images obtained at the location of the event may contain permanent features of the location that are not associated with the event.

The set of one or more images obtained at the location of the event and the event image 10 may be combined to produce a VR representation of the event. An example arrangement of a system that may be used to perform the method according to the present invention is shown in Figure 2. One or more devices 100 may be used by attendees to capture images of the event. The devices 100 may then transfer the images of the event to a cloud network or other communication network 200. The images of the event are then received by a server 300. The server 300 may be a remote processor or other computer system that is not located at the location at which the event is taking place.

The event images 10 may be automatically transferred to the communication network 200 by an app installed on the device 100 for the purpose of producing the VR representation of the event. While the app is running on the device 100, each image captured by the device may be transmitted automatically via the network 200 to the server 300. Additional sensor data captured by the device 100 as discussed above may also be transferred to the server 300 automatically by the app via the network 200.

Event images 10 may be manually sent to the server 300 by the user of the device 100 after the event has taken place. An app may be provided on the device 100 that allows the user of the device 100 to select particular images stored on the memory of the device 100 to be sent to the server 300. The selected images may be transferred from one device 100 to another device 100 prior to being sent to the server 300.

The set of one or more images that will be used to produce the first representation

20 may also be transferred to the server 300 via the network 200. The set of one or more images may be transferred before or after the event has taken place. The set of one or more images may have been combined to form the first representation 20 of the location of the event and the first representation 20 may be transferred via the network 200 to the server 300. The set of one or more images may have been captured using one of the devices 100 and sent to the server 300 from the device 100.

An example of the steps performed to implement the method of the present invention is shown in Figure 3. Each of the steps described below may be performed by the server 300.

In a first step 51, the server 300 may receive an event image 10 and a set of one or more images obtained at the location of the event. The server 300 may also receive further sensor data obtained by the device 100 used to capture the event image 10 at the time the event image 10 was captured.

In a second step 52, the server 300 may perform processing on the received images. In the case where the first representation 20 is a panoramic image, the server 300 may stitch the set of one or more images together to produce the panoramic image of the location of the event. The production of the panoramic image may comprise the steps of determining key points in each of the one or more images, deriving a descriptor associated with the local region of each key point in the image, matching descriptors between the images, calculating the position and orientation of the perspective of each image from the matching of the descriptors and applying a transformation to each of the images to obtain the panoramic image. An example of how to perform this technique may be found at https://www.pyimagesearch.eom/2016/01/l 1/opencv-panorama-stitching/ (accessed 8 January 2019). The panoramic image may be produced using orientation information associated with each of the one or more images of the set of one or more images. In this case, the orientation information may be used instead of or in addition to the determination of the descriptors.

The server 300 may also perform processing of the event image 10. For example, the server may process the event image 10 to define separate regions of the event image 10 as foreground regions or background regions. The classification of the separate regions of the event image 10 into foreground and background regions may be performed using image matting, image segmentation or disparity techniques. Once regions of the event image 10 have been separated into foreground and background regions, the colour balance of the background region of the event image 10 may be modified to reduce the difference between the colour balance of the background of the event image 10 and the first representation 20. For example, at least one of the hue, saturation and intensity of regions of the event image 10 may be modified to reduce the difference between the average value of these characteristics of the region and a corresponding region of the first representation 20. Alternatively, or additionally, hue, saturation and intensity of regions of the first representation 20 may also be modified to reduce the differences between the average value. Regions of interest may be matching patches of the wall, or a person’s hair or shirt.

Correspondence between regions may be established by, for example, first labelling regions within the event image 10 and the first representation 20 and then matching regions between the images which have the same label. Labels within the images can be established by, for example, using a convolutional neural network or other artificial intelligence or semantic analysis methods, or by manually labelling the regions within the images. This process is described in further detail below.

Reducing the difference in colour balance between the background region of the event image 10 and the first representation 20 may reduce the amount of processing in the combining step described below. The foreground regions of the event image 10 may further be classified into separate foreground regions at different focal planes or distances from the device 100 used to capture the event image 10.

In the case where the first representation 20 is a panoramic image, the panoramic image may be received in an equirectangular projection or as a cube map. If the panoramic image is received in a different form, the panoramic image may be processed by converting the panoramic image from the different form, such as an equirectangular projection, into a rectilinear cube map projection comprising six faces corresponding to four horizontal and two vertical directions. Cube mapping is a known technique in computer graphics, as described at the following web address:

https://en.wikipedia.org/wiki/Cube_mapping and https://docs.unity3d.com/Manual/class- Cubemap.html (both accessed 8 January 2019). The faces of the cube maps are also referred to as facets. The horizontal and vertical directions of the cube map may correspond to the horizontal and vertical directions as defined by the location at which the event is taking place. The panoramic image may be converted into one or more rectilinear cube maps, wherein the angle of orientation of the horizontal faces of the one or more cube maps is different to each other. For example, the panoramic image may be converted into two cube maps where the horizontal orientation of the first cube map is at a 45° angle to the horizontal orientation of the second cube map.

In a third step 53, the server 300 may combine the event image 10 and the first representation 20 to produce a VR image. The VR image may be referred to as an augmented image. To combine the event image 10 and the first representation 20, the event image 10 and the first representation 20 may be analysed to determine a region of the first representation 20 that corresponds to the event image 10. The event image 10 may then be inserted into the first representation 20 at the region that corresponds to the event image 10 to produce the VR image.

When the first representation 20 is a panoramic image represented as one or more cube maps, as an initial step the event image 10 may be assigned to a particular face of one of the cube maps. The assignment of an event image 10 to a particular face of one of the cube maps may be performed by comparing features between the event image 10 and each of the faces of the cube maps, determining which face of the cube map shares the most features with the event image 10 and assigning the event image 10 to that face. Only the section of the panoramic image corresponding to the cube face to which the event image has been assigned may be considered when determining the region of the panoramic image 20 that corresponds to the event image 10. Assigning the event image 10 to a face of a cube map prior to locating the event image 10 within the panoramic image 20 may allow the virtual reality representation of the event to be derived more quickly.

The assignment of an event image 10 to a particular face of one of the cube maps may be performed by considering only the horizontal faces of the cube maps. Event images 10 captured by attendees at the event may more often be captured in the horizontal plane or at an orientation close to that of the horizontal plane. Discarding the vertical faces of the cube maps may allow the VR representation to be derived more quickly.

Locating the region of the first representation 20 that corresponds to the event image 10 and inserting the event image 10 into the first representation 20 may be performed using trilateration to determine the location of features in the image at the location of the event. Trilateration may be performed using computer vision techniques. For example, one or more of the techniques of point detection and description, descriptor matching, 3D point reconstruction and bundle adjustment may be used. The trilateration technique may allow the correct position and size of features within the event image 10 within the first representation 20 to be determined by calculating the location of the features of the event image 10 within the location of the event. Location of the regions may be performed using the same processes as described above for determining correspondence between regions of the event image 10 and the first representation 20.

When the event image 10 is inserted into the first representation 20, the event image 10 may appear as a free-standing cut out in the first representation 20 in the VR image. A plurality of event images 10 may be incorporated into the first representation 20 to form the VR image. In the VR image, parallax image artefacts may occur due to regions of the image 10 not having the same distance to the camera as the focal plane at which the image 10 is positioned in the VR image.

When the event image 10 has been separated into background and foreground regions, the background region may be used to locate the position of the event image 10 within the first representation 20 and the foreground region may not be used. When the event image 10 has been separated into multiple foreground regions at separate focal planes and/or distances from the device 10 used to capture the image, the multiple foreground regions may be positioned at different locations and sizes relative to each other in the VR image based on the calculated location of the features of the event image 10 at the location of the event. This may reduce the incidence of parallax imaging artefacts as described above.

The further sensor data recorded by the device 100 at the time at which the event image 10 is captured may be used to calculate the position of the features of the event image 10 within the location of the event and locate the position of the event image 10 within the first representation 20. For example, information on the orientation of the device 100 at the time at which the event image 10 is captured can be used to restrict or reduce the regions of the first representation 20 that are compared to the event image 10. Information of the position of the device 100 at the time at which the event image 10 is captured can be used to restrict or reduce the regions of the first representation 20 that are compared to the event image. Any of the sensor data described above that may be obtained by the device 100 may be used to restrict or reduce the regions of the first representation 20 that are compared to the event image 10. Using the sensor data in this way may constrain the triangulation cost functions used in the trilateration process, may reduce the number of degrees of freedom of the search space of the first representation 20 and may limit the range of each degree of freedom. For example, at least one of GPS, Wi Fi positioning, altitude recordings from the barometer or other positional information may be used to determine event images 10 that are not within proximity to each other and therefore do not need to be matched, reducing the size of the search space. Information from at least one of the magnetometer and accelerometer may be used to determine event images 10 that are not within each other’s field of view and therefore will not have corresponding regions and therefore do not be matched. Use of the further sensor data recorded by the device 100 may therefore reduce the search space and computational cost.

Other information about the event image 100 may be used to locate the position of the event image 10 within the first representation 20. For example, contextual information such as a comment associated with the event image 10 may be used. The comment may include information about features in the event image 10 which may have been identified in the first representation 20 and the comparison of the event image 10 may be restricted to the region of the first representation 20 containing the identified feature. The contextual information may be used to restrict or reduce the regions of the first representation 20 that are compared to the event image 10.

The processing of the images discussed above in the third step 53 may be performed by a neural network or a plurality of distributed neural networks. For example, a neural network may be trained and used to detect and classify objects within the event image 10 and/or the first representation 20, such as a person, a chair, doors, walls of the room, or any other furnishings. This is described in detail at

https://medium.com/datadriveninvestor/deep-learning-for-image-segmentation- dl0dl9131113 (accessed 8 January 2019). Imaging processing techniques may be used to segment region associated with the object which may be semantically labelled by the neural network. Furthermore, properties of the object can be qualified by the neural network. For example, the neural network may identify if the object is associated with a static region within the local area of the event, such as flooring, walls, a ceiling, trees and buildings that would not be expected to significantly move during the event. If further images do not share corresponding static regions, for example if one image contains a house that another image does not, then more detailed correspondences or relative localisation between the two images may not need to be computed which may save computational time.

Regions of the event image 10 that are identified as static and/or background may be incorporated into the first representation 20 using the same techniques described above for the formation of the first representation 20.

In a further example, objects within regions of the event image 10 that are not identified as static may be identified as dynamic, for example, people or cars within the event image. Regions identified as containing a dynamic object may be used to generate a 3D model of that object by identifying multiple instances of that object within a plurality of event images 10. The event images 10 may have been captured near or at the same time. For example, if two event images 10 simultaneously capture a person, then a predetermined 3D parametric model of the person can be located in the VR image such that the 3D model of the object matches the segmented contour of the region associated with the person in each event image 10. A frame of a video recording at the event may also be used an event image 10 for the purposes of 3D object reconstruction.

A plurality of sets of one or more images of the location of the event may be received by the server 300, where each of the sets of one or more images of the location of the event may have been captured from different points within the location of the event. Each of the steps described above may be performed separately for each of the plurality of sets of one or more images. For example, a representation 20 may be obtained for each of the plurality of the sets of one or more images. The location of the event image 10 may be determined within each of the plurality of representations 20 and the event image 10 may be inserted into each of the plurality of representations 20 to produce a plurality of VR images. In the case where the representations are panoramic images, the position of the location at which each of the panoramic images 20 have been captured in relation to each other may be determined.

The plurality of panoramic images may be used to generate a 3D model of the location of the event. A 3D model may consist of textured triangles forming a 3D mesh which represents the surfaces of the location or objects within the location, for example the walls, ceilings, floors and furniture. Such a 3D model may be used as an alternative to or in addition to the panoramic images 20 to produce the plurality of VR images. For example, surface rendering techniques may be used to generate the VR images from the 3D model. The 3D model may also be obtained or refined using information obtain about the location of the event from other sources. For example, LIDAR scans or a predetermined 3D model may be used. A 3D model may be obtained using techniques such as photogrammetry, as described at

https://pdfs.semanticscholar.org/be55/bfb7a4e96952a27f7a2e475c88d60flfl459.pdf (accessed 4 March 2019).

In a fourth step 54, the server 300 may derive the VR representation using the one or more VR images produced in the third step. The VR representation may be a virtual scene produced using the one or more VR images. The VR representation may represent a 4D photo and/or video album of the event. The VR representation may be viewed using VR headset or as a dynamic 3D scene or interactive 3D video. When a user views the VR representation, the event image 10 and features within the event image 10 are shown at the appropriate position in the VR reproduction of the location. The event image 10 or features in the event image may be visible at all times during the VR representation.

Alternatively, the event image 10 and features within the event image 10 may only be shown when the user’s line of sight within the VR representation means that the position of the event image 10 and features within the event image 10 are close to the centre of view of the user of the VR representation. Event images 10 may appear and disappear as different parts of the location of the event are viewed by the user. The event images 10 within the VR representation may appear as a single cut-out within the VR representation such that an object of interest within the cut-out corresponding to the event image 10 is in the correct scale, position and orientation as discussed above. The event images 10 may also be separated into different layers according to the focal distance of each layer as discussed above and with each separated layer optionally placed within the VR

reproduction at the correct scale, position and orientation. 3D meshes or models of objects obtained as discussed above may be placed within the VR reproduction at the correct scale, position and orientation, which may be displayed, for example, using surface rendering techniques. As the user moves around the VR representation, VR images associated with the first representation 20 corresponding to the virtual location of the user may be displayed to the user to create to illusion of movement around the location of the event.

Information associated with the event image 10 may be used to derive the VR representation. For example, the time at which the event image 10 was captured may be used to derive the VR representation. The VR representation may include a time component where the event images 10 displayed in the VR representation depend on the amount of time that the user has been viewing the VR representation. Event images 10 captured at similar times may therefore be displayed at the same or similar times in the VR representation. For example, the event image 10 may be displayed to the user in the VR representation after a length of time corresponding to the length of time after the beginning of the event at which the event image 10 was captured. For example, 1 minute of time in the VR representation may correspond to 1 hour of time passing at the event. The correspondence in time between the VR representation and the event may be varied depending on the number of event images 10 associated with particular periods of time at the event. For example, time in the VR representation may pass more quickly in time periods where fewer event images 10 were captured at the event and pass more slowly in time periods where more event images 10 were captured at the event.

The above embodiments have been described in relation to images being captured at the event. Alternatively, or additionally, video, and corresponding audio, captured at the event may be included in the VR representation in the same way. For example, a single frame of the video may be used as an event image 10 to include the video in the VR representation.

Although the steps above are described as taking place at the server, processing steps may also be performed in other locations. For example, any of the processing steps discussed above may be performed on the devices 100 of the attendees of the event.

The flow charts and descriptions thereof herein should not be understood to prescribe a fixed order of performing the method steps described therein. Rather, the method steps may be performed in any order that is practicable. Although the present invention has been described in connection with specific exemplary embodiments, it should be understood that various changes, substitutions, and alterations apparent to those skilled in the art can be made to the disclosed embodiments without departing from the spirit and scope of the invention as set forth in the appended claims.

Methods and processes described herein can be embodied as code (e.g., software code) and/or data. Such code and data can be stored on one or more computer-readable media, which may include any device or medium that can store code and/or data for use by a computer system. When a computer system reads and executes the code and/or data stored on a computer-readable medium, the computer system performs the methods and processes embodied as data structures and code stored within the computer-readable storage medium. In certain embodiments, one or more of the steps of the methods and processes described herein can be performed by a processor (e.g., a processor of a computer system or data storage system). It should be appreciated by those skilled in the art that computer-readable media include removable and non-removable structures/devices that can be used for storage of information, such as computer-readable instructions, data structures, program modules, and other data used by a computing system/environment. A computer-readable medium includes, but is not limited to, volatile memory such as random access memories (RAM, DRAM, SRAM); and non-volatile memory such as flash memory, various read-only-memories (ROM, PROM, EPROM, EEPROM), magnetic and ferromagneti c/ferroelectric memories (MRAM, FeRAM), and magnetic and optical storage devices (hard drives, magnetic tape, CDs, DVDs); network devices; or other media now known or later developed that is capable of storing computer-readable information/data. Computer-readable media should not be construed or interpreted to include any

propagating signals.

Claims

1. A method comprising:

receiving a first set of one or more images captured at the location of an event;

generating a virtual reality representation in dependence on a first representation that has been generated in dependence on said first set of one or more images;

receiving a second set of one or more images captured at the location of the event by one or more attendees of the event;

processing the one or more images in the second set to separate regions at at least two different focal planes;

including the second set of one or more images in the virtual reality representation based at least on one of the regions.

2. The method of claim 1, wherein the first representation is a panoramic image or a 3D model.

3. The method of claim 1 or claim 2, where the processing of the one or more images in the second set to separate the regions comprises processing of the one or more images by a neutral network to segment the regions.

4. The method of claims 1 to 3, wherein the first set of one or more images has been captured before the event takes place.

5. The method of any one of claims 1 to 4, wherein the regions correspond to the foreground and background of the one or more images in the second set,

wherein the including of the or each image of second set is performed based on the background or foreground of the or each image.

6. The method of claim 5, wherein the generation of the first representation is further dependent on the region corresponding to the background of one or each image of the second set.

7. The method of claim 5 or claim 6, further comprising processing the background of the one or more images in the second set to minimize the difference in colour balance between the background of the one or more images in the second set and the first

representation.

8. The method of any one of claims 1 to 7, wherein the regions correspond to separate regions of the foreground at different focal planes, wherein the inclusion of the or each image of the second set is performed based on one of the separate regions of the foreground of the or each image.

9. The method of any one of claims 1 to 4, wherein the regions correspond to static and dynamic regions of the or each image, wherein the inclusion of the or each image of the second set is performed based on one of the separate regions corresponding to the static region of the or each image.

10. The method of any one of claims 1 to 9, wherein the first representation is represented by a cube map; and

the inclusion of the second set of one or more images in the virtual reality

representation comprises:

comparing the one or more images in the second set to each face of the cube map and assigning each of the one or more images in the second set to a face of the cube map based on the comparing; and

combining each of the one or more images in the second set with the region of the first representation corresponding to the face of the cube map to which each of the one or more images in the second set has been assigned.

11. The method of any one of claims 1 to 10, further comprising receiving contextual data associated with the one or more images of the second set; wherein the inclusion of the second set of one or more images is determined based on the contextual data.

12. The method of any claim 11, wherein the contextual data includes sensor data captured by the device used to capture each image.

13. The method of claim 12, wherein the sensor data includes the location of the device used to capture each of the one or more images at the time when the image was captured.

14. The method of claims 12 and 13, wherein the sensor data includes the orientation of the device used to capture each of the one or more images at the time when the image was captured.

15. The method of any one of claims 12 to 14, wherein the sensor data includes the time at which each of the one or more images was captured.

16. The method of any one of claims 1 to 15, further comprising:

receiving a third set of one or more images captured at the location of the event wherein the third set of one of more images has been captured at a different position within the same location to the first set of one or more images;

wherein the generation of the virtual reality representation of the event is further in dependence on a second representation that has been generated in dependence on said third set of one or more images.

17. The method of claim 16, further comprising, when the first representation and the second representation are both a panoramic image, obtaining a 3D model of the location of the event based on the first representation and the second representation;

wherein the virtual reality representation is generated based on the obtained 3D model of the location of the event.

18. The method of any one of claims 1 to 17, further comprising: generating a 3D model of an object in dependence on at least one of the second set of one or more images; and including the 3D model of the object in the VR representation.

19. The method of any one of the preceding claims, further comprising:

using a trained neural network to segment at least one further region in the first representation and in the one or more of the images of the second set, wherein the further regions are associated with a respective object, and to semantically label the further region in association with the respective object;

matching the labels associated with objects in the or each second images with labels associated with objects in the one or images of the second set,

wherein the including the or each second image into the virtual reality representation is in dependence on a result of the matching.

20. A system configured to produce a virtual reality representation of an event, the system comprising a remote processor, wherein the remote processor is configured to perform the method of any one of claims 1 to 19.

21. The system of claim 20, further comprising a network of mobile devices wherein each mobile device is configured to capture images; wherein the network of mobile devices is networked with the remote processor.