WO2022261772A1

WO2022261772A1 - Deep-learning method for automated content creation in augmented and virtual reality

Info

Publication number: WO2022261772A1
Application number: PCT/CA2022/050963
Authority: WO
Inventors: Rajesh Nayak; Sierra FRANCIS; Yi Wang
Original assignee: 3Rdi Laboratory Incorporated
Priority date: 2021-06-16
Filing date: 2022-06-16
Publication date: 2022-12-22

Abstract

Deep-learning method for automated content creation in virtual reality. In an embodiment, a neural network (e.g., convolutional neural network) is applied to 360-degree video to provide an end-to-end AR/VR system that localizes and recognizes objects in the 360-degree video. The neural network accurately locates and recognizes objects in the 360-degree video, such that corresponding media assets may be automatically retrieved (i.e., without manual intervention) and attached to frames of the 360-degree video.

Description

DEEP-LEARNING METHOD FOR AUTOMATED CONTENT CREATION IN

AUGMENTED AND VIRTUAL REALITY

BACKGROUND

[1] Field of the Invention

[2] The embodiments described herein are generally directed to image processing for augmented and/or virtual reality, and, more particularly, to a deep-learning based method to automate content creation for objects in 360-degree video using object localization to attach relevant media assets without manual intervention.

[3] Description of the Related Art

[4] To create interactive experiences in augmented reality (AR) or virtual reality (YR), common technology relies upon several manual interventions: locating objects in the virtual space; creating accessible “hot spots” on the objects; and assigning relevant media assets to each “hot spot.” This process is time-consuming and monetarily expensive.

[5] To facilitate content creation for YR, some existing systems have provided a graphical user interface that supports a drag-and-drop method of attaching media assets to 360-degree video. However, this process still requires manually recognizing and locating the objects manually, prior to attaching assets to the objects.

[6] Methods designed for AR function poorly when the low-level feature differences (e.g., shapes or colors) of target obj ects are insufficient to discriminate the target obj ects from each other, or when a target object produces significant visual differences from different viewing angles (e.g., a sculpture in an art gallery).

[7] Therefore, it would be desirable to automate content creation for AR and VR, end to end, without the need for manual intervention and without these problems.

SUMMARY

[8] Accordingly, systems, methods, and non-transitory computer-readable media are disclosed for automating content creation in VR using deep learning. BRIEF DESCRIPTION OF THE DRAWINGS

[9] The details of the present invention, both as to its structure and operation, may be gleaned in part by study of the accompanying drawings, in which like reference numerals refer to like parts, and in which:

[10] FIG. 1 illustrates an example processing system, by which one or more of the processes described herein, may be executed, according to an embodiment;

[11] FIGS. 2 and 3 illustrate flowcharts of example processes for training a neural network for an AR/VR system, according to an embodiment; and

[12] FIGS. 4 and 5 illustrate function diagrams for examples of automated content creation in an AR/VR system, according to an embodiment.

DETAILED DESCRIPTION

[13] In an embodiment, systems, methods, and non-transitory computer-readable media are disclosed for automated content creation in AR or VR using deep learning. For example, content creation in an AR/VR system may be automated by localizing objects in a video (e.g., conventional two-dimensional video in AR, 360-degree video in VR), recognizing each localized object, searching for media assets corresponding to the recognized objects, and “attaching” the media assets to the video, for example, using “hot spots.” The result is an end-to-end, automatic process that receives raw video as an input, and outputs a processed video in which objects are highlighted with accessible hot spots from which corresponding media assets can be accessed. In an alternative embodiment, content creation in an AR system may be automated by recognizing objects in a video (e.g., conventional two-dimensional video) in real time, and searching for media assets corresponding to the recognized objects.

[14] As used herein, the term “virtual reality” or “VR” should be understood to refer to interactive 360-degree or spherical computer-generated image data (e.g., a 360-degree image or video), and “augmented reality” or “AR” should be understood to refer to the superimposition of one or more computer-generated images over a user’s view of reality (e.g., a two-dimensional image or video captured by a camera in real time) on a display. Ideally, an AR system should not overly rely upon processing resources to perform the automated content creation, since such systems are typically embodied in personal mobile devices (e.g., smart phones), which have limited resources, and must perform the automated content creation in real time.

[15] After reading this description, it will become apparent to one skilled in the art how to implement the invention in various alternative embodiments and alternative applications. However, although various embodiments of the present invention will be described herein, it is understood that these embodiments are presented by way of example and illustration only, and not limitation. As such, this detailed description of various embodiments should not be construed to limit the scope or breadth of the present invention as set forth in the appended claims.

[16] 1. System Overview

[17] 1.1. Example Processing Device

[18] FIG. 1 is a block diagram illustrating an example wired or wireless system 100 that may be used in connection with various embodiments described herein. For example, system 100 may be used as or in conjunction with one or more of the functions, processes, or methods (e.g., to store and/or execute one or more software modules) described herein. System 100 can be a server or any conventional personal computer, or any other processor-enabled device that is capable of wired or wireless data communication. Other computer systems and/or architectures may be also used, as will be clear to those skilled in the art.

[19] System 100 preferably includes one or more processors 110. Processor(s) 110 may comprise a central processing unit (CPU). Additional processors may be provided, such as a graphics processing unit (GPU), an auxiliary processor to manage input/output, an auxiliary processor to perform floating-point mathematical operations, a special-purpose microprocessor having an architecture suitable for fast execution of signal-processing algorithms (e.g., digital- signal processor), a slave processor subordinate to the main processing system (e.g., back-end processor), an additional microprocessor or controller for dual or multiple processor systems, and/or a coprocessor. Such auxiliary processors may be discrete processors or may be integrated with processor 110. Examples of processors which may be used with system 100 include, without limitation, the Pentium® processor, Core i7® processor, and Xeon® processor, all of which are available from Intel Corporation of Santa Clara, California.

[20] Processor 110 is preferably connected to a communication bus 105. Communication bus 105 may include a data channel for facilitating information transfer between storage and other peripheral components of system 100. Furthermore, communication bus 105 may provide a set of signals used for communication with processor 110, including a data bus, address bus, and/or control bus (not shown). Communication bus 105 may comprise any standard or non-standard bus architecture such as, for example, bus architectures compliant with industry standard architecture (ISA), extended industry standard architecture (EISA), Micro Channel Architecture (MCA), peripheral component interconnect (PCI) local bus, standards promulgated by the Institute of Electrical and Electronics Engineers (IEEE) including IEEE 488 general-purpose interface bus (GPIB), IEEE 696/S- 100, and/or the like.

[21] System 100 preferably includes a main memory 115 and may also include a secondary memory 120. Main memory 115 provides storage of instructions and data for programs executing on processor 110, such as one or more of the functions and/or modules discussed herein. It should be understood that programs stored in the memory and executed by processor 110 may be written and/or compiled according to any suitable language, including without limitation C/C++, Java, JavaScript, Perl, Visual Basic, .NET, and the like. Main memory 115 is typically semiconductor- based memory such as dynamic random access memory (DRAM) and/or static random access memory (SRAM). Other semiconductor-based memory types include, for example, synchronous dynamic random access memory (SDRAM), Rambus dynamic random access memory (RDRAM), ferroelectric random access memory (FRAM), and the like, including read only memory (ROM).

[22] Secondary memory 120 may optionally include an internal medium 125 and/or a removable medium 130. Removable medium 130 is read from and/or written to in any well-known manner. Removable storage medium 130 may be, for example, a magnetic tape drive, a compact disc (CD) drive, a digital versatile disc (DVD) drive, other optical drive, a flash memory drive, and/or the like.

[23] Secondary memory 120 is a non-transitory computer-readable medium having computer-executable code (e.g., disclosed software modules) and/or other data stored thereon. The computer software or data stored on secondary memory 120 is read into main memory 115 for execution by processor 110.

[24] In alternative embodiments, secondary memory 120 may include other similar means for allowing computer programs or other data or instructions to be loaded into system 100. Such means may include, for example, a communication interface 140, which allows software and data to be transferred from external storage medium 145 to system 100. Examples of external storage medium 145 may include an external hard disk drive, an external optical drive, an external magneto-optical drive, and/or the like. Other examples of secondary memory 120 may include semiconductor-based memory, such as programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable read-only memory (EEPROM), and flash memory (block-oriented memory similar to EEPROM).

[25] As mentioned above, system 100 may include a communication interface 140. Communication interface 140 allows software and data to be transferred between system 100 and external devices (e.g. printers), networks, or other information sources. For example, computer software or data may be transferred to system 100, over one or more networks (e.g., including the Internet), from a network server via communication interface 140. Examples of communication interface 140 include a built-in network adapter, network interface card (NIC), Personal Computer Memory Card International Association (PCMCIA) network card, card bus network adapter, wireless network adapter, Universal Serial Bus (USB) network adapter, modem, a wireless data card, a communications port, an infrared interface, an IEEE 1394 fire-wire, and any other device capable of interfacing system 100 with a network or another computing device. Communication interface 140 preferably implements industry -promulgated protocol standards, such as Ethernet IEEE 802 standards, Fiber Channel, digital subscriber line (DSL), asynchronous digital subscriber line (ADSL), frame relay, asynchronous transfer mode (ATM), integrated digital services network (ISDN), personal communications services (PCS), transmission control protocol/Internet protocol (TCP/IP), serial line Internet protocol/point to point protocol (SLIP/PPP), and so on, but may also implement customized or non-standard interface protocols as well.

[26] Software and data transferred via communication interface 140 are generally in the form of electrical communication signals 155. These signals 155 may be provided to communication interface 140 via a communication channel 150. In an embodiment, communication channel 150 may be a wired or wireless network, or any variety of other communication links. Communication channel 150 carries signals 155 and can be implemented using a variety of wired or wireless communication means including wire or cable, fiber optics, conventional phone line, cellular phone link, wireless data communication link, radio frequency (“RF”) link, or infrared link, just to name a few.

[27] Computer-executable code (e.g., computer programs, comprising one or more software modules) is stored in main memory 115 and/or secondary memory 120. Computer-executable code can also be received via communication interface 140 and stored in main memory 115 and/or secondary memory 120. Such computer-executable code, when executed, enable system 100 to perform the various functions of the disclosed embodiments as described elsewhere herein.

[28] In this description, the term “computer-readable medium” is used to refer to any non- transitory computer-readable storage media used to provide computer-executable code and/or other data to or within system 100. Examples of such media include main memory 115, secondary memory 120 (including internal memory 125, removable medium 130, and external storage medium 145), and any peripheral device communicatively coupled with communication interface 140 (including a network information server or other network device). These non-transitory computer-readable media are means for providing executable code, programming instructions, software, and/or other data to system 100.

[29] In an embodiment that is implemented using software, the software may be stored on a computer-readable medium and loaded into system 100 by way of removable medium 130, I/O interface 135, or communication interface 140. In such an embodiment, the software is loaded into system 100 in the form of electrical communication signals 155. The software, when executed by processor 110, preferably causes processor 110 to perform one or more of the processes and functions described elsewhere herein.

[30] In an embodiment, I/O interface 135 provides an interface between one or more components of system 100 and one or more input and/or output devices. Example input devices include, without limitation, sensors, keyboards, touch screens or other touch-sensitive devices, biometric sensing devices, computer mice, trackballs, pen-based pointing devices, and/or the like. Examples of output devices include, without limitation, other processing devices, cathode ray tubes (CRTs), plasma displays, light-emitting diode (LED) displays, liquid crystal displays (LCDs), printers, vacuum fluorescent displays (VFDs), surface-conduction electron-emitter displays (SEDs), field emission displays (FEDs), and/or the like. In some cases, an input and output device may be combined, such as in the case of a touch panel display (e.g., in a smartphone, tablet, or other mobile device). In a particular embodiment, EO interface 135 provides an input interface for a camera 175 (e.g., a 360-degree video camera integrated into a VR system, a camera integrated into a mobile device or AR system, etc.), and an output to a display 180 (e.g., integrated into an AR/VR system, integral with a mobile device, etc.) [31] System 100 may also include optional wireless communication components that facilitate wireless communication over a voice network and/or a data network. The wireless communication components comprise an antenna system 170, a radio system 165, and a baseband system 160. In system 100, radio frequency (RF) signals are transmitted and received over the air by antenna system 170 under the management of radio system 165.

[32] In an embodiment, antenna system 170 may comprise one or more antennae and one or more multiplexors (not shown) that perform a switching function to provide antenna system 170 with transmit and receive signal paths. In the receive path, received RF signals can be coupled from a multiplexor to a low noise amplifier (not shown) that amplifies the received RF signal and sends the amplified signal to radio system 165.

[33] In an alternative embodiment, radio system 165 may comprise one or more radios that are configured to communicate over various frequencies. In an embodiment, radio system 165 may combine a demodulator (not shown) and modulator (not shown) in one integrated circuit (IC). The demodulator and modulator can also be separate components. In the incoming path, the demodulator strips away the RF carrier signal leaving a baseband receive audio signal, which is sent from radio system 165 to baseband system 160.

[34] If the received signal contains audio information, then baseband system 160 decodes the signal and converts it to an analog signal. Then the signal is amplified and sent to a speaker. Baseband system 160 also receives analog audio signals from a microphone. These analog audio signals are converted to digital signals and encoded by baseband system 160. Baseband system 160 also encodes the digital signals for transmission and generates a baseband transmit audio signal that is routed to the modulator portion of radio system 165. The modulator mixes the baseband transmit audio signal with an RF carrier signal, generating an RF transmit signal that is routed to antenna system 170 and may pass through a power amplifier (not shown). The power amplifier amplifies the RF transmit signal and routes it to antenna system 170, where the signal is switched to the antenna port for transmission.

[35] Baseband system 160 is also communicatively coupled with processor 110, which may be a central processing unit (CPU). Processor 110 has access to data storage areas 115 and 120. Processor 110 is preferably configured to execute instructions (i.e., computer programs, such as the disclosed application, or software modules) that can be stored in main memory 115 or secondary memory 120. Computer programs can also be received from baseband processor 160 and stored in main memory 110 or in secondary memory 120, or executed upon receipt. Such computer programs, when executed, enable system 100 to perform the various functions of the disclosed embodiments.

[36] 1.2. Example Implementation

[37] In an embodiment, system 100 is configured to automatically create content using an integrated or connected camera 175. For example, system 100 may be a mobile device, such as an AR or VR head-mounted device, smart phone, tablet computer, laptop computer, and/or the like, comprising an integrated camera 175 and/or display 180. However, it should be understood that system 100 may also be a desktop computer or other non-mobile processing device that is connected to an external camera 175 and/or display 180. In either case, I/O interface 135 may be configured to provide video data from camera 175 to processor(s) 110 and/or memory 115 or 120 (e.g., via communication bus 105), and from processor(s) 110 and/or memory 115 or 120 to display 180 (e.g., via communication bus 105).

[38] In an embodiment, camera 175 is a video camera that captures two-dimensional video frames that, when viewed as a video sequence, illustrate the movement of objects in a real-world environment. In an alternative embodiment, camera 175 is a 360-degree video camera. 360-degree video cameras capture a series of spherical video frames. This spherical video frames can be used to create an immersive viewing experience that enables viewers to visualize a scene in every direction as they turn their heads while wearing a YR head- mounted device. As used herein, the phrase “360-degree video frame” should be understood to refer to a spherical video frame, captured by a 360-degree video camera. Each 360-degree video frame may be encoded to an RGB image. Consecutive 360- degree video frames, captured by camera 175 of system 100, may be processed by processor(s) 110 of system 100, as described herein, and displayed on a display 180 of system 100.

[39] Processor(s) 110 of system 100 may execute an operating system or application that is capable of processing video data (e.g., captured by camera 175), training neural networks, and/or applying neural networks. In an embodiment, processor(s) 110 may comprise a GPU, and the operating system or application may utilize the GPU to train and/or apply a neural network. For example, the GPU may provide mathematical computations in parallel to the operation of a CPU to accelerate processing time of the neural network during the training phase and/or operation phase. In addition, memory 115 and/or 120 may be utilized to store the weights of one or more neural networks. The stored weights are responsible for activating perceptors of the neural network(s) prior to generating outputs, such as the predicted location or class name of an object in an image, such as a video frame. Memory 115 and/or 120 may also be used to store media assets, such that processor(s) 110 may retrieve media assets from memory 115 and/or 120 after obtaining predicted class names from the neural network(s).

[40] Alternatively, system 100 may communicate with an external system (e.g., via communication interface 140 over one or more networks), such as a remote server or cloud instance, that trains and/or applies a neural network on video captured by camera 175. In this case, system 100 may communicate a stream of 360-degree video frames captured by camera 175 to the external system, and receive a stream of processed 360-degree video frames back from the external system to be displayed on display 180. It should be understood that, in such an embodiment, the external system may store the neural network and media assets in memory to which the external system has either local or remote access.

[41] In an embodiment, each system 100 may communicate wirelessly with a network of processing devices within a vicinity of system 100. The network of processing devices may provide storage of media assets and streaming functionality to systems 100. Depending on the requirements of the specific implementation, the network of processing devices may be used to accelerate the training times of neural networks on systems 100 and/or update the weights of neural networks stored on systems 100, over-the-air, when new objects must be localized and/or recognized.

[42] 2. Process Overview

[43] Embodiments of processes for automated content creation in AR/VR using deep learning will now be described in detail. It should be understood that the described processes may be embodied in one or more software modules that are executed by one or more hardware processors (e.g., processor 110), for example, as a computer program or software package. The described processes may be implemented as instructions represented in source code, object code, and/or machine code. These instructions may be executed directly by hardware processor(s) 110, or alternatively, may be executed by a virtual machine operating between the object code and hardware processors 110. [44] Alternatively, the described processes may be implemented as a hardware component (e.g., general-purpose processor, integrated circuit (IC), application-specific integrated circuit (ASIC), digital signal processor (DSP), field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, etc.), combination of hardware components, or combination of hardware and software components. To clearly illustrate the interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps are described herein generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled persons can implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the invention. In addition, the grouping of functions within a component, block, module, circuit, or step is for ease of description. Specific functions or steps can be moved from one component, block, module, circuit, or step to another without departing from the invention.

[45] Furthermore, while the processes, described herein, are illustrated with a certain arrangement and ordering of subprocesses, each process may be implemented with fewer, more, or different subprocesses and a different arrangement and/or ordering of subprocesses. In addition, it should be understood that any subprocess, which does not depend on the completion of another subprocess, may be executed before, after, or in parallel with that other independent subprocess, even if the subprocesses are described or illustrated in a particular order.

[46] 2.1. Introduction

[47] In a first embodiment, the automated content creation in an AR/VR system comprises: localizing common objects in a 360-degree video (e.g., paintings or sculptures in the case of an art gallery); (2) recognizing the localized objects by providing concrete or specific object information (e.g., “Mona Lisa” by Leonardo Da Vinci or “David” by Michelangelo); and (3) creating hot spots, in the 360-degree video (e.g., on top of or otherwise near the recognized objects), by which users can access media assets associated with the corresponding objects. In an embodiment, two neural networks are utilized to provide joint object localization and recognition. In a second embodiment, the automated content creation in an AR/VR system comprises: (1) recognizing objects by providing concrete or specific object information (e.g., “Mona Lisa” by Leonardo Da Vinci or “David” by Michelangelo); and (2) retrieving media assets associated with the recognized objects.

[48] In the first embodiment, A first neural network may be explicitly applied for extremely accurate object localization. A training dataset may be prepared to comprise or consist of red-green-blue (RGB) images captured as or extracted from frames of a 360- degree video. Each image may contain one or more objects. Each object in each image may be independently annotated or labeled with one or more classifications of the object by recording the object’s location in the image (e.g., coordinates ofthe object with respect to the image’ s coordinate system). After the neural network has been trained on these labeled images, the neural network can be applied, in a testing or operation phase, to localize objects on which it has been trained by outputting the objects’ coordinates in 360-degree video frames. The coordinates can then be used to extract each object from the frames (e.g., by cropping the frame to exclude regions outside a bounding box around the object).

[49] In the first and second embodiments, a second neural network may be applied for recognizing concrete information about the identity of the extracted objects. A training dataset may be prepared to comprise or consist of RGB images ofthe objects for which recognition is desired. This training dataset is used to train the neural network to understand the various spatial characteristics of each object in the training dataset. After the neural network has been trained on the training dataset, the neural network can be applied, in a testing or operation phase, to recognize objects on which it has been trained by outputting a predicted class name of each object recognized in a two-dimensional or 360-degree video frame.

[50] In an embodiment, during the operation phase, the network (e.g., comprising the first neural network and the second neural network in the first embodiment, and comprising only the second neural network in the second embodiment), receives a video (e.g., two-dimensional video for AR, 360-degree video for VR) and outputs predicted class names for objects recognized in the frames of the video. These predicted class names, output by the network, can then be used as keywords to search an asset database (e.g., in local or remote storage).

[51] In an embodiment, the training phase and the operation phase of the neural network(s) may be executed by two different systems 100. For example, the training phase may be executed by a server, desktop computer, or other system with significant computing power to generate the neural network(s), whereas the operation phase may be executed by a lightweight device, such as a smart phone, AR head-mounted device, VR head-mounted device, and/or the like, to apply the neural network(s) to real-time video. However, it is contemplated that the lightweight device may be supported by one or more other systems, such as a network of over-the-air processing devices, a remote network server, and/or the like, which may provide updates to the neural network(s) and/or remote execution of one or more of the intermediate functions described herein.

[52] 2.2. Training Phase for Object Localization Model

[53] FIG. 2 illustrates a flowchart of an example process for training an object localization model for an AR/VR system, according to an embodiment. The process comprises a process 200 of training a neural network to generate an object localization model 250, which predicts the locations of one or more objects in a 360-degree video frame. In an embodiment, object localization model 250 is only responsible for general object localization and recognition. In such an embodiment, a separate object recognition model is trained to predict the specific class name of the objects that are located and generically classified by object localization model 250.

[54] For example, a famous painting, such as the “Mona Lisa,” can be classified in a hierarchical or pyramid structure, with a generic class name of “painting” and a unique, concrete, or otherwise specific class name of “Mona Lisa.” In this example, object localization model 250 may locate the “Mona Lisa” in a 360-degree video frame and classify it as “painting.” Thereafter, the object recognition model may classify the located object as “Mona Lisa.”

[55] Training process 200 for object localization model 250 may comprise a cube map conversion subprocess 220 that convert 360-degree video data 210 into cube map images. 360- degree video data 210 may comprise a series of 360-degree video frames to be used for training. Each 360-degree video frame may contain one or more objects that are labeled with their location and generic class name. Cube map conversion subprocess 220 converts each 360-degree video frame into a cube map image containing the one or more objects. Cube mapping is a method of mapping or projecting a 360-degree view of an environment, as represented by each 360-degree video frame, onto six faces of a cube. Specifically, each 360-degree video frame may be converted by cube map conversion subprocess 220 from a single flat image into six rectangular (e.g., square) images, representing views in six different directions (e.g., 4 sides, up, and down).

[56] Each object contained in each cube map image may be manually labeled by determining the object’s relative location with respect to the cube map image. The object’s relative location may be defined by coordinates within a coordinate system of the cube map image. For instance, the relative locations of the objects may be obtained by a user drawing bounding boxes around the objects, and using the coordinates of the comers of the bounding boxes to represent the locations of the objects in the corresponding cube map image’s coordinate system. In addition to location information, each object may be manually assigned a generic class name using a naming scheme defined by a pyramid structure, as described elsewhere herein. The resulting location information and generic class name for each object in each cube map image may be stored in object location data 230.

[57] The cube map images output by cube map conversion subprocess 220 and object location data 230, which label object locations in the cube map images with generic class names, are used to fine-tune the weights of a neural network 240 to produce object localization model 250. Neural network 240 may comprise a convolutional neural network (CNN), such as a deep-learning CNN, which is trained on the labeled cube map images.

[58] The training dataset, comprising 360-degree video data 210, may be uploaded by a user, via a web-accessible interface, over one or more networks, including the Internet. In this case, the system 100 that performs training 200 to generate object localization model 250 may be a network-based (e.g., cloud-based) server. For example, cube map conversion 220 may be performed by the server, and object location data 230 may be generated via the web-accessible interface provided by the server. Once object localization model 250 has been generated, it may be downloaded to and operated on mobile devices comprising different systems 100 to perform the AR/VR in real time. In other words, the system 100 that trains object localization model 250 may be different from the system 100 that operates object localization model 250.

[59] 2.3. Training Phase for Object Recognition Model

[60] FIG. 3 illustrates a flowchart of an example process for training an object recognition model for an AR/VR system, according to an embodiment. The process comprises a process 300 of training a neural network to generate an object recognition model 350, which accepts an image or video frame from a video as input, and predicts the specific class name of one or more objects in the image frame as output.

[61] In training process 300, one or more systems 100 may be used to process video data 310 in conjunction with ground-truth data 330, and execute a series of mathematical computations to generate object recognition model 350. Video data 310 may comprise a series of video frames that illustrate consecutive object movement relative to camera 175. Video data 310 may be acquired by a manual procedure, in which a video taker moves camera 175 around a single object to ensure that the object appears in each video frame.

[62] Video frame processing 320 is executed on video data 310 to select effective video frames to be used to train a neural network toward accurate object recognition. Video frame processing 320 may extract a collection of video frames that illustrate the object from different viewing angles. To ensure the usefulness of extracted video frames that include different viewing angles with less redundancy, the variation of an object’s movements may be measured by one or more motion estimation algorithms, such as a block matching algorithm (BMA), hierarchical motion estimation, sub-pixel motion estimation, differential pulse code modulation (DPCM), or optical flow.

[63] Ground-truth data 330 may comprise one or more class names that correspond to the objects appearing in video data 310. The video frames in video data 310 that are output from video frame processing 320 may be manually labeled with the class names in ground-truth data 330. In other words, class names are assigned to each of the objects in the processed video frames. The number of class names (categories) in ground-truth data 330 represents the number of objects that can be recognized by object recognition model 350.

[64] Neural network 340 may be trained by the video frames output by video frame processing 320 and labeled with ground-truth data 330. Neural network 340 may be a CNN, such as a deep-learning CNN. The weights of neural network 340 are fine-tuned to provide an object recognition model 350 that is suitable to recognize objects in a video frame and output the corresponding specific class name of the object.

[65] The training dataset, comprising video data 310 and/or ground-truth data 330, may be uploaded by a user, via a web-accessible interface, over one or more networks, including the Internet. In this case, the system 100 that performs training 300 to generate object recognition model 350 may be a network-based (e.g., cloud-based) server. Once object recognition model 350 has been generated, it may be downloaded to and operated on mobile devices comprising different systems 100 to perform the AR/VR in real time. In other words, the system 100 that trains object recognition model 350 may be different from the system 100 that operates object recognition model 350.

[66] 2.4. Operation Phase

[67] FIG. 4 illustrates a function diagram for an example of automated content creation in an AR/VR system, according to an embodiment. One or more of the functions illustrated in FIG. 3 may be executed by processor(s) 110 of system 100. System 100 may be a VR system comprising a 360-degree video camera 175 that captures 360-degree video of the surrounding environment in real time. The captured 360-degree video comprises or is converted into 360- degree video data 410 comprising a plurality of 360-degree video frames that can be encoded and/or decoded by system 100 and/or other processing devices.

[68] The 360-degree video frames of 360-degree video data 410 are converted to mapped video frame data 420. To construct mapped video frame data 410, the 360-degree video frames are converted to one or more cube map images by cube map conversion subprocess 422. Cube map conversion subprocess 422 may be similar or identical to cube map conversion subprocess 220. Thus, any description of cube map conversion subprocess 220 can equally apply to cube map conversion subprocess 422, and vice versa. The cube map images, output by cube map conversion subprocess 422, are sorted in a temporal domain with respect to their corresponding 360-degree video frames, such that a current frame 424 and a next frame 426 can be considered. Current frame 424 and next frame 426 are two adjacent 360-degree video frames that have the same characteristics as the 360-degree video frames of 360-degree video data 210. It should be understood that next frame 426 is a 360-degree video frame that is immediately subsequent in the temporal domain to current frame 424.

[69] The cube map images are provided to object localization model 430. Object localization model 430 may be similar or identical to object localization model 250. Thus, any description of object localization model 250 can equally apply to object localization model 430, and vice versa. Current frame 424 is provided to object localization model 430 in order to determine the locations of objects on which object localization model 430 was trained. In an embodiment, object localization model 430 comprises a neural network (e.g., CNN, such as a deep-learning CNN) that is loaded with weights 432 that were obtained during a training phase (e.g., comprising training process 200). The output of object localization model 430 comprises the predicted locations of objects located in the cube map image(s) of current frame 424. Additionally, object localization model 430 may also determine the generic class names of objects in the cube map image(s) of current frame 424, in which case the output of object localization model 430 further comprises the generic class names of objects located in the cube map image(s) of current frame 424.

[70] If no object is localized in current frame 424 (i.e., “No” in subprocess 434), next frame 426 is set as current frame 424 and provided as input to object localization model 430. On the other hand, if at least one object is localized in current frame 424 (i.e., “Yes” in subprocess 434), each localized object (e.g., extracted from the cube map image(s) of current frame 424) is provided as input to object recognition model 440 to determine a specific class name of the localized object prior to searching for a corresponding media asset.

[71] In particular, each localized object, predicted by object localization model 430, is extracted from the cube map image(s) of current frame 424 (e.g., via cropping) into one or more image patches. The image patch(es) are used as the input to object recognition model 440 to realize object recognition. In an embodiment, object recognition model 440 comprises a neural network (e.g., CNN, such as a deep-learning CNN) that is loaded with weights 442 that were obtained during a training phase. Object recognition model 440 may be similar or identical to object recognition model 350. Thus, any description of object recognition model 350 can equally apply to object recognition model 440, and vice versa.

[72] If obj ect recognition model 440 recognizes the localized obj ect that was input (i.e., “Yes” in subprocess 444), the output of object recognition model 440 comprises the predicted specific class name of the localized object that was input. On the other hand, if object recognition model 440 cannot recognize the localized object that was input (i.e., “No” in subprocess 444), object recognition model 440 may output a special class name that indicates that no object was recognized or otherwise indicate that no object was recognized (e.g., because object recognition model 440 was not trained on the object).

[73] As an example, when current frame 424 (e.g., cube map image(s) representing a single 360-degree video frame) contains the sculpture “David,” object localization model 430 may predict the location of the sculpture in current frame 424 and predict a generic class name of “sculpture.” If object recognition model 440 has been trained to recognize “David,” object recognition model 440 may output the specific class name of “David.” On the other hand, if object recognition model 440 has not been trained to recognize “David,” object recognition model 440 will fail to recognize “David.” In this case, if there are no more localized objects output by object recognition model 440 and remaining to be recognized, the current frame 424 is changed to the next frame 426 and the localization and recognition processes are repeated.

[74] If object recognition model 440 outputs a specific class name for one or more objects (i.e., “Yes” in subprocess 444), thereby indicating that at least one object has been recognized in current frame 424, one or more media assets corresponding to the one or more objects may be retrieved in subprocess 450 from media asset data 452. In particular, each specific class name that is output by object recognition model 440 may be used as a keyword or index to search media asset data 452. The search result may comprise a media asset for each specific class name representing each recognized object.

[75] Media asset data 452 may be stored in local storage of system 100 or in remote storage (e.g., in a network of processing devices that wirelessly communicate with system 100, in a remote server or cloud service, etc.). Media asset data 452 may comprise, for each of a plurality of objects which object recognition model 440 is trained to recognize, one or more visual, audio, and/or text contents associated with the object. To facilitate queries, the structure of media asset data 452 may be constructed using a common or standard query language, such as Structured Query Language (SQL), Data Mining Extensions (DMX), or Contextual Query Language (CQL), such that media assets are retrievable by specific class name. Thus, when object recognition model 440 outputs a specific class name for an object, the corresponding media asset can be quickly retrieved using the specific class name as a keyword in a search query.

[76] After media asset(s) are retrieved for current frame 424, the retrieved media asset(s) may be attached to the corresponding 360-degree video frame in subprocess 454 to create processed 360-degree video data 460. Processed 360-degree video data 460 may comprise a 360- degree video that is processed such that the media asset(s) can be accessed by a user of system 100 (e.g., a VR head-mounted device) through the 360-degree video. For instance, in an embodiment, attaching media assets to a 360-degree video frame comprises, for each media asset, creating a hot spot on top of or near the location, as determined by object localization model 430, of each recognized object, in the 360-degree video frame. When a user of system 100 interacts with a hot spot, the corresponding media asset can be accessed and provided to the user. For instance, visual or text content of the media asset may be displayed on display 180 at or near the location of the corresponding object, and/or audio content may be played via speakers of system 100. Thus, as used herein, the phrase “automated content creation” may refer to the user of an AR/VR system having no need to manually search corresponding media assets for an object when that object appears in a video sequence.

[77] FIG. 5 illustrates a function diagram for an example of automated content creation in an AR/VR system, according to an alternative embodiment. One or more of the functions illustrated in FIG. 5 may be executed by processor(s) 110 of system 100. System 100 may be an AR system or a personal device capable of performing AR (e.g., a smart phone) with an integrated camera 175 that is configured to capture video and display processed video on a display 180. System 100 may use camera 175 to capture or record video in real time in a real-world environment. The recorded video is converted to video data 510 that can be encoded and/or decoded by system 100 and/or other processing devices.

[78] The decoded video data 510 comprises a plurality of video frames that, when viewed as a video sequence, visualize a temporally consecutive action. These video frames may be processed as video frame data 520, in which current frame 524 refers to a single video frame that is captured and/or considered in a present moment, and next frame 526 refers to a single video frame that is captured and/or considered immediately subsequent to current frame 524.

[79] To realize object recognition, current frame 524 is input to object recognition model 440 to predict the specific class names of one or more objects in current frame 524. It should be understood that elements 440-460 in FIG. 5 may be similar or identical to the respective same- numbered elements in FIG. 4. Thus, all descriptions of these elements with respect to FIG. 4 apply equally to these elements with respect to FIG. 5. Accordingly, these elements will not be redundantly described except to describe possible alternative implementations. It should be understood that object recognition model 440 may be applied to two-dimensional video frames in the same or similar manner as it is applied to cropped object images from cube map images.

[80] In an embodiment, the attachment of media assets in subprocess 454 in an AR system or mobile device (e.g., smart phone) may differ from subprocess 454 in a VR system. For example, in an alternative embodiment, instead of incorporating hot spots into processed video data 460, content from a media asset associated with each object in video data 460 may be shown directly in display 180 of system 100.

[81] The automated content creation of these disclosed embodiments may be suitable for real-time processing. In other words, the entire process is sufficiently fast that the corresponding media assets are immediately accessible when a user of the AR/VR system moves towards an object. For example, when the “Mona Lisa” appears in current frame 424 or 524, the user can access corresponding content of the media asset associated with the “Mona Lisa” without any substantial or noticeable delay.

[82] 3. Example Embodiments

[83] A Deep-Learning Based Method for Automated Experience Creation in Virtual Reality. A method of creating Virtual Reality (VR) experiences (which display 360-degree images or video to a user via a headset or other device to simulate immersion in the image), by using deep learning to localize objects in 360-degree video, recognize those objects, and attach relevant media assets without user intervention. This method comprises the following steps. A web-accessible user interface for uploading identifying images used to train a neural network for recognition of specific 2D and 3D objects 360-degree images or videos that will form the Virtual Reality environment. A pre-processing method for generating recognition neural network training data from a single identifying image (or video, for 3D data). A pre-processing method for converting 360-degree images into six-sided ‘cube map’ images, allowing a neural network to accurately identify objects in a 360-degree image. A Convolutional Neural Network (CNN) deployed for accurate localization of objects-of-interest in a six-sided ‘cube map.’ A CNN deployed for classification of 2D and 3D objects defined in the recognition neural network’s training data. An API for fetching digital media resources based on the predictions from the two neural networks of an object’s location and identity.

[84] A method of creating Virtual Reality (VR) experiences, which display 360-degree images or videos to a user via a headset or other device to simulate immersion in the image, by using deep learning to localize objects in 360-degree video, recognize those objects, and attach relevant media assets without user intervention, comprising: a web-accessible user interface used to generate image identifier, wherein the web-accessible interface allows user to upload identifying images that are used to train a convolutional neural network (CNN); and the CNN trained to perform object detection task comprising localizing and classifying user-specified objects in the 360-degree images or image frames.

[85] One or more of the methods above, wherein one output of the CNN comprises one or more bounding boxes around detected objects in the 360-degree images or image frames; another output consists of one or more feature vectors identify 2D or 3D objects in the output bounding boxes.

[86] One or more of the methods above, further comprising an additional image processing step prior to sending a 360-degree image or video frame to the CNN for object detection, wherein the additional image processing step converts a 360-degree image or video frame into six-sided cube map images.

[87] One or more of the methods above, wherein the CNN is configured to take one single image as input and output the bounding boxes and detected objects’ names.

[88] One or more of the methods above, wherein performing a completed object detection for a 360-degree image or video frame requires six times’ CNN inference with different six-sided cube map images.

[89] One or more of the methods above, further comprising additional image processing step after obtaining the predicting outcomes of the six-sided cube map images, wherein the additional image processing step use predicting outcomes of all six-sided cube map images to localize and classify objects in a 360-degree image or video frame.

[90] One or more of the methods above, further comprising a server that is communitive with the CNN, wherein the CNN is configured to send its predicting results to the server, and wherein the server is configured to fetch digital media resources based on the CNN’s predicting results.

[91] One or more of the methods above, wherein the digital media resources include image, text, script, or 3D model.

[92] One or more of the methods above, wherein fetched media resources overlays on detected objects by the CNN.

[93] A method of creating Augmented Reality (AR) experiences via deep learning for real time recognition and resource fetching; comprising: a web-accessible user interface used to generate image identifier, wherein the web-accessible interface allows user to upload identifying images that are used to train a convolutional neural network (CNN); and the CNN trained to perform image classification task comprising generating a feature vector identifying 2D or 3D objects defined in the training data. The method, further comprising the step of image augmentation to generate the CNN training data from a single image, a video, or 3D data. The method, wherein the CNN is configured to use a single image for the prediction of 2D or 3D objects. The method, further comprising a video capture device communicative with the CNN, wherein the video capture device is configured to generate the input data of the CNN as refer to each video frame or portions of frames captured by the video capture device. The method, further comprising a server that is communitive with the CNN, wherein the CNN is configured to send its predicting results to the server, and wherein the server is configured to fetch digital media resources based on the CNN’s predicting results. The method, further comprising a mobile device equipping a video capture device; and the process of inferring the CNN is executed on the mobile device. The method, wherein the digital media resources include image, text, script, or 3D model. The method, wherein the CNN is deployed on a mobile device for real-time classification of 2D or 3D objects.

[94] A method of creating Augmented Reality (AR) experiences via deep learning for real time recognition and resource fetching; comprising: a web-accessible user interface used to generate image identifier, wherein the web-accessible interface allows user to upload identifying images that are used to train a convolutional neural network (CNN); and the CNN trained to perform image classification task comprising generating a feature vector identifying 2D or 3D objects defined in the training data. The method, further comprising the step of image augmentation to generate the CNN training data from a single image, a video, or 3D data. The method, wherein the CNN is configured to use a single image for the prediction of 2D or 3D objects. The method, further comprising a video capture device communicative with the CNN, wherein the video capture device is configured to generate the input data of the CNN as refer to each video frame or portions of frames captured by the video capture device. The method, further comprising a server that is communitive with the CNN, wherein the CNN is configured to send its predicting results to the server, and wherein the server is configured to fetch digital media resources based on the CNN’s predicting results. The method, further comprising a mobile device equipping a video capture device; and the process of inferring the CNN is executed on the mobile device. The method, wherein the digital media resources include image, text, script, or 3D model. The method, wherein the CNN is deployed on a mobile device for real-time classification of 2D or 3D objects.

[95] The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles described herein can be applied to other embodiments without departing from the spirit or scope of the invention. Thus, it is to be understood that the description and drawings presented herein represent a presently preferred embodiment of the invention and are therefore representative of the subject matter which is broadly contemplated by the present invention. It is further understood that the scope of the present invention fully encompasses other embodiments that may become obvious to those skilled in the art and that the scope of the present invention is accordingly not limited.

[96] Combinations, described herein, such as “at least one of A, B, or C,” “one or more of

A, B, or C,” “at least one of A, B, and C,” “one or more of A, B, and C,” and “A, B, C, or any combination thereof’ include any combination of A, B, and/or C, and may include multiples of A, multiples of B, or multiples of C. Specifically, combinations such as “at least one of A, B, or C,” “one or more of A, B, or C,” “at least one of A, B, and C,” “one or more of A, B, and C,” and “A,

B, C, or any combination thereof’ may be A only, B only, C only, A and B, A and C, B and C, or A and B and C, and any such combination may contain one or more members of its constituents A, B, and/or C. For example, a combination of A and B may comprise one A and multiple B’s, multiple A’s and one B, or multiple A’s and multiple B’s.

Claims

CLAIMS What is claimed is:

1. A method comprising using at least one hardware processor to: train an object localization model to, for each of one or more objects, locate the object in an image and classify the object with a generic class name; and train an object recognition model to, for each of the one or more objects, classify the object with a specific class name.

2. The method of Claim 1, wherein the object localization model comprises first neural network, and wherein the object recognition model comprises a second neural network.

3. The method of Claim 2, wherein the first neural network and the second neural network each comprise a convolutional neural network.

4. The method of Claim 3, wherein the convolutional neural network is a deep learning convolutional neural network.

5. The method of Claim 1, wherein training the object localization model comprises: receiving 360-degree video data comprising a plurality of 360-degree video frames, wherein each of the plurality of 360-degree video frames contains one or more objects; and, for each of the plurality of 360-degree video frames, converting the 360-degree video frame into at least one cube map image; for each of the one or more objects contained in the 360-degree video frame, labeling the at least one cube map image with a location of the object in the at least one cube map image and a generic class name of the object.

6. A method comprising using at least one hardware processor to: receive image data; apply an object localization model to the image data to predict a location and generic class name for each of one or more objects in the image data; extract each of the one or more objects from the image data; apply an object recognition model to each of the one or more objects to predict a specific class name for the object; for each predicted class name, retrieve a media asset based on the predicted class name, and process the image data to add a hot spot to the image data at the predicted location of the object, for which the predicted class name was predicted, in the image data, wherein the hot spot is associated with the media asset; and display the processed image data on a display.

7. The method of Claim 6, wherein the image data comprises a plurality of 360-degree video frames from a 360-degree video.

8. The method of Claim 7, wherein applying the object localization model to the image data comprises, for each of the plurality of 360-degree video frames: converting the 360-degree video frame into at least one cube map image; and applying the object recognition model to the at least one cube map image.

9. The method of Claim 6, further comprising using the at least one hardware processor to: detect when a hot spot is within a field of view of a user; and, in response to detecting that the hot spot is within the field of view of the user, display content of the media asset associated with the hot spot.

10. A system comprising: at least one hardware processor; and one or more software modules that are configured to, when executed by the at least one hardware processor, perform the method of any one of the preceding claims.

11. The system of Claim 10, wherein the system comprises a virtual reality head- mounted device that comprises the at least one hardware processor, the one or more software modules, and a camera.

12. A non-transitory computer-readable medium having instructions stored therein, wherein the instructions, when executed by a processor, cause the processor to perform the method of any one of Claims 1-9.