WO2022059858A1

WO2022059858A1 - Method and system to generate 3d audio from audio-visual multimedia content

Info

Publication number: WO2022059858A1
Application number: PCT/KR2020/018079
Authority: WO
Inventors: Sai Teja K; Kaushik Saha
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2020-09-16
Filing date: 2020-12-10
Publication date: 2022-03-24

Abstract

The present subject matter refers a method and system for generation of spatialized audio from an audio-visual multimedia content. The method comprises receiving one or more audio-visual contents comprising one or more visual-objects and respective audio. The visual objects are identified within one or more image-frames associated with the audio-visual content. Based thereupon, one or more motion-paths with spatiality are simulated from the audio-visual content. An image sequence is reconstructed, such that the image sequence denotes a movement of one or more identified visual objects in accordance said simulated motion paths. Such image-sequence is associated with a time-interval associated with the at least one audio visual content and denotes a first data. Audio from one or more audio visual contents are combined to result a mixed mono audio and thereby defines a second data. One or more points in 3D space are defined as positions for the audio related to the one or more visual objects based on scaling the one or more simulated motions paths. Finally, a 3D audio is reconstructed for the one or more visual object as a third data.

Description

METHOD AND SYSTEM TO GENERATE 3D AUDIO FROM AUDIO-VISUAL MULTIMEDIA CONTENT

The present disclosure relates to interactive computing devices and in particular, relates to generation of audio content for a given audio-visual content.

Ever-expanding and ever-increasing listener's audio experience over the last few decades has created a necessity to create quality content using sophisticated technologies. At least an example of rendering quality user experience is exhibiting 3D sound in audio visual content.

State of the art proprietary solutions (e.g. Dolby) provides an audio mixing interface where content creators have to record the sounds independently, and then mix/associate them with the video by properly assigning co-ordinates in space to the corresponding audio objects. More specifically, the content creator independently records individual sounds also known as audio objects. Thereafter, as a part of associating audio with the video, the content mixers 'manually' associate each audio object to their estimated-position in a scene, by assigning them position co-ordinates in a 3D space using existing audio mixing criteria.

The mixed audio is then converted into a variety of multi-channel formats (e.g. 5.1, 7.1 etc) specific to an audio playback system. The audio received by the user is then decoded into the respective speaker configuration. In another example of a directly performed recording, content creators recording a video with multi-channel sound recording in a sound lab.

Extensive production costs associated with aforesaid procedures restrict both content-creators as well as end-users to settle with substandard audio quality, thereby depriving oneself of the immersive audio experience. At least a reason is that premium audio encoding and decoding technologies are not available on all devices or require additional setup for software and hardware components. Most of the content producers avail less costly audio solutions (such as a stereo) which render only a minimal experience to the user unlike the immersive experience rendered by the 3D audio or the surround sound. With the huge prevalence of over the top (OTT) media-based services and content becoming more and more prominent with each passing day, delivering a multi-channel audio data incurs a lot of band-width usage costs for the end-user as well as enormous download time. Accordingly, OTT based multimedia content streaming associated multi-channel audio requires specific arrangements both at the media provider's end and the client's end.

Currently, there is no end-to-end automated solution for altogether generating an immersive audio. Unless multi-channel audio stream exists in the playback, user cannot experience it. Once audio is encoded in mono, none of state of the art solutions can up-mix the audio to a multi-channel spatial sound. Also, standard spatial sound generation tools like Dolby requires a human intervention to create spatial sound.

In order to address this shortfall of overhead associated with generation of spatial sound, the current devices use solutions like audio up-mixing or virtual-surround to augment the sound effects and attempt to reach a quality at par with surround sound. However such manoeuvres don't rely on a visual correspondence between the audio and video content and thus cannot produce a real surround effect. In other words, the state of the art solutions ends up exhibiting a lack of co-ordination in visual and audio spatiality in the run-up to generate surround sound effects or audio spatiality.

There lies at least a need to generate a 3D audio from a given audio-visual content which originally does not have multi-channel audio content.

This summary is provided to introduce a selection of concepts in a simplified format that are further described in the detailed description of the invention. This summary is not intended to identify key or essential inventive concepts of the invention, nor is it intended for determining the scope of the invention.

The present subject method refers a method for generation of spatialized audio from an audio-visual multimedia content. The method comprises receiving one or more audio-visual contents comprising one or more visual-objects and respective audio. The visual objects are identified within one or more image-frames associated with the audio-visual content. Based thereupon, one or more motion-paths with spatiality are simulated from the audio-visual content. An image sequence is reconstructed, such that the image sequence denotes a movement of one or more identified visual objects in accordance said simulated motion paths. Such image-sequence is associated with a time-interval associated with the at least one audio visual content and denotes a first data. Audio from one or more audio visual contents are combined to result a mixed mono audio and thereby defines a second data. One or more points in 3D space are defined as positions for the audio related to the one or more visual objects based on scaling the one or more simulated motions paths. Finally, a 3D audio is reconstructed for the one or more visual object as a third data.

To further clarify the advantages and features of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawing. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.

These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:

Figure 1 illustrates method-steps in accordance with an embodiment of the present subject matter;

Figure 2 illustrates an implementation of the method-steps of Fig. 1, in accordance with an embodiment of the present subject matter;

Figure 3(3a and 3b) illustrates an example multimedia content required for synthesis of dataset, in accordance with an embodiment of the present subject matter;

Figure 4(4a and 4b) illustrates another example multimedia content required for synthesis of dataset, in accordance with an embodiment of the subject matter;

Figure 5 illustrates an example schematic diagram depicting a sub-process in accordance with an embodiment of the present subject matter;

Figure 6(6a and 6b) illustrates an example image sequence depicting a sub-process in accordance with an embodiment of the present subject matter;

Figure 7 illustrates another example schematic diagram depicting a sub-process in accordance with an embodiment of the present subject matter;

Figure 8 illustrates an example system implementation of the process of Fig. 1, in accordance with an embodiment of the subject matter;

Figure 9 illustrates method-steps in accordance with another embodiment of the present subject matter;

Figure 10 illustrates an example sub-process in accordance with an embodiment of the present subject matter;

Figure 11 illustrates an example system implementation of the process of Fig. 10 in accordance with an embodiment of the present subject matter;

Figure 12 illustrates method-steps in accordance with another embodiment of the present subject matter;

Figure 13 illustrates an example system implementation of the process of Fig. 12 in accordance with an embodiment of the present subject matter;

Figure 14 illustrates an example scenario in accordance with an embodiment of the present subject matter;

Figure 15 illustrates another example scenario in accordance with an embodiment of the present subject matter;

Figure 16 illustrates another example scenario in accordance with an embodiment of the present subject matter;

Figure 17 illustrates another example scenario in accordance with an embodiment of the present subject matter;

Figure 18(18a and 18b) illustrates another example scenario in accordance with an embodiment of the present subject matter;

Figure 19 illustrates another system architecture implementing various modules and sub-modules in accordance with the implementation depicted in Fig. 8, Fig. 11 and Fig. 13; and

Figure 20 illustrates a computing-device based implementation in accordance with an embodiment of the present subject matter.

Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have been necessarily been drawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present invention. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having benefit of the description herein.

It should be understood that although illustrative implementations of the embodiments of the present disclosure are illustrated below, the present invention may be implemented using any number of techniques, whether currently known or in existence. The present disclosure should in no way be limited to the illustrative implementations, drawings, and techniques illustrated below, including the exemplary design and implementation illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents.

The term "some" as used herein is defined as "none, or one, or more than one, or all." Accordingly, the terms "none," "one," "more than one," "more than one, but not all" or "all" would all fall under the definition of "some." The term "some embodiments" may refer to no embodiments or to one embodiment or to several embodiments or to all embodiments. Accordingly, the term "some embodiments" is defined as meaning "no embodiment, or one embodiment, or more than one embodiment, or all embodiments."

The terminology and structure employed herein is for describing, teaching and illuminating some embodiments and their specific features and elements and does not limit, restrict or reduce the spirit and scope of the claims or their equivalents.

More specifically, any terms used herein such as but not limited to "includes," "comprises," "has," "consists," and grammatical variants thereof do NOT specify an exact limitation or restriction and certainly do NOT exclude the possible addition of one or more features or elements, unless otherwise stated, and furthermore must NOT be taken to exclude the possible removal of one or more of the listed features and elements, unless otherwise stated with the limiting language "MUST comprise" or "NEEDS TO include."

Whether or not a certain feature or element was limited to being used only once, either way it may still be referred to as "one or more features" or "one or more elements" or "at least one feature" or "at least one element." Furthermore, the use of the terms "one or more" or "at least one" feature or element do NOT preclude there being none of that feature or element, unless otherwise specified by limiting language such as "there NEEDS to be one or more . . ." or "one or more element is REQUIRED."

Unless otherwise defined, all terms, and especially any technical and/or scientific terms, used herein may be taken to have the same meaning as commonly understood by one having an ordinary skill in the art.

Embodiments of the present invention will be described below in detail with reference to the accompanying drawings.

Figure 1 illustrates method-steps in accordance with an embodiment of the present subject matter. The present subject matter refers a method for the generation of spatialized audio from an audio-visual multimedia content. The method comprises receiving (step 102) one or more audio-visual contents comprising one or more visual-objects and respective audio. The audio-visual content comprises receiving the audio-visual content defined by one or more of: 2D video and mono audio, 2D video and stereo audio, 360 degree video and mono audio, and 360 degree video and stereo audio.

Further, the method comprises identifying (step 104) said visual-objects within one or more image-frames associated with the audio-visual content. The identifying of said visual objects comprises determining a region of interest (ROI) within each frame associated with the audio visual content, and generating an inverted representation of the ROI to remove background and thereby identify the visual objects.

Further, the method comprises simulating (step 106) one or more motion-paths with spatiality from the audio-visual content. The simulating of one or more motion-paths comprises simulating the motion exhibited within the audio-visual content based on a plurality of parameters comprising one or more of a variable pace, a variable step size, occlusion, a 3D environment boundary, frame per second (fps). In an example, the simulation of the motion comprises identifying a number of data points P for calculation. Thereafter, a subset of points F is calculated within said identified points P in 3D space. Another subset F' is calculated within said identified points P based on interpolating amongst the subset of points F. Finally, the calculation of the points P is concluded based on calculation of remaining points by iteratively interpolating amongst the calculated set of points within the subset F'.

Further, the method comprises reconstructing (step 108) an image sequence denoting a movement of one or more identified visual objects in accordance said simulated motion paths. Such sequence is associated with a time-interval associated with the at least one audio visual content and wherein said reconstructed image sequence denotes a first data. The reconstructing of the image sequence comprises generating at least one frame of the image-sequence by orienting each of the identified visual objects against a plain background in accordance with a sub-time interval within the time-interval associated with the at least one audio visual content. Accordingly, a new scene with combined moving objects is created as the first data.

In addition, as a part of the reconstruction of the image sequence in step 108, the region of interest (ROI) is received for each frame within the reconstructed image sequence, wherein each frame in the reconstructed image sequence corresponds to a time instant within the time-interval of the reconstructed image sequence. Each received ROI for each frame is combined to generate a combined ROI (i.e. a combined binary mask) for the reconstructed image sequence. The combined ROI along with the reconstructed image sequence is used as the first data for extracting temporal features during training of the model.

Further, the method comprises combining (step 110) audio from one or more audio visual contents to result in a mixed mono audio defining a second data.

Further, the method comprises combining determining (step 112) one or more points in 3D space as positions for the audio related to one or more visual objects based on scaling the one or more simulated motions paths. Such determination comprises interpolating data among a plurality of image frames comprising the identified visual objects to result in upgraded set of image-frames. The upgraded set of image frames is mapped to a spherical configuration. The simulated motion-paths are scaled to correspond to said spherical configuration. A plurality of points corresponding to the scaled motion path are ascertained as the positions for the audio in 3D space.

Further, the method comprises reconstructing (step 114) 3D audio for the one or more visual object as a third data. The 3D audio is reconstructed based on the audio within the audio- visual content and the determined positions in the 3D space. The reconstructing of the 3D audio comprises generating 3D audio with respect to each visual object based on the determined positions in the 3D space and audio in the audio visual content. The reconstructed 3D audio is mixed channel wise. Background sound is added within the 3D audio at a position assumed to be backside of a prospective listener. In an example, the 3D audio corresponds to surround audio constructed in accordance with the position of all the objects.

Figure 2 illustrates an example implementation implementing the method steps of Fig. 1. In an example, Fig. 2 refers a synthetic dataset pipeline and involves generating synthetic dataset, which facilitates use of spatial data of visual objects as well as the audio, to train a deep learning model. Based thereupon, a synthetic scene is generated by combining multiple videos. The new scene and the audios generated share the same spatial arrangements, thus providing a direct audio visual association for 3D audio and 2D video, which can be further used for training the model.

At step 202, which corresponds to step 102, any two sample videos with single audio source are received from an external source. The videos containing different kinds of visual objects are downloaded and may be saved categorically based on the class. In an example, videos depicted in Fig. 3 represent the first set of video for class "Person" and video represented in Fig. 4 represent a second set of video for class "vehicles".

Such videos correspond to image frames with any frame per second (fps) 'F' that are extracted and saved as intermediate output. Corresponding audio is also extracted and saved as intermediate outputs. Thereafter, the corresponding frames and audio are sent as input to next step 204.

At step 204, which corresponds to step 104, an image mask processing is performed: This step involves visual mask extraction. Any standard image segmentation techniques can be used to extract the binary mask of the target class. The state of the art mask extraction techniques, while segmenting most of the image predict inaccurate mask boundaries which result in poor segmentation. At least to avoid the above challenge, a mask extension technique is used to expand the predicted mask boundaries and to avoid data loss in regions of interest. The component iteratively includes "d" additional neighboring unmasked pixels of mask boundary, in a final mask. The frame is then cropped to the bounding box of the mask. The binary mask is then inverted to get the background mask, which is used to remove the background. The final image is then reshaped to a standard size H x W. Such actions are applied sequentially on all the frames of all the videos to be combined.

The description of step 206 has been illustrated subsequent to Fig. 3 and Fig. 4

Figure 3 illustrates an example multimedia content required for the synthesis of dataset, in accordance with an embodiment of the present subject matter. Fig. 3a represents the image frames of the first set of video while Fig. 3b represents the region of interest or the visual objects in Fig. 3a. The ROIs in Fig. 3b are obtained based on subjecting the frames of Fig. 3a to image mask-processing of step 204.

Fig. 4a represents the image frames of the second set of videos while Fig. 4b represents the region of interest (i.e. visual objects present in Fig. 4a) based on subjecting the frames of Fig. 3a to image-mask processing of step 204.

Continuing with the description of Fig. 2, the method in Fig. 2 further comprises step 206 that corresponds to step 106 of Fig. 1.

At step 206, random motion paths are simulated. The present step establishes and provides the data for spatializing the video and audio. Based on the number of videos selected for combining in step 202, a corresponding random path is generated. To simulate the motion, various parameters like variable pace, variable step size, occlusion, 3D environment boundaries, and fps are introduced, to make the motion more realistic.

The description of step 208 has been illustrated after Fig. 5.

Fig. 5 represents twin stage operation with respect to step 206 to generate random 3D path.

As a first sub- step concerning step 206, based on fps and length of the target scene, the total number of data points "P" to be calculated are identified. First, a fraction F of P points are calculated using Random walk in 3 dimensions independently and confined with standard image dimensions H X W along with a defined Z limits. While constructing the above 3D walk, a random step size within a threshold is used, to control the variable step length, which introduces the pace of movement. Calculating all the data points using Random walk may be avoided to minimize randomness in movement which is highly unlikely to be exhibited by any moving object in an actual-video. Occlusion is known to be caused when a visual object leaves and enters the scene, or when overlapped by a different object. The former case of occlusion is handled by introducing a deviation factor D, which is an additional boundary that Random Walk point can move to. The 3D walk also uses additional parameter to freeze the motion by reusing the previous points for next few points K. The freeze time, when and how much time interval, is also decided randomly

As a second sub- step with respect to step 206, the remaining data points are actually used to control smooth and consistent movement in a particular direction for a time interval. The calculation of the remaining points is done as follows. A fraction F' of the remaining points are calculated by interpolating data between each of the F points, which was calculated using the Random walk in the first sub-step. During each interpolation, a random smoothness factor is used. Accordingly, the present second sub-step is iteratively used to calculate the remaining points until all P points are calculated.

Accordingly, step 206 vide two sub-steps in Fig. 5 renders a way to introduce motion and associated spatiality, that can be used for both video and audio.

Continuing with the description of Fig. 2, the method in Fig. 2 further comprises Step 208 that corresponds to step 108 of Fig. 1. At

step

208, 3D to 2D Path Projection is implemented: The points or paths that were calculated in step 206 are in three dimensions. As the target videos are 2D, the 3D paths are projected and translated to necessary properties of the corresponding visual mask or the ROI in terms of size when object is moving in Z direction and horizontal and vertical position of visual object's center when moving in X and Y direction. In an example, an inverse 'Tangent function' may be used to translate the Z dimensions to size of the image. This produces the depth perception for the visual object as it is moved away or towards the viewer.

At step 210 of Fig. 2, which corresponds to step 108 of Fig. 1, multi-object scene reconstruction takes place to render a reconstructed image sequence based on combining the videos of Fig. 3 and Fig. 4. After the above calculation of step 208, the corresponding visual object is pasted onto a white or any background image which acts as a slate background image. The same is repeated for all the visual objects corresponding to their time frame. The visual objects of the path moving outside the background image dimensions are ignored to create the occlusion. The second scenario of occlusion is handled by placing the closest visual object in terms of Z values, on the top of other visual objects, when their paths cross.

Further, based on inverted image masks obtained from the step 204, a combined binary mask for all corresponding visual objects for each time frame is calculated. The masks for T time frames of the combined scene are converted into a single binary mask, by taking the max value from each pixel across the binary frames. The combined binary mask thus obtained is a region of interest that the visual objects tend to cover in the videos of Fig. 3 and Fig. 4.

The description of step 212 has been illustrated subsequent to Fig. 6.

Figure 6 illustrates an example image sequence depicting a sub-process in accordance with an embodiment of the present subject matter.

Fig. 6a illustrates an example reconstructed image sequence from t=0 to t= 5 seconds as obtained by the example operation of step 210 over the combination of the two sets of videos as depicted in Fig. 3 and Fig. 4 respectively. In other words, Fig. 6 refers an example synthetic scene after Random path with fps 1 and time interval of 6 sec, wherein image frames are depicted corresponding to time slots 0 to 5 within the time interval of 6 seconds

Fig. 6b represents an example combined mask or the combined ROI as obtained by example operation of step 210. Alternatively, Fig. 6b refers a binary mask or region of interest corresponding to the reconstructed image sequence in Fig. 6a.

Continuing with the description of Fig. 2, the method in Fig. 2 further comprises Step 212 that corresponds to step 112 of Fig. 1 and refers audio spatialization. The random paths generated in the step 206, includes data sufficient for the number of frames in a video. Based on the human visual perceptiveness, a certain number of video frames are enough to produce a motion in an object. However, with respect to audio, the approach is unlike video, since the auditory perceptiveness of humans is high compared to the visual perceptiveness. Accordingly, the sampling rate in audio is considered higher compared to the fps. The further explanation in Fig. 7 refers generation of the positions for the audio using the motion-points of the video.

The description of step 214 has been illustrated subsequent to Fig. 7.

Figure 7 illustrates normalizing visual correspondence from motion-path of visual objects to audio in accordance with an embodiment of the present subject matter.

As a first step, the data between each frame that is required for audio is interpolated using an interpolation function and a smoothness factor, to reduce abrupt changes in audio levels.

As a second step, the motion vectors are scaled within a unit sphere in accordance with 3D audio specifications such as an Ambisonic specification. In contrast to the state of the art audio mixing tools which involve manual positioning of sound, the present step refers to mapping positions from a rectangular image frame to a unit Omni-sphere. The forthcoming paragraph refers to scaling for converting and normalizing the motion vectors into a unit Omni-sphere.

Final image frames resized as equal in all directions for attaining equality of boundaries used for the 3D random path (as computed in step 206). Accordingly, as the step 702 and a first stage scaling, all random path values are divided by the length of the boundary in each direction to scale the values to unit scale. The values thus obtained lie within the cube, circumscribing a unit sphere, and denote a random path range after first scaling

As step 704 and as a second stage scaling, the values circumscribing the unit sphere are now confined to the unit sphere. For such purposed, the values are divided by sqrt(3) which scales them to be within a cube, inscribed within the unit sphere. Accordingly, the confined value within the unit square are in line with a unit sphere equation defined by a ²+ b ²+ c ²= 1. Accordingly, step 704 refers to a final path range after second scaling.

Continuing with the description of Fig. 2, the method in Fig. 2 further comprises Step 214 that corresponds to step 110 of Fig. 1.

Step 214 represents a procedure of "Mono Audio Mixing". For the videos to be mixed, the corresponding audio STFTs or fourier-spectrums are used to mix the audio in frequency domain. The mixed audio acts as the second data required for training. Additional background audio is also added for the portions of the scene that does not have any visual objects to simulate background sound. Accordingly, an ML model undergoing training is guided on how to place the audio for which visual data or visual object is not available.

Step 216 corresponds to step 114 and accordingly refers 3D sound encoding into a desired specification and accordingly audio mixing to cause achieving of third type of data: The audios of each visual object are first spatialized using a standard encoding technique such as Ambisonic. The above step is repeated for all the audios to be mixed. The encoded audios are then mixed channel wise, in frequency domain by summing all the corresponding audio channels. The same background audio, added in step 214 is placed in the 3D audio assuming it is behind the listener, by choosing farthest point on axis behind the listener, assuming he is at the center of the sphere.

Overall steps

210, 214 and 216 of Fig. 2 correspond to generating three types of synthetic dataset that enables correspondence between 2D visual objects and audio. Such dataset is synthesized for training a deep learning model.

Fig. 8 refers Mono To 3D Audio server 800. This component corresponds to the synthetic scene and audio reconstruction pipeline in accordance with Fig. 2 as a part of a processing layer 802. As a part of hardware or hardware or hardware interface layer 804 and in line with standard computing environment specification, the server 800 comprises operating system, API, network communication, GPU and an external interface for external device access.

Figure 9 illustrates processing the data set generated from the Fig. 2 for usage as a training data set for ML training purposes and thereby refers training data set generator 900. A machine-learning (ML) model or a deep learning model 906 is trained based on a training data set defined by one or more of said first, second and third data to enable the generation of the 3D audio from another audio-visual multimedia content based on the ML model. The training data set 904 may comprise the elements as referred to in the forthcoming paragraph.

As one of the example element for the training data set, a dense optical flow may be estimated with respect to the videos utilized in Fig. 3 and Fig. 4 based on using existing standard techniques. The dense optical flow refers the motion of pixels across the series of frames. In addition, as a part of determining optical flow data augmentation for frames may be performed to change brightness, contrast, hue etc.

Another example data set refers to a reconstructed image sequence and combined binary mark from step 210 which also defines the first type of data in accordance with Fig. 1.

Another example data set refers to a fourier spectrum or a Short Term Fourier transform (STFT) of mixed mono audio obtained from step 214 which also defines the first type of data in accordance with Fig. 1.

As another example dataset, the third type of data from Fig. 1 or the reconstructed 3D audio from step 216 is utilized to calculate a target ROI or a target audio mask. Such generation comprises determining a region of interest (ROI) associated with the 3D audio based on a Fourier spectrum (FT) value of one or more channels within the 3D audio and a Fourier spectrum value of the mixed mono audio defined as the second data.

Considering the example of Ambisonic specification, each channel of Ambisonic audio other than a base channel may be used to calculate an audio mask, by dividing each Ambisonic channel STFT values of the 3D audio of step 216 by Mixed Mono Audio STFT from step 214. The resulting ratio masks are provided to the model, as target data set 904 to draw prediction during training.

The following Table 1 lists the elements as the overall training data set.

Figure 10 illustrates the training of a machine-learning (ML) model in accordance with an embodiment of the present subject matter.

At step 1002, a first-set of temporal features from the reconstructed image sequence is defined as the first data. Specifically, as a part of temporal feature extraction, the module extracts the overall visual features of an entire scene. The input frames passed to this module are of the shape T x C x H x W, where T is total frames in the scene, C is number of image channels, H is the height of image, W is the width. The input is reshaped to T*C x H x W, which make the total number of channels of image as T*C. This input is passed to series of units of (Conv2D + Batch normalization + Max Pooling). The output of this module is a new array of shape 1 x H' x W'.

At step 1004, a second-set of temporal features are extracted from the combined ROI. More specifically, the present step refers to a Binary Mask Adaptive Max pooling wherein the binary masks of the scene are adaptively Max pooled to 1 x H' x W'. This mask is element-wise multiplied with temporal features. The same at-least reduces the background noise and guide the model to focus only on the regions where motion might occur. The present step 1004 at least relies on the fact that convolution operations don't change the relative position of features extracted, concerning original pixel positions.

At step 1006, the first set of temporal features of step 1002 and the second set of temporal features from the step 1004 are subjected to a pooling operation to generate a third set of temporal features referred as masked temporal features

At step 1008, a set of optical features are extracted based on sensing optical flow across the frames within the reconstructed image sequence of step 210 defined as the first data. More specifically, as a part of dense optical flow estimation and feature-extraction, the optical flow estimation is executed between adjacent-frames using standard-techniques. The output from dense optical flow estimation, between all adjacent frames, is (T-1) x 2 x H x W. This data is then reshaped to (T-1)*2 x H x W and passed to series of units of (Conv2D + Batch normalization + Max Pooling) to yield optical flow features. The output of this module is a new array of shape 1 x H' x W' .in line with the array shape of

steps

1002 and 1004.

At step 1010, a feature map is created based on the third set of temporal-features and the set of optical-features obtained from

step

1006 and 1008.

At step 1012, an ML model to be trained is provided wherein such ML model is defined by a convolutional neural network for image feature extraction. The fourier-spectrum of the second data (i.e. Mixed mono Audio STFT from step 214) and the feature-map are processed by the ML model and based thereupon the processed second data and processed map are concatenated with layers of the ML model. In an example, the ML model may be a 'UNET' based fully convolution network in line with deep-learning architecture. For rendering a visual Attention for the UNET, the outputs from the step 1010 are flattened and stacked. The flattened attention features are concatenated with every alternate up-conv layers of either 5 layered or 7 layered standard UNET. The input and output channels of the up-conv layers are adjusted to accommodate the concatenated attention features

At step 1014, an ROI for each of a plurality of channels associated with 3D audio is predicted, and based thereupon the operation of the ML model is optimized based on a comparison between the predicted ROI and the ROI forming a part of target dataset 904. More specifically, the model is trained to learn visual correspondence of audio from the 2D video to generate 3D Audio Masks.

The present step 1014 is directed to model predictions and loss optimization, wherein the output of the modified UNET in step 1012 is audio masks of 3D audio channels. The targets for the model are the Target 3D audio masks 904 earlier calculated in Fig. 9. Based on the prediction and the target data 904, the loss calculation and error backpropagation is performed to train the model or the UNET. Standard loss function like L1, L2 may be used for loss.

Figure 11 refers a Mono To 3D Audio server 1100 for the training model as referred in Fig. 10. This server also hosts post trained model for cloud-based result generation after training, in a scenario where a user chooses to perform a cloud computing enabled remote mono to 3D audio conversion, instead of the client's device.

As a part of a processing layer 1102, the server executes the flow of Fig. 10 for example optical flow estimation 1008, temporal feature extraction 1006, visual attention map generation 1010 based on the optical features and the temporal features and the modified UNET architecture 1012. As a part of hardware or hardware or hardware interface layer 1104 and in line with standard computing environment specification, the server 1100 comprises operating system, API, network communication, GPU and an external interface for external device access.

Figure 12 refers 3D audio generation on user's data as a part of client device operation in accordance with an embodiment of the present subject matter. More specifically, the method comprises generation of 3D audio from a user provided audio-visual multimedia content based on said ML model.

At step 1202, an audio-visual content is selected at the user end to be rendered at the user device with 3D audio effect. Such audio visual content may be different from the sample audio- visual content selected in Fig. 3 and Fig. 4 for generating the training dataset.

At step 1204, one or more of video frames and audio are extracted from the received audio-visual content. The video frames are extracted at a certain frame per second (fps) along with the audio, from the video stream/file submitted by the user. In the case extracted-audio is in non-mono format (e.g. in stereo format), then the extracted audio is converted into mono-audio.

At step 1206, a predetermined-condition is sensed. The sensing of the predetermined-condition comprises detecting one or more of detecting a contextual or scenic change with respect to the video frames within the received audio visual content, and a buffer-level exceeding threshold. More specifically, as a part of buffer-processing decision, the data in buffer is sent for processing when the buffer gets full or scene change is detected. A standard scene detection technique may be used to identify a change in the scene, e.g. based on histogram threshold or camera angle changes.

At step 1208, based on a trigger raised by the step 1206, a sequence of data processing steps are activated as depicted in Fig. 10 with respect to the image-frames of the audio-visual content as rendered as input in step 1202 of Fig. 12. In an example, the audio-visual content frames undergo the steps of capturing of an optical flow across the video frames, computation of a fourier-spectrum of a mixed mono audio, and determination of a combined region of interest (ROI) with respect to the video frames in accordance with the

steps

210, 212 and 214 of Fig. 2. Thereafter, a set of temporal features from the combined ROI are extracted in line with

step

1002 and 1004. Such a set of temporal-features are subjected to a pooling-operation to generate pooled temporal features. Based thereupon, a feature map or a visual attention map is generated in line with the step 1010 based on pooled temporal features and the set of optical features.

More specifically, for calculation of binary-image mask for the frames of the audio-visual content received in step 1202, a standard image segmentation technique may be used to first identify and segment the same object classes, used to train the model. All the masks in a single time frame are applied on an array filed with zeros, having the same dimensions as that of the image. The above step is repeated on all frames received for processing. A combined binary mask or the combined ROI is then calculated by taking maximum value, pixel-wise. The steps 1002 to 1008 are then performed over the binary-mask, the extracted temporal features, and the optical features to obtain the visual-attention features in line with step 1010.

At step 1210, an ROI for each of a plurality of channels associated with 3D audio is obtained by an on-device trained model (or a cloud services rendered model) based on the feature map and the mono audio of the received audio-visual content. Based thereupon, a Fourier spectrum for each channel is derived based on the predicted ROI. Accordingly, a 3D audio is generated based on inverse Fourier spectrum criteria applied to the Fourier spectrum for each channel.

The present step 1210 refers to model prediction and post-processing: The trained model predicts the audio masks of each 3D audio channel (e.g. Ambisonic channel) for a corresponding mono audio input. The audio masks are then multiplied with the input mono audio, to get the final STFT of the audio. An inverse STFT on each channel is applied to get the final 3D audio.

At step 1212, the output is delivered to the user as per the required format. In case of on-device processing, the 3D audio stream as obtained through step 1210 can be passed to ambisonic decoder and the user receives a direct audio playback through the connected speakers connected. In case of cloud based processing, the users can directly stream ambisonic audio from the server in any standard web-based delivery techniques and protocols and thereby convert the received ambisonic to standard audio layouts like 2.1, 5.1, 7.1, 9.1, 10.1, 11.1, 13.1, 22.1, 26.1 and 3D Over Headphone based audio layouts.

Figure 13 illustrates a client device 1300 or cloud-server to process the inputs and produce 3D audio on a real-video, by using processing signals of scene change detection and buffer management in accordance with Fig. 12. The client device 1300 refers the user's device such as mobile, PC, tablet or multimedia device like TV and is used for sending and receiving the outputs such as video and audio-data to be processed. In case of on-device processing, a pre-trained model and ancillary modules are also part of the user device.

The processing layer 1102 and hardware/hardware interface layer 1104 of the user device corresponding to mono to 3D audio server as depicted in Fig. 11.

Further, as a part of memory 1302 with respect to the user device or external storage systems, the same is used stores store intermediate output as the processed input video frames 1304 corresponding to step 210, the processed input mono audio 1306 corresponding to the step 214, and final outputs as the target 3D audio 1308. In an example, the memory 1302 is used to store and provide data to the processing layer 1102 of the user device 1300 in case of on-device execution of the ML model. In case of cloud server (Mono to 3D audio server) based execution of the ML model, the memory 1302 may belong to the cloud server and provides data to a processing-layer of the cloud server.

Figure 14 illustrates conversion from a Mono/Stereo audio format to multi-channel audio format.

As per scenario 1, the 3D audio generation as described in the preceding description can be used to convert a mono audio sound to 3D Audio using visual correspondence. The user provides the video and audio to the system vide step 1402, along with the choice of the audio format. The system either generates Ambisonic audio vide step 1404 or in turn converts the generated Ambisonic audio 1404 to multi-channel audio, based on the audio format choice.

As per scenario 2, the 3D audio generation may be extended to convert stereo audio to 3D Audio as well. In this case, the user provides video along with stereo audio vide step 1406. The system down mixes the stereo audio to mono audio vide step 1408 and applies the same operations as in Mono to 3D Audio conversion vide step 1410. The user can choose the output format in this case as well.

Likewise, cloud based streaming services or content producers may appropriate the 3D audio-generation of the present subject matter when the input audio stream of video is mono/stereo. In an example, the present subject matter's 3D audio generator generates 3D sound as Ambisonic sound and it is streamed to the client device by streaming services. The user device decodes the spatialized multichannel audio.

Figure 15 illustrates conversion of mono audio of 360°video to 3D audio in accordance with the present subject matter. As a 360°video is a representation of a panoramic image onto equirectangular representation, the present subject matter's 3D audio generation may be construed to be operable with 360°video. Based on considering multiple synthetic scenes as segments of large panoramic view, a new Synthetic dataset for 360deg videos and 3D Audio can be easily generated.

In operation and as a part of training phase, all of the corresponding frames of multiple, independent synthetic scenes generated as the training dataset in Fig. 2 and Fig. 9 can be projected together on equirectangular or any standard projection used in 360°videos. Thereafter, binary masks of the scene are also projected onto 360°equirectangular/standard projection. The sound field of combined Ambisonic audio of each synthetic scene can be rotated based on the arrangement of scenes in 360°panoramic order, and then finally combined into single Ambisonic. The rest of the stages of the training process such as the modified UNET architecture and associated training steps remain as provided in steps 1002 to 1012.

As a part of real-time operation of the trained model, i.e. real time conversion of mono audio, the user initializes the 360°video. Accordingly, the present subject matter eliminates the use of otherwise expensive 360°video and audio-equipment for creating such content. In addition, the same addresses the challenge of creating a huge-dataset manually.

Figure 16 illustrates an example working implementation of the present subject matter, in accordance with an embodiment of the present subject matter.

In the state of the art scenarios where content is recorded in real-time, and streamed live or near real-time, it is impossible to up-mix or provide a surround/3D sound in the stream. The necessity to stream the content as a live broadcast also renders it impossible to mix the audio in light of the complete manual process of the state of the art.

The present subject matter at least advantageously benefits such scenarios and does away with the otherwise mandatory requirement of any recording devices or re-recording of sounds. The 3D sound generator in accordance with the present subject matter and renders the users with a multi-channel/3D Audio directly in near real-time. In example, the present 3D audio generator may be part of RTMP pipeline, a protocol for live streaming media content, or be a part of transcoding pipelines.

Figure 17 illustrates another example working implementation of the present subject matter, in accordance with an embodiment of the present subject matter.

As Ambisonic audio is speaker agnostic, the same makes it easy to decode to any speaker configuration in real-time, by using standard layout angles of speaker arrangement. While decoding the audio, ambisonic decoding specification also allows the inclusion of the position of the listener in the Omnisphere. With a Multi-Media Device that can track the user's location w.r.t it, or let the user manually position him through the device interface, a new sound experience can be provided to the user by using real-time sound field rotation. The real-time sound field rotation which is used in AR/VR technologies can now be extended to a general multi-media device based on the 3D audio generation in the Ambisonic format in accordance with the present subject matter.

Figure 18 illustrates another example working implementation of the present subject matter, in accordance with an embodiment of the present subject matter.

Fig. 18a illustrates usage of 3D audio generator as a plugin for existing manual audio mixing tools. In current scenario, content creators spend a lot of time mixing the audio. To reduce the manual effort, the present 3D audio generator may be added as fully automated or semi-automated plugin to the existing audio mixing tools. In case of fully automated feature, audio mixing is entirely performed by the solution. In semi auto mixing, content creator can further control additional audios like background sounds where he wishes to place the sound in 3D.

Fig. 18b illustrates appropriation of the present 3D audio generator within the functionality of audio transcoders while streaming.

In state of the art scenario related to OTT, a video streaming platform relies entirely on the content creator to provide a multi-channel audio. Another challenge for the OTT platform is heavy bandwidth requirements for delivering multi-channel audio which also in turn is difficult to download. Such large data inflicts buffering issues or heavy bandwidth costs to the user.

The present subject matter's 3D audio generator at least overcomes both the challenges, as Ambisonic content is comparatively small in size compared to 5.1 or 7.1 audio formats. Streaming services can include present subject matter's mechanism as part of audio transcoding pipeline to generate and stream 3D Audio even if source audio is mono.

In operation, at step 1802, OTT platforms receive content from Content Providers.

At step 1804, the transcoding services prepare the content for streaming using the present subject matter's 3D audio generator.

At step 1806, the multimedia audio visual content packaged for streaming includes the 3D audio as generated in step 1804. Such 3D audio is in for example ambisonic format for ease of transmission and download. However, since Ambisonic can be easily converted to 5.1 or 7.1 layouts easily, they can also convert them to required formats and stream the content, if there is a need to deliver only 5.1/ 7.1 audio

Figure 19 illustrates a representative architecture 1900 to provide tools and development environment described herein for a technical-realization of the implementation Fig. 8, Fig. 11 and Fig. 13 through an audio-visual content processing based computing device. Figure 19 is merely a non-limiting example, and it will be appreciated that many other architectures may be implemented to facilitate the functionality described herein. The architecture may be executing on hardware such as a computing machine 2000 of Fig. 20 that includes, among other things, processors, memory, and various application-specific hardware components.

The architecture 1900 may include an operating-system, libraries, frameworks or middleware. The operating system may manage hardware resources and provide common services. The operating system may include, for example, a kernel, services, and drivers defining a hardware interface layer. The drivers may be responsible for controlling or interfacing with the underlying hardware. For instance, the drivers may include display drivers, camera drivers, Bluetooth® drivers, flash memory drivers, serial communication drivers (e.g., Universal Serial Bus (USB) drivers), Wi-Fi® drivers, audio drivers, power management drivers, and so forth depending on the hardware configuration.

A hardware interface layer includes libraries which may include system libraries such as file-system (e.g., C standard library) that may provide functions such as memory allocation functions, string manipulation functions, mathematic functions, and the like. In addition, the libraries may include API libraries such as audio-visual media libraries (e.g., multimedia data libraries to support presentation and manipulation of various media format such as MPEG4, H.264, MP3, AAC, AMR, JPG, PNG), database libraries (e.g., SQLite that may provide various relational database functions), web libraries (e.g. WebKit that may provide web browsing functionality), and the like.

A middleware may provide a higher-level common infrastructure such as various graphic user interface (GUI) functions, high-level resource management, high-level location services, and so forth. The middleware may provide a broad spectrum of other APIs that may be utilized by the applications or other software components/modules, some of which may be specific to a particular operating system or platform.

The term "module" used in this disclosure may refer to a certain unit that includes one of hardware, software and firmware or any combination thereof. The module may be interchangeably used with unit, logic, logical block, component, or circuit, for example. The module may be the minimum unit, or part thereof, which performs one or more particular functions. The module may be formed mechanically or electronically. For example, the module disclosed herein may include at least one of ASIC (Application-Specific Integrated Circuit) chip, FPGAs (Field-Programmable Gate Arrays), and programmable-logic device, which have been known or are to be developed.

Further, the architecture 1900 depicts an aggregation of audio/video processing device based mechanisms and ML/NLP based mechanism in accordance with an embodiment of the present subject matter. A user-interface defined as input and interaction 1901 refers to overall input. It can include one or more of the following -touch screen, microphone, camera etc. A first hardware module 1902 depicts specialized hardware for ML/NLP based mechanisms. In an example, the first hardware module 1902 comprises one or more of neural processors, FPGA, DSP, GPU etc.

A second hardware module 1912 depicts specialized hardware for executing the audio/video processing device related audio and video simulations. ML/NLP based frameworks and APIs 1904 correspond to the hardware interface layer for executing the ML/NLP logic 1906 based on the underlying hardware. In an example, the frameworks may be one or more or the following - Tensorflow,

, NLTK, GenSim, ARM Compute etc. Audio simulation frameworks and APIs 1914 may include one or more of - Audio Core, Audio Kit, Unity, Unreal etc.

A database 1908 depicts a pre-trained voice feature database. The database 1908 may be remotely accessible through cloud by the ML/NLP logic 1906. In other example, the database 1908 may partly reside on cloud and partly on-device based on usage statistics.

Another database 1918 refers the memory of the device 1300. The database 1918 may be remotely accessible through cloud. In other example, the database 1918 may partly reside on the cloud and partly on-device based on usage statistics.

A rendering module 1905 is provided for rendering audio output and trigger further utility operations. The rendering module 1905 may be manifested as a display cum touch screen, monitor, speaker, projection screen, etc.

A general-purpose hardware and driver module 1903 corresponds to the computing device 2000 as referred in Fig. 20 and instantiates drivers for the general purpose hardware units as well as the application-specific units (1902, 1912).

In an example, the NLP/ML mechanism and audio simulations underlying the present architecture 1900 may be remotely accessible and cloud-based, thereby being remotely accessible through a network connection. An audio/video processing device may be configured for remotely accessing the NLP/ML modules and simulation modules may comprise skeleton elements such as a microphone, a camera a screen/monitor, a speaker etc.

Further, at-least one of the plurality of modules of Fig. 8, Fig. 11 and Fig. 13 or the Modified UNET architecture may be implemented through AI based on an ML/NLP logic 1906. A function associated with AI may be performed through the non-volatile memory, the volatile memory, and the processor constituting the first hardware module 1902 i.e. specialized hardware for ML/NLP based mechanisms. The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU). The aforesaid processors collectively correspond to the processor 2002 of Fig. 20.

The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning.

In an implementation, ML/NLP logic 1906 may be configured to convert the speech into a computer-readable text using an automatic speech recognition (ASR) model. A user's intent of utterance may be obtained by interpreting the converted-text using a natural language understanding (NLU) model. The ASR model or NLU model may be an artificial intelligence model. The artificial intelligence model may be processed by an artificial intelligence-dedicated processor designed in a hardware structure specified for artificial intelligence model processing. The artificial intelligence model may be obtained by training.

According to another embodiment, ML/NLP logic 1906 may be an image recognition logic to obtain output data recognizing an image or a feature in the image by using image data as input data for an artificial intelligence model. The artificial intelligence model may be obtained by training

According to yet another embodiment, the ML/NLP logic 1906 may be a reasoning or prediction logic and may use an artificial intelligence model to draw recommendations or predictions based on input data. A pre-processing operation may be performed on the data to convert into a form appropriate for use as an input for the artificial intelligence model. The artificial intelligence model may be obtained by training

Here, being provided through learning means that, by applying a learning logic/technique to a plurality of learning data, a predefined operating rule or AI model of a desired characteristic is made. "Obtained by training" means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training technique. The learning may be performed in a device (i.e. the architecture 1900 or the device 2000) itself in which AI according to an embodiment is performed, and/or may be implemented through a separate server/system. "

The AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a neural network layer operation through calculation between a result of computation of a previous-layer and an operation of a plurality of weights. Examples of neural-networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.

The ML/NLP logic 1906 is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning techniques include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

Language understanding as performed by the ML/NLP logic 1906 may be a technique for recognizing and applying/processing human language/text and includes, e.g., natural language processing, machine translation, dialog system, question answering, or speech recognition/synthesis.

Visual understanding as performed by the ML/NLP logic 1906 may be a technique for recognizing and processing things as does human vision and includes, e.g., object recognition, object tracking, image retrieval, human recognition, scene recognition, 3D reconstruction/localization, or image enhancement.

Reasoning prediction as performed by the ML/NLP logic 1906 may be a technique of logically reasoning and predicting by determining information and includes, e.g., knowledge-based reasoning, optimization prediction, preference-based planning, or recommendation.

Figure 20 shows yet another exemplary implementation in accordance with the embodiment, and yet another typical hardware configuration of the system depicted in Fig. 2, Fig. 11 and Fig. 13 in the form of a computer system 2000 is shown. The computer system 2000 can include a set of instructions that can be executed to cause the computer system 2000 to perform any one or more of the methods disclosed. The computer system 2000 may operate as a standalone device or may be connected, e.g., using a network, to other computer systems or peripheral devices.

In a networked deployment, the computer system 2000 may operate in the capacity of a server or as a client user computer in a server-client user network environment, or as a peer computer system in a peer-to-peer (or distributed) network environment. The computer system 2000 can also be implemented as or incorporated across various devices, such as a VR device, personal computer (PC), a tablet PC, a personal digital assistant (PDA), a mobile device, a palmtop computer, a communications device, a web appliance, or any other machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while a single computer system 2000 is illustrated, the term "system" shall also be taken to include any collection of systems or sub-systems that individually or jointly execute a set, or multiple sets, of instructions to perform one or more computer functions.

The computer system 2000 may include a processor 2002 e.g., a central processing unit (CPU), a graphics processing unit (GPU), or both. The processor 2002 may be a component in a variety of systems. For example, the processor 2002 may be part of a standard personal computer or a workstation. The processor 2002 may be one or more general processors, digital signal processors, application specific integrated circuits, field programmable gate arrays, servers, networks, digital circuits, analog circuits, combinations thereof, or other now known or later developed devices for analysing and processing data The processor 2002 may implement a software program, such as code generated manually (i.e., programmed).

The computer system 2000 may include a memory 2004, such as a memory 2004 that can communicate via a bus 2008. The memory 2004 may include, but is not limited to computer readable storage media such as various types of volatile and non-volatile storage media, including but not limited to random access memory, read-only memory, programmable read-only memory, electrically programmable read-only memory, electrically erasable read-only memory, flash memory, magnetic tape or disk, optical media and the like. In one example, the memory 2004 includes a cache or random access memory for the processor 2002. In alternative examples, the memory 2004 is separate from the processor 2002, such as a cache memory of a processor, the system memory, or other memory. The memory 2004 may be an external storage device or database for storing data. The memory 2004 is operable to store instructions executable by the processor 2002. The functions, acts or tasks illustrated in the figures or described may be performed by the programmed processor 2002 executing the instructions stored in the memory 2004. The functions, acts or tasks are independent of the particular type of instructions set, storage media, processor or processing strategy and may be performed by software, hardware, integrated circuits, firm-ware, micro-code and the like, operating alone or in combination. Likewise, processing strategies may include multiprocessing, multitasking, parallel processing and the like.

As shown, the computer system 2000 may or may not further include a display unit 2010, such as a liquid crystal display (LCD), an organic light emitting diode (OLED), a flat panel display, a solid state display, a cathode ray tube (CRT), a projector, or other now known or later developed display device for outputting determined information. The display 2010 may act as an interface for the user to see the functioning of the processor 2002, or specifically as an interface with the software stored in the memory 2004 or in the drive unit 2016.

Additionally, the computer system 2000 may include an input device 2012 configured to allow a user to interact with any of the components of system 2000. The computer system 2000 may also include a disk or optical drive unit 2016. The disk drive unit 2016 may include a computer-readable medium 2022 in which one or more sets of instructions 2024, e.g. software, can be embedded. Further, the instructions 2024 may embody one or more of the methods or logic as described. In a particular example, the instructions 2024 may reside completely, or at least partially, within the memory 2004 or the processor 2002 during execution by the computer system 2000.

The present invention contemplates a computer-readable medium that includes instructions 2024 or receives and executes instructions 2024 responsive to a propagated signal so that a device connected to a network 2026 can communicate voice, video, audio, images or any other data over the network 2026. Further, the instructions 2024 may be transmitted or received over the network 2026 via a communication port or interface 2020 or using a bus 2008. The communication port or interface 2020 may be a part of the processor 2002 or maybe a separate component. The communication port 2020 may be created in software or maybe a physical connection in hardware. The communication port 2020 may be configured to connect with a network 2026, external media, the display 2010, or any other components in system 2000, or combinations thereof. The connection with the network 2026 may be a physical connection, such as a wired Ethernet connection or may be established wirelessly as discussed later. Likewise, the additional connections with other components of the system 2000 may be physical or may be established wirelessly. The network 2026 may alternatively be directly connected to the bus 2008.

The network 2026 may include wired networks, wireless networks, Ethernet AVB networks, or combinations thereof. The wireless network may be a cellular telephone network, an 802.11, 802.16, 802.20, 802.1Q or WiMax network. Further, the network 2026 may be a public network, such as the Internet, a private network, such as an intranet, or combinations thereof, and may utilize a variety of networking protocols now available or later developed including, but not limited to TCP/IP based networking protocols. The system is not limited to operation with any particular standards and protocols. For example, standards for Internet and other packet switched network transmission (e.g., TCP/IP, UDP/IP, HTML, HTTP) may be used.

At least by virtue of the preceding description, the present subject matter refers a deep learning model that uses just a regular 2D Video to generate 3D audio from mono audio. Since collecting 2D videos which have 3D audio is challenging and are not directly available, the present subject matter approach for synthetically generating such training data for the model. The model may be hosted on user's device and does way with the dependency of device to get appropriate pre-processed audio format. The 3D audio/ Multi channel audio produced by the solution is generic to all the existing devices' encoding and decoding techniques, and thus eliminates any additional major hardware or software requirements. It is a completely automated solution, and can be integrated at various stages of content delivery pipelines, namely content creation, content streaming or on device playback.

While specific language has been used to describe the disclosure, any limitations arising on account of the same are not intended. As would be apparent to a person in the art, various working modifications may be made to the method in order to implement the inventive concept as taught herein.

The drawings and the forgoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein.

Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of embodiments is at least as broad as given by the following claims.

Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.

Claims

A method for generation of spatialized audio from an audio-visual multimedia content, the method comprising:

Receiving (102) one or more audio-visual contents comprising one or more visual-objects and respective audio;

Identifying (104) said visual objects within one or more image-frames associated with the audio-visual content;

Simulating (106) one or more motion-paths with spatiality from the audio-visual content;

Reconstructing (108) an image sequence denoting a movement of one or more identified visual objects in accordance said simulated motion paths, said sequence associated with a time-interval associated with the at least one audio visual content and wherein said reconstructed image sequence denotes a first data;

Combining (110) audio from one or more audio visual contents to result a mixed mono audio defining a second data;

Determining (112) one or more points in 3D space as positions for the audio related to the one or more visual objects based on scaling the one or more simulated motions paths; and

Reconstructing (114) a 3D audio for the one or more visual object.
The method as claimed in claim 1, wherein the 3D audio is reconstructed based on one or more of:

a) the audio within the audio- visual content and

b) said determined positions in the 3D space, wherein said reconstructed 3D audio represents a third data.
The method as claimed in claim 1, further comprising training a machine-learning (ML) model based on a training data set defined by one or more of said first, second and third data to enable generation of the 3D audio from another audio-visual multimedia content based on said ML model.
The method as claimed in claim 1, wherein said identifying of said visual objects comprises:

determining a region of interest (ROI) within each frame associated with the audio visual content; and

generating an inverted representation of the ROI to remove background and thereby identify the visual objects.
The method as claimed in claim 4, further comprising:

receiving the region of interest (ROI) for each frame within the reconstructed image sequence, wherein each frame in the reconstructed image sequence corresponds to a time instant within the time-interval of the reconstructed image sequence;

combining each received ROI for each frame to generate a combined ROI for the reconstructed image sequence; and

enabling usage of the combined ROI for extracting temporal features during training of the model.
The method as claimed in claim 1, wherein receiving the audio-visual content comprises receiving the audio-visual content defined by one or more of:

2D video and mono audio;

2D video and stereo audio;

360 degree video and mono audio; and

360 degree video and stereo audio.
The method as claimed in claim 1, wherein the simulating one or more motion-paths comprises simulating the motion exhibited within the audio-visual content based on a plurality of parameters comprising one or more of a variable pace, a variable step size, occlusion, a 3D environment boundary, frame per second (fps).
The method as claimed in claim 5, wherein the simulation of the motion comprises:

identifying a number of data points P for calculation;

calculating a subset of points F within said identified points P in 3D space;

calculating another subset F' within said identified points P based on interpolating amongst the subset of points F; and

concluding calculation of the points P based on calculation of remaining points by iteratively interpolating amongst the calculated set of points within the subset F'.
The method as claimed in claim 1, wherein reconstructing the image sequence comprises:

generating at least one frame of the image-sequence by orienting each of the identified visual objects against a plain background in accordance with a sub-time interval within the time-interval associated with the at least one audio visual content.
The method as claimed in claim 1, wherein said determining of one or more points in 3D space for positioning the audio comprises:

interpolating data among a plurality of image frames comprising the identified visual objects to result in upgraded set of image-frames;

mapping the upgraded set of image frames to a spherical configuration;

scaling the simulated motion-paths to correspond to said spherical configuration; and

ascertaining a plurality of points corresponding to the scaled motion path as the positions for the audio in 3D space.
The method as claimed in claim 10, wherein the reconstructing of the 3D audio comprises:

generating 3D audio with respect to each visual object based on the determined positions in the 3D space and audio in the audio visual content;

mixing the 3D audio channel wise; and

adding background sound within the 3D audio at a position assumed to be backside of a prospective listener.
The method as claimed in claim 1, further comprising generating a target dataset from the third data based on:

determining a region of interest associated with the 3D audio based on a Fourier spectrum (FT) value of one or more channels within the 3D audio and a Fourier spectrum value of the mixed mono audio defined as the second data.
The method as claimed in claim 3, wherein said training of the machine-learning (ML) model comprises:

extracting a first-set of temporal features from the reconstructed image sequence defined as the first data;

extracting a second-set of temporal features from the combined ROI;

subjecting the first and second set of temporal features to a pooling operation to generate a third set of temporal features;

extracting a set of optical features based on sensing optical flow across the frames within the reconstructed image sequence defined as the first data; and

creating a feature map based on the third set of temporal features and the set of optical features.
The method as claimed in claim 13, further comprising:

providing the ML model to be trained, said ML model defined by a convolutional neural network for image feature extraction;

processing a fourier-spectrum of the second data and feature map and based thereupon concatenating the processed second data and processed map with layers of the ML model;

predicting an ROI for each of a plurality of channels associated with 3D audio; and

optimizing the operation of the ML model based on a comparison between the predicted ROI and the ROI forming a part of target dataset.
A system (800, 1100, 1300) for generation of spatialized audio from an audio-visual multimedia content, the system comprising:

an image reconstruction module (210) for

receiving one or more audio-visual contents comprising one or more visual-objects and respective audio;

identifying said visual objects within one or more image-frames associated with the audio-visual content;

simulating one or more motion-paths with spatiality from the audio-visual content;

reconstructing an image sequence denoting a movement of one or more identified visual objects in accordance said simulated motion paths, said sequence associated with a time-interval associated with the at least one audio visual content and wherein said reconstructed image sequence denotes a first data;

an audio processing module (214) for combining audio from one or more audio visual contents to result a mixed mono audio defining a second data;

and

an audio reconstruction module (216, 212) for :

determining one or more points in 3D space as positions for the audio related to the one or more visual objects based on scaling the one or more simulated motions paths; and

reconstructing a 3D audio for the one or more visual object as third data.