WO2021258866A1

WO2021258866A1 - Method and system for generating a background music for a video

Info

Publication number: WO2021258866A1
Application number: PCT/CN2021/092052
Authority: WO
Inventors: Prince Narula; Singh SHUBHAM KUMAR; Ashish Mishra
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2020-06-23
Filing date: 2021-05-07
Publication date: 2021-12-30

Abstract

The present disclosure relates to a system and method for generating background music for a video. The disclosure comprises receiving the video comprising one or more video frames. Thereafter, the one or more video frames are processed to identify at least one of a scene, an emotion and a text associated with the video. The method then determines at least one genre based on the at least one of the scene, the emotion and the text associated with the video; and then identifies one or more music samples based on at least one of the genre and a user preference. Subsequently, a background music for the video is generated based on the identified one or more music samples.

Description

METHOD AND SYSTEM FOR GENERATING A BACKGROUND MUSIC FOR A VIDEO

FIELD OF THE DISCLOSURE

The present disclosure generally relates to video processing and more particularly to a system and method for automatically generating a background music for a video.

BACKGROUND

The following description of related art is intended to provide background information pertaining to the field of the disclosure. This section may include certain aspects of the art that may be related to various features of the present disclosure. However, it should be appreciated that this section be used only to enhance the understanding of the reader with respect to the present disclosure, and not as admissions of prior art.

With the advent of camera units in mobile devices and smartphones, the use of images and videos have become rampant in the past decade. People tend to capture all relevant or important events by recording videos of them and relive those memories by watching such videos. Many a times, when a video is recorded, a lot of background noise is also recorded. For instance, if a user is recording a video of his child cutting a cake on his birthday, the recorded video may include background noise of people talking or kids shouting that may not be relevant for the video as such. Such irrelevant background noise reduces the relevancy of the memory being captured.

Sometimes users modify the videos by adding background music to such videos to enhance the experience associated with the video. This is however a manual and cumbersome exercise. Existing solutions only provide for generation of music based on a syllable or based on existing music themes. Some existing solutions may also eliminate or reduce background noise in a video. This, however, does not enhance the user experience of a video.

In view of the above limitations, it is apparent that there exists a need in the art to enhance user experience with a video.

SUMMARY

This section is provided to introduce certain objects and aspects of the present disclosure in a simplified form that are further described below in the detailed description. This summary is not intended to identify the key features or the scope of the claimed subject matter.

In view of the above limitations, it is an object of the disclosure to provide a system and method for automatically generating a background music for a video. It is another object of the disclosure to enhance the user experience of a video by supplementing the video with a background music that captures the essence of the video. It is yet another object of the disclosure to reduce the irrelevant background noise from a video. It is also an object of the disclosure to provide a system and method for automatically generating a background music for a video that is relevant for the video based on the scenes, emotions and/or text in the video.

In order to achieve the above-mentioned objects, one aspect of the present disclosure provides a method for generating background music for a video. The method comprises receiving the video, at a user equipment, the video comprising one or more video frames. Thereafter, the one or more video frames are processed to identify at least one of a scene, an emotion and a text associated with the video. The method then determines at least one genre based on the at least one of the scene, the emotion and the text associated with the video; and then identifies one or more music samples based on eat least one of the genre and a user preference. Subsequently, a background music for the video is generated based on the one or more music samples.

Another aspect of the disclosure pertains to a system for generating background music for a video, the system comprising a transceiver unit configured to receive the video comprising one or more video frames. The system further comprises a processing unit coupled to the transceiver unit, the processing unit configured to:

process the one or more video frames to identify at least one of a scene, an emotion and a text associated with the video, determine at least one genre based on the at least one of the scene, the emotion and the text associated with the video, and identify one or more music samples based on eat least one of the genre and a user preference. The system also comprises a music generator coupled to the transceiver unit and the processing unit, the music generator configured to generate a background music for the video based on the one or more music samples.

Yet another aspect of the disclosure pertains to a user equipment comprising a system for generating background music for a video, the system further comprising: a transceiver unit configured to receive the video comprising one or more video frames. The system further comprises a processing unit coupled to the transceiver unit, the processing unit configured to: process the one or more video frames to identify at least one of a scene, an emotion and a text associated with the video, determine at least one genre based on the at least one of the scene, the emotion and the text associated with the video, and identify one or more music samples based on eat least one of the genre and a user preference. The system also comprises a music generator coupled to the transceiver unit and the processing unit, the music generator configured to generate a background music for the video based on the one or more music samples.

BRIEF DESCRIPTION OF DRAWINGS

The accompanying drawings, which are incorporated herein, and constitute a part of this disclosure, illustrate exemplary embodiments of the disclosed methods and systems in which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that disclosure of such drawings includes disclosure of electrical components, electronic components or circuitry commonly used to implement such components.

Figure 1 illustrates an exemplary architecture diagram for a system for generating background music for a video, in accordance with exemplary embodiments of the present disclosure.

Figure 2 illustrates an exemplary method flow diagram for a method for generating background music for a video, in accordance with exemplary embodiments of the present disclosure.

Figure 3 illustrates an exemplary user interface diagram illustrating an example implementation of the disclosure.

The foregoing shall be more apparent from the following more detailed description of the disclosure.

DETAILED DESCRIPTION

In the following description, for the purposes of explanation, numerous examples have been set forth in order to provide a brief description of the disclosure. It will be apparent however, that the disclosure may be practiced without these specific details, features and examples; and the scope of the present disclosure is not limited to the examples provided herein below.

As discussed briefly in the background section, the existing solutions for video editing do not provide any automatic solution for generating a relevant background music for a video that is currently being captured or is captured and stored in a user equipment. Thus, typically, while a video is recorded, the essence of the video is not captured due to the irrelevant background noise being recorded with the video. The present disclosure aims at enhancing the user experience of a video by automatically adding background music to a video that is relevant for that particular video.

Existing prior art solutions only discuss generation of music but without taking into account the user experience. For instance, some solutions offer music generation for a current segment based on previous segment. This does not take into account the overall context of the video and the user experience associated with it. Similarly, other known solutions provide for mixing of music to generate a mashup, however such generation fails to take into account user emotions or conversations.

The present disclosure proposes to enhance the user experience by taking into account the entire context of a video. The disclosure proposes to determine a scene of each frame of a video, detect one or more emotions of the user from each of the frames and determine speech from the background noise to determine the overall context of the video first. Thereafter, the scenes, emotions and speech/text identified from the video are used to map the video to one or more genres. Then a sample of music for each of the genres determined are selected and a final background music for the video is determined based on such music samples.

The disclosure will now be described in more detail with reference to the drawings.

Fig. 1 illustrates an exemplary architecture diagram for a system for generating background music for a video, in accordance with exemplary embodiments of the present disclosure. As shown in Fig. 1, the system [100] comprises at least one transceiver unit [102] , at least one processing unit [104] , at least one music generator [106] and at least one storage unit [108] , all components being connected to each other. Although only a few components are shown, the disclosure encompasses one or more of such units or other units necessary to implement the functionality of the disclosure.

As used herein, “couple” , “connect” , “associate” and its cognate terms, such as “coupled” , “connected” , “associated” includes a physical connection (such as a conductor) , a virtual connection (such as through randomly assigned memory locations of data memory device) , a logical connection (such as through logical gates of semiconducting device) , other suitable connections, or a combination of such connections, as may be obvious to a skilled person.

The transceiver unit [102] is coupled to the processing unit [104] , the music generator [106] and the storage unit [108] . The transceiver unit [102] comprises at least one transmitter and at least one receiver (not shown in Figure for clarity) to transmit and receive information, respectively. The transceiver unit [102] is configured to receive one or more videos, each video comprising one or more video frames. The number of video frames in a video depends on a length and quality of the video. The transceiver unit [102] may receive the video from a camera unit of a user equipment or a storage unit [108] of the system [100] or from any other external storage unit. Further, the video received at the transceiver unit [102] may be in an encoded form.

The transceiver unit [102] is further configured to transmit the video to the processing unit [104] and/or the storage unit [108] . In an embodiment, the transceiver unit [102] may receive the one or more video frames in a sequential manner and temporarily store the same in a buffer before transmitting to any other unit.

The processing unit [104] is configured to receive the one or more video frames from the transceiver unit [102] . The processing unit [104] is further configured to process the video frames to identify at least one of a scene, an emotion and a text associated with the video. As used herein, a ‘scene’ refers to an event or view of an image or a video frame or video. For instance, scene may be indoor or outdoor; mountains or sea; etc. Further, as used herein, ‘emotion’ refers to the feeling associated with an image or a video frame or video. For instance, emotion may include happy, sad, excited, worried, anxious, etc.

The processing unit [104] is further configured to determine at least one genre based on the at least one of the scene, the emotion and the text associated with the video. As used herein, ‘genre’ refers to a style or category of music. For instance, genre may include pop, rock, jazz, country, disco, folk, instrumental, dance, funk, soul, etc.

The processing unit [104] is also configured to identify one or more music samples based on at least one of the genre and a user preference. As used herein, ‘music sample’ refers to a music clip. For instance, a music clip may be a song or a part thereof. The length of the music clip may vary depending on the clip and may be measured in seconds.

As used herein, a “processing unit” or “processor” includes one or more processors, wherein processor refers to any logic circuitry for processing instructions. A processor may be a general-purpose processor, a special purpose processor, a conventional processor, a digital signal processor, a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits, Field Programmable Gate Array circuits, any other type of integrated circuits, etc. The processor may perform signal coding data processing, input/output processing, and/or any other functionality that enables the working of the system according to the present disclosure. More specifically, the processor or processing unit is a hardware processor.

The music generator [106] is coupled to the transceiver unit [102] and the processing unit [104] . The music generator [106] is configured to generate a background music for the video based on the one or more music samples. The music generator [106] is further configured to mix the generated background music with the video to generate a modified video, wherein the modified video includes the same video frames as the original video but includes background music determined by the system [100] .

The a storage unit [108] is coupled to the transceiver unit [102] , the processing unit [104] , and the music generator [106] , wherein the storage unit [108] is configured to store the video, one or more pre-defined scenes, one or more pre-defined emotions, one or more pre-defined keywords, one or more pre-defined genres and one or more pre-defined music samples. The storage unit [108] is further configured to store any other intermediate information generated by any of the units in the system [100] . The storage unit [108] is also configured to store the updated or modified video generated by the music generator.

Figure 2 illustrates an exemplary method flow diagram for a method for generating background music for a video, in accordance with exemplary embodiments of the present disclosure. As shown in Fig. 2, the method begins at 202 and proceeds to step 204.

At step 204, the video is received at the transceiver unit [102] , wherein the video comprises of one or more video frames. The video may be received at the transceiver unit [102] at step 204 from a camera unit or the storage unit [108] or any other external storage unit. For instance, when a user is recording a video using a camera unit of a user equipment, the video frames of the video may be received at the transceiver unit [102] in real-time for further processing. In another example, the video may be stored at the storage unit [108] of the system [100] such as a pre-recorded video or a pre-stored video. In yet another example, the video may be stored at an external storage unit such as on a cloud server, or on a server of a social networking website which the user may download on his/her electronic device for further processing.

The disclosure also encompasses receiving the video in response to an action taken by the user. For instance, while recording a video on an electronic device, the user may take an action by selecting a smart sound mode on the user interface of the camera application of the electronic device for enabling the implementation of the disclosure. In another instance, the user may take an action to implement the disclosure by selecting a smart sound option for a pre-stored video.

As also discussed above, the video may be received at the transceiver unit [102] in entirety or in parts. For instance, if the video is received from the storage unit [108] or an external storage unit, then the video may be received in entirety at one instant. However, in another example, if the video is received at the transceiver unit [102] from a camera unit of an electronic device or user equipment, then such video may be received in parts.

Next, at step 206, the one or more video frames are processed by the processing unit [106] to identify at least one of a scene, an emotion and a text associated with the video. Step 206 includes receiving the video from the transceiver unit [102] prior to processing the video. The disclosure encompasses that the video is received by the processing unit [106] from the transceiver unit [102] , in parts or in entirety at one instance.

The processing unit [106] identifies one or more scenes of the video using a scene detection model. Each video frame of the video received at the processing unit [106] is processed to identify a scene for each video frame. The scene detection model may be based on object detection. For instance, objects in a video frame may be identified and thereafter the objects may be mapped to pre-defined set of scenes. If for example, a video frame comprises object such as sun and hills, the same may be mapped to the scene of mountains and or outdoor scene. The scene detection is based on one or more pre-defined scenes stored in the system [100] .

The disclosure also encompasses detecting more than one scene in a video. For instance, a video may comprise few video frames depicting the beach and the sea and thereafter few video frames of the inside of an ice-cream parlour. In such a case, the disclosure identifies two distinct indoor and outdoor scenes from such video frames. In a scenario wherein two or more scenes are identified for a single video frame, the disclosure also encompasses resolving conflict between two or more scenes based on historical data or any other parameter. Historical data may include scenes detected in the previous frames of the same video. Other parameters may include information about the emotions from other sources such as social networking applications, text or speech included in the video frames, captions given by a user while saving or sharing such video, etc.

The disclosure also encompasses storing the scenes of each video frame or set of video frames for further processing. In an embodiment, no scene may be detected by the processing unit [106] for a particular frame, in which case the processing unit [106] may store a null value for the scene for the frame.

The processing unit [106] identifies one or more emotions based on an emotion detection model. Each and every video frame received at the processing unit [106] is processed to identify emotions. The emotion detection model may be based on facial emotion detection. For instance, each video frame may be processed to identify faces in the frame based on facial detection techniques and thereafter from identified each face, the emotion may be detected. Emotion detection may also be based on background sound or noise captured in the video frames. For instance, if a video frame captures a sound of people laughing then a happy emotion may be interpreted by the processing unit [106] . The emotion detection is also based on one or more pre-defined emotion stored in the system [100] .

The disclosure encompasses detecting or identifying more than one emotion in a particular video or even in a single frame. For instance, if the video frame comprises image of people laughing however the background noise indicates shouting then two emotions may be identified, i.e. happy and angry. In such a case, the disclosure also encompasses resolving conflict between two or more emotions based on historical data or any other parameter. Historical data may include emotions detected in the previous frames of the same video. Other parameters may include information about the emotions from other sources such as social networking applications, text or speech included in the video frames, captions given by a user while saving or sharing such video, etc.

The disclosure also encompasses storing the emotions of each video frame or set of video frames for further processing. In an embodiment, no emotion may be detected by the processing unit [106] for a particular frame, in which case the processing unit [106] may store a null value for the emotion for the frame.

The processing unit [106] identifies text from a video based on a speech to text model. Each video frame of the video is processed to identify sound/background noise from the vide frame and the same is converted into text. The text may be typically long including full sentences, etc. and may require further cleaning. The disclosure encompasses identifying keywords from text detected from each video frame, wherein the keywords are important words to understand the context of the video frame/video. The detection of text is also based on one or more pre-defined keywords stored in the system [100] .

Since one sentence may be broken down or spread across various video frames, the analysis of identifying text from a video may be done in batches of one or more video frames instead of processing each frame separately to get more efficient results.

The disclosure also encompasses storing the text or keywords of each video frame or set of video frames for further processing. In an embodiment, no text may be detected by the processing unit [106] for a particular frame, in which case the processing unit [106] may store a null value for the text for the frame.

Thus, by the end of step 206, one or more scenes, emotions and text is identified for each video frame. This together, comprises the context of the video being analysed.

Next, at step 208, the processing unit [106] determines at least one genre based on the at least one of the scene, the emotion and the text associated with the video. The processing unit [106] applies one or more models based on neural networks to identify at least one genre based on the context of the video. In an exemplary implementation, the output of the neural network can be considered to be a one-dimensional matrix where each value represents a specific genre. If i ^th value is one, it conveys that i ^th genre suits the video context. If i ^th value is zero, i ^th genre should be ignored. The disclosure encompasses determining a genre based on a set of pre-defined genres.

After determining the relevant genre, this information is further processed by the processing unit [106] to identify one or more music samples for each of the selected genres. The sample selection process can either be random or user-preference dependent. For instance, if the user likes a particular music sample (for a particular genre) on a social networking website, then such a sample may be given preference over other samples in the same genre.

In an exemplary implementation, each music sample is represented as a matrix, and output of the processing unit [106] from this step is a concatenated matrix of all sample matrices.

Subsequently, at step 212, a background music for the video is generated based on the one or more music samples. The background music generated by the music generator is an amalgamation of the music samples identified in the previous step. Next, the method ends at step 214. The disclosure encompasses mixing the generated background music with the video to generate a modified video.

The disclosure encompasses that when multiple and distinct scenes, emotions, text, etc. are identified, the video is broken down or separated into one or more parts, wherein each part comprises one or more video frames and each part has a unique video context distinct from another part. For example, in a video capturing a wedding, the video may comprise, for example, 200 frames of happy events and last 75 frames of sad events. For efficient processing and achieving the objectives of the disclosure, the video is broken into two parts, i.e. part one comprising first 200 frames and part two comprising subsequent and last 75 frames. This part one and part two are then processed separately to identify one or more genres and subsequently music samples for each part. Thereafter, the music samples identified for each part are mixed or amalgamated in sequence to generate a background music such that the background music for part one contains happy music samples while the background music for second part contains sad or instrumental music.

The system [100] may entirely or partly reside in a user equipment or electronic device. The device may be capable of capturing a video in real-time via a camera unit. As used herein, the user equipment or electronic device may include but not limited to any electrical, electronic, electromechanical and computing device or equipment. The user equipment may also include, but is not limited to, a mobile phone, smart phone, laptop, a general-purpose computer, desktop, personal digital assistant, tablet computer, wearable device or any other computing device.

Figure 3 illustrates an exemplary user interface diagram of the user equipment illustrating an example implementation of the disclosure. As shown in Fig. 3, the interface for implementing the disclosure at a user equipment may comprise a preview [302] where a preview from the camera unit is shown to a user while capturing an image or a video using a camera application. The interface also includes an icon [304] to switch between front and rear camera. The interface also shows a capture/record icon [306] which when initiated by the user results in capturing of an image or a video. The interface typically also provides a thumbnail [310] that allows the user to quickly shift to the already recorded videos. The interface also shows a smart music icon [308] , which when initiated by the user allows implementation of the present disclosure. For instance, the user may touch the icon [308] on the touch screen display of the user device which will initiate the method as described with reference to Figure 2.

As evident from the above disclosure, the present disclosure aims to enhance video recording experience for users by supplementing it with background music relevant to the context of video/scenario. The disclosure helps to reduce the noise in videos. It also provides context relevant unique audio to enhance the captured memories of a user.

While considerable emphasis has been placed herein on the disclosed embodiments, it will be appreciated that many embodiments can be made and that many changes can be made to the embodiments without departing from the principles of the present disclosure. These and other changes in the embodiments of the present disclosure will be apparent to those skilled in the art, whereby it is to be understood that the foregoing descriptive matter to be implemented is illustrative and non-limiting.

Claims

A method for generating background music for a video, the method comprising:

- receiving, at a transceiver unit [102] , the video comprising one or more video frames;

- processing, by a processing unit [104] , the one or more video frames to identify at least one of a scene, an emotion and a text associated with the video;

- determining, by the processing unit [104] , at least one genre based on at least one of the scene, the emotion and the text associated with the video;

- identifying, by the processing unit [104] , one or more music samples based on at least one of the genre and a user preference; and

- generating, by a music generator [106] , a background music for the video based on the one or more music samples.
The method as claimed in claim 1, further comprising mixing the generated background music with the video to generate a modified video.
The method as claimed in claim 1, wherein the video is received from one of a camera framework and a memory unit [108] .
The method as claimed in claim 1, wherein processing, at the processing unit [104] , the one or more video frames to identify at least one of the scene, the emotion and the text associated with the video is based on one or more pre-defined scenes, one or more pre-defined emotions and one or more pre-defined keywords.
The method as claimed in claim 1, wherein determining, at the processing unit [104] , the at least one genre based on the identified at least one of the scene, the emotion and the text associated with the video, is further based on one or more pre-defined genres.
The method as claimed in claim 1, wherein identifying, the processing unit [104] , the one or more music samples based on eat least one of the genre and a user preference, is further based on one or more pre-defined music samples.
A system for generating background music for a video, the system comprising:

- a transceiver unit [102] configured to receive the video comprising one or more video frames;

- a processing unit [104] coupled to the transceiver unit [102] , the processing unit [104] configured to:

process the one or more video frames to identify at least one of a scene, an emotion and a text associated with the video,

determine at least one genre based on at least one of the scene, the emotion and the text associated with the video, and

identify one or more music samples based on eat least one of the genre and a user preference; and

- a music generator [106] coupled to the transceiver unit [102] and the processing unit [104] , the music generator [106] configured to generate a background music for the video based on the one or more music samples.
The system as claimed in claim 7, further comprising a storage unit [108] coupled to the transceiver unit [102] , the processing unit [104] , and the music generator [106] , wherein the storage unit [108] is configured to store the video, one or more pre-defined scenes, one or more pre-defined emotions, one or more pre-defined keywords, one or more pre-defined genres and one or more pre-defined music samples.
The system as claimed in claim 7, wherein the transceiver unit [102] is configured to receive the video from one of a camera frame unit and the storage unit [108] .
The system as claimed in claim 7, wherein the music generator [106] is configured to mix the generated background music with the video to generate a modified video.
A user equipment comprising:

a system [100] for generating background music for a video, the system further comprising:

- a transceiver unit [102] configured to receive the video comprising one or more video frames;

- a processing unit [104] coupled to the transceiver unit [102] , the processing unit configured to:

process the one or more video frames to identify at least one of a scene, an emotion and a text associated with the video,

determine at least one genre based on the at least one of the scene, the emotion and the text associated with the video, and

identify one or more music samples based on eat least one of the genre and a user preference; and
a music generator [106] coupled to the transceiver unit [102] and the processing unit [104] , the music generator [106] configured to generate a background music for the video based on the one or more music samples.