WO2020055173A1

WO2020055173A1 - Method and system for audio content-based recommendations

Info

Publication number: WO2020055173A1
Application number: PCT/KR2019/011845
Authority: WO
Inventors: Ashish Chopra; Rajan Dahiya
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2018-09-11
Filing date: 2019-09-11
Publication date: 2020-03-19

Abstract

Provided are an apparatus and a method for audio content-based recommendations. The method includes receiving, by an electronic device, a plurality of audio content, decoding, by the electronic device, the plurality of audio contents to extract a feature set of each of the audio contents, performing, by the electronic device, a frequency analysis of the feature set of the each of the audio contents, reducing, by the electronic device, the feature set of each of the audio contents based on the frequency analysis classifying, by the electronic device, the at least one audio content into at least one cluster based on the reduced feature set, providing, by the electronic device, the audio content-based recommendations based on the at least one cluster.

Description

METHOD AND SYSTEM FOR AUDIO CONTENT-BASED RECOMMENDATIONS

The disclosure relates to a recommendation method and device. More particularly, the disclosure relates to a method and a device for audio content-based recommendations.

In general, a music recommendation service is largely provided in many ways, but currently, no artificial intelligent (AI) model exist that can interact with a user and learn from user responses while the user listens to music with an electronic device.

Additionally, there is no existing mechanism to suggest best suitable audio contents based on a pattern found in a music library and audio effects for a given audio file based on its content analysis used for preparing a predictive model.

Thus, it is desired to address the above mentioned disadvantages or other shortcomings or at least provide a useful alternative.

Provided is a method and an apparatus for audio content based recommendations in an electronic device. The method comprises receiving a plurality of audio contents, decoding the plurality of audio contents to extract a feature set of each of the plurality of audio contents, performing a frequency analysis of the feature set of each of the plurality of audio contents, reducing the feature set of each of the plurality of audio contents based the frequency analysis; classifying at least one audio content among the plurality of audio contents into at least one cluster based on the reduced feature set of each of the plurality of audio contents, and providing a recommendation of candidate audio content based on the at least one cluster.

The present disclosure provides an effective mechanism to suggest best suitable audio contents based on a pattern found in a music library and audio effects for a given audio file based on its content analysis used for preparing a predictive model.

The above and other aspects, features, and advantages of certain embodiments of the disclosure will be more apparent from the following description taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a block diagram of an electronic device, according to an embodiment of the disclosure;

FIG. 2 is a block diagram of an audio content-based recommendation engine, according to an embodiment of the disclosure;

FIG. 3 and FIG. 4 are flow diagrams illustrating a method for audio content-based recommendations in the electronic device, according to an embodiment of the disclosure;

FIG. 5 illustrates a frequency analysis of a feature set of each of audio contents, according to an embodiment of the disclosure;

FIG. 6 illustrates an MEL spectrogram, according to an embodiment of the disclosure;

FIG. 7 illustrates a content partition labeling and prediction modeling, according to an embodiment of the disclosure;

FIG. 8 illustrates a function and operation of a machine learning model, according to an embodiment of the disclosure;

FIG. 9 is an example scenario in which the electronic device clusters the audio contents based on the frequency analysis of the feature set, according to an embodiment of the disclosure;

FIG. 10 illustrates an example scenario in which the electronic device 100 recommends tracks and/or songs based on content analysis of current playing song, according to an embodiment of the disclosure;

FIG. 11 illustrates another example scenario in which the electronic device 100 recommends tracks and/or songs based on content analysis of current playing song, according to an embodiment of the disclosure;

FIG. 12 illustrates another example scenario in which the electronic device 100 recommends tracks and/or songs based on content analysis of current playing song, according to an embodiment of the disclosure;

FIG. 13A, 13B, 13C and 13D are example scenarios in which the electronic device recommends tracks and/or songs based on content analysis of current playing song, according to an embodiment of the disclosure;

FIG. 14 is an example scenario in which the electronic device creates the dynamic playlists on the basis of content based partitions, according to an embodiment of the disclosure;

FIG. 15 is an example scenario in which the electronic device shares the clusters with other user of the electronic device, according to an embodiment of the disclosure; and

FIG. 16A, 16B and 16C are example scenarios in which the electronic device provides sound effects on the basis of content based partitions, according to an embodiment of the disclosure.

Throughout the drawings, like reference numerals will be understood to refer to like parts, components, and structures.

In an embodiment, the feature set comprise at least one selected from a group of a beat of audio content, an emotion embedded audio content, a genre of audio content, a sound effect of audio content, and a playlist of audio content.

In an embodiment, the performing of the frequency analysis comprises generating a frequency spectrogram for each of the plurality of audio contents by extracting features of at least one audio content among the plurality of audio contents, determining frequencies in a first time interval and magnitudes for the at least one audio content in the frequency spectrogram, and determining a dominant group based on the determined frequencies in the first time interval and the magnitudes for the at least one audio content, wherein the dominant group comprises a maximum number of non-zero magnitude of the frequencies.

In an embodiment, the features of each of the plurality of audio contents comprises a frequency spectrum image of the at least one audio content.

In an embodiment, the reducing of the feature set of each of the plurality of audio contents comprises determining a dimensionality reduction function for the feature set of each of the plurality of audio contents based on audio parameters and component factors which are associated with characteristics of audio contents, determining a weight factor associated with each of features included in the feature set, and reducing a dimension of the feature set based on the dimensionality reduction function and the weight factor.

In an embodiment, the audio parameters comprise a bandwidth derived from sound frequency of the at least one audio content.

In an embodiment, the classifying of the at least one audio content comprises generating a MEL spectrogram of the at least one audio content on a three-axis plane, wherein a first axis represents frequency value derived from the audio content, a second axis represents the reduced feature set, and a third axis represents time derived from the audio content, labeling the at least one audio content using a Cartesian plane in the MEL spectrogram based on the frequency analysis based on a mapping function, and classifying the at least one audio content into the at least one cluster based on the labeling.

In an embodiment, the mapping function is used to classify a category of the at least one audio content using a weight factor and the frequency value.

In an embodiment, the providing of the recommendation comprises detecting the candidate audio content in the electronic device, performing a frequency analysis of a feature set of the candidate audio content, predicting at least one recommendation corresponding to the candidate audio content based on the at least one cluster having candidate audio content using a machine learning model, displaying an indication indicating an availability of the at least one recommendation, receiving an input on the indication, and displaying the at least one recommendation on a display of the electronic device.

In an embodiment, the providing of the recommendation further comprises detecting a user response to the at least one recommendation, updating the availability of the at least one recommendation based on the user response.

In an embodiment, the predicting of the at least one recommendation comprises predicting the at least one recommendation corresponding to the candidate audio content based on the at least one cluster and determining of whether the feature set of the at least one audio content is substantially similar to or the same as the feature set of the candidate audio content.

Provided is an apparatus for audio content based recommendations, the apparatus comprising a display; and a processor configured to receive a plurality of audio contents, decode the plurality of audio contents to extract a feature set of each of the plurality of audio contents, perform a frequency analysis of the feature set of each of the plurality of audio contents, reduce the feature set of each of the plurality of audio contents based the frequency analysis, classify at least one audio content among the plurality of audio contents into at least one cluster based on the reduced feature set of each of the plurality of audio contents, and provide, to the display, a recommendation of candidate audio content based on the at least one cluster.

Provided is a non-transitory computer readable medium configured to store one or more computer programs including instructions that, when executed by at least one processor, cause the at least one processor to control for receiving a plurality of audio contents; decoding the plurality of audio contents to extract a feature set of each of the plurality of audio contents; performing a frequency analysis of the feature set of each of the plurality of audio contents; reducing the feature set of each of the plurality of audio contents based the frequency analysis; classifying at least one audio content among the plurality of audio contents into at least one cluster based on the reduced feature set of each of the plurality of audio contents; and providing a recommendation of candidate audio content based on the at least one cluster.

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of various embodiments of the disclosure as defined by the claims and their equivalents. It includes various specific details to assist in that understanding but these are to be regarded as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the various embodiments described herein may be made without departing from the scope and spirit of the disclosure. In addition, descriptions of well-known functions and constructions may be omitted for clarity and conciseness.

The terms and words used in the following description and claims are not limited to the bibliographical meanings, but, are merely used by the inventor to enable a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following description of various embodiments of the disclosure is provided for illustration purpose only and not for the purpose of limiting the disclosure as defined by the appended claims and their equivalents.

It is to be understood that the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to "a component surface" includes reference to one or more of such surfaces.

As used herein, the terms "1st" or "first" and "2nd" or "second" may use corresponding components regardless of importance or order and are used to distinguish one component from another without limiting the components.

The accompanying drawings are used to help easily understand various technical features and it should be understood that the embodiments presented herein are not limited by the accompanying drawings. As such, the present disclosure should be construed to extend to any alterations, equivalents and substitutes in addition to those which are particularly set out in the accompanying drawings. Although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are generally only used to distinguish one element from another.

Embodiments herein disclose a method for audio content-based recommendations. The method includes receiving, by an electronic device, a plurality of audio content, decoding, by the electronic device, the plurality of audio contents to extract a feature set of each of the audio contents, performing, by the electronic device, a frequency analysis of the feature set of the each of the audio contents, reducing, by the electronic device, the feature set of each of the audio contents based on the frequency analysis, clustering, by the electronic device, the at least one audio content into at least one cluster based on the reduced feature set, and providing, by the electronic device, the audio content-based recommendations based on the at least one cluster.

The proposed method can be used to provide a predictive analysis of the audio content-based recommendations (e.g., music recommendation, emotions, genres and sound effects or the like) to a user based on a machine learning model in a cost effective manner.

In addition, the proposed method can be used to provide enhancement of user's music listening experience through a music content analysis based upon an artificial intelligence model (e.g., machine learning model). The proposed method can be used to enlist a content based mechanism which analyses a frequency distribution and audio parameters of the audio data/content of individual music files and employs a supervised learning and deep learning models on an n ^th vector and spectrogram produced for a predictive model. This will be used to achieve an automatic music recommendation model, a music beat labelling prediction model, a music emotion labelling prediction model, a music genre prediction model, and a default best suited audio effect model for individual music content and for group of songs in an effective manner.

The method can be used learn a taste of the user by redefining the group of songs depending on the user taste. In an example, if the user of the electronic device contains 50 happy songs in one group and 50 melody songs in other groups and the user likes to hear only happy music, the group including happy songs (happy group) is classified as dominating group and the electronic device will tend to add more songs to the happy group from the nearest neighbor which is at lesser distance from the happy group. The method may include providing a self-adjusting model depending upon the user taste. The method can be used to predict the emotions extracted from the song using the deep learning model.

The method may include recommending the music depending upon the content present in the electronic device without user intervention. The method may include predicting genres of unknown songs without the user intervention.

The method may present the audio data to the user according to the user taste. The method may include detecting the user pattern of audio effects settings so that playlist creation and audio effects are provided in a dynamic manner. The method enables the users to exist the duplicate audio content so as to reduce the memory management process.

The proposed method allows the user to obtain the best suitable set of songs, recommended genres and emotions of songs for the available audio content in a cost effective and quick manner. The method may include automatically suggesting various sound effects which enhance the user listening experience of the music. The method may include providing a full control of storing user preference related to the sound effects, presentation of contents and predicted information at run time that has been predicted as part of machine learning for the user content in an effective way.

The method may include providing a self-learning model which reduces user manual efforts. The method may include providing a predictive analysis of the music recommendation, emotions, genres and sound effects to the user without using metadata information, a natural language process (NLP) and/or a database query options.

The method may include providing the recommendation on the basis of audio parameters and frequency analysis of dominating groups so as to improve performance of the predictive model. This results in providing enriched data to the user in an interactive way. The method may include providing different sound effects to the user based on the audio content.

The method may include providing seamless interaction with the content where disinterested content will be skipped or unused automatically and the user can easily find content of her or his interest. The method may include providing the full control for a user preference related to the sound effects and volume for each content. The method may include providing the music content by reducing a mobile data usage.

In the proposed methods, recommendation to the user with songs may be made on the bases of content analysis so that most preferred audio content can be visible to the user first. The user can create a playlist on the basis of content to which the user want to prefer to listen. Further, a dynamic playlist can be created on the basis of content based partitions. The predictive model may learn the user behavior and suggest content online for the user depending on which partition user prefers to listen. The predictive model may also learn the listening pattern of the user thereby providing user the most appropriate neighbor to which the user would like to listen so that the user need not search for similar songs preferred by the user. If the number of songs is small in the current partition, the partition will be recreated by absorbing songs from the neighbor partition. This results in extending the scope of partition and an aggregation among partitions.

Referring now to the drawings, and more particularly to FIGS. 1 through 16C, there are shown preferred embodiments.

FIG. 1 is a block diagram of an electronic device 100, according to an embodiment of the disclosure. The electronic device 100 may be, for example, but not limited to a cellular phone, a smart phone, a Personal Digital Assistant (PDA), a tablet computer, a laptop computer, a music player, a video player or the like. In an embodiment, the electronic device 100 includes an audio content-based recommendation engine 110, a communicator 120, a memory 130, a processor 140, and a display 150. The audio content-based recommendation engine 110, a communicator 120, and a processor 140 may be implemented as one hardware processor.

Currently, there is no content specific recommendation and prediction model which suggests to the user most appropriate songs that the user likes to listen to and preferred songs based on a frequency spectrum and feature specification and genre and emotion of the songs based on a machine learning model. Further, there is no categorization of the music exists based on a quality and content spectrum of an audio.

In an embodiment, to address the foregoing issue, the audio content-based recommendation engine 110 may be configured to receive a plurality of audio contents and decode the plurality of audio contents to extract a feature set of each of the audio contents. The feature set may be, for example, a beat of audio content, an emotion embedded in audio content, a genre of audio content, a sound effect of audio content, and/or a playlist of audio content.

In an example, the audio content-based recommendation engine 110 may be configured to extract a current playing song so that a frequency variation is observed for the whole song to analyze frequency variation and audio parameters. The audio parameter may include, for example, at least one selected from a group of a bandwidth, a sample rate, a bit depth, a flux, a rooted mean, a variance, and a peak finder derived from sound of audio content or sound frequency of audio content.

Further, the audio content-based recommendation engine 110 may be configured to perform a frequency analysis of the feature set of the each of the audio contents.

In an embodiment, the audio content-based recommendation engine 110 may be configured to perform the frequency analysis of the feature set of each of the audio contents by generating a frequency spectrogram for each of the audio contents by storing features of the at least one audio contents as a frequency spectrum image, determining frequencies in at least one of time interval and magnitudes for the at least one audio content in the frequency spectrogram, and determining a dominant group based on the frequencies in the at least one of time interval and magnitudes associated with each of the audio content. The dominant group includes a maximum number of non-zero magnitude of frequency.

The frequency analysis of the feature set of each of the audio contents is illustrated referring to FIG. 5. In an example, weighted maximum frequency magnitude for the first audio content may be computed as below.

Fmg ₁ = (f ₁*m ₁ + f ₂*m ₂ + ... + f _n*m _n)/(f ₁+f ₂ + ... +f _n)

where f is frequency, m is magnitude and f1, f2 lie in the most dominating group.

In an example, referring to FIG. 5, the frequency spectrum corresponding to a single audio in frequency ranges from 0 to 22000 kHz is depicted. As depicted in FIG. 5, each frequency has its magnitude corresponding to sound pressure level (SPL). Further, the audio content-based recommendation engine 110 may be configured to analyze the frequency value magnitudes in a window of size 1000. Total 22 intervals are formed. Out of these 22 intervals, the interval which is having most non-zeros magnitudes and average magnitude of the group is greater than that of other 21 groups will be the most dominating group. Further, the audio content-based recommendation engine 110 obtains total 22 Fmg ₁ for each interval to determine the maximum frequency magnitude which will be global Fmg _max of single song.

That the group is most dominating means the group has the most number of non-zero's magnitude of frequency and average magnitude of the group is also the highest. For, example, f ₁, f ₂, f ₃,..., f ₁₀₀₀ are frequencies in the interval and m ₁, m ₂, m ₃,..., m ₁₀₀₀ are the magnitudes of the frequencies.

Referring to FIG. 5, the time will be the time value in mili-seconds of song duration in the most dominating group interval. This phase will give F _max and Time.

Further, the audio content-based recommendation engine 110 is configured to reduce the feature set of each of the audio contents based on the frequency analysis. In an embodiment, the reduction of the feature set may be a reduction of a dimension of the feature set. In an embodiment, the feature set of each of the audio content is reduced by determining a dimensionality reduction function for the feature set of each of the audio contents based on component factors and audio parameters, determining a weight factor associated with each of the feature in the feature set, and reducing the feature set of each of the audio content based on the dimensionality reduction function and the weight factor. In an embodiment, the component factors are an actual contribution of audio parameters in defining a particular characteristic of the audio content. In an embodiment, the component factors are associated with a particular characteristic of the audio content.

Further, the audio content-based recommendation engine 110 may be configured to cluster the at least one audio content into at least one cluster based on the reduced feature set. In an embodiment, the clustering the at least one audio content into the at least one cluster may be performed by creating a MEL spectrogram of the at least one audio on a three-axis plane in which a first axis represents a frequency value derived from or associated with the at least one audio content, a second axis represents the reduced feature set and a third axis represents time derived from or associated with the at least one audio content, labeling the at least one audio content using a Cartesian plane in the created MEL spectrogram based on the frequent analysis using a machine learning model and a mapping function, and clustering the at least one audio content into at least one cluster based on the labeling.

FIG. 6 illustrates the MEL spectrogram representing an acoustic time-frequency representation of a sound, according to an embodiment of the disclosure.

In an embodiment, the mapping function is used to classify a category of the audio content in the Cartesian plane using the weighted factor and the frequency value. The weighted factor and the frequency value are obtained from the MEL spectrogram. In an example, the mapping function is used to classify a category of the audio content or audio songs. For example, the mapping function may determine to which category of beats, tempo, genre, emotions the song belongs. The classification of the category of the audio content may be based on a weight factor and frequency values derived from the audio content.

In an embodiment, labeling may be performed relying upon the components identifying during the content analysis. In an example, based on the content analysis, the song is classified as tempo means, the song is labeled as a tempo related songs.

FIG. 7 is a graph illustrating the content partition labeling and prediction modeling, according to an embodiment of the disclosure.

Referring to FIG. 7, a first quadrant which contains these three values such as Tempo, lively, and stirring can also contain multiple classification categories depending upon the exploration/extension of the analysis of the content. In an example, these four quadrants can be further expanded to 8, 16, 32 or the like. Depending upon the number of planes, the audio content-based recommendation engine 110 divides the Cartesian plane. For the better understanding, the user of the audio content-based recommendation engine 110 may divides a plane into two which is equivalent to 4 quadrants.

Further, the audio content-based recommendation engine 100 may be configured to provide the audio content-based recommendations based on the at least one cluster. In an embodiment, the audio content-based recommendations are provided by detecting a candidate audio content available in the electronic device 100, performing the frequency analysis of the feature set of the candidate audio content, predicting the audio content-based recommendations corresponding to the candidate audio content based on the at least one cluster having the candidate audio content using the machine learning model, displaying the indication indicating the availability of the audio content-based recommendations, detecting the input performed on the indication, and displaying the audio content-based recommendations on a display of the electronic device 100. The machine learning model can be, for example, but not limited to, a deep learning model.

In an embodiment, the audio content-based recommendations correspond to the at least one audio content having the feature set which is the same as or similar to the feature set of the candidate audio content, where the content-based recommendations correspond to at least one of the beat feature, the emotion feature, the genre feature, the sound effect, and/or the playlist.

In an embodiment, the predicting the audio content-based recommendations corresponding to the candidate audio content based on the at least one cluster and determining of whether the feature set of the at least one audio content is substantially similar to or the same as the feature set of the candidate audio content.

In an embodiment, the cluster will represent the audio content list/song list of same frequency or similar frequency.

In an example, a tempo label has been suggested by a prediction model depending upon the prediction and labeled partition. Then, if the user listens to tempo related songs, the audio content-based recommendation engine 100 may recommend other relevant tempo songs to the user.

FIG. 8 illustrates functions and operations of the machine learning model, according to an embodiment.

Referring to FIG. 8, the machine learning model can be a layer based neural network (e.g., 4 layer deep learning network for genre prediction, 5 layer neural network for emotion prediction or the like) and may train the neural network for a predictive analysis. In an example, the 4 layer based neural network having 32, 16, 10 and 4 nodes at each level may be implemented to achieve deep learning for emotion prediction. The predictive function may pass feature vector set to the neural network and produce the output Yp as shown in the FIG. 8

In an example, the machine learning model may be created by a tenser flow library. Initially, the machine learning model may build a data set of N songs which is created by obtaining tagging information across the Internet for the emotion and genre information. These N songs emotions and genre are known on the basis of tagging information available at the Internet. Further, the machine learning model may extract the features of the songs in the dataset. From the tenser flow library, the machine learning model will co-relate current song feature information with category of songs. In an example, the emotions predicted through the machine learning model are sad, happy, party, relaxed and angry. In another example, the genres predicted through the machine learning model are classical, metal, country, pop, rock, ballad, rhythm-and-blues and disco.

In an embodiment, the 4 layer neural network may be used for genres prediction in which last layer - output layer will have 6 nodes for predicting genres and　the 4 layer neural network is used for emotions prediction in which the last layer have 5 nodes for predicting　the emotion (the number of emotion is equal to the number of nodes in the last layer). Further, the layers are only 4 for both emotion prediction model and　genre prediction model. In an embodiment, the neural network　may include the 1st layer corresponding to 32 nodes, 2nd layer corresponding to 16 nodes, the last layer corresponding to 5 nodes (number of emotions) for emotion predicting model and 6 nodes (number of genres) for　genre predicting　model.

In an embodiment, the audio content-based recommendation engine 110 may be configured to represent the nearest neighbor partition with a current partition in order to increase a distance between the partitions. The first nearest partition is closest to the playing song partition as shown in FIG. 12.

In case a random song occurs, the labels may be created by the machine learning model on the basis of extracted features provided to it and categories to the songs.

In an example, if the audio content-based recommendation engine 110 can predict genres for current playing as an unknown genre and the audio content-based recommendation engine 110 can next predict genres for current playing as POP genre, the audio content-based recommendation engine 110 recommends a next list of song/songs related to the POP genre.

In another example, if the electronic device 100 can predict emotions for current playing as sad, the electronic device 100 recommends a next list of song/songs related to sad emotion.

In another embodiment, the electronic device 100 may extract audio parameters for all audio files for content analysis. For each audio files, the electronic device 100 may analyze the frequencies set to calculate most dense group having non-zero's magnitudes. Further, the electronic device 100 computes the dimensionality reduction function F(x) which is used for dimensionality reduction of array of size M (M audio id's) from a M*N matrix of M audio id's having N features in order to form M*1 matrix having M audio id and weighted factor. Further, the electronic device 100 may generate the MEL spectrogram of 3 dimensional graph using frequency, time, and audio vector components. Further, the MEL spectrogram of 3 dimensional graph superimposes with the Cartesian plane. Further, the electronic device 100 may forms N/K group partition, where N is total number of songs, and K is number of planes in which the songs have been partitioned. Further, the electronic device 100 may recommend to a user with variety of songs depending upon the partitions.

For the genre and the emotions, the electronic device 100 may implement a soft max classifier for the training and implement deep learning logics using a Keras model.

The processor 140 may be configured to execute instructions stored in the memory 130 and to perform various processes. The communicator 120 may be configured for communicating internally between internal hardware components and externally with external devices via one or more networks.

The memory 130 may store instructions to be executed by the processor 140. The memory 130 also stores instructions to determine the location of the electronic device 100. The memory 130 may be a non-volatile memory which includes magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memories (EPROM) or electrically erasable and programmable (EEPROM) memories. In addition, the memory 130 may, in some examples, be considered a non-transitory storage medium. The term "non-transitory" may indicate that the storage medium is not embodied in a carrier wave or a propagated signal. However, the term "non-transitory" should not be interpreted that the memory 130 is non-movable. In some examples, the memory 130 can be configured to store larger amounts of information than the memory. In certain examples, a non-transitory storage medium may store data that can, over time, change (e.g., in Random Access Memory (RAM) or cache).

Although the FIG. 1 illustrates the electronic device 100 with various hardware components but it is to be understood that other embodiments are not limited thereto. In other embodiments, the electronic device 100 may include less or more number of components. Further, the labels or names of the components are used only for illustrative purpose and does not limit the scope of the invention. One or more components can be combined together to perform same or substantially similar function to perform the audio content-based recommendations in the electronic device 100.

FIG. 2 is a block diagram of the audio content-based recommendation engine 110, according to an embodiment of the disclosure.

In an embodiment, the audio content-based recommendation engine 110 may include a feature extractor 110a, a frequency analyzer 110b, a dimensionality reducer 110c, a classifier model 110d, a learning model 110e and a prediction model 110f.

The feature extractor 110a and the frequency analyzer 110b may be configured to receive the plurality of audio contents and to decode the plurality of audio contents to extract the feature set of each of the audio contents. Further, the feature extractor 110a and the frequency analyzer 110b may be configured to perform the frequency analysis of the feature set of the each of the audio contents. Further, the dimensionality reducer 110c may be configured to reduce the feature set of each of the audio contents based on the frequency analysis. Further, the classifier model 110d and the learning model 110e may be configured to cluster the at least one audio content into at least one cluster based on the reduced feature set. Further, the prediction model 110f may be configured to provide the audio content-based recommendations based on the at least one cluster. Further, the audio content-based recommendation engine 110 may implement the soft max classifier for the training and implement deep learning logics using a Keras model. The soft max classifier may be used for emotion classification and for genre classification in the audio content.

Although FIG. 2 illustrates various hardware components of the audio content-based recommendation engine 110 but it is to be understood that other embodiments are not limited thereto. In other embodiments, the audio content-based recommendation engine 110 may include less or more number of components. Further, the labels or names of the components are used only for illustrative purpose and does not limit the scope of the invention. One or more components can be combined together to perform same or substantially similar function to perform the audio content-based recommendations in the audio content-based recommendation engine 110.

FIG. 3 is a flow diagram 300 illustrating a method for the audio content-based recommendations in the electronic device 100, according to an embodiment of the disclosure.

Referring to FIG. 3, the operations 302 through 312 may be performed by the audio content-based recommendation engine 110 or the processor 140.

At 302, the method includes receiving the plurality of audio contents. At 304, the method includes decoding the plurality of audio contents to extract the feature set of each of the audio contents. At 306, the method includes performing the frequency analysis of the feature set of the each of the audio contents. At 308, the method includes reducing the feature set of each of the audio contents based on the frequency analysis. At 310, the method includes clustering or classifying the at least one audio content into at least one cluster based on the reduced feature set. At 312, the method includes providing the audio content-based recommendations based on the at least one cluster.

FIG. 4 is a flow diagram 400 illustrating a method for the audio content-based recommendations in the electronic device 100, according to an embodiment of the disclosure.

Referring to FIG. 4, the operations 402 through 416 are performed by the audio content-based recommendation engine 110 or a processor 140. At 402, the method includes detecting the candidate audio content in the electronic device 100. At 404, the method includes performing the frequency analysis of the feature set of the candidate audio content. At 406, the method includes predicting the audio content-based recommendations corresponding to the candidate audio content based on the at least one cluster having the candidate audio content using the machine learning model. At 408, the method includes displaying the indication indicating the availability of the audio content-based recommendations. At 410, the method includes detecting the input performed on the indication. At 412, the method includes displaying the audio content-based recommendations. At 414, the method includes detecting the user response to the audio content-based recommendations. At 416, the method includes automatically updating the audio content available in the at least one cluster having the audio content-based recommendations based on the user response.

The various actions, acts, blocks, steps, or the like in the flow diagrams 300 and 400 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some of the actions, acts, blocks, steps, or the like may be omitted, added, modified, skipped, or the like without departing from the scope of the invention.

FIG. 9 is an example scenario in which the electronic device 100 clusters the audio contents based on the frequency analysis of the feature set, according to an embodiment of the disclosure.

In an example, the audio contents are labeled using the Cartesian plane in the created MEL spectrogram based on the frequent analysis using the machine learning model. Further, the audio contents are clustered based on the labeling. The cluster 1 is formed (i.e., there are 3 happy songs found on the basis of same type of audio content and 3 songs lie in same frequency spectrum), and similarly, the cluster 2 is formed (i.e., there are 3 sad songs found on the basis of same type of audio content and 3 songs lie in same frequency spectrum).

FIG. 10 illustrates an example scenario in which the electronic device 100 recommends tracks and/or songs based on content analysis of current playing song, according to an embodiment of the disclosure.

Referring to FIG. 10, the electronic device 100 may be configured to detect the candidate audio content and to perform the frequency analysis of the feature set of the candidate audio content. Further, the electronic device 100 may be configured to predict the audio content-based recommendations corresponding to the candidate audio content based on at least one cluster having the candidate audio content using the machine learning model. Further, the electronic device 100 may be configured to display the indication indicating the availability of the audio content-based recommendations. Further, the electronic device 100 may be configured to detect an input performed on the indication. Further, the electronic device 100 may be configured to execute the audio content-based recommendations.

Referring to FIG. 11, the electronic device 100 may represent the nearest neighbor partition with current partition in order to increase a distance between the partitions (i.e., electronic device 100 represent the first nearest partition that is closest to the playing song partition and then next available partition based on the content analysis).

Referring to FIG. 12, the electronic device 100 learns and analyzes the response from the user and presents the song which may be the most preferred by the user. This recommendation will reflect changes of user preferences time to time. The most preferred group by the user will be always at the top position of a display of the electronic device 100 and will constitute a dense of songs as the partition will re-adjust itself by fetching a few neighbor songs and increase its density. Suppose that previously, the group has 2 songs and after learning, the partition fetches 3 neighbor songs from the other partitions which are closest to the group.

In another example, the electronic device 100 learns from the user responses and presents the song most preferred by the user. This recommendation will keep on changing as user preferences change. The user's most preferred group will be always at the top position of the display of the electronic device 100 and will present a dense of songs as the partition re-adjusts emotions by fetching a few neighbor songs to "other" group and increase the density. Suppose that previously, the group has 2 sad songs and the 2 songs are "classical" and 1 is of different emotion after learning that the partition fetches 3 neighbor songs with emotions from other partitions which are closest to the user. This will enable the predictive model to retrain and re-predict through deep learning.

FIG. 13A, 13B, 13C and 13D are example scenarios in which the electronic device recommends tracks and/or songs based on content analysis of current playing song, according to an embodiment of the disclosure.

Referring to FIG. 13A, 13B, 13C and 13D, the electronic device 100 may be configured to detect the candidate audio content and perform the frequency analysis of the feature set of the candidate audio content. Further, the electronic device 100 may determine that candidate audio content related to happy song and accordingly, the electronic device 100 provides the suggestion related to the happy songs as shown in FIG. 13B.

If the user of the electronic device 100 starts listening to sad songs as shown in the FIG. 13C, the electronic device 100 provides the suggestion related to sad songs as shown in FIG. 13D.

FIG. 14 is an example scenario in which the electronic device 100 creates the dynamic playlists on the basis of content based partitions, according to an embodiment of the disclosure.

Referring to FIG. 14, the dynamic playlists can be created on the basis of content based partitions. In an example, 5 partitions may exist in the electronic device 100, then 5 dynamic playlists may be created. In an example, a recommended playlist have 39 tracks in the partition.

FIG. 15 is an example scenario in which the electronic device 100 shares the clusters with other users of the electronic device 100, according to an embodiment of the disclosure. All audio tracks in various clusters are shared with another user so that another user of the electronic device 100 is able to obtain the recommended songs based on the clusters.

Referring to FIG. 16A, 16B and 16C, suppose the partition has the audio specifications as follows. Bass=20, Vocal=30, Treble=30 and instrumental=20. The electronic device 100 may equalize the sound effects by reducing Treble and reducing Vocal by performing audio content analysis.

In accordance with an aspect of the disclosure, a method and a device for audio content-based recommendations are provided. The method includes receiving, by an electronic device, a plurality of audio contents. Further, the method includes decoding, by the electronic device, the plurality of audio contents to extract a feature set of each of the audio contents. Further, the method includes performing, by the electronic device, a frequency analysis of the feature set of the each of the audio contents. Further, the method includes reducing, by the electronic device, the feature set of each of the audio contents based on the frequency analysis. Further, the method includes clustering, by the electronic device, the at least one audio content into at least one cluster based on the reduced feature set. Further, the method includes providing, by the electronic device, the audio content-based recommendations based on the at least one cluster.

In an embodiment, performing the frequency analysis of the feature set of each of the audio contents includes generating a frequency spectrogram for each of the audio contents by storing features of the at least one audio contents as a frequency spectrum image, determining frequencies in at least one time interval and magnitudes for the at least one audio content in the frequency spectrogram, and determine a dominant group based on the frequencies in the at least one time interval and magnitudes associated with each of the audio content, where the dominant group includes a maximum number of non-zero's magnitude of frequency.

In an embodiment, reducing the feature set of each of the audio contents based on the frequency analysis includes determining a dimensionality reduction function for the feature set of each of the audio contents based on component factors and audio parameters, determining a weight factor associated with each of the feature in the feature set, and reducing the feature set of each of the audio content based on the dimensionality reduction function and the weight factor.

In an embodiment, the component factors are an actual contribution of audio parameters in defining a particular characteristic of the audio content. In an embodiment, the component factors are associated with a particular characteristic of the audio content.

In an embodiment, clustering the at least one audio content into at least one cluster based on the reduced feature set includes creating a MEL spectrogram of the at least one audio content on a three-axis plane in which a first axis represents a frequency value, a second axis represents the reduced feature set, and a third axis represents time, labeling the at least one audio content using a Cartesian plane in the created MEL spectrogram based on the frequent analysis using a machine learning model and a mapping function, and clustering the at least one audio content into the at least one cluster based on the labeling.

In an embodiment, the mapping function is used to classify a category of the audio content in the Cartesian plane using a weighted factor and the frequency value.

In an embodiment, providing the audio content-based recommendations based on the at least one cluster includes detecting a candidate audio content in the electronic device, performing a frequency analysis of a feature set of the candidate audio content, predicting audio content-based recommendations corresponding to the candidate audio content based on the at least one cluster having the candidate audio content using a machine learning model, displaying an indication indicating an availability of the audio content-based recommendations, detecting an input performed on the indication, and displaying the audio content-based recommendations.

In an embodiment, the audio content-based recommendations corresponds to at least one audio content having a feature set same or similar to the feature set of the candidate audio content, where the content-based recommendations corresponds to at least one of a beat feature, an emotion feature, a genre feature, a sound effect, and a playlist.

In an embodiment, the feature set includes at least one of a beat feature, an emotion feature, a genre feature, a sound effect and a playlist.

Accordingly, embodiments herein disclose a method for audio content-based recommendations. The method includes detecting, by an electronic device, a candidate audio content. Further, the method includes performing, by the electronic device, a frequency analysis of a feature set of the candidate audio content. Further, the method includes predicting, by the electronic device, the audio content-based recommendations corresponding to the candidate audio content based on at least one cluster having the candidate audio content using a machine learning model. Further, the method includes displaying, by the electronic device, an indication indicating an availability of the audio content-based recommendations. Further, the method includes detecting, by the electronic device, an input performed on the indication. Further, the method includes causing to display, by the electronic device, the audio content-based recommendations.

In an embodiment, the method further includes detecting, by the electronic device, a user response to the audio content-based recommendations. Further, the method includes automatically updating, by the electronic device, the audio content available in the at least one cluster having the audio content-based recommendations based on the user response.

Accordingly, embodiments herein disclose an electronic device for audio content-based recommendations. The electronic device includes an audio content-based recommendation engine coupled to a memory and a processor. The audio content-based recommendation engine is configured to receive a plurality of audio contents and decode the plurality of audio contents to extract a feature set of each of the audio contents. Further, the audio content-based recommendation engine is configured to perform a frequency analysis of the feature set of the each of the audio contents and reduce the feature set of each of the audio contents based on the frequency analysis. Further, the audio content-based recommendation engine is configured to cluster the at least one audio content into at least one cluster based on the reduced feature set and provide the audio content-based recommendations based on the at least one cluster.

Accordingly, embodiments herein disclose an electronic device for audio content-based recommendations. The electronic device includes an audio content-based recommendations engine coupled to a memory and a processor. The audio content-based recommendation engine is configured to detect a candidate audio content in the electronic device and perform a frequency analysis of a feature set of the candidate audio content. Further, the audio content-based recommendation engine is configured to predict the audio content-based recommendations corresponding to the candidate audio content based on at least one cluster having the candidate audio content using a machine learning model. Further, the audio content-based recommendation engine is configured to display an indication indicating an availability of the audio content-based recommendations. Further, the audio content-based recommendation engine is configured to detect an input performed on the indication. Further, the audio content-based recommendation engine is configured to display the audio content-based recommendations.

These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.

The embodiments disclosed herein can be implemented using at least one software program running on at least one hardware device and performing network management functions to control the elements.

The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the embodiments as described herein.

Claims

A method of audio content based recommendations in an electronic device, the method comprising:

receiving a plurality of audio contents;

decoding the plurality of audio contents to extract a feature set of each of the plurality of audio contents;

performing a frequency analysis of the feature set of each of the plurality of audio contents;

reducing the feature set of each of the plurality of audio contents based the frequency analysis;

classifying at least one audio content among the plurality of audio contents into at least one cluster based on the reduced feature set of each of the plurality of audio contents; and

providing a recommendation of candidate audio content based on the at least one cluster.
The method of claim 1, wherein the feature set comprise at least one selected from a group of a beat of audio content, an emotion embedded audio content, a genre of audio content, a sound effect of audio content, and a playlist of audio content.
The method of claim 1, wherein the performing of the frequency analysis comprises

generating a frequency spectrogram for each of the plurality of audio contents by extracting features of at least one audio content among the plurality of audio contents;

determining frequencies in a first time interval and magnitudes for the at least one audio content in the frequency spectrogram; and

determining a dominant group based on the determined frequencies in the first time interval and the magnitudes for the at least one audio content,

wherein the dominant group comprises a maximum number of non-zero magnitude of the frequencies.
The method of claim 3, wherein the features of each of the plurality of audio contents comprises a frequency spectrum image of the at least one audio content.
The method of claim 1, wherein the reducing of the feature set of each of the plurality of audio contents comprises:

determining a dimensionality reduction function for the feature set of each of the plurality of audio contents based on audio parameters and component factors which are associated with characteristics of audio contents;

determining a weight factor associated with each of features included in the feature set; and

reducing a dimension of the feature set based on the dimensionality reduction function and the weight factor.
The method of claim 5, wherein the audio parameters comprise a bandwidth derived from sound frequency of the at least one audio content.
The method of claim 1, wherein the classifying of the at least one audio content comprises

generating a MEL spectrogram of the at least one audio content on a three-axis plane, wherein a first axis represents frequency value derived from the audio content, a second axis represents the reduced feature set, and a third axis represents time derived from the audio content;

labeling the at least one audio content using a Cartesian plane in the MEL spectrogram based on the frequency analysis based on a mapping function; and

classifying the at least one audio content into the at least one cluster based on the labeling.
The method of claim 7, wherein the mapping function is used to classify a category of the at least one audio content using a weight factor and the frequency value.
The method of claim 1, wherein the providing of the recommendation comprises

detecting the candidate audio content in the electronic device;

performing a frequency analysis of a feature set of the candidate audio content;

predicting at least one recommendation corresponding to the candidate audio content based on the at least one cluster having candidate audio content using a machine learning model;

displaying an indication indicating an availability of the at least one recommendation;

receiving an input on the indication; and

displaying the at least one recommendation on a display of the electronic device.
The method of claim 9, wherein the providing of the recommendation further comprises:

detecting a user response to the at least one recommendation;

updating the availability of the at least one recommendation based on the user response.
The method of claim 9, wherein the predicting of the at least one recommendation comprises

predicting the at least one recommendation corresponding to the candidate audio content based on the at least one cluster and determining of whether the feature set of the at least one audio content is substantially similar to or the same as the feature set of the candidate audio content.
An apparatus for audio content based recommendations, the apparatus comprising:

a display; and

a processor configured to

receive a plurality of audio contents,

decode the plurality of audio contents to extract a feature set of each of the plurality of audio contents,

perform a frequency analysis of the feature set of each of the plurality of audio contents,

reduce the feature set of each of the plurality of audio contents based the frequency analysis,

classify at least one audio content among the plurality of audio contents into at least one cluster based on the reduced feature set of each of the plurality of audio contents, and

provide, to the display, a recommendation of candidate audio content based on the at least one cluster.
The apparatus of claim 12, wherein the performing of the frequency analysis comprises

generating a frequency spectrogram for each of the plurality of audio contents by extracting features of at least one audio content among the plurality of audio contents;

determining frequencies in a first time interval and magnitudes for the at least one audio content in the frequency spectrogram; and

determining a dominant group based on the determined frequencies in the first time interval and the magnitudes for the at least one audio content,

wherein the dominant group comprises a maximum number of non-zero magnitude of the frequencies.
The apparatus of claim 12, wherein the classifying of the at least one audio content comprises

generating a MEL spectrogram of the at least one audio content on a three-axis plane, wherein a first axis represents frequency value derived from the audio content, a second axis represents the reduced feature set, and a third axis represents time derived from the audio content;

labeling the at least one audio content using a Cartesian plane in the MEL spectrogram based on the frequency analysis based on a mapping function; and

classifying the at least one audio content into the at least one cluster based on the labeling.
A non-transitory computer readable medium configured to store one or more computer programs including instructions that, when executed by at least one processor, cause the at least one processor to control for:

receiving a plurality of audio contents;

decoding the plurality of audio contents to extract a feature set of each of the plurality of audio contents;

performing a frequency analysis of the feature set of each of the plurality of audio contents;

reducing the feature set of each of the plurality of audio contents based the frequency analysis;

classifying at least one audio content among the plurality of audio contents into at least one cluster based on the reduced feature set of each of the plurality of audio contents; and

providing a recommendation of candidate audio content based on the at least one cluster.