WO2016102738A1

WO2016102738A1 - Similarity determination and selection of music

Info

Publication number: WO2016102738A1
Application number: PCT/FI2014/051037
Authority: WO
Inventors: Antti Eronen; Jussi LEPPÄNEN; Pasi SAARI; Arto Lehtiniemi
Original assignee: Nokia Technologies Oy
Priority date: 2014-12-22
Filing date: 2014-12-22
Publication date: 2016-06-30

Abstract

A method comprises determining properties of first group of music tracks, e.g. by a first artist,and a second group of music tracks, e.g. by a second artist,based on track level attributes, determining a similarity between the first and second groups based at least in part on determined properties of the tracks, selecting one or more tracks from the second group based at least in part on said similarity, and outputting a list of said selected tracks. The similarity may include group level, track level, and combined group and track level similarities. The track level attributes may be acoustic features extracted from the tracks, tags, metadata or other data, such as keywords extracted from reviews of the tracks. The method may include ranking and/or revising the list based on one or more of user preferences, a user history and/or whether a user plays or skips the selected tracks.

Description

Similarity determination and selection of music

Field

This disclosure relates to determining similarity and similarity-based selection of music tracks. In particular, this disclosure relates to assessing and selecting music tracks from a database based on acoustic similarities.

Background

Audio content databases, streaming services, online stores and media player software applications often include genre classifications, to allow a user to search for tracks to play stream and/or download.

Some databases, services, stores and applications also include a facility for recommending music tracks to a user based on a history of music that they have accessed in conjunction with other data, such as rankings of tracks or artists from the user, history data from other users who have accessed the same or similar tracks in the user's history or otherwise have similar user profiles, metadata assigned to the tracks by experts and/or users, and so on. Summary

According to an aspect, an apparatus includes a controller and a memory in which is stored computer readable instructions that, when executed by the controller, cause the controller to receive input information regarding one or more music tracks, determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes, determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks, select one or more music tracks from the second group of music tracks based at least in part on said similarity and output a list of said selected music tracks. The input information may, for example, indicate a name of a music track, an album, an artist, a performer, a record label, a playlist, a producer, a musical genre, sub-genre or style. For example, where the input information indicates an artist, the first group of music tracks may contain music tracks by that artist and, optionally, the second group of music tracks may contain music tracks by one or more second artists. If the second group of music tracks contains music tracks by one second artist, then the group level similarity would indicate similarity between the first and second artists.

In some embodiments, the input information may indicate a particular music track, in which case the controller may obtain information regarding one or more of an album, an artist, a performer, a record label, a playlist, a producer, a musical genre, sub-genre or style from metadata associated with the particular music track or from extracting information from a local or remote database, and define the first group based on the obtained information.

The computer readable instructions, when executed by the controller, may further cause the controller to determine a similarity between a first one of said first plurality of music tracks and one of the second plurality of music tracks where said similarity between the first group of music tracks and the second group of music tracks is a combination of a group level similarity based on the properties of the first plurality of music tracks and the properties of the second plurality of music tracks and a track level similarity based on the properties of said one of said first plurality of music tracks and the properties of said one of the second plurality of music tracks.

The track level attributes may include acoustic features extracted from said music tracks and/or at least one of: tags associated with at least some of said music tracks, metadata associated with said music tracks and keywords extracted from text associated with said music tracks.

The properties may include at least one property based on a musical instrument and at least one property based on a musical genre. For example, the properties may include probabilities that a tag for a musical instrument or genre applies to a respective one of the first and second pluralities of music tracks.

The computer readable instructions, when executed by the controller, may further cause the controller to monitor a history of music tracks previously accessed by a user, revise said list of selected music tracks based on the properties of the selected music tracks in said history and on whether the previously accessed music tracks in the history were played or skipped and output said revised list.

The computer readable instructions, when executed by the controller, may further cause the controller to rank the selected music tracks in the list based, at least in part, on user preferences for the properties included in the received input.

The computer readable instructions, when executed by the controller, may further cause the controller to rank the selected music tracks in the list based, at least in part, on properties of previously accessed music tracks indicated in a user history. Alternatively, or additionally, the computer readable instructions, when executed by the controller, may further cause the controller to rank the selected music tracks in the list based, at least in part, on a property of a previously accessed music track indicated in a user history, wherein said first plurality of music tracks does not include said property. Where the selected music tracks based at least in part on the user history, the ranking may include adjusting similarities for the selected music tracks according to whether said previously accessed music track, or tracks, indicated in the user history was played or skipped by the user. The computer readable instructions, when executed by the controller, may cause the controller to determine said properties by evaluating first level probabilities that a particular tag applies based on the track level attributes and evaluating a second level probability that the particular tag applies based on the first level probability. For example, the controller may be caused to evaluate the first level probabilities using a first classifier and a second classifier and to evaluate the second level probabilities using a third classifier, wherein the first and third classifiers are non-probabilistic classifiers and the second classifier is a

probabilistic classifier. According to another aspect, a method includes receiving input information regarding at least one music track, determining properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes, determining a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks, selecting one or more music tracks from the second group of music tracks based at least in part on said similarity and outputting a list of said selected music tracks.

Such a method may further include determining a similarity between one of said first plurality of music tracks and one of the second plurality of music tracks, wherein said similarity between the first group of music tracks and the second group of music tracks is a combination of a group level similarity based on the properties of the first plurality of music tracks and the properties of the second plurality of music tracks and a track level similarity based on the properties of said one of said first plurality of music tracks and the properties of said one of the second plurality of music tracks. The track level attributes may include acoustic features extracted from said music tracks and/or at least one of: tags associated with at least some of said music tracks, metadata associated with said music tracks and keywords extracted from text associated with said music tracks. The method may also include monitoring a history of music tracks previously accessed by a user, revising the list of selected music tracks based on the properties of the previously accessed music tracks in said history and on whether the previously accessed music tracks in the history were played or skipped and outputting said revised list.

The method may include ranking the selected music tracks in the list based on one or more of user preferences for the properties included in the received input, properties of previously accessed music tracks indicated in a user history and a property of a previously music track indicated in a user history, wherein said first plurality of music tracks does not include said property. For example, such ranking may include adjusting similarities for the selected music tracks according to whether said previously music track, or tracks, indicated in the user history was played or skipped by the user. In an example embodiment, the method may include determining said properties comprises evaluating first level probabilities that a particular tag applies based on the extracted acoustic features and evaluating a second level probability that the particular tag applies based on the first level probability.

A computer program comprising computer readable instructions which, when executed by a computer, cause said computer to perform any of the above methods according to said aspect may also be provided.

According to yet another aspect, a non-transitory tangible computer program product in which is stored computer readable instructions that, when executed by a computer, cause the computer to receive input information regarding one or more music tracks, determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes, determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks, select one or more music tracks from the second group of music tracks based at least in part on said similarity and output a list of said selected music tracks.

According to a further aspect, an apparatus is configured to receive input information regarding one or more music tracks, determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes, determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks, select one or more music tracks from the second group of music tracks based at least in part on said similarity and output a list of said selected music tracks. According to a yet further aspect, an apparatus includes an interface to receive input information regarding one or more music tracks and to output a list of selected music tracks, an extractor to extract track level attributes associated with a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and a second plurality of music tracks belonging to a second group of music tracks and to determine properties of the first plurality of music tracks and the second plurality of music tracks based on track level attributes of said music tracks, a similarity determination module to determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks and a

recommendation engine to select the selected music tracks from the second group of music tracks based at least in part on said similarity.

Brief description of the drawings

Embodiments will now be described by way of non-limiting examples with reference to the accompanying drawings, of which:

Figure l is a schematic diagram of a system in which an embodiment may be included;

Figure 2 is a schematic diagram of components of an analysis server according to an embodiment, in the system of Figure l;

Figure 3 is an overview of a method that may be performed by the analysis server of Figure 2;

Figure 4 is a flowchart of the method shown in overview in Figure 3;

Figure 5 depicts a user interface for use in the method of Figure 3

Figure 6 is a flowchart of a method of extracting acoustic features from an input signal, for use in the method of Figure 4;

Figure 7 depicts an example of a blocked and windowed input signal;

Figure 8 depicts an example energy spectrum of a transformed input signal;

Figure 9 depicts a frequency response of an example filter bank for filtering the transformed input signal shown in Figure 8;

Figure 10 depicts an example mel-energy spectrum output from the filter bank represented in Figure 8;

Figure 11 is an overview of a process for obtaining multiple types of acoustic features in the method of Figure 4;

Figure 12 is an overview of a method of tag obtaining probabilities for use in the method of Figure 4;

Figure 13 is a flowchart of the method of Figure 12;

Figure 14 shows example distributions for instrument-based tag

probabilities; Figure 15 shows the example probability distributions of Figure 14 after logarithmic transformation;

Figure 16 depicts an example track feature vector generated by the method of Figure 13;

Figure 17 is an overview of a recommendation procedure that may be performed as a part of the method of Figure 4;

Figure 18 is an overview of another recommendation procedure that may be performed as a part of the method of Figure 4;

Figure 19 is an overview of yet another recommendation procedure that may be performed as a part of the method of Figure 4.

Detailed description

Embodiments described herein concern assessing features of music tracks, determining similarities between music tracks and selecting music tracks based on such similarities, for example, for recommendation to a user.

Referring to Figure 1, an analysis server 100 is shown connected to a network 102, which can be any data network such as a Local Area Network (LAN), Wide Area Network (WAN) or the Internet. The analysis server 100 is configured to receive and process requests for audio content from one or more terminals 104, 105 via the network 102.

In the present example, two terminals 104, 105 are shown, each incorporating media playback hardware and software, such as a speaker (not shown) and/or audio output jack (not shown) and a processor (not shown) that executes a media player software application to stream and/or download audio content over the network 102 and to play audio content through the speaker. As well as audio content, the terminals 104, 105 may be capable of streaming or downloading video content over the network 102 and presenting the video content using the speaker and a display 106. Suitable terminals 104, 105 will be familiar to persons skilled in the art. For instance a smart phone could serve as a terminal 104, 105 in the context of this application although a laptop, tablet or desktop computer may be used instead. Such terminals 104, 105 include music and video playback and data storage functionality and can be connected to the music analysis sever 100 via a cellular network, Wi-fi, Bluetooth® or any other suitable connection such as a cable or wire. Optionally, the display 106 may be a touch screen display. As shown in Figure 2, the analysis server loo includes a controller 202, an input and output interface 204 configured to transmit and receive data via the network 102, a memory 206 and a mass storage device 208 for storing video and audio data. The controller 202 is connected to each of the other components in order to control operation thereof. The controller 202 may take any suitable form. For instance, it may be a processing arrangement that includes a microcontroller, plural

microcontrollers, a processor, or plural processors. The memory 206 and mass storage device 208 may be in the form of a non-volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD) . The memory 206 stores, amongst other things, an operating system 210 and at least one software application 212 to be executed by the controller 202. Random Access Memory (RAM) 214 is used by the controller 202 for the temporary storage of data.

The operating system 210 may contain code which, when executed by the controller 202 in conjunction with the RAM 214, controls operation of analysis server 100 and provides an environment in which the or each software application 212 can run.

Software application 212 is configured to control and perform audio and video information processing by the controller 202 of the analysis server 100 to determine similarities between music tracks and, optionally, to generate music recommendations. The operation of this software application 212 according to a first embodiment will now be described in detail, with reference to Figures 3 to 4. In the following, the accessed music tracks are referred to as input signals.

Figure 3 is an overview of a procedure for recommending music tracks to the user of the terminal 104, in which the controller 202 acts as an extractor 30 to extract track level attributes of a music track, similarity assessment module 31 and recommendation engine 32. The basis for the recommendation procedure may be information provided by a user of the terminal 104 via the network 102. For example, the user may indicate an artist and/or a track so that similar artists and/or tracks can be identified. Alternatively, or additionally, the user may provide other information on which the recommendation procedure may be based. For example, the user may indicate one or more of an album, a performer, such as a particular musician, a producer, a record label, a playlist, a musical genre, sub- genre or style, and so on.

Alternatively, or additionally, information regarding music tracks accessed by a user, such as the tracks that have been accessed the greatest number of times, the artist with the greatest number of tracks in a library in the terminal 104, recently accessed tracks or recently purchased tracks, may be used to identify an artist and, optionally, a track, referred to as track 1, to use as a basis for generating

recommendations. If only a name for track 1 is indicated, an artist may be identified from metadata of track 1. If only an artist is indicated, track 1 may be a track selected automatically, such as the track by that artist accessed the most times by the user, or a most popular track by that artist as indicated by a remote database, such as a streaming database, rankings for tracks in a digital music store or information obtained from social media.

A first group of music tracks, group 1, is defined based on the information. Where the user has input artist or performer information, or an artist or performer has been identified from other information input by the user or obtained from the user history, the first group may contain multiple tracks by that artist or performer. In another example, if the user has input information identifying an album or record label, the first group may include music tracks from that album or record label.

As noted above, in other examples, information that may be used as that basis can include one or more of an artist, album, a performer, a producer, a record label, a playlist, a musical genre, sub-genre or style, and so on, by defining the first group of music tracks, group 1, according to the basis provided.

Attributes 33a to 33c for a first music track 1 of the first group, group 1, and one or more further tracks 2...m of group 1 are obtained from the data stored in

video/audio storage 208 or remote database or obtained from social media information, other websites and so on. The similarity assessment module 31 defines a combined vector 34 for the first group, group 1, based on some or all of the attributes 33a to 33c obtained for tracks i...m.

Similarly, attributes 35a to 35c are obtained for a plurality of tracks i...n of the second group, group 2, from which the recommendations are to be drawn, and a combined vector 36 for group 2 is defined. For example, where the recommendation is to be based on a first artist, album or performer, the second group may contain multiple tracks by a second artist or performer. The second artist may be selected automatically, based on an analysis of attributes 33a to 33c obtained for track 1 of group 1 and/or information from streaming databases, rankings in digital music stores, social media information and so on. For example, such databases may indicate that users who listened to the first artist often listen to certain other artists and one of those other artists may be selected as the second artist and the second group, group 2, defined to include multiple tracks by that second artist.

The similarity assessment module 31 then determines a group level similarity 37, based on the combined vectors 35, 36 for group 1 and group 2, based on the plurality of tracks i...m of group 1 and the plurality of tracks i...n of group 2. The similarity assessment module 31 may also determine one or more track level similarities 38, each based on a vector 39, 40 combining attributes 33a of an individual track of group 1 and the attributes 35a of an individual track of group 2 respectively. A combined group and track similarity 41 may also be computed based on the group level similarity 37 and the track level similarity 38.

One or more of the group level similarity 37 and the combined group and track similarity 41 are input to the recommendation engine 32. The recommendation engine may then select music tracks from the video/audio storage 208 or another database, for example, a remote database accessed via the network 102 or other network, as recommendations 42 of music tracks that the user of the terminal 104 might enjoy, based on the input similarities 37, 41 and, optionally, further input from the user of the terminal 104. The recommendations 42 may be output via the I/O interface 204 and transmitted to the terminal 104 for presentation on the display 106.

Figure 4 is a flowchart showing further detail of the method described above in relation to Figure 3.

Beginning at step S4.0, a basis for generating the recommendations 42 is obtained (step S4.1) . In this example the basis may be provided by the user of the terminal 104. Figure 5 depicts a user interface 50, that may be presented by the display 106, through which the user can provide the basis. The user interface 50 includes fields 51, 52 in which a user can indicate a name for a first artist, and/or a name of a first music track by the first artist, track 1. As noted above, where a user indicates only one of the first artist and the first music track by the first artist, additional artist and track information may be obtained to supplement the user input as basis for the recommendations 42. Optionally, one or more sliders 53, 54 may be provided to allow the user to indicate preferences for the type of music tracks to be recommended. In this example, sliders 53 are provided for indicating instrument-based preferences and sliders 54 are provided for indicating music genre-based preferences. While Figure 5 depicts sliders 53, 54, in other embodiments, alternative input techniques for obtaining user preferences may be used, such as numerical values indicating relative importance or rankings for the preferences or input arranging the preferences in order of importance to the user.

In steps S4.2 and S4.3, first and second groups of music tracks are defined according to the basis obtained in step S4.1. Examples of ways in which group 1 and group 2 may be defined are discussed above in relation to Figure 3. In one example, where the basis includes information identifying an artist, group 1 may contain one or more music tracks by that artist, while group 2 may contain one or more music tracks by a second artist.

Next, in step S4.4, the controller 202 obtains attributes of a plurality of tracks i...m of group 1. Such attributes may be obtained from metadata associated with the plurality of tracks i...m indicating, for example, genre of musical tracks or type of artist, obtained from the data storage 208, or a remote database, or information from a streaming service or digital music store. Additionally, or alternatively, attributes may be obtained by analysing text in social media pages or other webpages. For example, where group 1, is a collection of music tracks by a first artist, an analysis of text on a website for the first artist or a label on which the first artist's music is released and/or reviews of the first artist's music on websites and/or blog pages may be performed and keywords extracted from that text.

Another option, which may be combined with one or both of such metadata and such keywords, is to extract acoustic features from audio data of tracks i...m of group l.

Figure 4 shows attributes 33a to 33c being obtained for individual tracks i...m of group 1 one by one and used to form a vector 34 for group 1, before attributes 35a to 35c for individual tracks i...n of group are obtained in turn, in steps S4.4 to S4.8. This sequence may be used, for example, where group 2 is selected based on results from the analysis of the tracks of group 1. However, in other embodiments, attributes 33a to 33c, 35a to 35c for the tracks from groups 1 and 2 may be obtained in a different order than shown in Figure 4, or in parallel. In particular, it is not necessary to complete the obtaining of attributes 33a to 33c for the tracks i...m from group 1 and/or create the vector 34 for group 1 before proceeding to obtain attributes 35a to 35c for tracks i...n of group 2. Nor is it necessary to obtain the attributes of a particular track before beginning a process for obtaining attributes for another track.

In the following description, the attributes 33a to 33c, 35a to 35c are acoustic features extracted from audio data of tracks i...m of group 1 and tracks i...n of group 2, in the form of probabilities 116 that the tracks include a particular instrument or belong to a particular music genre. However, as noted above, the attributes may include one or more of metadata obtained from a database 208, streaming service or digital music store, keywords extracted from text relating to track 1 of group 1 and other audio features of the tracks, as well as, or instead of such tag probabilities 116.

An example procedure for extracting acoustic features and obtaining tag

probabilities 116 at step S4.4 will now be described with reference to Figure 6.

Starting at step s6.o, if an input signal for track 1 of group 1 is in a compressed format, such as MPEG-i Audio Layer 3 (MP3), Advanced Audio Coding (AAC) and so on, the input signal is decoded into pulse code modulation (PCM) data (step s6.i). In this particular example, the samples for decoding are taken at a rate of 44.1 kHz and have a resolution of 16 bits. The controller 202 may, optionally, resample the decoded input signal at a lower rate, such as 22050 kHz (step s6.2). An optional "pre-emphasis" process is shown as step S6.3. Since audio signals conveying music tend to have a large proportion of their energy at low frequencies, the pre-emphasis process filters the decoded input signal to flatten the spectrum of the decoded input signal. The relatively low sensitivity of the human ear to low frequency sounds may be modelled by such flattening. One example of a suitable filter for this purpose is a first-order Finite Impulse Response (FIR) filter with a transfer function of 1-0.98Z ¹.

At step S6.4, the controller 202 blocks the input signal into frames. The frames may include, for example, 1024 or 2048 samples of the input signal, and the subsequent frames may be overlapping or they may be adjacent to each other according to a hop-size of, for example, 50% and 0%, respectively. In other examples, the frames may be non-adjacent so that only part of the input signal is formed into frames. Figure 7 depicts an example in which an input signal 70 is divided into blocks to produce adjacent frames of about 30 ms in length which overlap one another by 25%. However, frames of other lengths and/or overlaps may be used.

A Hamming window, such as windows 72a to 72d, is applied to the frames at step S6.5, to reduce windowing artifacts. An enlarged portion in Figure 7 depicts a frame 74 following the application of a window to the input signal 70.

At step S7.6, a Fast Fourier Transform (FFT) is applied to the windowed signal 74 to produce a magnitude spectrum of the input signal. An example FFT spectrum is shown in Figure 8. Optionally, the FFT magnitudes may be squared to obtain a power spectrum of the signal for use in place of the magnitude spectrum in the following steps.

The spectrum produced by the FFT at step s6.6 may have a greater frequency resolution at high frequencies than is necessary, since the human auditory system is capable of better frequency resolution at lower frequencies but is capable of lower frequency resolution at higher frequencies. So, at step S6.7, the spectrum is filtered to simulate non-linear frequency resolution of the human ear.

In this example, the filtering at step S6.7 is performed using a filter bank havin; channels of equal bandwidths on the mel-frequency scale. The mel-frequency scaling may be achieved by setting the channel centre frequencies equidistantly on a mel-frequency scale, given by the Equation (1),

me/( )= 25951og₁₀^l + ^ (1)

where/is the frequency in Hertz.

The output of each filtered channel is a sum of the FFT frequency bins belonging to that channel, weighted by a mel-scale frequency response. The weights for filters in an example filter bank are shown in Figure 9. In Figure 9, 40 triangular-shaped bandpass filters are depicted whose center frequencies are evenly spaced on a perceptually motivated mel-frequency scale. The filters may span frequencies from 30 hz to 11025 Hz, in the case of the input signal having a sampling rate of 22050 Hz. For the sake of example, the filter heights in Figure 9 have been scaled to unity.

Variations may be made in the filter bank in other embodiments, such as spanning the band center frequencies linearly below 1000 Hz, scaling the filters such that they have unit area instead of unity height, varying the number of frequency bands, or changing the range of frequencies spanned by the filters.

The weighted sum of the magnitudes from each of the filter bank channels may be referred to as mel-band energies m . , where j=i ...N, N being the number of filters.

In step s6.8, a logarithm, such as a logarithm of base 10, may be taken from the mel-band energies m■ , producing log mel-band energies rr . An example of a log mel-band energy spectrum is shown in Figure 10.

Next, at step S6.9, the MFCCs are obtained. In this particular example, a Discrete Cosine Transform is applied to a vector of the log mel-band energies m to obtain the MFCCs according to Equation (2),

where N is the number of filters, i=o,..., I and / is the number of MFCCs. In an exemplary embodiment, i=20.

At step s6.io, further mathematical operations may be performed on the MFCCs produced at step S6.9, such as calculating a mean of the MFCCs and/or time derivatives of the MFCCs to produce the required acoustic features 33a on which the calculation of the tag probabilities 116 will be based.

In this particular embodiment, the acoustic features produced at step s6.io include one or more of:

" a MFCC matrix for the music track;

" first and, optionally, second time derivatives of the MFCCs, also referred to as "delta MFCCs";

" a mean of the MFCCs of the music track;

" a covariance matrix of the MFCCs of the music track;

" an average of mel-band energies over the music track, based on output from the channels of the filter bank obtained in step S5.6;

" a standard deviation of the mel-band energies over the music track;

" an average logarithmic energy over the frames of the music track, obtained as an average of c_mei(o) over a period of time, for example, using Equation (2); and

" a standard deviation of the logarithmic energy.

The extracted features are then output (step s6.11). As noted above, the features output at step s6.11 may also include a fluctuation pattern and danceability features for the track, such as:

" a median fluctuation pattern over the song;

" a fluctuation pattern bass feature;

- a fluctuation pattern gravity feature;

" a fluctuation pattern focus feature;

" a fluctuation pattern maximum feature;

" a fluctuation pattern sum feature;

" a fluctuation pattern aggressiveness feature; " a fluctuation pattern low-frequency domination feature;

" a danceability feature (detrended fluctuation analysis exponent for at least one predetermined time scale);and

" a club-likeness value.

The mel-band energies calculated in step s6.8 may be used to calculate one or more of the fluctuation pattern features listed above in step s6.io. In an example method of fluctuation pattern analysis, a sequence of logarithmic domain mel-band magnitude frames are arranged into segments of a desired temporal duration and the number of frequency bands is reduced. A FFT is applied over coefficients of each of the frequency bands across the frames of a segment to compute amplitude modulation frequencies of loudness in a described range, for example, in a range of 1 to 10 Hz. The amplitude modulation frequencies may be weighted and smoothing filters applied. The results of the fluctuation pattern analysis for each segment may take the form of a matrix with rows corresponding to modulation frequencies and columns corresponding to the reduced frequency bands and/or a vector based on those parameters for the segment. The vectors for multiple segments may be averaged to generate a fluctuation pattern vector to describe the music track. Danceability features and club-likeness values are related to beat strength, which may be loosely defined as a rhythmic characteristic that allows discrimination between pieces of music, or segments thereof, having the same tempo. Briefly, a piece of music characterised by a higher beat strength would be assumed to exhibit perceptually stronger and more pronounced beats than another piece of music having a lower beat strength. As noted above, a danceability feature may be obtained at step s6.io by detrended fluctuation analysis, which indicates correlations across different time scales, based on the mel-band energies obtained at step s6.8. Examples of techniques of club-likeness analysis, fluctuation pattern analysis and detrended fluctuation analysis are disclosed in British patent application no.

1401626.5, as well as example methods for extracting MFCCs. The disclosure of GB 1401626.5 is incorporated herein by reference in its entirety. The features obtained at step s6.io may include features relating to tempo in beats per minute (BPM), such as: ^" an average of an accent signal in a low, or lowest, frequency band;

^" a standard deviation of said accent signal;

^" a maximum value of a median or mean of periodicity vectors;

^" a sum of values of the median or mean of the periodicity vectors;

" tempo indicator for indicating whether a tempo identified for the input signal is considered constant, or essentially constant, or is considered non-constant, or ambiguous;

^" a first BPM estimate and its confidence;

^" a second BPM estimate and its confidence;

" a tracked BPM estimate over the music track and its variation;

^" a BPM estimate from a lightweight tempo estimator.

Example techniques for beat tracking, using accent information, are disclosed in US published patent application no. 2007/240558 Ai, US patent application no.

14/302,057, International (PCT) published patent application nos. WO2013/ 164661 Ai and WO2014/001849 Ai, the disclosures of which are hereby incorporated by reference in their entireties.

In one example beat tracking method, described in GB 1401626.5, one or more accent signals are derived from the input signal 70, for detection of events and/or changes in the music track. A first one of the accent signals may be a chroma accent signal based on fundamental frequency Fo salience estimation, while a second one of the accent signals may be based on a multi-rate filter bank

decomposition of the input signal 70.

A BPM estimate may be obtained based on a periodicity analysis for extraction of a sequence of periodicity vectors on the basis of the accent signals, where each periodicity vector includes a plurality of periodicity values, each periodicity value describing the strength of periodicity for a respective period length, or "lag". A point-wise mean or median of the periodicity vectors over time may be used to indicate a single representative periodicity vector over a time period of the music track. For example, the time period may be over the whole duration of the music track. Then, an analysis can be performed on the periodicity vector to determine a most likely tempo for the music track. One example approach comprises

performing k-nearest neighbours regression to determine the tempo. In this case, the system is trained with representative music tracks with known tempo. The k- nearest neighours regression is then used to predict the tempo value of the music track based on the tempi of k-nearest representative tracks. More details of such an approach have been described in Eronen, Klapuri, "Music Tempo Estimation With k -NN Regression", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, Issue 1, pages 50-57, the disclosure of which is incorporated herein by reference.

Chorus related features that may be obtained at step s6.io include:

" a chorus start time; and

" a chorus end time.

Example systems and methods that can be used to detect chorus related features are disclosed in US 2008/236371 Ai, the disclosure of which is hereby incorporated by reference in its entirety.

Other features that may be obtained include:

" a duration of the music track in seconds,

- an A- weighted sound pressure level (SPL);

" a standard deviation of the SPL;

" an average brightness, or spectral centroid (SC), of the music track, calculated as a spectral balancing point of a windowed FFT signal magnitude in frames of, for example, 40 ms in length;

" a standard deviation of the brightness;

- an average low frequency ratio (LFR), calculated as a ratio of energy of the input signal below 100Hz to total energy of the input signal, using a windowed FFT signal magnitude in 40 ms frames; and

" a standard deviation of the low frequency ratio. Figure 11 is an overview of a process of extracting multiple acoustic features, some or all of which may be obtained in step S6.9 and s6.io. Figure 11 shows how some input features are derived, at least in part, from computations of other input features. The features shown in Figure 11 include the MFCCs, delta MFCCs and mel-band energies discussed above in relation to Figure 6, indicated using bold text and solid lines. The dashed lines and standard text indicate other features that may be extracted and made available alongside, or instead of, the MFCCs, delta MFCCs and mel-band energies, for use in calculating the tag probabilities 116. For example, as discussed above, the mel-band energies may be used to calculate fluctuation pattern features and/or danceability features through detrended fluctuation analysis. Results of fluctuation pattern analysis and detrended fluctuation analysis may then be used to obtain a club-likeness value. Also as noted above, beat tracking features, labeled as "beat tracking 2" in Figure 11, may be calculated based, in part, on a chroma accent signal from a F₀ salience estimation.

Returning to Figure 6, at step s6.i2, tag probabilities 116 and an overall track vector 39 for track 1 of group 1 are evaluated. An overview of an example method for obtaining tag probabilities 116 and creating a track vector 39 is shown in Figure 12.

The acoustic features 110 for track 1 of group 1 produced in steps 6.9 and s6.io are input to first level classifiers 111 to generate first level probabilities for the music track. In this example, the first level classifiers 111 include first classifiers 112 and second classifiers 113 to generate first and second probabilities respectively, the second classifiers 113 being different from the first classifiers 112. In the

embodiments to be described below, the first classifiers 112 are non-probabilistic classifiers, while the second classifiers 113 are probabilistic classifiers.

In this embodiment, the first and second classifiers 112, 113 compute first level probabilities that the music tracks include particular instruments and/or belong to particular musical genres. Optionally, probabilities based on other acoustic similarities may be included as will be noted hereinbelow.

The first level probabilities are input to at least one second level classifier 114. In this embodiment, the second level classifier 114 includes a third classifier 115, which may be a non-probabilistic classifier. The third classifier 115 generates the tag probabilities 116 based, at least in part, on the first level probabilities output by the first level classifiers 111 and the second level probabilities output by the second classifiers 113.

Figure 13 is a flowchart depicting the method of Figure 12 in more detail. In this particular example, the first and third classifiers 112, 115 are support vector machine (SVM) classifiers and the second classifiers 113 are based on Gaussian Mixture Models (GMM).

Starting at step S13.0, one, some or all of the extracted features 110 or descriptors obtained in steps S6.9 and s6.io are selected to be used as input to the first classifiers 112 (step S13.1) and, optionally, normalised (step S13.2) . For example, a look up table 216 or database may be stored in the memory 206 of the for each of the tag probabilities to be produced by the analysis server 100, that provides a list of features to be used to generate each first classifier and statistics, such as mean and variance of the selected features, that can be used in normalisation of the extracted features 33a. In such an example, the controller 202 retrieves the list of features from the look up table 216, and accordingly selects and normalises the listed features for each of the first level probabilities to be generated. The normalisation statistics for each first level probability in the database may be determined during training of the first classifiers 112.

As noted above, in this example, the first classifiers 112 are SVM classifiers. The SVM classifiers are trained using a database of music tracks for which information regarding musical instruments and genre is already available. The database may include tens of thousands of tracks that provide examples for each particular musical instrument for which a tag probability 116 is to be evaluated.

Examples of musical instruments for which information may be provided in the database include:

- Accordion;

" Acoustic guitar;

" Backing vocals;

" Banjo;

- Bass guitar;

" Bass synthesizer;

" Brass instruments;

" Glockenspiel;

" Drums;

- Eggs; - Electric guitar;

- Electric piano;

" Guitar synthesizer;

- Keyboards;

- Lead vocals;

- Organ;

- Percussion;

- Piano;

- Saxophone;

~ Stringed instruments;

- Synthesizer; and

- Woodwind instruments.

The training database includes indications of genres that the music tracks belong to, as well as indications of genres that the music tracks do not belong to.

Examples of musical genres that may be indicated in the database include:

- Ambient and new age;

- Blues;

- Classical;

" Country and western;

- Dance;

- Easy listening;

- Electronica;

- Folk and roots;

- Indie and alternative;

" Jazz;

- Latin;

- Metal;

- Pop;

- Rap and hip hop;

- Reggae; - Rock;

^" Soul, R&B and funk; and

" World music. By analysing acoustic features extracted from the music tracks in the training database, for which instruments and/or genre are known, a SVM classifier can be trained to determine whether or not a music track includes a particular instrument, for example, an electric guitar. Similarly, another SVM classifier can be trained to determine whether or not the music track belongs to a particular genre, such as Metal.

In this embodiment, the training database provides a highly imbalanced selection of music tracks, in that a set of tracks for training a given SVM classifier includes many more positive examples than negative ones. In other words, for training a SVM classifier to detect the presence of a particular instrument, a set of music tracks for training in which the number of tracks that include that instrument is significantly greater than the number of tracks that do not include that instrument will be used. Similarly, in an example where a SVM classifier is being trained to determine whether a music track belongs to a particular genre, the set of music tracks for training might be selected so that the number of tracks that belong to that genre is significantly greater than the number of tracks that do not belong to that genre.

An error cost may be assigned to the different first level probabilities produced by the first classifiers 112 to take account of the imbalances in the training sets. For example, if a minority class of the training set for a particular first classification includes 400 songs and an associated majority class contains 10,000 tracks, an error cost of 1 may be assigned to the minority set and an error cost of 400/ 10,000 may be assigned to the majority class. This allows all of the training data to be retained, instead of downsampling data of the negative examples.

New SVM classifiers can be added by collecting new training data and training the new classifiers. Since the SVM classifiers are binary, new classifiers can be added alongside existing classifiers.

As mentioned above, the training process can include determining a selection of one or more acoustic features to be used input for particular first classifiers 112 and statistics for normalising those features. The number of features available for selection, M, may be much greater than the number of features selected for determining a particular first classification, N; that is, M > > N. The selection of features to be used is determined iteratively, based on a development set of music tracks for which the relevant instrument or genre information is available, as follows.

Firstly, to reduce redundancy, a check is made as to whether two or more of the features are so highly correlated that the inclusion of more than one of those features would not be beneficial. For example, if two features have a correlation coefficient that is larger than 0.9, then only one of those features is considered available for selection.

The feature selection training starts using an initial selection of features, such as the average MFCCs for music tracks in the development set or a single "best" feature for a given first classification. For instance, a feature that yields the largest classification accuracy when used individually may be selected as the "best" feature and used as the sole feature in an initial feature selection. An accuracy of the first classification based on the initial feature selection is determined. Further features are then added to the feature selection to determine whether or not the accuracy of the first classification is improved by their inclusion. Features to be tested for addition to the selection of features may be chosen using a method that combines forward feature selection and backward feature selection in a sequential floating feature selection. Such feature selection may be performed during the training stage, by evaluating the classification accuracy on a portion of the training set.

In each iteration, each of the features available for selection is added to the existing feature selection in turn, and the accuracy of the SVM classifier with each

additional feature is determined. The feature selection is then updated to include the feature that, when added to the feature selection, provided the largest increase in accuracy for the development set. After a feature is added to the feature selection, the accuracy of the SVM classifier is reassessed, by generating probabilities based on edited feature selections in which each of the features in the feature selection is omitted in turn. If it is found that the omission of one or more features provides an improvement in the accuracy of a generated probability, then the feature that, when omitted, leads to the biggest improvement in accuracy is removed from the feature selection. If no

improvements are found when any of the existing features are left out, but the accuracy does not change when a particular feature is omitted, that feature may also be removed from the feature selection in order to reduce redundancy.

The iterative process of adding and removing features to and from the feature selection continues until the addition of a further feature no longer provides a significant improvement in the accuracy of the SVM classifier. For example, if the improvement in accuracy falls below a given percentage, the iterative process may be considered complete, and the current selection of features is stored in the lookup table 216, for use in selecting features in step S13.1. The normalisation of the selected features 110 at step S13.2 is optional. Where provided, the normalization of the selected features 110 at step S13.2 may

potentially improve the accuracy of the first classifiers 112.

In another embodiment, at step S13.1, a linear feature transform may be applied to the available features 110 obtained in steps S6.9 and s6.io, instead of performing the feature selection procedure described above. For example, a Partial Least Squares Discriminant Analysis (PLS-DA) may be used to obtain a linear

combination of features for calculating a corresponding first classification. Instead of using the above iterative process to select N features from the set of M features, a linear feature transform is applied to an initial high-dimensional set of features to arrive at a smaller set of features which provides a good discrimination between classes. The initial set of features may include some or all of the available features, such as those shown in Figure 11, from which a reduced set of features can be selected based on the result of the transform.

The PLS-DA transform parameters may be optimized and stored in a training stage. During the training stage, the transform parameters and its dimensionality may be optimized for each tag or output classification, such as an indication of an instrument or a genre. More specifically, the training of the system parameters can be done in a cross-validation manner, for example, as five-fold cross-validation, where all the available data is divided into five non-overlapping sets. At each fold, one of the sets is held out for evaluation and the four remaining sets are used for training. Furthermore, the division of folds may be specific for each tag or classification.

For each fold and each tag or classification, the training set is split into 50% -50% inner training-test folds. Then, the PLS-DA transform may be trained on the inner training-test folds and the SVM classifier may be trained on the obtained dimensions. The accuracy of the SVM classifier using the transformed features transformed may be evaluated on the inner test fold. It is noted that, when a feature vector (track) is tested, it is subjected to the same PLS-DA transform, the parameters of which were obtained during training. This manner, an optimal dimensionality for the PLS-DA transform may be selected. For example, the dimensionality may be selected such that the area under the receiver operating characteristic (ROC) curve is maximized. In one example embodiment, an optimal dimensionality is selected among candidates between 5 to 40 dimensions. Hence, the PLS-DA transform is trained on the whole of the training set, using the optimal number of dimensions, and then used in training the SVM classifier. As an alternative to PLS-DA, other feature transforms such as Linear Discriminant Analysis (LDA), Principal Components Analysis (PCA), or Independent Component Analysis (ICA) could be used.

In the following, an example is discussed in which the selected features 110 on which the first classifications are based are the mean of the MFCCs of track 1 of group 1 and the covariance matrix of the MFCCs of track 1 of group 1, although in other examples alternative and/or additional features, such as the other features shown in Figure 11, may be used.

At step S13.3, the controller 202 defines a feature vector based on each set of selected features 110 or selected combination of features 110 for track 1 of group 1. The feature vectors may then be normalized to have a zero mean and a variance of 1, based on statistics determined and stored during the training process.

At step S13.4, the controller 202 generates one or more first probabilities that track 1 of group 1 has a certain characteristic, based on the feature vector or vectors. The first classifiers 112 are used to calculate respective probabilities for each feature vector defined in step S13.3. In this manner, the number of first classifiers 112 corresponds to the number of tag probabilities 116 to be predicted for the music track. In this particular example, a probability is generated by a respective first classifier 112 for each instrument tag probability and for each genre tag probability to be predicted for the music track, based on the mean MFCCs and the MFCC covariance matrix. In addition, a probability may be generated by the first classifiers 112 based on whether the music track is likely to be an instrumental track or a vocal track. Also, for vocal tracks, another probability may be generated by the first classifiers 112 based on whether the vocals are provided by a male or female vocalist. In other embodiments, the controller 202 may generate only one or some of these probabilities and/or calculate additional probabilities at step S13.4. The different classifications may be based on respective selections of features from the available features 110 selected in step S13.1.

The first classifiers 112 may use a radial basis function (RBF) kernel K, defined as:

(3) where the default γ parameter is the reciprocal of the number of features in the feature vector, ΰ is the input feature vector and v is a support vector.

The output from the first classifiers 112 may be in the form of first classifications based on an optimal predicted probability threshold that separates a positive prediction from a negative prediction for a particular tag probability, based on the probabilities computed by the first classifiers 112. The setting of an optimal predicted probability threshold may be learned in the training procedure to be described later below. Where there is no imbalance in data used to train the first classifiers 112, the optimal predicted probability threshold may be 0.5. However, where there is an imbalance between the number of tracks providing positive examples and the number of tracks provided negative examples in the training sets used to train the first classifiers 112, the threshold p_thr may be set to a prior probability of a minority class P_min in the first classification, using Equation (4) as follows:

where, in the set of n tracks used to train the SVM classifiers, n_m,„ is the number of tracks in the minority class and n_maj is the number of tracks in a majority class. The prior probability P_m!„ may be learned as part of the training of the SVM classifiers.

Probability distributions for examples of possible first classifications, based on an evaluation of a number n of tracks, are shown in Figure 14. The nine examples in Figure 14 suggest a correspondence between a prior probability for a given first classification and its probability distribution based on the n tracks. Such a correspondence is particularly marked where the SVM classifier was trained with an imbalanced training set of tracks. Consequently, the predicted probability threshold for the different examples vary over a considerable range.

Optionally, a logarithmic transformation may be applied to the probabilities produced by the first classifiers 112 at step S13.4, so that the probabilities are on the same scale and the optimal predicated probability threshold may correspond to a predetermined value, such as 0.5.

Equations (5) to (7) below provide an example normalization which adjusts the optimal predicted probability threshold to 0.5. Where the probability output by a SVM classifier is p and the prior probability P of a particular tag being applicable to a track is greater than 0.5, then the normalized probability p_norm is given by:

Pnorm = i^{1 ~} P) (5) where L = ^°&(β·5)

log(l -

Meanwhile, where the prior probability P is less than or equal to 0.5, then the normalised probability p norm is given by:

Pnorm P ^⁾

where L' = (8)

log( ) Figure 15 depicts the example probability distributions of Figure 14 after a logarithmic transformation has been applied, on which optimal predicated probability thresholds of 0.5 are marked. The probabilities output by the first classifiers 112 correspond to a normalised probability p_norm that a respective one of the tags to be considered applies to track 1 of group 1. The first classifications may include probabilities pinsti that a particular instrument is included in the music track and probabilities p_gem that the music track belongs to a particular genre.

In steps S13.5 to S13.6, further first level probabilities are generated for the input signal by the second classifiers 113, based on the MFCCs and other parameters produced in step S4.4. Although Figure 13 shows steps S13.3 and S13.6 being performed in sequence, in another embodiment steps S13.5 and S13.6 may be performed before, or in parallel, with steps S13.4 and S13.5.

In this particular example, the acoustic features 110 of track 1 of group 1 on which the second classifications are based are the MFCC matrix for and the first time derivatives of the MFCCs, and probabilities are generated p inst₂, Pgem for each instrument tag (step S13.5) and for each musical genre tag (step S13.6) to be predicted. Optionally, further probabilities may be generated based on whether the music track is likely to be an instrumental track or a vocal track and, for vocal tracks, another probability may be generated based on whether the vocals are provided by a male or female vocalist. In other embodiments, the controller 202 may generate only one or some of these second classifications and/or calculate additional second classifications at steps S13.5 and S13.6.

In this example, the second classifiers 113 compute probabilities p inst₂, Pgem USing probabilistic models that have been trained to represent the distribution of features extracted from audio signals captured from each instrument or genre. Such training can be performed using an expectation maximisation algorithm that iteratively adjusts the model parameters to maximise the likelihood of the model for a particular instrument or genre generating features matching one or more input features in the captured audio signals for that instrument or genre. The parameters of the trained probabilistic models may be stored in a database, for example, in the database 208 of the analysis server, or in remote storage that is accessible to the analysis server 100 via a network, such as the network 102. For each instrument or genre, likelihoods L_yes, L_no are evaluated that the respective probabilistic model could have generated the selected or transformed features from the input signal. In this embodiment, in steps S13.5 and S13.6, the instrument- based probabilities pi_nst₂ are produced by the second classifiers 113 using first and second Gaussian Mixture Models (GMMs), based on the MFCCs and their first time derivatives calculated in step S13.5. Meanwhile, the probabilities p_gen2 that the music track belongs to a particular musical genre are produced by the second classifiers 113 using third GMMs. However, the first and second GMMs used to compute the instrument-based probabilities pi_nst₂ may be trained and used slightly differently from third GMMs used to compute the genre-based probabilities p_gen2, as will now be explained.

The first and second GMMs used in step S13.5 may have been trained with an Expectation Maximisation algorithm using a training set of examples which are known either to include the instrument and examples which are known to not include the instrument. For each track in the training set, MFCC feature vectors and their corresponding first time derivatives are computed. The MFCC feature vectors for the examples in the training set that contain the instrument are used to train a first GMM for that instrument, while the MFCC feature vectors for the examples that do not contain the instrument are used to train a second GMM for that instrument. In this manner, for each instrument to be tagged, two GMMs are produced. The first GMM is for a track that includes the instrument and is used to evaluate the likelihood L_yes, while the second GMM is for a track that does not include the instrument and is used to evaluate the likelihood L_no. In this example, the first and second GMMs each contain 64 component Gaussians.

The first and second GMMs may then be refined by discriminative training, for example using maximum mutual information (MMI) criterion on a balanced training set where, for each instrument to be tagged, the number of example tracks that contain the instrument is equal to the number of example tracks that do not contain the instrument.

Returning to step S13.5, the two likelihoods L_yes, L_no are computed based on the first and second GMMs and the MFCCs for the music track. The first is the likelihood L_yes that the corresponding instrument is included in the music track, while the second is the likelihood L_no that the instrument is not included in the music track. The first and second likelihoods L_yes, L_no may be computed in a log- domain, and then converted to a linear domain.

The first and second likelihoods L_yes, L_no are then mapped to a tag probability pi_nst₂ of the instrument being included in the track, as follows:

P instl (9)

As noted above, the third GMMs, used to compute genre-based probabilities p_gen2, are trained differently to the first and second GMMs. For each genre to be considered, a third GMM is trained based on MFCCs for a training set of tracks known to belong to that genre. One third GMM is produced for each genre to be considered. In this example, the third GMM includes 64 component Gaussians. In step S13.6, for each of the genres to be considered, a likelihood L_gen is computed for the track 1 of group 1 belonging to that genre, based on the likelihood of each of the third GMMs being capable of outputting the MFCC feature vector of the music track. For example, to determine which of the eighteen genres in the list hereinabove might apply to the music track, eighteen likelihoods would be produced.

The genre likelihoods L_gen are then mapped to probabilities p_gen2 , as follows:

where m is the number of genre tags to be considered.

In another embodiment, the first and second GMMs may be trained and used in the manner described above for the third GMMs. In yet further embodiments, the GMMs used for analysing genre may be trained and used in the same manner, using either of techniques described in relation to the first, second and third GMMs above. The first classifications pinsti and p_gem and the second classifications pinst₂ and p_gen₂ for track 1 of group 1 are then normalized to have a mean of zero and a variance of 1 (step S13.7) and collected to form a feature vector for input to the one or more second level classifiers 115 (step S13.8). In this particular example, the second level classifiers 115 include third classifiers 116, as noted above, and the third classifiers 116 are non-probabilistic classifiers, such as SVM classifiers trained in a similar manner to that described above in relation to the first classifiers 112. At the training stage, the first classifiers 112 and the second classifiers 113 may be used to output probabilities pinsti, p_gem, Pinst₂, p_gem for the training sets of example music tracks from the database. The outputs from the first and second classifiers 112, 113 are then used as input data to train the third classifiers 116.

The third classifiers 116 determine second level probabilities pi_nst3 for whether track 1 of group 1 contains a particular instrument and/or second level probabilities p_{g e}n₃ for whether track 1 of group 1 belongs to a particular genre (step S13.9) . In this example, where the third classifiers 116 are SVM classifiers, the second level probabilities pinst₃, p_gen₃ are generated in a similar manner to the first level probabilities pinsti, p_gem computed by the first classifiers 112. The second level probabilities pinst₃, p_gen3 are then log normalised (step s13.11), as described above in relation to the first level probabilities pinsti, p_gem from the first classifiers 112, and output as the tag probabilities 116 at step s13.11.

Optionally, tags based on the tag probabilities 116 may be associated with the music track at step s13.11. For example, the tag probabilities 116 exceed a probability threshold, such as 0.5 for normalised probabilities, tags corresponding to the instruments and/or genres may be stored in a database entry for the music track 208. The track vector 39 is then generated at step S13.12 from the tag probabilities 116 output at step s13.11 and normalised. An example of a track vector 39 is shown in Figure 16. In this particular example, the track vector 39 reflects non-zero probabilities for the music track being a rock song including lead and backing vocals, bass, drums, electric guitar, keyboard and percussion. Alternatively, or additionally, some or all of the first level and/or second level probabilities pmsti, p_{g e}m, Pin_St2, Pgem, Pinst₃, p_gen₃ themselves and/or the features 110 extracted at step S6.9 and s6.io may be output for further analysis and/or storage. The tag probability calculation process ends at step S13.13.

Returning to Figure 4, steps S4.4 and steps S4.5, including the processes of Figures 6 and 13, are repeated to obtain attributes 33b, 33c for tracks 2 to m of group 1, until no further tracks of group 1 remain to be analysed (step S4.5). However, the repetition of step S13.12 to create track vectors 39 for tracks 2...m of group 1 is optional. In this manner, attributes 33a to 33c are obtained for tracks i...m of group 1, while a track vector 39 may be generated for one, some or all of the tracks i...m of group 1. The combined vector 34 for group 1 is then created (step S4.6), based on the tag probabilities 116 generated at step S4.4 for tracks i...m of group 1. For example, the feature vector 34 may be created by summing the tag probabilities 116 for all of the analysed tracks 1 to m of group 1 and, optionally, normalising the sum. Next, for each of the tracks i...n of group 2 to be analysed, the attributes 35a to 35c are obtained in turn (steps S4.7, S4.8) and, for at least one of the tracks i...n of group 2, a track vector 40 is created, as described above in relation to steps S4.4 to S4.6 and Figures 6 and 13, until no further tracks of group 2 remain to be analysed (step S4.8). As discussed above in relation to the tracks i...m of group 1, the creation of a track vector 40 may be performed at step S13.12 for one, some or all of the tracks i...n of group 2.

A combined vector 36 for group 2 is then created (step S4.9), for example by summing the tag probabilities 116 for the tracks i...n of group 2, and, optionally, normalising the sum.

The group level similarity 37 for the tracks of groups 1 and 2 is calculated by evaluating the similarity between the combined vectors 36, 37 for groups 1 and 2 (step S4.10). For example, if the combined vectors 36, 37 for artists 1 and 2 are denoted by a and b , their similarity sim{a,b can be measured with a cosine similarity defined as shown by Equation (11): a * b

sim a, b (11)

\a\ x \b

In other embodiments, a different technique may be used to evaluate

instead of the cosine similarity shown in Equation (n). For example, one alternative technique may include using the Euclidean distance and taking its inverse to obtain the similarity sim{a, b . Another example technique for assessing similarity may use the Kullback-Leibler divergence.

One or more track level similarities 38 are assessed at step s4.11, based on the similarity of the track vectors 39, 40. The similarity of the track vectors 39, 40 may be assessed using Equation (11) above.

A combined group and track similarity 41 may then be determined, for example, by summing the group level similarity 37 and the track level similarity 41 (step S4.12). In this particular example, the group level similarity 37, the track level similarity 38 and the combined group and track similarity 41 are normalised so that they have values in a range between o and 1. Similarities between group 1 and one or more further groups of music tracks may be computed by repeating steps S4.7 to S4.12 for additional groups and generating respective group level similarities 37, track level similarities 38 and combined group and track similarities 41 for each additional group. For example, where group 1 contains tracks by a first artist and group 2 contains tracks by a second artist, further groups, groups 3 and 4 may be defined, containing tracks by a third artist, a fourth artist respectively, and so on.

In step S4.13, a list of recommendations 42 of tracks is compiled from the music tracks of group 2 and, where provided, any further groups of music tracks that have been analysed, based on one or both of the group level similarity 41 and, optionally, the combined group and track similarity 37. In one embodiment, a list of tracks exhibiting the highest combined group and track similarity 41 and/or other similarity 37, 38 may be compiled at step s4.11 and output to the user (step S4.14). Alternatively, the list of recommendations 42 may be ranked and/or revised as part of the compilation (step S4.15). Examples of compilation procedures that may be performed at step S4.15 will now be described, with reference to Figures 17, 18 and 19.

Figures 17 to 19 show example procedures for generating the list of

recommendations at step S4.15, in which other ranking techniques are employed.

Beginning with the example of Figure 17, which starts at step SI7.0, a preliminary list of candidate tracks from group 2 and any further groups of music tracks is compiled (S17.1), based on one or more of the similarities calculated in steps S4.10 to S4.12.

In the example method of Figure 17, the list of preliminary candidates is revised based on user preferences input by the user, for example, by using the sliders 53, 54 in the user interface 50 shown in Figure 5. For example, the user may have indicated that they would like to receive recommendations of tracks that are jazzier and include more piano than a particular track indicated in field 52 of the user interface 50, corresponding to track 1 of group 1, but include less stringed instruments, using the sliders 53, 54. Where the user has provided input indicating a preference has been received (step SI7.2), the tag probability 116 corresponding to a first property indicated by the user is identified (step S17.3) and the relevant tag probabilities for the candidate tracks are retrieved or otherwise obtained (step S17.4) and adjusted as follows. If the user input indicated a positive contribution for the property (step SI7.5), such as "more jazz", the tag probabilities 116 for a genre of "jazz" for the candidate tracks are added to the values calculated for one or more of the similarities between track 1 of group 1 and the candidate tracks. If the user input indicated a positive contribution for the property (step SI7.5), such as "less strings", the tag probabilities 116 for stringed instruments is subtracted from the similarities for the candidate tracks are subtracted from the similarity values for the candidate tracks (step S17.7). If another preference has been indicated by the user (step S17.8), then steps S17.3 to S17.7 are repeated for the next preference, until the similarity values have been adjusted for all the received user preferences (step S17.8) . The candidate tracks are then ranked based on their adjusted similarities (step SI7.9), completing the procedure (step SI7.10) . The list of recommendations 42 output at step S4.14 may be based on a selected subset, or on all, of the candidate tracks in the ranked list of candidate tracks. For example, a predetermined number of the highest ranked candidate tracks may be selected for inclusion in the list of recommendations 42.

In another example method, shown in Figure 18, beginning at step si8.o, a preliminary list of candidate tracks from group 2 is obtained (si8.i), for example by compiling the list based on one or more of the similarities calculated in steps S4.10 to S4.12 as described above in relation to Figure 17. In this example, the candidate tracks in the preliminary list are ranked based on user preferences as indicated by the user's listening history, as will now be explained.

A user history is obtained in step si8.2. The user history may be based on the number of times the user has previously accessed music tracks stored on the terminal 104, other database or a streaming service, tracks ranked by a user on social media or online music database, and/or on tracks purchased by the user from a digital music store.

Next, the controller 202 obtains a tag indicating a user preference from the tracks in the history (step S18.3). For example, the controller 202 may determine the most common tag for the previously accessed tracks shown in the user history. The corresponding tag probabilities 116 for the candidate tracks from group 2 are then retrieved (step S18.4) .

If the obtained tag seems to be viewed positively by the user (step S18.5), for example, if the tag occurs most often in previously accessed tracks that were played, downloaded or purchased by a user, then the tag probabilities 116 for the candidate tracks are added to one or more of the similarities calculated in steps S4.10 to S4.12 (step si8.6) .

If the obtained tag seems to indicate a negative user preferences, for example if the tag occurs most often in previously accessed tracks that were skipped by the user (step S18.5), then the tag probabilities 116 for the candidate tracks are subtracted from one or more of the similarities calculated in steps S4.10 to S4.12 (step S18.7) . Steps S18.3 to si8.8 may be repeated for further tags, if required (step si8.8).

The candidate tracks are then ranked based on their adjusted similarities (step S18.9), completing the procedure (step S18.10) . As noted above, in relation to Figure 17, the list of recommendations 42 output at step S4.14 may be based on a subset, or on all, of the candidate tracks in the ranked list of candidate tracks produced at step S18.9. In yet another example method, shown in Figure 19, starting at step S19.0, a list of preliminary candidates is obtained (step S19.1), as discussed above in relation to Figure 17. In this example method, the preliminary candidates are ranked based on properties other than those on which the tag probabilities 116 are based. In step S19.2, the controller 202 determines whether the user history includes a previously accessed music track that has a tag that was not included in track 1 of group 1 (step S19.2). For example, a previously accessed music track in the user history may include instruments that were not included in track 1 of group 1, or belong a different genre from track 1 of group 1. In the following, such a tag is referred to as a "new tag".

The tag probabilities 116 for the new tag are retrieved for the candidate tracks (step S19-3)- If the user history indicates that the user listened to the previously accessed music track with the new tag (step S19.4), then the tag probabilities 116 for the new tag in the candidate tracks is added to their respective similarities (step S19.5).

If the user history indicates that the user skipped the previously accessed music track with the new tag (step S19.4), then the tag probabilities 116 for the new tag in the candidate tracks is subtracted from their respective similarities (step S19.6).

If there are further new tags in the previously accessed track (steps S19.7, S19.8), then tag probabilities 116 for the candidate tracks for the further new tags are also added to, or subtracted from, the similarities as appropriate (steps S19.5, S19.6). If required, the controller 202 may then search for another previously accessed track with at least one new tag in the user history (steps S19.9, S19.2), to further adjust the similarities of the candidate tracks (steps S19.3 to S19.9) . The candidate tracks are then ranked based on their adjusted similarities (step S19.10), completing the procedure (step s19.11) . The list of recommendations 42 output at step S4.14 may be based on subset, or on all, of the candidate tracks in the ranked list of candidate tracks. In yet another embodiment, one or more of the methods described above with reference to Figures 17 to 19 may be used to compile the list of recommendations 42 at step S4.13.

The list of recommendations 42 is output at step S4.14 via the interface 202. In this example, the list is transmitted to the user's terminal 104 via the network 102. The terminal 104 may present the list of recommendations 42 to the user as a list of music tracks, optionally with links to access the recommended tracks from a streaming service or database or to purchase the recommended tracks from a digital music store. Where at least some of the recommendations include music tracks in a library accessible by the terminal 104, for example stored in storage

208, the list of recommendations 42 may include, or take the form of, a playlist to be followed by a media player software application stored in the terminal 104.

The procedure for recommending music tracks may end at this point (step S4.15) Alternatively, the analysis server 100 may receive and monitor user history information (step S4.16) from the terminal 104 after the list of recommendations 42 has been output (step S4.14) and determine whether the list of recommendations 42 should be revised (step S14.17) . For example, the controller 202 may determine that revision is needed to adjust the recommendations 42 based on whether the user has listened to, or skipped, tracks in the existing list of recommendations 42.

If revision is required (step S4.17), then the controller 202 revises the list of recommendations 42 (step S4.18) . For example, the controller 202 may update the list of recommendations 42 based on tags from the recommended tracks that the user has listened to, or skipped, by performing the method of Figure 18, using the existing list of recommendations 42 as the preliminary list of candidate tracks in step si8.i and received updated user history as the user history obtained in step S18.2. Alternatively, or additionally, the controller 202 may update the list of recommendations 42 if a previously accessed music track appearing in the updated user history includes a new tag, using the method of Figure 19, using the existing list of recommendations 42 as the preliminary list of candidate tracks in step S19.1.

The revised list of recommendations 42 based on the new rankings is then output (step S4.14) .

The monitoring of user history and revision of the list of recommendations 42

(steps S4.14 to S4.18) may continue until further revision is not needed (step S4.15), for example, if the user of the terminal 104 pauses or stops music playback, closes the media player application or switches of terminal 104.

The procedure then ends at step S4.19.

It will be appreciated that the above-described embodiments are not limiting on the scope of the invention, which is defined by the appended claims and their alternatives. Various alternative implementations will be envisaged by the skilled person, and all such alternatives are intended to be within the scope of the claims.

It is noted that the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.

Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a "computer-readable medium" may be any media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer. A computer-readable medium may comprise a computer-readable storage medium that may be any tangible media or means that can contain or store the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer as defined previously. The computer-readable medium may be a volatile medium or a non-volatile medium.

According to various embodiments of the previous aspect of the present invention, the computer program according to any of the above aspects, may be implemented in a computer program product comprising a tangible computer-readable medium bearing computer program code embodied therein which can be used with the processor for the implementation of the functions described above.

Reference to "computer-readable storage medium", "computer program product", "tangibly embodied computer program" etc, or a "processor" or "processing circuit" etc. should be understood to encompass not only computers having differing architectures such as single/multi processor architectures and sequencers/parallel architectures, but also specialised circuits such as field programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices and other devices. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array, programmable logic device, etc. If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined.

Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

Claims

1. An apparatus, comprising:

a controller; and

a memory in which is stored computer readable instructions that, when executed by the controller, cause the controller to:

receive input information regarding one or more music tracks;

determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes;

determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks;

select one or more music tracks from the second group of music tracks based at least in part on said similarity; and

output a list of said selected music tracks.

2. An apparatus according to claim 1, where said computer readable

instructions, when executed by the controller, further cause the controller to :

determine a similarity between one of said first plurality of music tracks and one of the second plurality of music tracks;

wherein said similarity between the first group of music tracks and the second group of music tracks is a combination of a group level similarity based on the properties of the first plurality of music tracks and the properties of the second plurality of music tracks and a track level similarity based on the properties of said one of said first plurality of music tracks and the properties of said one of the second plurality of music tracks.

3. An apparatus according to claim 1 or 2, wherein said track level attributes include acoustic features extracted from said first plurality of music tracks or said second plurality of music tracks.

4. An apparatus according to claim 1, 2 or 3, wherein said track level attributes include at least one of: tags associated with at least some of said first plurality of music tracks and said second plurality of music tracks;

metadata associated with at least some of said first plurality of music tracks and said second plurality of music tracks; and

keywords extracted from text associated with at least some of said first plurality of music tracks and said second plurality of music tracks.

5. An apparatus according to any of the preceding claims, wherein said properties include at least one property based on a musical instrument and at least one property based on a musical genre.

6. An apparatus according to claim 5, wherein said properties comprise probabilities that a tag for a musical instrument or genre applies to a respective one of the first and second pluralities of music tracks.

7. An apparatus according to any of the preceding claims, where said computer readable instructions, when executed by the controller, further cause the controller to :

monitor a history of music tracks previously accessed by a user;

revise said list of selected music tracks based on the properties of the previously accessed music tracks in said history and on whether the music tracks in the history were played or skipped; and

output said revised list.

8. An apparatus according to any of the preceding claims, where said computer readable instructions, when executed by the controller, further cause the controller to :

rank the selected music tracks in the list based, at least in part, on user preferences for the properties included in the received input.

9. An apparatus according to any of the preceding claims, wherein said computer readable instructions, when executed by the controller, further cause the controller to :

rank the selected music tracks in the list based, at least in part, on properties of previously accessed music tracks indicated in a user history.

10. An apparatus according to any of the preceding claims, wherein said computer readable instructions, when executed by the controller, further cause the controller to :

rank the selected music tracks in the list based, at least in part, on a further property of a previously accessed music track indicated in a user history, wherein said first plurality of music tracks does not include said further property.

11. An apparatus according to claim 9 or 10, wherein said computer readable instructions, when executed by the controller, cause the controller to rank the selected music tracks by adjusting similarities for the selected music tracks according to whether said previously accessed music track indicated in the user history was played or skipped by the user.

12. An apparatus according to any of the preceding claims, wherein said computer readable instructions, when executed by the controller, cause the controller to determine said properties by evaluating first level probabilities that a particular tag applies based on the track level attributes and evaluating a second level probability that the particular tag applies based on the first level probability.

13. An apparatus according to claim 12, wherein said computer readable instructions, when executed by the controller, cause the controller to evaluate the first level probabilities using a first classifier and a second classifier and to evaluate the second level probabilities using a third classifier, wherein the first and third classifiers are non-probabilistic classifiers and the second classifier is a

probabilistic classifier.

14. A method comprising:

receiving input information regarding at least one music track;

determining properties of a first plurality of music tracks belonging to a first group of music tracks defined based at least in part on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes;

determining a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks; selecting one or more music tracks from the second group of music tracks based at least in part on said similarity; and

outputting a list of said selected music tracks.

15. A method according to claim 14, comprising:

determining a similarity between one of said first plurality of music tracks and one of the second plurality of music tracks;

16. A method according to claim 14 or 15, wherein said track level attributes include acoustic features extracted from said first and second pluralities of music tracks.

17. A method according to claim 14, 15 or 16, comprising:

monitoring a history of music tracks accessed by a user;

revising the list of selected music tracks based on the properties of the previously accessed music tracks in said user history and on whether the previously accessed music tracks were played or skipped; and

outputting said revised list.

18. A method according to any of claims 14 to 17, comprising:

ranking the selected music tracks in the list based on one or more of:

user preferences for the properties included in the received input; properties of previously accessed music tracks indicated in a user history; and

a property of a previously accessed music track indicated in a user history, wherein said first plurality of music tracks does not include said property.

19. A method according to any of claims 14 to 17, comprising:

ranking the selected music tracks in the list based on at least one property of at least one previously accessed music track indicated in a user history; wherein said ranking the selected music tracks comprises adjusting similarities for the selected music tracks according to whether said previously accessed music track indicated in the user history was played or skipped by the user.

20. A computer program comprising computer readable instructions which, when executed by a computer, cause said computer to perform a method according to any of claims 14 to 19.

21. A non-transitory tangible computer program product in which is stored computer readable instructions that, when executed by a computer, cause the computer to :

receive input information regarding one or more music tracks;

determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based at least in part on the input informationg and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes;

output a list of said selected music tracks.

22. An apparatus configured to :

receive input a first group of music tracks;

extract track level attributes associated with a first plurality of music tracks belonging to the first group of music tracks and a second plurality of music tracks belonging to a second group of music tracks ;

determine properties of the first plurality of music tracks and the second plurality of music tracks, based on track level attributes;

determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks; select one or more music tracks from the second group of music tracks based at least in part on said similarity; and output a list of said selected music tracks.

23. An apparatus according to claim 22, configured to :

24. An apparatus according to claim 22 or 23, wherein said track level attributes include acoustic features extracted from said first plurality of music tracks or said second plurality of music tracks.

25. An apparatus according to claim 22, 23 or 24, wherein said track level attributes include at least one of:

tags associated with at least some of said first plurality of music tracks and said second plurality of music tracks;

26. An apparatus according to any of claims 22 to 25, wherein said properties include at least one property based on a musical instrument and at least one property based on a musical genre.

27. An apparatus according to claim 26, wherein said properties comprise probabilities that a tag for a musical instrument or genre applies to a respective one of the first and second pluralities of music tracks.

28. An apparatus according to any of claims 22 to 27, configured to :

monitor a history of music tracks previously accessed by a user; revise said list of selected music tracks based on the properties of the previously accessed music tracks in said history and on whether the music tracks in the history were played or skipped; and

output said revised list.

29. An apparatus according to any of claims 22 to 28, configured to:

30. An apparatus according to any of claims 22 to 29, configured to:

31. An apparatus according to any of claims 22 to 30, configured to:

32. An apparatus according to claim 30 or 31, configured to rank the selected music tracks by adjusting similarities for the selected music tracks according to whether said previously accessed music track indicated in the user history was played or skipped by the user.

33. An apparatus according to any of claims 22 to 32, configured to determine said properties by evaluating first level probabilities that a particular tag applies based on the track level attributes and evaluating a second level probability that the particular tag applies based on the first level probability.

34. An apparatus according to claim 33, configured to evaluate the first level probabilities using a first classifier and a second classifier and to evaluate the second level probabilities using a third classifier, wherein the first and third classifiers are non-probabilistic classifiers and the second classifier is a

probabilistic classifier.