WO2016102738A1 - Similarity determination and selection of music - Google Patents

Similarity determination and selection of music Download PDF

Info

Publication number
WO2016102738A1
WO2016102738A1 PCT/FI2014/051037 FI2014051037W WO2016102738A1 WO 2016102738 A1 WO2016102738 A1 WO 2016102738A1 FI 2014051037 W FI2014051037 W FI 2014051037W WO 2016102738 A1 WO2016102738 A1 WO 2016102738A1
Authority
WO
WIPO (PCT)
Prior art keywords
music tracks
tracks
music
properties
group
Prior art date
Application number
PCT/FI2014/051037
Other languages
French (fr)
Inventor
Antti Eronen
Jussi LEPPÄNEN
Pasi SAARI
Arto Lehtiniemi
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Priority to PCT/FI2014/051037 priority Critical patent/WO2016102738A1/en
Publication of WO2016102738A1 publication Critical patent/WO2016102738A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/248Character recognition characterised by the processing or recognition method involving plural approaches, e.g. verification by template match; Resolving confusion among similar patterns, e.g. "O" versus "Q"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/12Classification; Matching
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/036Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal of musical genre, i.e. analysing the style of musical pieces, usually for selection, filtering or classification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/041Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal based on mfcc [mel -frequency spectral coefficients]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/056Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or identification of individual instrumental parts, e.g. melody, chords, bass; Identification or separation of instrumental parts by their characteristic voices or timbres
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/121Musical libraries, i.e. musical databases indexed by musical parameters, wavetables, indexing schemes using musical parameters, musical rule bases or knowledge bases, e.g. for automatic composing methods
    • G10H2240/131Library retrieval, i.e. searching a database or selecting a specific musical piece, segment, pattern, rule or parameter set
    • G10H2240/141Library retrieval matching, i.e. any of the steps of matching an inputted segment or phrase with musical database contents, e.g. query by humming, singing or playing; the steps may include, e.g. musical analysis of the input, musical feature extraction, query formulation, or details of the retrieval process

Definitions

  • This disclosure relates to determining similarity and similarity-based selection of music tracks.
  • this disclosure relates to assessing and selecting music tracks from a database based on acoustic similarities.
  • Audio content databases, streaming services, online stores and media player software applications often include genre classifications, to allow a user to search for tracks to play stream and/or download.
  • Some databases, services, stores and applications also include a facility for recommending music tracks to a user based on a history of music that they have accessed in conjunction with other data, such as rankings of tracks or artists from the user, history data from other users who have accessed the same or similar tracks in the user's history or otherwise have similar user profiles, metadata assigned to the tracks by experts and/or users, and so on.
  • an apparatus includes a controller and a memory in which is stored computer readable instructions that, when executed by the controller, cause the controller to receive input information regarding one or more music tracks, determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes, determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks, select one or more music tracks from the second group of music tracks based at least in part on said similarity and output a list of said selected music tracks.
  • the input information may, for example, indicate a name of a music track, an album, an artist, a performer, a record label, a playlist, a producer, a musical genre, sub-genre or style.
  • the first group of music tracks may contain music tracks by that artist and, optionally, the second group of music tracks may contain music tracks by one or more second artists. If the second group of music tracks contains music tracks by one second artist, then the group level similarity would indicate similarity between the first and second artists.
  • the input information may indicate a particular music track
  • the controller may obtain information regarding one or more of an album, an artist, a performer, a record label, a playlist, a producer, a musical genre, sub-genre or style from metadata associated with the particular music track or from extracting information from a local or remote database, and define the first group based on the obtained information.
  • the computer readable instructions when executed by the controller, may further cause the controller to determine a similarity between a first one of said first plurality of music tracks and one of the second plurality of music tracks where said similarity between the first group of music tracks and the second group of music tracks is a combination of a group level similarity based on the properties of the first plurality of music tracks and the properties of the second plurality of music tracks and a track level similarity based on the properties of said one of said first plurality of music tracks and the properties of said one of the second plurality of music tracks.
  • the track level attributes may include acoustic features extracted from said music tracks and/or at least one of: tags associated with at least some of said music tracks, metadata associated with said music tracks and keywords extracted from text associated with said music tracks.
  • the properties may include at least one property based on a musical instrument and at least one property based on a musical genre.
  • the properties may include probabilities that a tag for a musical instrument or genre applies to a respective one of the first and second pluralities of music tracks.
  • the computer readable instructions when executed by the controller, may further cause the controller to monitor a history of music tracks previously accessed by a user, revise said list of selected music tracks based on the properties of the selected music tracks in said history and on whether the previously accessed music tracks in the history were played or skipped and output said revised list.
  • the computer readable instructions when executed by the controller, may further cause the controller to rank the selected music tracks in the list based, at least in part, on user preferences for the properties included in the received input.
  • the computer readable instructions when executed by the controller, may further cause the controller to rank the selected music tracks in the list based, at least in part, on properties of previously accessed music tracks indicated in a user history.
  • the computer readable instructions when executed by the controller, may further cause the controller to rank the selected music tracks in the list based, at least in part, on a property of a previously accessed music track indicated in a user history, wherein said first plurality of music tracks does not include said property.
  • the ranking may include adjusting similarities for the selected music tracks according to whether said previously accessed music track, or tracks, indicated in the user history was played or skipped by the user.
  • the computer readable instructions when executed by the controller, may cause the controller to determine said properties by evaluating first level probabilities that a particular tag applies based on the track level attributes and evaluating a second level probability that the particular tag applies based on the first level probability.
  • the controller may be caused to evaluate the first level probabilities using a first classifier and a second classifier and to evaluate the second level probabilities using a third classifier, wherein the first and third classifiers are non-probabilistic classifiers and the second classifier is a
  • a method includes receiving input information regarding at least one music track, determining properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes, determining a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks, selecting one or more music tracks from the second group of music tracks based at least in part on said similarity and outputting a list of said selected music tracks.
  • Such a method may further include determining a similarity between one of said first plurality of music tracks and one of the second plurality of music tracks, wherein said similarity between the first group of music tracks and the second group of music tracks is a combination of a group level similarity based on the properties of the first plurality of music tracks and the properties of the second plurality of music tracks and a track level similarity based on the properties of said one of said first plurality of music tracks and the properties of said one of the second plurality of music tracks.
  • the track level attributes may include acoustic features extracted from said music tracks and/or at least one of: tags associated with at least some of said music tracks, metadata associated with said music tracks and keywords extracted from text associated with said music tracks.
  • the method may also include monitoring a history of music tracks previously accessed by a user, revising the list of selected music tracks based on the properties of the previously accessed music tracks in said history and on whether the previously accessed music tracks in the history were played or skipped and outputting said revised list.
  • the method may include ranking the selected music tracks in the list based on one or more of user preferences for the properties included in the received input, properties of previously accessed music tracks indicated in a user history and a property of a previously music track indicated in a user history, wherein said first plurality of music tracks does not include said property. For example, such ranking may include adjusting similarities for the selected music tracks according to whether said previously music track, or tracks, indicated in the user history was played or skipped by the user.
  • the method may include determining said properties comprises evaluating first level probabilities that a particular tag applies based on the extracted acoustic features and evaluating a second level probability that the particular tag applies based on the first level probability.
  • a computer program comprising computer readable instructions which, when executed by a computer, cause said computer to perform any of the above methods according to said aspect may also be provided.
  • a non-transitory tangible computer program product in which is stored computer readable instructions that, when executed by a computer, cause the computer to receive input information regarding one or more music tracks, determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes, determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks, select one or more music tracks from the second group of music tracks based at least in part on said similarity and output a list of said selected music tracks.
  • an apparatus configured to receive input information regarding one or more music tracks, determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes, determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks, select one or more music tracks from the second group of music tracks based at least in part on said similarity and output a list of said selected music tracks.
  • an apparatus includes an interface to receive input information regarding one or more music tracks and to output a list of selected music tracks, an extractor to extract track level attributes associated with a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and a second plurality of music tracks belonging to a second group of music tracks and to determine properties of the first plurality of music tracks and the second plurality of music tracks based on track level attributes of said music tracks, a similarity determination module to determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks and a
  • recommendation engine to select the selected music tracks from the second group of music tracks based at least in part on said similarity.
  • Figure l is a schematic diagram of a system in which an embodiment may be included
  • Figure 2 is a schematic diagram of components of an analysis server according to an embodiment, in the system of Figure l;
  • Figure 3 is an overview of a method that may be performed by the analysis server of Figure 2;
  • Figure 4 is a flowchart of the method shown in overview in Figure 3;
  • Figure 5 depicts a user interface for use in the method of Figure 3
  • Figure 6 is a flowchart of a method of extracting acoustic features from an input signal, for use in the method of Figure 4;
  • Figure 7 depicts an example of a blocked and windowed input signal
  • Figure 8 depicts an example energy spectrum of a transformed input signal
  • Figure 9 depicts a frequency response of an example filter bank for filtering the transformed input signal shown in Figure 8.
  • Figure 10 depicts an example mel-energy spectrum output from the filter bank represented in Figure 8.
  • Figure 11 is an overview of a process for obtaining multiple types of acoustic features in the method of Figure 4;
  • Figure 12 is an overview of a method of tag obtaining probabilities for use in the method of Figure 4;
  • Figure 13 is a flowchart of the method of Figure 12;
  • Figure 14 shows example distributions for instrument-based tag
  • Figure 15 shows the example probability distributions of Figure 14 after logarithmic transformation
  • Figure 16 depicts an example track feature vector generated by the method of Figure 13;
  • Figure 17 is an overview of a recommendation procedure that may be performed as a part of the method of Figure 4;
  • Figure 18 is an overview of another recommendation procedure that may be performed as a part of the method of Figure 4;
  • Figure 19 is an overview of yet another recommendation procedure that may be performed as a part of the method of Figure 4.
  • Embodiments described herein concern assessing features of music tracks, determining similarities between music tracks and selecting music tracks based on such similarities, for example, for recommendation to a user.
  • an analysis server 100 is shown connected to a network 102, which can be any data network such as a Local Area Network (LAN), Wide Area Network (WAN) or the Internet.
  • the analysis server 100 is configured to receive and process requests for audio content from one or more terminals 104, 105 via the network 102.
  • two terminals 104, 105 are shown, each incorporating media playback hardware and software, such as a speaker (not shown) and/or audio output jack (not shown) and a processor (not shown) that executes a media player software application to stream and/or download audio content over the network 102 and to play audio content through the speaker.
  • the terminals 104, 105 may be capable of streaming or downloading video content over the network 102 and presenting the video content using the speaker and a display 106.
  • Suitable terminals 104, 105 will be familiar to persons skilled in the art. For instance a smart phone could serve as a terminal 104, 105 in the context of this application although a laptop, tablet or desktop computer may be used instead.
  • Such terminals 104, 105 include music and video playback and data storage functionality and can be connected to the music analysis sever 100 via a cellular network, Wi-fi, Bluetooth® or any other suitable connection such as a cable or wire.
  • the display 106 may be a touch screen display.
  • the analysis server loo includes a controller 202, an input and output interface 204 configured to transmit and receive data via the network 102, a memory 206 and a mass storage device 208 for storing video and audio data.
  • the controller 202 is connected to each of the other components in order to control operation thereof.
  • the controller 202 may take any suitable form. For instance, it may be a processing arrangement that includes a microcontroller, plural
  • the memory 206 and mass storage device 208 may be in the form of a non-volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD) .
  • the memory 206 stores, amongst other things, an operating system 210 and at least one software application 212 to be executed by the controller 202.
  • Random Access Memory (RAM) 214 is used by the controller 202 for the temporary storage of data.
  • the operating system 210 may contain code which, when executed by the controller 202 in conjunction with the RAM 214, controls operation of analysis server 100 and provides an environment in which the or each software application 212 can run.
  • Software application 212 is configured to control and perform audio and video information processing by the controller 202 of the analysis server 100 to determine similarities between music tracks and, optionally, to generate music recommendations.
  • the operation of this software application 212 according to a first embodiment will now be described in detail, with reference to Figures 3 to 4.
  • the accessed music tracks are referred to as input signals.
  • Figure 3 is an overview of a procedure for recommending music tracks to the user of the terminal 104, in which the controller 202 acts as an extractor 30 to extract track level attributes of a music track, similarity assessment module 31 and recommendation engine 32.
  • the basis for the recommendation procedure may be information provided by a user of the terminal 104 via the network 102.
  • the user may indicate an artist and/or a track so that similar artists and/or tracks can be identified.
  • the user may provide other information on which the recommendation procedure may be based.
  • the user may indicate one or more of an album, a performer, such as a particular musician, a producer, a record label, a playlist, a musical genre, sub- genre or style, and so on.
  • information regarding music tracks accessed by a user such as the tracks that have been accessed the greatest number of times, the artist with the greatest number of tracks in a library in the terminal 104, recently accessed tracks or recently purchased tracks, may be used to identify an artist and, optionally, a track, referred to as track 1, to use as a basis for generating
  • track 1 may be a track selected automatically, such as the track by that artist accessed the most times by the user, or a most popular track by that artist as indicated by a remote database, such as a streaming database, rankings for tracks in a digital music store or information obtained from social media.
  • a first group of music tracks, group 1 is defined based on the information. Where the user has input artist or performer information, or an artist or performer has been identified from other information input by the user or obtained from the user history, the first group may contain multiple tracks by that artist or performer. In another example, if the user has input information identifying an album or record label, the first group may include music tracks from that album or record label.
  • information that may be used as that basis can include one or more of an artist, album, a performer, a producer, a record label, a playlist, a musical genre, sub-genre or style, and so on, by defining the first group of music tracks, group 1, according to the basis provided.
  • Attributes 33a to 33c for a first music track 1 of the first group, group 1, and one or more further tracks 2...m of group 1 are obtained from the data stored in
  • the similarity assessment module 31 defines a combined vector 34 for the first group, group 1, based on some or all of the attributes 33a to 33c obtained for tracks i...m.
  • attributes 35a to 35c are obtained for a plurality of tracks i...n of the second group, group 2, from which the recommendations are to be drawn, and a combined vector 36 for group 2 is defined.
  • the second group may contain multiple tracks by a second artist or performer.
  • the second artist may be selected automatically, based on an analysis of attributes 33a to 33c obtained for track 1 of group 1 and/or information from streaming databases, rankings in digital music stores, social media information and so on.
  • such databases may indicate that users who listened to the first artist often listen to certain other artists and one of those other artists may be selected as the second artist and the second group, group 2, defined to include multiple tracks by that second artist.
  • the similarity assessment module 31 determines a group level similarity 37, based on the combined vectors 35, 36 for group 1 and group 2, based on the plurality of tracks i...m of group 1 and the plurality of tracks i...n of group 2.
  • the similarity assessment module 31 may also determine one or more track level similarities 38, each based on a vector 39, 40 combining attributes 33a of an individual track of group 1 and the attributes 35a of an individual track of group 2 respectively.
  • a combined group and track similarity 41 may also be computed based on the group level similarity 37 and the track level similarity 38.
  • One or more of the group level similarity 37 and the combined group and track similarity 41 are input to the recommendation engine 32.
  • the recommendation engine may then select music tracks from the video/audio storage 208 or another database, for example, a remote database accessed via the network 102 or other network, as recommendations 42 of music tracks that the user of the terminal 104 might enjoy, based on the input similarities 37, 41 and, optionally, further input from the user of the terminal 104.
  • the recommendations 42 may be output via the I/O interface 204 and transmitted to the terminal 104 for presentation on the display 106.
  • Figure 4 is a flowchart showing further detail of the method described above in relation to Figure 3.
  • a basis for generating the recommendations 42 is obtained (step S4.1) .
  • the basis may be provided by the user of the terminal 104.
  • Figure 5 depicts a user interface 50, that may be presented by the display 106, through which the user can provide the basis.
  • the user interface 50 includes fields 51, 52 in which a user can indicate a name for a first artist, and/or a name of a first music track by the first artist, track 1.
  • additional artist and track information may be obtained to supplement the user input as basis for the recommendations 42.
  • one or more sliders 53, 54 may be provided to allow the user to indicate preferences for the type of music tracks to be recommended.
  • sliders 53 are provided for indicating instrument-based preferences and sliders 54 are provided for indicating music genre-based preferences. While Figure 5 depicts sliders 53, 54, in other embodiments, alternative input techniques for obtaining user preferences may be used, such as numerical values indicating relative importance or rankings for the preferences or input arranging the preferences in order of importance to the user.
  • first and second groups of music tracks are defined according to the basis obtained in step S4.1. Examples of ways in which group 1 and group 2 may be defined are discussed above in relation to Figure 3. In one example, where the basis includes information identifying an artist, group 1 may contain one or more music tracks by that artist, while group 2 may contain one or more music tracks by a second artist.
  • the controller 202 obtains attributes of a plurality of tracks i...m of group 1.
  • attributes may be obtained from metadata associated with the plurality of tracks i...m indicating, for example, genre of musical tracks or type of artist, obtained from the data storage 208, or a remote database, or information from a streaming service or digital music store.
  • attributes may be obtained by analysing text in social media pages or other webpages. For example, where group 1, is a collection of music tracks by a first artist, an analysis of text on a website for the first artist or a label on which the first artist's music is released and/or reviews of the first artist's music on websites and/or blog pages may be performed and keywords extracted from that text.
  • Another option which may be combined with one or both of such metadata and such keywords, is to extract acoustic features from audio data of tracks i...m of group l.
  • Figure 4 shows attributes 33a to 33c being obtained for individual tracks i...m of group 1 one by one and used to form a vector 34 for group 1, before attributes 35a to 35c for individual tracks i...n of group are obtained in turn, in steps S4.4 to S4.8.
  • This sequence may be used, for example, where group 2 is selected based on results from the analysis of the tracks of group 1.
  • attributes 33a to 33c, 35a to 35c for the tracks from groups 1 and 2 may be obtained in a different order than shown in Figure 4, or in parallel.
  • the attributes 33a to 33c, 35a to 35c are acoustic features extracted from audio data of tracks i...m of group 1 and tracks i...n of group 2, in the form of probabilities 116 that the tracks include a particular instrument or belong to a particular music genre.
  • the attributes may include one or more of metadata obtained from a database 208, streaming service or digital music store, keywords extracted from text relating to track 1 of group 1 and other audio features of the tracks, as well as, or instead of such tag probabilities 116.
  • step s6.o if an input signal for track 1 of group 1 is in a compressed format, such as MPEG-i Audio Layer 3 (MP3), Advanced Audio Coding (AAC) and so on, the input signal is decoded into pulse code modulation (PCM) data (step s6.i).
  • PCM pulse code modulation
  • the samples for decoding are taken at a rate of 44.1 kHz and have a resolution of 16 bits.
  • the controller 202 may, optionally, resample the decoded input signal at a lower rate, such as 22050 kHz (step s6.2).
  • An optional "pre-emphasis" process is shown as step S6.3.
  • the pre-emphasis process filters the decoded input signal to flatten the spectrum of the decoded input signal.
  • the relatively low sensitivity of the human ear to low frequency sounds may be modelled by such flattening.
  • One example of a suitable filter for this purpose is a first-order Finite Impulse Response (FIR) filter with a transfer function of 1-0.98Z 1 .
  • the controller 202 blocks the input signal into frames.
  • the frames may include, for example, 1024 or 2048 samples of the input signal, and the subsequent frames may be overlapping or they may be adjacent to each other according to a hop-size of, for example, 50% and 0%, respectively. In other examples, the frames may be non-adjacent so that only part of the input signal is formed into frames.
  • Figure 7 depicts an example in which an input signal 70 is divided into blocks to produce adjacent frames of about 30 ms in length which overlap one another by 25%. However, frames of other lengths and/or overlaps may be used.
  • a Hamming window such as windows 72a to 72d, is applied to the frames at step S6.5, to reduce windowing artifacts.
  • An enlarged portion in Figure 7 depicts a frame 74 following the application of a window to the input signal 70.
  • a Fast Fourier Transform is applied to the windowed signal 74 to produce a magnitude spectrum of the input signal.
  • An example FFT spectrum is shown in Figure 8.
  • the FFT magnitudes may be squared to obtain a power spectrum of the signal for use in place of the magnitude spectrum in the following steps.
  • the spectrum produced by the FFT at step s6.6 may have a greater frequency resolution at high frequencies than is necessary, since the human auditory system is capable of better frequency resolution at lower frequencies but is capable of lower frequency resolution at higher frequencies. So, at step S6.7, the spectrum is filtered to simulate non-linear frequency resolution of the human ear.
  • the filtering at step S6.7 is performed using a filter bank havin; channels of equal bandwidths on the mel-frequency scale.
  • the mel-frequency scaling may be achieved by setting the channel centre frequencies equidistantly on a mel-frequency scale, given by the Equation (1),
  • each filtered channel is a sum of the FFT frequency bins belonging to that channel, weighted by a mel-scale frequency response.
  • the weights for filters in an example filter bank are shown in Figure 9.
  • 40 triangular-shaped bandpass filters are depicted whose center frequencies are evenly spaced on a perceptually motivated mel-frequency scale.
  • the filters may span frequencies from 30 hz to 11025 Hz, in the case of the input signal having a sampling rate of 22050 Hz.
  • the filter heights in Figure 9 have been scaled to unity.
  • Variations may be made in the filter bank in other embodiments, such as spanning the band center frequencies linearly below 1000 Hz, scaling the filters such that they have unit area instead of unity height, varying the number of frequency bands, or changing the range of frequencies spanned by the filters.
  • a logarithm such as a logarithm of base 10 may be taken from the mel-band energies m ⁇ , producing log mel-band energies rr .
  • An example of a log mel-band energy spectrum is shown in Figure 10.
  • the MFCCs are obtained.
  • step s6.io further mathematical operations may be performed on the MFCCs produced at step S6.9, such as calculating a mean of the MFCCs and/or time derivatives of the MFCCs to produce the required acoustic features 33a on which the calculation of the tag probabilities 116 will be based.
  • the acoustic features produced at step s6.io include one or more of:
  • first and, optionally, second time derivatives of the MFCCs also referred to as “delta MFCCs”;
  • the extracted features are then output (step s6.11).
  • the features output at step s6.11 may also include a fluctuation pattern and danceability features for the track, such as:
  • the mel-band energies calculated in step s6.8 may be used to calculate one or more of the fluctuation pattern features listed above in step s6.io.
  • a sequence of logarithmic domain mel-band magnitude frames are arranged into segments of a desired temporal duration and the number of frequency bands is reduced.
  • a FFT is applied over coefficients of each of the frequency bands across the frames of a segment to compute amplitude modulation frequencies of loudness in a described range, for example, in a range of 1 to 10 Hz.
  • the amplitude modulation frequencies may be weighted and smoothing filters applied.
  • the results of the fluctuation pattern analysis for each segment may take the form of a matrix with rows corresponding to modulation frequencies and columns corresponding to the reduced frequency bands and/or a vector based on those parameters for the segment.
  • the vectors for multiple segments may be averaged to generate a fluctuation pattern vector to describe the music track.
  • Danceability features and club-likeness values are related to beat strength, which may be loosely defined as a rhythmic characteristic that allows discrimination between pieces of music, or segments thereof, having the same tempo. Briefly, a piece of music characterised by a higher beat strength would be assumed to exhibit perceptually stronger and more pronounced beats than another piece of music having a lower beat strength.
  • a danceability feature may be obtained at step s6.io by detrended fluctuation analysis, which indicates correlations across different time scales, based on the mel-band energies obtained at step s6.8.
  • detrended fluctuation analysis which indicates correlations across different time scales, based on the mel-band energies obtained at step s6.8. Examples of techniques of club-likeness analysis, fluctuation pattern analysis and detrended fluctuation analysis are disclosed in British patent application no.
  • the features obtained at step s6.io may include features relating to tempo in beats per minute (BPM), such as: " an average of an accent signal in a low, or lowest, frequency band;
  • tempo indicator for indicating whether a tempo identified for the input signal is considered constant, or essentially constant, or is considered non-constant, or ambiguous
  • one or more accent signals are derived from the input signal 70, for detection of events and/or changes in the music track.
  • a first one of the accent signals may be a chroma accent signal based on fundamental frequency Fo salience estimation, while a second one of the accent signals may be based on a multi-rate filter bank
  • a BPM estimate may be obtained based on a periodicity analysis for extraction of a sequence of periodicity vectors on the basis of the accent signals, where each periodicity vector includes a plurality of periodicity values, each periodicity value describing the strength of periodicity for a respective period length, or "lag".
  • a point-wise mean or median of the periodicity vectors over time may be used to indicate a single representative periodicity vector over a time period of the music track. For example, the time period may be over the whole duration of the music track. Then, an analysis can be performed on the periodicity vector to determine a most likely tempo for the music track.
  • One example approach comprises
  • Chorus related features that may be obtained at step s6.io include:
  • Example systems and methods that can be used to detect chorus related features are disclosed in US 2008/236371 Ai, the disclosure of which is hereby incorporated by reference in its entirety.
  • an average brightness, or spectral centroid (SC), of the music track calculated as a spectral balancing point of a windowed FFT signal magnitude in frames of, for example, 40 ms in length;
  • LFR average low frequency ratio
  • Figure 11 is an overview of a process of extracting multiple acoustic features, some or all of which may be obtained in step S6.9 and s6.io.
  • Figure 11 shows how some input features are derived, at least in part, from computations of other input features.
  • the features shown in Figure 11 include the MFCCs, delta MFCCs and mel-band energies discussed above in relation to Figure 6, indicated using bold text and solid lines.
  • the dashed lines and standard text indicate other features that may be extracted and made available alongside, or instead of, the MFCCs, delta MFCCs and mel-band energies, for use in calculating the tag probabilities 116.
  • the mel-band energies may be used to calculate fluctuation pattern features and/or danceability features through detrended fluctuation analysis. Results of fluctuation pattern analysis and detrended fluctuation analysis may then be used to obtain a club-likeness value.
  • beat tracking features labeled as "beat tracking 2" in Figure 11, may be calculated based, in part, on a chroma accent signal from a F 0 salience estimation.
  • step s6.i2 tag probabilities 116 and an overall track vector 39 for track 1 of group 1 are evaluated.
  • An overview of an example method for obtaining tag probabilities 116 and creating a track vector 39 is shown in Figure 12.
  • the acoustic features 110 for track 1 of group 1 produced in steps 6.9 and s6.io are input to first level classifiers 111 to generate first level probabilities for the music track.
  • the first level classifiers 111 include first classifiers 112 and second classifiers 113 to generate first and second probabilities respectively, the second classifiers 113 being different from the first classifiers 112.
  • the first classifiers 112 are non-probabilistic classifiers, while the second classifiers 113 are probabilistic classifiers.
  • the first and second classifiers 112, 113 compute first level probabilities that the music tracks include particular instruments and/or belong to particular musical genres.
  • probabilities based on other acoustic similarities may be included as will be noted hereinbelow.
  • the first level probabilities are input to at least one second level classifier 114.
  • the second level classifier 114 includes a third classifier 115, which may be a non-probabilistic classifier.
  • the third classifier 115 generates the tag probabilities 116 based, at least in part, on the first level probabilities output by the first level classifiers 111 and the second level probabilities output by the second classifiers 113.
  • Figure 13 is a flowchart depicting the method of Figure 12 in more detail.
  • the first and third classifiers 112, 115 are support vector machine (SVM) classifiers and the second classifiers 113 are based on Gaussian Mixture Models (GMM).
  • SVM support vector machine
  • GMM Gaussian Mixture Models
  • step S13.0 one, some or all of the extracted features 110 or descriptors obtained in steps S6.9 and s6.io are selected to be used as input to the first classifiers 112 (step S13.1) and, optionally, normalised (step S13.2) .
  • a look up table 216 or database may be stored in the memory 206 of the for each of the tag probabilities to be produced by the analysis server 100, that provides a list of features to be used to generate each first classifier and statistics, such as mean and variance of the selected features, that can be used in normalisation of the extracted features 33a.
  • the controller 202 retrieves the list of features from the look up table 216, and accordingly selects and normalises the listed features for each of the first level probabilities to be generated.
  • the normalisation statistics for each first level probability in the database may be determined during training of the first classifiers 112.
  • the first classifiers 112 are SVM classifiers.
  • the SVM classifiers are trained using a database of music tracks for which information regarding musical instruments and genre is already available.
  • the database may include tens of thousands of tracks that provide examples for each particular musical instrument for which a tag probability 116 is to be evaluated.
  • Examples of musical instruments for which information may be provided in the database include:
  • the training database includes indications of genres that the music tracks belong to, as well as indications of genres that the music tracks do not belong to.
  • Examples of musical genres that may be indicated in the database include:
  • a SVM classifier By analysing acoustic features extracted from the music tracks in the training database, for which instruments and/or genre are known, a SVM classifier can be trained to determine whether or not a music track includes a particular instrument, for example, an electric guitar. Similarly, another SVM classifier can be trained to determine whether or not the music track belongs to a particular genre, such as Metal.
  • the training database provides a highly imbalanced selection of music tracks, in that a set of tracks for training a given SVM classifier includes many more positive examples than negative ones.
  • a set of music tracks for training in which the number of tracks that include that instrument is significantly greater than the number of tracks that do not include that instrument will be used.
  • the set of music tracks for training might be selected so that the number of tracks that belong to that genre is significantly greater than the number of tracks that do not belong to that genre.
  • An error cost may be assigned to the different first level probabilities produced by the first classifiers 112 to take account of the imbalances in the training sets. For example, if a minority class of the training set for a particular first classification includes 400 songs and an associated majority class contains 10,000 tracks, an error cost of 1 may be assigned to the minority set and an error cost of 400/ 10,000 may be assigned to the majority class. This allows all of the training data to be retained, instead of downsampling data of the negative examples.
  • New SVM classifiers can be added by collecting new training data and training the new classifiers. Since the SVM classifiers are binary, new classifiers can be added alongside existing classifiers.
  • the training process can include determining a selection of one or more acoustic features to be used input for particular first classifiers 112 and statistics for normalising those features.
  • the number of features available for selection, M may be much greater than the number of features selected for determining a particular first classification, N; that is, M > > N.
  • the selection of features to be used is determined iteratively, based on a development set of music tracks for which the relevant instrument or genre information is available, as follows.
  • a check is made as to whether two or more of the features are so highly correlated that the inclusion of more than one of those features would not be beneficial. For example, if two features have a correlation coefficient that is larger than 0.9, then only one of those features is considered available for selection.
  • the feature selection training starts using an initial selection of features, such as the average MFCCs for music tracks in the development set or a single "best" feature for a given first classification. For instance, a feature that yields the largest classification accuracy when used individually may be selected as the "best" feature and used as the sole feature in an initial feature selection. An accuracy of the first classification based on the initial feature selection is determined. Further features are then added to the feature selection to determine whether or not the accuracy of the first classification is improved by their inclusion. Features to be tested for addition to the selection of features may be chosen using a method that combines forward feature selection and backward feature selection in a sequential floating feature selection. Such feature selection may be performed during the training stage, by evaluating the classification accuracy on a portion of the training set.
  • features such as the average MFCCs for music tracks in the development set or a single "best" feature for a given first classification. For instance, a feature that yields the largest classification accuracy when used individually may be selected as the "best" feature and used as the sole feature in an initial feature selection.
  • each of the features available for selection is added to the existing feature selection in turn, and the accuracy of the SVM classifier with each
  • the feature selection is then updated to include the feature that, when added to the feature selection, provided the largest increase in accuracy for the development set.
  • the accuracy of the SVM classifier is reassessed, by generating probabilities based on edited feature selections in which each of the features in the feature selection is omitted in turn. If it is found that the omission of one or more features provides an improvement in the accuracy of a generated probability, then the feature that, when omitted, leads to the biggest improvement in accuracy is removed from the feature selection. If no
  • the iterative process of adding and removing features to and from the feature selection continues until the addition of a further feature no longer provides a significant improvement in the accuracy of the SVM classifier. For example, if the improvement in accuracy falls below a given percentage, the iterative process may be considered complete, and the current selection of features is stored in the lookup table 216, for use in selecting features in step S13.1.
  • the normalisation of the selected features 110 at step S13.2 is optional. Where provided, the normalization of the selected features 110 at step S13.2 may
  • a linear feature transform may be applied to the available features 110 obtained in steps S6.9 and s6.io, instead of performing the feature selection procedure described above.
  • PLS-DA Partial Least Squares Discriminant Analysis
  • a linear feature transform is applied to an initial high-dimensional set of features to arrive at a smaller set of features which provides a good discrimination between classes.
  • the initial set of features may include some or all of the available features, such as those shown in Figure 11, from which a reduced set of features can be selected based on the result of the transform.
  • the PLS-DA transform parameters may be optimized and stored in a training stage.
  • the transform parameters and its dimensionality may be optimized for each tag or output classification, such as an indication of an instrument or a genre.
  • the training of the system parameters can be done in a cross-validation manner, for example, as five-fold cross-validation, where all the available data is divided into five non-overlapping sets. At each fold, one of the sets is held out for evaluation and the four remaining sets are used for training. Furthermore, the division of folds may be specific for each tag or classification.
  • the training set is split into 50% -50% inner training-test folds.
  • the PLS-DA transform may be trained on the inner training-test folds and the SVM classifier may be trained on the obtained dimensions.
  • the accuracy of the SVM classifier using the transformed features transformed may be evaluated on the inner test fold. It is noted that, when a feature vector (track) is tested, it is subjected to the same PLS-DA transform, the parameters of which were obtained during training.
  • an optimal dimensionality for the PLS-DA transform may be selected. For example, the dimensionality may be selected such that the area under the receiver operating characteristic (ROC) curve is maximized. In one example embodiment, an optimal dimensionality is selected among candidates between 5 to 40 dimensions.
  • ROC receiver operating characteristic
  • the PLS-DA transform is trained on the whole of the training set, using the optimal number of dimensions, and then used in training the SVM classifier.
  • other feature transforms such as Linear Discriminant Analysis (LDA), Principal Components Analysis (PCA), or Independent Component Analysis (ICA) could be used.
  • LDA Linear Discriminant Analysis
  • PCA Principal Components Analysis
  • ICA Independent Component Analysis
  • the selected features 110 on which the first classifications are based are the mean of the MFCCs of track 1 of group 1 and the covariance matrix of the MFCCs of track 1 of group 1, although in other examples alternative and/or additional features, such as the other features shown in Figure 11, may be used.
  • the controller 202 defines a feature vector based on each set of selected features 110 or selected combination of features 110 for track 1 of group 1.
  • the feature vectors may then be normalized to have a zero mean and a variance of 1, based on statistics determined and stored during the training process.
  • the controller 202 generates one or more first probabilities that track 1 of group 1 has a certain characteristic, based on the feature vector or vectors.
  • the first classifiers 112 are used to calculate respective probabilities for each feature vector defined in step S13.3.
  • the number of first classifiers 112 corresponds to the number of tag probabilities 116 to be predicted for the music track.
  • a probability is generated by a respective first classifier 112 for each instrument tag probability and for each genre tag probability to be predicted for the music track, based on the mean MFCCs and the MFCC covariance matrix.
  • a probability may be generated by the first classifiers 112 based on whether the music track is likely to be an instrumental track or a vocal track.
  • another probability may be generated by the first classifiers 112 based on whether the vocals are provided by a male or female vocalist.
  • the controller 202 may generate only one or some of these probabilities and/or calculate additional probabilities at step S13.4.
  • the different classifications may be based on respective selections of features from the available features 110 selected in step S13.1.
  • the first classifiers 112 may use a radial basis function (RBF) kernel K, defined as:
  • the output from the first classifiers 112 may be in the form of first classifications based on an optimal predicted probability threshold that separates a positive prediction from a negative prediction for a particular tag probability, based on the probabilities computed by the first classifiers 112.
  • the setting of an optimal predicted probability threshold may be learned in the training procedure to be described later below. Where there is no imbalance in data used to train the first classifiers 112, the optimal predicted probability threshold may be 0.5.
  • the threshold p t hr may be set to a prior probability of a minority class P m in in the first classification, using Equation (4) as follows: where, in the set of n tracks used to train the SVM classifiers, n m , makeup is the number of tracks in the minority class and n ma j is the number of tracks in a majority class.
  • the prior probability P m! caravan may be learned as part of the training of the SVM classifiers.
  • Probability distributions for examples of possible first classifications, based on an evaluation of a number n of tracks, are shown in Figure 14.
  • the nine examples in Figure 14 suggest a correspondence between a prior probability for a given first classification and its probability distribution based on the n tracks. Such a correspondence is particularly marked where the SVM classifier was trained with an imbalanced training set of tracks. Consequently, the predicted probability threshold for the different examples vary over a considerable range.
  • a logarithmic transformation may be applied to the probabilities produced by the first classifiers 112 at step S13.4, so that the probabilities are on the same scale and the optimal predicated probability threshold may correspond to a predetermined value, such as 0.5.
  • Equations (5) to (7) below provide an example normalization which adjusts the optimal predicted probability threshold to 0.5.
  • the probability output by a SVM classifier is p and the prior probability P of a particular tag being applicable to a track is greater than 0.5, then the normalized probability p n orm is given by:
  • Figure 15 depicts the example probability distributions of Figure 14 after a logarithmic transformation has been applied, on which optimal predicated probability thresholds of 0.5 are marked.
  • the probabilities output by the first classifiers 112 correspond to a normalised probability p n orm that a respective one of the tags to be considered applies to track 1 of group 1.
  • the first classifications may include probabilities pinsti that a particular instrument is included in the music track and probabilities p g em that the music track belongs to a particular genre.
  • steps S13.5 to S13.6 further first level probabilities are generated for the input signal by the second classifiers 113, based on the MFCCs and other parameters produced in step S4.4.
  • Figure 13 shows steps S13.3 and S13.6 being performed in sequence, in another embodiment steps S13.5 and S13.6 may be performed before, or in parallel, with steps S13.4 and S13.5.
  • the acoustic features 110 of track 1 of group 1 on which the second classifications are based are the MFCC matrix for and the first time derivatives of the MFCCs, and probabilities are generated p inst 2 , Pgem for each instrument tag (step S13.5) and for each musical genre tag (step S13.6) to be predicted.
  • further probabilities may be generated based on whether the music track is likely to be an instrumental track or a vocal track and, for vocal tracks, another probability may be generated based on whether the vocals are provided by a male or female vocalist.
  • the controller 202 may generate only one or some of these second classifications and/or calculate additional second classifications at steps S13.5 and S13.6.
  • the second classifiers 113 compute probabilities p inst 2 , Pgem USing probabilistic models that have been trained to represent the distribution of features extracted from audio signals captured from each instrument or genre. Such training can be performed using an expectation maximisation algorithm that iteratively adjusts the model parameters to maximise the likelihood of the model for a particular instrument or genre generating features matching one or more input features in the captured audio signals for that instrument or genre.
  • the parameters of the trained probabilistic models may be stored in a database, for example, in the database 208 of the analysis server, or in remote storage that is accessible to the analysis server 100 via a network, such as the network 102.
  • the instrument- based probabilities pi ns t 2 are produced by the second classifiers 113 using first and second Gaussian Mixture Models (GMMs), based on the MFCCs and their first time derivatives calculated in step S13.5.
  • GMMs Gaussian Mixture Models
  • the probabilities p gen2 that the music track belongs to a particular musical genre are produced by the second classifiers 113 using third GMMs.
  • the first and second GMMs used to compute the instrument-based probabilities pi ns t 2 may be trained and used slightly differently from third GMMs used to compute the genre-based probabilities p gen2 , as will now be explained.
  • the first and second GMMs used in step S13.5 may have been trained with an Expectation Maximisation algorithm using a training set of examples which are known either to include the instrument and examples which are known to not include the instrument. For each track in the training set, MFCC feature vectors and their corresponding first time derivatives are computed. The MFCC feature vectors for the examples in the training set that contain the instrument are used to train a first GMM for that instrument, while the MFCC feature vectors for the examples that do not contain the instrument are used to train a second GMM for that instrument. In this manner, for each instrument to be tagged, two GMMs are produced.
  • the first GMM is for a track that includes the instrument and is used to evaluate the likelihood L yes
  • the second GMM is for a track that does not include the instrument and is used to evaluate the likelihood L no .
  • the first and second GMMs each contain 64 component Gaussians.
  • the first and second GMMs may then be refined by discriminative training, for example using maximum mutual information (MMI) criterion on a balanced training set where, for each instrument to be tagged, the number of example tracks that contain the instrument is equal to the number of example tracks that do not contain the instrument.
  • MMI maximum mutual information
  • the two likelihoods L yes , L no are computed based on the first and second GMMs and the MFCCs for the music track.
  • the first is the likelihood L yes that the corresponding instrument is included in the music track, while the second is the likelihood L no that the instrument is not included in the music track.
  • the first and second likelihoods L yes , L no may be computed in a log- domain, and then converted to a linear domain.
  • the first and second likelihoods L yes , L no are then mapped to a tag probability pi ns t 2 of the instrument being included in the track, as follows:
  • the third GMMs used to compute genre-based probabilities p gen 2 are trained differently to the first and second GMMs. For each genre to be considered, a third GMM is trained based on MFCCs for a training set of tracks known to belong to that genre. One third GMM is produced for each genre to be considered. In this example, the third GMM includes 64 component Gaussians.
  • a likelihood L gen is computed for the track 1 of group 1 belonging to that genre, based on the likelihood of each of the third GMMs being capable of outputting the MFCC feature vector of the music track. For example, to determine which of the eighteen genres in the list hereinabove might apply to the music track, eighteen likelihoods would be produced.
  • m is the number of genre tags to be considered.
  • first and second GMMs may be trained and used in the manner described above for the third GMMs.
  • the GMMs used for analysing genre may be trained and used in the same manner, using either of techniques described in relation to the first, second and third GMMs above.
  • the first classifications pinsti and p ge m and the second classifications pinst 2 and p ge n 2 for track 1 of group 1 are then normalized to have a mean of zero and a variance of 1 (step S13.7) and collected to form a feature vector for input to the one or more second level classifiers 115 (step S13.8).
  • the second level classifiers 115 include third classifiers 116, as noted above, and the third classifiers 116 are non-probabilistic classifiers, such as SVM classifiers trained in a similar manner to that described above in relation to the first classifiers 112.
  • the first classifiers 112 and the second classifiers 113 may be used to output probabilities pinsti, p g em, Pinst 2 , p ge m for the training sets of example music tracks from the database.
  • the outputs from the first and second classifiers 112, 113 are then used as input data to train the third classifiers 116.
  • the third classifiers 116 determine second level probabilities pi ns t3 for whether track 1 of group 1 contains a particular instrument and/or second level probabilities p g e n 3 for whether track 1 of group 1 belongs to a particular genre (step S13.9) .
  • the third classifiers 116 are SVM classifiers
  • the second level probabilities pinst 3 , p g en 3 are generated in a similar manner to the first level probabilities pinsti, p g em computed by the first classifiers 112.
  • the second level probabilities pinst 3 , p gen3 are then log normalised (step s13.11), as described above in relation to the first level probabilities pinsti, p g em from the first classifiers 112, and output as the tag probabilities 116 at step s13.11.
  • tags based on the tag probabilities 116 may be associated with the music track at step s13.11.
  • the tag probabilities 116 exceed a probability threshold, such as 0.5 for normalised probabilities
  • tags corresponding to the instruments and/or genres may be stored in a database entry for the music track 208.
  • the track vector 39 is then generated at step S13.12 from the tag probabilities 116 output at step s13.11 and normalised.
  • An example of a track vector 39 is shown in Figure 16.
  • the track vector 39 reflects non-zero probabilities for the music track being a rock song including lead and backing vocals, bass, drums, electric guitar, keyboard and percussion.
  • first level and/or second level probabilities pmsti, p g e m, Pin St2 , Pgem, Pinst 3 , p g en 3 themselves and/or the features 110 extracted at step S6.9 and s6.io may be output for further analysis and/or storage.
  • the tag probability calculation process ends at step S13.13.
  • steps S4.4 and steps S4.5 are repeated to obtain attributes 33b, 33c for tracks 2 to m of group 1, until no further tracks of group 1 remain to be analysed (step S4.5).
  • step S13.12 to create track vectors 39 for tracks 2...m of group 1 is optional.
  • attributes 33a to 33c are obtained for tracks i...m of group 1, while a track vector 39 may be generated for one, some or all of the tracks i...m of group 1.
  • the combined vector 34 for group 1 is then created (step S4.6), based on the tag probabilities 116 generated at step S4.4 for tracks i...m of group 1.
  • the feature vector 34 may be created by summing the tag probabilities 116 for all of the analysed tracks 1 to m of group 1 and, optionally, normalising the sum.
  • the attributes 35a to 35c are obtained in turn (steps S4.7, S4.8) and, for at least one of the tracks i...n of group 2, a track vector 40 is created, as described above in relation to steps S4.4 to S4.6 and Figures 6 and 13, until no further tracks of group 2 remain to be analysed (step S4.8).
  • the creation of a track vector 40 may be performed at step S13.12 for one, some or all of the tracks i...n of group 2.
  • a combined vector 36 for group 2 is then created (step S4.9), for example by summing the tag probabilities 116 for the tracks i...n of group 2, and, optionally, normalising the sum.
  • the group level similarity 37 for the tracks of groups 1 and 2 is calculated by evaluating the similarity between the combined vectors 36, 37 for groups 1 and 2 (step S4.10). For example, if the combined vectors 36, 37 for artists 1 and 2 are denoted by a and b , their similarity sim ⁇ a,b can be measured with a cosine similarity defined as shown by Equation (11): a * b
  • one alternative technique may include using the Euclidean distance and taking its inverse to obtain the similarity sim ⁇ a, b .
  • Another example technique for assessing similarity may use the Kullback-Leibler divergence.
  • One or more track level similarities 38 are assessed at step s4.11, based on the similarity of the track vectors 39, 40.
  • the similarity of the track vectors 39, 40 may be assessed using Equation (11) above.
  • a combined group and track similarity 41 may then be determined, for example, by summing the group level similarity 37 and the track level similarity 41 (step S4.12).
  • the group level similarity 37, the track level similarity 38 and the combined group and track similarity 41 are normalised so that they have values in a range between o and 1.
  • Similarities between group 1 and one or more further groups of music tracks may be computed by repeating steps S4.7 to S4.12 for additional groups and generating respective group level similarities 37, track level similarities 38 and combined group and track similarities 41 for each additional group.
  • group 1 contains tracks by a first artist and group 2 contains tracks by a second artist
  • groups 3 and 4 may be defined, containing tracks by a third artist, a fourth artist respectively, and so on.
  • a list of recommendations 42 of tracks is compiled from the music tracks of group 2 and, where provided, any further groups of music tracks that have been analysed, based on one or both of the group level similarity 41 and, optionally, the combined group and track similarity 37.
  • a list of tracks exhibiting the highest combined group and track similarity 41 and/or other similarity 37, 38 may be compiled at step s4.11 and output to the user (step S4.14).
  • the list of recommendations 42 may be ranked and/or revised as part of the compilation (step S4.15). Examples of compilation procedures that may be performed at step S4.15 will now be described, with reference to Figures 17, 18 and 19.
  • Figures 17 to 19 show example procedures for generating the list of
  • the list of preliminary candidates is revised based on user preferences input by the user, for example, by using the sliders 53, 54 in the user interface 50 shown in Figure 5.
  • the user may have indicated that they would like to receive recommendations of tracks that are jazzier and include more piano than a particular track indicated in field 52 of the user interface 50, corresponding to track 1 of group 1, but include less stringed instruments, using the sliders 53, 54.
  • the tag probability 116 corresponding to a first property indicated by the user is identified (step S17.3) and the relevant tag probabilities for the candidate tracks are retrieved or otherwise obtained (step S17.4) and adjusted as follows.
  • step SI7.5 If the user input indicated a positive contribution for the property (step SI7.5), such as "more jazz”, the tag probabilities 116 for a genre of "jazz" for the candidate tracks are added to the values calculated for one or more of the similarities between track 1 of group 1 and the candidate tracks. If the user input indicated a positive contribution for the property (step SI7.5), such as "less strings”, the tag probabilities 116 for stringed instruments is subtracted from the similarities for the candidate tracks are subtracted from the similarity values for the candidate tracks (step S17.7). If another preference has been indicated by the user (step S17.8), then steps S17.3 to S17.7 are repeated for the next preference, until the similarity values have been adjusted for all the received user preferences (step S17.8) .
  • the candidate tracks are then ranked based on their adjusted similarities (step SI7.9), completing the procedure (step SI7.10) .
  • the list of recommendations 42 output at step S4.14 may be based on a selected subset, or on all, of the candidate tracks in the ranked list of candidate tracks. For example, a predetermined number of the highest ranked candidate tracks may be selected for inclusion in the list of recommendations 42.
  • a preliminary list of candidate tracks from group 2 is obtained (si8.i), for example by compiling the list based on one or more of the similarities calculated in steps S4.10 to S4.12 as described above in relation to Figure 17.
  • the candidate tracks in the preliminary list are ranked based on user preferences as indicated by the user's listening history, as will now be explained.
  • a user history is obtained in step si8.2.
  • the user history may be based on the number of times the user has previously accessed music tracks stored on the terminal 104, other database or a streaming service, tracks ranked by a user on social media or online music database, and/or on tracks purchased by the user from a digital music store.
  • the controller 202 obtains a tag indicating a user preference from the tracks in the history (step S18.3). For example, the controller 202 may determine the most common tag for the previously accessed tracks shown in the user history. The corresponding tag probabilities 116 for the candidate tracks from group 2 are then retrieved (step S18.4) .
  • step S18.5 If the obtained tag seems to be viewed positively by the user (step S18.5), for example, if the tag occurs most often in previously accessed tracks that were played, downloaded or purchased by a user, then the tag probabilities 116 for the candidate tracks are added to one or more of the similarities calculated in steps S4.10 to S4.12 (step si8.6) .
  • step S18.5 If the obtained tag seems to indicate a negative user preferences, for example if the tag occurs most often in previously accessed tracks that were skipped by the user (step S18.5), then the tag probabilities 116 for the candidate tracks are subtracted from one or more of the similarities calculated in steps S4.10 to S4.12 (step S18.7) . Steps S18.3 to si8.8 may be repeated for further tags, if required (step si8.8).
  • the candidate tracks are then ranked based on their adjusted similarities (step S18.9), completing the procedure (step S18.10) .
  • the list of recommendations 42 output at step S4.14 may be based on a subset, or on all, of the candidate tracks in the ranked list of candidate tracks produced at step S18.9.
  • a list of preliminary candidates is obtained (step S19.1), as discussed above in relation to Figure 17.
  • the preliminary candidates are ranked based on properties other than those on which the tag probabilities 116 are based.
  • step S19.2 the controller 202 determines whether the user history includes a previously accessed music track that has a tag that was not included in track 1 of group 1 (step S19.2).
  • a previously accessed music track in the user history may include instruments that were not included in track 1 of group 1, or belong a different genre from track 1 of group 1. In the following, such a tag is referred to as a "new tag”.
  • the tag probabilities 116 for the new tag are retrieved for the candidate tracks (step S19-3)- If the user history indicates that the user listened to the previously accessed music track with the new tag (step S19.4), then the tag probabilities 116 for the new tag in the candidate tracks is added to their respective similarities (step S19.5).
  • step S19.4 If the user history indicates that the user skipped the previously accessed music track with the new tag (step S19.4), then the tag probabilities 116 for the new tag in the candidate tracks is subtracted from their respective similarities (step S19.6).
  • step S19.7, S19.8 If there are further new tags in the previously accessed track (steps S19.7, S19.8), then tag probabilities 116 for the candidate tracks for the further new tags are also added to, or subtracted from, the similarities as appropriate (steps S19.5, S19.6). If required, the controller 202 may then search for another previously accessed track with at least one new tag in the user history (steps S19.9, S19.2), to further adjust the similarities of the candidate tracks (steps S19.3 to S19.9) . The candidate tracks are then ranked based on their adjusted similarities (step S19.10), completing the procedure (step s19.11) .
  • the list of recommendations 42 output at step S4.14 may be based on subset, or on all, of the candidate tracks in the ranked list of candidate tracks.
  • one or more of the methods described above with reference to Figures 17 to 19 may be used to compile the list of recommendations 42 at step S4.13.
  • the list of recommendations 42 is output at step S4.14 via the interface 202.
  • the list is transmitted to the user's terminal 104 via the network 102.
  • the terminal 104 may present the list of recommendations 42 to the user as a list of music tracks, optionally with links to access the recommended tracks from a streaming service or database or to purchase the recommended tracks from a digital music store.
  • the recommendations include music tracks in a library accessible by the terminal 104, for example stored in storage
  • the list of recommendations 42 may include, or take the form of, a playlist to be followed by a media player software application stored in the terminal 104.
  • the procedure for recommending music tracks may end at this point (step S4.15)
  • the analysis server 100 may receive and monitor user history information (step S4.16) from the terminal 104 after the list of recommendations 42 has been output (step S4.14) and determine whether the list of recommendations 42 should be revised (step S14.17) .
  • the controller 202 may determine that revision is needed to adjust the recommendations 42 based on whether the user has listened to, or skipped, tracks in the existing list of recommendations 42.
  • step S4.18 the controller 202 revises the list of recommendations 42 .
  • the controller 202 may update the list of recommendations 42 based on tags from the recommended tracks that the user has listened to, or skipped, by performing the method of Figure 18, using the existing list of recommendations 42 as the preliminary list of candidate tracks in step si8.i and received updated user history as the user history obtained in step S18.2.
  • the controller 202 may update the list of recommendations 42 if a previously accessed music track appearing in the updated user history includes a new tag, using the method of Figure 19, using the existing list of recommendations 42 as the preliminary list of candidate tracks in step S19.1.
  • the revised list of recommendations 42 based on the new rankings is then output (step S4.14) .
  • steps S4.14 to S4.18) may continue until further revision is not needed (step S4.15), for example, if the user of the terminal 104 pauses or stops music playback, closes the media player application or switches of terminal 104.
  • Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic.
  • the software, application logic and/or hardware may reside on memory, or any computer media.
  • the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media.
  • a "computer-readable medium" may be any media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.
  • a computer-readable medium may comprise a computer-readable storage medium that may be any tangible media or means that can contain or store the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer as defined previously.
  • the computer-readable medium may be a volatile medium or a non-volatile medium.
  • the computer program according to any of the above aspects may be implemented in a computer program product comprising a tangible computer-readable medium bearing computer program code embodied therein which can be used with the processor for the implementation of the functions described above.
  • references to "computer-readable storage medium”, “computer program product”, “tangibly embodied computer program” etc, or a “processor” or “processing circuit” etc. should be understood to encompass not only computers having differing architectures such as single/multi processor architectures and sequencers/parallel architectures, but also specialised circuits such as field programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices and other devices.
  • References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array, programmable logic device, etc.
  • the different functions discussed herein may be performed in a different order and/or concurrently with each other.
  • one or more of the above-described functions may be optional or may be combined.

Abstract

A method comprises determining properties of first group of music tracks, e.g. by a first artist,and a second group of music tracks, e.g. by a second artist,based on track level attributes, determining a similarity between the first and second groups based at least in part on determined properties of the tracks, selecting one or more tracks from the second group based at least in part on said similarity, and outputting a list of said selected tracks. The similarity may include group level, track level, and combined group and track level similarities. The track level attributes may be acoustic features extracted from the tracks, tags, metadata or other data, such as keywords extracted from reviews of the tracks. The method may include ranking and/or revising the list based on one or more of user preferences, a user history and/or whether a user plays or skips the selected tracks.

Description

Similarity determination and selection of music
Field
This disclosure relates to determining similarity and similarity-based selection of music tracks. In particular, this disclosure relates to assessing and selecting music tracks from a database based on acoustic similarities.
Background
Audio content databases, streaming services, online stores and media player software applications often include genre classifications, to allow a user to search for tracks to play stream and/or download.
Some databases, services, stores and applications also include a facility for recommending music tracks to a user based on a history of music that they have accessed in conjunction with other data, such as rankings of tracks or artists from the user, history data from other users who have accessed the same or similar tracks in the user's history or otherwise have similar user profiles, metadata assigned to the tracks by experts and/or users, and so on. Summary
According to an aspect, an apparatus includes a controller and a memory in which is stored computer readable instructions that, when executed by the controller, cause the controller to receive input information regarding one or more music tracks, determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes, determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks, select one or more music tracks from the second group of music tracks based at least in part on said similarity and output a list of said selected music tracks. The input information may, for example, indicate a name of a music track, an album, an artist, a performer, a record label, a playlist, a producer, a musical genre, sub-genre or style. For example, where the input information indicates an artist, the first group of music tracks may contain music tracks by that artist and, optionally, the second group of music tracks may contain music tracks by one or more second artists. If the second group of music tracks contains music tracks by one second artist, then the group level similarity would indicate similarity between the first and second artists.
In some embodiments, the input information may indicate a particular music track, in which case the controller may obtain information regarding one or more of an album, an artist, a performer, a record label, a playlist, a producer, a musical genre, sub-genre or style from metadata associated with the particular music track or from extracting information from a local or remote database, and define the first group based on the obtained information.
The computer readable instructions, when executed by the controller, may further cause the controller to determine a similarity between a first one of said first plurality of music tracks and one of the second plurality of music tracks where said similarity between the first group of music tracks and the second group of music tracks is a combination of a group level similarity based on the properties of the first plurality of music tracks and the properties of the second plurality of music tracks and a track level similarity based on the properties of said one of said first plurality of music tracks and the properties of said one of the second plurality of music tracks.
The track level attributes may include acoustic features extracted from said music tracks and/or at least one of: tags associated with at least some of said music tracks, metadata associated with said music tracks and keywords extracted from text associated with said music tracks.
The properties may include at least one property based on a musical instrument and at least one property based on a musical genre. For example, the properties may include probabilities that a tag for a musical instrument or genre applies to a respective one of the first and second pluralities of music tracks.
The computer readable instructions, when executed by the controller, may further cause the controller to monitor a history of music tracks previously accessed by a user, revise said list of selected music tracks based on the properties of the selected music tracks in said history and on whether the previously accessed music tracks in the history were played or skipped and output said revised list.
The computer readable instructions, when executed by the controller, may further cause the controller to rank the selected music tracks in the list based, at least in part, on user preferences for the properties included in the received input.
The computer readable instructions, when executed by the controller, may further cause the controller to rank the selected music tracks in the list based, at least in part, on properties of previously accessed music tracks indicated in a user history. Alternatively, or additionally, the computer readable instructions, when executed by the controller, may further cause the controller to rank the selected music tracks in the list based, at least in part, on a property of a previously accessed music track indicated in a user history, wherein said first plurality of music tracks does not include said property. Where the selected music tracks based at least in part on the user history, the ranking may include adjusting similarities for the selected music tracks according to whether said previously accessed music track, or tracks, indicated in the user history was played or skipped by the user. The computer readable instructions, when executed by the controller, may cause the controller to determine said properties by evaluating first level probabilities that a particular tag applies based on the track level attributes and evaluating a second level probability that the particular tag applies based on the first level probability. For example, the controller may be caused to evaluate the first level probabilities using a first classifier and a second classifier and to evaluate the second level probabilities using a third classifier, wherein the first and third classifiers are non-probabilistic classifiers and the second classifier is a
probabilistic classifier. According to another aspect, a method includes receiving input information regarding at least one music track, determining properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes, determining a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks, selecting one or more music tracks from the second group of music tracks based at least in part on said similarity and outputting a list of said selected music tracks.
Such a method may further include determining a similarity between one of said first plurality of music tracks and one of the second plurality of music tracks, wherein said similarity between the first group of music tracks and the second group of music tracks is a combination of a group level similarity based on the properties of the first plurality of music tracks and the properties of the second plurality of music tracks and a track level similarity based on the properties of said one of said first plurality of music tracks and the properties of said one of the second plurality of music tracks. The track level attributes may include acoustic features extracted from said music tracks and/or at least one of: tags associated with at least some of said music tracks, metadata associated with said music tracks and keywords extracted from text associated with said music tracks. The method may also include monitoring a history of music tracks previously accessed by a user, revising the list of selected music tracks based on the properties of the previously accessed music tracks in said history and on whether the previously accessed music tracks in the history were played or skipped and outputting said revised list.
The method may include ranking the selected music tracks in the list based on one or more of user preferences for the properties included in the received input, properties of previously accessed music tracks indicated in a user history and a property of a previously music track indicated in a user history, wherein said first plurality of music tracks does not include said property. For example, such ranking may include adjusting similarities for the selected music tracks according to whether said previously music track, or tracks, indicated in the user history was played or skipped by the user. In an example embodiment, the method may include determining said properties comprises evaluating first level probabilities that a particular tag applies based on the extracted acoustic features and evaluating a second level probability that the particular tag applies based on the first level probability.
A computer program comprising computer readable instructions which, when executed by a computer, cause said computer to perform any of the above methods according to said aspect may also be provided.
According to yet another aspect, a non-transitory tangible computer program product in which is stored computer readable instructions that, when executed by a computer, cause the computer to receive input information regarding one or more music tracks, determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes, determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks, select one or more music tracks from the second group of music tracks based at least in part on said similarity and output a list of said selected music tracks.
According to a further aspect, an apparatus is configured to receive input information regarding one or more music tracks, determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes, determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks, select one or more music tracks from the second group of music tracks based at least in part on said similarity and output a list of said selected music tracks. According to a yet further aspect, an apparatus includes an interface to receive input information regarding one or more music tracks and to output a list of selected music tracks, an extractor to extract track level attributes associated with a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and a second plurality of music tracks belonging to a second group of music tracks and to determine properties of the first plurality of music tracks and the second plurality of music tracks based on track level attributes of said music tracks, a similarity determination module to determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks and a
recommendation engine to select the selected music tracks from the second group of music tracks based at least in part on said similarity.
Brief description of the drawings
Embodiments will now be described by way of non-limiting examples with reference to the accompanying drawings, of which:
Figure l is a schematic diagram of a system in which an embodiment may be included;
Figure 2 is a schematic diagram of components of an analysis server according to an embodiment, in the system of Figure l;
Figure 3 is an overview of a method that may be performed by the analysis server of Figure 2;
Figure 4 is a flowchart of the method shown in overview in Figure 3;
Figure 5 depicts a user interface for use in the method of Figure 3
Figure 6 is a flowchart of a method of extracting acoustic features from an input signal, for use in the method of Figure 4;
Figure 7 depicts an example of a blocked and windowed input signal;
Figure 8 depicts an example energy spectrum of a transformed input signal;
Figure 9 depicts a frequency response of an example filter bank for filtering the transformed input signal shown in Figure 8;
Figure 10 depicts an example mel-energy spectrum output from the filter bank represented in Figure 8;
Figure 11 is an overview of a process for obtaining multiple types of acoustic features in the method of Figure 4;
Figure 12 is an overview of a method of tag obtaining probabilities for use in the method of Figure 4;
Figure 13 is a flowchart of the method of Figure 12;
Figure 14 shows example distributions for instrument-based tag
probabilities; Figure 15 shows the example probability distributions of Figure 14 after logarithmic transformation;
Figure 16 depicts an example track feature vector generated by the method of Figure 13;
Figure 17 is an overview of a recommendation procedure that may be performed as a part of the method of Figure 4;
Figure 18 is an overview of another recommendation procedure that may be performed as a part of the method of Figure 4;
Figure 19 is an overview of yet another recommendation procedure that may be performed as a part of the method of Figure 4.
Detailed description
Embodiments described herein concern assessing features of music tracks, determining similarities between music tracks and selecting music tracks based on such similarities, for example, for recommendation to a user.
Referring to Figure 1, an analysis server 100 is shown connected to a network 102, which can be any data network such as a Local Area Network (LAN), Wide Area Network (WAN) or the Internet. The analysis server 100 is configured to receive and process requests for audio content from one or more terminals 104, 105 via the network 102.
In the present example, two terminals 104, 105 are shown, each incorporating media playback hardware and software, such as a speaker (not shown) and/or audio output jack (not shown) and a processor (not shown) that executes a media player software application to stream and/or download audio content over the network 102 and to play audio content through the speaker. As well as audio content, the terminals 104, 105 may be capable of streaming or downloading video content over the network 102 and presenting the video content using the speaker and a display 106. Suitable terminals 104, 105 will be familiar to persons skilled in the art. For instance a smart phone could serve as a terminal 104, 105 in the context of this application although a laptop, tablet or desktop computer may be used instead. Such terminals 104, 105 include music and video playback and data storage functionality and can be connected to the music analysis sever 100 via a cellular network, Wi-fi, Bluetooth® or any other suitable connection such as a cable or wire. Optionally, the display 106 may be a touch screen display. As shown in Figure 2, the analysis server loo includes a controller 202, an input and output interface 204 configured to transmit and receive data via the network 102, a memory 206 and a mass storage device 208 for storing video and audio data. The controller 202 is connected to each of the other components in order to control operation thereof. The controller 202 may take any suitable form. For instance, it may be a processing arrangement that includes a microcontroller, plural
microcontrollers, a processor, or plural processors. The memory 206 and mass storage device 208 may be in the form of a non-volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD) . The memory 206 stores, amongst other things, an operating system 210 and at least one software application 212 to be executed by the controller 202. Random Access Memory (RAM) 214 is used by the controller 202 for the temporary storage of data.
The operating system 210 may contain code which, when executed by the controller 202 in conjunction with the RAM 214, controls operation of analysis server 100 and provides an environment in which the or each software application 212 can run.
Software application 212 is configured to control and perform audio and video information processing by the controller 202 of the analysis server 100 to determine similarities between music tracks and, optionally, to generate music recommendations. The operation of this software application 212 according to a first embodiment will now be described in detail, with reference to Figures 3 to 4. In the following, the accessed music tracks are referred to as input signals.
Figure 3 is an overview of a procedure for recommending music tracks to the user of the terminal 104, in which the controller 202 acts as an extractor 30 to extract track level attributes of a music track, similarity assessment module 31 and recommendation engine 32. The basis for the recommendation procedure may be information provided by a user of the terminal 104 via the network 102. For example, the user may indicate an artist and/or a track so that similar artists and/or tracks can be identified. Alternatively, or additionally, the user may provide other information on which the recommendation procedure may be based. For example, the user may indicate one or more of an album, a performer, such as a particular musician, a producer, a record label, a playlist, a musical genre, sub- genre or style, and so on.
Alternatively, or additionally, information regarding music tracks accessed by a user, such as the tracks that have been accessed the greatest number of times, the artist with the greatest number of tracks in a library in the terminal 104, recently accessed tracks or recently purchased tracks, may be used to identify an artist and, optionally, a track, referred to as track 1, to use as a basis for generating
recommendations. If only a name for track 1 is indicated, an artist may be identified from metadata of track 1. If only an artist is indicated, track 1 may be a track selected automatically, such as the track by that artist accessed the most times by the user, or a most popular track by that artist as indicated by a remote database, such as a streaming database, rankings for tracks in a digital music store or information obtained from social media.
A first group of music tracks, group 1, is defined based on the information. Where the user has input artist or performer information, or an artist or performer has been identified from other information input by the user or obtained from the user history, the first group may contain multiple tracks by that artist or performer. In another example, if the user has input information identifying an album or record label, the first group may include music tracks from that album or record label.
As noted above, in other examples, information that may be used as that basis can include one or more of an artist, album, a performer, a producer, a record label, a playlist, a musical genre, sub-genre or style, and so on, by defining the first group of music tracks, group 1, according to the basis provided.
Attributes 33a to 33c for a first music track 1 of the first group, group 1, and one or more further tracks 2...m of group 1 are obtained from the data stored in
video/audio storage 208 or remote database or obtained from social media information, other websites and so on. The similarity assessment module 31 defines a combined vector 34 for the first group, group 1, based on some or all of the attributes 33a to 33c obtained for tracks i...m.
Similarly, attributes 35a to 35c are obtained for a plurality of tracks i...n of the second group, group 2, from which the recommendations are to be drawn, and a combined vector 36 for group 2 is defined. For example, where the recommendation is to be based on a first artist, album or performer, the second group may contain multiple tracks by a second artist or performer. The second artist may be selected automatically, based on an analysis of attributes 33a to 33c obtained for track 1 of group 1 and/or information from streaming databases, rankings in digital music stores, social media information and so on. For example, such databases may indicate that users who listened to the first artist often listen to certain other artists and one of those other artists may be selected as the second artist and the second group, group 2, defined to include multiple tracks by that second artist.
The similarity assessment module 31 then determines a group level similarity 37, based on the combined vectors 35, 36 for group 1 and group 2, based on the plurality of tracks i...m of group 1 and the plurality of tracks i...n of group 2. The similarity assessment module 31 may also determine one or more track level similarities 38, each based on a vector 39, 40 combining attributes 33a of an individual track of group 1 and the attributes 35a of an individual track of group 2 respectively. A combined group and track similarity 41 may also be computed based on the group level similarity 37 and the track level similarity 38.
One or more of the group level similarity 37 and the combined group and track similarity 41 are input to the recommendation engine 32. The recommendation engine may then select music tracks from the video/audio storage 208 or another database, for example, a remote database accessed via the network 102 or other network, as recommendations 42 of music tracks that the user of the terminal 104 might enjoy, based on the input similarities 37, 41 and, optionally, further input from the user of the terminal 104. The recommendations 42 may be output via the I/O interface 204 and transmitted to the terminal 104 for presentation on the display 106.
Figure 4 is a flowchart showing further detail of the method described above in relation to Figure 3.
Beginning at step S4.0, a basis for generating the recommendations 42 is obtained (step S4.1) . In this example the basis may be provided by the user of the terminal 104. Figure 5 depicts a user interface 50, that may be presented by the display 106, through which the user can provide the basis. The user interface 50 includes fields 51, 52 in which a user can indicate a name for a first artist, and/or a name of a first music track by the first artist, track 1. As noted above, where a user indicates only one of the first artist and the first music track by the first artist, additional artist and track information may be obtained to supplement the user input as basis for the recommendations 42. Optionally, one or more sliders 53, 54 may be provided to allow the user to indicate preferences for the type of music tracks to be recommended. In this example, sliders 53 are provided for indicating instrument-based preferences and sliders 54 are provided for indicating music genre-based preferences. While Figure 5 depicts sliders 53, 54, in other embodiments, alternative input techniques for obtaining user preferences may be used, such as numerical values indicating relative importance or rankings for the preferences or input arranging the preferences in order of importance to the user.
In steps S4.2 and S4.3, first and second groups of music tracks are defined according to the basis obtained in step S4.1. Examples of ways in which group 1 and group 2 may be defined are discussed above in relation to Figure 3. In one example, where the basis includes information identifying an artist, group 1 may contain one or more music tracks by that artist, while group 2 may contain one or more music tracks by a second artist.
Next, in step S4.4, the controller 202 obtains attributes of a plurality of tracks i...m of group 1. Such attributes may be obtained from metadata associated with the plurality of tracks i...m indicating, for example, genre of musical tracks or type of artist, obtained from the data storage 208, or a remote database, or information from a streaming service or digital music store. Additionally, or alternatively, attributes may be obtained by analysing text in social media pages or other webpages. For example, where group 1, is a collection of music tracks by a first artist, an analysis of text on a website for the first artist or a label on which the first artist's music is released and/or reviews of the first artist's music on websites and/or blog pages may be performed and keywords extracted from that text.
Another option, which may be combined with one or both of such metadata and such keywords, is to extract acoustic features from audio data of tracks i...m of group l.
Figure 4 shows attributes 33a to 33c being obtained for individual tracks i...m of group 1 one by one and used to form a vector 34 for group 1, before attributes 35a to 35c for individual tracks i...n of group are obtained in turn, in steps S4.4 to S4.8. This sequence may be used, for example, where group 2 is selected based on results from the analysis of the tracks of group 1. However, in other embodiments, attributes 33a to 33c, 35a to 35c for the tracks from groups 1 and 2 may be obtained in a different order than shown in Figure 4, or in parallel. In particular, it is not necessary to complete the obtaining of attributes 33a to 33c for the tracks i...m from group 1 and/or create the vector 34 for group 1 before proceeding to obtain attributes 35a to 35c for tracks i...n of group 2. Nor is it necessary to obtain the attributes of a particular track before beginning a process for obtaining attributes for another track.
In the following description, the attributes 33a to 33c, 35a to 35c are acoustic features extracted from audio data of tracks i...m of group 1 and tracks i...n of group 2, in the form of probabilities 116 that the tracks include a particular instrument or belong to a particular music genre. However, as noted above, the attributes may include one or more of metadata obtained from a database 208, streaming service or digital music store, keywords extracted from text relating to track 1 of group 1 and other audio features of the tracks, as well as, or instead of such tag probabilities 116.
An example procedure for extracting acoustic features and obtaining tag
probabilities 116 at step S4.4 will now be described with reference to Figure 6.
Starting at step s6.o, if an input signal for track 1 of group 1 is in a compressed format, such as MPEG-i Audio Layer 3 (MP3), Advanced Audio Coding (AAC) and so on, the input signal is decoded into pulse code modulation (PCM) data (step s6.i). In this particular example, the samples for decoding are taken at a rate of 44.1 kHz and have a resolution of 16 bits. The controller 202 may, optionally, resample the decoded input signal at a lower rate, such as 22050 kHz (step s6.2). An optional "pre-emphasis" process is shown as step S6.3. Since audio signals conveying music tend to have a large proportion of their energy at low frequencies, the pre-emphasis process filters the decoded input signal to flatten the spectrum of the decoded input signal. The relatively low sensitivity of the human ear to low frequency sounds may be modelled by such flattening. One example of a suitable filter for this purpose is a first-order Finite Impulse Response (FIR) filter with a transfer function of 1-0.98Z 1.
At step S6.4, the controller 202 blocks the input signal into frames. The frames may include, for example, 1024 or 2048 samples of the input signal, and the subsequent frames may be overlapping or they may be adjacent to each other according to a hop-size of, for example, 50% and 0%, respectively. In other examples, the frames may be non-adjacent so that only part of the input signal is formed into frames. Figure 7 depicts an example in which an input signal 70 is divided into blocks to produce adjacent frames of about 30 ms in length which overlap one another by 25%. However, frames of other lengths and/or overlaps may be used.
A Hamming window, such as windows 72a to 72d, is applied to the frames at step S6.5, to reduce windowing artifacts. An enlarged portion in Figure 7 depicts a frame 74 following the application of a window to the input signal 70.
At step S7.6, a Fast Fourier Transform (FFT) is applied to the windowed signal 74 to produce a magnitude spectrum of the input signal. An example FFT spectrum is shown in Figure 8. Optionally, the FFT magnitudes may be squared to obtain a power spectrum of the signal for use in place of the magnitude spectrum in the following steps.
The spectrum produced by the FFT at step s6.6 may have a greater frequency resolution at high frequencies than is necessary, since the human auditory system is capable of better frequency resolution at lower frequencies but is capable of lower frequency resolution at higher frequencies. So, at step S6.7, the spectrum is filtered to simulate non-linear frequency resolution of the human ear.
In this example, the filtering at step S6.7 is performed using a filter bank havin; channels of equal bandwidths on the mel-frequency scale. The mel-frequency scaling may be achieved by setting the channel centre frequencies equidistantly on a mel-frequency scale, given by the Equation (1),
me/( )= 25951og10^l + ^ (1)
where/is the frequency in Hertz.
The output of each filtered channel is a sum of the FFT frequency bins belonging to that channel, weighted by a mel-scale frequency response. The weights for filters in an example filter bank are shown in Figure 9. In Figure 9, 40 triangular-shaped bandpass filters are depicted whose center frequencies are evenly spaced on a perceptually motivated mel-frequency scale. The filters may span frequencies from 30 hz to 11025 Hz, in the case of the input signal having a sampling rate of 22050 Hz. For the sake of example, the filter heights in Figure 9 have been scaled to unity.
Variations may be made in the filter bank in other embodiments, such as spanning the band center frequencies linearly below 1000 Hz, scaling the filters such that they have unit area instead of unity height, varying the number of frequency bands, or changing the range of frequencies spanned by the filters.
The weighted sum of the magnitudes from each of the filter bank channels may be referred to as mel-band energies m . , where j=i ...N, N being the number of filters.
In step s6.8, a logarithm, such as a logarithm of base 10, may be taken from the mel-band energies m■ , producing log mel-band energies rr . An example of a log mel-band energy spectrum is shown in Figure 10.
Next, at step S6.9, the MFCCs are obtained. In this particular example, a Discrete Cosine Transform is applied to a vector of the log mel-band energies m to obtain the MFCCs according to Equation (2),
Figure imgf000015_0001
where N is the number of filters, i=o,..., I and / is the number of MFCCs. In an exemplary embodiment, i=20.
At step s6.io, further mathematical operations may be performed on the MFCCs produced at step S6.9, such as calculating a mean of the MFCCs and/or time derivatives of the MFCCs to produce the required acoustic features 33a on which the calculation of the tag probabilities 116 will be based.
In this particular embodiment, the acoustic features produced at step s6.io include one or more of:
" a MFCC matrix for the music track;
" first and, optionally, second time derivatives of the MFCCs, also referred to as "delta MFCCs";
" a mean of the MFCCs of the music track;
" a covariance matrix of the MFCCs of the music track;
" an average of mel-band energies over the music track, based on output from the channels of the filter bank obtained in step S5.6;
" a standard deviation of the mel-band energies over the music track;
" an average logarithmic energy over the frames of the music track, obtained as an average of cmei(o) over a period of time, for example, using Equation (2); and
" a standard deviation of the logarithmic energy.
The extracted features are then output (step s6.11). As noted above, the features output at step s6.11 may also include a fluctuation pattern and danceability features for the track, such as:
" a median fluctuation pattern over the song;
" a fluctuation pattern bass feature;
- a fluctuation pattern gravity feature;
" a fluctuation pattern focus feature;
" a fluctuation pattern maximum feature;
" a fluctuation pattern sum feature;
" a fluctuation pattern aggressiveness feature; " a fluctuation pattern low-frequency domination feature;
" a danceability feature (detrended fluctuation analysis exponent for at least one predetermined time scale);and
" a club-likeness value.
The mel-band energies calculated in step s6.8 may be used to calculate one or more of the fluctuation pattern features listed above in step s6.io. In an example method of fluctuation pattern analysis, a sequence of logarithmic domain mel-band magnitude frames are arranged into segments of a desired temporal duration and the number of frequency bands is reduced. A FFT is applied over coefficients of each of the frequency bands across the frames of a segment to compute amplitude modulation frequencies of loudness in a described range, for example, in a range of 1 to 10 Hz. The amplitude modulation frequencies may be weighted and smoothing filters applied. The results of the fluctuation pattern analysis for each segment may take the form of a matrix with rows corresponding to modulation frequencies and columns corresponding to the reduced frequency bands and/or a vector based on those parameters for the segment. The vectors for multiple segments may be averaged to generate a fluctuation pattern vector to describe the music track. Danceability features and club-likeness values are related to beat strength, which may be loosely defined as a rhythmic characteristic that allows discrimination between pieces of music, or segments thereof, having the same tempo. Briefly, a piece of music characterised by a higher beat strength would be assumed to exhibit perceptually stronger and more pronounced beats than another piece of music having a lower beat strength. As noted above, a danceability feature may be obtained at step s6.io by detrended fluctuation analysis, which indicates correlations across different time scales, based on the mel-band energies obtained at step s6.8. Examples of techniques of club-likeness analysis, fluctuation pattern analysis and detrended fluctuation analysis are disclosed in British patent application no.
1401626.5, as well as example methods for extracting MFCCs. The disclosure of GB 1401626.5 is incorporated herein by reference in its entirety. The features obtained at step s6.io may include features relating to tempo in beats per minute (BPM), such as: " an average of an accent signal in a low, or lowest, frequency band;
" a standard deviation of said accent signal;
" a maximum value of a median or mean of periodicity vectors;
" a sum of values of the median or mean of the periodicity vectors;
" tempo indicator for indicating whether a tempo identified for the input signal is considered constant, or essentially constant, or is considered non-constant, or ambiguous;
" a first BPM estimate and its confidence;
" a second BPM estimate and its confidence;
" a tracked BPM estimate over the music track and its variation;
" a BPM estimate from a lightweight tempo estimator.
Example techniques for beat tracking, using accent information, are disclosed in US published patent application no. 2007/240558 Ai, US patent application no.
14/302,057, International (PCT) published patent application nos. WO2013/ 164661 Ai and WO2014/001849 Ai, the disclosures of which are hereby incorporated by reference in their entireties.
In one example beat tracking method, described in GB 1401626.5, one or more accent signals are derived from the input signal 70, for detection of events and/or changes in the music track. A first one of the accent signals may be a chroma accent signal based on fundamental frequency Fo salience estimation, while a second one of the accent signals may be based on a multi-rate filter bank
decomposition of the input signal 70.
A BPM estimate may be obtained based on a periodicity analysis for extraction of a sequence of periodicity vectors on the basis of the accent signals, where each periodicity vector includes a plurality of periodicity values, each periodicity value describing the strength of periodicity for a respective period length, or "lag". A point-wise mean or median of the periodicity vectors over time may be used to indicate a single representative periodicity vector over a time period of the music track. For example, the time period may be over the whole duration of the music track. Then, an analysis can be performed on the periodicity vector to determine a most likely tempo for the music track. One example approach comprises
performing k-nearest neighbours regression to determine the tempo. In this case, the system is trained with representative music tracks with known tempo. The k- nearest neighours regression is then used to predict the tempo value of the music track based on the tempi of k-nearest representative tracks. More details of such an approach have been described in Eronen, Klapuri, "Music Tempo Estimation With k -NN Regression", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, Issue 1, pages 50-57, the disclosure of which is incorporated herein by reference.
Chorus related features that may be obtained at step s6.io include:
" a chorus start time; and
" a chorus end time.
Example systems and methods that can be used to detect chorus related features are disclosed in US 2008/236371 Ai, the disclosure of which is hereby incorporated by reference in its entirety.
Other features that may be obtained include:
" a duration of the music track in seconds,
- an A- weighted sound pressure level (SPL);
" a standard deviation of the SPL;
" an average brightness, or spectral centroid (SC), of the music track, calculated as a spectral balancing point of a windowed FFT signal magnitude in frames of, for example, 40 ms in length;
" a standard deviation of the brightness;
- an average low frequency ratio (LFR), calculated as a ratio of energy of the input signal below 100Hz to total energy of the input signal, using a windowed FFT signal magnitude in 40 ms frames; and
" a standard deviation of the low frequency ratio. Figure 11 is an overview of a process of extracting multiple acoustic features, some or all of which may be obtained in step S6.9 and s6.io. Figure 11 shows how some input features are derived, at least in part, from computations of other input features. The features shown in Figure 11 include the MFCCs, delta MFCCs and mel-band energies discussed above in relation to Figure 6, indicated using bold text and solid lines. The dashed lines and standard text indicate other features that may be extracted and made available alongside, or instead of, the MFCCs, delta MFCCs and mel-band energies, for use in calculating the tag probabilities 116. For example, as discussed above, the mel-band energies may be used to calculate fluctuation pattern features and/or danceability features through detrended fluctuation analysis. Results of fluctuation pattern analysis and detrended fluctuation analysis may then be used to obtain a club-likeness value. Also as noted above, beat tracking features, labeled as "beat tracking 2" in Figure 11, may be calculated based, in part, on a chroma accent signal from a F0 salience estimation.
Returning to Figure 6, at step s6.i2, tag probabilities 116 and an overall track vector 39 for track 1 of group 1 are evaluated. An overview of an example method for obtaining tag probabilities 116 and creating a track vector 39 is shown in Figure 12.
The acoustic features 110 for track 1 of group 1 produced in steps 6.9 and s6.io are input to first level classifiers 111 to generate first level probabilities for the music track. In this example, the first level classifiers 111 include first classifiers 112 and second classifiers 113 to generate first and second probabilities respectively, the second classifiers 113 being different from the first classifiers 112. In the
embodiments to be described below, the first classifiers 112 are non-probabilistic classifiers, while the second classifiers 113 are probabilistic classifiers.
In this embodiment, the first and second classifiers 112, 113 compute first level probabilities that the music tracks include particular instruments and/or belong to particular musical genres. Optionally, probabilities based on other acoustic similarities may be included as will be noted hereinbelow.
The first level probabilities are input to at least one second level classifier 114. In this embodiment, the second level classifier 114 includes a third classifier 115, which may be a non-probabilistic classifier. The third classifier 115 generates the tag probabilities 116 based, at least in part, on the first level probabilities output by the first level classifiers 111 and the second level probabilities output by the second classifiers 113.
Figure 13 is a flowchart depicting the method of Figure 12 in more detail. In this particular example, the first and third classifiers 112, 115 are support vector machine (SVM) classifiers and the second classifiers 113 are based on Gaussian Mixture Models (GMM).
Starting at step S13.0, one, some or all of the extracted features 110 or descriptors obtained in steps S6.9 and s6.io are selected to be used as input to the first classifiers 112 (step S13.1) and, optionally, normalised (step S13.2) . For example, a look up table 216 or database may be stored in the memory 206 of the for each of the tag probabilities to be produced by the analysis server 100, that provides a list of features to be used to generate each first classifier and statistics, such as mean and variance of the selected features, that can be used in normalisation of the extracted features 33a. In such an example, the controller 202 retrieves the list of features from the look up table 216, and accordingly selects and normalises the listed features for each of the first level probabilities to be generated. The normalisation statistics for each first level probability in the database may be determined during training of the first classifiers 112.
As noted above, in this example, the first classifiers 112 are SVM classifiers. The SVM classifiers are trained using a database of music tracks for which information regarding musical instruments and genre is already available. The database may include tens of thousands of tracks that provide examples for each particular musical instrument for which a tag probability 116 is to be evaluated.
Examples of musical instruments for which information may be provided in the database include:
- Accordion;
" Acoustic guitar;
" Backing vocals;
" Banjo;
- Bass guitar;
" Bass synthesizer;
" Brass instruments;
" Glockenspiel;
" Drums;
- Eggs; - Electric guitar;
- Electric piano;
" Guitar synthesizer;
- Keyboards;
- Lead vocals;
- Organ;
- Percussion;
- Piano;
- Saxophone;
~ Stringed instruments;
- Synthesizer; and
- Woodwind instruments.
The training database includes indications of genres that the music tracks belong to, as well as indications of genres that the music tracks do not belong to.
Examples of musical genres that may be indicated in the database include:
- Ambient and new age;
- Blues;
- Classical;
" Country and western;
- Dance;
- Easy listening;
- Electronica;
- Folk and roots;
- Indie and alternative;
" Jazz;
- Latin;
- Metal;
- Pop;
- Rap and hip hop;
- Reggae; - Rock;
" Soul, R&B and funk; and
" World music. By analysing acoustic features extracted from the music tracks in the training database, for which instruments and/or genre are known, a SVM classifier can be trained to determine whether or not a music track includes a particular instrument, for example, an electric guitar. Similarly, another SVM classifier can be trained to determine whether or not the music track belongs to a particular genre, such as Metal.
In this embodiment, the training database provides a highly imbalanced selection of music tracks, in that a set of tracks for training a given SVM classifier includes many more positive examples than negative ones. In other words, for training a SVM classifier to detect the presence of a particular instrument, a set of music tracks for training in which the number of tracks that include that instrument is significantly greater than the number of tracks that do not include that instrument will be used. Similarly, in an example where a SVM classifier is being trained to determine whether a music track belongs to a particular genre, the set of music tracks for training might be selected so that the number of tracks that belong to that genre is significantly greater than the number of tracks that do not belong to that genre.
An error cost may be assigned to the different first level probabilities produced by the first classifiers 112 to take account of the imbalances in the training sets. For example, if a minority class of the training set for a particular first classification includes 400 songs and an associated majority class contains 10,000 tracks, an error cost of 1 may be assigned to the minority set and an error cost of 400/ 10,000 may be assigned to the majority class. This allows all of the training data to be retained, instead of downsampling data of the negative examples.
New SVM classifiers can be added by collecting new training data and training the new classifiers. Since the SVM classifiers are binary, new classifiers can be added alongside existing classifiers.
As mentioned above, the training process can include determining a selection of one or more acoustic features to be used input for particular first classifiers 112 and statistics for normalising those features. The number of features available for selection, M, may be much greater than the number of features selected for determining a particular first classification, N; that is, M > > N. The selection of features to be used is determined iteratively, based on a development set of music tracks for which the relevant instrument or genre information is available, as follows.
Firstly, to reduce redundancy, a check is made as to whether two or more of the features are so highly correlated that the inclusion of more than one of those features would not be beneficial. For example, if two features have a correlation coefficient that is larger than 0.9, then only one of those features is considered available for selection.
The feature selection training starts using an initial selection of features, such as the average MFCCs for music tracks in the development set or a single "best" feature for a given first classification. For instance, a feature that yields the largest classification accuracy when used individually may be selected as the "best" feature and used as the sole feature in an initial feature selection. An accuracy of the first classification based on the initial feature selection is determined. Further features are then added to the feature selection to determine whether or not the accuracy of the first classification is improved by their inclusion. Features to be tested for addition to the selection of features may be chosen using a method that combines forward feature selection and backward feature selection in a sequential floating feature selection. Such feature selection may be performed during the training stage, by evaluating the classification accuracy on a portion of the training set.
In each iteration, each of the features available for selection is added to the existing feature selection in turn, and the accuracy of the SVM classifier with each
additional feature is determined. The feature selection is then updated to include the feature that, when added to the feature selection, provided the largest increase in accuracy for the development set. After a feature is added to the feature selection, the accuracy of the SVM classifier is reassessed, by generating probabilities based on edited feature selections in which each of the features in the feature selection is omitted in turn. If it is found that the omission of one or more features provides an improvement in the accuracy of a generated probability, then the feature that, when omitted, leads to the biggest improvement in accuracy is removed from the feature selection. If no
improvements are found when any of the existing features are left out, but the accuracy does not change when a particular feature is omitted, that feature may also be removed from the feature selection in order to reduce redundancy.
The iterative process of adding and removing features to and from the feature selection continues until the addition of a further feature no longer provides a significant improvement in the accuracy of the SVM classifier. For example, if the improvement in accuracy falls below a given percentage, the iterative process may be considered complete, and the current selection of features is stored in the lookup table 216, for use in selecting features in step S13.1. The normalisation of the selected features 110 at step S13.2 is optional. Where provided, the normalization of the selected features 110 at step S13.2 may
potentially improve the accuracy of the first classifiers 112.
In another embodiment, at step S13.1, a linear feature transform may be applied to the available features 110 obtained in steps S6.9 and s6.io, instead of performing the feature selection procedure described above. For example, a Partial Least Squares Discriminant Analysis (PLS-DA) may be used to obtain a linear
combination of features for calculating a corresponding first classification. Instead of using the above iterative process to select N features from the set of M features, a linear feature transform is applied to an initial high-dimensional set of features to arrive at a smaller set of features which provides a good discrimination between classes. The initial set of features may include some or all of the available features, such as those shown in Figure 11, from which a reduced set of features can be selected based on the result of the transform.
The PLS-DA transform parameters may be optimized and stored in a training stage. During the training stage, the transform parameters and its dimensionality may be optimized for each tag or output classification, such as an indication of an instrument or a genre. More specifically, the training of the system parameters can be done in a cross-validation manner, for example, as five-fold cross-validation, where all the available data is divided into five non-overlapping sets. At each fold, one of the sets is held out for evaluation and the four remaining sets are used for training. Furthermore, the division of folds may be specific for each tag or classification.
For each fold and each tag or classification, the training set is split into 50% -50% inner training-test folds. Then, the PLS-DA transform may be trained on the inner training-test folds and the SVM classifier may be trained on the obtained dimensions. The accuracy of the SVM classifier using the transformed features transformed may be evaluated on the inner test fold. It is noted that, when a feature vector (track) is tested, it is subjected to the same PLS-DA transform, the parameters of which were obtained during training. This manner, an optimal dimensionality for the PLS-DA transform may be selected. For example, the dimensionality may be selected such that the area under the receiver operating characteristic (ROC) curve is maximized. In one example embodiment, an optimal dimensionality is selected among candidates between 5 to 40 dimensions. Hence, the PLS-DA transform is trained on the whole of the training set, using the optimal number of dimensions, and then used in training the SVM classifier. As an alternative to PLS-DA, other feature transforms such as Linear Discriminant Analysis (LDA), Principal Components Analysis (PCA), or Independent Component Analysis (ICA) could be used.
In the following, an example is discussed in which the selected features 110 on which the first classifications are based are the mean of the MFCCs of track 1 of group 1 and the covariance matrix of the MFCCs of track 1 of group 1, although in other examples alternative and/or additional features, such as the other features shown in Figure 11, may be used.
At step S13.3, the controller 202 defines a feature vector based on each set of selected features 110 or selected combination of features 110 for track 1 of group 1. The feature vectors may then be normalized to have a zero mean and a variance of 1, based on statistics determined and stored during the training process.
At step S13.4, the controller 202 generates one or more first probabilities that track 1 of group 1 has a certain characteristic, based on the feature vector or vectors. The first classifiers 112 are used to calculate respective probabilities for each feature vector defined in step S13.3. In this manner, the number of first classifiers 112 corresponds to the number of tag probabilities 116 to be predicted for the music track. In this particular example, a probability is generated by a respective first classifier 112 for each instrument tag probability and for each genre tag probability to be predicted for the music track, based on the mean MFCCs and the MFCC covariance matrix. In addition, a probability may be generated by the first classifiers 112 based on whether the music track is likely to be an instrumental track or a vocal track. Also, for vocal tracks, another probability may be generated by the first classifiers 112 based on whether the vocals are provided by a male or female vocalist. In other embodiments, the controller 202 may generate only one or some of these probabilities and/or calculate additional probabilities at step S13.4. The different classifications may be based on respective selections of features from the available features 110 selected in step S13.1.
The first classifiers 112 may use a radial basis function (RBF) kernel K, defined as:
(3) where the default γ parameter is the reciprocal of the number of features in the feature vector, ΰ is the input feature vector and v is a support vector.
The output from the first classifiers 112 may be in the form of first classifications based on an optimal predicted probability threshold that separates a positive prediction from a negative prediction for a particular tag probability, based on the probabilities computed by the first classifiers 112. The setting of an optimal predicted probability threshold may be learned in the training procedure to be described later below. Where there is no imbalance in data used to train the first classifiers 112, the optimal predicted probability threshold may be 0.5. However, where there is an imbalance between the number of tracks providing positive examples and the number of tracks provided negative examples in the training sets used to train the first classifiers 112, the threshold pthr may be set to a prior probability of a minority class Pmin in the first classification, using Equation (4) as follows:
Figure imgf000027_0001
where, in the set of n tracks used to train the SVM classifiers, nm,„ is the number of tracks in the minority class and nmaj is the number of tracks in a majority class. The prior probability Pm!„ may be learned as part of the training of the SVM classifiers.
Probability distributions for examples of possible first classifications, based on an evaluation of a number n of tracks, are shown in Figure 14. The nine examples in Figure 14 suggest a correspondence between a prior probability for a given first classification and its probability distribution based on the n tracks. Such a correspondence is particularly marked where the SVM classifier was trained with an imbalanced training set of tracks. Consequently, the predicted probability threshold for the different examples vary over a considerable range.
Optionally, a logarithmic transformation may be applied to the probabilities produced by the first classifiers 112 at step S13.4, so that the probabilities are on the same scale and the optimal predicated probability threshold may correspond to a predetermined value, such as 0.5.
Equations (5) to (7) below provide an example normalization which adjusts the optimal predicted probability threshold to 0.5. Where the probability output by a SVM classifier is p and the prior probability P of a particular tag being applicable to a track is greater than 0.5, then the normalized probability pnorm is given by:
Pnorm = i1 ~ P) (5) where L = ^°&(β·5)
log(l -
Meanwhile, where the prior probability P is less than or equal to 0.5, then the normalised probability p norm is given by:
Pnorm P ^)
where L' = (8)
log( ) Figure 15 depicts the example probability distributions of Figure 14 after a logarithmic transformation has been applied, on which optimal predicated probability thresholds of 0.5 are marked. The probabilities output by the first classifiers 112 correspond to a normalised probability pnorm that a respective one of the tags to be considered applies to track 1 of group 1. The first classifications may include probabilities pinsti that a particular instrument is included in the music track and probabilities pgem that the music track belongs to a particular genre.
In steps S13.5 to S13.6, further first level probabilities are generated for the input signal by the second classifiers 113, based on the MFCCs and other parameters produced in step S4.4. Although Figure 13 shows steps S13.3 and S13.6 being performed in sequence, in another embodiment steps S13.5 and S13.6 may be performed before, or in parallel, with steps S13.4 and S13.5.
In this particular example, the acoustic features 110 of track 1 of group 1 on which the second classifications are based are the MFCC matrix for and the first time derivatives of the MFCCs, and probabilities are generated p inst2, Pgem for each instrument tag (step S13.5) and for each musical genre tag (step S13.6) to be predicted. Optionally, further probabilities may be generated based on whether the music track is likely to be an instrumental track or a vocal track and, for vocal tracks, another probability may be generated based on whether the vocals are provided by a male or female vocalist. In other embodiments, the controller 202 may generate only one or some of these second classifications and/or calculate additional second classifications at steps S13.5 and S13.6.
In this example, the second classifiers 113 compute probabilities p inst2, Pgem USing probabilistic models that have been trained to represent the distribution of features extracted from audio signals captured from each instrument or genre. Such training can be performed using an expectation maximisation algorithm that iteratively adjusts the model parameters to maximise the likelihood of the model for a particular instrument or genre generating features matching one or more input features in the captured audio signals for that instrument or genre. The parameters of the trained probabilistic models may be stored in a database, for example, in the database 208 of the analysis server, or in remote storage that is accessible to the analysis server 100 via a network, such as the network 102. For each instrument or genre, likelihoods Lyes, Lno are evaluated that the respective probabilistic model could have generated the selected or transformed features from the input signal. In this embodiment, in steps S13.5 and S13.6, the instrument- based probabilities pinst2 are produced by the second classifiers 113 using first and second Gaussian Mixture Models (GMMs), based on the MFCCs and their first time derivatives calculated in step S13.5. Meanwhile, the probabilities pgen2 that the music track belongs to a particular musical genre are produced by the second classifiers 113 using third GMMs. However, the first and second GMMs used to compute the instrument-based probabilities pinst2 may be trained and used slightly differently from third GMMs used to compute the genre-based probabilities pgen2, as will now be explained.
The first and second GMMs used in step S13.5 may have been trained with an Expectation Maximisation algorithm using a training set of examples which are known either to include the instrument and examples which are known to not include the instrument. For each track in the training set, MFCC feature vectors and their corresponding first time derivatives are computed. The MFCC feature vectors for the examples in the training set that contain the instrument are used to train a first GMM for that instrument, while the MFCC feature vectors for the examples that do not contain the instrument are used to train a second GMM for that instrument. In this manner, for each instrument to be tagged, two GMMs are produced. The first GMM is for a track that includes the instrument and is used to evaluate the likelihood Lyes, while the second GMM is for a track that does not include the instrument and is used to evaluate the likelihood Lno. In this example, the first and second GMMs each contain 64 component Gaussians.
The first and second GMMs may then be refined by discriminative training, for example using maximum mutual information (MMI) criterion on a balanced training set where, for each instrument to be tagged, the number of example tracks that contain the instrument is equal to the number of example tracks that do not contain the instrument.
Returning to step S13.5, the two likelihoods Lyes, Lno are computed based on the first and second GMMs and the MFCCs for the music track. The first is the likelihood Lyes that the corresponding instrument is included in the music track, while the second is the likelihood Lno that the instrument is not included in the music track. The first and second likelihoods Lyes, Lno may be computed in a log- domain, and then converted to a linear domain.
The first and second likelihoods Lyes, Lno are then mapped to a tag probability pinst2 of the instrument being included in the track, as follows:
P instl (9)
Figure imgf000031_0001
As noted above, the third GMMs, used to compute genre-based probabilities pgen2, are trained differently to the first and second GMMs. For each genre to be considered, a third GMM is trained based on MFCCs for a training set of tracks known to belong to that genre. One third GMM is produced for each genre to be considered. In this example, the third GMM includes 64 component Gaussians. In step S13.6, for each of the genres to be considered, a likelihood Lgen is computed for the track 1 of group 1 belonging to that genre, based on the likelihood of each of the third GMMs being capable of outputting the MFCC feature vector of the music track. For example, to determine which of the eighteen genres in the list hereinabove might apply to the music track, eighteen likelihoods would be produced.
The genre likelihoods Lgen are then mapped to probabilities pgen2 , as follows:
Figure imgf000031_0002
where m is the number of genre tags to be considered.
In another embodiment, the first and second GMMs may be trained and used in the manner described above for the third GMMs. In yet further embodiments, the GMMs used for analysing genre may be trained and used in the same manner, using either of techniques described in relation to the first, second and third GMMs above. The first classifications pinsti and pgem and the second classifications pinst2 and pgen2 for track 1 of group 1 are then normalized to have a mean of zero and a variance of 1 (step S13.7) and collected to form a feature vector for input to the one or more second level classifiers 115 (step S13.8). In this particular example, the second level classifiers 115 include third classifiers 116, as noted above, and the third classifiers 116 are non-probabilistic classifiers, such as SVM classifiers trained in a similar manner to that described above in relation to the first classifiers 112. At the training stage, the first classifiers 112 and the second classifiers 113 may be used to output probabilities pinsti, pgem, Pinst2, pgem for the training sets of example music tracks from the database. The outputs from the first and second classifiers 112, 113 are then used as input data to train the third classifiers 116.
The third classifiers 116 determine second level probabilities pinst3 for whether track 1 of group 1 contains a particular instrument and/or second level probabilities pg en3 for whether track 1 of group 1 belongs to a particular genre (step S13.9) . In this example, where the third classifiers 116 are SVM classifiers, the second level probabilities pinst3, pgen3 are generated in a similar manner to the first level probabilities pinsti, pgem computed by the first classifiers 112. The second level probabilities pinst3, pgen3 are then log normalised (step s13.11), as described above in relation to the first level probabilities pinsti, pgem from the first classifiers 112, and output as the tag probabilities 116 at step s13.11.
Optionally, tags based on the tag probabilities 116 may be associated with the music track at step s13.11. For example, the tag probabilities 116 exceed a probability threshold, such as 0.5 for normalised probabilities, tags corresponding to the instruments and/or genres may be stored in a database entry for the music track 208. The track vector 39 is then generated at step S13.12 from the tag probabilities 116 output at step s13.11 and normalised. An example of a track vector 39 is shown in Figure 16. In this particular example, the track vector 39 reflects non-zero probabilities for the music track being a rock song including lead and backing vocals, bass, drums, electric guitar, keyboard and percussion. Alternatively, or additionally, some or all of the first level and/or second level probabilities pmsti, pg em, PinSt2, Pgem, Pinst3, pgen3 themselves and/or the features 110 extracted at step S6.9 and s6.io may be output for further analysis and/or storage. The tag probability calculation process ends at step S13.13.
Returning to Figure 4, steps S4.4 and steps S4.5, including the processes of Figures 6 and 13, are repeated to obtain attributes 33b, 33c for tracks 2 to m of group 1, until no further tracks of group 1 remain to be analysed (step S4.5). However, the repetition of step S13.12 to create track vectors 39 for tracks 2...m of group 1 is optional. In this manner, attributes 33a to 33c are obtained for tracks i...m of group 1, while a track vector 39 may be generated for one, some or all of the tracks i...m of group 1. The combined vector 34 for group 1 is then created (step S4.6), based on the tag probabilities 116 generated at step S4.4 for tracks i...m of group 1. For example, the feature vector 34 may be created by summing the tag probabilities 116 for all of the analysed tracks 1 to m of group 1 and, optionally, normalising the sum. Next, for each of the tracks i...n of group 2 to be analysed, the attributes 35a to 35c are obtained in turn (steps S4.7, S4.8) and, for at least one of the tracks i...n of group 2, a track vector 40 is created, as described above in relation to steps S4.4 to S4.6 and Figures 6 and 13, until no further tracks of group 2 remain to be analysed (step S4.8). As discussed above in relation to the tracks i...m of group 1, the creation of a track vector 40 may be performed at step S13.12 for one, some or all of the tracks i...n of group 2.
A combined vector 36 for group 2 is then created (step S4.9), for example by summing the tag probabilities 116 for the tracks i...n of group 2, and, optionally, normalising the sum.
The group level similarity 37 for the tracks of groups 1 and 2 is calculated by evaluating the similarity between the combined vectors 36, 37 for groups 1 and 2 (step S4.10). For example, if the combined vectors 36, 37 for artists 1 and 2 are denoted by a and b , their similarity sim{a,b can be measured with a cosine similarity defined as shown by Equation (11): a * b
sim a, b (11)
\a\ x \b
In other embodiments, a different technique may be used to evaluate
Figure imgf000034_0001
instead of the cosine similarity shown in Equation (n). For example, one alternative technique may include using the Euclidean distance and taking its inverse to obtain the similarity sim{a, b . Another example technique for assessing similarity may use the Kullback-Leibler divergence.
One or more track level similarities 38 are assessed at step s4.11, based on the similarity of the track vectors 39, 40. The similarity of the track vectors 39, 40 may be assessed using Equation (11) above.
A combined group and track similarity 41 may then be determined, for example, by summing the group level similarity 37 and the track level similarity 41 (step S4.12). In this particular example, the group level similarity 37, the track level similarity 38 and the combined group and track similarity 41 are normalised so that they have values in a range between o and 1. Similarities between group 1 and one or more further groups of music tracks may be computed by repeating steps S4.7 to S4.12 for additional groups and generating respective group level similarities 37, track level similarities 38 and combined group and track similarities 41 for each additional group. For example, where group 1 contains tracks by a first artist and group 2 contains tracks by a second artist, further groups, groups 3 and 4 may be defined, containing tracks by a third artist, a fourth artist respectively, and so on.
In step S4.13, a list of recommendations 42 of tracks is compiled from the music tracks of group 2 and, where provided, any further groups of music tracks that have been analysed, based on one or both of the group level similarity 41 and, optionally, the combined group and track similarity 37. In one embodiment, a list of tracks exhibiting the highest combined group and track similarity 41 and/or other similarity 37, 38 may be compiled at step s4.11 and output to the user (step S4.14). Alternatively, the list of recommendations 42 may be ranked and/or revised as part of the compilation (step S4.15). Examples of compilation procedures that may be performed at step S4.15 will now be described, with reference to Figures 17, 18 and 19.
Figures 17 to 19 show example procedures for generating the list of
recommendations at step S4.15, in which other ranking techniques are employed.
Beginning with the example of Figure 17, which starts at step SI7.0, a preliminary list of candidate tracks from group 2 and any further groups of music tracks is compiled (S17.1), based on one or more of the similarities calculated in steps S4.10 to S4.12.
In the example method of Figure 17, the list of preliminary candidates is revised based on user preferences input by the user, for example, by using the sliders 53, 54 in the user interface 50 shown in Figure 5. For example, the user may have indicated that they would like to receive recommendations of tracks that are jazzier and include more piano than a particular track indicated in field 52 of the user interface 50, corresponding to track 1 of group 1, but include less stringed instruments, using the sliders 53, 54. Where the user has provided input indicating a preference has been received (step SI7.2), the tag probability 116 corresponding to a first property indicated by the user is identified (step S17.3) and the relevant tag probabilities for the candidate tracks are retrieved or otherwise obtained (step S17.4) and adjusted as follows. If the user input indicated a positive contribution for the property (step SI7.5), such as "more jazz", the tag probabilities 116 for a genre of "jazz" for the candidate tracks are added to the values calculated for one or more of the similarities between track 1 of group 1 and the candidate tracks. If the user input indicated a positive contribution for the property (step SI7.5), such as "less strings", the tag probabilities 116 for stringed instruments is subtracted from the similarities for the candidate tracks are subtracted from the similarity values for the candidate tracks (step S17.7). If another preference has been indicated by the user (step S17.8), then steps S17.3 to S17.7 are repeated for the next preference, until the similarity values have been adjusted for all the received user preferences (step S17.8) . The candidate tracks are then ranked based on their adjusted similarities (step SI7.9), completing the procedure (step SI7.10) . The list of recommendations 42 output at step S4.14 may be based on a selected subset, or on all, of the candidate tracks in the ranked list of candidate tracks. For example, a predetermined number of the highest ranked candidate tracks may be selected for inclusion in the list of recommendations 42.
In another example method, shown in Figure 18, beginning at step si8.o, a preliminary list of candidate tracks from group 2 is obtained (si8.i), for example by compiling the list based on one or more of the similarities calculated in steps S4.10 to S4.12 as described above in relation to Figure 17. In this example, the candidate tracks in the preliminary list are ranked based on user preferences as indicated by the user's listening history, as will now be explained.
A user history is obtained in step si8.2. The user history may be based on the number of times the user has previously accessed music tracks stored on the terminal 104, other database or a streaming service, tracks ranked by a user on social media or online music database, and/or on tracks purchased by the user from a digital music store.
Next, the controller 202 obtains a tag indicating a user preference from the tracks in the history (step S18.3). For example, the controller 202 may determine the most common tag for the previously accessed tracks shown in the user history. The corresponding tag probabilities 116 for the candidate tracks from group 2 are then retrieved (step S18.4) .
If the obtained tag seems to be viewed positively by the user (step S18.5), for example, if the tag occurs most often in previously accessed tracks that were played, downloaded or purchased by a user, then the tag probabilities 116 for the candidate tracks are added to one or more of the similarities calculated in steps S4.10 to S4.12 (step si8.6) .
If the obtained tag seems to indicate a negative user preferences, for example if the tag occurs most often in previously accessed tracks that were skipped by the user (step S18.5), then the tag probabilities 116 for the candidate tracks are subtracted from one or more of the similarities calculated in steps S4.10 to S4.12 (step S18.7) . Steps S18.3 to si8.8 may be repeated for further tags, if required (step si8.8).
The candidate tracks are then ranked based on their adjusted similarities (step S18.9), completing the procedure (step S18.10) . As noted above, in relation to Figure 17, the list of recommendations 42 output at step S4.14 may be based on a subset, or on all, of the candidate tracks in the ranked list of candidate tracks produced at step S18.9. In yet another example method, shown in Figure 19, starting at step S19.0, a list of preliminary candidates is obtained (step S19.1), as discussed above in relation to Figure 17. In this example method, the preliminary candidates are ranked based on properties other than those on which the tag probabilities 116 are based. In step S19.2, the controller 202 determines whether the user history includes a previously accessed music track that has a tag that was not included in track 1 of group 1 (step S19.2). For example, a previously accessed music track in the user history may include instruments that were not included in track 1 of group 1, or belong a different genre from track 1 of group 1. In the following, such a tag is referred to as a "new tag".
The tag probabilities 116 for the new tag are retrieved for the candidate tracks (step S19-3)- If the user history indicates that the user listened to the previously accessed music track with the new tag (step S19.4), then the tag probabilities 116 for the new tag in the candidate tracks is added to their respective similarities (step S19.5).
If the user history indicates that the user skipped the previously accessed music track with the new tag (step S19.4), then the tag probabilities 116 for the new tag in the candidate tracks is subtracted from their respective similarities (step S19.6).
If there are further new tags in the previously accessed track (steps S19.7, S19.8), then tag probabilities 116 for the candidate tracks for the further new tags are also added to, or subtracted from, the similarities as appropriate (steps S19.5, S19.6). If required, the controller 202 may then search for another previously accessed track with at least one new tag in the user history (steps S19.9, S19.2), to further adjust the similarities of the candidate tracks (steps S19.3 to S19.9) . The candidate tracks are then ranked based on their adjusted similarities (step S19.10), completing the procedure (step s19.11) . The list of recommendations 42 output at step S4.14 may be based on subset, or on all, of the candidate tracks in the ranked list of candidate tracks. In yet another embodiment, one or more of the methods described above with reference to Figures 17 to 19 may be used to compile the list of recommendations 42 at step S4.13.
The list of recommendations 42 is output at step S4.14 via the interface 202. In this example, the list is transmitted to the user's terminal 104 via the network 102. The terminal 104 may present the list of recommendations 42 to the user as a list of music tracks, optionally with links to access the recommended tracks from a streaming service or database or to purchase the recommended tracks from a digital music store. Where at least some of the recommendations include music tracks in a library accessible by the terminal 104, for example stored in storage
208, the list of recommendations 42 may include, or take the form of, a playlist to be followed by a media player software application stored in the terminal 104.
The procedure for recommending music tracks may end at this point (step S4.15) Alternatively, the analysis server 100 may receive and monitor user history information (step S4.16) from the terminal 104 after the list of recommendations 42 has been output (step S4.14) and determine whether the list of recommendations 42 should be revised (step S14.17) . For example, the controller 202 may determine that revision is needed to adjust the recommendations 42 based on whether the user has listened to, or skipped, tracks in the existing list of recommendations 42.
If revision is required (step S4.17), then the controller 202 revises the list of recommendations 42 (step S4.18) . For example, the controller 202 may update the list of recommendations 42 based on tags from the recommended tracks that the user has listened to, or skipped, by performing the method of Figure 18, using the existing list of recommendations 42 as the preliminary list of candidate tracks in step si8.i and received updated user history as the user history obtained in step S18.2. Alternatively, or additionally, the controller 202 may update the list of recommendations 42 if a previously accessed music track appearing in the updated user history includes a new tag, using the method of Figure 19, using the existing list of recommendations 42 as the preliminary list of candidate tracks in step S19.1.
The revised list of recommendations 42 based on the new rankings is then output (step S4.14) .
The monitoring of user history and revision of the list of recommendations 42
(steps S4.14 to S4.18) may continue until further revision is not needed (step S4.15), for example, if the user of the terminal 104 pauses or stops music playback, closes the media player application or switches of terminal 104.
The procedure then ends at step S4.19.
It will be appreciated that the above-described embodiments are not limiting on the scope of the invention, which is defined by the appended claims and their alternatives. Various alternative implementations will be envisaged by the skilled person, and all such alternatives are intended to be within the scope of the claims.
It is noted that the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.
Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a "computer-readable medium" may be any media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer. A computer-readable medium may comprise a computer-readable storage medium that may be any tangible media or means that can contain or store the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer as defined previously. The computer-readable medium may be a volatile medium or a non-volatile medium.
According to various embodiments of the previous aspect of the present invention, the computer program according to any of the above aspects, may be implemented in a computer program product comprising a tangible computer-readable medium bearing computer program code embodied therein which can be used with the processor for the implementation of the functions described above.
Reference to "computer-readable storage medium", "computer program product", "tangibly embodied computer program" etc, or a "processor" or "processing circuit" etc. should be understood to encompass not only computers having differing architectures such as single/multi processor architectures and sequencers/parallel architectures, but also specialised circuits such as field programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices and other devices. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array, programmable logic device, etc. If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined.
Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

Claims

Claims
1. An apparatus, comprising:
a controller; and
a memory in which is stored computer readable instructions that, when executed by the controller, cause the controller to:
receive input information regarding one or more music tracks;
determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based, at least in part, on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes;
determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks;
select one or more music tracks from the second group of music tracks based at least in part on said similarity; and
output a list of said selected music tracks.
2. An apparatus according to claim 1, where said computer readable
instructions, when executed by the controller, further cause the controller to :
determine a similarity between one of said first plurality of music tracks and one of the second plurality of music tracks;
wherein said similarity between the first group of music tracks and the second group of music tracks is a combination of a group level similarity based on the properties of the first plurality of music tracks and the properties of the second plurality of music tracks and a track level similarity based on the properties of said one of said first plurality of music tracks and the properties of said one of the second plurality of music tracks.
3. An apparatus according to claim 1 or 2, wherein said track level attributes include acoustic features extracted from said first plurality of music tracks or said second plurality of music tracks.
4. An apparatus according to claim 1, 2 or 3, wherein said track level attributes include at least one of: tags associated with at least some of said first plurality of music tracks and said second plurality of music tracks;
metadata associated with at least some of said first plurality of music tracks and said second plurality of music tracks; and
keywords extracted from text associated with at least some of said first plurality of music tracks and said second plurality of music tracks.
5. An apparatus according to any of the preceding claims, wherein said properties include at least one property based on a musical instrument and at least one property based on a musical genre.
6. An apparatus according to claim 5, wherein said properties comprise probabilities that a tag for a musical instrument or genre applies to a respective one of the first and second pluralities of music tracks.
7. An apparatus according to any of the preceding claims, where said computer readable instructions, when executed by the controller, further cause the controller to :
monitor a history of music tracks previously accessed by a user;
revise said list of selected music tracks based on the properties of the previously accessed music tracks in said history and on whether the music tracks in the history were played or skipped; and
output said revised list.
8. An apparatus according to any of the preceding claims, where said computer readable instructions, when executed by the controller, further cause the controller to :
rank the selected music tracks in the list based, at least in part, on user preferences for the properties included in the received input.
9. An apparatus according to any of the preceding claims, wherein said computer readable instructions, when executed by the controller, further cause the controller to :
rank the selected music tracks in the list based, at least in part, on properties of previously accessed music tracks indicated in a user history.
10. An apparatus according to any of the preceding claims, wherein said computer readable instructions, when executed by the controller, further cause the controller to :
rank the selected music tracks in the list based, at least in part, on a further property of a previously accessed music track indicated in a user history, wherein said first plurality of music tracks does not include said further property.
11. An apparatus according to claim 9 or 10, wherein said computer readable instructions, when executed by the controller, cause the controller to rank the selected music tracks by adjusting similarities for the selected music tracks according to whether said previously accessed music track indicated in the user history was played or skipped by the user.
12. An apparatus according to any of the preceding claims, wherein said computer readable instructions, when executed by the controller, cause the controller to determine said properties by evaluating first level probabilities that a particular tag applies based on the track level attributes and evaluating a second level probability that the particular tag applies based on the first level probability.
13. An apparatus according to claim 12, wherein said computer readable instructions, when executed by the controller, cause the controller to evaluate the first level probabilities using a first classifier and a second classifier and to evaluate the second level probabilities using a third classifier, wherein the first and third classifiers are non-probabilistic classifiers and the second classifier is a
probabilistic classifier.
14. A method comprising:
receiving input information regarding at least one music track;
determining properties of a first plurality of music tracks belonging to a first group of music tracks defined based at least in part on the input information and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes;
determining a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks; selecting one or more music tracks from the second group of music tracks based at least in part on said similarity; and
outputting a list of said selected music tracks.
15. A method according to claim 14, comprising:
determining a similarity between one of said first plurality of music tracks and one of the second plurality of music tracks;
wherein said similarity between the first group of music tracks and the second group of music tracks is a combination of a group level similarity based on the properties of the first plurality of music tracks and the properties of the second plurality of music tracks and a track level similarity based on the properties of said one of said first plurality of music tracks and the properties of said one of the second plurality of music tracks.
16. A method according to claim 14 or 15, wherein said track level attributes include acoustic features extracted from said first and second pluralities of music tracks.
17. A method according to claim 14, 15 or 16, comprising:
monitoring a history of music tracks accessed by a user;
revising the list of selected music tracks based on the properties of the previously accessed music tracks in said user history and on whether the previously accessed music tracks were played or skipped; and
outputting said revised list.
18. A method according to any of claims 14 to 17, comprising:
ranking the selected music tracks in the list based on one or more of:
user preferences for the properties included in the received input; properties of previously accessed music tracks indicated in a user history; and
a property of a previously accessed music track indicated in a user history, wherein said first plurality of music tracks does not include said property.
19. A method according to any of claims 14 to 17, comprising:
ranking the selected music tracks in the list based on at least one property of at least one previously accessed music track indicated in a user history; wherein said ranking the selected music tracks comprises adjusting similarities for the selected music tracks according to whether said previously accessed music track indicated in the user history was played or skipped by the user.
20. A computer program comprising computer readable instructions which, when executed by a computer, cause said computer to perform a method according to any of claims 14 to 19.
21. A non-transitory tangible computer program product in which is stored computer readable instructions that, when executed by a computer, cause the computer to :
receive input information regarding one or more music tracks;
determine properties of a first plurality of music tracks belonging to a first group of music tracks defined based at least in part on the input informationg and properties of a second plurality of music tracks belonging to a second group of music tracks, the properties of the first plurality of music tracks and the properties of the second plurality of music tracks being based on track level attributes;
determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks;
select one or more music tracks from the second group of music tracks based at least in part on said similarity; and
output a list of said selected music tracks.
22. An apparatus configured to :
receive input a first group of music tracks;
extract track level attributes associated with a first plurality of music tracks belonging to the first group of music tracks and a second plurality of music tracks belonging to a second group of music tracks ;
determine properties of the first plurality of music tracks and the second plurality of music tracks, based on track level attributes;
determine a similarity between the first group of music tracks and the second group of music tracks based at least in part on the determined properties of the first plurality of music tracks and the second plurality of music tracks; select one or more music tracks from the second group of music tracks based at least in part on said similarity; and output a list of said selected music tracks.
23. An apparatus according to claim 22, configured to :
determine a similarity between one of said first plurality of music tracks and one of the second plurality of music tracks;
wherein said similarity between the first group of music tracks and the second group of music tracks is a combination of a group level similarity based on the properties of the first plurality of music tracks and the properties of the second plurality of music tracks and a track level similarity based on the properties of said one of said first plurality of music tracks and the properties of said one of the second plurality of music tracks.
24. An apparatus according to claim 22 or 23, wherein said track level attributes include acoustic features extracted from said first plurality of music tracks or said second plurality of music tracks.
25. An apparatus according to claim 22, 23 or 24, wherein said track level attributes include at least one of:
tags associated with at least some of said first plurality of music tracks and said second plurality of music tracks;
metadata associated with at least some of said first plurality of music tracks and said second plurality of music tracks; and
keywords extracted from text associated with at least some of said first plurality of music tracks and said second plurality of music tracks.
26. An apparatus according to any of claims 22 to 25, wherein said properties include at least one property based on a musical instrument and at least one property based on a musical genre.
27. An apparatus according to claim 26, wherein said properties comprise probabilities that a tag for a musical instrument or genre applies to a respective one of the first and second pluralities of music tracks.
28. An apparatus according to any of claims 22 to 27, configured to :
monitor a history of music tracks previously accessed by a user; revise said list of selected music tracks based on the properties of the previously accessed music tracks in said history and on whether the music tracks in the history were played or skipped; and
output said revised list.
29. An apparatus according to any of claims 22 to 28, configured to:
rank the selected music tracks in the list based, at least in part, on user preferences for the properties included in the received input.
30. An apparatus according to any of claims 22 to 29, configured to:
rank the selected music tracks in the list based, at least in part, on properties of previously accessed music tracks indicated in a user history.
31. An apparatus according to any of claims 22 to 30, configured to:
rank the selected music tracks in the list based, at least in part, on a further property of a previously accessed music track indicated in a user history, wherein said first plurality of music tracks does not include said further property.
32. An apparatus according to claim 30 or 31, configured to rank the selected music tracks by adjusting similarities for the selected music tracks according to whether said previously accessed music track indicated in the user history was played or skipped by the user.
33. An apparatus according to any of claims 22 to 32, configured to determine said properties by evaluating first level probabilities that a particular tag applies based on the track level attributes and evaluating a second level probability that the particular tag applies based on the first level probability.
34. An apparatus according to claim 33, configured to evaluate the first level probabilities using a first classifier and a second classifier and to evaluate the second level probabilities using a third classifier, wherein the first and third classifiers are non-probabilistic classifiers and the second classifier is a
probabilistic classifier.
PCT/FI2014/051037 2014-12-22 2014-12-22 Similarity determination and selection of music WO2016102738A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/FI2014/051037 WO2016102738A1 (en) 2014-12-22 2014-12-22 Similarity determination and selection of music

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/FI2014/051037 WO2016102738A1 (en) 2014-12-22 2014-12-22 Similarity determination and selection of music

Publications (1)

Publication Number Publication Date
WO2016102738A1 true WO2016102738A1 (en) 2016-06-30

Family

ID=56149313

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2014/051037 WO2016102738A1 (en) 2014-12-22 2014-12-22 Similarity determination and selection of music

Country Status (1)

Country Link
WO (1) WO2016102738A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10014841B2 (en) 2016-09-19 2018-07-03 Nokia Technologies Oy Method and apparatus for controlling audio playback based upon the instrument
CN110046048A (en) * 2019-04-18 2019-07-23 杭州电子科技大学 A kind of load-balancing method adaptively quickly reassigned based on workload
IT201900020486A1 (en) * 2019-11-06 2021-05-06 Luciano Nigro DIGITAL PLATFORM FOR REAL-TIME COMPARISON OF MUSICAL INSTRUMENT ELEMENTS
CN113220929A (en) * 2021-04-06 2021-08-06 辽宁工程技术大学 Music recommendation method based on time-staying and state-staying mixed model
EP4336381A1 (en) * 2022-09-09 2024-03-13 Sparwk AS System and method for music entity matching

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060107823A1 (en) * 2004-11-19 2006-05-25 Microsoft Corporation Constructing a table of music similarity vectors from a music similarity graph

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060107823A1 (en) * 2004-11-19 2006-05-25 Microsoft Corporation Constructing a table of music similarity vectors from a music similarity graph

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CASEY, M. ET AL.: "Content-based information retrieval: current directions and future challenges", PROCEEDINGS OF THE IEEE, vol. 96, no. 4, April 2008 (2008-04-01), pages 668 - 696 *
KNEES, P. ET AL.: "A music search engine built upon audio-based and web-based similarity measures", INT. ACM SIGIR CONF., 27 July 2007 (2007-07-27), pages 8pp *
PAMPALK, E. ET AL.: "MusicRainbow: a new user interface to discover artists using audio-based similarity and web-based labeling", INT. CONF. ON MUSIC INFORMATION RETRIEVAL (ISMIR2006, 8 October 2006 (2006-10-08), Victoria, Canada *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10014841B2 (en) 2016-09-19 2018-07-03 Nokia Technologies Oy Method and apparatus for controlling audio playback based upon the instrument
CN110046048A (en) * 2019-04-18 2019-07-23 杭州电子科技大学 A kind of load-balancing method adaptively quickly reassigned based on workload
CN110046048B (en) * 2019-04-18 2021-09-28 杭州电子科技大学 Load balancing method based on workload self-adaptive fast redistribution
IT201900020486A1 (en) * 2019-11-06 2021-05-06 Luciano Nigro DIGITAL PLATFORM FOR REAL-TIME COMPARISON OF MUSICAL INSTRUMENT ELEMENTS
CN113220929A (en) * 2021-04-06 2021-08-06 辽宁工程技术大学 Music recommendation method based on time-staying and state-staying mixed model
CN113220929B (en) * 2021-04-06 2023-12-05 辽宁工程技术大学 Music recommendation method based on time residence and state residence mixed model
EP4336381A1 (en) * 2022-09-09 2024-03-13 Sparwk AS System and method for music entity matching

Similar Documents

Publication Publication Date Title
US11094309B2 (en) Audio processing techniques for semantic audio recognition and report generation
EP2659482B1 (en) Ranking representative segments in media data
GB2533654A (en) Analysing audio data
US8423356B2 (en) Method of deriving a set of features for an audio input signal
US10129314B2 (en) Media feature determination for internet-based media streaming
US10296959B1 (en) Automated recommendations of audio narrations
CN111309965B (en) Audio matching method, device, computer equipment and storage medium
US9774948B2 (en) System and method for automatically remixing digital music
US8865993B2 (en) Musical composition processing system for processing musical composition for energy level and related methods
WO2016102738A1 (en) Similarity determination and selection of music
GB2522644A (en) Audio signal analysis
KR20160069784A (en) Method and device for generating music playlist
US20180173400A1 (en) Media Content Selection
Niyazov et al. Content-based music recommendation system
JP5345783B2 (en) How to generate a footprint for an audio signal
Yu et al. Sparse cepstral codes and power scale for instrument identification
Foucard et al. Multi-scale temporal fusion by boosting for music classification.
Foster et al. Sequential complexity as a descriptor for musical similarity
Krey et al. Music and timbre segmentation by recursive constrained K-means clustering
Pei et al. Instrumentation analysis and identification of polyphonic music using beat-synchronous feature integration and fuzzy clustering
Tian A cross-cultural analysis of music structure
Hartmann Testing a spectral-based feature set for audio genre classification
Joseph Fernandez Comparison of Deep Learning and Machine Learning in Music Genre Categorization
Saikkonen Structural analysis of recorded music
Sandrock Multi-label feature selection with application to musical instrument recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14908893

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14908893

Country of ref document: EP

Kind code of ref document: A1