WO2015110823A1

WO2015110823A1 - Processing audio data to produce metadata

Info

Publication number: WO2015110823A1
Application number: PCT/GB2015/050151
Authority: WO
Inventors: David Marston; Chris BAUME; Panos Kudumakis; Mathieu BARTHET; Gyorgy FAZEKAS; Andrew Hill; Mark Sandler
Original assignee: British Broadcasting Corporation; Queen Mary University Of London; Broadchart International Limited
Priority date: 2014-01-24
Filing date: 2015-01-23
Publication date: 2015-07-30
Also published as: GB201401218D0; GB2523730A

Abstract

A system for automated control of retrieval and output of music audio files, comprises a training input for receiving music audio files each having one or more associated keywords from a set of keywords. An analyser is arranged to convert keywords to M dimensional vectors in a vector space, where M is less than the total number of distinct keywords in the set of keywords. The analyser arranged to sample features of the music audio files and to produce an F dimensional vector in a vector space representing each music audio file. A machine learning module is arranged to derive a conversion between M dimensional vectors and F dimensional vectors. A sample input is arranged to receive a sample audio file, to extract features and to produce a derived F dimensional vector in vector space. A converter is arranged to convert the F dimensional vector to a derived M dimensional vector in vector space using the derived conversion. An output is arranged to allow selection and retrieval of music audio files using the derived M dimensional vector.

Description

PROCESSING AUDIO DATA TO PRODUCE METADATA

BACKGROUND OF THE INVENTION This invention relates to a system and method for processing audio data to produce metadata and for controlling retrieval and output of music audio files.

Audio content, such as music, may be stored in a variety of formats and can have accompanying metadata describing the content that may be stored with the content or separately. Recorded music comprises tracks, movements, albums or other useful divisions. For simplicity, we will refer to a portion of music or other audio data, however divided, as audio content.

It is convenient to store metadata related to audio content to assist in the storage and retrieval of audio content from databases for use with guides. Such metadata may be represented graphically for user selection, or may be used by systems for processing the audio content. Example metadata includes the contents title, textural description and genre. There can be problems in appropriately deriving and using metadata. Cu rated music collections require many man-hours of work to maintain. Further, the use of keywords alone can be an inaccurate representation of complex aspects of music. There can also be problems in the reliability of created metadata, particularly where the metadata requires some form of human intervention, rather than automated machine processing. If the metadata is not reliable, then the extraction process will again lead to poor results sets. SUMMARY OF THE INVENTION

We have appreciated the need to convert between audio content and metadata using techniques that improve on the ability for the metadata to represent the associated audio content to allow improved retrieval and output.

In broad terms, the invention provides a system and method for converting between audio content and metadata using a conversion between an M- dimensional vector mood space, derived from metadata of audio content training data, and an F-dimensional vector feature space derived from features extracted from the audio content training data.

The use of mapping between the two vector spaces provides more accurate and faster searching and output techniques than prior arrangements.

The invention is defined in the claims to which reference is now directed.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be described in more detail by way of example with reference to the drawings, in which:

Figure 1; is a diagram of the main functional components of a system embodying the invention;

Figure 2 is a diagrammatic representation of an algorithm embodying the invention;

Figure 3 shows an overview of one specific use case of an embodiment of the invention; and

Figure 4 is a graph showing an analysis of accuracy of the embodiment. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The invention may be embodied in a method and system for processing metadata related to audio data (which may also be referred to as audio content) to produce an output signal. The output signal may be used for controlling a display, initiating playback or controlling other hardware.

A system embodying the invention is shown in Figure 1. The system may be implemented as dedicated hardware or as a process within a larger system.

The system 100 embodying the invention has an input 2 for receiving audio content that is to be used for the purpose of deriving a machine learning conversion. The content input 2 may comprise an input from a database of audio content such as a music library of audio music tracks which may be gathered as single tracks or albums or otherwise categorised into a structured store. Each portion of audio content, such as a given music track, comprises audio data as well as metadata, particularly keywords used to tag the audio content. The audio content is first provided to a keyword analyser 4.

The keyword analyser 4 analyses the set of keywords found within all of the training audio content and derives a conversion between each keyword and a multidimensional vector M so that each keyword may be easily converted to such a multidimensional vector. In the example embodiment, the total set of keywords comprises 450 distinct words for the training content. The M dimensional space preferably has five dimensions and we will refer to this space as a "mood" space. Each axis of the mood space has no particular meaning and is simply a useful representation by which the keywords may be classified. The M dimensional vectors may be derived in a variety of ways including techniques based on user input in which users are asked to score words for similarity, automated processes in which words are compared using libraries or a thesaurus or a statistical process from which the relationship between the keywords can be inferred. The important point to note is that a large set of words is converted such that each word may be represented by a vector in an dimensional space. The number of dimensions is less than the number of distinct words and preferably, for computational reduction, is significantly less than the number of words. As noted above, the number of dimensions preferred, based on analysis, is M = 5 being considerably less than the 450 distinct training words.

The next component in the system 100 is a feature analyser 6 which also receives as an input the audio content at audio content input 2. The purpose of the feature analyser is to produce a feature vector in a feature vector space F of F dimensions. Features of audio content include matters such as tempo, loudness, spectral slope and other such aspects of audio understood in the art. The feature analyser extracts a set of features from each portion of audio content and converts these to an F dimensional vector in the F dimensional space. In the example embodiment, 63 differing features of audio music, discussed later, are applied to the feature analyser which reduces the set of features down to an F dimensional vector. The referred number of dimensions for the F dimensional vector is 32.

The output from the two analyser stages, therefore, is an M dimensional vector representing the audio content in mood space and an F dimensional vector representing the content in feature space. Each portion of content has one M dimensional vector and one F dimensional vector. It would be possible to extend this to have more than one such vector for each piece of content, but one is preferred in the embodiment for simplicity. The machine learning stage 8 receives the F dimensional and M dimensional vectors for each portion of audio content and, using machine learning techniques, derives a conversion between the M dimensional vectors and F dimensional vectors. The preferred machine learning operation is discussed in greater detail later. In broad terms, the analysis involves deriving a conversion or mapping between the M dimensional vector and F dimensional vector by looking for a correlation between the M dimensional vector and F dimensional vector in the training data set. The larger the training data size, up to a threshold, the more accurate the correlation between the two data sets. The output of the machine learning module 8 is a mapping or conversion between the M dimensional vectors and F dimensional vectors which may be used for analysis of specific samples of audio content. The conversion or mapping is provided to a converter 10 which is arranged to operate a reverse process in which vectors in the F dimensional feature space can be converted to vectors in the M dimensional mood space. A given sample of audio content may then be tagged with one or more M dimensional vectors that represent the mood of the music. This representation of mood is a vector that does not directly have any meaning, but as it has been derived by a model that has used natural language and statistical processing from keywords that do have meaning; the mood vector will have a useful purpose in looking for similarities between portions of music.

A particular sample of audio content such as a music track may be input at sample input 12 to the converter 10 and the process of feature extraction operated within the converter 10 to derive the F dimensional vector. The conversion or mapping to an M dimensional vector is performed and an output asserted on output line 1 1. The output may comprise the M dimensional vector for that piece of audio content. Alternatively, the converter may further convert the M dimensional vector to one or more keywords using an M dimensional vector to keyword conversion also provided from the learning module 8. Alternatively, the output 1 1 may assert a signal back to the content input 2 to retrieve a similar piece of audio content directly from the body of audio content, or to a different music database comprising content that has either been tagged with keywords or that is automatically tagged with M dimensional vectors. In this way, the system may be implemented as part of a database system, a music player or other user device which can automatically retrieve, select, display in a list or play music tracks.

Figure 2 summarises the process operated by the above system that may be conveniently split into two stages. The first stage is a training stage comprising analysis of keywords 20, analysis of features 22 and the derivation of a conversion 24 from a training set of data. The second stage then extracts features from a specific sample of audio content at a featured extraction stage 26 converts those features to vectors at conversion stage 28 and asserts an output at output step 30 in the manner described above. The steps operated in Figure 2 may be operated within one device. Preferably, though, steps 20 to 24 are provided in systems for deriving a conversion and steps 26 to 30 are provided in a user device such as a music player.

A particular use case for the techniques described will now be set out in relation to Figure 3. The example is for generating mood metadata for commercial music which itself has no metadata that can be used for creative purposes such as for music selection. The training data provided is so called production music, this is music that is typically not sold to the public but is mainly used in the film and TV industry. Databases of such production music are available that have been manually catalogued with high quality keywords and so is useful in providing training data. Commercial music is generally available music for purchase such as singles and albums available in hard copy or for download, this is not generally well catalogued and indexed with keywords. As shown in Figure 3 the production music training database is sampled, audio features extracted and training and tests performed to derive the best audio features for selection. Alongside this, the editorial tags such as keywords are retrieved, these filtered to remove redundancies in information and a final set of editorial tags derived to create a mood.. The selected audio features and mood values are then input to a regressor training and testing module and this step process repeated until the best performing features found and final training models derived to give a track-to-mood mapping that may be used within a converter. Commercial music may then be input to a converter using the track- to-mood mapping which may be effectively operated in reverse to give estimated mood tags as a result.

The feature extraction process will now be described in greater detail:

There are many feature extraction software packages available, including ATLAB-based ones such as MIR Toolbox, MAToolbox and PsySound3, or open sourceC++/Python libraries such as Marsyas, CLAM, LibXtract, Aubio and YAAFE. However, it remains difficult to know whether audio features computed by different audio feature extraction and analysis tools are mutually compatible or interchangeable. Moreover, if different tools were used in the same experiment, the outputs typically need conversion to some sort of common format, and for reproducibility, this glue code needs to evolve with the changes of the tools themselves.

To resolve these issues, we used the Vamp plugin architecture developed at QMUL as a standardised way of housing feature extraction algorithms. A large number of algorithms are available as Vamp plugins, and as they can all be used from the command fine using Sonic Annotator, it is easy to extract a wide variety of features. FiveVamp plugin collections were selected for use as part of the project - the BBC plugin set, the QMUL plugin set, NNLSChroma, Mazurka, and libxtract. The plugins developed and used in this work were released as open source software available online.

We analysed 63 features computed from these Vamp plugins by using them as the input to a four mood classifier. The results showed that some of the features (e.g. spectral kurtosis and skewness) had no correlation with the four basic moods which were considered in the experiments, so these were not included in further extraction processes. Of the remaining 47 algorithms, 40 were used with their default settings, while the remaining ones were set up with a variety of configurations, producing a total of 59 features. These are listed in Table 1.

The inventors used a high-performance computing cluster, which houses over 5,000 Intel Sandy Bridge cores, to extract the features from 128,024 music tracks. The tracks were first down-mixed to mono in order to save time when transferring the files over FTP. Since all of the algorithms use a mono input, this does not affect the result. Once the files were on the cluster, a separate task was run for each music track in which Sonic Annotator was used to extract each feature. As a result of parallelisation, all designated features were extracted from the collection in less than seven hours. Example features are listed in Table 1 at the end of this description, The next stage is feature selection to now be described:

Feature selection is the process of selecting a subset of features for the purpose of removing redundant data. By utilising this pre-processing stage, the accuracy and speed of machine learning-based systems can be improved. Generally speaking, feature selection involves choosing subsets of features and evaluating them until a stopping criterion is reached.

The different subset evaluation techniques broadly fall into three categories: the filter model, the wrapper model and the hybrid model. The filter model relies on general characteristics of the data to evaluate feature subsets, whereas the wrapper model uses the performance of a predetermined algorithm (such as a support vector machine) as the evaluation criterion. The wrapper model gives superior performance as it finds features best suited to the chosen algorithm, but it is more computationally expensive and specific to that algorithm. The hybrid model attempts to combine the advantages of both.

Most studies which have employed feature selection in the context of music mood/genre classification have used the filter model, with the ReliefF algorithm being particularly popular. Notably however, there are a couple of studies that have successfully used the wrapper method. Due to the superior performance of the wrapper method, this was chosen as the evaluation.

The music data used for evaluating the features was randomly selected from a production music library, with each track coming from a different album (to avoid skewing data with an 'album effect' due to similarity of tracks), being over 30 seconds in length (to avoid sound effects and shortened versions of tracks) and having explicitly labelled mood tags. This resulted in 1 ,760 tracks, whose features were scaled to fall between 0 and 1 , before the tracks were randomly split into two-thirds training (1 , 173) and one-third testing (587). Where the feature is time- varying, the following six metrics were used to summarise the output: mean, standard deviation, minimum, maximum, median and mode.

Although some of these statistics assume a Gaussian distribution of audio features (an assumption which clearly does not hold in all cases), we found that the above combination of metrics provide a reasonable compromise. Using a bag-of-frames approach as an alternative would require storing large amounts of frame-wise feature data, which isn't practical given the size of the target music collection in which our method will be applied.

The features were evaluated by using combinations of them as the input, and a five-dimensional mood representation as the output. The mood model used is based on the structure of the keywords in the production music database. The system was implemented by using five support vector regressors (SVR), each based on a polynomial kernel. The implementation used the scikit-learn Python module. Although the RBF kernel has often been used for MER, a recent study has shown the polynomial kernel to be faster and more accurate. Two-fold cross-validation was used with each regressor to perform a grid search. Each regressor was trained using the optimum parameters from cross-validation and evaluated against the test set, using absolute error as the performance metric. As there are over 5.7x10¹⁷ different possible combinations of the 59 features, it would have been impractical to perform an exhaustive search so a forward sequential search was chosen, where features are added successively.

To avoid the problem of the subset size increasing exponentially, the following algorithm was developed:

1. Start with a set containing every combination of N features

2. Evaluate the performance of each and choose the M best combinations

3. Generate a new set of combinations by adding every one of the 59 features to each of the top to make combinations with (N + 1 ) features

4. Repeat from step 2

To maximise the computing time available , the parameters were set as N = 2 and M = 12. Figure 4 shows the best absolute error achieved for each regressor when combining up to 20 different features, with the minima marked as triangles. Table 2 shows which features were used to achieve those minima. The full results, which show the features that were used for every point of the graph in Figure 4. The overall minimum mean error achieved by using the best feature combinations for each regressor was 0:1699.

From the shape of the plots in Figure 2, we can see that using more features produces diminishing returns. In each case, the error reaches a minimum before reaching a baseline. This could be avoided by using cross-validation, but that would make the process somewhere in the order of 100 times slower, which is prohibitively slow given the mildness of the over-fitting.

Table 2 shows that mood prediction benefits from a wide variety of features (32 in this case) from every category. However, some of the regressors were more reliant on certain categories than others. For example, SVR1 uses harmonic and rhythmic features more than spectral ones. This suggests that it may be advantageous to optimise the features for individual dimensions. The design of specific mood models will now be described in greater detail

In the same way that a measure of similarity between tracks can be derived from tag co-occurrence counts, a measure of similarity between tags can be derived from their co-occurrence over tracks.

In the case of curated editorial metadata, tracks are associated with a list of unique tags judged to be the most appropriate by professional music experts. Hence, a given tag is only attributed once to a track, unlike for social tags for which a large number of users sets tags to tracks. Initially the mood tags were cleaned by correcting misspellings (100 errors out of 2,398 mood tags), removing duplicates (338 duplicates yielding 2,060 unique tags), and stripping white spaces and punctuation marks (e.g. '!"). Instead of following a bag-of-words approach for which the meaning of certain tags with multiple words can be lost (e.g. "guilty pleasure"), we collated words of alternate forms to further process them as single entities (using a hyphen between the words). The vocabulary used in the editorial annotations is composed of conventional words and does not have the idiosyncrasies of social tags which often include sentences, informal expressions (e.g. "good for dancing to in a goth bar", or artists' names. For this reason, we did not have to tokenise the tags with a stop-list (to remove words such as 'it", 'and", 'the", for instance). However, we used a stemmer algorithm 3 to detect tags with similar base parts (e.g. "joyful" and "joy"), as these refer to identical emotional concepts.1,873 mood-related

stems were obtained out of the 2,060 unique mood tags. In order to reduce the size of the stem vector while maintaining the richness of the track descriptions, we only kept stems which were associated with at least 100 tracks in further analyses. This stem filtering process yielded a list of 453 stems which provided at least one mood tag for each of the 183, 176 tracks from the ILM dataset.

The associations between tracks and stems are provided in a document-term matrix

where:

The stem pairwise co-occurrences over tracks cij are then given by:

where fx_ig is the set of tracks annotated with stem i and j j is the cardinality operator. The measure of dissimilarity between stems sij is computed as follows:

where Max(cij) is the maximum of the pairwise stem cooccurrence

in the ILM dataset (26,859). Non-metric multidimensional scaling (MDS) analyses werethen applied to the stem dissimilarit matrix,

Four outlier stems presenting a null or very small co-occurrence measure with all the other stems were discarded, not to bias the MDS analysis (this yielded a list of 449 stems).

We have plotted the evolution of Kruskal's stressi as the number of dimensions D increases from 1 to 13. Following a rule of thumb for MDS, acceptable, good and excellent representations are obtained for D = 3 (stress <0.2), D = 5 (stress < 0.1 ) and D = 11 (stress < 0.05). Interestingly, five dimensions yield a good representation (elbow of the scree plot). This result suggests that more than three dimensions are required to accurately categorise mood terms in the context of production music, which contrasts with the classical three-dimensional emotion model (arousal, valence and dominance). In further analyses, we mapped the mood stems back to mood tags to uncover the meaning of the dimensions.

Interestingly, analysis revealed that three out of the five MDS dimensions are significantly correlated with the arousal and/or valence and/or dominance dimensions, showing that the 5-D MDS configuration captures aspects of the core emotion dimensions.

We devised several methods to summarise the tags of a track in a given multidimensional mood space. Let's denote

the tag matrix representing the coordinates of the tags i of a track across the dimensions j of the mood space. For the methods described above, the tag summary matrix

is obtained by multiplying the tag matrix with a weight matrix

This method assumes that a track is best represented by the tag from its set of tags which has the highest term frequency (TF) in the dataset. The weight wi for the N tags of a track are as follows:

This method summarises the tags of a track by their centroid or geometrical mean in the mood space. The tag weights are hence given by

The Term-Frequency Weighted Centroid (TFW) is a method that summarises the tags of a track by their centroid after attributing to each tag a weight proportional

to its term frequency (TF):

Hence the centroid is attracted by the tag of highest term frequency.

Inverse Term-Frequency Weighted Centroid (ITF)

Conversely, this method attributes more weight to the tag of lowest term frequency following the assumption that this tag may convey more specific information about the song:

Rather than summarising the tags of a track by a point in the space, this method assumes that the tags can be represented by a Gaussian distribution. The tag summary matrix Y is given by the mean and variance

of the tag matrix:

Model Derived from Mood Taxonomy (CLUST)

Popular mood key words were added to an initial selection provided by QMUL to create a list of 355 mood words. Over 95% of the production music library contained at least one of these 355 words. Each of these words were placed in one of 27 categories, which became the starting point for a cluster-based model. Each category was treated as a cluster containing several mood words. Many of these clusters could be considered to overlap in their mood, some were clearly opposites while others had little in common. To convert these clusters into dimensions, the overlapping ones were combined into single dimensions; any opposite clusters were converted into negative (-ve) values of the dimension they were opposite to; and the non-overlapping clusters were treated as new dimensions. Using this method, the 27 clusters were converted to 10 dimensions, giving each of the 355 mood words 10 dimensional mood values.

The choice of allocation of words to clusters and cluster to dimensions was performed on only one person's opinion. The choice of 10 dimensions was a compromise between combining clusters that are too dissimilar and having a too sparse model. To illustrate the process, the first three dimensions represent the following mood clusters: 1 ) Confident (+ve scale), Caution & Doubt (-ve scale), 2) Sad & Serious (+ve scale), Happy & Optimistic (-ve scale), and 3) Exciting (+ve scale), Calm (-ve scale). As each music track is associated with several mood tags, each mapped to 10 dimensional values, tags had to be combined. The most simple and obvious way would be to take the mean of all the mood values to generate a single 10

SUBSTITUTE SHEET RULE 26 dimensional value for a track. However, it was felt that a music track can be represented by moods that differ significantly, so combining them into a single mood would be too crude. Therefore, a method (denoted PEA) to generate two mood values per track was devised. This method uses clustering of the 10-D scores where close scores are combined together. The means of the two most significant clusters are then calculated, resulting in two 10-D mood values for each track. A weight was assigned to each value according to the size of the cluster. For the purposes of searching a database of tracks with mood values assigned to them, a distance measurement is required to find which tracks most closely match each other. For the MDS-based models, distance between tracks were obtained using either the Euclidean distance between tag summary vectors (methods MTF, CEN, TFW, ITF), or the Kullback-Leibler (KL) divergence between the Gaussian representations of the tags (method MVA). As the model described above allocates two 10-D mood values per track (method PEA), a weighted Euclidean measure was used which exploited the weighting valued associated with each of the two 10-D mood values. This is shown in equation (5) where ms(i; k) is the mood of the seed track (where i is the value index, and k is the dimension index), mtQ; k) is the mood expressed by the test track (where j is the value index), ws(i) is the seed track weighting, and wt(j) is the test track weighting.

The embodying system may be implemented in hardware in a number of ways. In a preferred embodiment, the content store 2 may be an online service such as cable television or Internet TV. Similarly, the general user parameters store 6 may be an online service provided as part of cable television delivery, periodically downloaded or provided once to the metadata processor and then retained by the processor. The remaining functionality may be implemented within a client device 100 such as a set top box, PC, TV or other client device for AV content.

SUBSTITUTE SHEET RULE 26 Table 1

Claims

A system for automated controlling of retrieval and output of music audio files, comprising:

a training input for receiving music audio files each having one or more associated keywords from a set of keywords;

- an analyser arranged to convert keywords to M dimensional vectors in a vector space , where M is less than the total number of distinct keywords in the set of keywords;

- an analyser arranged to sample features of the music audio files and to produce an F dimensional vector in a vector space F representing each music audio file;

- a machine learning module arranged to derive a conversion between M dimensional vectors and F dimensional vectors;

a sample input arranged to receive a sample audio file, to extract features and to produce a derived F dimensional vector in vector space F;

- a converter arranged to convert the F dimensional vector to a derived M dimensional vector in vector space M using the derived conversion;

- an output arranged to allow selection and retrieval of music audio files using the derived M dimensional vector.

A system according to claim 1 , wherein the converter is further arranged to derive one or more keywords from the derived M dimensional vector.

A system according to claim 2, wherein the output is arranged to allow selection and retrieval of music audio files using the derived one or more keywords.

A system according to any preceding claim, wherein the output is arranged to control a display to produce a list of titles of audio content.

A system according to any preceding claim, wherein the output is arranged to automatically retrieve audio content.

A system for automated retrieval of audio content, comprising:

a converter arranged to convert the F dimensional vector to a derived M dimensional vector in vector space M using a stored derived conversion;

an output arranged to allow selection and retrieval of music audio files using the derived M dimensional vector;

- wherein the stored derived conversion is derived using a machine learning module arranged to derive a conversion between M dimensional vectors and F dimensional vectors from training audio content.

A method for automated retrieval and output of music audio files, comprising:

receiving music audio files each having one or more associated keywords from a set of keywords;

converting keywords to M dimensional vectors in a vector space M, where M is less than the total number of distinct keywords in the set of keywords;

sampling features of the music audio files and to produce an F dimensional vector in a vector space F representing each music audio file;

deriving a conversion between M dimensional vectors and F dimensional vectors;

receiving a sample audio file, to extract features and to produce a derived F dimensional vector in vector space F; - converting the F dimensional vector to a derived M dimensional vector in vector space M using the derived conversion; and

- selecting and retrieving of music audio files using the derived M dimensional vector.

8. A method according to claim 1 , comprising deriving one or more keywords from the derived M dimensional vector.

9. A method according to claim 2, comprising providing selection and retrieval of music audio files using the derived one or more keywords.

10. A method according to any of claims 7 to 9, comprising controlling a display to produce a list of titles of audio content.

A method according to any preceding claim, comprising automatically retrieving audio content.

A method for automated retrieval of audio content, comprising:

receiving a sample audio file, to extract features and to produce a derived F dimensional vector in vector space F;

converting the F dimensional vector to a derived M dimensional vector in vector space M using a derived conversion;

providing selection and retrieval of music audio files using the derived M dimensional vector;

- wherein the derived conversion is derived using a machine learning module arranged to derive a conversion between M dimensional vectors and F dimensional vectors from training audio content.