WO2014096832A1 - Système d'analyse audio et procédé utilisant une caractérisation de segment audio - Google Patents

Système d'analyse audio et procédé utilisant une caractérisation de segment audio Download PDF

Info

Publication number
WO2014096832A1
WO2014096832A1 PCT/GB2013/053362 GB2013053362W WO2014096832A1 WO 2014096832 A1 WO2014096832 A1 WO 2014096832A1 GB 2013053362 W GB2013053362 W GB 2013053362W WO 2014096832 A1 WO2014096832 A1 WO 2014096832A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature data
audio
input audio
audio signal
segments
Prior art date
Application number
PCT/GB2013/053362
Other languages
English (en)
Inventor
Michela Magas
Cyril Laurier
Original Assignee
Michela Magas
Cyril Laurier
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GBGB1222951.4A external-priority patent/GB201222951D0/en
Priority claimed from GB201312399A external-priority patent/GB201312399D0/en
Application filed by Michela Magas, Cyril Laurier filed Critical Michela Magas
Priority to GB1512636.0A priority Critical patent/GB2523973B/en
Publication of WO2014096832A1 publication Critical patent/WO2014096832A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/041Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal based on mfcc [mel -frequency spectral coefficients]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/061Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction of musical phrases, isolation of musically relevant segments, e.g. musical thumbnail generation, or for temporal structure analysis of a musical piece, e.g. determination of the movement sequence of a musical work
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/075Musical metadata derived from musical analysis or for use in electrophonic musical instruments
    • G10H2240/081Genre classification, i.e. descriptive metadata for classification or selection of musical pieces according to style
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/075Musical metadata derived from musical analysis or for use in electrophonic musical instruments
    • G10H2240/085Mood, i.e. generation, detection or selection of a particular emotional content or atmosphere in a musical piece
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/121Musical libraries, i.e. musical databases indexed by musical parameters, wavetables, indexing schemes using musical parameters, musical rule bases or knowledge bases, e.g. for automatic composing methods
    • G10H2240/131Library retrieval, i.e. searching a database or selecting a specific musical piece, segment, pattern, rule or parameter set
    • G10H2240/141Library retrieval matching, i.e. any of the steps of matching an inputted segment or phrase with musical database contents, e.g. query by humming, singing or playing; the steps may include, e.g. musical analysis of the input, musical feature extraction, query formulation, or details of the retrieval process
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/131Mathematical functions for musical analysis, processing, synthesis or composition
    • G10H2250/215Transforms, i.e. mathematical transforms into domains appropriate for musical signal processing, coding or compression
    • G10H2250/235Fourier transform; Discrete Fourier Transform [DFT]; Fast Fourier Transform [FFT]

Definitions

  • the present invention relates to an audio analysis system and method.
  • the present invention relates to a method of analysing audio sources in order to extract audio features and parameters that may be used to search for similar audio data.
  • the present invention comprises a method of building a database of automatically extracted audio segment files and a method of analysing an input audio stream (a "seed" query) against the database in order to identify audio segment files that are similar to the seed query.
  • Methods utilising high-level data to map clusters of music for the purpose of efficient music searches may therefore rely on considerable human input of semantic descriptors and assessment values, including human-edited music similarity graphs, similarity graphs generated using collaborative filtering techniques, similarity graphs generated as a function of monitored radio or network broadcast playlists, and similarity graphs constructed from music metadata (e.g.
  • US7571 183 This may include focusing on a way to capture emotional data from users via a specific device (e.g. US2010145892); or focusing on a playiist generation method based on high- level descriptors (e.g. EP2410444). None of the above methods focus on the description of an individual musical phrase within a music piece, and instead attempt to assign an emotional measure to the track as a whole.
  • Methods of audio similarity measures based on the automatised generation of music metadata by segmentation focus on the identification of specific classes for the purpose of making corresponding metadata markers available to music search engines. This may include identifying repetition of set classes, such as 'stanza' or 'refrain' (e.g. US7345233); or creating a music summary, which makes classifiers such as 'sad' and 'jazz' available as metadata for a similarity music search, and clustering music pieces according to this classification (e.g. US76261 11).
  • the above methods rely on high-level semantic metadata derived from audio analysis to be served to search engines, rather than a high-level audio feature-led search.
  • the object of the present invention is to provide an audio search system that provides a more effective audio search system.
  • a method of matching an input audio signal to audio files stored in a data store comprising: receiving the input audio signal; processing the input audio signal to determine structural parameter feature data related to the received input audio signal; analysing the determined structural parameter feature data to extract semantic feature data; comparing the feature data of the input audio signal to pre- processed feature data relating to audio files stored in the data store in order to match one or more audio files within a similarity threshold of the input audio signal; outputting a search result on the basis of the matched one or more audio files wherein semantic feature data is extracted from the structural parameter data using a supervised learning technique.
  • An audio signal is taken and is then analysed in order to extract audio related feature data from the input audio.
  • the extracted features may be matched to features extracted from a plurality of segments of audio stored in a database and audio samples may be returned in order of best result.
  • the input audio may be an audio signal (e.g. of a music track playing on the radio/TV), may be a segment of audio (e.g. an extract of an audio track that is played) or may be a segment of audio that has been returned as an earlier search result and which is reused as a starting query.
  • the segment may be analysed in real time and audio features are extracted.
  • the system may generate audio segments marked by timestamps that have been arrived at using an automatic segmentation algorithm.
  • a segmentation algorithm may employ a principle such as significant change in a feature of audio to determine start and end points of an audio segment.
  • These audio segments (or "musical phrases") may be analysed to extract salient high and low level feature data (semantic and structural feature data respectively) which is associated with those audio segments and their value relationships, in order to describe those audio segments with semantic and human-related contextual meaning.
  • the value relationships may comprise relationships between various extracted feature data (e.g. where the extraction of feature data has identified a loudness rating and a musical type, then a value relationship between high values of "loud” and “metal” may be derived. By weighting results further value relationships may be derived, e.g. defining a "young" or "old” target audience for the search results).
  • an audio segment once fully analysed may be associated with an audio segment
  • each audio segment may be analysed according to a series of low level (e.g. objective measures such as tempo, rhythm descriptors, tonal descriptors etc.) audio and high level feature (e.g. subjective measures such as mood, style etc.) extraction algorithms.
  • low level e.g. objective measures such as tempo, rhythm descriptors, tonal descriptors etc.
  • high level feature e.g. subjective measures such as mood, style etc.
  • the present invention provides an improved audio analysis system compared to known systems.
  • the present invention provides improved effectiveness by focussing on parts of audio tracks instead of complete audio tracks combined with the use of automatised weighting of high-level comparison measures, e.g. the mood of an audio extract/track.
  • the analysis methods according to embodiments of the present invention allow "on the fly" similarity queries to be run.
  • the first aspect of the present invention provides a method in which an input audio signal is processed to determine structural parameter feature data such as descriptors that are extracted from temporal and spectral representations of the input audio.
  • structural parameter feature data such as descriptors that are extracted from temporal and spectral representations of the input audio.
  • Such data is also referred to as "low level" feature data and includes, by way of example, the following audio features grouped by type:
  • HFC Frequency Component
  • semantic feature data is then extracted from the "low level” feature data.
  • Semantic feature data (also referred to as "High-level features”) may be defined as descriptors capturing the semantics of an audio sample, and their corresponding value relationships. These features can model concepts such as mood, style and a variety of others. High level feature data is extracted via a supervised learning technique.
  • the low and high level feature data determined and extracted from the input audio signal is then compared to pre-processed feature data relating to audio segments stored in the data store in order to determine one or more matching audio segments.
  • the input audio signal may comprise a complete audio/music track or an extract thereof, subject to the automatic identification of the length of the audio segment. Search results output by the method may also be fed back in as the input audio signal.
  • a single data store may store both the pre-processed feature data relating to the stored audio segments and the audio segment itself.
  • the data store may instead store the pre-processed feature data relating to the audio segment and the audio segment itself may be stored in a further data store (i.e. the actual sound file that the audible, playable component is within may be located in a different location).
  • the pre- processed feature data relating to the audio segment may further include links to the actual related audio segments/complete audio files.
  • the comparing step may comprise comparing the determined structural parameter feature data and extracted semantic audio feature data to pre-processed structural parameter feature data and semantic feature data relating to the plurality of audio segments in order to match one or more audio segments that have structural parameter feature data and semantic feature data that is within a similarity threshold of the determined structural parameter feature data and extracted semantic feature data.
  • the pre-processed feature data relating to the plurality of audio segments may conveniently be stored in a data store such as a database.
  • the data store may comprise a plurality of audio segment characterisations, each audio segment characterisation comprising the pre- processed feature data relating to an audio segment.
  • the audio segment characterisation may comprise further data identifying start/end points of the audio segment within a longer audio file.
  • the audio segment characterisation may also comprise meta-data that defines the relationships between one or more types of feature data.
  • the audio segment may be stored in the same data store as the audio segment characterisation.
  • the audio segment may be stored in a further data store (e.g. a third party's music library) and the audio segment characterisation may be stored in the data store with an identifier of the actual sound file (the audible, playable component) or a link or other direction to the third party database.
  • the search result output may comprise an actual audio segment (i.e. an actual sound file representing the audio segment, if available) or a link or direction to a further data store/database (if stored elsewhere).
  • an actual audio segment i.e. an actual sound file representing the audio segment, if available
  • a link or direction to a further data store/database if stored elsewhere.
  • the method may further conveniently comprise segmenting the input audio signal to identify one or more audio segments. Segmenting the input audio in this manner may conveniently reduce processor burden when determining feature data and improve effectiveness of analysis and comparison.
  • Segmenting the input audio in this manner may comprise determining feature data within the input audio signal. Furthermore, segmenting may comprise identifying candidate segments based on changes in the determined feature data over time. Segmenting may comprise identifying candidate segments using a novelty curve technique or using a peak detection algorithm to identify novelty peaks in order to identify candidate audio segments
  • a first segmentation process may determine an initial audio segment which may then be normalised in a further segmentation process. Normalisation may comprise analysing the feature data over a finer time scale than during the initial/first segmentation step.
  • Segmenting the input audio signal may also comprise filtering the identified audio segments on the basis of one or more heuristic rules.
  • Processing the input audio signal may comprise analysing the input audio signal waveform to extract temporal feature data. Analysing the input audio signal waveform may comprise measuring loudness with RMS.
  • Processing the input audio signal may also comprise performing a fast Fourier transform on the input audio signal in order to extract spectral feature data.
  • the method may further comprise analysing components of the fast Fourier transform to determine changes in frequency.
  • Processing the input audio signal may comprise generating a chromagram in order to extract tonal feature data.
  • the method may further comprise analysing chromas within the generated chromagram and extracting tonal feature data based on the distribution of chromas within the input audio signal.
  • the method may further comprise conducting a statistical analysis of the extracted feature data in order to determine structural parameter feature data.
  • the method may comprise processing identified audio segments.
  • Analysing the determined structural parameter feature data may comprise inputting the determined structural feature data into a supervised learning based classifier model in order to extract semantic feature data.
  • the supervised learning model may comprise a Support Vector Machine (SVM).
  • SVM Support Vector Machine
  • the classifier model may be arranged to output semantic feature data including one or more from the group of: musical style; mood of music; instruments used within the input audio signal.
  • comparing the feature data may comprise weighting feature data that is assessed using a similarity algorithm.
  • Two or more types of feature data may be used to compare the feature data of the input audio signal with the pre-processed feature data and the weighting given to each type of feature data may be customisable.
  • two types of semantic feature data such as mood and style
  • one type of structural feature data such as tone
  • the various types (groups) of feature data may be weighted relative to one another depending on the context of the search.
  • the method may further comprise pre-processing the identified structural feature data in order to normalise the feature data.
  • a system for matching an input audio signal to one or more audio segments within a plurality of audio segments comprising: an input arranged to receive the input audio signal; a processor arranged to: process the input audio signal to determine structural parameter feature data related to the received input audio signal; analyse the determined structural parameter feature data to extract semantic feature data; and compare the feature data of the input audio signal to pre-processed feature data relating to the plurality of audio segments in order to match one or more audio segments within a similarity threshold of the input audio signal; an output arranged to output a search result on the basis of the matched one or more audio segments wherein the processor is arranged to extract semantic feature data from the structural parameter data using a supervised learning technique.
  • a method of matching an input audio file comprising: receiving the input audio file; determining the structural parameter feature data and semantic feature data associated with the input audio file, the semantic feature data having been extracted from the structural parameter data using a supervised learning technique;
  • a system for matching an input audio file comprising: an input arranged to receive the input audio file; a processor arranged to determine the structural parameter feature data and semantic feature data associated with the input audio file, the semantic feature data having been extracted from the structural parameter data using a supervised learning technique; and to compare the feature data of the input audio file to pre-processed feature data relating to the plurality of audio segments in order to match one or more audio segments within a similarity threshold of the input audio file; an output arranged to output a search result on the basis of the matched one or more audio segments.
  • a method of building a data store of audio segment characterisations comprising: receiving an input audio file; processing the input audio to determine structural parameter feature data related to the received input audio signal; analysing the determined structural parameter feature data to extract semantic feature data wherein semantic feature data is extracted from the structural parameter data using a supervised learning technique; storing the determined structural parameter feature data and extracted semantic feature data in a data store.
  • a method of segmenting an input audio signal comprising: receiving the input audio signal; processing the input audio signal to determine structural parameter feature data related to the received input audio signal; segmenting the input audio signal to identify one or more audio segments wherein segmenting comprises identifying candidate audio segments based on changes in the determined feature data over time.
  • references to audio files within the third to sixth aspects of the present invention may include complete musical tracks or fragments thereof.
  • the present invention provides a method and system for analysing audio data.
  • the invention extends to the following:
  • a system for recognising similar features in at least two audio segments comprising: a sampler of an existing audio stream, a real time feature extractor, a media search engine connected to the features extractor, a media search engine able to match the features to a database of audio segment characterisations, including both low level and high level descriptors, a media search engine able to return audio samples of matching audio segments, an audio visual display which allows to sample the matching audio segments.
  • Examples of embodiment of such a system include using a known audio sample to locate similar but unknown alternatives (such as, but not exclusively, music by unknown artists); using a catalogued audio sample to find versions of the same (such as, but not exclusively, live recordings of commercial music); using an audio sample to locate linking points in other audio (such as, but not exclusively, in music mixing); using an audio sample to discover music relationships between audio samples (such as, but not exclusively, in relationships between music from different world cultures).
  • a known audio sample to locate similar but unknown alternatives (such as, but not exclusively, music by unknown artists); using a catalogued audio sample to find versions of the same (such as, but not exclusively, live recordings of commercial music); using an audio sample to locate linking points in other audio (such as, but not exclusively, in music mixing); using an audio sample to discover music relationships between audio samples (such as, but not exclusively, in relationships between music from different world cultures).
  • This module is to segment audio into meaningful and consistent segments.
  • the relevant "cutting point” is located to maximise the consistency of each part.
  • the present invention is not limited to use any particular type of segmentation, it is recommended to use a particular algorithm based on a classifier-based segmentation algorithm.
  • the audio is first divided into frames from which descriptors are extracted.
  • a segmentation marker is generated.
  • the salient changes are evaluated using the accuracy of a classifier, by comparing descriptors prior to and following the candidate
  • the segmentation marker method may use the following steps:
  • Embodiment of such a method may be in allowing very precise search of the desired mood in a particular piece of music, regardless of whether the audio segment comes from a 30-second production piece or from a concept album.
  • the audio features are variables extracted from the audio signal describing some aspect of the information it contains.
  • a rich set of audio features is extracted based on temporal and spectral representations of the audio signal.
  • the resulting values are stored into audio clips.
  • An audio clip is linked to an audio sample and contains all the audio features extracted from it. After the audio feature extraction step, each audio sample has an audio profile associated.
  • the audio features are divided into two types: low-level and high-level.
  • High-level stands for descriptors that are based on the output from a trained machine learning algorithm, using curated databases as described below.
  • HFC Frequency Component
  • feature statistics are computed (such as but not limited to: minimum, maximum, mean, variance and derivatives). It is recommended to then standardise those values across the whole music collection values, easing their combination to build similarity measures.
  • High-level features are defined as descriptors capturing the semantics of an audio sample. These features can model concept such as mood, style and others.
  • the output of this process are classifiers of mood, style, instruments and others (it can be extended indefinitely). Each classifier can be applied to get a set of probability estimations. For each audio sample, classification probabilities for all high-level features are computed and included in the audio profile.
  • the feature extraction process may group all audio descriptors (low-level and high-level feature vectors) into vectors, matrix or other data structure and store them in audio profiles. Audio profiles can be saved in memory, databases, files or any other available way to store data. 4) a method for identifying similar features between audio segments.
  • the main objective of this system is to provide a good matching to a query.
  • the method to compute the matching is essential. Also, it has to be flexible enough to be adapted (manually or automatically) to as many use cases as possible.
  • the method employed is a mood matching similarity algorithm computing a measure used to compare instances and find the closest one (most similar).
  • the mood matching similarity measure is computed based on the extracted features contained in the audio profiles (both low-level and high-level).
  • This similarity measure is a weighted sum of several similarity measures. With a linear combination, a final similarity score is used to retrieve similar audio samples. This similarity measure can be customised with the coefficients of the linear combination for each component.
  • each descriptor is customisable with the coefficients of the weighted Pearson correlation measure on high-level features. This can be customised automatically, optimising the coefficients according to a set of rules and constraints.
  • the similarity measure is computed and used to retrieve similar audio samples.
  • a Nearest Neighbour Search Algorithm can be used to find the most similar results.
  • the system can be used with a client/server approach.
  • the audio segment characterisations may be stored on a server, into a database, the audio segment characterisations containing both audio segments and metadata related to those segments.
  • the database is linked to the audio information and the server is able to stream audio data to the user.
  • a similarity query may be issued from any audio sample uploaded by the user or provided by any other means.
  • a method for extracting features from streaming audio in real time comprising a real-time features extractor.
  • the present invention may extend to a desktop application (or smart device application) which sits in the background and "listens" to what the user is listening to (e.g. YouTube, Spotify, iTunes Library), analyses a few seconds at a time, and sends real-time analysed data to a data store to find similar audio segments from a collection it is connected to.
  • Figure 1 shows a flow chart according to an embodiment of the present invention detailing how a user may search for audio samples
  • Figure 2 shows a flow chart according to an embodiment of the present invention that shows the search procedure of Figure 1 in more detail;
  • Figure 3 shows a flow chart according to an embodiment of the present invention detailing how a database of audio segment characterisations may be created
  • Figure 4 shows the process of segmenting an audio stream according to an embodiment of the present invention
  • Figure 5 shows the process according to an embodiment of the present invention of extracting low level audio features from an audio segment
  • Figure 6 shows the process according to an embodiment of the present invention of extracting high level audio features from an audio segment
  • Figure 7 shows the process of defining a feature vector for an audio segment according to an embodiment of the present invention
  • Figure 8 shows the process of comparing audio segment characterisations according to an embodiment of the present invention
  • Figure 9 illustrates the various components of an input audio signal and the associated storage within a database in accordance with an embodiment of the present invention.
  • the present invention is arranged to process an input audio signal and to match the input audio to one or more stored audio segments.
  • the input audio signal may comprise a complete music track, a part of a track or may be a constant input that is processed to match audio segments continuously (i.e. the method of the present invention may constantly "sniff" an input audio signal).
  • the input audio signal is also referred to as an "audio source” or "audio stream”.
  • seed inputs Inputs to a search engine in accordance with embodiments of the present invention are variously referred to as “seed” inputs.
  • seed query refers to an audio signal that has been processed to derive audio data.
  • an input audio signal or “seed audio stream” may be processed to generate the seed query).
  • the seed query may not necessarily relate to a complete audio track but to a portion or segment thereof.
  • Generation of a "seed query” may therefore involve determining an audio segment which is then analysed to extract a number of feature vectors (both high level and low level features as mentioned above), the resulting audio segment being used as the seed query. Extracted feature vectors may also be used as the basis of a "seed query”.
  • Low level features within the feature vectors described below relate to structural features of the input audio. "Low level” features and “low level feature vectors” are therefore equivalent terms to "structural parameter feature data” as used above. High level features within the feature vectors described below relate to semantic features (such as mood or genre). “High level” features and “high level feature vectors” are therefore equivalent terms to "semantic feature data” as used above.
  • audio segment characterisation refers to an (automatically) extracted audio segment and its associated high and low level feature vectors and any related metadata.
  • a similarity database may comprise a selection of audio segment characterisations.
  • FIG. 1 an example of the user experience of searching for audio samples according to an embodiment of the present invention is shown.
  • a seed audio stream is derived from an audio source (e.g. online streaming source or music library).
  • a similarity search is triggered (step 20) from the seed query into the similarity database of pre-processed audio segment characterisations. Similarity results are returned (step 30) by the audio segment characterisation database by order of relevance to the seed query and aligned according to the resulting clip starting point. Results afford sampling of matching audio segments in real time with fast audio loops. Any resulting audio segment can serve as the starting point for a new query (step 40).
  • the seed query takes the form of a segment of audio that has been processed to extract feature vectors.
  • An audio segment may be identified either by real-time analysis of a limited streaming window or by pre-processing a longer stream of audio with an automatic segmentation algorithm.
  • step 40 if a resulting audio segment characterisation is used as the starting point for a new query then the subsequent search will occur more quickly than the initial search because the seed query in the new search is using feature vector information retrieved from the database as part of the preceding search rather than having to process a new audio stream to derive such data.
  • FIG. 2 shows the process of searching in more detail.
  • An initial starting query comes, in step 50, from an audio stream (e.g. online streaming source or music library). It is noted that the audio stream is most likely an audio sample rather than a whole track. However, in some embodiments an entire track may be used.
  • the audio stream is from an outside source (step 60) then the stream is segmented and similarity features are extracted in real time (step 70) to form the seed query (step 75).
  • the similarity features extracted in step 70 may take the form of one or more audio feature vectors.
  • the seed query is then sent to the similarity database for matching with pre-processed audio segments (step 20) and results are returned in step 30.
  • Step 80 If the audio stream is from an inside source (Step 80, i.e. the seed query has come from an earlier search as per step 40 above), then the corresponding audio feature vectors relating to the audio stream may be retrieved from an audio segment/audio segment characterisation database and matched to other audio segments from the database. Similarity measures may be performed and results sent back to the front end interface, aligned along the starting point of each matching audio segment.
  • Figure 3 relates to the process of creating a database of audio segment characterisations, that is to say a database of pre-processed feature data relating to audio segments that have been analysed and characterised for use in the search processes of Figures 1 and 2 above.
  • step 85 an audio stream is used as a starting point for building the audio segment
  • the audio stream is segmented, in step 90, into audio segments using an automatic segmentation algorithm, indicating each segment start and end point.
  • Low level and high level similarity features are extracted, in step 100, from each segment and stored in the audio segment characterisation database (120) as an audio segment characterisation (the audio segment plus associated feature vectors).
  • the database is indexed, in step 1 10, according to each feature type for each audio segment characterisation.
  • the start and end points of the audio segment may be stored as part of the audio segment characterisation and the actual sound file (that the audible, playable component is within) may be located in a different location (e.g. the audio segment characterisation may be contained within the audio segment characterisation database and the original sound file may be stored within a third party's music library).
  • the term "audio segment characterisation” should be taken to encompass:
  • a similarity database where the audio segment characterisation comprises details of the start and end points of a segment within an audio track (the start/end points defining the audio segment) which is stored in a different location to the audio segment characterisation database and the feature vectors associated with the audio segment.
  • Figure 4 illustrates the process of audio segmentation. It is noted that the input audio stream (50, 85) may relate to a sampled audio stream (50) to be searched and matched against a database of pre-processed audio segment characterisations. However, Figure 4 would more commonly relate to the process of building the database of pre-processed audio segment characterisations and consequently the audio stream (85) in the description of Figure 4 relates to a new audio stream that is to be analysed and included in the audio segment characterisations database 120.
  • the process of Figure 4 aims to identify meaningful audio segments within the input audio stream by segmenting the input audio according to the most consistent groupings of musical features, along lines of significant transitions, including, but not exclusively, spectral distribution, tonal sequences, rhythm analysis, genre and mood.
  • the segmentation process comprises a feature extraction step 130 followed by a two stage segmentation determination process.
  • Step 130 the input audio stream (85) is analysed to extract features within the audio input. For example, the mood of the audio may be analysed. A rhythm analysis may also be performed. It is noted that the extraction step 130 may draw upon some or all of the low level and high level analysis techniques described in relation to Figure 5 and 6 below.
  • the input audio signal may be associated with an initial signal description according to the features that have been extracted.
  • Step 140 a first pass through the audio input signal is made to identify candidate segmentation points.
  • the signal description may be analysed throughout the time period (the length of the input audio signal) at a first level of temporal granularity of the input sample to identify changes in the signal description. For example, changes in mood, beat, harmony and rhythm descriptors
  • Step 140 therefore represents a fast statistical analysis that shows time points with significant changes in the features indicating novelty and potential segment candidates.
  • Novelty curves may be computed in order to detect potential segment candidates.
  • Novelty curves are sequences of novelty estimations in time and may be computed using different techniques.
  • a moving window centred on the current analysed time may be used to compute a novelty estimation.
  • Each half of the window can train a binary classifier (such as, but not limited to, Support Vector Machines).
  • the novelty estimation at this particular timestamp, relies on the classifier cross-validation value (how well the classifier can separate those two halves). At each timestamp, this process allows a novelty estimation to be computed.
  • the combined novelty estimations create a novelty curve. Multiple novelty curves may be analysed to combine different feature types.
  • a mood novelty curve (based on mood features) may be used to detect changes in mood, combined with a rhythm novelty curve (based on rhythm features) to detect changes in rhythm.
  • the segmentation process can be tuned to work on one or several aspects. Segment candidates can be identified and aggregated from both independent and combined novelty curves .
  • a peak detection algorithm may be performed on the novelty curve to identify novelty peaks and consequently detect segment candidates.
  • the output of Step 140 is a series of initial audio segment candidates. These are filtered in Step 150 to remove unlikely candidates.
  • Analysis in step 150 is performed on an heuristics based analysis that identifies segment candidates based on a set of pre-determined rules that relate to audio processing in the context of musical tracks. For example, if the input audio sample has been identified as a "live" recording then the heuristics rules of step 150 may be designed to ignore segments containing clapping at the end of the track. Further rules may define the length of the introduction to the musical track (in other words the rules may prevent introductions from being too long relative to the length of the track) and may also define a minimum length of segment (e.g. no segments to be less than 3 seconds in length).
  • Step 160 a second pass through the audio input signal is made in order to fine tune the initial segmentation analysis in step 140.
  • This second pass comprises a granular statistical analysis that fixes a start and end points with precision.
  • the analysis in this step 160 may be at a second, finer level of time grain in order to fine tune the start and end points of the segment.
  • Steps 140 and 160 may therefore be seen as analogous to the process of searching through a video clip for the start of a scene.
  • An initial scan through the video at high speed may be made to identify the rough location for the start of the scene before being followed by a slower speed scan to identify the actual start point.
  • step 180 the segments identified in step 170 may be cut according to the granular analysis, and individually labelled.
  • the audio segments (80) will form part of the audio segment characterisations in the database.
  • the audio segment characterisations will also be associated with low level and high level feature vectors as described in Figures 5 to 7 below.
  • step 190 an audio stream is provided. It is noted that although the input audio at step 190 could comprise the whole of a track (i.e. the input audio could be the same as step 85) it is preferred if the audio segment input at step 190 comprises a segment that has been identified in the process of Figure 4.
  • step 200 the input audio segment is pre-processed.
  • Pre-processing may include sample-rate conversion or audio normalisation, to adjust audio streams to a consistent amplitude and time representation.
  • step 210 a windowing process is undertaken in which the audio segment is subdivided further into audio frames which are used as a unit of analysis).
  • Each frame 220 (sub-divided segment of audio) is described according to select parameters. For example:
  • temporal descriptors 230 measure levels of noise or loudness with RMS (root- mean square).
  • spectral descriptors 240 require a Fourier transform to switch measurement from the temporal domain to the audio frequency domain. Spectral features map a change of bias, peaks or frequency.
  • tonal descriptors 250 analyse the chromagram of included chromas and extract tonal descriptors according to their distributions.
  • a chromagram is the map of the frequencies of the tonal scale.
  • exact rhythm descriptors 260 are drawn from spectral 240 and chroma 250 frequency representations.
  • a statistical summary (in Step 270) of the descriptors is conducted by analysing spectral, tonal, temporal and rhythm values and indicating the mean, variance and derivatives, and storing (Step 280) the values as low level feature vectors that relate to the input audio segment.
  • the process of high level analysis is aided by previously trained classifiers with supervised learning algorithms for detection of high level concepts (e.g. mood, style), and application of the classifier model on the low level feature vectors.
  • This process requires some pre-processing of feature vectors (e.g. scaling of feature values for comparison).
  • the high level feature vectors are constructed from probability predictions derived from each classifier model.
  • any suitable supervised learning technique may be used in accordance with the process shown in Figure 6 in order to analyse the audio stream.
  • Support Vector Machine processes are currently preferred as the mechanism for modelling such high level features, but it is noted that other suitable processes may be used, e.g. artificial neural networks, decision trees, logistic regression etc.
  • the process of Figure 6 comprises two parts - an offline process 300 in which a supervised learning algorithm is used to build a classifier model and an online process 302 in which the classifier model is applied to the features that are input for analysis.
  • the offline process 300 comprises training the model with "ground truth” data in which examples of music in a plurality of categories is made [the model may be trained using low level feature vectors extracted from music in accordance with the process of Figure 5]. Crowd-sourced data using semantic tagging (e.g. "happy", "sad” music) may also be incorporated at this point for training the model with the supervised learning algorithm. Once the classifier model is built, any audio segment may be input into the model in order to determine subjective features regarding the music.
  • the output of the objective analysis of Figure 5 is input (step 304) for analysis by the classifier model.
  • a pre-processing step 306 normalises the input data and selects certain low level features for analysis.
  • the classifier model analyses the low level features and outputs a number of predictions, e.g. the likely style of the music (represented as a probability), the likely mood of the music (again represented as a probability) and the likely instruments that appear in the music segment (again represented as a probability).
  • step 310 a series of high level feature vectors are output.
  • Figure 7 illustrates how the low level (280, 404) and high level feature vectors (310, 406) from the processes of Figures 5 and 6 may be combined into a single feature vector that describes the audio segment that was originally input into the process of Figure 5 (the audio segment in turn being the segment identified by the process of Figure 4).
  • the methodology comprises extracting low-level features and generating low level feature vectors. These allow extraction of high level features and generating high level feature vectors.
  • the feature vectors are then concatenated (step 320) and final values are presented.
  • Final values are stored into a feature vector containing all the descriptors.
  • Feature vectors can be stored in different formats including binary or text files. Below is an example of a simplified text file representing a single feature vector. It contains the low-level and high-level feature values of an audio segment. As noted above, the combination of an audio component (audio segment), or a reference to it, and related feature vectors is stored as an audio segment characterisation (410).
  • genre_rock 0.85 genre_metal:0.99
  • the low level feature vectors comprise all the entries up to, but not including, the "mood”, “instrument” and “genre” entries.
  • the high level feature vectors comprise the "mood”, “instrument” and “genre” entries.
  • Similarity is measured between feature vectors A and feature vectors B using a similarity algorithm.
  • the similarity algorithm is based on measuring A and B vector values within each feature group (e.g. mood, style, rhythm or tone) and weighting each similarity measure according to its relevance to the application. The weighting allows for the algorithm to be customised and adaptable to a variety of user scenarios. The fusion of values generates an overall similarity measure.
  • Figure 8 illustrates the process of comparing the feature vectors (feature vector A, 330) of a sampled audio segment (e.g. a seed audio stream 50, 400) with the feature vectors (feature vector B, 332) of a pre-processed audio segment in an audio segment characterisations database (120). It is noted that the seed segment (feature vector A) would be compared in a pair-wise manner with all the available segments within the database. It is noted however that multiple such comparisons could be conducted at once in order to speed the process. A similarity algorithm 334 then outputs a similarity measure 336.
  • the overall comparison is shown on the left hand side of Figure 8.
  • the process is shown in more detail on the right hand side of Figure 8 where the similarity computation, 334a, 334b, is shown being performed against different comparison groups 330a, 330b, 332a, 332b (e.g. tempo or spectrum could be compared from the low level feature vectors).
  • the results of the two similarity measures 334a and 334b are fused together in step 338 to provide the similarity measure output 338.
  • three feature groups are considered and compared: mood (high level feature), tone (a low level feature) and style (high level).
  • the weighting of these groups may be done automatically, by providing a set of constraints and using a parameter optimisation algorithm (such as, but not limited to, grid search), each parameter being the weight of each feature group.
  • Constraints are a set of rules defining positive and negative results to evaluate the search algorithm results.
  • Real time analysis is aided by progress indicators at the front end and optimisation at the back end. - fast-forwarding the analysis of audio to the point where the results are still intelligible; changing the parameters of the windowing process (a window is small segment of audio data used as a unit of analysis)
  • Figure 9 shows the components of an input audio signal 400. As described above the input signal 400 is segmented into a number of audio segments 402 from which low level feature data 404 is extracted. High level feature data 406 is then derived from the extracted low level data.
  • the processed audio signal may be used to populate a data store/database 408.
  • the feature data 406, 408 may be stored as part of an audio segment characterisation entry 410.
  • the audio segment characterisation 410 may further comprise a segment identifier 412 to identify the audio file from which the audio segment derives and other meta data 414.
  • the other meta data 414 may comprise start/end times to identify where the audio segment is located within the complete audio file (that contains the audio segment)
  • the audio file which contains the audio segment in question may be stored in a second data store (not shown) in which case the other meta data 414 may also comprise a hyperlink or other suitable link to the audio file that contains the audio segment in question.
  • Box 10 seed audio stream - candidate identified from outside source or available (audio segment characterisations)
  • Box 70 - similarity features extract audio features from audio segments

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé de mise en correspondance d'un signal audio d'entrée avec un ou plusieurs segments audio d'une pluralité de segments audio, le procédé faisant appel : à la réception du signal audio d'entrée ; au traitement du signal audio d'entrée pour déterminer des données de caractéristiques de paramètres structurels se rapportant au signal audio d'entrée reçu ; à l'analyse des données de caractéristiques de paramètres structurels déterminées pour extraire des données de caractéristiques sémantiques ; à la comparaison des données de caractéristiques du signal audio d'entrée avec des données de caractéristiques prétraitées se rapportant à la pluralité de segments audio afin de faire correspondre un ou plusieurs segments audio à l'intérieur d'un seuil de similarité du signal audio d'entrée ; à délivrer un résultat de recherche sur la base du ou des segments audio mis en correspondance, les données de caractéristiques sémantiques étant extraites des données de paramètres structurels au moyen d'une technique d'apprentissage supervisé.
PCT/GB2013/053362 2012-12-19 2013-12-19 Système d'analyse audio et procédé utilisant une caractérisation de segment audio WO2014096832A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1512636.0A GB2523973B (en) 2012-12-19 2013-12-19 Audio analysis system and method using audio segment characterisation

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GBGB1222951.4A GB201222951D0 (en) 2012-12-19 2012-12-19 Audio analysis system and method
GB1222951.4 2012-12-19
GB201312399A GB201312399D0 (en) 2013-07-10 2013-07-10 Audio analysis system and method
GB1312399.7 2013-07-10

Publications (1)

Publication Number Publication Date
WO2014096832A1 true WO2014096832A1 (fr) 2014-06-26

Family

ID=49998568

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2013/053362 WO2014096832A1 (fr) 2012-12-19 2013-12-19 Système d'analyse audio et procédé utilisant une caractérisation de segment audio

Country Status (2)

Country Link
GB (1) GB2523973B (fr)
WO (1) WO2014096832A1 (fr)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016003920A1 (fr) * 2014-06-29 2016-01-07 Google Inc. Dérivation de score probabiliste pour un alignement de séquences audio
WO2017214408A1 (fr) * 2016-06-09 2017-12-14 Tristan Jehan Identification d'un contenu multimédia
EP3267668A3 (fr) * 2016-07-05 2018-03-28 Dialogtech Inc. Système et procédé de détection automatique des appels indésirables
US10055413B2 (en) 2015-05-19 2018-08-21 Spotify Ab Identifying media content
EP3340238A4 (fr) * 2015-05-25 2019-06-05 Guangzhou Kugou Computer Technology Co., Ltd. Procédé et appareil de traitement audio, et terminal
US10372757B2 (en) 2015-05-19 2019-08-06 Spotify Ab Search media content based upon tempo
US11113346B2 (en) 2016-06-09 2021-09-07 Spotify Ab Search media content based upon tempo
CN114005464A (zh) * 2021-11-04 2022-02-01 深圳万兴软件有限公司 一种节拍速度估测方法、装置、计算机设备及存储介质
CN114205677A (zh) * 2021-11-30 2022-03-18 浙江大学 一种基于原型视频的短视频自动编辑方法
US20220238087A1 (en) * 2019-05-07 2022-07-28 Moodagent A/S Methods and systems for determining compact semantic representations of digital audio signals
CN114928755A (zh) * 2022-05-10 2022-08-19 咪咕文化科技有限公司 一种视频制作方法、电子设备及计算机可读存储介质
US20220310051A1 (en) * 2019-12-20 2022-09-29 Netease (Hangzhou) Network Co.,Ltd. Rhythm Point Detection Method and Apparatus and Electronic Device
US20230073174A1 (en) * 2021-07-02 2023-03-09 Brainfm, Inc. Neurostimulation Systems and Methods
US20240338408A1 (en) * 2023-04-04 2024-10-10 Roblox Corporation Digital content management in virtual environments

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060075886A1 (en) * 2004-10-08 2006-04-13 Markus Cremer Apparatus and method for generating an encoded rhythmic pattern
US20070240557A1 (en) * 2006-04-12 2007-10-18 Whitman Brian A Understanding Music
US20080300702A1 (en) * 2007-05-29 2008-12-04 Universitat Pompeu Fabra Music similarity systems and methods using descriptors

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100749045B1 (ko) * 2006-01-26 2007-08-13 삼성전자주식회사 음악 내용 요약본을 이용한 유사곡 검색 방법 및 그 장치
TW201022968A (en) * 2008-12-10 2010-06-16 Univ Nat Taiwan A multimedia searching system, a method of building the system and associate searching method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060075886A1 (en) * 2004-10-08 2006-04-13 Markus Cremer Apparatus and method for generating an encoded rhythmic pattern
US20070240557A1 (en) * 2006-04-12 2007-10-18 Whitman Brian A Understanding Music
US20080300702A1 (en) * 2007-05-29 2008-12-04 Universitat Pompeu Fabra Music similarity systems and methods using descriptors

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CASEY M A ET AL: "Content-Based Music Information Retrieval: Current Directions and Future Challenges", PROCEEDINGS OF THE IEEE, IEEE. NEW YORK, US, vol. 96, no. 4, 2 April 2008 (2008-04-02), pages 668 - 696, XP011206028, ISSN: 0018-9219 *
MICHELA MAGAS: "Michela Magas on music search and discovery -YouTube", BIG AWARDS' AT RAVENSBOURNE COLLEGE, GREENWICH, LONDON, 6 March 2012 (2012-03-06), Internet, XP055107138, Retrieved from the Internet <URL:http://www.youtube.com/watch?v=aCcJ2r0DPyI> [retrieved on 20140312] *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106663110B (zh) * 2014-06-29 2020-09-15 谷歌有限责任公司 音频序列对准的概率评分的导出
US9384758B2 (en) 2014-06-29 2016-07-05 Google Inc. Derivation of probabilistic score for audio sequence alignment
CN106663110A (zh) * 2014-06-29 2017-05-10 谷歌公司 音频序列对准的概率评分的导出
WO2016003920A1 (fr) * 2014-06-29 2016-01-07 Google Inc. Dérivation de score probabiliste pour un alignement de séquences audio
EP3161689A4 (fr) * 2014-06-29 2018-03-07 Google LLC Dérivation de score probabiliste pour un alignement de séquences audio
US11048748B2 (en) 2015-05-19 2021-06-29 Spotify Ab Search media content based upon tempo
US10055413B2 (en) 2015-05-19 2018-08-21 Spotify Ab Identifying media content
US10372757B2 (en) 2015-05-19 2019-08-06 Spotify Ab Search media content based upon tempo
EP3340238A4 (fr) * 2015-05-25 2019-06-05 Guangzhou Kugou Computer Technology Co., Ltd. Procédé et appareil de traitement audio, et terminal
US10984035B2 (en) 2016-06-09 2021-04-20 Spotify Ab Identifying media content
WO2017214408A1 (fr) * 2016-06-09 2017-12-14 Tristan Jehan Identification d'un contenu multimédia
US11113346B2 (en) 2016-06-09 2021-09-07 Spotify Ab Search media content based upon tempo
US12032639B2 (en) 2016-06-09 2024-07-09 Spotify Ab Search media content based upon tempo
US12032620B2 (en) 2016-06-09 2024-07-09 Spotify Ab Identifying media content
US10194022B2 (en) 2016-07-05 2019-01-29 Dialogtech Inc. System and method for automatically detecting undesired calls
EP3267668A3 (fr) * 2016-07-05 2018-03-28 Dialogtech Inc. Système et procédé de détection automatique des appels indésirables
US20220238087A1 (en) * 2019-05-07 2022-07-28 Moodagent A/S Methods and systems for determining compact semantic representations of digital audio signals
US20220310051A1 (en) * 2019-12-20 2022-09-29 Netease (Hangzhou) Network Co.,Ltd. Rhythm Point Detection Method and Apparatus and Electronic Device
US12033605B2 (en) * 2019-12-20 2024-07-09 Netease (Hangzhou) Network Co., Ltd. Rhythm point detection method and apparatus and electronic device
US20230073174A1 (en) * 2021-07-02 2023-03-09 Brainfm, Inc. Neurostimulation Systems and Methods
CN114005464A (zh) * 2021-11-04 2022-02-01 深圳万兴软件有限公司 一种节拍速度估测方法、装置、计算机设备及存储介质
CN114205677A (zh) * 2021-11-30 2022-03-18 浙江大学 一种基于原型视频的短视频自动编辑方法
CN114928755A (zh) * 2022-05-10 2022-08-19 咪咕文化科技有限公司 一种视频制作方法、电子设备及计算机可读存储介质
CN114928755B (zh) * 2022-05-10 2023-10-20 咪咕文化科技有限公司 一种视频制作方法、电子设备及计算机可读存储介质
US20240338408A1 (en) * 2023-04-04 2024-10-10 Roblox Corporation Digital content management in virtual environments

Also Published As

Publication number Publication date
GB2523973A (en) 2015-09-09
GB201512636D0 (en) 2015-08-26
GB2523973B (en) 2017-08-02

Similar Documents

Publication Publication Date Title
WO2014096832A1 (fr) Système d&#39;analyse audio et procédé utilisant une caractérisation de segment audio
Fu et al. A survey of audio-based music classification and annotation
Serra et al. Unsupervised music structure annotation by time series structure features and segment similarity
Gillet et al. On the correlation of automatic audio and visual segmentations of music videos
Fuhrmann et al. Polyphonic instrument recognition for exploring semantic similarities in music
Tzanetakis Song-specific bootstrapping of singing voice structure
Giannakopoulos Study and application of acoustic information for the detection of harmful content and fusion with visual information
Ghosal et al. Song/instrumental classification using spectrogram based contextual features
Álvarez et al. Feature subset selection based on evolutionary algorithms for automatic emotion recognition in spoken spanish and standard basque language
Kruspe et al. Automatic speech/music discrimination for broadcast signals
Siddiquee et al. An Effective Machine Learning Approach for Music Genre Classification with Mel Spectrograms and KNN
Nagavi et al. An extensive analysis of query by singing/humming system through query proportion
KR20200118587A (ko) 음악의 내재적 정보를 이용한 음악 추천 시스템
Dandashi et al. A survey on audio content-based classification
Patil et al. Content-based audio classification and retrieval: A novel approach
Yang Towards real-time music auto-tagging using sparse features
Peiris et al. Musical genre classification of recorded songs based on music structure similarity
Zhang et al. Automatic generation of music thumbnails
Peiris et al. Supervised learning approach for classification of Sri Lankan music based on music structure similarity
Aurchana et al. Musical instruments sound classification using GMM
Fuhrmann et al. Quantifying the Relevance of Locally Extracted Information for Musical Instrument Recognition from Entire Pieces of Music.
Simas Filho et al. Genre classification for brazilian music using independent and discriminant features
Chmulik et al. Continuous Music Emotion Recognition Using Selected Audio Features
Hsu et al. Deep Learning Based EDM Subgenre Classification using Mel-Spectrogram and Tempogram Features
Karunarathna et al. Classification of voice content in the context of public radio broadcasting

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13821908

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 1512636

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20131219

WWE Wipo information: entry into national phase

Ref document number: 1512636.0

Country of ref document: GB

122 Ep: pct application non-entry in european phase

Ref document number: 13821908

Country of ref document: EP

Kind code of ref document: A1