GB2615321A

GB2615321A - Methods and systems for analysing an audio track

Info

Publication number: GB2615321A
Application number: GB2201361.9A
Authority: GB
Inventors: Oliver Marsh Thomas
Original assignee: Altered States Technologies Ltd
Current assignee: Altered States Technologies Ltd
Priority date: 2022-02-02
Filing date: 2022-02-02
Publication date: 2023-08-09
Also published as: GB202201361D0; WO2023148299A1

Abstract

A computer-implemented method and system for identifying section transition boundaries (224a, 224a’, 224a’’, fig.2d) within an audio track comprising calculating first 204a and second 204b statistical analyses at a plurality of timesteps 206—the statistical analyses being performed on frequency bands within a specific energy band where the first and second analyses are in different energy bands—then identifying first 208a and second 208b set of section transition markers by chronologically comparing the first and second statistical analyses respectively across these multiple timesteps, followed by defining a master set of section transition markers 210 which are each present in both prior sets of section transition markers. The audio track may be sampled, or passed through a low-pass filter or band-pass filter, having drums and rhythmic parts removed, to provide the musical audio information on which the algorithm is applied. The master set of section transition boundaries can be combined with a set of hitpoints (e.g. from a drum or a musical note being played) to find the hitpoints closest to the transition boundaries. Hitpoints may be found by autocorrelation. The transition markers may be used to allow a user to skip or loop within a music track (figs.6a-6e) using a user-interface (fig.6a).

Description

METHODS AND SYSTEMS FOR ANALYSING AN AUDIO TRACK

Technical field

This disclosure relates to methods of analysing an audio track and, in particular, to a computer-implemented method of identifying section transition boundaries within an audio track.

Background

Current methods of consuming recorded music through conventional music playback applications (for example, cloud-based audio streaming platforms) involve listening to artist-released tracks, which comprise a continuous audio file spanning the duration of a recording.

These conventional music playback applications allow a user to control playback of the tracks via music player controls for play, pause, next track, previous track, return to start of current track, rewind, fast-forward, shuffle current playlist, loop current playlist, and loop current track. The music player controls may additionally include a representation of a timeline which indicates the current playback position and allows a user to scroll through the track.

However conventional music playback applications provide no simple controls to enable playback to skip to or to loop a specific, musically significant, section (such as a specific verse or chorus). Accordingly, were a user to wish to skip to or loop a specific section of a track, they would need to manually scroll through the track to find that section themselves. Further, a user would not be able to loop that specific section without manually scrolling or rewinding to the start of the section on each pass.

Furthermore, conventional music playback applications do not allow a user to manipulate the ordering of sections or export a singular section within a song. For example, a user may wish to restructure a song to change its playback duration or to change its section playback order.

Alternatively, a user may wish to skip to or loop a specific section in order to learn how to play that section on an instrument or to play that section back in isolation.

A significant technological hurdle preventing conventional music playback applications from providing simple controls for skipping to or looping specific sections of a track, manipulating the sections' ordering, and exporting singular sections within a song, is that it is difficult to identify the section boundaries within an audio track automatically and accurately without user input. Such user input is expensive and time consuming. Due to the huge variety in types of music, general audio quality, production, mixing and mastering techniques there appears to be no concrete "one-size-fits-all" solution. Current best practice is a manual approach based on listening to the music and manually keying in points where there is a section boundary or instead looking at the signal shape in a digital audio workstation (DAVV) to assess where changes in the signal shape occur.

DAWs are software used for recording, editing and producing audio files. DAWs allow multiple audio files to be manipulated as a project and then exported together as a single, master audio file. DAWs may enable audio files to be manipulated in any number of ways. For instance, DAWs allow a user to loop a specific region within a project or to skip playback to predetermined locations across a project. However, in order to do so, the user must manually define the boundaries of the loop or the location to which playback should skip. The user may do so by viewing the project, or waveform of a specific audio file within the project, to identify likely section transition boundaries by ear or by eye, using a certain amount of trial and error in the process. These manual methods of identifying section transition boundaries are time-consuming and often unreliable as the waveform may not appear to comprise a perceptible change at each section transition boundary.

Therefore, a need exists for an automated method of locating and categorising different sections of a musical piece without manual input, which would then permit the user to quickly manipulate the song arrangement, skip unwanted sections or even loop preferred sections. Such a development would be analogous to that of moving from a standard tape cassette to a CD when skipping to the next song, (i.e. moving from scanning forward and stopping a few times until roughly the next song is found, to simply pressing a button). The difference is that now this step is applied within the song, or audio file itself.

Furthermore, there is a need for applying an automated method of locating and categorising different sections of a musical piece in a music playback application.

Summary

According to a first aspect, there is provided a computer-implemented method of identifying section transition boundaries within an audio track. The audio track comprises first audio data. The first audio data comprising musical information across a plurality of discrete frequency bands and across a plurality of timesteps spanning a duration of the audio track. The method comprises: calculating, at a plurality of timesteps, a first statistical analysis of frequency bands in the plurality of discrete frequency bands within a first energy band of the first audio data, and a second statistical analysis of frequency bands in the plurality of discrete frequency bands within a second energy band of the first audio data. The second energy band is different from the first energy band. The method further comprises: identifying a first set of section transition markers by chronologically comparing the first statistical analysis across multiple timesteps; identifying a second set of section transition markers by chronologically comparing the second statistical analysis across multiple timesteps; and defining a master set of section transition markers comprising section transition markers which are present in both the first set of section transition markers and the second set of section transition markers.

The first and second statistical analyses may each return a quantity for a given fimestep that represents the occupancy, fullness or frequency band occupancy of the respective energy band within which the respective statistical analysis is calculated. Additionally, section transition markers may be considered to be present in both the first set of section transition markers and the second set of section transition markers when a marker in the first set of section transition markers is within a threshold temporal distance from a marker in the second set of section transition markers.

Optionally, the master set of section transition markers comprises a set of section transition boundaries. Alternatively, the method may further comprise: prior to calculating the first statistical analysis and the second statistical analysis, calculating a temporal location of a set of hitpoints within the first audio data, and after defining a master set of section transition markers, calculating a set of section transition boundaries, the set of section transition boundaries comprising hitpoints from the set of hitpoints that are temporally closest to each section transition marker in the master set of section transition markers.

Optionally, the first statistical analysis and the second statistical analysis comprise a summation which may be a weighted summation.

Optionally, chronologically comparing the first statistical analysis across multiple timesteps comprises identifying a difference of the first statistical analysis across succeeding timesteps that exceeds a first threshold difference. Here, chronologically comparing the second statistical analysis across multiple timesteps comprises identifying a difference of the second statistical analyses across succeeding timesteps that exceeds a second threshold difference.

Optionally, the method further comprises, prior to calculating the first statistical analysis and the second statistical analysis, applying a lowpass filter to the first audio data. This lowpass filter may have a cut-off at 6kHz to 11kHz. This lowpass filter removes noise, high frequency distortion and repeating rhythmical components (such as drum signals).

Optionally, the first energy band comprises a high-energy level band and a second energy band comprises a mid-energy level band. Alternatively, the first energy band may comprise a high-energy level band and a second energy band may comprise a high-energy level band. Alternatively, the first energy band may comprise a mid-energy level band and a second energy band may comprise a mid-energy level band.

Optionally, the musical information comprises a set of samples positioned sequentially to span the duration of the audio track.

Optionally, calculating the temporal location of the set of hitpoints within the audio data comprises: filtering the first audio data to produce a filtered audio data, and identifying local maxima in a third statistical analysis of frequency bands in the plurality of discrete frequency bands within a third energy band.

Optionally, filtering the first audio data comprises applying a band-pass filter to the first audio data.

Optionally, the filtered audio data comprises a substantially isolated rhythmically significant audio component. Here, filtering the first audio data comprises isolating a rhythmically significant audio component. An audio component may be considered to be rhythmically significant if it comprises substantially rhythmically repetitive elements. For example, the rhythmically significant audio component may comprise a drum stem.

Optionally, the method further comprises calculating a predominant repetition timescale. The predominant repetition timescale may be calculated by autocorrelating the set of hitpoints.

Alternatively, the predominant repetition timescale may be calculated by autocorrelafing the first audio data. In either case, autocorrelation is generated comprising a set of autocorrelation peaks. Here, calculating the predominant repetition timescale further comprises defining the predominant repetition timescale as the time-lag between the time-zero peak and the strongest peak different from the time-zero peak in the set of autocorrelation peaks.

Optionally, the method further comprises checking the set of section transition boundaries by chronologically segmenting the first audio data into a plurality of succeeding temporal slices. Here, the start position of each slice is an integer number of predominant repetition timescales away from the first entry in the set of hitpoints. Here, the method further comprises, for each slice in the plurality of succeeding slices, calculating a statistical comparison of the energy content per frequency band of one slice and the energy content per frequency band of the next succeeding slice, and updating the set of section transition boundaries to include only entries which correspond to succeeding chronological slices at which the statistical comparison exceeds a threshold amount. Entries may correspond to succeeding chronological slices at which the statistical comparison exceeds a threshold amount when the entries fall within either of the succeeding chronological slices.

Optionally, each slice in the plurality of succeeding slices has a duration equal to a power of two multiple of the predominant repetition timescale, optionally the predominant repetition timescale, half the predominant repetition timescale, or double the predominant repetition timescale.

Optionally, the method may further comprise repeating the method with a different audio data of the audio track. For instance, the audio track may further comprise second audio data, the second audio data comprising musical information across an additional plurality of discrete frequency bands and across the plurality of timesteps spanning the duration of the audio track.

Here, the method further comprises: calculating a temporal location of an additional set of hitpoints within the second audio data; calculating, at a plurality of timesteps: an additional first statistical analysis of frequency bands in the plurality of discrete frequency bands within a first energy band of the second audio data, and an additional second statistical analysis of frequency bands in the plurality of discrete frequency bands within a second energy band of the second audio data. The second energy band of the second audio data is different from the first energy band of the second audio data. The method further comprises: identifying an additional first set of section transition markers by chronologically comparing the additional first statistical analysis across multiple timesteps; identifying an additional second set of section transition markers by chronologically comparing the additional second statistical analysis across multiple timesteps; defining an additional master set of section transition markers comprising section transition markers which are present in both the additional first set of section transition markers and the additional second set of section transition markers; and updating the set of section transition boundaries. The updated set of section transition boundaries comprising an additional plurality of hitpoints from the additional set of hitpoints that are temporally closest to each section transition marker in the additional second set of section transition markers.

Optionally, the first audio data defines an audio component, and the second audio data defines a different audio component. Alternatively, the audio track may comprise a stereo audio track, in which case the first audio data may comprise a normalised sum of the left channel of the stereo audio track and the right channel of the stereo audio track and the second audio data may comprise a normalised difference of the left channel of the audio track and the right channel of the audio track. Alternatively still, the first audio data may comprise a filtered subset of the audio track and the second audio data may comprise a differently filtered subset of the audio track.

According to a second aspect, there is provided a computer-implemented method for manipulating, via an input mechanism, an audio track being presented at an audio output. The audio track comprises a plurality of temporally sequential sections defined by section transition boundaries. The section transition boundaries are identified by the computer-implemented method of the first aspect. The method comprises: receiving, while presenting the audio track at the audio output, an input from a user via the input mechanism to skip to a temporally adjacent section of the audio track; and responsive to receiving the input from the user, skipping the presentation of the audio track at the audio output to the temporally adjacent section transition boundary.

According to a third aspect, there is provided a computer-implemented method for manipulating, via an input mechanism, an audio track being presented at an audio output. The audio track comprises a plurality of temporally sequential sections defined by section transition boundaries. The section transition boundaries are identified by the computer-implemented method of the first aspect. The method comprises: receiving, while presenting the audio track at the audio output, an input from a user via the input mechanism to loop the playback of a section of the audio track; and responsive to receiving the input from the user, looping the playback back to the beginning of the section on arrival at the end of the section.

According to a fourth aspect, there is provided a computer-implemented method for manipulating, via an input mechanism, an audio track being presented at an audio output. The audio track comprises a plurality of temporally sequential sections defined by section transition boundaries. The section transition boundaries are identified by the computer-implemented method of the first aspect. The method comprises: receiving, while presenting the audio track at the audio output, an input from a user via the input mechanism to reposition, remove or duplicate a section of the plurality of sections temporally across the duration of the audio track; responsive to receiving the input from the user, temporally repositioning, removing or duplicating the section across the duration of the audio track.

Optionally, the method further comprises: subsequent to temporally repositioning, removing or duplicating the section across the duration of the audio track, receiving a save input from a user via the input mechanism to save the audio track; and, responsive to receiving the save input from the user, saving the audio track which includes the temporally repositioned, removed or duplicated sections.

According to a fifth aspect, there is provided a computer-implemented method for manipulating, via an input mechanism, an audio track being presented at an audio output. The audio track comprises a plurality of temporally sequential sections defined by section transition boundaries. The section transition boundaries are identified by the computer-implemented method of the first aspect. The method comprises: receiving an input from a user via the input mechanism to save a section of the plurality of sections within the audio track; and responsive to receiving the input from the user, saving the section of the plurality of sections.

According to a sixth aspect, there is provided a computer readable medium comprising instructions which, when executed by a processor, cause the processor to perform the method of any other aspect.

According to a seventh aspect, there is provided a system with an internal memory, a processor, an input mechanism, a display and an audio output. The processor is configured to perform the method of any other aspect.

Brief Description of the drawings

Specific implementations of the present disclosure are described below in the detailed description by way of example only and with reference to the accompanying drawings, in which: Figure la illustrates an example horizontal, temporal structure of an audio file including a plurality of sections; Figure lb illustrates an example vertical, layered structure of an audio file comprising many audio data; Figure 2a depicts a method of identifying section transition boundaries within an audio track; Figure 2b illustrates a waveform of audio data; Figure 2c depicts plots of two-dimensional arrays produced by calculating a statistical analysis on the waveform of figure 2b; Figure 2d depicts plots of the moving average of the plots of figure 2c; Figure 3 depicts a method of calculating a location of a set of hitpoints within audio data; Figure 4a illustrates an autocorrelation of audio data of an audio track comprising a regular tempo; Figure 4b illustrates an autocorrelation of audio data of an audio track comprising an irregular tempo; Figure 4c depicts a method of checking section transition boundaries based on a predominant repetition timescale; Figure 5 depicts an alternative method of checking section transition boundaries; Figure 6a illustrates example section transport controls which may be provided within a GUI; Figure 6b illustrates an alternative example section transport controls which may be provided within a GUI; Figure 6c illustrates elements of the example section transport controls of figure 6b after a section has been muted; Figure 6d illustrates elements of the example section transport controls of figure 6b after sections have been deleted, Figure 6e illustrates elements of the example section transport controls of figure 6b after sections have been duplicated and reordered; and Figure 7 illustrates a block diagram of one implementation of a computing device.

Like reference numerals are used for like components throughout the drawings.

Detailed Description

In overview, and without limitation, the application discloses a computer-implemented method of identifying section transition boundaries within an audio track. The audio track is formed by first audio data and optionally second audio data, third audio data, fourth audio data, and so on. For example, each audio data may comprise a musical stem -e.g. a single audio source or a group of audio sources mixed together. Each audio data comprises musical information which spans a plurality of discrete frequency bands (or, in other words, a continuous range of frequencies across a plurality of discrete frequency bands) and a plurality of timesteps. The timesteps may span the duration of the audio track.

The section transition boundaries are identified by calculating, at a plurality of timesteps, a plurality of statistical analyses, each within a discrete energy band, of frequency bands of the first audio data. For instance, the statistical analyses may comprise a sum, or weighted sum of the number of frequency bands which fall within the respective energy band. For each discrete energy band, the statistical analysis of each sequential timestep of the audio track is then compared and, based on that comparison, a plurality of sets of section transition markers are identified -each of which corresponds to a single statistical analysis. For example, the comparison may comprise identifying locations of local maxima in the rate of change of the statistical analysis across the plurality of timesteps. In this example, a set of section transition markers are identified as being the location of the local maximum in the rate of change of the statistical analysis.

Next, a master set of section transition markers are defined which comprise section transition markers, which are present across at least two of the plurality of sets of section transition markers. For instance, the section transition markers may be present in at least two of the plurality of sets of section transition markers if they are within a threshold temporal distance of each other. Finally, the set of section transition boundaries are calculated as the hitpoints (for instance, the transients) of the audio data that are temporally closest to each entry in the master set of section transition markers.

Accordingly, the present disclosure enables an efficient, reliable and more universally applicable automated method of locating and categorising different sections of a musical piece without manual input. Furthermore, the automated method can be effectively and flexibly applied across various types and genre of music, to audio files of varying audio quality, to audio files which have been produced, mixed, and mastered by any number of distinct techniques, and to audio files with different vertical, layered structures.

Figures la and lb depict example structures for an audio track 100. Figure la depicts an example horizontal, temporal structure of an audio track 100 and figure lb depicts an example vertical, layered structure of an audio track 100. The audio track 100 of figure la may be the same as the audio track 100 of figure lb. As depicted in figure la, the audio track comprises a plurality of sections 102a, 102b, 102c, 102d, 102e, 102f, 102g, 102h. The sections are temporally sequential and defined by temporally-positioned section transition boundaries. The audio track comprises a duration. Thus, as viewed in figure la, the sections 102a, 102b, 102c, 102d, 102e, 102f, 102g, 102h are temporally ordered from left to right, such that the duration of the audio track spans the beginning of 'A' section 102a to the end of 'D' section 102h.

Each section may comprise a complete musical idea. For instance, each section may be tonally and/or dynamically distinct from its preceding and succeeding sections. Additionally or alternatively, the most or least prominent frequency bands of each section may differ from the most or least prominent frequency bands of other sections. Additionally or alternatively, the occupancy of each frequency band in one section may differ from the occupancy of each frequency band respectively in another section. Additionally or alternatively, the most or least prominent energy bands, or the occupancy of energy bands in one section may differ from the most or least prominent energy bands, or the occupancy of energy bands in another section. Additionally or alternatively, each section may be tonally distinct from other sections distinguished by different lead or overarching melodies. In some implementations, some sections may comprise a different key to other sections. Musical sections may relate to lyrical content, in the case of popular music, and correspond to a "verse" or a "chorus" or "refrain".

For instance, in the specific implementation depicted in figure 1, audio track 100 may comprise 'A' sections 102a, 102c (e.g., a verse), 'B' sections 102b, 102d, 102f, 102g (e.g. a chorus or refrain), a 'C' section 102e (e.g. a bridge), and a 'D' section 102h (e.g. an outro). Accordingly, the audio track 100, depicted in figure 1 is structured verse 102a, chorus 102b, verse 102c, chorus 102d, bridge 102e, double chorus 102f, 102g, outro 102h. This exemplary structure is a staple in rock and pop compositions, but it will be appreciated that many other structures -such as ABA, ABAB, AABA, AAA, ABCD, and so on -may be found in audio tracks of the same or different genres (such as classical, jazz, funk, and EDM). As well as verse, chorus, bridge and outro, sections of some audio tracks may further comprise: intro, pre-chorus, post-chorus, instrumental solo, elision, breakdown, shout chorus, exposition, development, recapitulation, drop, cadenza, coda, fadeout, middle-8 and/or interlude.

As depicted in figure lb, the audio track 100 comprises a plurality of audio data 104a, 104b, 104n (e.g. a plurality of sets of audio data), where the plurality of audio data 104a, 104b, 104n are layered to comprise the audio track 100. Each audio data comprises musical information across at least one of a plurality of discrete frequency bands and across a plurality of timesteps spanning the duration of the audio track. For instance, each audio data 104a, 104b, 104n comprises a plurality of samples which are positioned sequentially to span the duration of the audio track. In some implementations each audio data 104a, 104b, 104n may, in turn, be subdivided into a plurality of audio data, where the subdivided audio data spans a portion of the duration of the audio track. The subdivided audio data may in turn, be further subdivided, and so on.

Since many of the method steps disclosed herein may be carried out in the frequency-time domain, it may be desirable to optimise the duration of each timestep of the audio data prior to carrying out those steps. That is, the size of the timesteps must be small enough to achieve sufficient temporal resolution, but not so small as to reduce the clarity of peaks in the frequency-time domain. As such, the sample rate of the audio data 104a, 104b, 104n may be reduced (that is, the size of the timesteps may be increased) before carrying out the method(s) disclosed herein. The exact size of the timesteps for best results may be learnt and optimised by machine learning techniques, through the training of a computational neural network on example training data, for example. In some implementations, timesteps of approximately 0.05-0.005 seconds have been found to be particularly beneficial.

For instance, first audio data 104a, second audio data 104b, and nth audio data 104n may each define an audio component -such as a stem. A stem is a grouped collection of audio sources or recordings mixed to form a logical whole (and may optionally be post-processed). For instance, the single stem may be one of a lead vocals stem, a backing vocals stem, a guitar stem (e.g. a lead guitar stem or a rhythm guitar stem), a bass stem, a percussion stem, a drum kit stem, a miscellaneous stem, and so on. In these implementations, the number of audio data equals the number of stems available. That is, if 8 stems are available, then the audio track may comprise first audio data 104a, second audio data 104b, third audio data (not shown in figure lb), fourth audio data (not shown in figure lb), fifth audio data (not shown in figure 1b), sixth audio data (not shown in figure lb), seventh audio data (not shown in figure 1b), and eighth audio data (not shown in figure lb). Equally, if 2 stems are available, then the audio track may only comprise first audio data 104a, and second audio data 104b.

In some implementations, the audio data used for the methods disclosed herein may comprise audio components which are not initially provided as distinct audio components of the audio track 100. In these implementations, software may be used to split the audio track 100 into a plurality of audio components defined by first audio data 104a, second audio data 104b, n'h audio data 104n (and so on) by removing the characteristic frequency fingerprint of the audio components from the audio track 100 as a whole. Accordingly, that characteristic frequency fingerprint comprises audio data which may correspond to a specific stem (e.g. the guitar stem). Third-party software for carrying out this functionality is known in the art, however its use in the methods of the present application is beneficial since it allows the methods of the present application which rely on multiple audio data to be universally applied to all audio tracks, regardless of whether the audio components are initially provided.

Alternatively, in some implementations, first audio data 104a, second audio data 104b, nth audio data 104n (and so on) may each comprise a high-pass, band-pass, and/or low pass filtered subset of the audio track. For instance, three audio data may be provided, each corresponding to a low-pass, band-pass, and high-pass filtered subset of the audio track as a whole respectively, where the low and high frequency parameters on the band-pass filter substantially equal the frequency parameters of the low-pass and high-pass filter respectively. Similarly, the audio data may comprise: a first audio data 104a which corresponds to the backing by taking all frequencies but notching out high energy signals in the range 250Hz2500Hz; a second audio data 104b which corresponds to the bass by taking frequencies in the range 50-200Hz; a third audio data (not shown in figure 1) which corresponds to the vocals by taking the high energy signal in the range 300-2000Hz; and a fourth audio data (not shown in figure 1) which corresponds to other instruments by taking the high energy signal in the range 3000-6000Hz. Ultimately, the isolation of the elements may be optimised by machine learning techniques.

In yet other implementations, the audio track may comprise a stereo audio track. In these implementations there may be two audio data: the first audio data 104a may comprise a normalised sum of the left and right channels of the stereo audio track (which essentially gives a mono signal), and the second audio data 104b may comprise a normalised difference of the left and right channels (which gives a side mix). If the stereo audio track comprises stereo audio components, then the audio data may comprise a mono signal and side mix for each audio component. Using mono signals and side mixes of stereo audio tracks in this way is beneficial as many audio tracks are provided in stereo form. Therefore, by splitting the stereo audio into a mono signal and a side mix, the methods of the present application which rely on multiple audio data may be universally applied to all stereo audio tracks, without requiring use of third-party software and regardless of whether the audio components are provided.

Figure 2a depicts a flowchart of a method 200 of identifying section transition boundaries within an audio track in accordance with an implementation of the present disclosure. Method 200 is computer implemented and may be carried out on any suitable device comprising a processor and memory -such as the device described herein with respect to figure 7. For instance, the processor of a (computer) system comprising a processor, an input mechanism, a display, and an audio output may be configured to perform the method 200. Similarly, an (optionally non-transitory) computer readable medium may comprise instructions which, when executed by a processor, cause the processor to perform the method 200. Example systems which are suitable for performing method 200 are described below with respect to figure 7.

Returning to figure 2a, method 200 may begin at step 202 where the location of a set of hitpoints within audio data is calculated. The audio data may comprise any one of first audio data 104a, second audio data 104b, nth audio data 104n (and so on) described above with respect to figure lb. The calculated location is a temporal location across the duration of the audio data. That is, the location may be encoded as a fimestamp across the duration of the audio track or a sample number (e.g. a timestep index) within the audio data.

A hitpoint marks musically and/or rhythmically significant positions within the audio data. For instance, each hitpoint may comprise a local maximum in energy or amplitude above a threshold within the audio data. As an example, at least some or all of the hitpoints may correspond to a transient -e.g. a high amplitude, short duration peak -in the first audio data. For instance, the hitpoints may comprise an attack transient -that is a brief peak in the audio signal where there is significant and rapid amplitude gain. Accordingly, the hitpoints may comprise a high degree of non-periodic components. The calculation of the location of the set of hitpoints within the audio data is described in more detail below with respect to figure 3. A hitpoint may literally correspond to a point at which a musical instrument (for example a drum, or a piano key, or a set of guitar strings) is "hit" and a musical note is produced.

Returning to figure 2a, regardless of whether step 202 is present, the method 200 proceeds to steps 204a and 204b where a statistical analysis of frequency bands within each of plurality of energy bands of the audio data is calculated. That is, at step 204a, a statistical analysis of frequency bands within a first energy band of the audio data is calculated. Similarly, at step 204b, a statistical analysis of frequency bands within a second energy band of the audio data is calculated. Steps 204a and 204b may be carried out simultaneously in parallel or consecutively in any order. Method 200 may further comprise any number of additional steps carried out in parallel or consecutively in any order with steps 204a and 204b, where each of the additional steps comprise calculating a statistical analysis of frequency bands within additional energy bands of the audio data.

The more energy bands within which a statistical analysis of frequency bands is calculated, the further improved the method 200 is at identifying section transition boundaries. However, as the number of energy bands increases, so too does the computational resource required to carry out the method. In some implementations, a statistical analysis of frequency bands may be calculated within 4 or 5 energy bands. The optimal number of energy bands may be learnt and optimised by machine learning techniques. In any case, the statistical analysis of frequency bands must be calculated within at least two energy bands in order to accurately identify section transition boundaries (and therefore the method need only be limited to as much).

The statistical analysis of frequency bands within a specific energy band may be carried out in the frequency-time domain, where a third dimension is energy (or power if the energy is normalised per unit time). The terms power and energy may be used substantially interchangeably herein. Therefore, in order to carry out the statistical analysis, a spectrogram (e.g. a PSD plot) of the audio data may be computed by, for instance, conventional Fourier transform methods. The statistical analysis at steps 204a and 204b produces a quantity for a given fimestep that represents the occupancy (e.g. the 'fullness' or the frequency band occupancy) of the respective energy band within which the statistical analysis is calculated. For instance, in some embodiments, the statistical analysis may comprise a summation or aggregation of the number of frequency bands within the plurality of discrete frequency bands of the musical information for which the energy in that frequency band falls within the respective energy band for a given timestep. Alternatively, the statistical analysis may comprise a weighted sum of the number of frequency bands which fall within the specific energy band for a given timestep, wherein each frequency band is weighted by its relative importance to the energy band. The specific energies of the energy bands used at steps 204a and 204b and the weightings of the weighted sum may be learnt and optimised by machine learning techniques.

Similarly, the parameters of the discrete frequency bands may be learnt and optimised by machine learning techniques. In general the first energy band may be a high-energy level band and the second energy band may be a mid-energy level band. Alternatively, both energy bands may be high-energy level bands. Alternatively still, both energy bands may be mid-energy level bands. In some implementations, example energy bands of 100% to 80% of the maximum energy of the audio track (high-energy level), 80% to 60% of the maximum energy of the audio track (high-energy level or mid-energy level), and/or 60% to 40% of the maximum energy of the audio track (mid-energy level) have been found to be particularly beneficial.

In alternative embodiments to those described above, the statistical analysis may comprise summing element values, summing the number of elements, computing matrix norms, calculating discrete values, and calculating deltas (differences) either per individual frequency, frequency band, or as an absolute value.

Alternatively, the statistical analyses performed at steps 204a and 204b may comprise a statistical analysis of energy bands within a frequency band of the audio data. For instance, the statistical analysis may produce a quantity representative of a measurement of the total energy within a frequency band at a given timestep, or the statistical analysis may produce a quantity for a given timestep that represents the energy band occupancy of the respective frequency band within which the statistical analysis is calculated. In these implementations, the quantities produced from steps 204a and 204b may be processed in substantially the same manner as discussed with respect to the quantities produced in other implementations of steps 204a and 204b discussed herein.

Optionally, prior to steps 204a and 204b, the method may comprise applying a lowpass filter to the audio data. The lowpass filter may be applied to the audio data in the frequency-time domain, e.g. by means of a spectrogram and conventional Fourier transform methods. The lowpass filter is applied at this stage to remove noise and high frequency distortion from the audio data. Furthermore, the lowpass filter is applied to remove transient elements of drum traces (which may be present as background noise in the audio data), or other substantially rhythmically repeating elements. This is beneficial because drum traces (and other substantially rhythmically repeating elements) are repetitive and often appear throughout an audio track without providing a distinction between sections. As such, the rhythmically repeating elements are not as relevant to the current method 200 of identifying section transition boundaries within an audio track. In some implementations, the lowpass filter applied here may have a cut-off between 6-11kHz, however this value may be optimised and learnt by machined learning techniques.

Next, the method 200 proceeds to step 206 where steps 204a and 204b (and the any number of additional steps carried out alongside steps 204a and 204b) are repeated over a plurality of fimesteps. For instance, steps 204a and 204b may be repeated over the entire audio track and therefore may be repeated at every timestep in the plurality of fimesteps spanning the duration of the audio track. Alternatively, steps 204a and 204b may be repeated over a portion of the entire audio track (e.g. a user-selected range of timestamps across the audio track) and therefore may be repeated at every fimestep across the portion of the audio track.

Consequently, iterating steps 204a and 204b at step 206 produces a two-dimensional array of data for each energy band which describes or signifies the occupancy of each energy band respectively at each fimestep across the plurality of fimesteps.

For example, figure 2b depicts a waveform 214 representing (a first) audio data in the amplitude-time domain. As can be seen in figure 2b, the waveform comprises an amplitude (as measured on the y-axis) and spans a duration (on the x-axis). The example waveform 214 comprises a visibly denser region 216a and a plurality of visibly less dense regions 216b. However, the boundary between the regions is not immediately apparent to a user. Furthermore, there is no guarantee that the visibly denser region 216a and visibly less dense regions 216b represent a complete description of the waveform's sections. Accordingly, a user, without access to the present disclosure's computer implemented method 200 of identifying section transition boundaries, may search by eye and by ear for the section transition boundaries through trial and error, about the rough boundary between the denser region 216a and the less dense region 216b. This trial and error process is time consuming and imprecise.

Figure 2c depicts two plots of the two-dimensional arrays of data for two energy bands produced by carrying out steps 204a and 204b iterated at step 206 over the waveform of figure 2b. Upper plot 218a resembles the frequency content or occupancy at each timestep within an upper energy band and lower plot 218b resembles the frequency content or occupancy at each timestep within a lower energy band. As can be identified in view of figure 2c, the plots highlight two less full or lower occupancy sections 220a, 220a' (which may be, for instance, verses) and two fuller or higher occupancy sections 220b, 220b' (which may be, for instance, choruses).

Returning to figure 2a, the method continues at steps 208a and 208b where a first set and a second set of section transition markers are identified. Each set of section transition markers comprise every possible location within the statistical analysis of frequency bands within a given energy band of the audio data which may potentially correspond to a section transition marker. In this way, the first and second sets of section transition markers identified at steps 208a and 208b may be associated with the first and second energy bands of steps 204a and 204b respectively. Alternatively, in some implementations 4 or 5 energy bands may be considered alongside steps 204a and 204b and the two energy bands which produce the clearest results (e.g. the least noisy results or the most locally consistent two-dimensional array) may be taken at steps 208a and 208b. Alternatively still, the 4 or 5 energy bands may be considered alongside steps 204a and 204b, in which case 4 or 5 sets of section transition markers may be identified alongside steps 208a and 208b.

The sets of section transition markers may be identified by chronologically comparing entries in the two-dimensional array of data for each energy band which describe or signify the occupancy of each energy band respectively. For instance, a set of section transition markers may be identified by comparing the statistical analysis at each fimestep with the same statistical analysis at a succeeding or the immediately succeeding fimestep. Timestep pairs which correspond to large spikes in energy or a step-change in level are identified and flagged. The temporal location of these timestep pairs correspond to the location of the potential section transition boundaries.

The above chronological comparison may be carried out through substantially any statistical analysis method. For instance, a moving average may be taken of the two-dimensional array of data describing the occupancy of an energy band to filter any noise or significant fluctuations in the array. Figure 2d depicts results of such a moving average calculation carried out on the plots of figure 2c. That is, figure 2d depicts an upper moving average 222a of upper plot 218a and lower moving average 222b of lower plot 218b. By taking a moving average, the section transition boundaries are able to be calculated more clearly and distinctly.

In this exemplary implementation, once a moving average has been taken, the rate of change of the moving average across all of the timesteps may be calculated. Then local maxima of that rate of change (e.g. local maxima above a threshold value, and/or local maxima which remain above a threshold value for at least a threshold number of timesteps) may be identified and assigned as the section transition markers.

As another example, the chronological comparison may comprise identifying a difference of the first statistical analysis across succeeding timesteps that exceeds a first threshold difference. This difference may be a difference of the moving average or the two-dimensional array itself or another smoothed or filtered version of the array.

Consequently, identifying a first set of section transition markers and a second set of section transition markers at steps 208a and 208b produces a first set of section transition markers and a second set of section transition markers. For instance, as labelled on figure 2d, section transition markers 224a, 224a' and 224a" are identified for the upper moving average 222a, and section transition markers 224b, 224b' and 224h" are identified for the lower moving average 222b. Each of section transition markers 224a, 224a' and 224a" are located at substantially the same place as section transition markers 224b, 224b' and 224h" respectively though, as depicted, they may not be located at the exact same fimestep.

Furthermore, in some implementations, some entries in the first set of section transition markers may not be located at substantially the same place as entries in the second set of section transition markers and/or vice versa (not shown in figure 2d). This may be caused by, for instance, a single instrument entering or changing dynamics in the middle of a section (e.g. in the middle or a verse). Accordingly, were the method 200 to comprise calculating a statistical analysis at step 204a, 204b for a single energy band only, the entry of this single instrument may be mistakenly identified as a section transition change. Rather, to avoid such an incorrect section transition change being identified, the method 200 comprises calculating a statistical analysis within multiple energy bands which may then correspond to multiple sets of section transition markers.

Consequently, method 200 proceeds to step 210 where a master set of section transition markers are defined. The master set of section transition markers comprise section transition markers which are present in both the first set of section transition markers and the second set of section transition markers. That is, the master set of section transition markers comprise one marker for each pair or markers in the first set of section transition markers and the second set of section transition markers which are within a threshold number of timestamps away from each other. The exact location of the master set of section transition markers may be defined as the midpoint between each marker in the respective pair, or may be taken as the location of one marker in the respective pair (for instance, the marker which corresponds to a higher energy band or contains more rhythmically repetitive content).

Accordingly, each entry in the master set of section transition markers corresponds to a section transition boundary. Therefore, if no hitpoints have been calculated or identified (e.g. if step 202 is not present), the master set of section transition markers may be taken to comprise a set of section transition boundaries. That is, each section transition marker in the master set is taken to be a section transition boundary.

However, if the location of a set of hitpoints was calculated at step 202, then the set of section transition boundaries may be calculated at step 212. Here, the set of section transition boundaries comprise hitpoints from the set of hitpoints that are temporally closest to each section transition marker in the master set of section transition markers. That is, each section transition boundary is located at the location of the hitpoint that is temporally closest to (e.g. within a threshold time of) each section transition marker in the master set. A benefit of step 212 is that, as the hitpoints mark musically and/or rhythmically significant positions across the audio track, each hitpoint is likely to correspond to musically and/or rhythmically significant positions across the audio track, such as a beat (or subdivision thereof), each downbeat, the beginning of each new bar, and so on. Therefore locating each section transition boundary at the location of the hitpoint that is temporally closest to each section transition marker in the master set, ensures that the section transition boundaries are located at musically and/or rhythmically significant positions across the audio track. As such, the calculation of the locations of the section transition boundaries by method 200 is further improved.

Figure 3 depicts a flowchart of an example method 300 of calculating a location of a set of hitpoints within audio data (e.g. step 202 of method 200).

As with method 200, the audio data used in method 300 may comprise any one of first audio data 104a, second audio data 104b, nth audio data 104n (and so on) described above with respect to figure 1 b. The audio data may comprise the same audio data as used in method 200, or may comprise an alternate audio data from within the same audio track. In particular, the audio data may be specifically chosen as audio data which is best suited for hitpoint location calculations. For instance, the audio data may include audio data which comprises a greater degree of rhythmic information than melodic information, such as a rhythmically significant audio component (for example a drum or percussion stem). In other words, the audio data may include strong transients with a sharp attack and a slow decay.

At step 302, the method 300 begins by filtering audio data to produce filtered audio data. Similar to as outlined above, the audio data may be filtered in the frequency-time domain, e.g. by means of a spectrogram and conventional Fourier transform methods. The purpose of this filter is to isolate the repetitive rhythmical elements (transients) which generally have traces across most of the frequency range. For instance, the audio data may be filtered to substantially isolate a rhythmically significant audio component by applying a bandpass filter with a low frequency parameter of 4-7kHz (which filters out typical vocals, synth, guitar and bass frequencies) and a high frequency parameter of 10-15kHz (which filters out frequencies which are typically blurred with noise). Alternatively, filtering the audio data at step 302 may comprise isolating a rhythmically significant audio component (e.g. a drum stem). Step 302 may additionally comprise normalising the energy of the audio data.

The method 300 proceeds to step 304 where a statistical analysis of frequency bands within a third energy band of the audio data is calculated. This statistical analysis may comprise any of the statistical analyses described with respect to steps 204a and 204b of method 200 of figure 2a above. Here, the third energy band may comprise a minimum energy threshold such that only the sharp attack of the rhythmical hits is measured in the statistical analysis. For instance, the minimum energy threshold may be at around -30 to -50dB.

Next, the method 300 proceeds to step 306 where step 304 is repeated over a plurality of timesteps (similar to step 204 in method 200). For instance, step 304 may be repeated over the entire audio track and therefore may be repeated at every timestep in the plurality of timesteps spanning the duration of the audio track. Alternatively, steps 304 may be repeated over a portion of the entire audio track (e.g. a user-selected range of fimestamps across the audio track) and therefore may be repeated at every timestep across the portion of the audio track.

Consequently, iterating step 304 at step 306 produces a two-dimensional array of data for the third energy band which describes or signifies the occupancy of the third energy band respectively at each timestep across the plurality of timesteps.

This two-dimensional array of data is then analysed at step 308 where local maxima (e.g. local maxima above a threshold value) are identified. As in the method 200 of figure 2b, a moving average of the two-dimensional array may first be calculated prior to identifying the local maxima. The local maxima represent fimestamps at which the occupancy of the third energy band is high and, as such, correspond to the timestamps (i.e. the temporal location) of the hitpoints in the set of hitpoints. Accordingly, the coordinate (temporal) locations of the local maxima define the coordinate (temporal) locations of the hitpoints in the set of hitpoints. Once the locations of the hitpoints have been calculated at step 202, method 200 may proceed to step 204a and 204b, as described above.

Optionally the computer implemented method 200 of identifying section transition boundaries within an audio track disclosed herein may further comprise calculating a predominant repetition timescale and checking the identified section transition boundary based on the predominant repetition timescale. The predominant repetition timescale defines a repetition timescale that is intrinsic within the audio track as a whole. For example, the predominant repetition timescale may comprise the length of a regular repeating motif or rhythm within the audio track. That is, the predominant repetition timescale comprises the typical shortest length of time between repeating patterns across the duration of the audio track. For instance, the predominant repetition timescale may be a single bar or a regular phrase (e.g. a regular 1/4-bar, 1/2-bar, 2-bar, 4-bar or 8-bar pattern) which reoccurs regularly within the audio track.

The predominant repetition timescale is calculated by autocorrelation methods. Autocorrelation of a signal comprises comparing and assessing the similarity of a signal to an image of itself at various fimesteps. In particular, methods of the present disclosure include calculating the predominant repetition timescale by either: autocorrelating the set of hitpoints (within audio data); or autocorrelating audio data itself (such as any of first audio data 104a, second audio data 104b, nth audio data 104n, and so on, described above). Using the set of hitpoints often gives more accuracy however, in tracks with a regular tempo, using the audio data may be beneficial as the hitpoints need not be calculated. Furthermore, more clear autocorrelation results may be achieved if the audio data which is autocorrelated comprises a rhythmically significant audio component.

In either case, the autocorrelation generates a set of autocorrelation peaks. The method of calculating the predominant repetition timescales further comprises defining the predominant repetition fimescales as the time-lag between the time-zero peak and the strongest peak different from the time-zero peak in the set of autocorrelation peaks.

For instance, figure 4a illustrates an autocorrelation 402 of audio data of an audio track which comprises a regular tempo and/or rhythm. The autocorrelation 402 includes a set of strong autocorrelation peaks 404a, 404b, 404c. An audio track may comprise a regular tempo, and thus the autocorrelation peaks may be strong, if that audio track was recorded to a grid or a metronome.

Similarly, figure 4b illustrates an autocorrelation 406 of audio data of an audio track which comprises an irregular tempo and/or rhythm. The autocorrelation 406 includes a set of weak autocorrelation peaks 408a, 408b, 408c. An audio track may comprise an irregular tempo, and thus the autocorrelation peaks may be weak, if that audio track was not recorded to a grid -for instance if that audio track is a live recording.

Returning to figure 4a, each peak 404a, 404b, 404c comprises a time coordinate (on the x-axis) and an autocorrelation strength coordinate (on the y-axis). Here, the time-zero peak 404a is the strongest peak and is given a time coordinate of zero. The next strongest peak 404b (i.e. the peak with the highest autocorrelation strength other than the time-zero peak) comprises a non-zero time coordinate. The difference between the time coordinate of the time-zero peak 404a and the next strongest peak 404b is taken to be the predominant repetition timescale.

For instance, in the implementation of figure 4a, peak 404b in figure 4a has a time coordinate of 1.2 seconds (or 1.20002 seconds) away from the time-zero peak 404a and therefore the predominant repetition timescale comprises 1.2 seconds (or 1.20002 seconds). Alternatively, the predominant repetition timescale may be known and pre-encoded within the metadata of the audio track (e.g. as a tempo). Alternatively still, the predominant repetition timescale may be extracted from other, known, third-party software.

Once a set of section transition boundaries have been calculated (e.g. by the method of figure 2a), the predominant repetition timescale may be calculated and subsequently used to check the section transition boundaries. Accordingly, figure 4c depicts a flowchart of a method 410 of checking the calculated section transition boundaries based on the predominant repetition timescale.

As a preliminary step to method 410 (not depicted on figure 4c) the method may include determining whether the audio track in question is suitable for checking based on the predominant repetition timescale. In particular, the method may determine whether the audio track has a regular tempo -in which case the audio track is determined to be suitable for predominant repetition timescale-based checking -or whether the audio track has an irregular tempo -in which case an arbitrary timescale can be selected of at least an order of magnitude greater than the initial timestep. An audio track may be considered to have a regular tempo if it produces an autocorrelation peak away from the time-zero peak with an autocorrelation strength coordinate of substantially 0.8 or above.

Method 410 checks the section transition boundaries by determining whether an incorrect section transition boundary has been accidentally identified by, for instance, a section transition boundary being mistakenly identified due to, for instance, a sudden high-energy hit (or transient) from a vocal or instrument within a phrase.

Method 410 begins at step 412 where the audio data is chronologically segmented into a plurality of succeeding temporal slices. Step 412 does not necessarily comprise splitting the audio data into a plurality of different data structures, rather step 412 may simply comprise temporally marking the audio data at the start position of each slice. The start position of each slice is an integer number of predominant repetition timescales away from the first entry in the set of hitpoints. In other words, step 412 may comprise propagating, from the first entry in the set of hitpoints, an integer number of predominant repetition timescales forward and backward throughout the audio track. It does not matter here whether the first entry in the set of hitpoints comprises the start of a bar or not. The integer multiple may comprise a power of two multiple of the predominant repetition timescale, optionally the predominant repetition timescale, half the predominant repetition timescale, or double the predominant repetition timescale in order to ensure that at least a strong beat (e.g. the first beat of a bar) and a weak beat (e.g. the second beat of a bar) are considered.

The audio data used in method 410 may comprise the same audio data as used in method 200 of figure 2b or method 300 of figure 3. Equally, the audio data used in method 410 may comprise any of first audio data 104a, second audio data 104b, nth audio data 104n, and so on, as described above.

Next, at step 414, method 410 proceeds to calculate a statistical comparison of the energy content per frequency band of a slice and the energy content per frequency band of the next (temporally) succeeding slice. The energy content per frequency band of a slice represents the averaged energy content of that slice. This statistical comparison may comprise, for instance, a matrix norm calculation or a difference per frequency slice calculation. Any suitable statistical comparison may be used. In any case, this statistical comparison may be carried out in the frequency-time domain, e.g. by means of a spectrogram and conventional Fourier transform methods. Step 414 is then iterated at step 416 across each slice in the plurality of succeeding slices.

Consequently, iterating step 414 at step 416 produces a value for each temporal slice which indicates the similarity of that slice's energy content per frequency band to the proceeding slice's energy content per frequency band. As such, the data produced by iterating step 414 at step 416 comprises sharp peaks (for instance, above a threshold amount) within slices, at the boundary at which there is a fluctuation in frequency content and energy. In other words, the data produced by iterating step 414 at step 416 indicates the very rough location of likely section transition boundaries.

Next, at step 418, the set of section transition boundaries (as previously calculated) are updated. Here, the set of section transition boundaries are updated to discard entries which do not fall within a pair of succeeding chronological slices within which a rough location of a likely section transition boundary has been located. For instance, the set of section transition boundaries may be updated to include only entries which correspond to (that is, which fall within a pair of) succeeding chronological slices at which the statistical comparison exceeds a threshold amount.

Figure 5 depicts a flowchart of an alternative method 500 of checking the calculated section transition boundaries. The checking method 500 of figure 5 may be carried out instead of, or as well as On parallel or sequentially in either order) the checking method 410 of figure 4c. The checking method 500 of figure 5 does not require the audio track to have a regular tempo.

Checking method 500 comprises, first, at step 502, calculating a set of section transition boundaries based on first audio data. Next, at step 504, another set of section transition boundaries are calculated based on second audio data. Next, at step 506, another set of section transition boundaries may optionally be calculated based on third audio data. Each of steps 502, 504, and 506 may be carried out simultaneously or sequentially in any order. Any number of additional sets of section transition boundaries may additionally be calculated in additional steps alongside steps 502, 504, 506, each for different audio data of the audio track.

Each of steps 502, 504, 506 (and the any number of additional steps) corresponds to an iteration of method 200 of identifying section transition boundaries, using the first audio data, second audio data, and third audio data respectively. Accordingly, each of steps 502, 504, 506 (and the any number of additional steps) returns a set of section transition boundaries identified from that step's audio data.

The first audio data, second audio data and (optional) third audio data (and so on) used in steps 502, 504, 506 (and so on) may comprise audio data which complementarily combine to comprise the audio track wholly or partially. As such, the audio data may be any of the groups described above with respect to figure lb. For instance, each audio data may be a specific audio component -such as a specific stem. Alternatively, each audio data may comprise a filtered subset of the audio track, where each respective audio data is filtered differently to the other audio data. Alternatively still, if the audio track is a stereo audio track each audio data may comprise a normalised sum of the left and right channels of the stereo audio track, and the second audio data comprises a normalised difference (side mix) of the left and right channels of the stereo audio track. Alternatively still, each audio data may comprise a normalised sum of the left and right channels of stereo audio components, and the second audio data comprises a normalised difference (side mix) of the left and right channels of the stereo audio components.

Method 500 proceeds to step 508, where the set of section transition boundaries are updated in accordance with the sets of section transition boundaries calculated at steps 502, 504, 506 (and so on). Here, boundaries which have been identified in at least two of the sets of section transition boundaries calculated at steps 502, 504, 506 (e.g. boundaries that are within a threshold temporal distance of each other) are taken to comprise the updated set of section transition boundaries. However, boundaries which have been identified in only one of the sets calculated at steps 502, 504, 506 are discarded. Accordingly, the updated set of section transition boundaries calculated at step 508 comprise the section transition boundaries calculated at steps 504 or 506 which are temporally closest to the section transition boundaries calculated at any preceding step (e.g. step 502 for step 504, and step 502 or 504 for step 506 respectively). Accordingly, the updated set of section transition boundaries calculated at step 508 comprise section transition boundaries which tally between at least two audio data -e.g. at least two stems -within a threshold number of samples.

In some implementations, similar to as described above with respect to step 210 of method 200, the exact location of the section transition boundaries calculated at step 508 may be defined as the midpoint between each section transition boundary in the respective boundaries which tally between each other calculated at steps 502, 504, and 506. Alternatively, the exact location of the section transition boundaries calculated at step 508 may be taken as the location of one section transition boundary in the respective pair. For instance, rhythmically significant components, such as bass and drums stems, may take precedence for the exact position of the section transition boundaries. By allowing rhythmically significant components to take precedence, the calculation of the ultimate section transition boundaries at step 508 is further improved because the rhythmically significant boundaries are more likely to occur at the start of the first bar of a new section.

A benefit of checking the section transition boundaries against a multiplicity of audio data, as per the method 500 of figure 5, is that any section transition boundaries identified by considering one set of audio data is checked against section transition boundaries identified through another set of audio data. Thus, the checking method 500 of figure 5 filters out section transition boundaries which may appear to be present in one audio data, but in actuality do not correspond to section transition boundaries of the audio track as a whole (or, which correspond to higher order, more detailed, section transition boundaries). For instance, if the first audio data comprises a guitar stem and the guitar enters the audio track mid-verse, then the set of section transition boundaries calculated at step 502 may include a boundary at that mid-verse location. This mid-verse location may comprise a section transition when taken at a high level of detail. However, by checking the set of section transition boundaries calculated at step 502 against another set of section transition boundaries, based on a second audio data (e.g. a bass stem) calculated at step 502, more detailed section transition boundaries than desired can be identified and discarded. In some implementations, a user may input the level of detail to which the section transition boundaries may be identified. For instance, this user input may identify a number of section transition boundaries to be returned (which may be a range).

Figures 6a and 6b illustrate example section transport controls which may be provided within a Graphical User Interface (GUI) according to some implementations. Figure 6a depicts section transport controls for looping the current section 602; return to start of current section 604; skip to end of current section 606; return to start of the track 608; or skip to start of the next track 610. These section transport controls may be presented within a music player application alongside the conventional play, pause, fast-forward, rewind controls and so on.

Additionally, as depicted in figure 6b, the sections within the song may be presented as a plurality of coloured boxes 612 running along the screen with the width of each box relating to its length of time played. Boxes coloured in the same or similar way (e.g. coloured by the same or similar hues) relate to sections that are similar within the song, such as verses and choruses.

The colour could relate to the relative energy or energy level occupancy within the different sections. Sections of the same type may be numbered or indexed to distinguish one similar section from another, such as verse 1 and verse 2. The plurality of coloured boxes 612 may be selected (via a user interface) to mute or unmute the associated section. A long click (or another distinct user input) may be used to loop an associated section. A dragged click (or another distinct user input) may be used to change order of the sections. Furthermore, other distinct user inputs may be used to duplicate, save (locally or otherwise), or delete sections.

For instance, as depicted in figure 6c, responsive to an input from a user, one of the boxes 614 may be muted. To indicate that the section associated with box 614 is muted, the box may lose its colour or hue (e.g. turn black or grey). In this example, upon playback of the audio track, the section associated with box 614 may be muted or skipped.

In another example, as depicted in figure 6d, responsive to an input from a user, one or more of the boxes may be deleted. Consequently, the audio track may be shortened by the length of the section associated with the deleted box and the two sections either side of that section may adjoin.

In another example, as depicted in figure 6d, responsive to an input from a user, one or more of the boxes may additionally be repositioned or duplicated. Consequently, the audio track's arrangement upon playback is adapted accordingly.

If the audio track is modified (such as by deletion, repositioning or duplication of the boxes as per figures 6d and 6e), the new, modified audio track may be saved to a memory (local or otherwise) in response to a user input indicating that the modified audio track should be saved accordingly. Therefore, a user is able to return to and reload a previously modified audio track.

Accordingly, implementations of the present disclosure allow a computer to create a modified audio track which comprises metadata identifying the location of the section transition boundaries within the original audio track itself. The method of creating the modified audio track may include: identifying section transition boundaries of an audio track comprising a plurality of temporally sequential sections defined by the section transition boundaries by the methods disclosed herein; and creating a modified audio track which comprises location of the section transition boundaries. The method may further include saving the modified audio file.

Additionally, implementations of the present disclosure allow a computer to receive, while presenting the audio track at the audio output, an input from a user via an input mechanism to skip to a temporally adjacent section of the audio track. Responsive to receiving this input from the user, the computer may skip the presentation of the audio track at the audio output to the temporally adjacent section transition boundary.

Alternatively, implementations of the present disclosure allow a computer to receive, while presenting the audio track at the audio output, an input from a user via an input mechanism to loop the playback of a section of the audio track. Responsive to receiving the input from the user, the computer may loop playback back to the beginning of the section upon arrival at the end of the section.

Alternatively still, implementations of the present disclosure allow a computer to receive, while presenting the audio track at the audio output, an input from a user via an input mechanism to reposition, remove or duplicate a section of the plurality of sections temporally across the duration of the audio track. Responsive to receiving the input from the user, the computer may temporally reposition, remove or duplicate the section across the duration of the audio track respectively. Optionally, subsequent to temporally repositioning, removing or duplicating the section across the duration of the audio track, the computer may receive an input from a user via the input mechanism to save the updated audio track. Responsive to receiving this input to save the updated audio track, the computer may save the updated audio track (i.e. the audio track after the sections have been temporally repositioned, removed or duplicated).

Alternatively still, implementations of the present disclosure allow a computer to receive an input from a user via the input mechanism to save a section of the plurality of sections within the audio track individually. Responsive to receiving the input from the user, the computer may save the section of the plurality of sections.

The approaches and methods described herein may be embodied on a computer-readable medium, which may be a non-transitory computer-readable medium. The computer-readable medium carrying computer-readable instructions arranged for execution upon a processor so as to make the processor carry out any or all of the methods described herein.

The term "computer-readable medium" as used herein refers to any medium that stores data and/or instructions for causing a processor to operate in a specific manner. Such storage medium may comprise non-volatile media and/or volatile media. Non-volatile media may include, for example, optical or magnetic disks. Volatile media may include dynamic memory.

Exemplary forms of storage medium include, a floppy disk, a flexible disk, a hard disk, a solid-state drive, a magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with one or more patterns of holes, a RAM, a PROM, an EPROM, a FLASH-EPROM, NVRAM, and any other memory chip or cartridge.

Figure 7 illustrates a block diagram of one implementation of a computing device 700 within which a set of instructions, for causing the computing device to perform any one or more of the methodologies discussed herein, may be executed. In alternative implementations, the computing device may be connected (e.g., networked) to other machines in a Local Area Network (LAN), an intranet, an extranet, or the Internet. The computing device may operate in the capacity of a server or a client machine in a client-server network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The computing device may be a personal computer (PC), a tablet computer, a set-top box (STB), a Personal Digital Assistant (FDA), a cellular telephone, a web appliance, a server, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single computing device is illustrated, the term "computing device" shall also be taken to include any collection of machines (e.g., computers) that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

The example computing device 700 includes a processing device 702, a main memory 704 (e.g., read-only memory (ROM), flash memory, dynamic random-access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.), a static memory 706 (e.g., flash memory, static random-access memory (SRAM), etc.), and a secondary memory (e.g., a data storage device 718), which communicate with each other via a bus 730.

Processing device 702 represents one or more general-purpose processors such as a microprocessor, central processing unit, or the like. More particularly, the processing device 702 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, processor implementing other instruction sets, or processors implementing a combination of instruction sets. Processing device 702 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. Processing device 702 is configured to execute the processing logic (instructions 722) for performing the operations, methods and steps discussed herein.

The computing device 700 may further include a network interface device 708. The computing device 700 also may include a video display unit 710 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device 712 (e.g., a keyboard or touchscreen), a cursor control device 714 (e.g., a mouse or touchscreen), and an audio device 716 (e.g., a speaker). The alphanumeric input device 712 and the cursor control device 714 may be considered together as a single input mechanism.

The data storage device 718 may include one or more machine-readable storage media (or more specifically one or more non-transitory computer-readable storage media) 728 on which is stored one or more sets of instructions 722 embodying any one or more of the methodologies or functions described herein. The instructions 722 may also reside, completely or at least partially, within the main memory 704 and/or within the processing device 702 during execution thereof by the computer system 700, the main memory 704 and the processing device 702 also constituting computer-readable storage media.

The various methods described above may be implemented by a computer program. The computer program may include computer code arranged to instruct a computer to perform the functions of one or more of the various methods described above. The computer program and/or the code for performing such methods may be provided to an apparatus, such as a computer, on one or more computer readable media or, more generally, a computer program product. The computer readable media may be transitory or non-transitory. The one or more computer readable media could be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or a propagation medium for data transmission, for example for downloading the code over the Internet. Alternatively, the one or more computer readable media could take the form of one or more physical computer readable media such as semiconductor or solid-state memory, magnetic tape, a removable computer diskette, a random-access memory (RAM), a read-only memory (ROM), a rigid magnetic disc, and an optical disk, such as a CD-ROM, CD-R/VV or DVD.

In an implementation, the modules, components and other features described herein can be implemented as discrete components or integrated in the functionality of hardware components such as ASICS, FPGAs, DSPs or similar devices.

A "hardware component" is a tangible (e.g., non-transitory) physical component (e.g., a set of one or more processors) capable of performing certain operations and may be configured or arranged in a certain physical manner. A hardware component may include dedicated circuitry or logic that is permanently configured to perform certain operations. A hardware component may be or include a special-purpose processor, such as a field programmable gate array (FPGA) or an ASIC. A hardware component may also include programmable logic or circuitry that is temporarily configured by software to perform certain operations.

Accordingly, the phrase "hardware component" should be understood to encompass a tangible entity that may be physically constructed, permanently configured (e.g., hardwired), or temporarily configured (e.g., programmed) to operate in a certain manner or to perform certain operations described herein.

In addition, the modules and components can be implemented as firmware or functional circuitry within hardware devices. Further, the modules and components can be implemented in any combination of hardware devices and software components, or only in software (e.g., code stored or otherwise embodied in a machine-readable medium or in a transmission medium).

Machine learning techniques may be employed to optimise any of the parameters of the present disclosure -such as any of the threshold values -through the training of a computational neural network on example training data, for example. As such, a database of past operations may be provided, either locally or at a remote content management system. Once the parameters have been trained by machine learning techniques for a given type or genre of audio track, further active machine learning need not be applied.

Unless specifically stated otherwise, as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as " receiving", "determining", "comparing ", "enabling", "maintaining," "identifying" or the like, refer to the actions and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantifies within the computer system memories or registers or other such information storage, transmission or display devices.

It will be understood that certain terminology is used in the preceding description for convenience and is not limiting. The terms "a", "an" and "the" should be read as meaning "at least one" unless otherwise specified. The term "comprising" will be understood to mean "including but not limited to" such that systems or method comprising a particular feature or step are not limited to only those features or steps listed but may also comprise features or steps not listed. Equally, terms such as "over", "under", "front", "back", "right", "left", "top", "bottom", "side", "clockwise", "anti-clockwise" and so on are used for convenience in interpreting the drawings and are not to be construed as limiting. Additionally, any method steps which are depicted in the figures as carried out sequentially, without causal connection, may alternatively be carried out in series in any order. Further, any method steps which are depicted as dashed or dotted flowchart boxes are to be understood as being optional.

The above description is intended to be illustrative, and not restrictive. Many other implementations will be apparent to those of skill in the art upon reading and understanding the above description. Although the present disclosure has been described with reference to specific example implementations, it will be recognized that the disclosure is not limited to the implementations described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative sense rather than a restrictive sense.

Claims

Claims 1. A computer-implemented method of identifying section transition boundaries within an audio track, wherein the audio track comprises first audio data, the first audio data comprising musical information across a plurality of discrete frequency bands and across a plurality of timesteps spanning a duration of the audio track, the method comprising: calculating, at a plurality of timesteps: a first statistical analysis of frequency bands in the plurality of discrete frequency bands within a first energy band of the first audio data, and a second statistical analysis of frequency bands in the plurality of discrete frequency bands within a second energy band of the first audio data, the second energy band being different from the first energy band; the method further comprising: identifying a first set of section transition markers by chronologically comparing the first statistical analysis across multiple timesteps; identifying a second set of section transition markers by chronologically comparing the second statistical analysis across multiple timesteps; and defining a master set of section transition markers comprising section transition markers which are present in both the first set of section transition markers and the second set of section transition markers.
2. The computer-implement method of claim 1, wherein either: the master set of section transition markers comprises a set of section transition boundaries; or the method further comprises: prior to calculating the first statistical analysis and the second statistical analysis, calculating a temporal location of a set of hitpoints within the first audio data, and after defining a master set of section transition markers, calculating a set of section transition boundaries, the set of section transition boundaries comprising hitpoints from the set of hitpoints that are temporally closest to each section transition marker in the master set of section transition markers.
3. The computer implemented method of claim 1 or claim 2, wherein the first statistical analysis and the second statistical analysis comprise a summation.
4. The computer implemented method of any of claims 1-3, wherein chronologically comparing the first statistical analysis across multiple timesteps comprises identifying a difference of the first statistical analysis across succeeding timesteps that exceeds a first threshold difference, and chronologically comparing the second statistical analysis across multiple timesteps comprises identifying a difference of the second statistical analyses across succeeding timesteps that exceeds a second threshold difference.
5. The computer-implemented method of any of claims 1-4, further comprising, prior to calculating the first statistical analysis and the second statistical analysis, applying a lowpass filter to the first audio data.
6. The computer-implemented method of any preceding claim, wherein the first energy band comprises a high-energy level band and a second energy band comprises a mid-energy level band.
7. The computer-implemented method of any preceding claim, wherein the musical information comprises a set of samples positioned sequentially to span the duration of the audio track.
8. The computer-implemented method of any preceding claim, wherein calculating the temporal location of the set of hitpoints within the audio data comprises: filtering the first audio data to produce a filtered audio data; and identifying local maxima in a third statistical analysis of frequency bands in the plurality of discrete frequency bands within a third energy band.
9. The computer-implemented method of claim 8, wherein filtering the first audio data comprises applying a band-pass filter to the first audio data.
10. The computer-implemented method of any of claims 8 or 9, wherein the filtered audio data comprises a substantially isolated rhythmically significant audio component and wherein filtering the first audio data comprises isolating a rhythmically significant audio component audio data.
11. The computer-implemented method of any preceding claim, further comprising; calculating a predominant repetition timescale by, either: autocorrelating the set of hitpoints, or autocorrelating the first audio data to generate an autocorrelation comprising a set of autocorrelation peaks; and defining the predominant repetition timescale as the time-lag between the time-zero peak and the strongest peak different from the time-zero peak in the set of autocorrelation peaks.
12. The computer implemented method of claim 11, further comprising checking the set of section transition boundaries by: chronologically segmenting the first audio data into a plurality of succeeding temporal slices, wherein a start position of each slice is an integer number of predominant repetition fimescales away from the first entry in the set of hitpoints for each slice in the plurality of succeeding slices, calculating a statistical comparison of the energy content per frequency band of one slice and the energy content per frequency band of the next succeeding slice; and updating the set of section transition boundaries to include only entries which correspond to succeeding chronological slices at which the statistical comparison exceeds a threshold amount.
13. The computer-implemented method of claim 12, wherein each slice in the plurality of succeeding slices has a duration equal to a power of two multiple of the predominant repetition timescale, optionally the predominant repetition timescale, half the predominant repetition timescale, or double the predominant repetition timescale.
14. The computer-implemented method of any preceding claim, wherein the audio track further comprises second audio data, the second audio data comprising musical information across an additional plurality of discrete frequency bands and across the plurality of timesteps spanning the duration of the audio track, the method further comprising: calculating a temporal location of an additional set of hitpoints within the second audio data; calculating, at a plurality of timesteps: an additional first statistical analysis of frequency bands in the plurality of discrete frequency bands within a first energy band of the second audio data, and an additional second statistical analysis of frequency bands in the plurality of discrete frequency bands within a second energy band of the second audio data, the second energy band of the second audio data being different from the first energy band of the second audio data; the method further comprising: identifying an additional first set of section transition markers by chronologically comparing the additional first statistical analysis across multiple timesteps; identifying an additional second set of section transition markers by chronologically comparing the additional second statistical analysis across multiple timesteps; defining an additional master set of section transition markers comprising section transition markers which are present in both the additional first set of section transition markers and the additional second set of section transition markers; and updating the set of section transition boundaries, the updated set of section transition boundaries comprising an additional plurality of hitpoints from the additional set of hitpoints that are temporally closest to each section transition marker in the additional second set of section transition markers.
15. The computer-implemented method of claim 14, wherein the first audio data defines an audio component, and the second audio data defines a different audio component.
16. The computer-implemented method of claim 14, wherein the audio track comprises a stereo audio track, wherein the first audio data comprises a normalised sum of the left channel of the stereo audio track and the right channel of the stereo audio track, and the second audio data comprises a normalised difference of the left channel of the audio track and the right channel of the audio track.
17. The computer-implemented method of claim 14, wherein the first audio data comprises a filtered subset of the audio track and the second audio data comprises a differently filtered subset of the audio track.
18. The computer-implemented method of any preceding claim, further comprising creating a modified audio track which comprises the audio track and the section transition boundaries.
19. A computer-implemented method for manipulating, via an input mechanism, an audio track being presented at an audio output, wherein the audio track comprises a plurality of temporally sequential sections defined by section transition boundaries, wherein the section transition boundaries are identified by the method of any of claims 1-18, the method comprising: receiving, while presenting the audio track at the audio output, an input from a user via the input mechanism to skip to a temporally adjacent section of the audio track; and responsive to receiving the input from the user, skipping the presentation of the audio track at the audio output to the temporally adjacent section transition boundary.
20. A computer-implemented method for manipulating, via an input mechanism, an audio track being presented at an audio output, wherein the audio track comprises a plurality of temporally sequential sections defined by section transition boundaries, wherein the section transition boundaries are identified by the method of any of claims 1-18, the method comprising: receiving, while presenting the audio track at the audio output, an input from a user via the input mechanism to loop the playback of a section of the audio track; and responsive to receiving the input from the user, looping the playback back to the beginning of the section on arrival at the end of the section.
21 A computer-implemented method for manipulating, via an input mechanism, an audio track being presented at an audio output, wherein the audio track comprises a plurality of temporally sequential sections defined by section transition boundaries, wherein the section transition boundaries are identified by the method of any of claims 1-18, the method cornprising: receiving, while presenting the audio track at the audio output, an input from a user via the input mechanism to reposition, remove or duplicate a section of the plurality of sections temporally across the duration of the audio track; responsive to receiving the input from the user, temporally repositioning, removing or duplicating the section across the duration of the audio track.
22. The computer-implemented method of claim 21, further comprising: subsequent to temporally repositioning, removing or duplicating the section across the duration of the audio track, receiving a save input from a user via the input mechanism to save the audio track; and responsive to receiving the save input from the user, saving the audio track which includes the temporally repositioned, removed or duplicated sections.
23 A computer-implemented method for manipulating, via an input mechanism, an audio track, wherein the audio track comprises a plurality of sections defined by section transition boundaries, wherein the section transition boundaries are identified by the method of any of claims 1-17, the method comprising: receiving an input from a user via the input mechanism to save a section of the plurality of sections within the audio track; and responsive to receiving the input from the user, saving the section of the plurality of sections.
24. A computer readable medium comprising instructions which, when executed by a processor, cause the processor to perform the method of any preceding claim.
25. A system with an internal memory, a processor, an input mechanism, a display and an audio output, the processor configured to perform the method of any of claims 1-23.