WO2005101243A1 - Method and apparatus for identifying audio such as music - Google Patents

Method and apparatus for identifying audio such as music Download PDF

Info

Publication number
WO2005101243A1
WO2005101243A1 PCT/SG2004/000092 SG2004000092W WO2005101243A1 WO 2005101243 A1 WO2005101243 A1 WO 2005101243A1 SG 2004000092 W SG2004000092 W SG 2004000092W WO 2005101243 A1 WO2005101243 A1 WO 2005101243A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
fingeφrint
query
location
file
Prior art date
Application number
PCT/SG2004/000092
Other languages
French (fr)
Inventor
Kue Leong Yeo
Kok Seng Chong
Sua Hong Neo
Kim Hann Kuah
Original Assignee
Matsushita Electric Industrial Co. Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Matsushita Electric Industrial Co. Ltd. filed Critical Matsushita Electric Industrial Co. Ltd.
Priority to PCT/SG2004/000092 priority Critical patent/WO2005101243A1/en
Publication of WO2005101243A1 publication Critical patent/WO2005101243A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/632Query formulation
    • G06F16/634Query by example, e.g. query by humming

Definitions

  • the invention relates to identifying audio, such as music, in particular by comparing certain properties of the sound with corresponding properties in a database derived from known sounds.
  • Descriptive information pertaining to audio data can be and is coupled with the media content. This is known as metadata. Such metadata may contain the artist name, composer name, title and other data. End users on the other hand can easily alter this data, thus rendering the method insecure.
  • Watermarking technologies have emerged as an alternative method. These embed digital information into the audio file without altering the acoustic content perceived by the end user. This digital information translates to a unique identifier of the song. The audio quality of the watermarked song versus the robustness of the watermarking algorithm to prevent hacking has been the central issues. However, watermarking requires a prior embedding of digital information into an audio file, thus all songs and other audio already on the market cannot now usefully be protected in tbis manner.
  • Content based identification system differs from the above technologies.
  • a content based recognition system analyses the acoustic properties of the audio file and develops unique fingerprints. Training of the audio files is achieved by suitable extraction of unique acoustic properties, also known as audio fingerprints. Caution must be exercised in the selection of the acoustic properties as it affects the system robustaess and database size. Following the training process is the database building process where all the audio fingerprints are indexed.
  • the matching process is similar to the training process.
  • -An audio fingerprint is extracted from a clip of an original song.
  • the matching algorifhrxi performs a calculation of the similarity between the extracted query audio fingerprint and the ones in the database by means of a similarity function.
  • the best- matched audio fingerprint yields the highest similarity measure and the corresponding song is deemed as the matched song.
  • the best search strategy to employ is to do an exhaustive search.
  • an exhaustive search can be prohibitive for a large database.
  • a method of generating and matching hashes is described in U.S. Patent Application No. 2002/0,178,410A1, published on 28 November 2002 in the name of Haitsma et al.
  • the hashes are short summaries of data files, which can be used to identify the file.
  • the method generates robust hashes for multimedia content.
  • the a ⁇ udio clip is divided into successive frames. For each frame, the frequency spectrum is divided into bands.
  • a robust property of each band is computed and represented by a concatenation of binary hash words, one for each frame.
  • To identify a possible compressed database audio file a block of derived hash words is matched by a computer with a large database.
  • the database consists of a lookup table (LUT), which stores the entry of each hash word. Each entry points to the song and the location where the respective hash word occurs. From the lookup table, the Hamming distance is calculated between the query hash word and the database song is obtained. The Hamming distance determines whether the query clip is similar to the database song.
  • LUT lookup table
  • a recognition system should exhibit a fast matching speed. This parameter becomes one of the important factors to consider in implementing a matching algorithm. Coupling to the above problem lies the consideration of storage requirements. Many times, the storage requirements determine the exactness or the compactness of a certain audio feature.
  • a method of generating a database from audio files, for use in audio identification of a query file comprises receiving one or more audio files, for individual received audio files, extracting an audio fingerprint of the audio file and generating a location table mapping the locations of data strings of a predetermined category within the audio fingerprint, and storing the one or more audio fingerprints and location tables in a database.
  • a method of matching an audio query against one or more known audio files comprises: receiving the query, extracting a query finge ⁇ rint of the query, receiving an audio fingerprint of an audio file, identifying one or more data strings of a predetermined category from the query fingerprint; identifying a location within the audio fingerprint having a data string corresponding to one of the one or more data strings of the query finge ⁇ rint; determining an alignment of the query finge ⁇ rint and the audio finge ⁇ rint that matches the position of the identified location within the audio finge ⁇ rint with the location of the predetermined data string; and determining a similarity measure between the query finge ⁇ rint and the audio finge ⁇ rint based on the alignment.
  • a method of matching an audio query file against an audio file comprises: receiving the query file; extracting a query finge ⁇ rint of the query file; receiving an audio finge ⁇ rint of a first audio file; identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query finge ⁇ rint; for the or each identified first data string of the query finge ⁇ rint, identifying one or more second locations, within the audio finge ⁇ rint, having a second data string corresponding to the identified first data string; for the or each second location, within the audio finge ⁇ rint, identified as having a second data string, aligning the query finge ⁇ rint against the audio finge ⁇ rint by matching the identified second location, within the audio finge ⁇ rint, with the first location of the corresponding first data string within the query finge ⁇ rin
  • a method of matching an audio query file against one or more known audio files comprises: receiving the query file; extracting a query finge ⁇ rint of the query file; receiving an audio finge ⁇ rint of a first audio file; identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query finge ⁇ rint; for the or each identified first data string of the query finge ⁇ rint, identifying one or more second locations, the second locations being locations within the audio finge ⁇ rint having a second data string, a second data string being a data string corresponding to the first identified first data string; for the or each identified second location within the audio finge ⁇ rint, determining a third location, the third location being a location within the audio finge ⁇ rint that would correspond to a predetermined fourth location, within the query finge ⁇ rint,
  • a computer program product having a computer program recorded on a computer readable medium, for instructing a computer to operate according to the method of any of the first to fourth aspects.
  • Figure 1 is a schematic view of sound identifying apparatus according to an embodiment of the invention.
  • Figure 2 is a flowchart illustrating an exemplary overview process of the training of an audio database according to an embodiment
  • Figure 3 is a flowchart illustrating an exemplary overview process of the matching of audio according to an embodiment
  • Figure 4 is a schematic block diagram illustrating components of an exemplary embodiment of the feature extraction module of Figure 1 ;
  • Figure 5 is a flowchart illustrating an exemplary process for extracting an audio finge ⁇ rint from an input framing audio file;
  • Figure 6 is an illustration of an exemplary arrangement of an audio finge ⁇ rint
  • Figure 7 is a schematic illustration of the use of an audio finge ⁇ rint in generating TTS maps
  • FIG. 8 is a flowchart relating to the generation of the TTS maps from an audio finge ⁇ rint
  • Figure 9 is a flowchart illustrating an exemplary matching procedure for determining the best match between the query file and audio files in a database
  • Figure 10 is an illustration of a query finge ⁇ rint and certain aspects of an exemplary method of determining possible locations against which to match an audio file with the query file;
  • Figure 11 is a flowchart relating to a process for determining a file BER
  • Figure 12 is a flowchart of an alternative BER calculation procedure
  • Figure 13 is a flowchart of a process for determining a best match using a statistical matching process
  • Figure 14 is a flowchart relating to an example of a specific portion of the process of Figure 13;
  • Figure 15 is an illustration of an example of how the query finge ⁇ rint, TTS maps and a current count table interrelate in the processes of Figure 13 and 14; and Figure 16 is an illustration of a computer system for implementing the apparatus and processes associated with the exemplary embodiments
  • LSSM Long String of Sub-band Match
  • An audio identification process has two stages: a database generation stage and a matching stage.
  • the database generation stage the system extracts an audio finge ⁇ rint and robust codes for a database of audio files.
  • a set of maps of all possible permutations of the codes is generated, storing all the locations of the different codes.
  • the matching stage the system performs a similarity measure between the query and the database files " by aligning the query at the locations indicated by the maps.
  • a bit error rate (BER) is computed for every audio file. The audio file with the lowest song BER is the matching audio file.
  • Figure 1 is a schematic view of sound identifying apparatus 10 according to an embodiment of the invention.
  • the sound identifying apparatus 10 has a feature extraction module 12 with two inputs, a training audio file input 14 for inputting an audio input signal SI and a query input 16 for inputting a query input signal S2.
  • the feature extraction module 12 also has two outputs, one carries an extracted audio feature signal S3 as an input to a feature database 18 and the other carries an extracted query feature signal S4 as an input to a matching module 20.
  • the feature database 18 contains an audio finge ⁇ rint database 22 and a Temporal Transition Code (TTS) map database 24.
  • TTS Temporal Transition Code
  • the features database 18 is also arranged to provide a selected extracted audio feature signal S5 as an input to the matching module 20.
  • the matching module 20 is arranged to output a match identification signal S6.
  • the audio identification process comprises two stages, a database generation stage, which is a training operation, and an identification stage.
  • a database generation stage which is a training operation
  • an identification stage usually occurs only once, possibly with occasional updates, whilst the identification stage occurs every time a new item needs identifying.
  • the two stages may also occur some time apart.
  • FIG 2 is a flowchart illustrating an exemplary overview process (SI 00) of the training of an audio database, for instance in the apparatus of Figure 1.
  • This process (SI 00) generates the contents of the feature database 18 of Figure 1; this is the database generation stage.
  • Figure 3 is a flowchart illustrating an exemplary overview process (SI 20) of the matching of audio, for instance using the apparatus of Figure 1.
  • the process (SI 20) is used to identify an unknown piece of audio, a so-called query; this is the identification stage.
  • the sound identifying apparatus 10 receives a training audio file within the audio input signal SI at the audio input 14 (step SI 02).
  • the feature extraction module 12 processes the input training audio file and translates relevant information to extract an input audio finge ⁇ rint (step SI 04).
  • the input training audio file may be converted into a mono-training audio file and sampled at a certain sampling frequency, such as 44.1kHz.
  • a set of TTS maps for the input training audio file is generated based on the extracted audio finge ⁇ rint (step SI 06).
  • the extracted audio fingerprint and the generated TTS maps are sent to storage in the audio finge ⁇ rint database 22 and the TTS Map database 24, respectively, within the feature database 18. Metadata, identifying the file and its author may also be sent to and stored in the feature database 18.
  • step SI 10 A check is then made to determine if the current training audio file is the last one to be processed for training (step SI 10). If this is not the last file, the next file is selected (step SI 12) and the process returns to step SI 04 to process the next training audio file. This process is therefore repeated for as many training audio files as are input to the apparatus 10 or in the database from which the input training audio files are drawn. If the current training audio file is the last file, the training process ends.
  • the sound identifying apparatus 10 receives an input query file in a query input signal S2 at the training audio file input 14 (step SI 22).
  • the feature extraction module 12 processes the input training audio file and translates relevant information to extract a query finge ⁇ rint (step SI 24).
  • the matching module 20 obtains necessary TTS Maps and audio finge ⁇ rints from the feature database (step S126)
  • the feature extraction module 12 also obtains certain query TTS codes from the query finge ⁇ rint (step SI 28).
  • the query finge ⁇ rint and query TTS codes are input to the matching module 20 as the extracted query feature signal S4.
  • the matching module 20 determines a best match between the query file and the training audio files used to generate the contents of the feature database 18 (step SI 30) and outputs the result. The matching process then ends.
  • a similarity measure such as a Bit Error Rate (BER) is used to determine the similarity between the query file and the training audio files whose audio finge ⁇ rints are in the database 18 by means of comparing the extracted features.
  • the BER for this pu ⁇ ose may be the ratio between the number of error bits and the total number of bits.
  • an audio finge ⁇ rint is extracted by calculating the difference in the multi-band energy between successive frames.
  • Each frame consists of several data samples. Consecutive frame features an overlapped in the data samples (previous samples and new samples) and each frame undergoes a Fast Fourier Transform (FFT). From the FFT, differential multi-band energies are extracted. Multi-band energies are calculated as the differences in energies between successive frames. These differential multi-band energies constitute the audio finge ⁇ rint processed at the particular frame, in short, the differential multi-band energies relating to time.
  • the audio finge ⁇ rint generated is the concatenation of several frames of multi-band energy in sequential order.
  • the audio identification process consists of two stages, namely a database generation stage where the system extracts an audio finge ⁇ rint and robust codes for every audio file. These codes are robust but they are not unique. Thus a set of maps comprising all possible permutations of the said codes are generated, storing all the locations of the different codes.
  • Figure 4 is a schematic block diagram illustrating components of an exemplary embodiment of the feature extraction module 12 of Figure 1.
  • Figure 5 is a flowchart illustrating an exemplary process for extracting an audio finge ⁇ rint from an input training audio file, for instance as may be involved in step SI 04 of Figure 2. This is also an example of the operation of the feature extraction module 12 of Figure 1 and as shown in detail Figure 4.
  • An audio file is received as a time domain, pulse code modulated (PCM) sample.
  • the received input audio file is divided into sample frames (step SI 42) by a framing circuit 22, with a subsequent frame overlapping a current frame to ensure high correlation between frames.
  • the preferred frame size is 1024 samples with each sample in the frame multiplied by a window function such as a Hamming window.
  • Each frame undergoes a spectral transformation (step S144), such as a FFT, by way of a transform module 24, to provide a spectral representation.
  • the spectral information may be band limited.
  • a recommended range is from to 300Hz and 3kHz for audio.
  • the band limiting of the training audio file's spectral information into a specific frequency range has the advantage of reducing storage requirements by eliminating redundancy.
  • a range that is most relevant to the human auditory system is generally selected.
  • the frequency range obtained is further processed into absolute values (step SI 46) by a magnitude generator 26 and separated into frequency bands (step SI 48) by a band separator 28.
  • each band has the same bandwidth, for example of 450Hz.
  • Suitable audio properties are extracted (step SI 50) to represent the training audio file.
  • energy is used as it exhibits robustness to various processing distortions.
  • the absolute spectral values in the individual frequency bands are squared and summed to obtain the energies of each band separately, in an energy extractor module 30.
  • This generates a set of multiband energies 32 for that particular frame.
  • the multiband energies 32 are stored in a one-frame delay 34 for each band (SI 52).
  • the multiband energy 32 for each band in the current frame N is also subtracted (step SI 54) from the energy value for the same band of the previous frame N- 1, at an adder 36, to produce a set of difference values 38.
  • a threshold module 40 compares the difference value 38 for each band of the frame with a threshold value (step S 156) for instance 0.1 , to produce a binary representation of each band. If the energy difference value 38 is above the energy threshold, binary ' 1 ' is used to represent the particular band. If the energy difference value 38 is below the energy threshold, binary '0' is used to represent the particular band.
  • the thresholds are all the same, although they may be adaptively controlled, based on the distortion or background noise in the environment in which the audio file is recorded. This might be achieved by sampling the audio file at various points.
  • a frame concatenator 42 generates an x-bit code word 44 for the frame (step SI 58) by concatenating the binary representation of the energy difference across bands.
  • a finge ⁇ rint concatenator 46 forms the audio finge ⁇ rint 50 by concatenating the code words 44 from sequential frames sequentially (step SI 60) for the training audio file.
  • the number of bits, x, in the code word 44 is generally the same as the number of bands (although it may differ if the data from two or more bands merge or if a band splits).
  • a typical number of bands is 6, as this balances robustness in the system with computation requirements.
  • Figure 6 is an illustration of an exemplary arrangement of an audio finge ⁇ rint 50.
  • the finge ⁇ rint is a two-dimensional array of binary values.
  • the horizontal direction 52 represents the differential energy contour for each frame and the vertical direction 54 represents the differential energy evolution across frames for a particular band.
  • the vertical direction is also synonymous with the time direction information.
  • the audio finge ⁇ rint's first column is labelled as band 0, followed by band 1, through to the last band, band 5.
  • the band relates to the energy at different parts of the frequency spectrum.
  • the first row is indexed as frame 0, followed by frame 1, etc.
  • the frame index relates to time dimension information, with the number of frame indices depending on the length of the training audio file.
  • TTS code Temporal Transition Code
  • a map is generated to locate all possible locations of a particular TTS code in a particular audio file.
  • Generation of this map also know as the Temporal Transition Map (TTS Map)
  • TTS Map Temporal Transition Map
  • the frame index gives a time representation of the TTS code and is continuous until the end of the audio finge ⁇ rint 50.
  • FIG. 7 is a schematic illustration of the use of an audio finge ⁇ rint in generating TTS maps.
  • Figure 7 illustrates an example of a set of TTS maps 60, 62, 64, 66, 68, 70, a separate one for each band of the audio finge ⁇ rint 50.
  • the uppermost TTS map 60 is for band 0 of the audio finge ⁇ rint 50.
  • the second TTS map 62 is for band 1 of the audio fingerprint 50, and so on.
  • Each individual TTS map 60 - 70 is capable of recording all possible permutations of TTS codes of the respective band.
  • Each TTS map 60 - 70 contains details of the frame locations 72 of the individual TTS codes within the relevant band for the associated audio -finge ⁇ rint 50.
  • the TTS codes are extracted from the audio finge ⁇ rint 50 relying on the principle of high correlation across frames.
  • the respective TTS codes are derived as a y- bit binary representation 74 consisting of a concatenation of the binary digit for that frame in the audio finge ⁇ rint 50 and the binary digits in the same band for the y-1 immediately succeeding frames.
  • the TTS maps 60 - 70 contain a list 80 of all possible values of the TTS code, from 0 to 2 y - 1.
  • the frame number of that -frame is then associated with that code within the TTS map.
  • the same TTS code may appear more than once in the same band of an audio f ⁇ nge ⁇ rint 50. Thus more than one frame number may be associated with each TTS code value.
  • Figure 8 is a flowchart relating to the generation of the TTS codes from the audio finge ⁇ rint 50. This is an example of the process of step SI 06 within the flowchart of Figure 2. Figure 8 is described with reference to the illustration of Figure 7.
  • the binary representation of the TTS code 74 is extracted (step SI 74).
  • the TTS code value in the code value list 80 that corresponds to the curre ⁇ t extracted TTS code for the current frame index is located.
  • the current frame index 72 is recorded, associated with the TTS code value for that frame index, in the TTS map (step S178).
  • the frame index value "0" is stored in the TTS map 60 associated with the TTS code value 333.
  • the training processes described above with reference to Figures 1, 2 and 4 to 8 encompass the training phrase of the recognition system.
  • processing complexity is encountered at certain phases of the training process, the increased complexity does not affect the matching time during a recognition operation as the relevant processes are conducted prior to matching.
  • the main concern during the training phase is the selection of suitable audio finge ⁇ rints as that contributes to the overall memory requirements of the system.
  • TTS maps and an audio fingerprint are created.
  • Each TTS map stores the TTS codes of each band of an audio fingerprint.
  • the TTS maps and the audio -finge ⁇ rints make up the data of the features database 18 of the sound identifying apparatus 10 of this embodiment.
  • the TTS map may be loaded into a RAM to achieve fast searching speed. If there is a memory constraint, the audio finge ⁇ rints need not be stored in RAM.
  • LSSM exploits the similarity between the actual TTS code and the query file's TTS codes to generate the TTS map.
  • a map of all possible 'correct' TTS codes is generated by concatenating the multi-band energies in time and shifting the TTS codes incrementally in time.
  • the result is a TTS map populated with enough information to enable efficient identification of the best match location of an audio file in milliseconds.
  • the TTS map streamlines the matching time required by the system as opposed to the exhaustive search.
  • Matching is a quick and simple process of accessing the TTS map with the help of the extracted TTS codes of the query file. From the extracted TTS codes, matching locations are located using the TTS map. Subsequently, a similarity measure is calculated between the query file and the database audio file. This process is looped for all the other extracted TTS codes.
  • the audio file that yields the highest similarity measure is the matched audio file.
  • a clip of unknown audio is input into the recognition system as a query file to identify that audio.
  • Matching is defined as locating a best matching database (training) audio file by determining the similarity between the query and the database audio files in the database.
  • an audio finge ⁇ rint is generated for the query, termed here a query finge ⁇ rint 100.
  • the query finge ⁇ rint 100 is extracted (in step SI 24 of Figure 2) by the same method as is used to extract the audio finge ⁇ rint 50.
  • the flowchart of Figure 5 and the associated description is applicable here too, with the file being processed being the query file, rather than a database audio file.
  • TTS codes for the query finge ⁇ rint 100 are also determined.
  • the q-uery TTS codes are generated (in step SI 28 of Figure 3) by the same method as is used to extract the database audio TTS codes (step SI 74). They are extracted one by one as is described below; TTS maps do not need to be generated for the query file. However, if desired, all the query TTS codes could be extracted initially before any of them is used, possibly to be stored in a table.
  • the system Prior to calculating a similarity measure between the query and the database audio files, the system determines the location at which the query best matches any database audio file. As the query may have been randomly extracted from an unknown source, the system has no idea where it should start its matching process. TTS codes are extracted from the query and possible locations for the query are determined by accessing the entries in the TTS maps.
  • Figure 9 is a flowchart illustrating an exemplary matching procedure for determining the best match between the query file and audio files in the database.
  • Figure 9 illustrates an example of the processes of steps S 126 to SI 30 of Figure 3.
  • the TTS maps and the audio finge ⁇ rint are retrieved for the audio file corresponding to the current audio file number (step S204).
  • a file BER for the current audio file is determined (step S206), for instance as is described later.
  • the file BER for the current database audio file is compared with the lowest file BER so far for this query (step S208). If the file BER for the current audio file is lower (for instance as would always happen if it is the first audio file being compared), then the file BER for the current audio file becomes the lowest BER (step S210). If the existing lowest file BER is lower, then that remains as the lowest file BER.
  • Figure 10 is an illustration of a query finge ⁇ rint 100 and certain aspects of an exemplary method of determining possible locations against which to match a database audio file with the query file.
  • Figure 11 is a flowchart relating to a process for determining a file BER (step S206 of Figure 9), including determining possible matching locations.
  • a query TTS code is obtained from the query finge ⁇ rint (step S224).
  • Tor audio file j audio frame indices of the audio finge ⁇ rint having the same value TTS code value as the query TTS code are identified, using the audio file TTS map for the same band (S226).
  • the TTS map being used is also for band 0.
  • several possible frame indices in the audio finge ⁇ rint correspond to any particular TTS code value.
  • the TTS code for the first frame of the first band of the query finge ⁇ rint 100 is binary 010011010 (decimal 154).
  • the TTS code value 154 is shared by frame indices 1, 13 and 300.
  • a bit error rate is generated from the comparison of the two finge ⁇ rints (step S232).
  • the BER is defined as the number of mismatched bits divided by the total number of bits in the query finge ⁇ rint.
  • the system compares the just-determined, current BER with the file; BER for this database audio file (step S234), that is with the lowest BER so far from the comparisons with this audio file. If the current BER is lowest (as would always happen if it is the first BER determined for this particular pair of query and database audio .file) then the current BER becomes the lowest BER (step S236). If the existing file BER is lower, then that remains as the file BER.
  • the system determines if there are any more locations to compare for the current query TTS code, that is whether the current audio match frame index is the last of the audio frame indices identified in step S226 (step S238). A similar check is made if step S231 determined that the query finge ⁇ rint did not fall entirely within the audio finge ⁇ rint.
  • step S240 If the current audio match frame index is not the last of the audio frame indices identified in step S226 and there more locations to compare, the next of the audio frame indices identified in step S226 is set as the current audio match frame index (step S240). If the current audio match frame index is the last of the audio frame indices identified in step S226, then a check is made of whether the current query frame index is the last query frame index in the current query band (step S242). A similar check is made if step S227 determined that no frame indices had been identified in step S226. If the current query frame index is not the last query frame index in the current query band, the query frame index number is incremented by one, i.e.
  • the operation proceeds for all frames and all bands of the query.
  • the location that yields the lowest BER is the location at which the query best matches the current database audio file.
  • comparison of finge ⁇ rints only occurs where the query finge ⁇ rint falls entirely within the audio finge ⁇ rint. In alternative embodiments, comparison may also occur where the query finge ⁇ rint falls partly outside the audio finge ⁇ rint (that is it begins before the audio finge ⁇ rint or ends after the audio finge ⁇ rint). Such an operation might allow more robust processing for noisier queries.
  • the comparison is made at the overlapping areas and is made for that part of the query finge ⁇ rint that falls outside the audio finge ⁇ rint as if the audio finge ⁇ rint existed in that area with every bit equal to zero.
  • the TTS code 154 from the first frame of the query finge ⁇ rint 100 matches the TTS code for frame indices 1, 13 and 300 of the audio finge ⁇ rint.
  • the query finge ⁇ rint 100 is first overlapped with the audio finge ⁇ rint such that frame 0 of the query finge ⁇ rint 100 is aligned with frame 1 of the audio finge ⁇ rint, and a first BER is generated.
  • the query finge ⁇ rint 100 is next overlapped with the audio finge ⁇ rint such that frame 0 of the query fingerprint 100 is aligned with frame 13 of the audio finge ⁇ rint, and a second BER is generated.
  • the query Qnge ⁇ rint 100 is next overlapped with the audio finge ⁇ rint such that frame 0 of the quer ⁇ finge ⁇ rint 100 is aligned with frame 300 of the audio finge ⁇ rint, and a third BER is generated.
  • the system performs a similarity measure between the query file and the database audio files by aligning the query at the locations indicated by the maps.
  • a bit error rate (BER) is computed for every audio file. The audio file with the lowest audio file BER is the matching audio file.
  • the TTS code is extracted for every frame index of every band of the query finge ⁇ rint, and a BER is calculated for every match.
  • This can be computationally expensive.
  • the amount of matches tested can be reduced.
  • One way of doing this is to keep a record of every alignment made between the query finge ⁇ rint and the audio finge ⁇ rint, so that when a repeat alignment comes up, within the same band or between different bands, no BER is determined for the second or further occurrence of that align-ment.
  • Another way of reducing processing used is only to calculate the BER for alignments where the query finge ⁇ rint falls wholly within the audio finge ⁇ rint (in a time sense).
  • TTS codes might be taken from only oi e band, for instance the first, second or third band, etc.
  • This method is a progressive thresholding method, for determining a BER for different overlap alignments of the query finge ⁇ rint - nd the audio finge ⁇ rint (a substitute for step S232 of Figure 11).
  • the progressive thresholding method simplifies the calculation of the file BER by reducing the amount of unnecessary processing for BER calculations that takes place at clear non-matching locations.
  • FIG. 12 is a flowchart of a progressive threshold BER calculation procedure. The conceptual flow shown in Figure 12 is based on conducting an initial BER calculation on a smaller region of frames.
  • a number of frames M is initiated at a first predetermined level (step S252), which is less than the number of frames in the query finge ⁇ rint.
  • the number M may be determined as a percentage of the full query, for instance 10 - 15%, or set as a fixed number, e.g. 50 frames.
  • the frames that are used within the number of frames M do not need to include the matched frame. The more distorted the query is, the larger the number would ideally be.
  • the number can be adaptively controlled based on a determination of distortion in the query, based on a sampling of background noise. For the current overlap position of the query finge ⁇ rint and the audio finge ⁇ rint a reduced area, the BER is calculated based on just the M frames (step S254).
  • the BER is calculated for a reduced area.
  • the M-based BER is compared with a BER threshold (step S256).
  • the BER threshold may also be fixed, for instance at 0.4 or adaptively controlled, based on the level of distortion in the query, especially if there is no adaptive control of the frame number M. If the M-based BER is not lower than the BER threshold, this sub-process ends, proceeding to step S238 of the process of Figure 11. If the M-based BER is lower than the BER threshold, the determination is then made as to whether the number JVE covers less than the whole of the query finge ⁇ rint (step S258).
  • the improved matching algorithm is used for every attempt at a match location and reduces the time required to perform BER calculations at locations that clearly do not match, by terminating the BER calculation prematurely.
  • the overall effect is an increase in matching speed.
  • step S256 of Figure 12 the process proceeds from step S256 of Figure 12 to step S 238 of Figure 11.
  • step S 234 of Figure 11 The current BER still stands as the BER for the reduced comparison set, but that is not a problem in step S234, because it is still a high BER that would be bettered by any audio file for which the BER is calculated for the complete query finge ⁇ rint.
  • step S260 has the number of frames M incremented by M each time in step S260.
  • the number of frames M is incremented immediately to the full number of frames in the query finge ⁇ rint, the first time through step S260.
  • Figure 13 is a flowchart of a process for determining a best match using a statistical matching process.
  • Figure 14 is a flowchart relating to determining a best match score, for example for use in the process of Figure 13.
  • Figure 15 is an illustration of an example of how the query finge ⁇ rint, TTS maps and a newly mentioned current count table interrelate in this alternative approach.
  • the system locates the best matching location in every compared database audio file.
  • a current count table records the similarity between the query and the database audio file.
  • the system selects a list of top-ten audio files.
  • the top-ten audio files are chosen based on the magnitude of the entries in the audio index map.
  • a BER computation is done on the top-ten audio files by aligning the query at the location stored by the audio index map.
  • the matching audio file is identified as the audio file with the smallest BER.
  • Figure 13 is a flowchart of a process for determining a best match using a statistical matching process.
  • the process of Figure 13 may be used as step S130 of the flowchart of Figure 3.
  • the TTS maps and the audio finge ⁇ rint are retrieved for the audio file corresponding to the current audio file number (step S204).
  • These first two steps are the same as the first two steps shown in the flowchart of Figure 9.
  • a statistically determined best match is determined for the current database audio file number (step S272), providing a score for that database audio file.
  • the current score is checked against the ten best scores so far (step S274) for the previously checked audio files, which, within the exemplary embodiment described below, are the ten highest scores so far. If the current score is better (e.g.
  • step S276 the list of the ten best scores so far is updated (step S276), by removing the worst (e.g. lowest) of the scores within the top ten and storing the current score in the top ten.
  • the BER is generated for each of the ten database audio files for which an associated score is present in the final list of the ten best scores (step S282).
  • the query finge ⁇ rint is matched to the associated audio finge ⁇ rints the arrangement that yields the score within the ten best scores, that is at the locations of the respective audio finge ⁇ rints that produced the scores in the final list of the ten best scores.
  • the database audio file of these ten that produces the lowest BER is deemed to be the matching database audio file for the query and is output as the best match audio file (step S284).
  • the process of Figure 14 is an example of a process that corresponds to step S272 of the process of Figure 13.
  • the approach used in effect, matches frame 0 in the query finge ⁇ rint against a frame index in the audio finge ⁇ rint of an database audio file and determines how many matched TTS codes this results in for each frame index in the audio finge ⁇ rint.
  • the TTS code in the query finge ⁇ rint is obtained (step S304).
  • the frame indices in the audio finge ⁇ rint of that database audio file that have the same TTS code are identified (step S306).
  • step S307 A determination is made (step S307) as to whether any audio finge ⁇ rint frame indices were identified for the TTS code in step S306. Assuming there are any, the first of the identified frame indices is set as the current offset frame index (step S308). A determination is made of which frame index within the audio finge ⁇ rint the query finge ⁇ rint would have to start at, for the current query frame index to match with the current offset frame index (step S310). For the database audio file frame index within the audio finge ⁇ rint determined as the frame index at which the query finge ⁇ rint would start, in step S310, a count associated with that database audio file frame index (in a current count table) is incremented by one (step S312).
  • step S3114 A check is made as to whether the current offset frame index is the last identified frame index from step S306 (step S314). If there are one or more further identified frame indices from step S306, the next identified frame index from step S306 is set as the current query frame index (step S316).
  • the highest count associated with any audio file frame index in the final current count table is set as the best match score for audio file j (step S326), and the best match score is stored together with the identity of the associated database audio file frame index. This best match score may then be compared with the ten best scores in step S274 of the process of Figure 13. In this instance the ten best scores will be the ten highest best match scores.
  • the TTS code 102 for query frame 86 is extracted from the query finge ⁇ rint 100.
  • the TTS code is binary 100110100 (decimal 308).
  • decimal 308 of the TTS map 60 for band 0 of the database audio file being compared there are three entries: frame 3, frame 100 and frame 422.
  • the offset associated with query frame 86, relative to query frame 0 is 86 (that is 86 - 0).
  • query frame 0 would match with database audio file frame negative 83 (that is 3 - 86), which is not possible.
  • the next identified frame index is frame 100.
  • query frame 86 For query frame 86 to match database audio file frame 100, query frame 0 would match with database audio file frame 14 (that is frame index less the offset, i.e. 100 - 86). As such, in a current count table 104, the current count associated with database audio file frame 14 is incremented by one. In this case, the count previously associated with database audio file frame 14 was eight and so its new count is nine. Likewise for query frame 86 to match database audio file frame 422, query frame 0 would match with database audio file frame 336 (that is 422 - 86). As such, in the current count table 104, the current count associated with database audio file frame 336 is incremented by one.
  • the count previously associated with database audio file frame 336 was twenty-four and so its new count is twenty-five. After this has been done for all query frame indices, a complete or final count table 106 is generated. In this particular case, the highest count number is 85, associated with database audio file frame 130.
  • Another mechanism for reducing the amount of processing necessary is to classify the audio in the database, for example between the spoken word and music, with music classified into vocal or instrumental, classical or modern, etc. These categorisations can be downloaded with the music audio itself when generating the database. When a query is input for matching pre-processing occurs to determine what type of audio it is, with similar categorisations. In this way, only audio files from the same general category need to be checked.
  • the apparatus and processes of the exemplary embodiments can be implemented on a computer system 400, for example as schematically shown in Figure 16.
  • the embodiments may be implemented as software, such as a computer program being executed within the computer system 400, and instructing the computer system 400 to conduct the method of the example embodiment.
  • the computer system 400 comprises a computer module 402, input modules such as a keyboard 404 and mouse 406 and a plurality of output devices such as a display 408, and printer 410.
  • the computer module 402 is connected to a computer network 412 via a suitable transceiver device 414, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).
  • LAN Local Area Network
  • WAN Wide Area Network
  • the computer module 402 in the example includes a processor 418, a Random Access Memory (RAM) 420 and a Read Only Memory (ROM) 422.
  • the computer module 402 also includes a number of input/output (I/O) interfaces, for example an I/O interface 424 to the display 408, and an I/O interface 426 to the keyboard 404.
  • I/O input/output
  • the components of the computer module 402 typically communicate via an interconnected bus 428 and in a manner known to the person skilled in the relevant art.
  • the application program is typically supplied to the user of the computer system 400 encoded on a data storage medium such as a CD-ROM or floppy disc and read utilising a corresponding data storage medium drive of a data storage device 430.
  • the application program is read and controlled in its execution by the processor 418.
  • Intermediate storage of program data may be accomplished using the RAM 420.
  • the application program supplied may include just the mechanism for identifying query files, without a mechanism for generating the database of audio files. That database may be supplied as a fixed database, for example on the same CD-ROM or available over the Internet or some other network.
  • the above description is of audio files in general.
  • the embodiments are of particular use where the audio files are music files, for example songs, with the query also being of a song.
  • the described methods provide systematic approaches to identify an audio clip from a song database efficiently, by the generation of TTS maps and audio finge ⁇ rints.
  • the approaches also provide accurate and efficient methods of searching for the matching song from a song database.
  • the embodiments have the ability to handle non-linear distorted queries without sacrificing search speed and storage requirements.
  • the embodiments provide robust and efficient algorithms for an audio identification system.
  • the embodiments of the invention provide efficient search and identification methodologies for an audio identification system.
  • Exemplary applications of the audio identification system include a broadcast monitoring system, a mobile retrieval system and a content monitoring system amongst others.
  • a broadcast monitoring system has a central database server and a monitoring station.
  • the monitoring station transmits audio finge ⁇ rints extracted from the broadcast audio content to the central server.
  • the central server determines the identity of the song being broadcast through its large database of songs.
  • a clip of a song is recorded using a mobile telephone and transmitted through the mobile network.
  • a song information server processes the clip and, in return, messages the result back to the user.
  • An audio file transferred through the Internet is first screened by an audio server.
  • the audio server determines the audio content through audio identification rather than by filename or ID tag. With the information gathered, it matches the legal rights of file on the distribution channel and blocks unauthorised transmission.

Abstract

A song identification process has two stages: a database generation stage and a matching stage. In the database generation stage the system extracts an audio fingerprint and robust codes for a database of songs. A set of maps of all possible permutations of the codes are generated, storing all the locations of the different codes. In the matching stage, the system performs a similarity measure between the query song and the database songs by aligning the query at the locations indicated by the maps. A bit error rate (BER) is computed for every song. The song with the lowest song BER is the matching song.

Description

METHOD AND APPARATUS FOR IDENTIFYING AUDIO SUCH AS MUSIC
Field of the Invention
The invention relates to identifying audio, such as music, in particular by comparing certain properties of the sound with corresponding properties in a database derived from known sounds.
Background
The revolution in Internet technologies provides convenience in communication between people in different parts of the globe. This revolution coupled with aew compression techniques makes the transfer of media content across the globe easier. Tapping the infrastructure that has developed, the music industry is exploring ways to manage the distribution of media content, especially audio files. Effective management requires the knowledge of the media being transferred. Several methods have been employed, the simplest being the verification of a filename. However, the ease of changing filename has rendered this method inadequate.
Another solution would be to use file comparison. By comparing actual data with query data bit by bit, a file identity could be determined. However, there exist several file formats widely available on the Internet. In order to reduce the amount of storage necessary for this approach, the comparison could use checksums or one way hash functions such as Message Digest 5 (MD5). Nevertheless, checksums and MD5 are independent of the audio properties of the file and this method may prove insecure.
Descriptive information pertaining to audio data can be and is coupled with the media content. This is known as metadata. Such metadata may contain the artist name, composer name, title and other data. End users on the other hand can easily alter this data, thus rendering the method insecure. Watermarking technologies have emerged as an alternative method. These embed digital information into the audio file without altering the acoustic content perceived by the end user. This digital information translates to a unique identifier of the song. The audio quality of the watermarked song versus the robustness of the watermarking algorithm to prevent hacking has been the central issues. However, watermarking requires a prior embedding of digital information into an audio file, thus all songs and other audio already on the market cannot now usefully be protected in tbis manner.
Content based identification system differs from the above technologies. A content based recognition system analyses the acoustic properties of the audio file and develops unique fingerprints. Training of the audio files is achieved by suitable extraction of unique acoustic properties, also known as audio fingerprints. Caution must be exercised in the selection of the acoustic properties as it affects the system robustaess and database size. Following the training process is the database building process where all the audio fingerprints are indexed.
For such approaches, the matching process is similar to the training process. -An audio fingerprint is extracted from a clip of an original song. The matching algorifhrxi performs a calculation of the similarity between the extracted query audio fingerprint and the ones in the database by means of a similarity function. The best- matched audio fingerprint yields the highest similarity measure and the corresponding song is deemed as the matched song.
The best search strategy to employ is to do an exhaustive search. However, due to constraints in processing power, an exhaustive search can be prohibitive for a large database.
A method of generating and matching hashes is described in U.S. Patent Application No. 2002/0,178,410A1, published on 28 November 2002 in the name of Haitsma et al. The hashes are short summaries of data files, which can be used to identify the file. The method generates robust hashes for multimedia content. The a~udio clip is divided into successive frames. For each frame, the frequency spectrum is divided into bands. A robust property of each band is computed and represented by a concatenation of binary hash words, one for each frame. To identify a possible compressed database audio file, a block of derived hash words is matched by a computer with a large database. The database consists of a lookup table (LUT), which stores the entry of each hash word. Each entry points to the song and the location where the respective hash word occurs. From the lookup table, the Hamming distance is calculated between the query hash word and the database song is obtained. The Hamming distance determines whether the query clip is similar to the database song.
An alternative method of recognising an audio clip is described in U.S. Patent Application No. 2002/0,083,060Al, published on 27 June 2002 in the name of Wang et al. The method recognises the audio clip by storing a set of landmark time points and associated fingerprints in the database. The landmarks occur at reproducible locations within the file. Fingerprints represent features of the signal at or near the landmark time points. To perform recognition, landmarks and fingerprints are computed for the query sample and used to retrieve matching fingerprints from the database. Therefore, if a large number of corresponding landmarks are linearly related, i.e. if equivalent fingerprints of the sample and the retrieved file have the same time evolution, then the file identified as similar to the sample.
In U.S. Patent Application No. 2001/0,044,719A1, published on 22 November 2001 in the name of Casey, a computerised method of extracting features from an acoustic signal and recognised is described. A spectral envelope is produced by windowing the acoustic signal. The dimensionality of the spectral envelope is then reduced to produce a set of features for the acoustic signal. The features in the set are clustered to produce a group of features for each of the sources. Each group of features is a quantitative descriptor that is associated with a qualitative descriptor. Hidden Markov models are trained with a set of known features and stored in a database. Matching is simply done by retrieving features to determine the similarity. For all recognition systems, there exists limited tolerance to the degree of variations in the input signal. This is termed as the robustness of the system. In most applications, the input signal could be purely subjected to compression distortions. However in more severe cases, non-linear distortions due to air-microphone recording may affect the overall system performance. All recognition systems developed face the challenge of matching accuracy due to above distortions.
More practically, a recognition system should exhibit a fast matching speed. This parameter becomes one of the important factors to consider in implementing a matching algorithm. Coupling to the above problem lies the consideration of storage requirements. Many times, the storage requirements determine the exactness or the compactness of a certain audio feature.
Summary
According to a first aspect of the present invention, there is provided a method of generating a database from audio files, for use in audio identification of a query file. The method comprises receiving one or more audio files, for individual received audio files, extracting an audio fingerprint of the audio file and generating a location table mapping the locations of data strings of a predetermined category within the audio fingerprint, and storing the one or more audio fingerprints and location tables in a database.
According to a second aspect of the present invention, there is provided a method of matching an audio query against one or more known audio files. The method comprises: receiving the query, extracting a query fingeφrint of the query, receiving an audio fingerprint of an audio file, identifying one or more data strings of a predetermined category from the query fingerprint; identifying a location within the audio fingerprint having a data string corresponding to one of the one or more data strings of the query fingeφrint; determining an alignment of the query fingeφrint and the audio fingeφrint that matches the position of the identified location within the audio fingeφrint with the location of the predetermined data string; and determining a similarity measure between the query fingeφrint and the audio fingeφrint based on the alignment. According to a third aspect of the present invention, there is provided a method of matching an audio query file against an audio file. The method comprises: receiving the query file; extracting a query fingeφrint of the query file; receiving an audio fingeφrint of a first audio file; identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingeφrint; for the or each identified first data string of the query fingeφrint, identifying one or more second locations, within the audio fingeφrint, having a second data string corresponding to the identified first data string; for the or each second location, within the audio fingeφrint, identified as having a second data string, aligning the query fingeφrint against the audio fingeφrint by matching the identified second location, within the audio fingeφrint, with the first location of the corresponding first data string within the query fingeφrint; determining a similarity measure between the query fingeφrint and the audio fingeφrint for the one or more different alignments ; and determining a best match measure for the received audio file from amongst the one or more determined similarity measures.
According to a fourth aspect of the present invention, there is provided a method of matching an audio query file against one or more known audio files. The method comprises: receiving the query file; extracting a query fingeφrint of the query file; receiving an audio fingeφrint of a first audio file; identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingeφrint; for the or each identified first data string of the query fingeφrint, identifying one or more second locations, the second locations being locations within the audio fingeφrint having a second data string, a second data string being a data string corresponding to the first identified first data string; for the or each identified second location within the audio fingeφrint, determining a third location, the third location being a location within the audio fingeφrint that would correspond to a predetermined fourth location, within the query fingeφrint, if the query fingeφrint were aligned against the audio fingeφrint by matching the respective second location, within the audio fingeφrint, with the first identified location of the corresponding first identified first data string within the query fingeφrint; and identifying a fifth location and an associated count, within the audio fingeφrint, the fifth location being the location of the audio fingeφrint that is determined as the third location most often and the associated count being the number of times the fifth location is determined as the third location.
According to a further aspect of the present invention, there is provided apparatus for operating according to the method of any of the first to fourth aspects.
According to yet a further aspect, there is provided a computer program product having a computer program recorded on a computer readable medium, for instructing a computer to operate according to the method of any of the first to fourth aspects.
Introduction to the Drawings
The invention is described by way of non-limitative example, with reference to the accompanying drawings in which:-
Figure 1 is a schematic view of sound identifying apparatus according to an embodiment of the invention;
Figure 2 is a flowchart illustrating an exemplary overview process of the training of an audio database according to an embodiment;
Figure 3 is a flowchart illustrating an exemplary overview process of the matching of audio according to an embodiment;
Figure 4 is a schematic block diagram illustrating components of an exemplary embodiment of the feature extraction module of Figure 1 ; Figure 5 is a flowchart illustrating an exemplary process for extracting an audio fingeφrint from an input framing audio file;
Figure 6 is an illustration of an exemplary arrangement of an audio fingeφrint;
Figure 7 is a schematic illustration of the use of an audio fingeφrint in generating TTS maps;
Figure 8 is a flowchart relating to the generation of the TTS maps from an audio fingeφrint;
Figure 9 is a flowchart illustrating an exemplary matching procedure for determining the best match between the query file and audio files in a database;
Figure 10 is an illustration of a query fingeφrint and certain aspects of an exemplary method of determining possible locations against which to match an audio file with the query file;
Figure 11 is a flowchart relating to a process for determining a file BER;
Figure 12 is a flowchart of an alternative BER calculation procedure;
Figure 13 is a flowchart of a process for determining a best match using a statistical matching process;
Figure 14 is a flowchart relating to an example of a specific portion of the process of Figure 13;
Figure 15 is an illustration of an example of how the query fingeφrint, TTS maps and a current count table interrelate in the processes of Figure 13 and 14; and Figure 16 is an illustration of a computer system for implementing the apparatus and processes associated with the exemplary embodiments
Description
The below-described embodiments use what may be described as a Long String of Sub-band Match (LSSM) method. They attempt to tackle the problems associated with an exhaustive search by means of innovating the indexing process. Instead of conducting an exhaustive search by matching the query audio in every possible location in the database files, LSSM provides an efficient way of locating an audio file in a database matching a query, such as a query song.
An audio identification process has two stages: a database generation stage and a matching stage. In the database generation stage the system extracts an audio fingeφrint and robust codes for a database of audio files. A set of maps of all possible permutations of the codes is generated, storing all the locations of the different codes. In the matching stage, the system performs a similarity measure between the query and the database files "by aligning the query at the locations indicated by the maps. A bit error rate (BER) is computed for every audio file. The audio file with the lowest song BER is the matching audio file.
Figure 1 is a schematic view of sound identifying apparatus 10 according to an embodiment of the invention.
The sound identifying apparatus 10 has a feature extraction module 12 with two inputs, a training audio file input 14 for inputting an audio input signal SI and a query input 16 for inputting a query input signal S2. The feature extraction module 12 also has two outputs, one carries an extracted audio feature signal S3 as an input to a feature database 18 and the other carries an extracted query feature signal S4 as an input to a matching module 20. The feature database 18 contains an audio fingeφrint database 22 and a Temporal Transition Code (TTS) map database 24. The features database 18 is also arranged to provide a selected extracted audio feature signal S5 as an input to the matching module 20. The matching module 20 is arranged to output a match identification signal S6.
The audio identification process comprises two stages, a database generation stage, which is a training operation, and an identification stage. Usually the training process occurs only once, possibly with occasional updates, whilst the identification stage occurs every time a new item needs identifying. The two stages may also occur some time apart.
Figure 2 is a flowchart illustrating an exemplary overview process (SI 00) of the training of an audio database, for instance in the apparatus of Figure 1. This process (SI 00) generates the contents of the feature database 18 of Figure 1; this is the database generation stage. Figure 3 is a flowchart illustrating an exemplary overview process (SI 20) of the matching of audio, for instance using the apparatus of Figure 1. The process (SI 20) is used to identify an unknown piece of audio, a so-called query; this is the identification stage.
The sound identifying apparatus 10 receives a training audio file within the audio input signal SI at the audio input 14 (step SI 02). The feature extraction module 12 processes the input training audio file and translates relevant information to extract an input audio fingeφrint (step SI 04). To reduce storage requirements, the input training audio file may be converted into a mono-training audio file and sampled at a certain sampling frequency, such as 44.1kHz. A set of TTS maps for the input training audio file is generated based on the extracted audio fingeφrint (step SI 06). The extracted audio fingerprint and the generated TTS maps are sent to storage in the audio fingeφrint database 22 and the TTS Map database 24, respectively, within the feature database 18. Metadata, identifying the file and its author may also be sent to and stored in the feature database 18. A check is then made to determine if the current training audio file is the last one to be processed for training (step SI 10). If this is not the last file, the next file is selected (step SI 12) and the process returns to step SI 04 to process the next training audio file. This process is therefore repeated for as many training audio files as are input to the apparatus 10 or in the database from which the input training audio files are drawn. If the current training audio file is the last file, the training process ends.
When an identification is to be made, the sound identifying apparatus 10 receives an input query file in a query input signal S2 at the training audio file input 14 (step SI 22). The feature extraction module 12 processes the input training audio file and translates relevant information to extract a query fingeφrint (step SI 24). The matching module 20 obtains necessary TTS Maps and audio fingeφrints from the feature database (step S126) The feature extraction module 12 also obtains certain query TTS codes from the query fingeφrint (step SI 28). The query fingeφrint and query TTS codes are input to the matching module 20 as the extracted query feature signal S4. Based on the TTS Maps and audio fingeφrints and the query fingeφrint and query TTS codes, the matching module 20 determines a best match between the query file and the training audio files used to generate the contents of the feature database 18 (step SI 30) and outputs the result. The matching process then ends.
As is described later, various methods can be used to find the best match. A similarity measure, such as a Bit Error Rate (BER), is used to determine the similarity between the query file and the training audio files whose audio fingeφrints are in the database 18 by means of comparing the extracted features. The BER for this puφose may be the ratio between the number of error bits and the total number of bits.
Training
In the following described LSSM approach, an audio fingeφrint is extracted by calculating the difference in the multi-band energy between successive frames. Each frame consists of several data samples. Consecutive frame features an overlapped in the data samples (previous samples and new samples) and each frame undergoes a Fast Fourier Transform (FFT). From the FFT, differential multi-band energies are extracted. Multi-band energies are calculated as the differences in energies between successive frames. These differential multi-band energies constitute the audio fingeφrint processed at the particular frame, in short, the differential multi-band energies relating to time. The audio fingeφrint generated is the concatenation of several frames of multi-band energy in sequential order.
The audio identification process consists of two stages, namely a database generation stage where the system extracts an audio fingeφrint and robust codes for every audio file. These codes are robust but they are not unique. Thus a set of maps comprising all possible permutations of the said codes are generated, storing all the locations of the different codes.
Figure 4 is a schematic block diagram illustrating components of an exemplary embodiment of the feature extraction module 12 of Figure 1. Figure 5 is a flowchart illustrating an exemplary process for extracting an audio fingeφrint from an input training audio file, for instance as may be involved in step SI 04 of Figure 2. This is also an example of the operation of the feature extraction module 12 of Figure 1 and as shown in detail Figure 4.
An audio file is received as a time domain, pulse code modulated (PCM) sample. The received input audio file is divided into sample frames (step SI 42) by a framing circuit 22, with a subsequent frame overlapping a current frame to ensure high correlation between frames. The preferred frame size is 1024 samples with each sample in the frame multiplied by a window function such as a Hamming window. Each frame undergoes a spectral transformation (step S144), such as a FFT, by way of a transform module 24, to provide a spectral representation. For computation simplicity and efficiency, the spectral information may be band limited. A recommended range is from to 300Hz and 3kHz for audio. The band limiting of the training audio file's spectral information into a specific frequency range has the advantage of reducing storage requirements by eliminating redundancy. A range that is most relevant to the human auditory system is generally selected.
) The frequency range obtained is further processed into absolute values (step SI 46) by a magnitude generator 26 and separated into frequency bands (step SI 48) by a band separator 28. This involves band limiting the spectrum, linearly or non-linearly into frequency ranges. In a preferred embodiment, each band has the same bandwidth, for example of 450Hz.
Suitable audio properties are extracted (step SI 50) to represent the training audio file. In this embodiment energy is used as it exhibits robustness to various processing distortions. To achieve this the absolute spectral values in the individual frequency bands are squared and summed to obtain the energies of each band separately, in an energy extractor module 30. This generates a set of multiband energies 32 for that particular frame. The multiband energies 32 are stored in a one-frame delay 34 for each band (SI 52). The multiband energy 32 for each band in the current frame N is also subtracted (step SI 54) from the energy value for the same band of the previous frame N- 1, at an adder 36, to produce a set of difference values 38. A threshold module 40 compares the difference value 38 for each band of the frame with a threshold value (step S 156) for instance 0.1 , to produce a binary representation of each band. If the energy difference value 38 is above the energy threshold, binary ' 1 ' is used to represent the particular band. If the energy difference value 38 is below the energy threshold, binary '0' is used to represent the particular band. The thresholds are all the same, although they may be adaptively controlled, based on the distortion or background noise in the environment in which the audio file is recorded. This might be achieved by sampling the audio file at various points.
A frame concatenator 42 generates an x-bit code word 44 for the frame (step SI 58) by concatenating the binary representation of the energy difference across bands. A fingeφrint concatenator 46 forms the audio fingeφrint 50 by concatenating the code words 44 from sequential frames sequentially (step SI 60) for the training audio file.
The number of bits, x, in the code word 44 is generally the same as the number of bands (although it may differ if the data from two or more bands merge or if a band splits). A typical number of bands is 6, as this balances robustness in the system with computation requirements. Figure 6 is an illustration of an exemplary arrangement of an audio fingeφrint 50. In this embodiment, the fingeφrint is a two-dimensional array of binary values. The horizontal direction 52 represents the differential energy contour for each frame and the vertical direction 54 represents the differential energy evolution across frames for a particular band. The vertical direction is also synonymous with the time direction information. For simple identification, the audio fingeφrint's first column is labelled as band 0, followed by band 1, through to the last band, band 5. The band relates to the energy at different parts of the frequency spectrum. The first row is indexed as frame 0, followed by frame 1, etc. The frame index relates to time dimension information, with the number of frame indices depending on the length of the training audio file.
In principle, there exists high correlation between successive energy in a particular band. Exploiting this result, the consecutive band energy is concatenated in the time direction to generate a code word, which is a string of data. This code captures the temporal envelope of the audio file at that particular time, which is termed here as the Temporal Transition Code (TTS code).
In order to create an efficient method to index these TTS codes, a map is generated to locate all possible locations of a particular TTS code in a particular audio file. Generation of this map, also know as the Temporal Transition Map (TTS Map), is achieved by concatenating the multi-band energies in time and shifting the TTS codes incrementally in the time direction. The associated time and band information of the particular TTS code is recorded into the different bands' TTS maps.
The frame index gives a time representation of the TTS code and is continuous until the end of the audio fingeφrint 50.
Figure 7 is a schematic illustration of the use of an audio fingeφrint in generating TTS maps. Figure 7 illustrates an example of a set of TTS maps 60, 62, 64, 66, 68, 70, a separate one for each band of the audio fingeφrint 50. The uppermost TTS map 60 is for band 0 of the audio fingeφrint 50. The second TTS map 62 is for band 1 of the audio fingerprint 50, and so on. Each individual TTS map 60 - 70 is capable of recording all possible permutations of TTS codes of the respective band.
Each TTS map 60 - 70 contains details of the frame locations 72 of the individual TTS codes within the relevant band for the associated audio -fingeφrint 50. The TTS codes are extracted from the audio fingeφrint 50 relying on the principle of high correlation across frames. For each frame, the respective TTS codes are derived as a y- bit binary representation 74 consisting of a concatenation of the binary digit for that frame in the audio fingeφrint 50 and the binary digits in the same band for the y-1 immediately succeeding frames.
The TTS maps 60 - 70 contain a list 80 of all possible values of the TTS code, from 0 to 2y - 1. When the TTS code 74 for a frame is determined, the frame number of that -frame is then associated with that code within the TTS map. The same TTS code may appear more than once in the same band of an audio fϊngeφrint 50. Thus more than one frame number may be associated with each TTS code value.
Figure 8 is a flowchart relating to the generation of the TTS codes from the audio fingeφrint 50. This is an example of the process of step SI 06 within the flowchart of Figure 2. Figure 8 is described with reference to the illustration of Figure 7.
The process is initiated by setting the current band number, BN = 0 and the current frame index, FI = 0 (step SI 72). For the current band number and the current fraine index the binary representation of the TTS code 74 is extracted (step SI 74). For FI = 0 this is achieved by concatenating y consecutive bits from frame 0 to frame index y-1. In this embodiment y = 9, as this is a suitable TTS code length for handling air- microphone distortions. Thus for BN = 0 and FI = 0, the relevant binary representation of the TTS code 74 for the audio fingeφrint 50 of Figure 6 is 101001101, being a concatenation of the values for FI = 0 to FI = 8 in band 0.
Within the TTS map 60 corresponding to the current band number, the TTS code value in the code value list 80 that corresponds to the curreαt extracted TTS code for the current frame index is located. The current frame index 72 is recorded, associated with the TTS code value for that frame index, in the TTS map (step S178). In the example shown in Figures 6 and 7, for BN = 0 and FI = 0, the TTS code 74 for FI = 0 is 333 (101001101 in decimal for easier reading in this description). Thus the frame index value "0" is stored in the TTS map 60 associated with the TTS code value 333.
Once the current frame index value has been stored in the TTS map, a determination is made whether the current frame index is the last within the current band (step SI 80). If the current frame index is not the last within the current band, the frame index is incremented by one, i.e. FI = FI + 1 (step SI 82), and the process returns to step SI 74. On the other hand, if the current frame index is the last within the current band at step SI 80, then a determination is made whether the current band number is the last band within the audio fingeφrint (in the example of Figures 6 and 7, whether BN = 5) (step SI 84). If the current band number is not the last band, the band number is incremented by one, i.e. BN = BN + 1 and the frame index reverts to zero, i.e. FI = 0 (step SI 86), and the process returns to step S174. On the other hand, if the current band number is the last within the current audio fingeφrint, then this sub-process is finished, as this marks the end of the generation of TTS maps for that audio fingeφrint 50.
In the example of Figures 6 and 7, after the frame index value 0 has been stored with TTS code value 333 in the TTS map 60 for the first band, the TTS code 74 for the next frame, FI = 1 is extracted. This has a binary representation 010011010 (decimal 154), against which value the frame index value 1 is stored. The binary representation for FI = 2 is 100110100 (decimal 308), against which value the frame index value 2 is stored. In this example, the binary representation for FI = 6 is 101OO1101 (decimal 333), which is the same TTS code value as that of FI = 0. Thus this same value 333 has two frame locations stored against it.
The training processes described above with reference to Figures 1, 2 and 4 to 8 encompass the training phrase of the recognition system. Although, processing complexity is encountered at certain phases of the training process, the increased complexity does not affect the matching time during a recognition operation as the relevant processes are conducted prior to matching. The main concern during the training phase is the selection of suitable audio fingeφrints as that contributes to the overall memory requirements of the system.
For each training audio file in the database, TTS maps and an audio fingerprint are created. Each TTS map stores the TTS codes of each band of an audio fingerprint. The TTS maps and the audio -fingeφrints make up the data of the features database 18 of the sound identifying apparatus 10 of this embodiment.
In the sound identifying apparatus 10, the TTS map may be loaded into a RAM to achieve fast searching speed. If there is a memory constraint, the audio fingeφrints need not be stored in RAM.
Matching
In principle, LSSM exploits the similarity between the actual TTS code and the query file's TTS codes to generate the TTS map. Effectively, due to a-priori knowledge of the database audio file, a map of all possible 'correct' TTS codes is generated by concatenating the multi-band energies in time and shifting the TTS codes incrementally in time. The result is a TTS map populated with enough information to enable efficient identification of the best match location of an audio file in milliseconds. The TTS map streamlines the matching time required by the system as opposed to the exhaustive search.
Matching is a quick and simple process of accessing the TTS map with the help of the extracted TTS codes of the query file. From the extracted TTS codes, matching locations are located using the TTS map. Subsequently, a similarity measure is calculated between the query file and the database audio file. This process is looped for all the other extracted TTS codes.
Finally, the audio file that yields the highest similarity measure is the matched audio file. During the matching stage, a clip of unknown audio, the query audio, is input into the recognition system as a query file to identify that audio. Matching is defined as locating a best matching database (training) audio file by determining the similarity between the query and the database audio files in the database. For this to occur an audio fingeφrint is generated for the query, termed here a query fingeφrint 100. The query fingeφrint 100 is extracted (in step SI 24 of Figure 2) by the same method as is used to extract the audio fingeφrint 50. The flowchart of Figure 5 and the associated description is applicable here too, with the file being processed being the query file, rather than a database audio file. Likewise the TTS codes for the query fingeφrint 100 are also determined. The q-uery TTS codes are generated (in step SI 28 of Figure 3) by the same method as is used to extract the database audio TTS codes (step SI 74). They are extracted one by one as is described below; TTS maps do not need to be generated for the query file. However, if desired, all the query TTS codes could be extracted initially before any of them is used, possibly to be stored in a table.
Prior to calculating a similarity measure between the query and the database audio files, the system determines the location at which the query best matches any database audio file. As the query may have been randomly extracted from an unknown source, the system has no idea where it should start its matching process. TTS codes are extracted from the query and possible locations for the query are determined by accessing the entries in the TTS maps.
Figure 9 is a flowchart illustrating an exemplary matching procedure for determining the best match between the query file and audio files in the database. Figure 9 illustrates an example of the processes of steps S 126 to SI 30 of Figure 3.
A current audio file number is initiated, that is j = 0 (step S202). The TTS maps and the audio fingeφrint are retrieved for the audio file corresponding to the current audio file number (step S204). A file BER for the current audio file is determined (step S206), for instance as is described later. The file BER for the current database audio file is compared with the lowest file BER so far for this query (step S208). If the file BER for the current audio file is lower (for instance as would always happen if it is the first audio file being compared), then the file BER for the current audio file becomes the lowest BER (step S210). If the existing lowest file BER is lower, then that remains as the lowest file BER. The system also determines if there are any more audio files to compare, that is whether j = the last relevant number (step S212). If there are no more audio files to compare, the audio file with the current lowest file BER is set and output as the best matching audio file for that query (step S214), after which the process ends. On the other hand, if there are more audio files, the current audio file number is incremented by one, i.e. j = j + 1 (step S216), and the process returns to step S204 to retrieve features of the next audio file and determine the file BER for that next audio file.
Figure 10 is an illustration of a query fingeφrint 100 and certain aspects of an exemplary method of determining possible locations against which to match a database audio file with the query file. Figure 11 is a flowchart relating to a process for determining a file BER (step S206 of Figure 9), including determining possible matching locations.
A query band number QBN, a query frame index QFI and the file BER are reset for a new audio file, that is QBN = 0, QFI = 0, file BER = max (step S222). For the current query band number and the current query frame index, a query TTS code is obtained from the query fingeφrint (step S224). Tor audio file j, audio frame indices of the audio fingeφrint having the same value TTS code value as the query TTS code are identified, using the audio file TTS map for the same band (S226). Thus if QBN = 0, the TTS map being used is also for band 0. Typically, several possible frame indices in the audio fingeφrint correspond to any particular TTS code value. In the illustrated example in Figure 10, the TTS code for the first frame of the first band of the query fingeφrint 100 is binary 010011010 (decimal 154). For the database audio file whose particular first band TTS map is shown in Figure 10, the TTS code value 154 is shared by frame indices 1, 13 and 300.
A determination is made (step S227) as to whether any audio fingeφrint frame indices were identified for the TTS code. Assuming one or more potential frame locations have been identified, the first identified frame index from the audio fingeφrint is set as the current audio match frame index (step S228). The query fingeφrint is overlapped or compared with the audio fingeφrint, with the current query frame index aligned with the current audio match frame index (step S230). These two frame indices have the same TTS code value as each other. A check is made as to whether the query fingeφrint falls entirely within the audio fingeφrint with the current alignment (step S231). If the query fingeφrint falls entirely within the audio fingeφrint, a bit error rate (BER) is generated from the comparison of the two fingeφrints (step S232). The BER is defined as the number of mismatched bits divided by the total number of bits in the query fingeφrint.
Once the BER is generated for a frame location with the shared TTS code, the system compares the just-determined, current BER with the file; BER for this database audio file (step S234), that is with the lowest BER so far from the comparisons with this audio file. If the current BER is lowest (as would always happen if it is the first BER determined for this particular pair of query and database audio .file) then the current BER becomes the lowest BER (step S236). If the existing file BER is lower, then that remains as the file BER.
The system determines if there are any more locations to compare for the current query TTS code, that is whether the current audio match frame index is the last of the audio frame indices identified in step S226 (step S238). A similar check is made if step S231 determined that the query fingeφrint did not fall entirely within the audio fingeφrint.
If the current audio match frame index is not the last of the audio frame indices identified in step S226 and there more locations to compare, the next of the audio frame indices identified in step S226 is set as the current audio match frame index (step S240). If the current audio match frame index is the last of the audio frame indices identified in step S226, then a check is made of whether the current query frame index is the last query frame index in the current query band (step S242). A similar check is made if step S227 determined that no frame indices had been identified in step S226. If the current query frame index is not the last query frame index in the current query band, the query frame index number is incremented by one, i.e. QFI = QFI + 1 (step S244) and the process returns to step S224. If the current query frame index is the last query frame index in the current query band, then a check is made of whether the current query band number is the last query band in the query fiogeφrint (step S246). If the current query band number is not the last query band number in the query fingeφrint, the query band number is incremented by one, i.e. QBN = QBN + 1, and the query file index number is reset, i.e. QFI = 0 (step S248). After this the process returns to step S224. If the current query band number is the last query band number in the query fingeφrint, this part of the overall process ends. The next step in the main process of Figure 9 is step S208.
In this way, the operation proceeds for all frames and all bands of the query. For each audio fingeφrint, the location that yields the lowest BER is the location at which the query best matches the current database audio file.
In the above embodiment, comparison of fingeφrints only occurs where the query fingeφrint falls entirely within the audio fingeφrint. In alternative embodiments, comparison may also occur where the query fingeφrint falls partly outside the audio fingeφrint (that is it begins before the audio fingeφrint or ends after the audio fingeφrint). Such an operation might allow more robust processing for noisier queries. The comparison is made at the overlapping areas and is made for that part of the query fingeφrint that falls outside the audio fingeφrint as if the audio fingeφrint existed in that area with every bit equal to zero.
In the example of Figure 10, the TTS code 154 from the first frame of the query fingeφrint 100 matches the TTS code for frame indices 1, 13 and 300 of the audio fingeφrint. Thus the query fingeφrint 100 is first overlapped with the audio fingeφrint such that frame 0 of the query fingeφrint 100 is aligned with frame 1 of the audio fingeφrint, and a first BER is generated. The query fingeφrint 100 is next overlapped with the audio fingeφrint such that frame 0 of the query fingerprint 100 is aligned with frame 13 of the audio fingeφrint, and a second BER is generated. The query Qngeφrint 100 is next overlapped with the audio fingeφrint such that frame 0 of the quer^ fingeφrint 100 is aligned with frame 300 of the audio fingeφrint, and a third BER is generated.
In the matching stage as described with reference to Figures 9 to 11, the system performs a similarity measure between the query file and the database audio files by aligning the query at the locations indicated by the maps. A bit error rate (BER) is computed for every audio file. The audio file with the lowest audio file BER is the matching audio file.
In the exemplary process described with reference to Figure 11, the TTS code is extracted for every frame index of every band of the query fingeφrint, and a BER is calculated for every match. This can be computationally expensive. To reduce the burden, the amount of matches tested can be reduced. One way of doing this is to keep a record of every alignment made between the query fingeφrint and the audio fingeφrint, so that when a repeat alignment comes up, within the same band or between different bands, no BER is determined for the second or further occurrence of that align-ment. Another way of reducing processing used is only to calculate the BER for alignments where the query fingeφrint falls wholly within the audio fingeφrint (in a time sense). Further ways of reducing the amount of processing required might include testing only occasional query TTS codes, e.g. every second one, every third one, every fov rth one, etc. Additionally or alternatively, the TTS codes might be taken from only oi e band, for instance the first, second or third band, etc.
As is mentioned above, the exemplary process described with reference to Figure 11 is computationally expensive. Moreover there is high processing complexity in calculating the file BER for the whole of the query for each possible location. An alternative or additional way of reducing the data processing needed is described now with reference to Figure 12. This method is a progressive thresholding method, for determining a BER for different overlap alignments of the query fingeφrint - nd the audio fingeφrint (a substitute for step S232 of Figure 11). The progressive thresholding method simplifies the calculation of the file BER by reducing the amount of unnecessary processing for BER calculations that takes place at clear non-matching locations.
The concept behind the progressive thresholding method is that, since the calculation of a BER involves the summation of all the mismatched bits in the design-ated region of matching, the process can be terminated prematurely if the mismatch bits exceed a certain threshold within a smaller region. This threshold is termed here the progressive threshold. Figure 12 is a flowchart of a progressive threshold BER calculation procedure. The conceptual flow shown in Figure 12 is based on conducting an initial BER calculation on a smaller region of frames.
A number of frames M, is initiated at a first predetermined level (step S252), which is less than the number of frames in the query fingeφrint. The number M may be determined as a percentage of the full query, for instance 10 - 15%, or set as a fixed number, e.g. 50 frames. The frames that are used within the number of frames M do not need to include the matched frame. The more distorted the query is, the larger the number would ideally be. The number can be adaptively controlled based on a determination of distortion in the query, based on a sampling of background noise. For the current overlap position of the query fingeφrint and the audio fingeφrint a reduced area, the BER is calculated based on just the M frames (step S254). Thus, where M is less than the total number of frames in the query fingeφrint, the BER is calculated for a reduced area. The M-based BER is compared with a BER threshold (step S256). The BER threshold may also be fixed, for instance at 0.4 or adaptively controlled, based on the level of distortion in the query, especially if there is no adaptive control of the frame number M. If the M-based BER is not lower than the BER threshold, this sub-process ends, proceeding to step S238 of the process of Figure 11. If the M-based BER is lower than the BER threshold, the determination is then made as to whether the number JVE covers less than the whole of the query fingeφrint (step S258). If the number M does not cover less than the whole of the query fingeφrint, then the process continues as with the rest of the operation of Figure 11 with step S234. If the number M covers less ttian the whole of the query fingeφrint, the frame number M is incremented by initial M (i.e. M = M + initial M) (step S260) and the process returns to step S254, where the BER- is calculated for the increased number of frames. This process continues until a BER is produced for all the frames in the query fingeφrint, effectively as in step S232 of the process of Figure 11.
The improved matching algorithm is used for every attempt at a match location and reduces the time required to perform BER calculations at locations that clearly do not match, by terminating the BER calculation prematurely. The overall effect is an increase in matching speed.
In the embodiment described with respect to Figure 12, where a match is rejected before the complete BER is calculated, the process proceeds from step S256 of Figure 12 to step S 238 of Figure 11. In an alternative embodiment, where a match is rejected before the complete BER is calculated, the process proceeds from step S256 of Figure 12 to step S 234 of Figure 11. The current BER still stands as the BER for the reduced comparison set, but that is not a problem in step S234, because it is still a high BER that would be bettered by any audio file for which the BER is calculated for the complete query fingeφrint.
Additionally in the embodiment described with reference to Figure 12 has the number of frames M incremented by M each time in step S260. In a further alternative embodiment, the number of frames M is incremented immediately to the full number of frames in the query fingeφrint, the first time through step S260.
The above approaches to determining the best matching audio file use two distinct steps for each file. Firstly potential matching locations between the query and each database audio file are determined. Secondly, time is consumed by the calculation of some form of comparison BER for every (or almost every) potential match.
An alternative approach to calculating a BER for each potential match location is described with reference to Figures 13 to 15. Figure 13 is a flowchart of a process for determining a best match using a statistical matching process. Figure 14 is a flowchart relating to determining a best match score, for example for use in the process of Figure 13. Figure 15 is an illustration of an example of how the query fingeφrint, TTS maps and a newly mentioned current count table interrelate in this alternative approach.
To increase matching speed, statistical matching is introduced. During the initial matching phase, the system locates the best matching location in every compared database audio file. A current count table records the similarity between the query and the database audio file. By storing the number of times the time location is indexed, the system selects a list of top-ten audio files. The top-ten audio files are chosen based on the magnitude of the entries in the audio index map. Finally, a BER computation is done on the top-ten audio files by aligning the query at the location stored by the audio index map. The matching audio file is identified as the audio file with the smallest BER.
Figure 13 is a flowchart of a process for determining a best match using a statistical matching process. The process of Figure 13 may be used as step S130 of the flowchart of Figure 3.
A current audio file number is initiated, that is j = 0 (step S202). The TTS maps and the audio fingeφrint are retrieved for the audio file corresponding to the current audio file number (step S204). These first two steps are the same as the first two steps shown in the flowchart of Figure 9. A statistically determined best match is determined for the current database audio file number (step S272), providing a score for that database audio file. The current score is checked against the ten best scores so far (step S274) for the previously checked audio files, which, within the exemplary embodiment described below, are the ten highest scores so far. If the current score is better (e.g. higher) than any existing score within the ten best scores so far, the list of the ten best scores so far is updated (step S276), by removing the worst (e.g. lowest) of the scores within the top ten and storing the current score in the top ten. After the ten best scores so far is updated or if the current score is deemed no better than any existing score within the ten best scores so far, in step S274, a determination is made of whether all the database audio files have been checked, that is whether j = the last relevant number (step S278). If any have not yet been checked for a match, the database audio file number is incremented by one, i.e. j = j + 1 (step S280). If all the database audio files have been checked, the BER is generated for each of the ten database audio files for which an associated score is present in the final list of the ten best scores (step S282). The query fingeφrint is matched to the associated audio fingeφrints the arrangement that yields the score within the ten best scores, that is at the locations of the respective audio fingeφrints that produced the scores in the final list of the ten best scores. The database audio file of these ten that produces the lowest BER is deemed to be the matching database audio file for the query and is output as the best match audio file (step S284).
The process of Figure 14 is an example of a process that corresponds to step S272 of the process of Figure 13. The approach used, in effect, matches frame 0 in the query fingeφrint against a frame index in the audio fingeφrint of an database audio file and determines how many matched TTS codes this results in for each frame index in the audio fingeφrint.
The process of Figure 14 is initiated by resetting a query band number to zero, i.e. QFN = 0, a query frame index number to zero, i.e. QFI = 0, and resetting a file score and all frame index counts in a current count table to zero (step S302). For the current query frame index of the current query band, the TTS code in the query fingeφrint is obtained (step S304). Using the TTS band map (for the same band number as the current query band number) of the database audio file being compared, the frame indices in the audio fingeφrint of that database audio file that have the same TTS code are identified (step S306). A determination is made (step S307) as to whether any audio fingeφrint frame indices were identified for the TTS code in step S306. Assuming there are any, the first of the identified frame indices is set as the current offset frame index (step S308). A determination is made of which frame index within the audio fingeφrint the query fingeφrint would have to start at, for the current query frame index to match with the current offset frame index (step S310). For the database audio file frame index within the audio fingeφrint determined as the frame index at which the query fingeφrint would start, in step S310, a count associated with that database audio file frame index (in a current count table) is incremented by one (step S312). A check is made as to whether the current offset frame index is the last identified frame index from step S306 (step S314). If there are one or more further identified frame indices from step S306, the next identified frame index from step S306 is set as the current query frame index (step S316).
Once all the identified frame indices from step S306 have been processed, a determination is made as to whether the current query frame index number QFI is the number of the last query frame index (step S318). A similar check is made if step S307 determined that no frame indices had been identified in step S306. If the current query frame index number QFI is not the number of the last query frame index, the query frame index number is incremented by one, i.e. QFI = QFI + 1 (step S320), and the process returns to step S304 to obtain the TTS code for the new query frame index number. If the current query frame index number QFI is the number of the last query frame index, a determination is made as to whether the current query band number QBN is the number of the last query band (step S322). If the current query band number QBN is not the number of the last query band, the query band number is incremented by one, i.e. QBN = QBN + 1 and the query frame index number is reset to 0, i.e. QFI = 0 (step S324), and the process returns to step S304 to obtain the TTS code for the next band. Once all the bands have been processed, the highest count associated with any audio file frame index in the final current count table is set as the best match score for audio file j (step S326), and the best match score is stored together with the identity of the associated database audio file frame index. This best match score may then be compared with the ten best scores in step S274 of the process of Figure 13. In this instance the ten best scores will be the ten highest best match scores.
In Figure 15, the TTS code 102 for query frame 86 is extracted from the query fingeφrint 100. The TTS code is binary 100110100 (decimal 308). For decimal 308 of the TTS map 60 for band 0 of the database audio file being compared, there are three entries: frame 3, frame 100 and frame 422. As the current query frame index is frame 86, the offset associated with query frame 86, relative to query frame 0 is 86 (that is 86 - 0). Thus for query frame 86 to match database audio file frame 3, query frame 0 would match with database audio file frame negative 83 (that is 3 - 86), which is not possible. The next identified frame index is frame 100. For query frame 86 to match database audio file frame 100, query frame 0 would match with database audio file frame 14 (that is frame index less the offset, i.e. 100 - 86). As such, in a current count table 104, the current count associated with database audio file frame 14 is incremented by one. In this case, the count previously associated with database audio file frame 14 was eight and so its new count is nine. Likewise for query frame 86 to match database audio file frame 422, query frame 0 would match with database audio file frame 336 (that is 422 - 86). As such, in the current count table 104, the current count associated with database audio file frame 336 is incremented by one. In this case, the count previously associated with database audio file frame 336 was twenty-four and so its new count is twenty-five. After this has been done for all query frame indices, a complete or final count table 106 is generated. In this particular case, the highest count number is 85, associated with database audio file frame 130.
For the statistical processing, as with the other methods of finding a best match, it is possible to reduce the amount of processing that occurs. For example, in the approach described with reference to the flowchart of Figure 14, the process could just rely upon a count obtained from fewer than all the bands, even a single band, such as band 0.
Another mechanism for reducing the amount of processing necessary is to classify the audio in the database, for example between the spoken word and music, with music classified into vocal or instrumental, classical or modern, etc. These categorisations can be downloaded with the music audio itself when generating the database. When a query is input for matching pre-processing occurs to determine what type of audio it is, with similar categorisations. In this way, only audio files from the same general category need to be checked.
The apparatus and processes of the exemplary embodiments can be implemented on a computer system 400, for example as schematically shown in Figure 16. The embodiments may be implemented as software, such as a computer program being executed within the computer system 400, and instructing the computer system 400 to conduct the method of the example embodiment. The computer system 400 comprises a computer module 402, input modules such as a keyboard 404 and mouse 406 and a plurality of output devices such as a display 408, and printer 410.
The computer module 402 is connected to a computer network 412 via a suitable transceiver device 414, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).
The computer module 402 in the example includes a processor 418, a Random Access Memory (RAM) 420 and a Read Only Memory (ROM) 422. The computer module 402 also includes a number of input/output (I/O) interfaces, for example an I/O interface 424 to the display 408, and an I/O interface 426 to the keyboard 404.
The components of the computer module 402 typically communicate via an interconnected bus 428 and in a manner known to the person skilled in the relevant art.
The application program is typically supplied to the user of the computer system 400 encoded on a data storage medium such as a CD-ROM or floppy disc and read utilising a corresponding data storage medium drive of a data storage device 430. The application program is read and controlled in its execution by the processor 418. Intermediate storage of program data may be accomplished using the RAM 420. The application program supplied may include just the mechanism for identifying query files, without a mechanism for generating the database of audio files. That database may be supplied as a fixed database, for example on the same CD-ROM or available over the Internet or some other network.
The above description is of audio files in general. However, the embodiments are of particular use where the audio files are music files, for example songs, with the query also being of a song. The described methods provide systematic approaches to identify an audio clip from a song database efficiently, by the generation of TTS maps and audio fingeφrints. The approaches also provide accurate and efficient methods of searching for the matching song from a song database. In addition, the embodiments have the ability to handle non-linear distorted queries without sacrificing search speed and storage requirements. The embodiments provide robust and efficient algorithms for an audio identification system.
The embodiments of the invention provide efficient search and identification methodologies for an audio identification system. Exemplary applications of the audio identification system include a broadcast monitoring system, a mobile retrieval system and a content monitoring system amongst others.
Broadcast monitoring system: A broadcast monitoring system has a central database server and a monitoring station. The monitoring station transmits audio fingeφrints extracted from the broadcast audio content to the central server. The central server determines the identity of the song being broadcast through its large database of songs.
Mobile retrieval system: A clip of a song is recorded using a mobile telephone and transmitted through the mobile network. A song information server processes the clip and, in return, messages the result back to the user.
Content monitoring system: An audio file transferred through the Internet is first screened by an audio server. The audio server determines the audio content through audio identification rather than by filename or ID tag. With the information gathered, it matches the legal rights of file on the distribution channel and blocks unauthorised transmission.
In the foregoing manner, a method and apparatus for identifying audio such as music are disclosed. Only several embodiments are described but it will be apparent to one skilled in the art in view of this disclosure that numerous changes and/or modifications may be made without departing from the scope of the invention.

Claims

1. A method of generating a database from audio files, for use in audio identification of a query file, comprising: receiving one or more audio files; for individual received audio files extracting an audio fingeφrint of the audio file; and generating a location table mapping the locations of data strings of a predetermined category within the audio fingeφrint; and storing the one or more audio fingeφrints and location tables in a database.
2. A method of matching an audio query against one or more known audio files, comprising: (a) receiving the query; (b) extracting a query fingeφrint of the query; (c) receiving an audio fingeφrint of an audio file; (d) identifying one or more data strings of a predetermined category from the query fingeφrint; (e) identifying a location within the audio fingeφrint having a data string corresponding to one of the one or more data strings of the query fingeφrint; (f) determining an alignment of the query fingeφrint and the audio fingeφrint that matches the position of the identified location within the audio fingeφrint with the location of the predetermined data string; and (g) determining a similarity measure between the query fingeφrint and the audio fingeφrint based on the alignment.
3. A method according to claim 2, wherein step (e) comprises using a location table mapping the locations of data strings of a predetermined category within the audio fingeφrint.
4. A method according to claim 2, wherein the audio fingeφrint of an audio file is received from a database of audio files.
5. A method according to claim 4, further comprising generating a database from audio files, comprising: receiving one or more audio files; for individual received audio files extracting an audio fingeφrint of the audio file; and generating a location table mapping the locations of data strings of a predetermined category within the audio fingeφrint; and storing the one or more audio fingeφrints and location tables in the database; and wherein wherein step (e) comprises using the location table for the received audio fingeφrint.
6. A method according to any one of claims 2 to 5, wherein step (d) comprises identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingeφrint; and step (e) comprises for the or each identified first data string of the query fingeφrint, identifying one or more second locations, the second locations being locations within the audio fingeφrint having a second data string, a second data string being a data string corresponding to the identified first data string;
7. A method according to claim 6, wherein step (f) comprises, for the or each second location within the audio fingeφrint, identified as having a second data string, aligning the query fingeφrint against the audio fingeφrint by matching the identified second location within the audio fingeφrint with the first location of the corresponding first data string within the query fingeφrint; step (g) comprises determining the similarity measure between the query fingeφrint and the audio fingeφrint for one or more different alignments generated in step (f); and further comprising (a) determining a best match measure for the received audio file from amongst the one or more determined similarity measures.
8. A method of matching an audio query against an audio file, comprising: (a) receiving the query file; (b) extracting a query fingeφrint of the query file; (c) receiving an audio fingeφrint of a first audio file; (d) identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingeφrint; (e) for the or each identified first data string of the query fingeφrint, identifying one or more second locations, within the audio fingeφrint, having a second data string corresponding to the identified first data string; (f) for the or each second location, within the audio fingeφrint, identified as having a second data string, aligning the query fingeφrint against the audio fingeφrint by matching the identified second location, within the audio fingeφrint, with the first location of the corresponding first data string within the query fingeφrint; (g) determining a similarity measure between the query fingeφrint and the audio fingeφrint for one or more different alignments generated in step (f); and (h) determining a best match measure for the received audio file from amongst the one or more determined similarity measures.
9. A method according to claim 7 or 8, wherein step (g) further comprises determining a first similarity measure based on a reduced area of the query fingeφrint and comprises determining a second similarity measure based on a fuller or complete area of the query fingeφrint only if the first similarity measure is on a first side of a first threshold.
10. A method according to any one of claims 2 to 9, further comprising: repeating steps (c) and (e) to (g) for a plurality of different audio files; and comparing the similarity measures of the plurality of audio files to determine a best match.
11. A method according to claim 6, wherein step (g) comprises, for the or each identified second location within the audio fingeφrint, determining a third location, the third location being a location within the audio fingeφrint that would correspond to a predetermined fourth location, within the query fingeφrint, if the query fingeφrint were aligned against the audio fingeφrint by matching the respective second location, within the audio fingeφrint, with the first identified location of the corresponding first identified first data string within the query fingeφrint; and further comprising prior to step (g), (i) identifying a fifth location and an associated count, within the audio fingeφrint, the fifth location being the location of the audio fingeφrint that is determined as the third location most often and the associated count being the number of times the fifth location is determined as the third location; and wherein step (g) is carried out on the alignment where the predetermined fourth location of the query fingeφrint is aligned with the fifth location of the audio fingeφrint.
12. A method of matching an audio query file against one or more known audio files, comprising: (a) receiving the query file; (b) extracting a query fingeφrint of the query file; (c) receiving an audio fingeφrint of a first audio file; (d) identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingeφrint; (e) for the or each identified first data string of the query fingeφrint, identifying one or more second locations, the second locations being locations within the audio fingeφrint having a second data string, a second data string being a data string corresponding to the first identified first data string; (f) for the or each identified second location within the audio fingeφrint, determining a third location, the third location being a location within the audio fingeφrint that would correspond to a predetermined fourth location, within the query fingeφrint, if the query fingeφrint were aligned against the audio fingeφrint by matching the respective second location, within the audio fingeφrint, with the first identified location of the corresponding first identified first data string within the query fingeφrint; and (i) identifying a fifth location and an associated count, within the audio fingeφrint, the fifth location being the location of the audio fingeφrint that is determined as the third location most often and the associated count being the number of times the fifth location is determined as the third location.
13. A method according to claim 12, further comprising, after step (i), (g) determining a similarity measure between the query fingeφrint and the audio fingeφrint based on the alignment where the predetermined fourth location of the query fingeφrint is aligned with the fifth location of the audio fingeφrint.
14. A method according to claim 11 or 13, further comprising repeating steps (c), (e), (f) and (i) a plurality of times, for a plurality of different audio files, prior to step (i).
15. A method according to claim 14, further comprising comparing the count from step (i) from each of the plurality of different audio files and conducting step (g) in respect of those audio files for which the count is one of those within a predetermined number of the highest counts.
16. A method according to claim 15, further comprising comparing the similarity measures of the plurality of audio files to determine a best match.
17. A method according to any one of claims 11 to 16, wherein the predetermined fourth location is the start of the query fingeφrint.
18. A method according to any one of the preceding claims, wherein extracting a fingeφrint comprises extracting multiband energies from a file.
19. A method according to claim 18, wherein extracting a fingeφrint further comprises obtaining differential energies from the multiband energies.
20. A method according to claim 18 or 19, wherein the multiband energies comprise energies in 6 bands
21. A method according to any one of claims 18 to 20, wherein extracting a fingeφrint further comprises dividing the file into time domain samples.
22. A method according to any one of claims 18 to 21 , wherein extracting multiband energies comprises converting samples of the audio file into frequency samples.
23. A method according to any one of claims 18 to 22, wherein extracting multiband energies further comprises grouping samples within different frequency ranges into different frequency bands.
24. A method according to claim 23, wherein extracting multiband energies further comprises, in individual frequency bands, calculating the sum of squares of the samples.
25. A method according to any one of claims 18 to 24, wherein obtaining differential energies from the multiband energies comprises determining a difference value between the multiband energies of a current sample and the multiband energies of a preceding sample.
26. A method according to claim 25, wherein the preceding sample comprises the immediately preceding sample.
27. A method according to any one of claims 18 to 26, further comprising obtaining binary outputs from the differential multiband energies.
28. A method according to claim 27, wherein obtaining binary outputs from the differential multiband energies comprises comparing the differential multiband energies with thresholds to provide binary outputs.
29. A method according to claim 27 or 28, further comprising concatenating the binary outputs from individual samples to provide differential energy contours.
30. A method according to any one of claims 27 to 29, further comprising, for individual bands, concatenating the binary outputs from successive samples.
31. A method according to any one of the preceding claims, wherein the data strings of a predetermined category comprise data strings of a predetermined length.
32. A method according to claim 31 , wherein the predetermined length is 9 bits.
33. A method according to any one of the preceding claims, wherein the data strings of a predetermined category comprise strings of contiguous data.
34. A method according to any one of the preceding claims, wherein the data strings of a predetermined category comprise strings of data in a time direction.
35. A method according to any one of the preceding claims, wherein the audio fingeφrint comprises data derived from different frequency bands of the audio file and individual data strings of a predetermined category comprise strings of data from individual frequency bands.
36. A method according to claim 35 when dependent on at least claim 6, 8 or 11, wherein the first locations comprise, for at least one of the different frequency bands, every location in the audio fingeφrint derived from the at least one frequency band.
37. A method according to any one of the preceding claims, wherein the data at individual locations within the audio fingeφrint is present in a plurality of the data strings of a predetermined category.
38. A method according to claims 36 and 37, wherein the data at the starting location in the at least one frequency band is only used by one of the data strings of a predetermined category and the data at the end location in the at least one frequency band is only used by one of the data strings of a predetermined category.
3 . A method according to any one of claims the preceding claims, wherein the data strings of a predetermined category comprise strings of data that overlap.
40. A method according to any one of claims 2 to 11 and 13 to 16 or according to any one of claims 17 to 39 when dependent on at least one of claims 2, 8 and 13, wherein the similarity measure is a bit error rate.
41. A method according to any one of the preceding claims, wherein the audio file comprises a music file.
4-2. A method according to any one of the preceding claims, wherein the query comprises a portion of a music file.
43. A database relating to a plurality of audio files, comprising: a plurality of audio fingeφrints; and for individual audio fingeφrints, a location table mapping the locations of data strings of a predetermined category within the audio fingeφrint.
44. Apparatus operable according to the method of any one of the preceding claims.
45. Apparatus for generating a database from audio files, for use in audio identification of a query file, comprising: means for receiving one or more audio files; means for, for individual received audio files, extracting an audio fingeφrint of the audio file; and generating a location table mapping the locations of data strings of a predetermined category within the audio fingeφrint; and means for storing the one or more audio fingeφrints and location tables in a database.
46. Apparatus for matching an audio query against one or more known audio files, comprising: (a) means for receiving the query; (b) means for extracting a query fingeφrint of the query; (c) means for receiving an audio fingeφrint of an audio file; (d) means for identifying one or more data strings of a predetermined category from the query fingerprint; (e) means for identifying a location within the audio fingeφrint having a data string corresponding to one of the one or more data strings of the query fingeφrint; (f) means for determining an alignment of the query fingeφrint and the audio fingeφrint that matches the position of the identified location within the audio fingeφrint with the location of the predetermined data string; and (g) means for determining a similarity measure between the query fingeφrint and the audio fingeφrint based on the alignment.
47. Apparatus for matching an audio query file against an audio file, co prising: (a) means for receiving the query file; (b) means for extracting a query fingeφrint of the query file; (c) means for receiving an audio fingeφrint of a first audio file; (d) means for identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingeφrint; (e) means for, for the or each identified first data string of the query fingeφrint, identifying one or more second locations, within the audio fingeφrint, having a second data string corresponding to the identified first data string; (f) means for, for the or each second location, within the audio fingeφrint, identified as having a second data string, aligning the query fingeφrint against the audio fingeφrint by matching the identified second location, within the audio fingeφrint, with the first location of the corresponding first data string within the query fingeφrint; (g) means for determining a similarity measure between the query fingeφrint and the audio fingeφrint for one or more different alignments generated in step (f); and (h) means for determining a best match measure for the received audio file from amongst the one or more determined similarity measures.
48. Apparatus for matching an audio query file against one or more known audio files, comprising: (a) means for receiving the query file; (b) means for extracting a query fingeφrint of the query file; (c) means for receiving an audio fingeφrint of a first audio file; (d) means for identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingeφrint; (e) means for, for the or each identified first data string of the query fingeφrint, identifying one or more second locations, the second locations being locations within the audio fingeφrint having a second data string, a second data string being a data string corresponding to the first identified first data string; (f) means for, for the or each identified second location within the audio fingeφrint, determining a third location, the third location being a location within the audio fingeφrint that would correspond to a predetermined fourth location, within the query fingeφrint, if the query fingeφrint were aligned against the audio fingeφrint by matching the respective second location, within the audio fingeφrint, with the first identified location of the corresponding first identified first data string within the query fingeφrint; and (i) means for identifying a fifth location and an associated count, within the audio fingeφrint, the fifth location being the location of the audio fingeφrint that is determined as the third location most often and the associated count being the number of times the fifth location is determined as the third location.
49. A computer program product having a computer program recorded on a computer readable medium, for instructing a computer to operate according to the method of any one of claims
50. A computer program product having a computer program recorded on a computer readable medium, for generating a database from audio files, for use in audio identification of a query file, said computer program product comprising: computer program code means for receiving one or more audio files; computer program code means for, for individual received audio files, extracting an audio fingeφrint of the audio file; and generating a location table mapping the locations of data strings of a predetermined category within the audio fingeφrint; and computer program code means for storing the one or more audio fingeφrints and location tables in a database.
51. A computer program product having a computer program recorded on a computer readable medium, for matching an audio query against one or more known audio files, said computer program product comprising: (a) computer program code means for receiving the query; (b) computer program code means for extracting a query fingerprint of the query; (c) computer program code means for receiving an audio fingerprint of an audio file; (d) computer program code means for identifying one or more data strings of a predetermined category from the query fingeφrint; (e) computer program code means for identifying a location within the audio fingeφrint having a data string corresponding to one of the one or more data strings of the query fingeφrint; (f) computer program code means for determining an alignment of the query fingeφrint and the audio fingeφrint that matches the position of the identified location within the audio fingeφrint with the location of the predetermined data string; and (g) computer program code means for determining a similarity measure between the query fingeφrint and the audio fingeφrint based on the alignment.
52. A computer program product having a computer program recorded on a computer readable medium, for matching an audio query file against an audio file, said computer program product comprising: (a) computer program code means for receiving the query file; (by computer program code means for extracting a query fingeφrint of the query file; (c) computer program code means for receiving an audio fingeφrint of a first audio file; (d) computer program code means for identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query -fingeφrint; (e) computer program code means for, for the or each identified first data string of the query fingeφrint, identifying one or more second locations, within the audio fingeφrint, having a second data string corresponding to the identified first data string; (f) computer program code means for, for the or each second location, within the audio fingeφrint, identified as having a second data string, aligning the query fingeφrint against the audio fingeφrint by matching the identified second location, within the audio fingeφrint, with the first location of the corresponding first data string within the query fingeφrint; (g) computer program code means for determining a similarity measure between the query -fingeφrint and the audio fingeφrint for one or more different alignments generated in step (f); and (h) computer program code means for determining a best match measure for the received audio file from amongst the one or more determined similarity measures.
53. A computer program product having a computer program recorded on a computer readable medium, for matching an audio query against one or more known audio files, said computer program product comprising: (a) computer program code means for receiving the query file; (b) computer program code means for extracting a query fingeφrint of the query file; (c) computer program code means for receiving an audio fingeφrint of a first audio file; (d) computer program code means for identifying a set of one or more first data strings of a predetermined category, the first data strings being those data strings that are present at each of one or more first locations, the first locations being locations within the query fingeφrint; (e) computer program code means for, for the or each identified first data string of the query fingeφrint, identifying one or more second locations, the second locations being locations within the audio fingeφrint having a second data string, a second data sfring being a data string corresponding to the first identified first data string; (f) computer program code means for, for the or each identified second location within the audio fingeφrint, determining a third location, the third location being a location within the audio fingeφrint that would correspond to a predetermined fourth location, within the query fingeφrint, if the query fingeφrint were aligned against the audio fingeφrint by matching the respective second location, within the audio fingeφrint, with the first identified location of the corresponding first identified first data string within the query fingeφrint; and (i) computer program code means for identifying a fifth location and an associated count, within the audio fingeφrint, the fifth location being the location of the audio fingeφrint that is determined as the third location most often and the associated count being the number of times the fifth location is determined as the third location.
54. A method of generating a database from audio files, for use in audio identification of a query file, substantially as hereinbefore described with reference to and as illustrated in the accompanying drawings.
55. A method of matching an audio query file against an audio file, substantially as hereinbefore described with reference to and as illustrated in the accompanying drawings.
PCT/SG2004/000092 2004-04-13 2004-04-13 Method and apparatus for identifying audio such as music WO2005101243A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/SG2004/000092 WO2005101243A1 (en) 2004-04-13 2004-04-13 Method and apparatus for identifying audio such as music

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG2004/000092 WO2005101243A1 (en) 2004-04-13 2004-04-13 Method and apparatus for identifying audio such as music

Publications (1)

Publication Number Publication Date
WO2005101243A1 true WO2005101243A1 (en) 2005-10-27

Family

ID=35150189

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2004/000092 WO2005101243A1 (en) 2004-04-13 2004-04-13 Method and apparatus for identifying audio such as music

Country Status (1)

Country Link
WO (1) WO2005101243A1 (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100774585B1 (en) 2006-02-10 2007-11-09 삼성전자주식회사 Mehtod and apparatus for music retrieval using modulation spectrum
KR100862616B1 (en) * 2007-04-17 2008-10-09 한국전자통신연구원 Searching system and method of audio fingerprint by index information
ITVE20080074A1 (en) * 2008-09-30 2008-12-30 Fausto Artico RECOGNITION ALGORITHM AND POSSIBLE RECEIVING OF DATA CLUSTERS IN BIGGER DATA SETS IN THE PRESENCE OF ANY NOISE OR OTHER FORMS OF DISORDER THAT CORRORATE THEM MORE OR LESS
KR101030306B1 (en) 2009-02-18 2011-04-19 한국과학기술원 Multimedia fingerprint producing method and system based on quantum hashing
US8121830B2 (en) 2008-10-24 2012-02-21 The Nielsen Company (Us), Llc Methods and apparatus to extract data encoded in media content
US8359205B2 (en) 2008-10-24 2013-01-22 The Nielsen Company (Us), Llc Methods and apparatus to perform audio watermarking and watermark detection and extraction
US8508357B2 (en) 2008-11-26 2013-08-13 The Nielsen Company (Us), Llc Methods and apparatus to encode and decode audio for shopper location and advertisement presentation tracking
US8666528B2 (en) 2009-05-01 2014-03-04 The Nielsen Company (Us), Llc Methods, apparatus and articles of manufacture to provide secondary content in association with primary broadcast media content
US8959016B2 (en) 2002-09-27 2015-02-17 The Nielsen Company (Us), Llc Activating functions in processing devices using start codes embedded in audio
US9100132B2 (en) 2002-07-26 2015-08-04 The Nielsen Company (Us), Llc Systems and methods for gathering audience measurement data
US9197421B2 (en) 2012-05-15 2015-11-24 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US9210208B2 (en) 2011-06-21 2015-12-08 The Nielsen Company (Us), Llc Monitoring streaming media content
US9282366B2 (en) 2012-08-13 2016-03-08 The Nielsen Company (Us), Llc Methods and apparatus to communicate audience measurement information
US9313544B2 (en) 2013-02-14 2016-04-12 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US9336784B2 (en) 2013-07-31 2016-05-10 The Nielsen Company (Us), Llc Apparatus, system and method for merging code layers for audio encoding and decoding and error correction thereof
US9380356B2 (en) 2011-04-12 2016-06-28 The Nielsen Company (Us), Llc Methods and apparatus to generate a tag for media content
US9558272B2 (en) 2014-08-14 2017-01-31 Yandex Europe Ag Method of and a system for matching audio tracks using chromaprints with a fast candidate selection routine
US9609034B2 (en) 2002-12-27 2017-03-28 The Nielsen Company (Us), Llc Methods and apparatus for transcoding metadata
US9667365B2 (en) 2008-10-24 2017-05-30 The Nielsen Company (Us), Llc Methods and apparatus to perform audio watermarking and watermark detection and extraction
US9699499B2 (en) 2014-04-30 2017-07-04 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US9711153B2 (en) 2002-09-27 2017-07-18 The Nielsen Company (Us), Llc Activating functions in processing devices using encoded audio and detecting audio signatures
US9711152B2 (en) 2013-07-31 2017-07-18 The Nielsen Company (Us), Llc Systems apparatus and methods for encoding/decoding persistent universal media codes to encoded audio
US9762965B2 (en) 2015-05-29 2017-09-12 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
WO2017181852A1 (en) * 2016-04-19 2017-10-26 腾讯科技(深圳)有限公司 Song determining method and device, and storage medium
US9881083B2 (en) 2014-08-14 2018-01-30 Yandex Europe Ag Method of and a system for indexing audio tracks using chromaprints
US10282471B2 (en) 2015-01-02 2019-05-07 Gracenote, Inc. Audio matching based on harmonogram

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998025269A1 (en) * 1996-12-02 1998-06-11 Thomson Consumer Electronics, Inc. Apparatus and method for identifying the information stored on a medium
WO2002008943A2 (en) * 2000-07-21 2002-01-31 Cddb, Inc. Method and system for finding match in database related to waveforms
WO2002011123A2 (en) * 2000-07-31 2002-02-07 Shazam Entertainment Limited Method for search in an audio database
US6453252B1 (en) * 2000-05-15 2002-09-17 Creative Technology Ltd. Process for identifying audio content
US20030086341A1 (en) * 2001-07-20 2003-05-08 Gracenote, Inc. Automatic identification of sound recordings
US20030191764A1 (en) * 2002-08-06 2003-10-09 Isaac Richards System and method for acoustic fingerpringting

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998025269A1 (en) * 1996-12-02 1998-06-11 Thomson Consumer Electronics, Inc. Apparatus and method for identifying the information stored on a medium
US6453252B1 (en) * 2000-05-15 2002-09-17 Creative Technology Ltd. Process for identifying audio content
WO2002008943A2 (en) * 2000-07-21 2002-01-31 Cddb, Inc. Method and system for finding match in database related to waveforms
WO2002011123A2 (en) * 2000-07-31 2002-02-07 Shazam Entertainment Limited Method for search in an audio database
US20030086341A1 (en) * 2001-07-20 2003-05-08 Gracenote, Inc. Automatic identification of sound recordings
US20030191764A1 (en) * 2002-08-06 2003-10-09 Isaac Richards System and method for acoustic fingerpringting

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CANO P. ET AL.: "A Review of Algorithms for Audio Fingerprinting", IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 9 December 2002 (2002-12-09), pages 169 - 173, XP010642539, Retrieved from the Internet <URL:http://www.iua.upf.es/mtg/publications/MMSP-2002-pcano.pdf> *
FREEDB.HOWTO, 8 January 2004 (2004-01-08), Retrieved from the Internet <URL:http://www.freedb.org/software/freedb.howto1.07.zip> *
HAITSMA J. ET AL.: "An Efficient Database Search Strategy For Audio Fingerprinting", IEEEWORKSHOP ON MULTIMEDIA SIGNAL PROCESSING., 9 December 2002 (2002-12-09), pages 178 - 181, XP010642541 *

Cited By (58)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9100132B2 (en) 2002-07-26 2015-08-04 The Nielsen Company (Us), Llc Systems and methods for gathering audience measurement data
US8959016B2 (en) 2002-09-27 2015-02-17 The Nielsen Company (Us), Llc Activating functions in processing devices using start codes embedded in audio
US9711153B2 (en) 2002-09-27 2017-07-18 The Nielsen Company (Us), Llc Activating functions in processing devices using encoded audio and detecting audio signatures
US9900652B2 (en) 2002-12-27 2018-02-20 The Nielsen Company (Us), Llc Methods and apparatus for transcoding metadata
US9609034B2 (en) 2002-12-27 2017-03-28 The Nielsen Company (Us), Llc Methods and apparatus for transcoding metadata
KR100774585B1 (en) 2006-02-10 2007-11-09 삼성전자주식회사 Mehtod and apparatus for music retrieval using modulation spectrum
KR100862616B1 (en) * 2007-04-17 2008-10-09 한국전자통신연구원 Searching system and method of audio fingerprint by index information
WO2008127052A1 (en) * 2007-04-17 2008-10-23 Electronics And Telecommunications Research Institute System and method for searching audio fingerprint by index information
ITVE20080074A1 (en) * 2008-09-30 2008-12-30 Fausto Artico RECOGNITION ALGORITHM AND POSSIBLE RECEIVING OF DATA CLUSTERS IN BIGGER DATA SETS IN THE PRESENCE OF ANY NOISE OR OTHER FORMS OF DISORDER THAT CORRORATE THEM MORE OR LESS
WO2010038187A1 (en) * 2008-09-30 2010-04-08 Fausto Artico Method for data clusters indexing, recognition and retrieval in presence of noise
US9667365B2 (en) 2008-10-24 2017-05-30 The Nielsen Company (Us), Llc Methods and apparatus to perform audio watermarking and watermark detection and extraction
US8121830B2 (en) 2008-10-24 2012-02-21 The Nielsen Company (Us), Llc Methods and apparatus to extract data encoded in media content
US8554545B2 (en) 2008-10-24 2013-10-08 The Nielsen Company (Us), Llc Methods and apparatus to extract data encoded in media content
US11256740B2 (en) 2008-10-24 2022-02-22 The Nielsen Company (Us), Llc Methods and apparatus to perform audio watermarking and watermark detection and extraction
US8359205B2 (en) 2008-10-24 2013-01-22 The Nielsen Company (Us), Llc Methods and apparatus to perform audio watermarking and watermark detection and extraction
US11386908B2 (en) 2008-10-24 2022-07-12 The Nielsen Company (Us), Llc Methods and apparatus to perform audio watermarking and watermark detection and extraction
US10467286B2 (en) 2008-10-24 2019-11-05 The Nielsen Company (Us), Llc Methods and apparatus to perform audio watermarking and watermark detection and extraction
US11809489B2 (en) 2008-10-24 2023-11-07 The Nielsen Company (Us), Llc Methods and apparatus to perform audio watermarking and watermark detection and extraction
US10134408B2 (en) 2008-10-24 2018-11-20 The Nielsen Company (Us), Llc Methods and apparatus to perform audio watermarking and watermark detection and extraction
US8508357B2 (en) 2008-11-26 2013-08-13 The Nielsen Company (Us), Llc Methods and apparatus to encode and decode audio for shopper location and advertisement presentation tracking
KR101030306B1 (en) 2009-02-18 2011-04-19 한국과학기술원 Multimedia fingerprint producing method and system based on quantum hashing
US10003846B2 (en) 2009-05-01 2018-06-19 The Nielsen Company (Us), Llc Methods, apparatus and articles of manufacture to provide secondary content in association with primary broadcast media content
US8666528B2 (en) 2009-05-01 2014-03-04 The Nielsen Company (Us), Llc Methods, apparatus and articles of manufacture to provide secondary content in association with primary broadcast media content
US10555048B2 (en) 2009-05-01 2020-02-04 The Nielsen Company (Us), Llc Methods, apparatus and articles of manufacture to provide secondary content in association with primary broadcast media content
US11004456B2 (en) 2009-05-01 2021-05-11 The Nielsen Company (Us), Llc Methods, apparatus and articles of manufacture to provide secondary content in association with primary broadcast media content
US11948588B2 (en) 2009-05-01 2024-04-02 The Nielsen Company (Us), Llc Methods, apparatus and articles of manufacture to provide secondary content in association with primary broadcast media content
US9380356B2 (en) 2011-04-12 2016-06-28 The Nielsen Company (Us), Llc Methods and apparatus to generate a tag for media content
US11296962B2 (en) 2011-06-21 2022-04-05 The Nielsen Company (Us), Llc Monitoring streaming media content
US10791042B2 (en) 2011-06-21 2020-09-29 The Nielsen Company (Us), Llc Monitoring streaming media content
US11784898B2 (en) 2011-06-21 2023-10-10 The Nielsen Company (Us), Llc Monitoring streaming media content
US11252062B2 (en) 2011-06-21 2022-02-15 The Nielsen Company (Us), Llc Monitoring streaming media content
US9515904B2 (en) 2011-06-21 2016-12-06 The Nielsen Company (Us), Llc Monitoring streaming media content
US9210208B2 (en) 2011-06-21 2015-12-08 The Nielsen Company (Us), Llc Monitoring streaming media content
US9838281B2 (en) 2011-06-21 2017-12-05 The Nielsen Company (Us), Llc Monitoring streaming media content
US9209978B2 (en) 2012-05-15 2015-12-08 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US9197421B2 (en) 2012-05-15 2015-11-24 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US9282366B2 (en) 2012-08-13 2016-03-08 The Nielsen Company (Us), Llc Methods and apparatus to communicate audience measurement information
US9313544B2 (en) 2013-02-14 2016-04-12 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US9357261B2 (en) 2013-02-14 2016-05-31 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US9711152B2 (en) 2013-07-31 2017-07-18 The Nielsen Company (Us), Llc Systems apparatus and methods for encoding/decoding persistent universal media codes to encoded audio
US9336784B2 (en) 2013-07-31 2016-05-10 The Nielsen Company (Us), Llc Apparatus, system and method for merging code layers for audio encoding and decoding and error correction thereof
US10721524B2 (en) 2014-04-30 2020-07-21 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US9699499B2 (en) 2014-04-30 2017-07-04 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US11831950B2 (en) 2014-04-30 2023-11-28 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US10231013B2 (en) 2014-04-30 2019-03-12 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US11277662B2 (en) 2014-04-30 2022-03-15 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US9881083B2 (en) 2014-08-14 2018-01-30 Yandex Europe Ag Method of and a system for indexing audio tracks using chromaprints
US9558272B2 (en) 2014-08-14 2017-01-31 Yandex Europe Ag Method of and a system for matching audio tracks using chromaprints with a fast candidate selection routine
US10698948B2 (en) 2015-01-02 2020-06-30 Gracenote, Inc. Audio matching based on harmonogram
US11366850B2 (en) 2015-01-02 2022-06-21 Gracenote, Inc. Audio matching based on harmonogram
US10282471B2 (en) 2015-01-02 2019-05-07 Gracenote, Inc. Audio matching based on harmonogram
US11057680B2 (en) 2015-05-29 2021-07-06 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US10694254B2 (en) 2015-05-29 2020-06-23 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US11689769B2 (en) 2015-05-29 2023-06-27 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US10299002B2 (en) 2015-05-29 2019-05-21 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US9762965B2 (en) 2015-05-29 2017-09-12 The Nielsen Company (Us), Llc Methods and apparatus to measure exposure to streaming media
US10719551B2 (en) 2016-04-19 2020-07-21 Tencent Technology (Shenzhen) Company Limited Song determining method and device and storage medium
WO2017181852A1 (en) * 2016-04-19 2017-10-26 腾讯科技(深圳)有限公司 Song determining method and device, and storage medium

Similar Documents

Publication Publication Date Title
WO2005101243A1 (en) Method and apparatus for identifying audio such as music
US7328153B2 (en) Automatic identification of sound recordings
JP4902565B2 (en) How to hash a digital signal
Cano et al. A review of algorithms for audio fingerprinting
JP4945877B2 (en) System and method for recognizing sound / musical signal under high noise / distortion environment
USRE46037E1 (en) Method, medium, and system for music retrieval using modulation spectrum
JP5907511B2 (en) System and method for audio media recognition
KR100862616B1 (en) Searching system and method of audio fingerprint by index information
US20060013451A1 (en) Audio data fingerprint searching
US20020126872A1 (en) Method, apparatus and programs for generating and utilizing content signatures
US20050141707A1 (en) Efficient storage of fingerprints
CN101014953A (en) Audio fingerprinting system and method
US10217469B2 (en) Generation of a signature of a musical audio signal
CN104866604B (en) A kind of information processing method and server
Jiao et al. Robust speech hashing for content authentication
Haitsma et al. An efficient database search strategy for audio fingerprinting
Kekre et al. A review of audio fingerprinting and comparison of algorithms
CN111382303B (en) Audio sample retrieval method based on fingerprint weight
CN111666928A (en) Computer file similarity recognition system and method based on image analysis
US10861436B1 (en) Audio call classification and survey system
CN110400578A (en) The generation of Hash codes and its matching process, device, electronic equipment and storage medium
Khemiri et al. A generic audio identification system for radio broadcast monitoring based on data-driven segmentation
JP2012068827A (en) Retrieval device for information related to acoustic data
Li et al. Content-based audio retrieval using perceptual hash
CN117932106A (en) Multi-level indexing method and system for audio fingerprint database data

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

DPEN Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed from 20040101)
DPEN Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed from 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase