JP2014506686A - Extracting and matching feature fingerprints from speech signals - Google Patents

Extracting and matching feature fingerprints from speech signals Download PDF

Info

Publication number
JP2014506686A
JP2014506686A JP2013553444A JP2013553444A JP2014506686A JP 2014506686 A JP2014506686 A JP 2014506686A JP 2013553444 A JP2013553444 A JP 2013553444A JP 2013553444 A JP2013553444 A JP 2013553444A JP 2014506686 A JP2014506686 A JP 2014506686A
Authority
JP
Japan
Prior art keywords
audio signal
fingerprint
resampled
frequency
method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2013553444A
Other languages
Japanese (ja)
Other versions
JP5826291B2 (en
Inventor
セルジー ビロブロフ
Original Assignee
ヤフー! インコーポレイテッド
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US13/025,060 priority Critical
Priority to US13/025,060 priority patent/US9093120B2/en
Application filed by ヤフー! インコーポレイテッド filed Critical ヤフー! インコーポレイテッド
Priority to PCT/US2012/021303 priority patent/WO2012108975A2/en
Publication of JP2014506686A publication Critical patent/JP2014506686A/en
Application granted granted Critical
Publication of JP5826291B2 publication Critical patent/JP5826291B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval

Abstract

The audio fingerprint is extracted from the audio sample and includes characteristic information about the content in the sample. Fingerprint calculates the energy spectrum for a speech sample, resamples the energy spectrum, transforms the resampled energy spectrum to generate a series of feature element vectors, and uses separate coding of this feature element vector And can be generated by calculating a fingerprint. The generated fingerprint can be compared to a set of reference fingerprints in the database to identify the original audio content.
[Selection] Figure 1

Description

  The present invention relates generally to audio signal processing, and more particularly to extracting characteristic fingerprints from audio signals and searching a database for such fingerprints.

  Because file formats, compression techniques, and other data representation methods vary, problems with identifying data signals or comparing data signals with other data signals pose significant technical obstacles. For example, in the case of digital music files on a computer, there are many formats for encoding and compressing music. In addition, music is often sampled in digital form at various data rates and has various characteristics (eg, various waveforms). Furthermore, the recorded analog voice contains noise and distortion. These significant waveform differences can be an unfavorable choice for efficient file or signal recognition or comparison in direct comparison of such files. Also, a direct comparison of files cannot compare media encoded in different formats (eg comparing the same music encoded in MP3 and WAV).

  For these reasons, the identification and tracking of media and other content as distributed over the Internet is done by attaching metadata, watermarks, or some other code that contains identification information about the media There are many. However, this attached information is often incomplete and / or inaccurate. For example, metadata is rarely complete and it is even rarer that file names are the same. In addition, techniques such as watermarking that modify the original file by adding data or code are invasive. Another disadvantage of this approach is that it is vulnerable to tampering. Even if any media file contains accurate identification data, such as metadata or watermarks, this file can be “unlocked” (and therefore pirated) if information is successfully deleted.

  In order to avoid such a problem, another method based on the concept of analyzing the contents of the data signal itself has been developed. In one such method, a speech fingerprint is generated for a speech segment, where the speech fingerprint includes feature information about speech that can be used to identify the original speech. In one embodiment, the audio fingerprint includes a digital sequence that identifies the audio fragment. The process of generating an audio fingerprint is often based on the acoustic and perceptual characteristics of the audio generating the fingerprint. Audio fingerprints typically have a much smaller size than the original audio content and can therefore be used as a convenient tool to identify, compare and search for audio content. Audio fingerprinting can be used in a wide variety of applications, including broadcast monitoring, audio content integration, P2P network content filtering, and music or other audio content identification. When applied to such various fields, speech fingerprinting typically includes fingerprint extraction as well as fingerprint database search algorithms.

  Most existing fingerprinting techniques are based on extracting speech features from speech samples in the frequency domain. The speech is first divided into frames and a set of feature elements is calculated for each frame. Speech features that can be used include fast Fourier transform (FFT) coefficients, mel frequency cepstrum coefficients (MFCC), spectral flatness, sharpness, entropy, and modulation frequency. The computed feature elements are built into feature element vectors, which are usually transformed using derivatives, means, or variances. The feature element vector is mapped into a more compact representation using an algorithm such as principal component analysis, and then quantized to generate a speech fingerprint. Typically, a fingerprint obtained by processing a single speech frame has a relatively small size and may not be unique enough to identify the original speech sequence with the desired confidence. In order to improve the uniqueness of the fingerprint and thus increase the chance of accurate recognition (reduce the false positive rate), small sub-fingerprints are combined into a larger block representing approximately 3-5 seconds of speech can do.

  One fingerprinting technique developed by Philips uses a short-time Fourier transform (STFT) to extract a 32-bit sub-fingerprint every 11.8 millisecond period of the speech signal. The audio signal is first divided into overlapping frames with a length of 0.37 seconds, the frames are weighted by a Hamming window with an overlap rate of 31/32 and converted to the frequency domain using FFT. The frequency domain data obtained here can be represented as a spectrogram (for example, a time-frequency diagram) in which the horizontal axis represents time and the vertical axis represents frequency. The spectrum for each frame (spectrogram column) is divided into 33 non-overlapping frequency bands in a band of 300 Hz to 2000 Hz with logarithmic separation. Spectral energy for each band is calculated and a 32-bit sub-fingerprint is generated using the sign of the energy difference in the continuous band along the time and frequency axes. If the energy difference between two bands in one frame is greater than the energy difference between the same bands in the previous frame, the algorithm outputs “1” for the corresponding bit in the sub-fingerprint and If not, “0” is output for the corresponding bit. The fingerprint is constructed by combining 256 subsequent 32-bit sub-fingerprints into a single fingerprint block corresponding to 3 seconds of speech.

  This algorithm is designed to be robust against common types of speech processing, noise, and distortion, but not very robust against large speed changes due to the resulting spectral scaling. Absent. Therefore, a correction algorithm has been proposed for extracting speech fingerprints in the scale invariant Fourier-Merlin region. The modification algorithm includes additional steps that are performed after converting the speech frame to the frequency domain. This additional step includes spectral log mapping followed by a second Fourier transform. Therefore, the first FFT is applied for each frame, the obtained result is log-mapped to obtain a power spectrum, and then the second FFT is applied. This can be described as a Fourier transform of logarithmic resampling Fourier transform, similar to the well-known MFCC method widely used in speech recognition. The main difference is that the Fourier-Melin transform uses a full-spectrum log mapping, whereas the MFCC is mel frequency scale (linear up to 1 kHz, with log separation at higher frequencies, human hearing Based on simulating the characteristics of the system).

  The Philips algorithm is categorized as a so-called short-term analysis algorithm because the sub-fingerprinting process uses only two consecutive frames of spectral coefficients. There are other algorithms that extract spectral feature elements using multiple overlapping FFT frames of a spectrogram. Some of the methods based on evaluating multiple frames in time are known as long-term spectrogram analysis algorithms.

  For example, “Modulation-Scale Analysis for Content Identification” by Sukittano, IEEE Transactions on Signal Processing vol. 52 no. 10 (October 2004), one long-term analysis algorithm is based on the estimation of the modulation frequency. In this algorithm, the speech is divided and a spectrogram is calculated for this speech. The modulation spectrum is then calculated for each spectrogram band (eg, the spectrogram frequency range) by applying a second transform along the spectrogram timeline (eg, horizontal axis). This is different from the modified Phillips approach where a second FFT is applied along the spectrogram frequency column (eg, vertical axis). In this technique, the spectrogram is divided into N frequency bands, and the same number N of continuous wavelet transforms (CWT) are calculated, one for each band.

  Although the developers of this algorithm claim to have superior performance compared to the Philips algorithm, existing algorithms still exhibit some drawbacks. For example, especially if the audio is compressed using a CELP audio codec (eg, associated with mobile phone audio such as GSM®), this algorithm may be distorted speech and music. May not be robust enough to reliably identify. Furthermore, these algorithms are generally susceptible to noise and analog distortion, such as those associated with microphone recording. Even if this algorithm can identify speech with a single type of distortion (eg, GSM compression after being recorded from a microphone in a mobile phone, ie, a noise room with optical reverberations) It is not possible to handle combinations of distortions that are more general and closer to the actual situation (like voice).

  Thus, when applied to practical applications, existing fingerprinting schemes have unacceptably high error rates (eg, false positives and false negatives), are too large to be implemented commercially, and / or Or generate a fingerprint that is too slow. Therefore, there is a need to overcome existing limitations that current speech recognition technology cannot solve.

"Modulation-Scale Analysis for Content Identification" by Sukitannon, IEEE Transactions on Signal Processing vol. 52 no. 10 October 2004

  The invention thus makes it possible to extract a characteristic fingerprint from an audio signal based on the content of the audio signal. This fingerprint can be matched against a set of reference fingerprints (eg, in a database) to determine the identity of the signal or the similarity between the two signals. Due to the nature of the fingerprint extraction algorithm, the fingerprint extraction algorithm is fast, efficient, extremely accurate, scalable, and much faster than such solutions without causing many problems that compromise existing solutions. Robust.

  In an embodiment of a method for generating an audio fingerprint, an audio signal is sampled and spectrogram information is calculated from the signal. The spectrogram is divided into a plurality of frequency bands. In-band sequence samples are resampled.

  In one embodiment, re-sampling the sequence samples includes log-sampling the samples. In another embodiment, resampling the sequence samples includes scaling the size of the sequence samples in time units based on the intermediate frequency band and / or frequency range for the corresponding frequency band, and resizing the scaled sequence samples. Sampling. In another embodiment, resampling the sequence samples includes offsetting the sequence samples in time units based on the intermediate frequency band and / or frequency range for the corresponding frequency band. In another embodiment, re-sampling the sequence samples includes sampling from different sequence samples (ie, frequency bands) over time.

  A second transformation is then applied to the resampling sequence to obtain a feature element vector for each sequence. In one embodiment, the second transformation includes a transformation along the time axis. In another embodiment, the second transformation includes a transformation along the frequency axis followed by a transformation along the time axis. In another embodiment, the second transform includes a two-dimensional discrete cosine transform (2D DCT). The voice fingerprint is calculated based on the feature element vector. The voice fingerprint can be stored on a computer readable medium or temporarily fixed as a transmittable signal.

  The various types of sequence sample resampling described make the algorithm less sensitive to variations in audio playback speed and temporal compression and expansion. In this way, the fingerprint of the audio signal should have little or no variation, regardless of variations in playback speed, or variations due to time compression or expansion. Furthermore, the resampling described improves the low frequency resolution for the second time frequency conversion. This allows the use of a simple transform instead of the complex wavelet transform used to analyze the spectrogram modulation spectrum, making it more efficient and faster than previous methods.

  Further, due to the resampling described, the band output frame contains samples representing the beginning of the speech sequence to be analyzed for the majority. Thus, the resulting fingerprint is generated primarily using the sample located at the beginning of the sequence. The fingerprint can be used to match shorter speech sequences since a relatively small portion of the speech sequence contributes the most to the resulting fingerprint. In one implementation, for example, a fingerprint generated from a 5 second original speech frame can reliably match a sample obtained from a half speech fragment.

  Embodiments of the fingerprint processing technique are also resistant to noise and signal distortion. One implementation can detect speech-like signals in the presence of 100% white noise (ie, 0 db S / N ratio). Furthermore, this technique is resistant to filtering, compression, frequency equalization, and phase distortion.

  In another embodiment, the acoustic model is used to mark insignificant frequency bands when the generated fingerprint frame is formed using a specific number of frequency bands. Insignificant bands can include bands that do not substantially add any perceptual value in distinguishing audio samples. Only the relevant frequency band is processed, the S / N ratio is improved and the robustness of the whole fingerprint matching process is increased. In addition, eliminating irrelevant frequency bands greatly improves the recognition efficiency of band-limited audio content, for example, for speech encoded at very low bit rates, or analog recorded using low-speed tape. can do.

  Furthermore, embodiments of the present invention provide fast indexing and efficient searching for fingerprints in large databases. For example, an index for each voice fingerprint can be calculated from a portion of the fingerprint content. In one embodiment, a set of bits from the fingerprint is used as a fingerprint index, which corresponds to a more stable low frequency coefficient by resampling. In order to match a test fingerprint with a set of fingerprints in the database, the test fingerprint can be matched against an index to obtain a group of fingerprint candidates. The test fingerprint is then matched against the fingerprint candidates, thus eliminating the need to match the test fingerprint against all fingerprints in the database.

  In another embodiment, an edge detection algorithm is used to determine the exact edge for the speech frame or fragment being analyzed. In some applications, it is important to know the position of the edge of the analyzed audio frame within the audio sample, especially if the audio sample differs only for a short period of the entire sample. Edge detection algorithms can use linear regression techniques to identify the edges of speech frames.

  Fingerprinting technology embodiments have a wide variety of applications, including audio streams and other audio content (eg, streaming media, radio, advertisements, Internet broadcasts, music in CDs, MP3 files, or any other type of audio content. ) Real-time identification. Thus, embodiments of the present invention enable efficient, real-time media content audits and other recordings.

FIG. 6 is a schematic diagram of a process related to extracting and using fingerprints from audio samples according to an embodiment of the present invention. 1 is a schematic diagram of a fingerprint extraction system according to an embodiment of the present invention. FIG. FIG. 4 is a flow diagram of a matching algorithm according to an embodiment of the present invention. FIG. 4 illustrates an edge detection algorithm according to an embodiment of the present invention. 1 is a schematic diagram of a fingerprint extraction system including a logarithmic resampler and a T-point conversion module according to an embodiment of the present invention. Fig. 4 shows a graphical representation for a fingerprint extraction algorithm according to some other embodiments of the invention. Fig. 4 shows a graphical representation for a fingerprint extraction algorithm according to some other embodiments of the invention. Fig. 4 shows a graphical representation for a bandpass filter applied to a speech frame according to an embodiment of the present invention. Fig. 4 shows a graphical representation for a bandpass filter applied to a speech frame according to an embodiment of the present invention. FIG. 6 shows a graphical representation for resampling a subband sample sequence according to some other embodiments of the invention. FIG. 6 shows a graphical representation for resampling a subband sample sequence according to some other embodiments of the invention. FIG. 6 shows a graphical representation for resampling a subband sample sequence according to some other embodiments of the invention.

(Overview)

  Embodiments of the present invention allow for extracting feature information (eg, a voice fingerprint) from a sample of speech, and using the extracted feature information to match or identify speech. As illustrated in FIG. 1, the audio frame 105 extracted from the audio sample 100 is input to the fingerprint extraction algorithm 110. The audio sample 100 can be provided by any of a wide range of sources. Using the sequence of speech frames 105, the fingerprint extraction algorithm 110 generates one or more fingerprints 115 that are characteristic of this sequence. The audio fingerprint 115 serves as a distinguishing identifier and provides information regarding identification or other characteristics about the sequence of frames 105 of the audio sample 100. In particular, one or more fingerprints 115 for the audio sample 100 can make the audio sample 100 uniquely identifiable. Embodiments of the fingerprint extraction algorithm 110 are described in more detail below.

  Once the fingerprint is generated, the extracted fingerprint 115 can be used in a further process or stored on the media for later use. For example, fingerprint 115 can be used by fingerprint matching algorithm 120, which compares fingerprint 115 with an entry in fingerprint database 125 (a collection of voice fingerprints from known sources). Thus, the identification of the audio sample 100 is determined. In addition, various methods related to using fingerprints are described below.

  The audio sample 100 can be obtained from any of a wide variety of sources, depending on the application of the fingerprint processing system. In one embodiment, audio sample 100 is sampled and digitized from a broadcast received from a media broadcaster. Alternatively, the media broadcaster can transmit audio in digital form without requiring digitization. Media broadcaster types include, but are not limited to, wireless transmission operators, satellite transmission operators, and cable operators. Thus, the fingerprint processing system can be used to audit these broadcasters to determine which audio is broadcasting when. This allows an automated system to ensure compliance with broadcasting constraints, licensing agreements, and the like. Since the fingerprint extraction algorithm 110 can operate without the need to know the exact start and end of the broadcast signal, it does not require cooperation or recognition that the media broadcaster guarantees unique and fair results. Can be operated to.

  In another embodiment, the media server obtains an audio file from the media library and distributes the digital broadcast over a network (eg, the Internet) for use by the fingerprint extraction algorithm 110. Streaming Internet wireless broadcasts are one example of this type of architecture, where media, advertisements, and other content are delivered to individuals or groups of users. In such an embodiment, the fingerprint extraction algorithm 110 and the matching algorithm 120 typically have any information regarding the start time and end time of the individual media items contained within the streaming content of the audio sample 100. Although not present, the algorithms 110 and 120 do not require this information to identify streaming content.

  In another embodiment, the fingerprint extraction algorithm 110 receives an audio sample 100, or a series of frames 105, from a client computer that has access to a recording device that includes an audio file. The client computer retrieves the individual audio file from the recording device and sends this file to the fingerprint extraction algorithm 110 to generate one or more fingerprints 115 from this file. Alternatively, the client computer can retrieve batches of files from the recording device 140 and then continuously send these batches of files to the fingerprint extraction algorithm 110 to generate a set of fingerprints for each file. (As used herein, a “set” is understood to include any number of glue-processed items, including a single item.) The fingerprint extraction algorithm 110 may be on a client computer or network. Can be executed using a remote server connected to the client computer.

(algorithm)
One embodiment of a fingerprint extraction system 200 that implements the fingerprint extraction algorithm 110 shown in FIG. 1 is illustrated in FIG. Fingerprint extraction system 200 includes an analysis filter bank 205 coupled to a plurality of processing channels (each including one or more processing modules labeled as elements 210 and 215), wherein the plurality of processing channels are voiced. Coupled to the differential encoder 225 in order to create a fingerprint 115. The fingerprint extraction system 200 is configured to receive an audio frame 105 from which an audio fingerprint will be generated.

  As will be described in more detail below, for each input speech frame 105, analysis filter bank 205 generally calculates power spectrum information for received signals over a range of frequencies. In one embodiment, each processing channel corresponds to a frequency band within a range of frequencies that the bands can overlap. Thus, the channels divide the processing performed by the fingerprint extraction system 200 so that each channel performs processing related to the corresponding bandwidth. In another embodiment, each processing channel processes multiple frequency bands (ie, multiple frequency bands are associated with each processing channel). In yet another embodiment, the processing for multiple bands can be performed by a single module in a single channel, or the processing can be in some other configuration appropriate to the application and system technical limitations. Can be divided.

  The analysis filter bank 205 receives an audio frame 105 (such as frame 105 from the audio sample 100 illustrated in FIG. 1). The analysis filter bank 205 converts the audio frame 105 from the time domain to the frequency domain, and calculates power spectrum information regarding the frame 105 over a certain frequency range. In one embodiment, the power spectrum for signals in the range of about 250 to 2250 Hz is divided into multiple frequency bands (eg, Y band where Y = 13). The bands can have a linear or logarithmic intermediate frequency band distribution (or any other scale) and can further overlap. The output of the filter bank includes a measurement of signal energy for each of the plurality of bands. In one embodiment, the average energy measurement is performed using the cube root of the average spectral energy in the band.

  Various implementations of the analysis filter bank 205 are possible depending on software and hardware requirements and system limitations. In one embodiment, analysis filter bank 205 includes a plurality of bandpass filters that perform energy estimation and downsampling after separating the signal of speech frame 105 for each of the frequency bands. In one embodiment, the frequency that passes through each bandpass filter varies with time. In another embodiment, the frequency that passes through each bandpass filter is constant (ie, does not change with time). FIG. 7A shows a graphical representation of an embodiment in which the frequency passing through each bandpass filter does not change with time. Each rectangle in the graph 702 represents a signal of the audio frame 105 output by the bandpass filter. On the other hand, FIG. 7B shows a graphical representation of an embodiment in which the frequency passing through each bandpass filter varies with time. As can be seen in graph 704, in this example, the frequency passing through each bandpass filter decreases with time. In another embodiment, the passing frequency increases with time. After applying the bandpass filter to the audio frame 105, each frequency band includes a signal output by the corresponding bandpass filter.

  In another embodiment, analysis filter bank 205 is implemented using a short time fast Fourier transform (FFT). For example, audio 100 sampled at 8 kHz is divided into 64 ms frames 105 (eg, 512 samples). Next, the power spectrum of each 50% overlapping segment of two speech frames 105 (eg, 1024 samples) is computed by Hann window to perform FFT, followed by M uniform or logarithmically spaced overlapping triangles Band filtering is performed using a window.

  Various time frequency domain transforms can be used instead of the FFT described above. For example, a modified discrete cosine transform (MDCT) may be used. One advantage of MDCT is that it is less complex and can be calculated using only one n / 4 point FFT and some pre- and post-rotation of the sample. Therefore, it is expected that the filter bank 205 executed in MDCT performs better than that executed in FFT and can, for example, calculate the transform twice as fast.

  In another embodiment, analysis filter bank 205 is implemented using a cascading polyphase filter and an MP3 hybrid filter bank that includes MDCT followed by aliasing cancellation. The MP3 filter bank generates 576 spectral coefficients for each frame of speech 576 samples. For speech sampled at 8 kHz, the resulting frame rate is 13.8 fps, compared to 15.626 fps for the 1024 point FFT filter bank described above. Frame rate differences are offset during time-frequency analysis when the data is resampled, as described below. Further, the analysis filter bank 205 can be implemented using a quadrature mirror filter (QMF). The first stage of the MP3 hybrid filter bank uses a QMF filter with 32 equal bandwidths. Therefore, the 250 Hz to 2250 Hz frequency band of the 11,025 Hz audio signal can be divided into 13 bands.

  One advantage of the MP3 filter bank is portability. For different CPUs, the execution of the MP3 filter bank is highly optimized. Thus, the fingerprint generation routine can be easily integrated with an MP3 encoder that can obtain spectral coefficients from an MP3 filter bank without additional processing. Thus, the fingerprint generation routine can be easily integrated with an MP3 decoder that can obtain spectral data directly from the MP3 bitstream without full decoding. Furthermore, integration with other audio codecs is possible.

  Once the subband samples are determined, they are buffered and fed to one or more resamplers 210. The resampler 210 receives the subband samples and resamples the subband samples to generate a resample sequence. In one embodiment, the resampler 210 resamples the subband samples according to a non-uniform order, such as a discontinuity or an order reverse to the order in which the samples were sampled.

  In one embodiment, each resampler 210 corresponds to one of the Y frequency bands and is a sequence of S samples that are linearly spaced in time units with respect to the corresponding frequency band (eg, S is dependent on the filter bank implementation). , 64 to 80). In one embodiment, when a subband sample sequence is received, each resampler performs logarithmic resampling, scale resampling, or offset resampling for the respective subband sample sequence. As a result of resampling, resampler 210 generates an M resampling sequence for each audio frame.

  In one embodiment, log resampling includes a resampler 210 that log-maps the corresponding subband samples to generate a resampling sequence having T samples that are logarithmically spaced in time (eg, T = 64). Instead of logarithmic sampling, other types of non-linear sampling such as exponential resampling may be performed.

  In one embodiment, the scale resampling includes a resampler 210 that scales the size (ie, length) of each subband sample sequence in time units. The subband sample sequence is scaled based on the intermediate frequency band and / or frequency range for the frequency band. For example, the scaling can be such that the higher the subband intermediate frequency band, the larger the size of the subband sample sequence. As another example, the scaling may be such that the higher the subband intermediate frequency band, the smaller the size of the subband sequence. The scaled subband sample sequence is resampled by resampler 210 to produce a resampling sequence having T samples.

  In one embodiment, offset resampling includes a resampler 210 that offsets (ie, shifts) each subband sample sequence in time units. The offset of the subband sequence is based on the intermediate frequency band and / or frequency range of the resampler frequency band. For example, the time offset of the subband sample sequence can be increased as the subband intermediate frequency band is higher. The offset subband sample sequence is resampled by the resampler 210 to produce a resampling sequence having T samples.

  In another embodiment, each resampler 210 corresponds to multiple frequency bands. Each resampler 210 receives a subband sample sequence of a plurality of frequency bands. The number of subband sample sequences received by each resampler 210 varies based on execution. In one embodiment, the frequency band corresponding to each resampler is continuous.

  Each resampler 210 performs time frequency resampling on the corresponding subband sequence. Temporal frequency resampling includes a resampler 210 that samples from a corresponding frequency band that varies as time changes to generate a resampling sequence having T samples. In one embodiment, the frequency that the resampler 210 samples decreases over time. In another embodiment, the frequency that the resampler 210 samples increases over time. As a result of resampling, resampler 210 generates an M resampling sequence for each audio frame.

  8A and 8B illustrate time frequency resampling according to one embodiment. In FIG. 8A, each rectangle drawn with a gray outline in graph 802 represents a different frequency band (ie, a sample sequence of a frequency band). Each black diagonal line represents a resampling sequence generated by the resampler 210 as a result of time-frequency resampling. As can be seen in graph 802, to generate a resampling sequence, each resampler 210 samples a corresponding frequency band that varies with time. In the embodiment of graph 802, the frequency band sampled by resampler 210 decreases over time. A graph 804 in FIG. 8B is an example of the resampling sequence in FIG. 8A without a frequency band.

  A graph 806 in FIG. 8C illustrates a resampling sequence generated by the resampler 210 in an embodiment where each resampler 210 corresponds to one of the Y frequency bands. Similar to FIG. 8A, each rectangle drawn with a gray outline in the graph 806 represents a different frequency band, and each black line in the center of the rectangle represents a resampling sequence. As can be seen in FIG. 8C, the number of resampling sequences generated by the resampler 210 in this embodiment is the same as the number of frequency bands (ie, M = Y). Because of this situation, each resampler 210 samples within that frequency band.

  However, as can be seen in FIG. 8A, in an embodiment where each resampler 210 corresponds to multiple frequency bands and performs temporal frequency resampling, the number of resampling sequences is less than the number of frequency bands (ie, M <Y). In this embodiment, more frequency bands are required to ensure that each resampler 210 obtains samples from within the same time period and each resampling sequence includes T samples.

  After the resampler 210 performs resampling and an M resampling sequence is generated, the resampling sequence can be stored in a [M × T] matrix, which includes a time (horizontal) axis and a frequency (vertical) ) Corresponds to a sampled spectrogram with axes. The M resampling sequence is provided to one or more conversion modules 215 that perform conversion on the samples.

  In one embodiment, the transformation performed on each band of samples is a T-point transformation that is transformed along a time axis (eg, each row of an [M × T] matrix). In one embodiment, the T point transform is a T point FFT. A series of coefficients obtained as a result of the FFT is called a feature element vector. In one embodiment, the feature vector for each band includes any other coefficients of the FFT that are calculated in ascending order of frequency for that band. Thus, each feature element vector can include N coefficients (eg, N = T / 2 = 32). In another embodiment, instead of performing a T-point FFT, a T-point discrete cosine transform (DCT), a T-point discrete Hartley transform (DHT), or a discrete wavelet transform (DWT) is performed. The resulting feature element vector is provided to the differential encoder 225.

  In another embodiment, the conversion is performed following the T point conversion and the M point conversion. T-point transformation is performed on the samples for each band as described above. After the T-point transformation, the samples for each band are scaled in size and normalized by windowing. Following scaling, windowing, and normalization, an M-point transformation, which is a transformation along the frequency axis (eg, each column is an [M × T] matrix) is performed on the samples. In one embodiment, the M point transform is an FFT, DCT, DHT, or DWT along the frequency axis. The resulting feature element vector is provided to the differential encoder 225.

  In another embodiment, the transform is a two-dimensional discrete cosine transform (2D DCT). To perform this conversion, the samples for each band are normalized. When the sample is normalized, a one-dimensional DCT is performed along the time axis. A one-dimensional DCT along the time axis is followed by a one-dimensional DCT along the frequency axis. The resulting feature element vector is provided to the differential encoder 225.

  The differential encoder 225 generates a fingerprint 115 for the speech sample. In one embodiment, the differential encoder 225 subtracts the feature element vector corresponding to each pair of adjacent bands. When the Y band exists, there are Y-1 pairs of adjacent bands. By subtracting two feature element vectors, a vector of N difference values is obtained. For each of these difference values, the difference encoder 225 selects 1 if the difference is greater than or equal to 0, and selects 0 if the difference is less than 0. For each group of 4 bits in the sequence, the encoder assigns a bit value according to the codebook table. The best codebook value is calculated during adjustment and learning of the fingerprinting algorithm. If this process is repeated for each feature vector of successive pairs of bands, a bit matrix of [(Y−1) × N / 4] is obtained. This matrix can be represented as a linear bit sequence and is used as the speech fingerprint 115. In an embodiment where Y = 13 and N = 8, the fingerprint 115 has 12 bytes of information.

  In one embodiment, principal component analysis (PCA) is used to decorrelate and reduce the size before quantizing the resulting feature vector. In addition, or alternatively, other decorrelation techniques such as digital cosine transform can be used to eliminate redundancy and compress feature element vectors.

  In one embodiment, fingerprint extraction system 200 generates multiple fingerprints for a series of highly overlapping speech frames within a particular speech signal. In one embodiment, each series of audio frames 105 processed by the system 200 includes a 3 second audio signal, starting 64 milliseconds after the previous series started. In this way, fingerprints are generated for several 3-second portions of the audio signal starting every 64 milliseconds. In order to implement such a scheme, the fingerprint extraction system can include a memory buffer before and after the analysis filter bank 205, which receives the next 64 ms when the next speech frame is received. The audio signal is updated.

  6A and 6B show a graphical representation of a fingerprint extraction algorithm 110 according to some alternative embodiments of the present invention. In this process, analysis filter bank 205 receives a speech frame. Graph 602 is a diagram of received speech frames in the time domain. As shown in the graph 604, the analysis filter bank 205 performs FFT on the audio frame to convert from the time domain to the frequency domain. Next, power spectrum information is calculated for the speech frame when present in the frequency domain. As shown in the graph 606, the analysis filter bank 205 applies a plurality of band-pass filters to separate the signals of the frames for each frequency band.

  The frequency band sub-band sample sequence is resampled by the resampler 210. FIG. 6B illustrates four alternative techniques (denoted A, B, C, and D) that can be performed by a resampler 210 that resamples a subband sample sequence. In one embodiment, techniques A, B, and C are techniques that can be performed when each resampler 210 corresponds to one frequency band. In one embodiment, technique D can be performed when each resampler 210 corresponds to multiple frequency bands and the resampler 210 is configured to perform temporal frequency resampling.

  In technique A, each resampler 210 performs logarithmic sampling on the corresponding subband sample sequence (graph 608) to generate a resampling sequence having T samples (graph 616). In technique B, each resampler 210 scales the size of the respective subband sample sequence based on the intermediate frequency band and / or frequency range of the subband. As shown in graph 610, in this example, the higher the subband intermediate frequency band and the wider the subband frequency range, the smaller the size of the subband sample sequence. The scaled subband sample sequence is resampled to produce a resampling sequence each having T samples (graph 618).

  In technique C, each resampler 210 offsets each subband sample sequence in time units based on the subband intermediate frequency band and / or frequency range. As shown in graph 612, in this embodiment, the higher the subband intermediate frequency band, the greater the offset of the subband sample sequence. Resample the offset subband sample sequences to generate a resampling sequence each having T samples (graph 620).

  In technique D, each resampler 210 performs time-frequency resampling on the corresponding subband sample sequence. Temporal frequency resampling is performed by a resampler 210 that samples from a corresponding different frequency band as time changes. As shown in the graph 614, in this embodiment, as time elapses, the frequency sampled by the resampler 210 decreases. Resampling produces resampling sequences that each have T samples (graph 622).

  The resampling sequence (M resampling sequence) generated by the resampler 210 is stored in an [M × T] matrix. Each conversion module 215 performs a conversion on the resampling sequence generated by the corresponding resampler 210 (ie, the resampler 210 in the same channel as the conversion module 215). FIG. 6 illustrates three alternative techniques (denoted as E, F, and G) that can be performed by a transform module 215 that transforms the resampling sequence to generate a feature element vector.

  In technique E, conversion module 215 performs a T-point conversion, as illustrated in graph 624. In technology F, the transformation module 215 performs a T point transformation in another dimension following a T point transformation in one dimension, as illustrated in graph 626. In technique G, transformation module 215 performs a two-dimensional DCT or other suitable two-dimensional transformation, as illustrated in graph 628.

  When the feature element vectors are obtained by transforming the subband samples, the differential encoder 225 generates the fingerprint 115 using the feature element vectors generated by the transform module 215.

  FIG. 5 is one example of a fingerprint extraction system 200 in which the resampler is a logarithmic resampler and the conversion module 215 is a T-point conversion module 215. The logarithmic resampler 210 performs logarithmic sampling as described above (Technology A). However, it should be understood that in another embodiment, the logarithmic resampler 210 may be replaced with a resampler 210 that performs other resampling techniques (ie, techniques B, C, or D).

  The T point conversion module 215 performs T point conversion as described above (Technology E). However, in another embodiment, the T-point conversion module 215 can be replaced with a conversion module that performs other conversion techniques (ie, techniques F or G).

(Acoustic model)
In various applications of the fingerprint processing system, certain frequency bands may not be significant due to the inability to sense them, the coding process on the audio samples removing the bands, or other reasons. Accordingly, in one embodiment, the acoustic model 235 is used to identify and mark insignificant frequency bands for a particular fingerprint. Acoustic models such as psychoacoustic models are well known in various speech processing fields. A set of model parameters for the acoustic model 235 can be calculated for a high quality reference sample while generating the fingerprint 115 and stored in the database 125. Insignificant bands of the fingerprint 115 can be marked by assigning a special code or by zeroing the corresponding value (ie, bit). This allows the band to be effective in any subsequent matching process because only the corresponding band pair having a non-zero value in the process of matching the fingerprint with the database record is used to distinguish the fingerprint 115. Will be ignored. Furthermore, the marked bands (ie those having a zero value) can be completely excluded from the comparison.

  In one embodiment, the acoustic model is a psychoacoustic model for the human auditory system. This model can be useful when the fingerprint system is intended for speech recognition intended for the human auditory system. Such speech can be compressed by one or more perceptual encoders that remove irrelevant speech information. By using a human psychological acoustic model, it is possible to identify and exclude unrelated bands from the fingerprint.

  However, the psychoacoustic model is the only type of acoustic model suitable for human perceptual encoded speech. Another acoustic model is a model that simulates the characteristics of a specific recording device. Each band for such a recording device acoustic model can have a weighting factor assigned according to importance. Yet another acoustic model simulates certain environmental characteristics such as background noise found in a car or room. In such an embodiment, each band for the acoustic model can have a weighting factor that is assigned according to the importance in the environment in which the system is designed.

  In one embodiment, the parameters of the acoustic model 235 and the filter bank 205 depend on the type and characteristics of the audio signal 100 being analyzed. Various characteristics, including a set of subband weighting factors and multiple filter bank bands and their frequency distribution, are used to obtain a better match of the characteristics of the target speech signal. For example, for speech such as speech, signal power is primarily concentrated in the low frequency band, while music can include components associated with the high frequency band depending on the genre. In one embodiment, the parameters of the acoustic model are calculated from the reference audio signal and stored in the content database along with the generated fingerprint. In another embodiment, the parameters of the acoustic model are dynamically calculated based on the characteristics of the audio signal analyzed during the matching process.

  Thus, possible applications of the acoustic model 235 include adjusting speech recognition parameters related to specific environments and / or recording devices and encoding algorithm characteristics. For example, knowing the acoustic characteristics of a cellular phone voice path (microphone characteristics, voice processing and compression algorithms, and the like) makes it possible to develop an acoustic model that simulates these characteristics. By using such a model during fingerprint comparison, the robustness of the generated fingerprint matching process can be significantly increased.

(Fingerprint indexing and matching)
In one embodiment, fingerprint indexer 230 generates an index for each fingerprint 115. The fingerprint 115 is then stored in the fingerprint database 125 so that searching and matching for the contents of the fingerprint database 125 can be made efficient. In one embodiment, the index for fingerprint 115 includes a portion or hash of fingerprint 115. Accordingly, the fingerprint 115 of the fingerprint database 125 is indexed according to valid identification information relating thereto.

  In the above embodiment where each fingerprint 115 includes a matrix of [(Y−1) × N / 4] bits, the indexer 230 uses the leftmost column as an index. In an embodiment where each fingerprint 115 is a matrix of [12 × 8] bits, the index for fingerprint 115 may be the leftmost two columns of bits (24 bits overall). Thus, the bits used as an index for each fingerprint 115 is a subset of the fingerprint 115 based on the low frequency spectral coefficients of the feature vector used to calculate the fingerprint 115. Thus, these bits correspond to low frequency components in the resampled and transformed spectrogram band that are stable and insensitive to suppressing noise and distortion. Thus, at a high probability level, similar fingerprints will have the same numerical index. In this way, the index can be used to label and group similar and similar matching fingerprints in the database.

  FIG. 3 illustrates a method for matching a test fingerprint with the fingerprint database 125 using the above-described index, according to one embodiment of the present invention. In order to find a match in the fingerprint database 125 for the test fingerprint, the matching algorithm begins at step 310 where an index value for the test fingerprint is calculated as described above. This index value is used to obtain a group of fingerprint candidates that includes, for example, all of the fingerprints in the database 125 that have the same index value (320). As mentioned above, due to this method by which the index is calculated, any match in the database 125 is likely to be in this group of fingerprint candidates.

  To test for any matching within a group of fingerprint candidates, a bit error rate (BER) between the test fingerprint and each fingerprint candidate is calculated (330). The BER between two fingerprints is the percentage of the corresponding bits that do not match. For a random fingerprint that is not completely relevant, the BER would be expected to be 50%. In one embodiment, the two fingerprints are matched when the BER is less than about 35%, but other numerical limits are used depending on the desired tolerance for false positives and / or false negatives can do. In addition, calculations or criteria other than BER can be used to compare two fingerprints. For example, a matching rate that is an inverse measurement of BER can also be used. In addition, certain bits can be weighted higher than others in the comparison of two fingerprints.

  If there is no match (340) within the given match criteria, or if there is no longer an index to modify (350), the matching algorithm will find any match for the test fingerprint in the database 125. Can not. The system can then continue the search (eg, using lower constraint criteria in obtaining fingerprint candidates) or can stop. If there are one or more matching fingerprints (340), a list of matching fingerprints is returned (360).

  In one embodiment, the system may repeat the search as described above after modifying (370) the calculated fingerprint index to obtain a different set of fingerprint candidates to search for matching. . To modify (370) the calculated fingerprint index, one or more bits of the calculated fingerprint index can be flipped. In one embodiment where the fingerprint index has 24 bits, the search step flips different single bits of the 24-bit fingerprint index after a match could not be found using the original fingerprint index. Repeat 24 times each time. Various other techniques can be used to expand the search space.

  In one embodiment, the fingerprint indexer 230 may generate index bits from one or more fingerprints based on a set of frequency band weighting factors calculated by the acoustic model 235 and previously stored in the database 125. One or more indexes are generated by selecting. When using a plurality of indexes including an index obtained by bit flip, a group of fingerprint candidates includes all candidates obtained for each calculated index.

  In another embodiment, the search area can be narrowed by pre-screening and selecting only fingerprint candidates found in most or all of the candidate groups obtained for each calculated index. Pre-screening multiple fingerprint candidate groups by using multiple indexes, including those obtained by bit flips, can significantly improve database search performance. In one embodiment, indexes and references to possible fingerprint candidates are stored in computer memory, allowing fast selection and pre-selection of fingerprint candidates. In the second step (step 320), only the fingerprint candidates that are most likely to match a given fingerprint are loaded into computer memory for comparison. This approach keeps only a small sized index in the computer memory and allows fast searching by storing a large sized fingerprint on a slow device (eg, on a hard disk drive or network).

(Audio frame edge detection)
In some applications, it may be desirable to detect matching audio fragment edges. Edge detection allows the system to know exactly where a particular matching speech fragment is occurring in time. Depending on the quality of the speech to be analyzed, embodiments of the edge detection algorithm can detect matching speech fragment edges with an accuracy of about 0.1 to 0.5 seconds.

  As described above, embodiments of fingerprint processing techniques accumulate audio samples in a subband processing buffer. This buffering delays the output of the fingerprint processing algorithm and blurs the audio fragment. This effect is illustrated in FIG. 4, where the bit error rate (BER) over time between the reference fingerprint for the audio fragment and the series of fingerprints generated over time for the input sample audio stream. It is a graph. In the illustrated embodiment, matching is indicated when the subband buffer holds 3 seconds of audio and the two fingerprints have a bit error rate (BER) of 35% or less.

  Initially, at time T0, the subband processing buffer is empty, so the generated fingerprint is zero-matching with the original speech (eg, BER is expected to be equal to about 50%). As audio samples are added to the subband buffer, the BER decreases, indicating better matching. After sufficient time has elapsed, the BER decreases to less than a threshold of 35% at time T1, indicating matching. Finally, at time T2, the BER reaches a plateau when the buffer is filled with samples. When the fingerprinting algorithm passes the end of the corresponding speech fragment, it begins to generate a fingerprint that is not very matched and thus has an increased BER at time T3 and reaches a recognition threshold of 35% at time T4. The obtained matching curve (T1 to T4) period and plateau (T2 to T3) period are each shorter than the matched speech fragment (T0 to T3) period.

  In one embodiment, an edge detection algorithm is used to determine the exact edge of the matching audio frame or fragment. A BER curve as shown in FIG. 4 is obtained. The BER curve shows the start of matching when the BER is decreasing (eg, T1 to T2), the plateau where the BER is approximately constant (eg, T2 to T3), and the matching when the BER is increasing. It is divided into a range corresponding to the end (for example, T3 to T4). The actual BER curve is generally noisy and is segmented using an appropriate technique such as regression analysis. In one embodiment, all samples that produce a BER greater than 35% are ignored as they may not be reliable. Next, the start of the matching audio fragment (ie, time T1) is the intersection of the line that best fits the range in which the BER is decreasing (eg, T1 to T2) and the horizontal line corresponding to 50% BER. It can be calculated using linear regression. A similar approach can be applied to estimate time T5, the intersection of the line that best fits the range where BER is increasing (eg, T3 to T4) and the horizontal line corresponding to 50% BER. Get. However, in this case, time T5 corresponds to the end of the fragment delayed by subband buffer period B, not the actual end of the matching audio fragment. The last position of the fragment (ie, time T3) can be calculated by subtracting the subband buffer period B from the resulting estimate T5.

  In another embodiment, the end of the matching audio fragment is estimated as the end of the range T2 to T3, and the start of the audio fragment subtracts the duration of the subband buffer B from the time T2 corresponding to the start of the range T2 to T3. Is calculated by

(wrap up)
Although described with respect to vectors and matrices, the information calculated for any fingerprint or sub-fingerprint can be stored and processed in any form, not just as a vector or matrix of values. Thus, the terms vector and matrix are used only as a convenient mechanism for representing data extracted from speech samples and are not meant to imply any other limitations. In addition, although the power spectrum has been described with respect to a spectrogram, it should be understood that data representing the power spectrum or spectral analysis of an audio signal can be represented and used not only as a spectrogram but also in any other suitable form.

  In one embodiment, a software module comprises a computer-readable medium that includes computer program code that can be executed by a computer processor to perform any or all of the steps, operations, or processes described herein. Executed with a computer program product. Thus, any of the steps, operations, or processes described herein are performed or performed in conjunction with one or more software or hardware modules, alone or in combination with other devices. be able to. Further, any part of the system described with respect to hardware elements can be implemented in software, and any part of the system described with respect to software elements can be implemented in hardware such as hard-coded into a dedicated circuit. Can be executed. For example, the code for performing the described method can be embedded in a hardware device, such as an ASIC or other custom circuit. This allows the benefits of the present invention to be combined with the potential for many different devices.

  In another embodiment, the fingerprint processing algorithm stores, processes, or plays mobile phones, personal digital assistants (PDAs), MP3 players and / or recorders, set top boxes, television sets, game consoles, or audio content. Embedded in and executed by any of a variety of audio devices, such as any other device. Incorporating a fingerprinting algorithm into such a device can provide several benefits. For example, generating voice fingerprints directly on a mobile phone can provide better results than sending compressed audio from a mobile phone to a fingerprint processing server over a mobile phone network. By executing the algorithm on the mobile phone, the distortion caused by GSM compression that is designed to compress speech and does not work well for music is eliminated. Thus, this approach can significantly improve the recognition of voice recorded by a mobile phone. In addition, this reduces the load on the server as well as the network traffic.

  Another benefit of such an embedded approach is that the listening experience can be monitored without infringing on privacy and user rights. For example, the recording device records audio, generates a fingerprint, and then sends only the fingerprint to the server for analysis. Recorded audio never leaves the device. The server can then not recover the original voice from the fingerprint, but can identify the music or advertisement of interest using the sent fingerprint.

  The above-described embodiments of the present invention have been presented for purposes of illustration and are not intended to be exhaustive or to limit the invention to the precise forms disclosed. One skilled in the art can appreciate that many modifications and variations are possible in light of the above techniques. Accordingly, it is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.

100 Speech Sample 105 Speech Frame 110 Fingerprint Extraction Algorithm 115 Fingerprint 120 Fingerprint Matching Algorithm 125 Fingerprint Database 200 Fingerprint Extraction System 205 Analysis Filter Bank 210 Resampler 215 Transform Module 225 Differential Encoder 230 Fingerprint Indexer 235 Acoustic Model

Claims (21)

  1. A method for extracting a voice fingerprint from a voice frame, comprising:
    Filtering the audio frame into a plurality of frequency bands to generate a corresponding plurality of filtered audio signals;
    Scaling the size of each of the filtered audio signals in units of time based on the frequency of the corresponding frequency band;
    Re-sampling the scaled and filtered audio signal to generate a re-sampled audio signal;
    Converting the resampled audio signal to generate a feature element vector for each of the resampled audio signals;
    Calculating the speech fingerprint based on the feature element vector;
    Including methods.
  2.   The method of claim 1, wherein the frequency on which the scaling step is based is at least one of an intermediate frequency band and a frequency range for the corresponding frequency band.
  3.   The method of claim 1, wherein converting the resampled audio signal includes performing a conversion along a time axis.
  4. Converting the resampled audio signal comprises:
    Converting the resampled audio signal along a time axis;
    Converting the resampled audio signal along a frequency axis;
    The method of claim 1 comprising:
  5.   The method of claim 4, further comprising scaling, windowing, and normalizing the size of the resampled audio signal.
  6.   The method of claim 1, wherein transforming the resampled audio signal comprises performing a two-dimensional discrete cosine transform (2D DCT).
  7. A method for extracting a voice fingerprint from a voice frame, comprising:
    Filtering the audio frame into a plurality of frequency bands to generate a corresponding plurality of filtered audio signals;
    Offsetting each of the filtered audio signals in time units based on a corresponding frequency band;
    Resampling the offset and filtered audio signal to generate a resampled audio signal;
    Converting the resampled audio signal to generate a feature element vector for each of the resampled audio signals;
    Calculating the speech fingerprint based on the feature element vector;
    Including methods.
  8.   8. The method of claim 7, wherein the frequency on which the offset stage is based is at least one of an intermediate frequency band and a frequency range for a corresponding frequency band.
  9.   The method of claim 7, wherein converting the resampled and filtered audio signal comprises performing a conversion along a time axis.
  10. The stage of converting the resampled audio signal is
    Converting the resampled audio signal along a time axis;
    Converting the resampled audio signal along a frequency axis;
    The method of claim 7 comprising:
  11.   The method of claim 10, further comprising scaling, windowing, and normalizing the size of the resampled audio signal.
  12.   8. The method of claim 7, wherein transforming the resampled speech signal comprises performing a two-dimensional discrete cosine transform (2D DCT).
  13. A method for extracting a voice fingerprint from a voice frame, comprising:
    Filtering the audio frame into a plurality of frequency bands to generate a corresponding plurality of filtered audio signals;
    Performing time-frequency resampling on the filtered audio signal to generate a resampled audio signal;
    Converting the resampled audio signal to generate a feature element vector for each of the resampled audio signals;
    Calculating the speech fingerprint based on the feature element vector;
    Including methods.
  14.   Each filtered audio signal is associated with one or more processing channels, and time frequency resampling is performed on the filtered audio signal of each processing channel to generate a resampled audio signal. The method of claim 13.
  15.   15. The method of claim 14, wherein time-frequency re-sampling the filtered audio signal of the processing channel comprises sampling from different filtered audio signals of the processing channel over time.
  16.   The method of claim 13, wherein converting the resampled and filtered audio signal comprises performing a conversion along a time axis.
  17. Converting the resampled audio signal comprises:
    Converting the resampled audio signal along a time axis;
    Converting the resampled audio signal along a frequency axis;
    14. The method of claim 13, comprising:
  18.   The method of claim 17, further comprising scaling, windowing, and normalizing the size of the resampled audio signal.
  19.   The method of claim 13, wherein transforming the resampled audio signal comprises performing a two-dimensional discrete cosine transform (2D DCT).
  20. A method for extracting a voice fingerprint from a voice frame, comprising:
    Filtering the audio frame into a plurality of frequency bands to generate a corresponding plurality of filtered audio signals;
    Resampling the filtered audio signal in a non-uniform order to generate a resampled audio signal;
    Converting the resampled audio signal to generate a feature element vector for each of the resampled audio signals;
    Calculating the speech fingerprint based on the feature element vector;
    Including methods.
  21.   Re-sampling the filtered audio signal according to a non-uniform order comprises re-sampling the filtered audio signal in an order opposite to the order in which the filtered audio signal was initially sampled. Item 21. The method according to Item 20.
JP2013553444A 2011-02-10 2012-01-13 Extracting and matching feature fingerprints from speech signals Active JP5826291B2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US13/025,060 2011-02-10
US13/025,060 US9093120B2 (en) 2011-02-10 2011-02-10 Audio fingerprint extraction by scaling in time and resampling
PCT/US2012/021303 WO2012108975A2 (en) 2011-02-10 2012-01-13 Extraction and matching of characteristic fingerprints from audio signals

Publications (2)

Publication Number Publication Date
JP2014506686A true JP2014506686A (en) 2014-03-17
JP5826291B2 JP5826291B2 (en) 2015-12-02

Family

ID=46637583

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2013553444A Active JP5826291B2 (en) 2011-02-10 2012-01-13 Extracting and matching feature fingerprints from speech signals

Country Status (6)

Country Link
US (1) US9093120B2 (en)
EP (1) EP2673775A4 (en)
JP (1) JP5826291B2 (en)
CN (1) CN103403710B (en)
TW (2) TWI480855B (en)
WO (1) WO2012108975A2 (en)

Families Citing this family (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4665836B2 (en) * 2006-05-31 2011-04-06 日本ビクター株式会社 Music classification device, music classification method, and music classification program
US9955192B2 (en) 2013-12-23 2018-04-24 Inscape Data, Inc. Monitoring individual viewing of television events using tracking pixels and cookies
US8769584B2 (en) 2009-05-29 2014-07-01 TVI Interactive Systems, Inc. Methods for displaying contextually targeted content on a connected television
US9449090B2 (en) 2009-05-29 2016-09-20 Vizio Inscape Technologies, Llc Systems and methods for addressing a media database using distance associative hashing
US10192138B2 (en) 2010-05-27 2019-01-29 Inscape Data, Inc. Systems and methods for reducing data density in large datasets
US9838753B2 (en) 2013-12-23 2017-12-05 Inscape Data, Inc. Monitoring individual viewing of television events using tracking pixels and cookies
US10116972B2 (en) 2009-05-29 2018-10-30 Inscape Data, Inc. Methods for identifying video segments and displaying option to view from an alternative source and/or on an alternative device
US10375451B2 (en) 2009-05-29 2019-08-06 Inscape Data, Inc. Detection of common media segments
JP2011080937A (en) * 2009-10-09 2011-04-21 Sumitomo Chemical Co Ltd Inspection method of corrosion under heat insulating material
CN102215144B (en) * 2011-05-17 2016-06-29 中兴通讯股份有限公司 The measuring method of packet loss and system
WO2012089288A1 (en) * 2011-06-06 2012-07-05 Bridge Mediatech, S.L. Method and system for robust audio hashing
US8586847B2 (en) * 2011-12-02 2013-11-19 The Echo Nest Corporation Musical fingerprinting based on onset intervals
US9684715B1 (en) * 2012-03-08 2017-06-20 Google Inc. Audio identification using ordinal transformation
US9703932B2 (en) 2012-04-30 2017-07-11 Excalibur Ip, Llc Continuous content identification of broadcast content
KR102040199B1 (en) * 2012-07-11 2019-11-05 한국전자통신연구원 Apparatus and method for measuring quality of audio
WO2014094912A1 (en) 2012-12-21 2014-06-26 Rocket Pictures Limited Processing media data
CN103116629B (en) * 2013-02-01 2016-04-20 腾讯科技(深圳)有限公司 A kind of matching process of audio content and system
US9055376B1 (en) * 2013-03-08 2015-06-09 Google Inc. Classifying music by genre using discrete cosine transforms
US9728205B2 (en) 2013-03-15 2017-08-08 Facebook, Inc. Generating audio fingerprints based on audio signal complexity
US9679583B2 (en) 2013-03-15 2017-06-13 Facebook, Inc. Managing silence in audio signal identification
US9244042B2 (en) * 2013-07-31 2016-01-26 General Electric Company Vibration condition monitoring system and methods
US9898086B2 (en) * 2013-09-06 2018-02-20 Immersion Corporation Systems and methods for visual processing of spectrograms to generate haptic effects
FR3015754A1 (en) * 2013-12-20 2015-06-26 Orange Re-sampling a cadence audio signal at a variable sampling frequency according to the frame
CN103747277A (en) * 2014-01-10 2014-04-23 北京酷云互动科技有限公司 Multimedia program identification method and device
US9390727B2 (en) 2014-01-13 2016-07-12 Facebook, Inc. Detecting distorted audio signals based on audio fingerprinting
CN103794209A (en) * 2014-01-17 2014-05-14 王博龙 System for monitoring and playing advertisement before movie playing based on audio fingerprint identification technology
EP3111672B1 (en) * 2014-02-24 2017-11-15 Widex A/S Hearing aid with assisted noise suppression
NL2012567B1 (en) * 2014-04-04 2016-03-08 Teletrax B V Method and device for generating improved fingerprints.
TWI569257B (en) * 2014-07-04 2017-02-01 玄舟科技有限公司 Audio signal processing apparatus and audio signal processing method thereof
WO2016024172A1 (en) 2014-08-14 2016-02-18 Yandex Europe Ag Method of and a system for matching audio tracks using chromaprints with a fast candidate selection routine
WO2016024171A1 (en) 2014-08-14 2016-02-18 Yandex Europe Ag Method of and a system for indexing audio tracks using chromaprints
GB2530728A (en) * 2014-09-26 2016-04-06 Yummi Group Singapore Pte Ltd An interactive game system
CN104361889B (en) * 2014-10-28 2018-03-16 北京音之邦文化科技有限公司 A kind of method and device handled audio file
EP3023884A1 (en) * 2014-11-21 2016-05-25 Thomson Licensing Method and apparatus for generating fingerprint of an audio signal
AU2015355209B2 (en) * 2014-12-01 2019-08-29 Inscape Data, Inc. System and method for continuous media segment identification
CN104503758A (en) * 2014-12-24 2015-04-08 天脉聚源(北京)科技有限公司 Method and device for generating dynamic music haloes
AU2016211254B2 (en) 2015-01-30 2019-09-19 Inscape Data, Inc. Methods for identifying video segments and displaying option to view from an alternative source and/or on an alternative device
KR20160102815A (en) * 2015-02-23 2016-08-31 한국전자통신연구원 Robust audio signal processing apparatus and method for noise
WO2016135741A1 (en) * 2015-02-26 2016-09-01 Indian Institute Of Technology Bombay A method and system for suppressing noise in speech signals in hearing aids and speech communication devices
TWI693594B (en) * 2015-03-13 2020-05-11 瑞典商杜比國際公司 Decoding audio bitstreams with enhanced spectral band replication metadata in at least one fill element
CN107949849A (en) 2015-04-17 2018-04-20 构造数据有限责任公司 Reduce the system and method for packing density in large data sets
CN106297821A (en) * 2015-05-19 2017-01-04 红阳科技股份有限公司 Promote sound transmission system and the data processing method thereof of audible recognition rate
US9965685B2 (en) 2015-06-12 2018-05-08 Google Llc Method and system for detecting an audio event for smart home devices
US10080062B2 (en) 2015-07-16 2018-09-18 Inscape Data, Inc. Optimizing media fingerprint retention to improve system resource utilization
US9900636B2 (en) 2015-08-14 2018-02-20 The Nielsen Company (Us), Llc Reducing signature matching uncertainty in media monitoring systems
JP6463710B2 (en) 2015-10-16 2019-02-06 グーグル エルエルシー Hot word recognition
US9747926B2 (en) 2015-10-16 2017-08-29 Google Inc. Hotword recognition
US9928840B2 (en) 2015-10-16 2018-03-27 Google Llc Hotword recognition
CN107133190A (en) * 2016-02-29 2017-09-05 阿里巴巴集团控股有限公司 The training method and training system of a kind of machine learning system
CN105868397A (en) * 2016-04-19 2016-08-17 腾讯科技(深圳)有限公司 Method and device for determining song
CN106910494A (en) * 2016-06-28 2017-06-30 阿里巴巴集团控股有限公司 A kind of audio identification methods and device
CN110945494A (en) * 2017-07-28 2020-03-31 杜比实验室特许公司 Method and system for providing media content to a client
US10089994B1 (en) 2018-01-15 2018-10-02 Alex Radzishevsky Acoustic fingerprint extraction and matching
CN108447501B (en) * 2018-03-27 2020-08-18 中南大学 Pirated video detection method and system based on audio words in cloud storage environment
EP3547314A1 (en) * 2018-03-28 2019-10-02 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for providing a fingerprint of an input signal
WO2020039434A1 (en) * 2018-08-20 2020-02-27 Ramot At Tel-Aviv University Ltd., Plant-monitor
GB2577570A (en) * 2018-09-28 2020-04-01 Cirrus Logic Int Semiconductor Ltd Sound event detection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06318876A (en) * 1993-04-30 1994-11-15 Sony Corp Method and device for signal conversion and recording medium
JP2007065659A (en) * 2005-09-01 2007-03-15 Seet Internet Ventures Inc Extraction and matching of characteristic fingerprint from audio signal
JP2008511844A (en) * 2004-07-26 2008-04-17 エムツーエニー ゲゼルシャフト ミット ベシュレンクター ハフトゥングm2any GmbH Apparatus and method for stably classifying audio signals, method for constructing and operating an audio signal database, and computer program
US7454334B2 (en) * 2003-08-28 2008-11-18 Wildlife Acoustics, Inc. Method and apparatus for automatically identifying animal species from their vocalizations

Family Cites Families (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW232116B (en) * 1993-04-14 1994-10-11 Sony Corp Method or device and recording media for signal conversion
US5737720A (en) * 1993-10-26 1998-04-07 Sony Corporation Low bit rate multichannel audio coding methods and apparatus using non-linear adaptive bit allocation
US5918223A (en) 1996-07-22 1999-06-29 Muscle Fish Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information
JP4170458B2 (en) * 1998-08-27 2008-10-22 ローランド株式会社 Time-axis compression / expansion device for waveform signals
US6266003B1 (en) 1998-08-28 2001-07-24 Sigma Audio Research Limited Method and apparatus for signal processing for time-scale and/or pitch modification of audio signals
US6266644B1 (en) * 1998-09-26 2001-07-24 Liquid Audio, Inc. Audio encoding apparatus and methods
JP2000134105A (en) * 1998-10-29 2000-05-12 Matsushita Electric Ind Co Ltd Method for deciding and adapting block size used for audio conversion coding
US8326584B1 (en) 1999-09-14 2012-12-04 Gracenote, Inc. Music searching methods based on human perception
US7174293B2 (en) 1999-09-21 2007-02-06 Iceberg Industries Llc Audio identification system and method
US7194752B1 (en) 1999-10-19 2007-03-20 Iceberg Industries, Llc Method and apparatus for automatically recognizing input audio and/or video streams
US6834308B1 (en) 2000-02-17 2004-12-21 Audible Magic Corporation Method and apparatus for identifying media content presented on a media playing device
FR2807275B1 (en) 2000-04-04 2003-01-24 Mobiclick System allowing to transmit to a user information relating to a sound sequence which they listen to or listened to
US6453252B1 (en) 2000-05-15 2002-09-17 Creative Technology Ltd. Process for identifying audio content
US7853664B1 (en) 2000-07-31 2010-12-14 Landmark Digital Services Llc Method and system for purchasing pre-recorded music
US6990453B2 (en) 2000-07-31 2006-01-24 Landmark Digital Services Llc System and methods for recognizing sound and music signals in high noise and distortion
US6963975B1 (en) 2000-08-11 2005-11-08 Microsoft Corporation System and method for audio fingerprinting
US7562012B1 (en) 2000-11-03 2009-07-14 Audible Magic Corporation Method and apparatus for creating a unique audio signature
US20020072982A1 (en) 2000-12-12 2002-06-13 Shazam Entertainment Ltd. Method and system for interacting with a user in an experiential environment
US7359889B2 (en) 2001-03-02 2008-04-15 Landmark Digital Services Llc Method and apparatus for automatically creating database for use in automated media recognition system
US7136418B2 (en) * 2001-05-03 2006-11-14 University Of Washington Scalable and perceptually ranked signal coding and decoding
KR20040024870A (en) 2001-07-20 2004-03-22 그레이스노트 아이엔씨 Automatic identification of sound recordings
US7003131B2 (en) 2002-07-09 2006-02-21 Kaleidescape, Inc. Watermarking and fingerprinting digital content using alternative blocks to embed information
EP1567965A1 (en) 2002-11-12 2005-08-31 Koninklijke Philips Electronics N.V. Fingerprinting multimedia contents
US7013301B2 (en) 2003-09-23 2006-03-14 Predixis Corporation Audio fingerprinting system and method
US20060080356A1 (en) 2004-10-13 2006-04-13 Microsoft Corporation System and method for inferring similarities between media objects
KR100707186B1 (en) * 2005-03-24 2007-04-13 삼성전자주식회사 Audio coding and decoding apparatus and method, and recoding medium thereof
US7200576B2 (en) 2005-06-20 2007-04-03 Microsoft Corporation Secure online transactions using a captcha image as a watermark
EP2209116B8 (en) * 2007-10-23 2014-08-06 Clarion Co., Ltd. Device and method for high-frequency range interpolation of an audio signal
US8290782B2 (en) * 2008-07-24 2012-10-16 Dts, Inc. Compression of audio scale-factors by two-dimensional transformation
CN101404750B (en) * 2008-11-11 2011-01-05 清华大学 Video fingerprint generation method and device
US8594392B2 (en) * 2009-11-18 2013-11-26 Yahoo! Inc. Media identification system for efficient matching of media items having common content
US8589171B2 (en) * 2011-03-17 2013-11-19 Remote Media, Llc System and method for custom marking a media file for file matching
US9251406B2 (en) * 2012-06-20 2016-02-02 Yahoo! Inc. Method and system for detecting users' emotions when experiencing a media program
US9728205B2 (en) * 2013-03-15 2017-08-08 Facebook, Inc. Generating audio fingerprints based on audio signal complexity
US9679583B2 (en) * 2013-03-15 2017-06-13 Facebook, Inc. Managing silence in audio signal identification

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06318876A (en) * 1993-04-30 1994-11-15 Sony Corp Method and device for signal conversion and recording medium
US7454334B2 (en) * 2003-08-28 2008-11-18 Wildlife Acoustics, Inc. Method and apparatus for automatically identifying animal species from their vocalizations
JP2008511844A (en) * 2004-07-26 2008-04-17 エムツーエニー ゲゼルシャフト ミット ベシュレンクター ハフトゥングm2any GmbH Apparatus and method for stably classifying audio signals, method for constructing and operating an audio signal database, and computer program
JP2007065659A (en) * 2005-09-01 2007-03-15 Seet Internet Ventures Inc Extraction and matching of characteristic fingerprint from audio signal

Also Published As

Publication number Publication date
WO2012108975A3 (en) 2012-11-01
CN103403710A (en) 2013-11-20
CN103403710B (en) 2016-11-09
TW201246183A (en) 2012-11-16
WO2012108975A2 (en) 2012-08-16
US20120209612A1 (en) 2012-08-16
TWI560708B (en) 2016-12-01
US9093120B2 (en) 2015-07-28
EP2673775A2 (en) 2013-12-18
EP2673775A4 (en) 2017-07-19
JP5826291B2 (en) 2015-12-02
TW201447871A (en) 2014-12-16
TWI480855B (en) 2015-04-11

Similar Documents

Publication Publication Date Title
US10672407B2 (en) Distributed audience measurement systems and methods
US10497378B2 (en) Systems and methods for recognizing sound and music signals in high noise and distortion
US9640156B2 (en) Audio matching with supplemental semantic audio recognition and report generation
US20190373311A1 (en) Media Content Identification on Mobile Devices
US10546590B2 (en) Multi-mode audio recognition and auxiliary data encoding and decoding
US9256673B2 (en) Methods and systems for identifying content in a data stream
US10026410B2 (en) Multi-mode audio recognition and auxiliary data encoding and decoding
Giannakopoulos et al. Introduction to Audio Analysis: a MATLAB® approach
Mitrović et al. Features for content-based audio retrieval
Allamanche et al. Content-based Identification of Audio Material Using MPEG-7 Low Level Description.
JP5498525B2 (en) Spatial audio parameter display
US7457749B2 (en) Noise-robust feature extraction using multi-layer principal component analysis
RU2596592C2 (en) Spatial audio processor and method of providing spatial parameters based on acoustic input signal
KR100958144B1 (en) Audio Compression
JP4478183B2 (en) Apparatus and method for stably classifying audio signals, method for constructing and operating an audio signal database, and computer program
US8497417B2 (en) Intervalgram representation of audio for melody recognition
EP2659480B1 (en) Repetition detection in media data
Cano et al. Robust sound modeling for song detection in broadcast audio
JP5543640B2 (en) Perceptual tempo estimation with scalable complexity
Gupta et al. Current developments and future trends in audio authentication
JP4689625B2 (en) Adaptive mixed transform for signal analysis and synthesis
CN100543731C (en) Parameterized temporal feature analysis
Herre et al. Robust matching of audio signals using spectral flatness features
Burges et al. Distortion discriminant analysis for audio fingerprinting
KR20130108391A (en) Method, apparatus and machine-readable storage medium for decomposing a multichannel audio signal

Legal Events

Date Code Title Description
A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20140812

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20140903

A601 Written request for extension of time

Free format text: JAPANESE INTERMEDIATE CODE: A601

Effective date: 20141203

A602 Written permission of extension of time

Free format text: JAPANESE INTERMEDIATE CODE: A602

Effective date: 20141210

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20150105

A711 Notification of change in applicant

Free format text: JAPANESE INTERMEDIATE CODE: A711

Effective date: 20150511

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20150728

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20150826

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20150924

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20151013

R150 Certificate of patent or registration of utility model

Ref document number: 5826291

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

S531 Written request for registration of change of domicile

Free format text: JAPANESE INTERMEDIATE CODE: R313531

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

S533 Written request for registration of change of name

Free format text: JAPANESE INTERMEDIATE CODE: R313533

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

S111 Request for change of ownership or part of ownership

Free format text: JAPANESE INTERMEDIATE CODE: R313111

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250